Speech Recognition
Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.
Common Voice
Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents. Over 100 languages, updated continuously by Mozilla Foundation.
Top 10
Leading models on Common Voice.
| Rank | Model | wer-vi | Year | Source |
|---|---|---|---|---|
| 1 | Whisper base | 44.1 | 2024 | paper |
| 2 | MMS 1B-L1107 | 43.9 | 2024 | paper |
| 3 | Whisper base | 34.7 | 2024 | paper |
| 4 | Whisper base | 32.6 | 2024 | paper |
| 5 | MMS 1B-L1107 | 20.7 | 2024 | paper |
| 6 | Whisper large-v2 | 18.0 | 2024 | paper |
| 7 | Google USM Chirp v2 | 14.8 | 2024 | paper |
| 8 | MMS 1B-L1107 | 14.5 | 2024 | paper |
| 9 | Whisper large-v3 | 13.7 | 2024 | paper |
| 10 | Google USM Chirp v2 | 12.5 | 2024 | paper |
All datasets
4 datasets tracked for this task.
Related tasks
Other tasks in Speech.
Looking to run a model? HuggingFace hosts inference for this task type.