Speechautomatic-speech-recognition

Speech Recognition

Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.

4
Datasets
72
Results
wer
Canonical metric
Canonical Benchmark

Common Voice

Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents. Over 100 languages, updated continuously by Mozilla Foundation.

Primary metric: wer
View full leaderboard

Top 10

Leading models on Common Voice.

RankModelwer-viYearSource
1
Whisper base
44.12024paper
2
MMS 1B-L1107
43.92024paper
3
Whisper base
34.72024paper
4
Whisper base
32.62024paper
5
MMS 1B-L1107
20.72024paper
6
Whisper large-v2
18.02024paper
7
Google USM Chirp v2
14.82024paper
8
MMS 1B-L1107
14.52024paper
9
Whisper large-v3
13.72024paper
10
Google USM Chirp v2
12.52024paper

All datasets

4 datasets tracked for this task.

Related tasks

Other tasks in Speech.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace