LibriSpeech
Johns Hopkins University
1000 hours of English speech from audiobooks. Standard benchmark for automatic speech recognition with clean and noisy test splits.
Benchmark Stats
SOTA History
WER (test-other)
Word Error Rate on noisier/accented speech (lower is better)
Lower is better
| Rank | Model | Source | Score | Year | Paper |
|---|---|---|---|---|---|
| 1 | Parakeet RNNT 1.1B NVIDIA + Suno.ai. 1.1B params. Greedy decoding, no LM. SOTA on test-other. | Editorial | 2.47 | 2025 | Source |
| 2 | Parakeet TDT 0.6B v2 NVIDIA. 0.6B params. FastConformer-TDT. | Editorial | 3.19 | 2025 | Source |
| 3 | wav2vec 2.0 Large (960h) test-other WER (%). wav2vec 2.0 Large, 960h. Source: Table 3, arxiv:2006.11477 | Community | 3.3 | 2026 | Source |
| 4 | Canary 1B v2 NVIDIA. 1B multilingual ASR+AST. Aug 2025. | Editorial | 3.56 | 2025 | Source |
| 5 | Parakeet TDT 0.6B v3 NVIDIA. 0.6B params. Multilingual. Sep 2025. | Editorial | 3.59 | 2025 | Source |
| 6 | Whisper Large v3 test-other WER (%). Whisper large-v3. Source: OpenAI model card / arxiv:2212.04356 | Editorial | 3.6 | 2024 | Source |
| 7 | HuBERT Large (LS-960) test-other WER (%). HuBERT Large, 960h. Source: Table 2, arxiv:2106.07447 | Community | 3.6 | 2026 | Source |
| 8 | Canary-1B test-other WER (%). Canary-1B EN. Source: Table 2, arxiv:2310.09873 | Community | 3.8 | 2026 | Source |
| 9 | Voxtral Mini 3B Mistral AI. 3B multimodal model. July 2025. | Editorial | 4.08 | 2025 | Source |
| 10 | Google USM test-other WER (%). Google USM 2B. Source: Table 3, arxiv:2303.01037 | Community | 4.1 | 2026 | Source |
| 11 | Parakeet-CTC-1.1B test-other WER (%). Parakeet-CTC-1.1B. Source: Table 1, arxiv:2311.13251 | Community | 4.2 | 2026 | Source |
| 12 | Whisper Large v2 test-other WER (%). Whisper large-v2. Source: Table 5, arxiv:2212.04356 | Community | 5.2 | 2026 | Source |
| 13 | Phi-4-multimodal-instruct Microsoft. 5.6B multimodal model. Feb 2025. | Editorial | 5.97 | 2025 | Source |
WER (test-clean)
Word Error Rate on clean speech (lower is better)
Lower is better
| Rank | Model | Source | Score | Year | Paper |
|---|---|---|---|---|---|
| 1 | Parakeet RNNT 1.1B NVIDIA + Suno.ai. 1.1B params. FastConformer-RNNT. Greedy decoding, no LM. SOTA English ASR. | Editorial | 1.46 | 2025 | Source |
| 2 | Phi-4-multimodal-instruct Microsoft. 5.6B multimodal model. #1 on HF OpenASR leaderboard (March 2025, 6.14% avg WER). | Editorial | 1.67 | 2025 | Source |
| 3 | Parakeet TDT 0.6B v2 NVIDIA. 0.6B params. FastConformer-TDT. Greedy decoding on HF Open-ASR-Leaderboard framework. | Editorial | 1.69 | 2025 | Source |
| 4 | Parakeet-CTC-1.1B test-clean WER (%). Parakeet-CTC-1.1B. Source: Table 1, arxiv:2311.13251 | Community | 1.7 | 2026 | Source |
| 5 | Canary-1B test-clean WER (%). Canary-1B EN. Source: Table 2, arxiv:2310.09873 | Community | 1.7 | 2026 | Source |
| 6 | Conformer-CTC Large test-clean WER (%). Conformer-CTC Large, NeMo. Source: NVIDIA NGC model card | Community | 1.7 | 2026 | Source |
| 7 | Whisper Large v3 test-clean WER (%). Whisper large-v3. Source: OpenAI model card / arxiv:2212.04356 | Editorial | 1.8 | 2024 | Source |
| 8 | wav2vec 2.0 Large (960h) test-clean WER (%). wav2vec 2.0 Large, fine-tuned on 960h. Source: Table 3, arxiv:2006.11477 | Community | 1.8 | 2026 | Source |
| 9 | Voxtral Mini 3B Mistral AI. 3B multimodal model. Based on Ministral 3B with audio encoder. July 2025. | Editorial | 1.89 | 2025 | Source |
| 10 | HuBERT Large (LS-960) test-clean WER (%). HuBERT Large fine-tuned on 960h. Source: Table 2, arxiv:2106.07447 | Community | 1.9 | 2026 | Source |
| 11 | Parakeet TDT 0.6B v3 NVIDIA. 0.6B params. Multilingual ASR/AST. FastConformer-TDT. RTFx 3332 (fastest throughput). Sep 2025. | Editorial | 1.93 | 2025 | Source |
| 12 | Google USM test-clean WER (%). Google USM 2B. Source: Table 3, arxiv:2303.01037 | Community | 2 | 2026 | Source |
| 13 | Canary 1B v2 NVIDIA. 1B multilingual ASR+AST. Supports EN/DE/FR/ES. Aug 2025. | Editorial | 2.18 | 2025 | Source |
| 14 | Whisper Large v2 test-clean WER (%). Whisper large-v2. Source: Table 5, arxiv:2212.04356 | Community | 2.7 | 2026 | Source |
| 15 | wav2vec 2.0 Large Meta. 317M params. Self-supervised pre-training on 60k hours of speech. Foundational SSL ASR model. | Editorial | 2.9 | 2024 | Source |