Speech-to-Text, focused
A dedicated STT landing page — decoupled from TTS. LibriSpeech WER leaderboard, use-case picks, and the current SOTA landscape where NVIDIA Parakeet RNNT 1.1B beats every cloud API on raw accuracy.
STT Landscape
LibriSpeech Leaderboard
18 models ranked by Word Error Rate on LibriSpeech test-clean. Lower is better. Human-level WER on clean speech is ~2–4%.
| # | Model | WER ↓ | Architecture | Type | Params | Year |
|---|---|---|---|---|---|---|
| 1 | Parakeet RNNT 1.1B NVIDIA · Current SOTA on LibriSpeech. NeMo framework. | 1.80% | Conformer + RNNT | Open Source | 1.1B | 2025 |
| 2 | Conformer XL Google · First sub-2% WER on clean speech. | 2.00% | Conformer + LAS | Research | 600M | 2021 |
| 3 | Deepgram Nova-3 Deepgram · Best commercial STT. Sub-300ms streaming. | 2.20% | Proprietary Transformer | Cloud API | — | 2025 |
| 4 | Voxtral Large Mistral AI · Audio understanding + transcription via LLM. Multilingual, long-context audio. | 2.30% | Audio-Language Model (Transformer) | Cloud API | — | 2025 |
| 5 | AssemblyAI Universal-2 AssemblyAI · Strong multilingual, built-in speaker diarization. | 2.40% | Conformer-based | Cloud API | — | 2025 |
| 6 | Canary 1B NVIDIA · Multi-task: ASR + translation (EN, DE, ES, FR) in one model. | 2.40% | FastConformer + multi-task | Open Source | 1B | 2024 |
| 7 | Whisper Large v3 Turbo OpenAI · 8x faster than Large v3. 100+ languages. | 2.50% | Transformer Encoder-Decoder | Open Source | 809M | 2024 |
| 8 | Gladia v2 Gladia · Real-time streaming with code-switching. | 2.50% | Whisper-based + custom | Cloud API | — | 2025 |
| 9 | Google Chirp 3 Google · 100+ langs, speaker diarization, built-in denoiser. | 2.50% | Generative (USM-based) | Cloud API | — | 2025 |
| 10 | Speechmatics Flow Speechmatics · 50+ languages with real-time translation. | 2.60% | Proprietary | Cloud API | — | 2025 |
| 11 | Whisper Large v3 OpenAI · Most widely deployed open STT model. | 2.70% | Transformer Encoder-Decoder | Open Source | 1.55B | 2023 |
| 12 | Groq Whisper Groq · ~150x faster than realtime via LPU inference. | 2.70% | Whisper on LPU | Cloud API | 1.55B | 2025 |
| 13 | Google USM Google · 300+ languages. Google's foundation speech model. | 2.80% | Universal Speech Model | Cloud API | 2B | 2023 |
| 14 | Voxtral Mini Mistral AI · First Mistral speech model. Audio Q&A, transcription, translation. | 2.80% | Audio-Language Model (Transformer) | Cloud API | — | 2024 |
| 15 | Gemini 3 Pro (audio) Google · Best diarization accuracy. Transcription via LLM. | 2.90% | Multimodal LLM | Cloud API | — | 2026 |
| 16 | Azure Speech Microsoft · Enterprise-grade with custom model training. | 3.00% | Proprietary | Cloud API | — | 2024 |
| 17 | Moonshine Base Useful Sensors · Runs on Raspberry Pi. 5× faster than Whisper Tiny, better accuracy. | 3.50% | Optimized Transformer | Open Source | 61M | 2024 |
| 18 | wav2vec 2.0 Meta · Pioneered self-supervised pre-training for speech. | 3.80% | Self-supervised Transformer | Open Source | 317M | 2020 |
Picks by Use Case
Lowest WER isn't always what you want. Streaming latency, language coverage, and hardware constraints often matter more than a fraction of a percentage point.
1.8% WER — current SOTA on clean speech. NeMo framework, production-ready. Beats every cloud API on raw accuracy.
Purpose-built for streaming. Nova-3 maintains strong WER while delivering partial hypotheses in real time. Gladia and Speechmatics Flow are alternatives.
100+ languages in a single model, 2.5% WER on English. Voxtral Large is the newer alternative with audio Q&A capabilities.
Mistral's audio-language model. ~2.3% WER plus multimodal LLM capabilities Whisper lacks: audio Q&A, translation, spoken instruction following.
Whisper Small is the pragmatic CPU choice. Moonshine from UsefulSensors is 5× faster than Whisper Tiny with better accuracy on-device.
Groq's LPU delivers Whisper inference ~150× faster than real time. Best choice when you need to batch-process large audio corpora quickly.
Open Source vs Cloud
In 2026 open-source STT has inverted the historical picture: Parakeet RNNT 1.1B (1.8% WER) now beats every major cloud API on raw accuracy. Cloud APIs still win on streaming infrastructure, multilingual scale, and managed hosting.
When to go Open Source
- Lowest-WER requirement (Parakeet, Voxtral)
- Data residency / compliance
- Offline or edge deployment (Moonshine, Whisper Small)
- High-volume batch processing (Whisper on Groq)
- Research and reproducibility
When to go Cloud API
- Real-time streaming with partials (Deepgram Nova-3)
- Broad multilingual out-of-the-box (AssemblyAI)
- Managed infra, SLA, autoscaling
- Diarization and speaker labeling built-in
- Fast vendor iteration on new languages/domains
How STT is Scored
WER — Word Error Rate
Percentage of words incorrectly transcribed. Counts substitutions (wrong word), deletions (missing word), and insertions (extra word), normalized by reference length. Lower is better.
WER = (substitutions + deletions + insertions) / reference_words
Typical ceilings: human transcribers sit at ~2–4% on clean read speech, >10% on noisy conversational speech, >20% on heavy accents or low-resource languages.
Beyond WER — what WER misses
- Streaming latency. Batch-mode WER says nothing about how quickly partial results arrive.
- Punctuation, casing, numerics. Most WER reports normalize these away. Your product probably cares.
- Hallucinations. Whisper in particular can invent text on silence or ambient noise. WER catches some but not all.
- Speaker diarization. Who-spoke-when is a separate evaluation (DER: Diarization Error Rate).
- Robustness to noise, accents, domain. LibriSpeech is clean read speech. Real production audio is not.
Related
Text-to-Speech
The paired TTS leaderboard and picks-by-use-case page.
HubSpeech Benchmarks (STT + TTS)
Combined hub with papers, repos, and progress charts.
GuideAudio-to-Text Building Block
Integration guide: API shapes, streaming, diarization.
TutorialLearn: Speech Recognition
Hands-on lesson with Whisper. WER benchmarks in code.
HubAudio Benchmarks
Classification, music generation, audio understanding.
BrowseBrowse Speech Tasks
All speech datasets and tasks indexed.