Speech-to-TextBETA

Speech-to-Text, focused

A dedicated STT landing page — decoupled from TTS. LibriSpeech WER leaderboard, use-case picks, and the current SOTA landscape where NVIDIA Parakeet RNNT 1.1B beats every cloud API on raw accuracy.

STT Landscape

1.8%
Best WER (Parakeet RNNT 1.1B)
1.8%
Best Open Source (Parakeet RNNT 1.1B)
18
Models tracked

LibriSpeech Leaderboard

18 models ranked by Word Error Rate on LibriSpeech test-clean. Lower is better. Human-level WER on clean speech is ~2–4%.

#ModelWER ↓ArchitectureTypeParamsYear
1
Parakeet RNNT 1.1B
NVIDIA · Current SOTA on LibriSpeech. NeMo framework.
1.80%Conformer + RNNTOpen Source1.1B2025
2
Conformer XL
Google · First sub-2% WER on clean speech.
2.00%Conformer + LASResearch600M2021
3
Deepgram Nova-3
Deepgram · Best commercial STT. Sub-300ms streaming.
2.20%Proprietary TransformerCloud API2025
4
Voxtral Large
Mistral AI · Audio understanding + transcription via LLM. Multilingual, long-context audio.
2.30%Audio-Language Model (Transformer)Cloud API2025
5
AssemblyAI Universal-2
AssemblyAI · Strong multilingual, built-in speaker diarization.
2.40%Conformer-basedCloud API2025
6
Canary 1B
NVIDIA · Multi-task: ASR + translation (EN, DE, ES, FR) in one model.
2.40%FastConformer + multi-taskOpen Source1B2024
7
Whisper Large v3 Turbo
OpenAI · 8x faster than Large v3. 100+ languages.
2.50%Transformer Encoder-DecoderOpen Source809M2024
8
Gladia v2
Gladia · Real-time streaming with code-switching.
2.50%Whisper-based + customCloud API2025
9
Google Chirp 3
Google · 100+ langs, speaker diarization, built-in denoiser.
2.50%Generative (USM-based)Cloud API2025
10
Speechmatics Flow
Speechmatics · 50+ languages with real-time translation.
2.60%ProprietaryCloud API2025
11
Whisper Large v3
OpenAI · Most widely deployed open STT model.
2.70%Transformer Encoder-DecoderOpen Source1.55B2023
12
Groq Whisper
Groq · ~150x faster than realtime via LPU inference.
2.70%Whisper on LPUCloud API1.55B2025
13
Google USM
Google · 300+ languages. Google's foundation speech model.
2.80%Universal Speech ModelCloud API2B2023
14
Voxtral Mini
Mistral AI · First Mistral speech model. Audio Q&A, transcription, translation.
2.80%Audio-Language Model (Transformer)Cloud API2024
15
Gemini 3 Pro (audio)
Google · Best diarization accuracy. Transcription via LLM.
2.90%Multimodal LLMCloud API2026
16
Azure Speech
Microsoft · Enterprise-grade with custom model training.
3.00%ProprietaryCloud API2024
17
Moonshine Base
Useful Sensors · Runs on Raspberry Pi. 5× faster than Whisper Tiny, better accuracy.
3.50%Optimized TransformerOpen Source61M2024
18
wav2vec 2.0
Meta · Pioneered self-supervised pre-training for speech.
3.80%Self-supervised TransformerOpen Source317M2020

Picks by Use Case

Lowest WER isn't always what you want. Streaming latency, language coverage, and hardware constraints often matter more than a fraction of a percentage point.

Open Source vs Cloud

In 2026 open-source STT has inverted the historical picture: Parakeet RNNT 1.1B (1.8% WER) now beats every major cloud API on raw accuracy. Cloud APIs still win on streaming infrastructure, multilingual scale, and managed hosting.

When to go Open Source

  • Lowest-WER requirement (Parakeet, Voxtral)
  • Data residency / compliance
  • Offline or edge deployment (Moonshine, Whisper Small)
  • High-volume batch processing (Whisper on Groq)
  • Research and reproducibility

When to go Cloud API

  • Real-time streaming with partials (Deepgram Nova-3)
  • Broad multilingual out-of-the-box (AssemblyAI)
  • Managed infra, SLA, autoscaling
  • Diarization and speaker labeling built-in
  • Fast vendor iteration on new languages/domains

How STT is Scored

WER — Word Error Rate

Percentage of words incorrectly transcribed. Counts substitutions (wrong word), deletions (missing word), and insertions (extra word), normalized by reference length. Lower is better.

WER = (substitutions + deletions + insertions) / reference_words

Typical ceilings: human transcribers sit at ~2–4% on clean read speech, >10% on noisy conversational speech, >20% on heavy accents or low-resource languages.

Beyond WER — what WER misses

  • Streaming latency. Batch-mode WER says nothing about how quickly partial results arrive.
  • Punctuation, casing, numerics. Most WER reports normalize these away. Your product probably cares.
  • Hallucinations. Whisper in particular can invent text on silence or ambient noise. WER catches some but not all.
  • Speaker diarization. Who-spoke-when is a separate evaluation (DER: Diarization Error Rate).
  • Robustness to noise, accents, domain. LibriSpeech is clean read speech. Real production audio is not.

Related

Beta. This page is new and in active iteration. WER numbers are from published papers and vendor releases on LibriSpeech test-clean — independent eval coming in v2. Feedback welcome at k.wikiel@gmail.com.