Codesota · Speech-to-text · Beta19 models ranked · mean WER · HF Open ASR LeaderboardUpdated 2026-06-23

§ 00 · Speech-to-text

Word error rate, ranked.

A dedicated STT register — decoupled from TTS. The HF Open ASR Leaderboard, use-case picks, and a landscape where open models lead the public leaderboard outright.

19 models ranked by mean Word Error Rate across the eight datasets of the HF Open ASR Leaderboard — AMI, Earnings-22, GigaSpeech, LibriSpeech clean/other, SPGISpeech, TED-LIUM and VoxPopuli. Lower is better. LibriSpeech test-clean alone is saturated near 1–2%; mean WER over noisy, meeting and telephony audio is what now separates the field.

Leaderboard →Picks by use-case TTS register →

§ 01 · Open ASR Leaderboard

The leaderboard, top to tail.

Ranked by mean WER across the eight datasets of the HF Open ASR Leaderboard. Shaded row marks current SOTA.

Figures from the HF Open ASR Leaderboard (en_shortform), accessed 2026-05-22. Lower is better.

#	Model	Vendor	Kind	Architecture	Params	Mean WER	Year
01	Granite Speech 4.1 2B	IBM	Open Source	Speech-aware LLM (Granite)	2B	5.33	2025
02	Cohere Transcribe (Mar 2026)	Cohere	Open Source	Transformer ASR	2B	5.42	2026
03	Pulse Pro	Smallest AI	Cloud API	Proprietary ASR	—	5.42	2026
04	Zoom Scribe v1	Zoom	Cloud API	Proprietary	—	5.47	2025
05	Granite 4.0 1B Speech	IBM	Open Source	Speech-aware LLM (Granite)	1B	5.52	2025
06	Canary-Qwen-2.5B	NVIDIA	Open Source	FastConformer encoder + Qwen2 LM decoder	2.5B	5.63	2025
07	Granite Speech 3.3 8B	IBM	Open Source	Speech-aware LLM (Granite)	8B	5.74	2025
08	Qwen3-ASR-1.7B	Alibaba	Open Source	Qwen3 backbone fine-tuned for ASR	1.7B	5.76	2025
09	ElevenLabs Scribe v2	ElevenLabs	Cloud API	Proprietary	—	5.83	2025
10	Phi-4 Multimodal Instruct	Microsoft	Open Source	Phi-4 multimodal	6B	6.02	2025
11	Parakeet TDT 0.6B v2	NVIDIA	Open Source	FastConformer (TDT)	0.6B	6.05	2025
12	AssemblyAI Universal-3 Pro	AssemblyAI	Cloud API	Proprietary Conformer-based	—	6.21	2025
13	Canary 1B	NVIDIA	Open Source	FastConformer + multi-task	1B	6.50	2024
14	Voxtral Small 24B	Mistral AI	Open Source	Large multimodal LM with audio encoder	24B	6.62	2025
15	Google Chirp 3	Google	Cloud API	Generative (USM-based)	—	6.63	2025
16	Parakeet TDT 1.1B	NVIDIA	Open Source	FastConformer (TDT)	1.1B	7.02	2024
17	Voxtral Mini 3B	Mistral AI	Open Source	Audio-Language Model (Transformer)	3B	7.05	2025
18	Whisper Large v3	OpenAI	Open Source	Transformer Encoder-Decoder	1.55B	7.44	2023
19	Whisper Large v3 Turbo	OpenAI	Open Source	Transformer Encoder-Decoder (pruned decoder)	809M	7.83	2024

Fig 1 · Mean WER % across the 8 HF Open ASR Leaderboard datasets. Highlight on current SOTA row.

§ 02 · Picks

By use-case.

Lowest WER isn't always what you want. Streaming latency, language coverage, and hardware constraints often matter more than a fraction of a percentage point.

Maximum accuracy

Granite Speech 4.1 2B

Lowest mean WER on the HF Open ASR Leaderboard

5.33% mean WER across 8 datasets — #1 on the Open ASR Leaderboard. A 2B open model that holds up on noisy, meeting and telephony audio, not just clean read speech.

Real-time streaming

Deepgram Nova-3

Sub-300ms latency with partial results

Purpose-built for streaming. Nova-3 maintains strong WER while delivering partial hypotheses in real time. Gladia and Speechmatics Flow are alternatives.

Multilingual (100+ languages)

Whisper Large v3 Turbo

Broad language coverage with consistent quality

100+ languages in a single model, 2.5% WER on English. Voxtral Large is the newer alternative with audio Q&A capabilities.

Audio understanding (not just transcription)

Voxtral Large

Audio Q&A, translation, spoken instructions

Mistral's audio-language model. ~2.3% WER plus multimodal LLM capabilities Whisper lacks: audio Q&A, translation, spoken instruction following.

Edge / on-device

Whisper Small / Moonshine

Runs on CPU, mobile, or Raspberry Pi

Whisper Small is the pragmatic CPU choice. Moonshine from UsefulSensors is 5x faster than Whisper Tiny with better accuracy on-device.

Fastest inference

Groq Whisper

LPU-accelerated or specialized hardware

Groq's LPU delivers Whisper inference ~150x faster than real time. Best choice when you need to batch-process large audio corpora quickly.

§ 03 · Open vs cloud

The picture inverted.

In 2026 open-source STT has flipped the historical picture: Granite Speech 4.1 2B (5.33% mean WER) tops the Open ASR Leaderboard, ahead of every proprietary system measured on it. Cloud APIs still win on streaming infrastructure, multilingual scale, and managed hosting.

When to go open source

Lowest-WER requirement (Granite Speech, Canary-Qwen)
Data residency and compliance
Offline or edge deployment (Moonshine, Whisper Small)
High-volume batch processing (Whisper on Groq)
Research and reproducibility

When to go cloud API

Real-time streaming with partials (Deepgram Nova-3)
Broad multilingual out-of-the-box (AssemblyAI)
Managed infra, SLA, autoscaling
Diarization and speaker labeling built-in
Fast vendor iteration on new languages and domains

§ 04

How STT is scored

WER, and what it misses.

Word Error Rate counts substitutions, deletions and insertions divided by reference length. WER = (S + D + I) / N. Lower is better. Human transcribers sit at 2–4% on clean read speech, over 10% on noisy conversational audio, and over 20% on heavy accents or low-resource languages.

WER is a compressed signal. Most reports normalise away punctuation, casing and numerics; batch-mode WER says nothing about streaming latency; WER under-reports hallucination on silence. For anything past picking the leader, you want domain-specific evaluation.

Neighbouring registers.

Text-to-speech →

The paired TTS leaderboard and picks-by-use-case.

Speech hub · STT + TTS →

Combined register with papers, repos, trends.

Audio-to-text building block →

Integration guide: API shapes, streaming, diarisation.

Audio benchmarks →

Classification, music generation, audio understanding.

Beta · Mean WER figures from the HF Open ASR Leaderboard (en_shortform), accessed 2026-05-22. Feedback to k.wikiel@gmail.com.