Codesota · Speech · STT leaderboardSOTA speech-to-text · OSS ASR · API modelsUpdated May 7, 2026

§ 00 · Direct answer

STT leaderboard: the speech-to-text SOTA.

The short answer: Granite Speech 4.1 2B is the current SOTA entry in this STT leaderboard, with 5.33% mean WER on the HF Open ASR Leaderboard. Pulse Pro leads among hosted APIs on accuracy, while Deepgram Nova-3 remains the go-to for low-latency streaming. For an open-source ASR model, start with Granite Speech 4.1 2B or Whisper Large v3 Turbo if multilingual coverage matters most.

LibriSpeech is still useful for clean-speech comparison, but it is saturated near the frontier. Use the eight-dataset Open ASR mean WER as the primary ranking, then check LibriSpeech clean/other when your workload is close-read English speech.

View leaderboard →API vs open-source LibriSpeech SOTA Full speech register →

§ 01 · Leaderboard

ASR models, ranked by mean WER.

Lower mean WER is better. Rows are ranked by mean WER across the eight HF Open ASR Leaderboard datasets, with each model's LibriSpeech test-clean WER shown alongside for a clean-speech reference point.

Benchmark context · Open ASR Leaderboard →

#	Model	Vendor	Kind	Architecture	Params	LS Clean	Mean WER
01	Granite Speech 4.1 2B	IBM	Open Source	Speech-aware LLM (Granite)	2B	1.33%	5.33%
02	Cohere Transcribe (Mar 2026)	Cohere	Open Source	Transformer ASR	2B	1.25%	5.42%
03	Pulse Pro	Smallest AI	Cloud API	Proprietary ASR	—	1.80%	5.42%
04	Zoom Scribe v1	Zoom	Cloud API	Proprietary	—	1.63%	5.47%
05	Granite 4.0 1B Speech	IBM	Open Source	Speech-aware LLM (Granite)	1B	1.42%	5.52%
06	Canary-Qwen-2.5B	NVIDIA	Open Source	FastConformer encoder + Qwen2 LM decoder	2.5B	1.61%	5.63%
07	Granite Speech 3.3 8B	IBM	Open Source	Speech-aware LLM (Granite)	8B	1.43%	5.74%
08	Qwen3-ASR-1.7B	Alibaba	Open Source	Qwen3 backbone fine-tuned for ASR	1.7B	1.63%	5.76%
09	ElevenLabs Scribe v2	ElevenLabs	Cloud API	Proprietary	—	1.54%	5.83%
10	Phi-4 Multimodal Instruct	Microsoft	Open Source	Phi-4 multimodal	6B	1.69%	6.02%
11	Parakeet TDT 0.6B v2	NVIDIA	Open Source	FastConformer (TDT)	0.6B	1.69%	6.05%
12	AssemblyAI Universal-3 Pro	AssemblyAI	Cloud API	Proprietary Conformer-based	—	1.53%	6.21%
13	Canary 1B	NVIDIA	Open Source	FastConformer + multi-task	1B	1.48%	6.50%
14	Voxtral Small 24B	Mistral AI	Open Source	Large multimodal LM with audio encoder	24B	1.59%	6.62%
15	Google Chirp 3	Google	Cloud API	Generative (USM-based)	—	2.04%	6.63%
16	Parakeet TDT 1.1B	NVIDIA	Open Source	FastConformer (TDT)	1.1B	1.40%	7.02%
17	Voxtral Mini 3B	Mistral AI	Open Source	Audio-Language Model (Transformer)	3B	1.88%	7.05%
18	Whisper Large v3	OpenAI	Open Source	Transformer Encoder-Decoder	1.55B	2.01%	7.44%
19	Whisper Large v3 Turbo	OpenAI	Open Source	Transformer Encoder-Decoder (pruned decoder)	809M	2.10%	7.83%

§ 02 · LibriSpeech

LibriSpeech ASR SOTA, in context.

LibriSpeech test-clean is a classic ASR benchmark, but it no longer separates the strongest speech-to-text systems by itself. Current frontier systems cluster around very low WER on clean audiobook speech, so CodeSOTA uses the broader HF Open ASR Leaderboard as the headline ranking and treats LibriSpeech as one diagnostic slice.

Use LibriSpeech when your workload is clean, read English audio. For meetings, calls, accents, long-form podcasts, or noisy streaming, check AMI, Earnings-22, GigaSpeech, TED-LIUM, VoxPopuli and latency features before choosing a model.

LibriSpeech test-clean · 34 systems · ranked low-to-high

#	Model	Vendor	Kind	LS Clean WER
01	Cohere Transcribe (Mar 2026)	Cohere	Open Source	1.25%
02	Higgs Audio v3 8B STT v2	Boson AI	Open Source	1.27%
03	Granite Speech 4.1 2B (NAR)	IBM	Open Source	1.28%
04	Granite Speech 4.1 2B	IBM	Open Source	1.33%
05	Parakeet TDT 1.1B	NVIDIA	Open Source	1.40%
06	Granite 4.0 1B Speech	IBM	Open Source	1.42%
07	Granite Speech 3.3 8B	IBM	Open Source	1.43%
08	Parakeet RNNT 1.1B	NVIDIA	Open Source	1.45%
09	Canary 1B	NVIDIA	Open Source	1.48%
10	Canary 1B Flash	NVIDIA	Open Source	1.48%
11	AssemblyAI Universal-3 Pro	AssemblyAI	Cloud API	1.53%
12	Granite Speech 3.3 2B	IBM	Open Source	1.53%
13	ElevenLabs Scribe v2	ElevenLabs	Cloud API	1.54%
14	Voxtral Small 24B	Mistral AI	Open Source	1.59%
15	Canary-Qwen-2.5B	NVIDIA	Open Source	1.61%
16	Zoom Scribe v1	Zoom	Cloud API	1.63%
17	Qwen3-ASR-1.7B	Alibaba	Open Source	1.63%
18	Phi-4 Multimodal Instruct	Microsoft	Open Source	1.69%
19	Parakeet TDT 0.6B v2	NVIDIA	Open Source	1.69%
20	Pulse Pro	Smallest AI	Cloud API	1.80%
21	Voxtral Mini 3B	Mistral AI	Open Source	1.88%
22	Conformer XL	Google	Research	2.00%
23	Whisper Large v3	OpenAI	Open Source	2.01%
24	Google Chirp 3	Google	Cloud API	2.04%
25	Whisper Large v3 Turbo	OpenAI	Open Source	2.10%
26	Deepgram Nova-3	Deepgram	Cloud API	2.20%
27	Voxtral Large	Mistral AI	Cloud API	2.30%
28	Gladia v2	Gladia	Cloud API	2.50%
29	Speechmatics Flow	Speechmatics	Cloud API	2.60%
30	Groq Whisper	Groq	Cloud API	2.70%
31	Google USM	Google	Cloud API	2.80%
32	Azure Speech	Microsoft	Cloud API	3.00%
33	Moonshine Base	Useful Sensors	Open Source	3.50%
34	wav2vec 2.0	Meta	Open Source	3.80%

LibriSpeech test-clean WER from the HF Open ASR Leaderboard. Lower is better; the frontier is saturated near 1.3%, so treat sub-2% gaps as noise and rank on the eight-dataset mean WER above. Cloud-API rows show vendor- or AA-reported figures.

§ 03 · Decision

Open-source ASR or API?

The SOTA speech-to-text choice is not only the lowest WER. It is the trade between accuracy, latency, privacy, language coverage and operations.

Choose open-source when

You need control and unit economics.

Run Granite Speech or Canary-Qwen for top accuracy, Whisper v3 Turbo for multilingual coverage, Parakeet TDT for the fastest inference, or Moonshine for edge devices. This is usually the right path for private audio, high volume and custom deployment.

Choose an API when

You need product features now.

Deepgram Nova-3, AssemblyAI, Gladia, Speechmatics, Groq Whisper and Voxtral are better fits when streaming, diarization, hosted scaling, translation or audio-LLM behavior matters more than self-hosting the ASR stack.

§ 04 · Next reads

Related speech pages.

Speech hub Speech-to-text Speech recognition guide LibriSpeech benchmark