Codesota · Speech · STT leaderboardSOTA speech-to-text · OSS ASR · API modelsUpdated May 7, 2026
§ 00 · Direct answer

STT leaderboard: the speech-to-text SOTA.

The short answer: Granite Speech 4.1 2B is the current SOTA entry in this STT leaderboard, with 5.33% mean WER on the HF Open ASR Leaderboard. Pulse Pro leads among hosted APIs on accuracy, while Deepgram Nova-3 remains the go-to for low-latency streaming. For an open-source ASR model, start with Granite Speech 4.1 2B or Whisper Large v3 Turbo if multilingual coverage matters most.

LibriSpeech is still useful for clean-speech comparison, but it is saturated near the frontier. Use the eight-dataset Open ASR mean WER as the primary ranking, then check LibriSpeech clean/other when your workload is close-read English speech.

§ 01 · Leaderboard

ASR models, ranked by mean WER.

Lower mean WER is better. Rows are ranked by mean WER across the eight HF Open ASR Leaderboard datasets, with each model's LibriSpeech test-clean WER shown alongside for a clean-speech reference point.

Benchmark context · Open ASR Leaderboard →
#ModelVendorKindArchitectureParamsLS CleanMean WER
01Granite Speech 4.1 2BIBMOpen SourceSpeech-aware LLM (Granite)2B1.33%5.33%
02Cohere Transcribe (Mar 2026)CohereOpen SourceTransformer ASR2B1.25%5.42%
03Pulse ProSmallest AICloud APIProprietary ASR1.80%5.42%
04Zoom Scribe v1ZoomCloud APIProprietary1.63%5.47%
05Granite 4.0 1B SpeechIBMOpen SourceSpeech-aware LLM (Granite)1B1.42%5.52%
06Canary-Qwen-2.5BNVIDIAOpen SourceFastConformer encoder + Qwen2 LM decoder2.5B1.61%5.63%
07Granite Speech 3.3 8BIBMOpen SourceSpeech-aware LLM (Granite)8B1.43%5.74%
08Qwen3-ASR-1.7BAlibabaOpen SourceQwen3 backbone fine-tuned for ASR1.7B1.63%5.76%
09ElevenLabs Scribe v2ElevenLabsCloud APIProprietary1.54%5.83%
10Phi-4 Multimodal InstructMicrosoftOpen SourcePhi-4 multimodal6B1.69%6.02%
11Parakeet TDT 0.6B v2NVIDIAOpen SourceFastConformer (TDT)0.6B1.69%6.05%
12AssemblyAI Universal-3 ProAssemblyAICloud APIProprietary Conformer-based1.53%6.21%
13Canary 1BNVIDIAOpen SourceFastConformer + multi-task1B1.48%6.50%
14Voxtral Small 24BMistral AIOpen SourceLarge multimodal LM with audio encoder24B1.59%6.62%
15Google Chirp 3GoogleCloud APIGenerative (USM-based)2.04%6.63%
16Parakeet TDT 1.1BNVIDIAOpen SourceFastConformer (TDT)1.1B1.40%7.02%
17Voxtral Mini 3BMistral AIOpen SourceAudio-Language Model (Transformer)3B1.88%7.05%
18Whisper Large v3OpenAIOpen SourceTransformer Encoder-Decoder1.55B2.01%7.44%
19Whisper Large v3 TurboOpenAIOpen SourceTransformer Encoder-Decoder (pruned decoder)809M2.10%7.83%
§ 02 · LibriSpeech

LibriSpeech ASR SOTA, in context.

LibriSpeech test-clean is a classic ASR benchmark, but it no longer separates the strongest speech-to-text systems by itself. Current frontier systems cluster around very low WER on clean audiobook speech, so CodeSOTA uses the broader HF Open ASR Leaderboard as the headline ranking and treats LibriSpeech as one diagnostic slice.

Use LibriSpeech when your workload is clean, read English audio. For meetings, calls, accents, long-form podcasts, or noisy streaming, check AMI, Earnings-22, GigaSpeech, TED-LIUM, VoxPopuli and latency features before choosing a model.

LibriSpeech test-clean · 34 systems · ranked low-to-high
#ModelVendorKindLS Clean WER
01Cohere Transcribe (Mar 2026)CohereOpen Source1.25%
02Higgs Audio v3 8B STT v2Boson AIOpen Source1.27%
03Granite Speech 4.1 2B (NAR)IBMOpen Source1.28%
04Granite Speech 4.1 2BIBMOpen Source1.33%
05Parakeet TDT 1.1BNVIDIAOpen Source1.40%
06Granite 4.0 1B SpeechIBMOpen Source1.42%
07Granite Speech 3.3 8BIBMOpen Source1.43%
08Parakeet RNNT 1.1BNVIDIAOpen Source1.45%
09Canary 1BNVIDIAOpen Source1.48%
10Canary 1B FlashNVIDIAOpen Source1.48%
11AssemblyAI Universal-3 ProAssemblyAICloud API1.53%
12Granite Speech 3.3 2BIBMOpen Source1.53%
13ElevenLabs Scribe v2ElevenLabsCloud API1.54%
14Voxtral Small 24BMistral AIOpen Source1.59%
15Canary-Qwen-2.5BNVIDIAOpen Source1.61%
16Zoom Scribe v1ZoomCloud API1.63%
17Qwen3-ASR-1.7BAlibabaOpen Source1.63%
18Phi-4 Multimodal InstructMicrosoftOpen Source1.69%
19Parakeet TDT 0.6B v2NVIDIAOpen Source1.69%
20Pulse ProSmallest AICloud API1.80%
21Voxtral Mini 3BMistral AIOpen Source1.88%
22Conformer XLGoogleResearch2.00%
23Whisper Large v3OpenAIOpen Source2.01%
24Google Chirp 3GoogleCloud API2.04%
25Whisper Large v3 TurboOpenAIOpen Source2.10%
26Deepgram Nova-3DeepgramCloud API2.20%
27Voxtral LargeMistral AICloud API2.30%
28Gladia v2GladiaCloud API2.50%
29Speechmatics FlowSpeechmaticsCloud API2.60%
30Groq WhisperGroqCloud API2.70%
31Google USMGoogleCloud API2.80%
32Azure SpeechMicrosoftCloud API3.00%
33Moonshine BaseUseful SensorsOpen Source3.50%
34wav2vec 2.0MetaOpen Source3.80%

LibriSpeech test-clean WER from the HF Open ASR Leaderboard. Lower is better; the frontier is saturated near 1.3%, so treat sub-2% gaps as noise and rank on the eight-dataset mean WER above. Cloud-API rows show vendor- or AA-reported figures.

§ 03 · Decision

Open-source ASR or API?

The SOTA speech-to-text choice is not only the lowest WER. It is the trade between accuracy, latency, privacy, language coverage and operations.

Choose open-source when

You need control and unit economics.

Run Granite Speech or Canary-Qwen for top accuracy, Whisper v3 Turbo for multilingual coverage, Parakeet TDT for the fastest inference, or Moonshine for edge devices. This is usually the right path for private audio, high volume and custom deployment.

Choose an API when

You need product features now.

Deepgram Nova-3, AssemblyAI, Gladia, Speechmatics, Groq Whisper and Voxtral are better fits when streaming, diarization, hosted scaling, translation or audio-LLM behavior matters more than self-hosting the ASR stack.

§ 04 · Next reads

Related speech pages.

Speech hubSpeech-to-textSpeech recognition guideLibriSpeech benchmark