Codesota · Tasks · Speech RecognitionTasks/Audio/Speech Recognition

Audio · last verified 2026-04

Speech Recognition (ASR).

The other half of the voice stack. Pair an ASR with a text-to-speech model and a fast LLM and you have a voice agent. In 2026 the ASR market has bifurcated: low-latency streaming (Deepgram, AssemblyAI, Google Chirp 3) for real-time voice, and high-accuracy batch (Whisper Large v3, NVIDIA Canary, Speechmatics) for transcription and captioning. Below: 13 providers compared on cost per hour, streaming latency, language coverage, diarization, word timestamps, and custom vocabulary.

Submit corrections below · Vendors: claim your listing →

§ 01 · The matrix

13 providers, side by side.

Frontier API · hyperscaler cloud · open weights. Pricing normalised to dollars per hour of audio.

FrontierCloudOpen

Provider / Model	Tier	License	Cost / hr	Latency	Langs	Diariz.	Word ts	Custom vocab
OpenAI Whisper API · Whisper Large v3 Turbo	Frontier	Proprietary API	$0.36/hr	Batch only	99	—	✓	—	Claim →
Deepgram Nova 3	Frontier	Proprietary API	$0.26/hr	~300 ms	36+	✓	✓	✓	Claim →
AAI AssemblyAI Universal-2	Frontier	Proprietary API	$0.37/hr	~400 ms	99+	✓	✓	✓	Claim →
Sm Speechmatics Ursa 2 · Enhanced	Frontier	Proprietary API	~$1.20/hr	~700 ms	50+	✓	✓	✓	Claim →
Rv Rev AI Reverb · Machine + Human	Frontier	Proprietary API	$1.20/hr	~500 ms	36+	✓	✓	✓	Claim →
H Hume EVI 3 · Voice ASR	Frontier	Proprietary API	Per-minute	Realtime	30+	✓	✓	—	Claim →
Google Cloud Chirp 3 · Speech-to-Text v2	Cloud	Proprietary API	$1.44/hr	~500 ms	125+	✓	✓	✓	Claim →
Az Microsoft Azure Speech to Text · Standard / Enhanced	Cloud	Proprietary API	$1.00–1.40/hr	~500 ms	140+ locales	✓	✓	✓	Claim →
AWS Amazon Web Services Transcribe · Standard / Medical / Call Analytics	Cloud	Proprietary API	$1.44/hr	~600 ms	100+	✓	✓	✓	Claim →
OpenAI (open) Whisper Large v3 (open)	Open	Open weights	Self-host	Batch (GPU)	99	—	✓	—	Claim →
NVIDIA NeMo Canary 1B Flash · Parakeet TDT	Open	Open weights	Self-host	GPU-dependent	4 (en, de, es, fr) · Parakeet en-only	—	✓	—	Claim →
Hugging Face (open) Distil-Whisper Large v3	Open	Open weights	Self-host	Streaming-capable	English (primary)	—	✓	—	Claim →
Mn Useful Sensors (open) Moonshine Tiny / Base	Open	Open weights	Self-host	Sub-100 ms (CPU)	English	—	✓	—	Claim →

Pricing is list-price per hour of audio as of 2026-04. Most vendors price per minute or per 1000 minutes — converted here for comparability. Streaming and add-ons (diarization, custom models) often carry surcharges. Spot an error? Tell us →

§ 02 · Which should I use?

Streaming or batch first.

ASR selection is shaped by one binary first — streaming or batch — and four secondary axes: accuracy on YOUR audio, language coverage, custom vocabulary, and license. Shortcuts by use case:

Best accuracy (batch / long-form)

Speechmatics · NVIDIA Canary 1B · Whisper Large v3

Speechmatics tops independent accented-English benchmarks; Canary leads the HF Open ASR leaderboard; Whisper Large v3 is the universal multilingual baseline.

Lowest latency (real-time voice agents)

Deepgram Nova 3 · AssemblyAI · Google Chirp 3

Sub-500 ms streaming with stable partials. Deepgram is the price-performance leader; pair with a TTS like Cartesia or ElevenLabs Turbo to close the voice loop.

Cheapest at scale

Deepgram Nova 3 · OpenAI Whisper API · self-hosted Distil-Whisper

Deepgram at $0.26/hr is the lowest hosted price at frontier accuracy. Whisper API is $0.36/hr. Self-hosting Distil-Whisper drops below $0.10/hr if you can fill a GPU.

On-prem / compliance / air-gapped

Whisper Large v3 (MIT) · NVIDIA Canary (CC-BY-4.0) · Speechmatics on-prem

Whisper / Canary / Distil-Whisper run on your hardware. Speechmatics has the most mature commercial on-prem deployment if you want a vendor SLA without the cloud.

Edge / on-device ASR

Moonshine · Distil-Whisper Tiny · whisper.cpp

Moonshine runs on a Raspberry Pi in real time. Distil-Whisper Tiny + whisper.cpp run in-browser via WASM. The picks when you can't or won't ship audio to the cloud.

Multilingual (50+ languages)

Whisper Large v3 · Azure Speech · Google Chirp 3 · Common Voice fine-tunes

Whisper covers 99 languages out-of-the-box. Azure leads on locale variants (140+). For low-resource languages, fine-tune Whisper on Common Voice rather than waiting for a vendor.

Must-be-perfect transcripts (legal, broadcast)

Rev AI human · AssemblyAI + human review

When 95% WER isn't good enough, machine-only ASR is the wrong tool. Rev AI ships a same-API human-transcription path; AssemblyAI's enterprise plans bundle review.

Domain jargon (medical, legal, finance, code)

Deepgram + keyword boosting · AssemblyAI custom vocab · AWS Transcribe Medical

Custom vocabulary lift is real and easy to A/B. Deepgram and AssemblyAI both expose runtime keyword lists. AWS Transcribe Medical is purpose-trained on clinical audio.

§ 03 · What to actually test

Vendor demos lie.

Vendor WER numbers are reported on LibriSpeech test-clean — read audio, single speaker, studio mic. Real production audio is none of those things. Build a 30-minute evaluation set that stresses these six failure modes — most providers stratify sharply on them:

Score not just WER but diarization error rate, proper-noun recall, and tail-token accuracy. A model that nails 98% of words but mis-spells the CEO’s name on every call is a worse product than one with 95% WER and a clean entity layer.

Heavy accents

Indian, Nigerian, Scottish, regional US English. Most ASR is trained heavily on US/UK newsreader voices and degrades 5–15 WER points on accents.

Multiple speakers / overlap

Two-person call where speakers interrupt. Diarization (who said what) is dramatically harder than transcription. Test it as a separate metric.

Noisy environments

Cafe background, traffic, phone-call audio (8 kHz μ-law). Frontier models trained on clean podcasts collapse on real call-center audio. Test on YOUR channel.

Domain jargon

Medical drug names, legal Latin, ticker symbols, code identifiers, brand names. Out-of-vocabulary words drop hard — custom vocabulary lift is real and worth A/B testing.

Long-form audio (>30 min)

Whisper-style models built for 30-second windows can drift on hour-long meetings. Test WER at minute 5 vs minute 55 — they’re often different.

Code-switching

English + Spanish mid-sentence (or any L1+L2 mix) is common in real life and brutal for most ASR. Multilingual-by-design models (Whisper, Chirp 3) win here.

§ 04 · Why LibriSpeech lags

Saturated since 2024.

LibriSpeech (2015) is the canonical English ASR benchmark — read audiobook excerpts, clean studio audio, single speaker. For a decade it was the only number in town. In 2026 every frontier ASR is sub-3% WER on test-clean and sub-5% on test-other. Human ceiling is around 2%.

At that point the benchmark stops measuring model quality and starts measuring over-fitting to read-audio quirks. A 0.3-point WER delta on LibriSpeech is noise.

The interesting evals in 2026 are the HF Open ASR Leaderboard (which averages eight datasets — LibriSpeech, AMI, Earnings22, GigaSpeech, SPGISpeech, TED-LIUM, VoxPopuli, Common Voice — to resist single-set over-fitting), DER on AMI (real meeting diarization), and call-center test sets you build from your own audio.

Vendor-quoted WER is almost always test-clean. Treat it as table stakes, not a ranking signal. Always benchmark on your own audio.

§ 05 · Reference datasets & leaderboards

Academic standards.

Useful for academic comparison and open-weights training. Frontier API providers don’t train solely on these — they use proprietary multi-thousand-hour corpora covering call audio, meetings, and accented speech.

LibriSpeech

1,000 hours · read English · audiobooks2015

The canonical English ASR benchmark. Clean studio audio, single speakers reading public-domain books. Saturated — frontier models hit sub-3% WER on test-clean. Still the default training set for academic ASR.

Dataset page →

Common Voice

30,000+ hours · 100+ languages · crowdsourced2019

Mozilla’s ongoing multilingual speech corpus. The go-to for low-resource language fine-tuning. Audio quality varies widely (laptop mics, phones, accents) which is closer to production than LibriSpeech.

Dataset page →

TED-LIUM 3

452 hours · TED talks · English2018

TED talk audio with human transcripts. Tests prepared but expressive speech, mid-quality auditorium mics, and speaker variety. Common evaluation set in the HF Open ASR Leaderboard average.

Dataset page →

VoxPopuli

400K hours · 23 EU languages · parliamentary2021

European Parliament recordings — formal multilingual speech with accented English, code-switching, and microphone variability. The default for multilingual ASR research outside English.

Dataset page →

Earnings22

125 hours · finance domain · 27 accents2022

Earnings call transcripts with deliberately diverse speaker accents and dense domain jargon. Built specifically to stress-test ASR on real-world business audio. Brutal benchmark — frontier WER often 2–3× LibriSpeech.

Dataset page →

HF Open ASR Leaderboard

Avg of 8 English ASR datasets2024

The reference leaderboard for open-weights ASR. Averages WER across LibriSpeech, AMI, Earnings22, GigaSpeech, SPGISpeech, TED-LIUM, VoxPopuli, and Common Voice — resistant to single-dataset over-fitting. NVIDIA Canary and Whisper Large v3 trade #1.

Dataset page →

§ 06 · Practical tips for 2026

Five rules.

Streaming sacrifices accuracy for latency. Real-time partials add roughly 3–5% WER vs the same provider’s batch endpoint — the model has less context per chunk and can’t look ahead. Worth it for voice agents; not worth it for transcription.

Diarization is a separate problem from ASR. “Who said what” is harder than “what was said.” Even when diarization is bundled, test DER independently. For Whisper, the standard pipeline is Pyannote-Audio for speaker turns + Whisper for transcription, then alignment.

Domain vocabulary lift is real. Deepgram keyword boosting and AssemblyAI custom vocabulary both materially reduce error on proper nouns, brand names, and jargon. It takes one afternoon to A/B and the lift is often 10–30% on tail tokens.

Open-weights Whisper is the wrong tool for streaming. Whisper was built for 30-second batched windows. For low-latency streaming use Distil-Whisper, Moonshine, or Parakeet TDT — or pay for a managed streaming API (Deepgram, AssemblyAI). Don’t hack 30s Whisper into a real-time loop and expect good UX.

Always benchmark on YOUR audio. Vendor numbers are LibriSpeech. Your audio is phone calls, accented speakers, background noise, and domain jargon. A two-hour eval set built from your own production audio is the only number that matters — and it routinely re-orders the rankings.

For vendors

Run an ASR product? Claim your listing.

CodeSOTA’s speech-recognition comparison is read by engineers picking ASR for voice agents, call-center analytics, and transcription. If you represent a vendor above — or one we missed — claim the listing to submit verified pricing, latency benchmarks, and a demo link. Free; credibility-gated, not pay-to-play.

Claim a listing →Get a rank badge →

Related comparisons

Text-to-Speech (the bookend) →Visual Question Answering →Frontier LLM leaderboard →

Reply within 48 hours · No newsletter

What were you looking for on speech recognition?

Missing a provider, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.