The other half of the voice stack. Pair an ASR with a text-to-speech model and a fast LLM and you have a voice agent. In 2026 the ASR market has bifurcated: low-latency streaming (Deepgram, AssemblyAI, Google Chirp 3) for real-time voice, and high-accuracy batch (Whisper Large v3, NVIDIA Canary, Speechmatics) for transcription and captioning. Below: 13 providers compared on cost per hour, streaming latency, language coverage, diarization, word timestamps, and custom vocabulary.
Submit corrections below · Vendors: claim your listing →
Frontier API · hyperscaler cloud · open weights. Pricing normalised to dollars per hour of audio.
| Provider / Model | Tier | License | Cost / hr | Latency | Langs | Diariz. | Word ts | Custom vocab | |
|---|---|---|---|---|---|---|---|---|---|
| Frontier | Proprietary API | $0.36/hr | Batch only | 99 | — | ✓ | — | Claim → | |
| Frontier | Proprietary API | $0.26/hr | ~300 ms | 36+ | ✓ | ✓ | ✓ | Claim → | |
AAI | Frontier | Proprietary API | $0.37/hr | ~400 ms | 99+ | ✓ | ✓ | ✓ | Claim → |
| Frontier | Proprietary API | ~$1.20/hr | ~700 ms | 50+ | ✓ | ✓ | ✓ | Claim → | |
Rv | Frontier | Proprietary API | $1.20/hr | ~500 ms | 36+ | ✓ | ✓ | ✓ | Claim → |
H | Frontier | Proprietary API | Per-minute | Realtime | 30+ | ✓ | ✓ | — | Claim → |
| Cloud | Proprietary API | $1.44/hr | ~500 ms | 125+ | ✓ | ✓ | ✓ | Claim → | |
| Cloud | Proprietary API | $1.00–1.40/hr | ~500 ms | 140+ locales | ✓ | ✓ | ✓ | Claim → | |
| Cloud | Proprietary API | $1.44/hr | ~600 ms | 100+ | ✓ | ✓ | ✓ | Claim → | |
| Open | Open weights | Self-host | Batch (GPU) | 99 | — | ✓ | — | Claim → | |
| Open | Open weights | Self-host | GPU-dependent | 4 (en, de, es, fr) · Parakeet en-only | — | ✓ | — | Claim → | |
| Open | Open weights | Self-host | Streaming-capable | English (primary) | — | ✓ | — | Claim → | |
| Open | Open weights | Self-host | Sub-100 ms (CPU) | English | — | ✓ | — | Claim → |
Pricing is list-price per hour of audio as of 2026-04. Most vendors price per minute or per 1000 minutes — converted here for comparability. Streaming and add-ons (diarization, custom models) often carry surcharges. Spot an error? Tell us →
ASR selection is shaped by one binary first — streaming or batch — and four secondary axes: accuracy on YOUR audio, language coverage, custom vocabulary, and license. Shortcuts by use case:
Speechmatics tops independent accented-English benchmarks; Canary leads the HF Open ASR leaderboard; Whisper Large v3 is the universal multilingual baseline.
Sub-500 ms streaming with stable partials. Deepgram is the price-performance leader; pair with a TTS like Cartesia or ElevenLabs Turbo to close the voice loop.
Deepgram at $0.26/hr is the lowest hosted price at frontier accuracy. Whisper API is $0.36/hr. Self-hosting Distil-Whisper drops below $0.10/hr if you can fill a GPU.
Whisper / Canary / Distil-Whisper run on your hardware. Speechmatics has the most mature commercial on-prem deployment if you want a vendor SLA without the cloud.
Moonshine runs on a Raspberry Pi in real time. Distil-Whisper Tiny + whisper.cpp run in-browser via WASM. The picks when you can't or won't ship audio to the cloud.
Whisper covers 99 languages out-of-the-box. Azure leads on locale variants (140+). For low-resource languages, fine-tune Whisper on Common Voice rather than waiting for a vendor.
When 95% WER isn't good enough, machine-only ASR is the wrong tool. Rev AI ships a same-API human-transcription path; AssemblyAI's enterprise plans bundle review.
Custom vocabulary lift is real and easy to A/B. Deepgram and AssemblyAI both expose runtime keyword lists. AWS Transcribe Medical is purpose-trained on clinical audio.
Vendor WER numbers are reported on LibriSpeech test-clean — read audio, single speaker, studio mic. Real production audio is none of those things. Build a 30-minute evaluation set that stresses these six failure modes — most providers stratify sharply on them:
Score not just WER but diarization error rate, proper-noun recall, and tail-token accuracy. A model that nails 98% of words but mis-spells the CEO’s name on every call is a worse product than one with 95% WER and a clean entity layer.
Indian, Nigerian, Scottish, regional US English. Most ASR is trained heavily on US/UK newsreader voices and degrades 5–15 WER points on accents.
Two-person call where speakers interrupt. Diarization (who said what) is dramatically harder than transcription. Test it as a separate metric.
Cafe background, traffic, phone-call audio (8 kHz μ-law). Frontier models trained on clean podcasts collapse on real call-center audio. Test on YOUR channel.
Medical drug names, legal Latin, ticker symbols, code identifiers, brand names. Out-of-vocabulary words drop hard — custom vocabulary lift is real and worth A/B testing.
Whisper-style models built for 30-second windows can drift on hour-long meetings. Test WER at minute 5 vs minute 55 — they’re often different.
English + Spanish mid-sentence (or any L1+L2 mix) is common in real life and brutal for most ASR. Multilingual-by-design models (Whisper, Chirp 3) win here.
LibriSpeech (2015) is the canonical English ASR benchmark — read audiobook excerpts, clean studio audio, single speaker. For a decade it was the only number in town. In 2026 every frontier ASR is sub-3% WER on test-clean and sub-5% on test-other. Human ceiling is around 2%.
At that point the benchmark stops measuring model quality and starts measuring over-fitting to read-audio quirks. A 0.3-point WER delta on LibriSpeech is noise.
The interesting evals in 2026 are the HF Open ASR Leaderboard (which averages eight datasets — LibriSpeech, AMI, Earnings22, GigaSpeech, SPGISpeech, TED-LIUM, VoxPopuli, Common Voice — to resist single-set over-fitting), DER on AMI (real meeting diarization), and call-center test sets you build from your own audio.
Vendor-quoted WER is almost always test-clean. Treat it as table stakes, not a ranking signal. Always benchmark on your own audio.
Useful for academic comparison and open-weights training. Frontier API providers don’t train solely on these — they use proprietary multi-thousand-hour corpora covering call audio, meetings, and accented speech.
The canonical English ASR benchmark. Clean studio audio, single speakers reading public-domain books. Saturated — frontier models hit sub-3% WER on test-clean. Still the default training set for academic ASR.
Dataset page →Mozilla’s ongoing multilingual speech corpus. The go-to for low-resource language fine-tuning. Audio quality varies widely (laptop mics, phones, accents) which is closer to production than LibriSpeech.
Dataset page →TED talk audio with human transcripts. Tests prepared but expressive speech, mid-quality auditorium mics, and speaker variety. Common evaluation set in the HF Open ASR Leaderboard average.
Dataset page →European Parliament recordings — formal multilingual speech with accented English, code-switching, and microphone variability. The default for multilingual ASR research outside English.
Dataset page →Earnings call transcripts with deliberately diverse speaker accents and dense domain jargon. Built specifically to stress-test ASR on real-world business audio. Brutal benchmark — frontier WER often 2–3× LibriSpeech.
Dataset page →The reference leaderboard for open-weights ASR. Averages WER across LibriSpeech, AMI, Earnings22, GigaSpeech, SPGISpeech, TED-LIUM, VoxPopuli, and Common Voice — resistant to single-dataset over-fitting. NVIDIA Canary and Whisper Large v3 trade #1.
Dataset page →Streaming sacrifices accuracy for latency. Real-time partials add roughly 3–5% WER vs the same provider’s batch endpoint — the model has less context per chunk and can’t look ahead. Worth it for voice agents; not worth it for transcription.
Diarization is a separate problem from ASR. “Who said what” is harder than “what was said.” Even when diarization is bundled, test DER independently. For Whisper, the standard pipeline is Pyannote-Audio for speaker turns + Whisper for transcription, then alignment.
Domain vocabulary lift is real. Deepgram keyword boosting and AssemblyAI custom vocabulary both materially reduce error on proper nouns, brand names, and jargon. It takes one afternoon to A/B and the lift is often 10–30% on tail tokens.
Open-weights Whisper is the wrong tool for streaming. Whisper was built for 30-second batched windows. For low-latency streaming use Distil-Whisper, Moonshine, or Parakeet TDT — or pay for a managed streaming API (Deepgram, AssemblyAI). Don’t hack 30s Whisper into a real-time loop and expect good UX.
Always benchmark on YOUR audio. Vendor numbers are LibriSpeech. Your audio is phone calls, accented speakers, background noise, and domain jargon. A two-hour eval set built from your own production audio is the only number that matters — and it routinely re-orders the rankings.
CodeSOTA’s speech-recognition comparison is read by engineers picking ASR for voice agents, call-center analytics, and transcription. If you represent a vendor above — or one we missed — claim the listing to submit verified pricing, latency benchmarks, and a demo link. Free; credibility-gated, not pay-to-play.
Missing a provider, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.
Real humans read every message. We track what people are asking for and prioritize accordingly.