Speech Recognition
Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.
Speech recognition (ASR) converts spoken audio to text. Whisper (OpenAI) democratized high-accuracy multilingual ASR, and production systems from Google, Amazon, and AssemblyAI achieve <5% word error rate on clean English. The frontiers are noisy/accented speech, real-time streaming, and code-switching between languages mid-sentence.
History
Deep neural networks replace GMM-HMMs in acoustic modeling; Google ships DNN-based voice search
DeepSpeech (Baidu) introduces end-to-end CTC-based ASR, simplifying the traditional pipeline
Listen, Attend, and Spell (LAS) brings attention-based seq2seq to ASR; Google deploys in production
Wav2Vec 2.0 (Facebook) shows self-supervised pretraining on unlabeled audio dramatically improves ASR
Conformer (Gulati et al.) combines convolution with transformer attention — becomes the dominant ASR architecture
Whisper (OpenAI) releases a 1.5B-param model trained on 680K hours achieving robust multilingual ASR across 97 languages
Whisper large-v3 and Distil-Whisper push accuracy and speed; AssemblyAI Universal-2 and Deepgram Nova-2 lead commercial ASR
Canary (NVIDIA), Parakeet, and Moonshine optimize for real-time on-device ASR; WER drops below 3% on clean English
Universal Speech Model (Google) and Whisper-AT handle 100+ languages; multimodal models (GPT-4o, Gemini) process audio natively
How Speech Recognition Works
Audio preprocessing
Raw audio is converted to mel-spectrograms (80 frequency bins, 25ms windows with 10ms stride)
Encoder
A conformer or transformer encoder processes the spectrogram, producing hidden representations at ~20ms per frame
Decoder
An autoregressive transformer or CTC head converts encoder outputs to token sequences (subwords or characters)
Language model fusion
Optional external language model rescores hypotheses to improve accuracy on domain-specific vocabulary
Timestamp alignment
Cross-attention weights or forced alignment produce word-level timestamps for subtitling and diarization
Current Landscape
ASR in 2025 is a mature technology where clean English transcription is essentially solved at <3% WER. Whisper single-handedly democratized multilingual ASR — before it, high-quality ASR required expensive commercial APIs or years of data collection. The commercial market (AssemblyAI, Deepgram, Google, AWS) competes on latency, speaker diarization, and domain customization rather than raw accuracy. The architecture has converged on conformer encoders with transformer decoders, and self-supervised pretraining (Wav2Vec, HuBERT) remains critical for low-resource languages.
Key Challenges
Noisy and far-field audio: WER degrades significantly in reverberant rooms, cocktail party settings, and with background music
Accented and dialectal speech: models trained on standard dialects perform poorly on underrepresented accents
Code-switching: speakers who mix languages mid-sentence break single-language ASR systems
Streaming/real-time: achieving low latency (<500ms) while maintaining accuracy requires specialized architectures
Rare words and proper nouns: ASR systems struggle with domain-specific terminology, names, and technical jargon
Quick Recommendations
Best accuracy (batch)
Whisper large-v3 or AssemblyAI Universal-2
Sub-4% WER on English; strong multilingual support; excellent punctuation and casing
Real-time streaming
Deepgram Nova-2 or NVIDIA Canary
Low-latency streaming ASR with word-level timestamps; optimized for production
On-device / offline
Whisper.cpp (tiny/base) or Moonshine
Runs in real-time on mobile CPUs and edge devices; no cloud dependency
Open-source (self-hosted)
Whisper large-v3 + faster-whisper (CTranslate2)
4x faster inference with equivalent accuracy; batch processing on consumer GPUs
Multilingual / low-resource
Whisper large-v3 or MMS-1B (Meta)
MMS covers 1,100+ languages; Whisper covers 97 with higher accuracy on common ones
What's Next
The frontier is multimodal speech understanding (models that understand not just words but intent, emotion, and speaker identity from audio), zero-shot domain adaptation (accurate transcription of medical dictation or legal proceedings without fine-tuning), and fully on-device ASR that matches cloud quality. Expect ASR to merge into unified audio understanding models that handle transcription, translation, speaker identification, and sound event detection in a single model.
Benchmarks & SOTA
LibriSpeech
LibriSpeech ASR Corpus
1000 hours of English speech from audiobooks. Standard benchmark for automatic speech recognition.
State of the Art
Whisper Large v2
OpenAI
5.2
wer-test-other
Common Voice
Mozilla Common Voice
Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents. Over 100 languages, updated continuously by Mozilla Foundation.
State of the Art
Whisper Large v2
OpenAI
11.2
wer
FLEURS
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
Multilingual speech benchmark covering 102 languages, built on top of the FLoRes-101 machine translation benchmark. Evaluates ASR systems across diverse languages with standardized evaluation.
No results tracked yet
WildASR
WildASR: A Multilingual Diagnostic Benchmark for ASR Robustness
Multilingual (English, Chinese, Japanese, Korean) diagnostic benchmark evaluating ASR robustness across three out-of-distribution dimensions: environmental degradation (reverberation, noise, clipping), demographic shift (accents, children, older speakers), and linguistic diversity (code-switching, short utterances, incomplete speech). Uses WER for English and CER for CJK languages.
No results tracked yet
Related Tasks
Text-to-Speech
Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.
Speaker Verification
Verifying speaker identity from voice samples.
Speech Translation
Translating spoken audio directly to another language.
Voice Cloning
Replicating a speaker's voice characteristics.
Something wrong or missing?
Help keep Speech Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.