Speechautomatic-speech-recognition

Speech Recognition

Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.

4 datasets20 resultsView full task mapping →

Speech recognition (ASR) converts spoken audio to text. Whisper (OpenAI) democratized high-accuracy multilingual ASR, and production systems from Google, Amazon, and AssemblyAI achieve <5% word error rate on clean English. The frontiers are noisy/accented speech, real-time streaming, and code-switching between languages mid-sentence.

History

2012

Deep neural networks replace GMM-HMMs in acoustic modeling; Google ships DNN-based voice search

2014

DeepSpeech (Baidu) introduces end-to-end CTC-based ASR, simplifying the traditional pipeline

2017

Listen, Attend, and Spell (LAS) brings attention-based seq2seq to ASR; Google deploys in production

2019

Wav2Vec 2.0 (Facebook) shows self-supervised pretraining on unlabeled audio dramatically improves ASR

2020

Conformer (Gulati et al.) combines convolution with transformer attention — becomes the dominant ASR architecture

2022

Whisper (OpenAI) releases a 1.5B-param model trained on 680K hours achieving robust multilingual ASR across 97 languages

2023

Whisper large-v3 and Distil-Whisper push accuracy and speed; AssemblyAI Universal-2 and Deepgram Nova-2 lead commercial ASR

2024

Canary (NVIDIA), Parakeet, and Moonshine optimize for real-time on-device ASR; WER drops below 3% on clean English

2025

Universal Speech Model (Google) and Whisper-AT handle 100+ languages; multimodal models (GPT-4o, Gemini) process audio natively

How Speech Recognition Works

1Audio preprocessingRaw audio is converted to m…2EncoderA conformer or transformer …3DecoderAn autoregressive transform…4Language model fusionOptional external language …5Timestamp alignmentCross-attention weights or …Speech Recognition Pipeline
1

Audio preprocessing

Raw audio is converted to mel-spectrograms (80 frequency bins, 25ms windows with 10ms stride)

2

Encoder

A conformer or transformer encoder processes the spectrogram, producing hidden representations at ~20ms per frame

3

Decoder

An autoregressive transformer or CTC head converts encoder outputs to token sequences (subwords or characters)

4

Language model fusion

Optional external language model rescores hypotheses to improve accuracy on domain-specific vocabulary

5

Timestamp alignment

Cross-attention weights or forced alignment produce word-level timestamps for subtitling and diarization

Current Landscape

ASR in 2025 is a mature technology where clean English transcription is essentially solved at <3% WER. Whisper single-handedly democratized multilingual ASR — before it, high-quality ASR required expensive commercial APIs or years of data collection. The commercial market (AssemblyAI, Deepgram, Google, AWS) competes on latency, speaker diarization, and domain customization rather than raw accuracy. The architecture has converged on conformer encoders with transformer decoders, and self-supervised pretraining (Wav2Vec, HuBERT) remains critical for low-resource languages.

Key Challenges

Noisy and far-field audio: WER degrades significantly in reverberant rooms, cocktail party settings, and with background music

Accented and dialectal speech: models trained on standard dialects perform poorly on underrepresented accents

Code-switching: speakers who mix languages mid-sentence break single-language ASR systems

Streaming/real-time: achieving low latency (<500ms) while maintaining accuracy requires specialized architectures

Rare words and proper nouns: ASR systems struggle with domain-specific terminology, names, and technical jargon

Quick Recommendations

Best accuracy (batch)

Whisper large-v3 or AssemblyAI Universal-2

Sub-4% WER on English; strong multilingual support; excellent punctuation and casing

Real-time streaming

Deepgram Nova-2 or NVIDIA Canary

Low-latency streaming ASR with word-level timestamps; optimized for production

On-device / offline

Whisper.cpp (tiny/base) or Moonshine

Runs in real-time on mobile CPUs and edge devices; no cloud dependency

Open-source (self-hosted)

Whisper large-v3 + faster-whisper (CTranslate2)

4x faster inference with equivalent accuracy; batch processing on consumer GPUs

Multilingual / low-resource

Whisper large-v3 or MMS-1B (Meta)

MMS covers 1,100+ languages; Whisper covers 97 with higher accuracy on common ones

What's Next

The frontier is multimodal speech understanding (models that understand not just words but intent, emotion, and speaker identity from audio), zero-shot domain adaptation (accurate transcription of medical dictation or legal proceedings without fine-tuning), and fully on-device ASR that matches cloud quality. Expect ASR to merge into unified audio understanding models that handle transcription, translation, speaker identification, and sound event detection in a single model.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Speech Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000