Speechautomatic-speech-recognition

Speech Recognition

Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.

11 datasets526 resultsView full task mapping →

Speech recognition (ASR) converts spoken audio to text. Whisper (OpenAI) democratized high-accuracy multilingual ASR, and production systems from Google, Amazon, and AssemblyAI achieve <5% word error rate on clean English. The frontiers are noisy/accented speech, real-time streaming, and code-switching between languages mid-sentence.

History

2012

Deep neural networks replace GMM-HMMs in acoustic modeling; Google ships DNN-based voice search

2014

DeepSpeech (Baidu) introduces end-to-end CTC-based ASR, simplifying the traditional pipeline

2017

Listen, Attend, and Spell (LAS) brings attention-based seq2seq to ASR; Google deploys in production

2019

Wav2Vec 2.0 (Facebook) shows self-supervised pretraining on unlabeled audio dramatically improves ASR

2020

Conformer (Gulati et al.) combines convolution with transformer attention — becomes the dominant ASR architecture

2022

Whisper (OpenAI) releases a 1.5B-param model trained on 680K hours achieving robust multilingual ASR across 97 languages

2023

Whisper large-v3 and Distil-Whisper push accuracy and speed; AssemblyAI Universal-2 and Deepgram Nova-2 lead commercial ASR

2024

Canary (NVIDIA), Parakeet, and Moonshine optimize for real-time on-device ASR; WER drops below 3% on clean English

2025

Universal Speech Model (Google) and Whisper-AT handle 100+ languages; multimodal models (GPT-4o, Gemini) process audio natively

How Speech Recognition Works

1Audio preprocessingRaw audio is converted to m…2EncoderA conformer or transformer …3DecoderAn autoregressive transform…4Language model fusionOptional external language …5Timestamp alignmentCross-attention weights or …Speech Recognition Pipeline
1

Audio preprocessing

Raw audio is converted to mel-spectrograms (80 frequency bins, 25ms windows with 10ms stride)

2

Encoder

A conformer or transformer encoder processes the spectrogram, producing hidden representations at ~20ms per frame

3

Decoder

An autoregressive transformer or CTC head converts encoder outputs to token sequences (subwords or characters)

4

Language model fusion

Optional external language model rescores hypotheses to improve accuracy on domain-specific vocabulary

5

Timestamp alignment

Cross-attention weights or forced alignment produce word-level timestamps for subtitling and diarization

Current Landscape

ASR in 2025 is a mature technology where clean English transcription is essentially solved at <3% WER. Whisper single-handedly democratized multilingual ASR — before it, high-quality ASR required expensive commercial APIs or years of data collection. The commercial market (AssemblyAI, Deepgram, Google, AWS) competes on latency, speaker diarization, and domain customization rather than raw accuracy. The architecture has converged on conformer encoders with transformer decoders, and self-supervised pretraining (Wav2Vec, HuBERT) remains critical for low-resource languages.

Key Challenges

Noisy and far-field audio: WER degrades significantly in reverberant rooms, cocktail party settings, and with background music

Accented and dialectal speech: models trained on standard dialects perform poorly on underrepresented accents

Code-switching: speakers who mix languages mid-sentence break single-language ASR systems

Streaming/real-time: achieving low latency (<500ms) while maintaining accuracy requires specialized architectures

Rare words and proper nouns: ASR systems struggle with domain-specific terminology, names, and technical jargon

Quick Recommendations

Best accuracy (batch)

Whisper large-v3 or AssemblyAI Universal-2

Sub-4% WER on English; strong multilingual support; excellent punctuation and casing

Real-time streaming

Deepgram Nova-2 or NVIDIA Canary

Low-latency streaming ASR with word-level timestamps; optimized for production

On-device / offline

Whisper.cpp (tiny/base) or Moonshine

Runs in real-time on mobile CPUs and edge devices; no cloud dependency

Open-source (self-hosted)

Whisper large-v3 + faster-whisper (CTranslate2)

4x faster inference with equivalent accuracy; batch processing on consumer GPUs

Multilingual / low-resource

Whisper large-v3 or MMS-1B (Meta)

MMS covers 1,100+ languages; Whisper covers 97 with higher accuracy on common ones

What's Next

The frontier is multimodal speech understanding (models that understand not just words but intent, emotion, and speaker identity from audio), zero-shot domain adaptation (accurate transcription of medical dictation or legal proceedings without fine-tuning), and fully on-device ASR that matches cloud quality. Expect ASR to merge into unified audio understanding models that handle transcription, translation, speaker identification, and sound event detection in a single model.

Benchmarks & SOTA

LibriSpeech

LibriSpeech ASR Corpus

2015111 results

1000 hours of English speech from audiobooks. Standard benchmark for automatic speech recognition.

State of the Art

Mms-1b-fl102

28.7

wer

Open ASR Leaderboard

HF Open ASR Leaderboard (aggregate)

2023102 results

The Hugging Face Open ASR Leaderboard aggregates Word Error Rate and real-time factor across LibriSpeech, AMI, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM, and VoxPopuli to give a single composite score for English ASR systems. The de-facto modern ASR leaderboard.

State of the Art

Stt_en_fastconformer_ctc_large

6399.25031

rtfx

SPGISpeech

SPGISpeech Earnings Call Corpus

202156 results

SPGISpeech is a 5,000-hour corpus of professionally transcribed English earnings calls released by S&P Global Market Intelligence. The largest publicly available financial-domain ASR benchmark.

State of the Art

Wav2vec2-base-960h

27.56

wer

VoxPopuli

VoxPopuli Multilingual Speech Corpus

202155 results

VoxPopuli is a large-scale multilingual speech corpus derived from European Parliament event recordings, providing labelled ASR data for 18 European languages plus large quantities of unlabelled audio for self-supervised pre-training.

State of the Art

Wav2vec2-base-960h

32.48

wer

AMI-IHM

AMI Meeting Corpus — Individual Headset Microphone

200550 results

The AMI Meeting Corpus IHM subset consists of ~100 hours of recorded English meetings captured with individual headset microphones. Long-form spontaneous speech across overlapping speakers makes it a standard stress-test for ASR systems beyond clean read speech.

State of the Art

Mms-1b-fl102

86.78

wer

Earnings-22

Earnings-22 ASR Benchmark

202250 results

Earnings-22 is a 119-hour corpus of real-world English earnings calls covering 22 publicly traded companies. Heavy domain vocabulary (financial terminology, proper names) and accented speech make it a tough benchmark for production ASR systems.

State of the Art

Mms-1b-fl102

51.87

wer

TED-LIUM

TED-LIUM v3

201850 results

TED-LIUM is an English ASR corpus derived from public TED talks, with the v3 release providing ~452 hours of audio aligned to verbatim transcripts. Long-form prepared speech with diverse speakers, accents, and topics.

State of the Art

Mms-1b-fl102

32.35

wer

GigaSpeech

GigaSpeech

202147 results

GigaSpeech is a 10,000-hour English ASR corpus pulled from audiobooks, podcasts, and YouTube. Released by SpeechColab and widely used as a high-volume training+evaluation set covering diverse speaking styles and noise conditions.

State of the Art

Mms-1b-fl102

42.42

wer

Common Voice

Mozilla Common Voice

20194 results

Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents. Over 100 languages, updated continuously by Mozilla Foundation.

State of the Art

Whisper Large v2

OpenAI

11.2

wer

FLEURS

Few-shot Learning Evaluation of Universal Representations of Speech

20221 results

Multilingual speech benchmark covering 100+ languages. Commonly used for ASR and speech-language model evaluation.

State of the Art

Phi-4-Multimodal 5.6B

4

wer

WildASR

WildASR: A Multilingual Diagnostic Benchmark for ASR Robustness

20250 results

Multilingual (English, Chinese, Japanese, Korean) diagnostic benchmark evaluating ASR robustness across three out-of-distribution dimensions: environmental degradation (reverberation, noise, clipping), demographic shift (accents, children, older speakers), and linguistic diversity (code-switching, short utterances, incomplete speech). Uses WER for English and CER for CJK languages.

No results tracked yet

Related Tasks

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Speech Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Speech Recognition Benchmarks - Speech - CodeSOTA | CodeSOTA