Codesota · Lineage · Speech Recognition Benchmarks7 benchmarks · 6 edgesUpdated 2026-04-27
Benchmark lineage

Speech Recognition Benchmarks

How automatic speech recognition evaluation evolved from clean read speech on LibriSpeech, through multi-speaker and noisy conditions, toward naturalistic and multilingual benchmarks that reflect real deployment environments. The spine tracks where word error rate evaluation moved as clean-speech performance saturated; branches cover speaker verification (VoxCeleb), noisy conditions (LibriSpeech-other, GigaSpeech), and multilingual evaluation (FLEURS, Common Voice).

Editor's note

LibriSpeech test-clean has been effectively solved — modern end-to-end systems achieve 1.5–2% WER, near the transcription noise floor. The field's response has been to test harder conditions: multi-speaker meetings (CHiME-6), accented and code-switched speech (Common Voice), and genuinely unconstrained real-world audio (WildASR). FLEURS brought multilingual coverage to 102 languages and is now the standard for evaluating speech foundation models like Whisper. The active frontier as of 2025 is naturalistic multi-speaker diarization + transcription — a task where no current system is close to human parity on challenging domains.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
SCOPE SHIFTDIRECT SUCCESSORSCOPE SHIFTLibriSpeechAPR 2015SOTA 5.20VoxCelebJUN 2017CHiME-6APR 2020Common VoiceJUN 2020GigaSpeechJUN 2021FLEURSMAY 2022WildASRJAN 2024
LibriSpeechVoxCeleb · scope shift
VoxCeleb covers speaker identity, not transcription — a different task that addresses the 'who spoke' question LibriSpeech ignores. Speaker verification became a standard parallel track in speech evaluation.
LibriSpeechCHiME-6 · scope shift · attention
LibriSpeech test-other saturated; CHiME-6's multi-speaker dinner-party setup was the first major challenge where clean-speech progress didn't transfer. Where attention moved when LibriSpeech-other WER dropped below 4%.
LibriSpeechGigaSpeech · scope shift
GigaSpeech is a scale and diversity extension — 10× more data, multi-domain. A training and evaluation resource for robustness rather than a direct successor to LibriSpeech's narrow clean-speech task.
CHiME-6Common Voice · scope shift
Common Voice's multilingual coverage shifted attention from acoustic difficulty (CHiME-6) to language and accent diversity. The two benchmarks probe orthogonal failure modes.
Common VoiceFLEURS · direct successor · attention
FLEURS provides a cleaner 102-language evaluation with parallel text-aligned prompts derived from FLoRes, addressing Common Voice's annotation inconsistency across languages. Became the multilingual ASR standard when Whisper adopted it.
FLEURSWildASR · scope shift · attention
FLEURS evaluates multilingual generalisation; WildASR evaluates naturalness — real ambient noise, spontaneous speech, code-switching, and domain diversity. The current attention path for foundation-model ASR evaluation.
§ 02 · Benchmarks in this lineage

Nodes in detail.

Apr 2015Saturated
View benchmark page →

LibriSpeech

LibriSpeech ASR Corpus

1,000 hours of English audiobook speech split into clean and other (noisy) test sets. Defined ASR evaluation for the deep-learning era. test-clean WER under 2% for strong systems; test-other under 4%. Both effectively saturated for top models.

Panayotov et al. (Johns Hopkins) · paper
Jun 2017Active

VoxCeleb

VoxCeleb Speaker Recognition

100K+ utterances from 1,251 celebrities scraped from YouTube. VoxCeleb2 expanded to 6,112 identities. The standard speaker verification benchmark; equal error rate (EER) is the metric. Active as a speaker-modelling benchmark even as ASR has moved on.

Nagrani et al. (Oxford VGG) · paper
Apr 2020Active

CHiME-6

CHiME-6 Dinner Party ASR

20 dinner-party sessions recorded with distant microphones; multi-speaker, naturally overlapping speech with realistic noise. WER for systems without oracle diarization exceeds 50% for most participants. Exposed the gap between clean-speech WER progress and real conversational ASR.

Watanabe et al. (CHiME Challenge) · paper
Jun 2020Active

Common Voice

Mozilla Common Voice

Crowdsourced multilingual speech covering 100+ languages, many low-resource. Accent diversity within English makes it a harder distribution shift test than LibriSpeech. Primary use: multilingual and low-resource ASR evaluation, not English-only benchmarking.

Ardila et al. (Mozilla) · paper
Jun 2021Active

GigaSpeech

GigaSpeech Large-Scale ASR

10,000 hours of transcribed English from audiobooks, podcasts, and YouTube. Larger and more diverse than LibriSpeech; tests model robustness to domain and acoustic variation across sources.

Chen et al. · paper
May 2022Active

FLEURS

Few-shot Learning Evaluation of Universal Representations of Speech

102-language speech benchmark derived from FLoRes translation pairs. Covers many low-resource languages not represented in LibriSpeech or Common Voice. The standard benchmark for evaluating multilingual speech foundation models like Whisper.

Conneau et al. (Google) · paper

WildASR

WildASR In-the-Wild Speech Recognition

Naturalistic audio from diverse real-world environments — phone calls, live events, spontaneous conversation. Designed to expose failure modes that clean-speech benchmarks mask. The emerging standard for assessing deployment readiness of ASR systems.

Shi et al. · paper