Speech Recognition (Audio to Text)
70 years of teaching machines to listen — from single-digit recognizers to Whisper transcribing 99 languages at near-human accuracy.
70 Years of Teaching Machines to Listen
Automatic Speech Recognition (ASR) has been one of AI's longest-running quests. The path from recognizing ten digits to transcribing any speaker in any language required breakthroughs in signal processing, statistical modeling, neural architecture design, and — critically — scale of training data. Each generation solved one fundamental limitation of the last.
Understanding this history explains why Whisper works the way it does, what trade-offs were made, and why certain failure modes still persist today.
Audrey: The First Speech Recognizer
At Bell Labs, K.H. Davis, R. Biddulph, and S. Balashek built Audrey (Automatic Digit Recognizer) — a room-sized analog circuit that could recognize spoken digits 0–9 from a single speaker with roughly 97% accuracy. It worked by matching the energy patterns of formant frequencies against stored reference templates.
Audrey was a proof of concept with no practical use — it was tuned to one voice and could only handle isolated digits spoken with pauses between them. But it established the fundamental approach that would dominate for decades: compare incoming audio to stored templates.
— Davis, K.H. et al. (1952). Automatic Recognition of Spoken Digits. JASA, 24(6), 637–642.
IBM Shoebox
IBM demonstrated Shoebox at the 1962 World's Fair — a machine the size of a shoebox that recognized 16 spoken words (digits plus commands like "plus", "minus", "total") and could drive a simple adding machine by voice. It used analog filters to detect formant patterns. The press was amazed; researchers knew the hard problems — continuous speech, speaker independence, vocabulary beyond a handful of words — remained completely unsolved.
Dynamic Time Warping
Hiroaki Sakoe and Seibi Chiba formalized Dynamic Time Warping (DTW), an algorithm that could align two speech signals of different speeds. People say "hello" at different rates — DTW could stretch and compress the time axis to find the best alignment. This was the first time speech recognition could handle natural variation in speaking speed, enabling small-vocabulary isolated-word recognizers that actually worked for multiple speakers.
Hidden Markov Models Take Over
The single most important shift in ASR history. Researchers at IBM (Jelinek, Bahl, Mercer), CMU (Baker), and the Institute for Defense Analyses independently converged on Hidden Markov Models (HMMs) as the framework for speech recognition. The key insight: model speech as a sequence of hidden states (phonemes) that generate observable acoustic features, with probabilities governing transitions between states and emissions of observations.
# HMM for speech: two probability distributions # 1. Transition: P(next_phoneme | current_phoneme) # "t" → "r" → "ee" (for the word "tree") # 2. Emission: P(acoustic_features | phoneme) # phoneme "ee" → high F1, high F2 formant frequencies # Decoding: find most likely phoneme sequence given audio # Uses Viterbi algorithm (dynamic programming) best_path = viterbi(observations, transition_probs, emission_probs)
HMMs had a crucial mathematical advantage: the Baum-Welch algorithm (a special case of Expectation-Maximization) could train them from unlabeled audio-text pairs. Combined with Gaussian Mixture Models (GMMs) to model acoustic features, HMM-GMM systems dominated ASR for 30 years. Every commercial speech system from 1985 to 2012 — Dragon NaturallySpeaking, Nuance, Siri's original engine — was built on this foundation.
— Rabiner, L.R. (1989). A Tutorial on Hidden Markov Models. Proc. IEEE, 77(2), 257–286. The definitive HMM tutorial — still the most-cited paper in speech processing.
Large Vocabulary Continuous Speech Recognition
Through the DARPA-funded WSJ (Wall Street Journal) and Switchboard corpora, systems scaled to 60,000+ word vocabularies with continuous speech (no pauses between words). The trick was combining HMM acoustic models with statistical n-gram language models — P(word | previous words) — to constrain the search space. Dragon NaturallySpeaking launched in 1997 as the first commercial large-vocabulary dictation product. WER on clean read speech (WSJ) dropped below 10% for the first time. But noisy, conversational, accented, or multilingual speech remained catastrophically bad.
Deep Neural Networks Replace GMMs
In a landmark collaboration, researchers from Toronto (Hinton), Microsoft (Deng, Yu), Google (Jaitly, Senior), and IBM (Kingsbury) published a joint paper showing that replacing GMMs with deep neural networks (DNNs) in the HMM framework reduced word error rates by 20–30% relative across multiple benchmarks. The HMM structure stayed — the acoustic model got dramatically better.
This triggered the industry's wholesale shift to deep learning. Within two years, every major speech team (Google, Apple, Microsoft, Baidu) had replaced their GMM acoustic models with DNNs. The HMM framework was still there, but its days were numbered.
Deep Speech: End-to-End Learning
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Ng at Baidu Research published Deep Speech — a system that threw away the entire HMM pipeline. Instead of phoneme-level HMM states, pronunciation dictionaries, and language model rescoring, they used a single deep recurrent neural network trained end-to-end with Connectionist Temporal Classification (CTC).
"Our system does not need a phoneme dictionary, nor even the concept of a phoneme [...] We show that an end-to-end deep learning approach can be competitive with traditional methods on standard benchmarks, and can outperform them in noisy environments."
— Hannun, A. et al. (2014). Deep Speech: Scaling up end-to-end speech recognition. arXiv:1412.5567.
The key innovation was CTC loss — a training objective invented by Alex Graves (2006) that lets the network output a sequence of characters without needing to know the exact alignment between audio frames and text characters. CTC marginalizes over all possible alignments, freeing the model from requiring frame-level phoneme labels. This eliminated the need for forced alignment, pronunciation dictionaries, and the entire HMM state machine.
— Graves, A. et al. (2006). Connectionist Temporal Classification. ICML, 369–376.
Attention-Based Sequence-to-Sequence
Jan Chorowski, Dzmitry Bahdanau et al. and separately William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals (with their "Listen, Attend and Spell" model) showed that attention-based encoder-decoder models could transcribe speech by learning to focus on relevant audio frames when emitting each character. Unlike CTC, attention models could learn the alignment implicitly and produce outputs conditioned on previously generated tokens — enabling better handling of language model context.
Transformer & Conformer Enter ASR
The Transformer architecture (Vaswani et al., 2017) arrived in ASR through models like Speech-Transformer (Dong et al., 2018). But the breakthrough came with the Conformer (Gulati et al., 2020 at Google), which interleaved self-attention layers with convolution layers — attention captures global context, convolutions capture local acoustic patterns. Conformer achieved 1.9% WER on LibriSpeech test-clean, a new SOTA.
Meanwhile, wav2vec 2.0 (Baevski et al., 2020 at Meta) demonstrated that self-supervised pre-training on unlabeled audio — masking portions of the speech signal and predicting them — could learn powerful representations. Fine-tuning on just 10 minutes of labeled data achieved results competitive with 100 hours of supervised training.
— Gulati, A. et al. (2020). Conformer. Interspeech.
— Baevski, A. et al. (2020). wav2vec 2.0. NeurIPS.
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever at OpenAI took a radically different approach from the self-supervised pre-training trend. Instead of learning representations from unlabeled audio, they collected 680,000 hours of audio with existing transcriptions scraped from the internet — podcasts, audiobooks, lectures, YouTube videos with subtitles — in 99 languages.
The transcriptions were noisy (auto-generated subtitles, imperfect alignments), but the sheer scale compensated. Whisper was trained as a straightforward sequence-to-sequence Transformer with no self-supervised pre-training, no CTC, no external language model — just supervised training on an enormous, diverse, weakly-labeled dataset.
"We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours [...] the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning."
The result was the most robust general-purpose ASR model ever released. Whisper handled accents, background noise, technical jargon, and code-switching between languages — failure modes that had plagued ASR for decades — because it had simply heard them all during training. And OpenAI released it under the MIT license.
The Post-Whisper Landscape
Whisper spawned an ecosystem. Whisper large-v3 (November 2023) added improved multilingual performance with 128 Mel frequency bins (up from 80). Distil-Whisper (Gandhi et al., 2023) distilled the large model into versions 5.8x faster with minimal quality loss. faster-whisper (SYSTRAN) reimplemented inference in CTranslate2 for 4x speedup. WhisperX added forced alignment and speaker diarization on top.
Meanwhile, commercial providers pushed further: Deepgram Nova-2, AssemblyAI Universal-2, and Google Chirp all claim lower WER than Whisper on English benchmarks, particularly for conversational and noisy audio. The open-source world responded with Canary (NVIDIA, 2024) and Parakeet models achieving state-of-the-art English ASR.
The throughline: 1952 to 2025
Seven decades. Four paradigm shifts:
The lesson is consistent across all of ML: more data and simpler architectures trained at scale beat complex systems with less data. Whisper's Transformer is architecturally unremarkable — its power comes from 680,000 hours of diverse audio.
Whisper Architecture: How It Works
Whisper is an encoder-decoder Transformer. The architecture itself is deliberately standard — the innovation is in training data and task formulation, not model design.
Step 1: Audio Preprocessing
Raw audio is resampled to 16 kHz, then converted to an 80-channel (v2) or 128-channel (v3) log-Mel spectrogram using 25ms windows with 10ms stride. The spectrogram is computed over a fixed 30-second chunk — shorter audio is zero-padded, longer audio is processed in 30-second segments.
# Audio → Mel spectrogram (what the encoder actually sees)
# Input: 30 seconds of 16kHz audio = 480,000 samples
# Window: 25ms (400 samples), stride: 10ms (160 samples)
# Output: (80, 3000) for v2 or (128, 3000) for v3
# ↑ mel bins ↑ time frames (30s / 10ms)
import whisper
audio = whisper.load_audio("speech.mp3") # → (N,) float32
audio = whisper.pad_or_trim(audio) # → (480000,) exactly 30s
mel = whisper.log_mel_spectrogram(audio) # → (80, 3000)Step 2: Encoder
Two 1D convolution layers with GELU activations downsample the spectrogram by 2x in time (3000 frames to 1500), then sinusoidal positional embeddings are added. The result passes through N Transformer encoder blocks (N=32 for large) with self-attention and feed-forward layers.
# Encoder architecture (large-v3: 32 layers, d_model=1280)
mel_spectrogram # (128, 3000) — input
→ Conv1d(128→1280, kernel=3, stride=1) + GELU
→ Conv1d(1280→1280, kernel=3, stride=2) + GELU # downsample 2x
→ + sinusoidal_pos_embed # (1500, 1280)
→ 32× TransformerEncoderBlock:
→ LayerNorm → MultiHeadAttention(20 heads)
→ LayerNorm → FFN(1280 → 5120 → 1280)
→ LayerNorm
→ encoder_output # (1500, 1280)Step 3: Decoder (Autoregressive)
The decoder is a standard Transformer decoder with learned positional embeddings (not sinusoidal — a key difference from the encoder). It generates tokens one at a time, attending to both previously generated tokens (causal self-attention) and encoder output (cross-attention). Special tokens control behavior:
# Decoder token sequence (the "prompt" that controls Whisper):
<|startoftranscript|> # Begin
<|en|> # Language token (detected or forced)
<|transcribe|> # Task: transcribe (vs <|translate|> to English)
<|notimestamps|> # Or timestamp tokens: <|0.00|> <|0.50|> ...
The quick brown fox... # Generated text tokens
<|endoftext|> # Stop
# This multi-task formulation means ONE model handles:
# - Language identification (predict language token)
# - Transcription (same language out)
# - Translation (any language → English)
# - Timestamp prediction (when each word was spoken)This multi-task design is what makes Whisper so versatile. The model doesn't just transcribe — it has learned to detect language, generate timestamps, and translate, all conditioned on which special tokens appear in the decoder prefix.
Whisper Model Sizes
| Model | Params | Layers | d_model | VRAM |
|---|---|---|---|---|
| tiny | 39M | 4 | 384 | ~1 GB |
| base | 74M | 6 | 512 | ~1 GB |
| small | 244M | 12 | 768 | ~2 GB |
| medium | 769M | 24 | 1024 | ~5 GB |
| large-v3 | 1.55B | 32 | 1280 | ~10 GB |
| large-v3-turbo | 809M | 4 (dec) | 1280 | ~6 GB |
large-v3-turbo uses the full 32-layer encoder but only 4 decoder layers (vs 32 in large-v3), achieving 8x faster decoding with minimal quality loss on most languages.
Working Code: Three Ways to Transcribe
From the simplest API call to optimized local inference — pick the approach that matches your constraints.
Option 1: OpenAI Whisper API
EasiestNo GPU, no model download, no dependencies beyond the SDK. Pay $0.006/minute. Best for prototyping and low-volume production.
from openai import OpenAI
client = OpenAI()
# Basic transcription
with open("recording.mp3", "rb") as f:
result = client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json", # includes timestamps
timestamp_granularities=["segment", "word"]
)
print(result.text)
# Word-level timestamps
for word in result.words:
print(f"[{word.start:.2f}s] {word.word}")
# Translation (any language → English)
with open("german_speech.mp3", "rb") as f:
translation = client.audio.translations.create(
model="whisper-1",
file=f
)
print(translation.text) # English outputper minute
max file size
typical latency
Option 2: faster-whisper (Local, 4x Faster)
Productionfaster-whisper by SYSTRAN reimplements Whisper inference using CTranslate2 — a C++ inference engine with INT8 quantization. 4x faster than the original PyTorch implementation, 3x lower memory, identical accuracy. This is what you should use for production local deployment.
pip install faster-whisperfrom faster_whisper import WhisperModel
# Model sizes: tiny, base, small, medium, large-v3
# compute_type: float16 (GPU), int8 (CPU-friendly), float32
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Transcribe with timestamps
segments, info = model.transcribe(
"recording.mp3",
beam_size=5,
language="en", # or None for auto-detection
vad_filter=True, # skip silence (faster)
word_timestamps=True # word-level timing
)
print(f"Detected language: {info.language} ({info.language_probability:.0%})")
for segment in segments:
print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
if segment.words:
for word in segment.words:
print(f" [{word.start:.2f}s] {word.word}")VAD filter is essential for long files
Whisper processes audio in 30-second chunks. Without Voice Activity Detection (VAD) filtering, it hallucinates text during silent segments — a well-known failure mode. The vad_filter=True option uses Silero VAD to skip silent chunks, dramatically reducing both hallucinations and processing time.
Option 3: HuggingFace Transformers Pipeline
FlexibleThe HuggingFace pipeline provides a unified interface for any ASR model — Whisper, wav2vec2, Conformer, or fine-tuned variants. Best when you need to swap models or use community fine-tunes.
pip install transformers torch accelerateimport torch
from transformers import pipeline
# Load any ASR model from the Hub
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v3",
torch_dtype=torch.float16,
device="cuda:0",
)
# Basic transcription
result = pipe("recording.mp3")
print(result["text"])
# With timestamps and chunking for long files
result = pipe(
"long_meeting.mp3",
chunk_length_s=30, # process in 30s chunks
batch_size=16, # parallel chunks on GPU
return_timestamps=True # segment-level timestamps
)
for chunk in result["chunks"]:
start, end = chunk["timestamp"]
print(f"[{start:.1f}s → {end:.1f}s] {chunk['text']}")
# Swap to a different model (e.g., distil-whisper for speed)
fast_pipe = pipeline(
"automatic-speech-recognition",
model="distil-whisper/distil-large-v3",
torch_dtype=torch.float16,
device="cuda:0",
)
# 5.8x faster, within 1% WER of large-v3 on EnglishWord Error Rate Benchmarks
Word Error Rate (WER) is the standard metric for ASR accuracy. It measures the percentage of words that are wrong in the transcription — counting substitutions, insertions, and deletions against a human reference transcript.
# WER = (Substitutions + Insertions + Deletions) / Total Reference Words # Reference: "the cat sat on the mat" # Hypothesis: "the cat set on a mat" # ^^^ ^ # 1 substitution + 1 substitution = 2 errors / 6 words = 33.3% WER
LibriSpeech test-clean (Read English Audiobooks)
The standard academic benchmark. Clean, read speech — relatively easy. Lower is better.
Sources: OpenAI Whisper paper (2022), Gulati et al. (2020), NVIDIA NeMo (2024). Human WER from Liptchinsky et al. (2017). Note: Whisper WER is zero-shot (no fine-tuning on LibriSpeech).
Real-World Speech (Noisy, Conversational, Accented)
Where models are actually tested in production. These numbers tell you much more than LibriSpeech.
| Dataset | Type | Whisper v3 | Context |
|---|---|---|---|
| LibriSpeech test-clean | Read speech | 2.0% | Audiobooks, studio quality |
| LibriSpeech test-other | Harder read speech | 3.5% | Noisier recordings, varied speakers |
| Switchboard (Hub5'00) | Phone conversations | ~8.5% | Casual English, telephony quality |
| Common Voice (en) | Crowdsourced | ~9% | Diverse accents, variable quality |
| Earnings Calls | Business/finance | ~10% | Domain jargon, multiple speakers |
| Fleurs (avg 102 lang) | Multilingual | ~14% | Wide variance by language |
Key insight: LibriSpeech WER has limited predictive value for real-world performance. The gap between 2% (clean audiobooks) and 10%+ (noisy real audio) is where production quality is determined.
Explore More Benchmarks
Compare ASR models across datasets, languages, and latency:
View Speech Recognition Benchmarks->Where Whisper Breaks Down
Understanding failure modes is more valuable than memorizing accuracy numbers.
Hallucination on Silence
Whisper's autoregressive decoder will generate text even when there is no speech. Silent segments frequently produce hallucinated phrases — repeated words, URLs, or entire fabricated sentences. This is the most common production issue. Mitigation: always use VAD preprocessing to skip silent chunks.
Repetition Loops
The decoder occasionally enters repetitive loops, generating the same phrase or sentence fragment dozens of times. This is a known pathology of autoregressive Transformers. Mitigation: use temperature fallback (Whisper automatically retries with higher temperature when compression ratio is suspiciously high) and setcondition_on_previous_text=False for long files.
Low-Resource Languages
Whisper supports 99 languages, but quality varies enormously. High-resource languages (English, Spanish, German, Japanese) get 5–10% WER. Low-resource languages (Yoruba, Marathi, Welsh) can exceed 40–60% WER. The training data is heavily skewed toward English (roughly 65% of the 680K hours).
Timestamp Drift on Long Audio
Because Whisper processes 30-second chunks, timestamps can drift or snap incorrectly at chunk boundaries. For production timestamp accuracy, use WhisperX (which adds forced alignment via phoneme models) or faster-whisper withword_timestamps=True.
Speaker Diarization: Who Said What
Whisper does not identify speakers — it only transcribes. Speaker diarization is a separate task that identifies "who spoke when." For meetings, interviews, and podcasts, you typically combine Whisper with a diarization model.
WhisperX: Transcription + Alignment + Diarization
whisperx + pyannoteimport whisperx
# 1. Transcribe with Whisper
model = whisperx.load_model("large-v3", device="cuda", compute_type="float16")
audio = whisperx.load_audio("meeting.wav")
result = model.transcribe(audio, batch_size=16)
# 2. Force-align for accurate word timestamps
align_model, metadata = whisperx.load_align_model(language_code="en", device="cuda")
result = whisperx.align(result["segments"], align_model, metadata, audio, device="cuda")
# 3. Diarize — assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN")
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
for seg in result["segments"]:
speaker = seg.get("speaker", "UNKNOWN")
print(f"[{speaker}] {seg['text']}")Example Output
When to Use What
Prototyping / Simple Transcription
Use OpenAI Whisper API. No setup, just works.
$0.006/min | 25MB limit | ~10s latency | Best for quick experiments
Production (Cost-Sensitive / Privacy)
Use faster-whisper locally. No API costs, data stays on-premise.
One-time GPU cost | No data leaves your server | 4x faster than original Whisper
Real-Time / Streaming
Use Deepgram Nova-2. Sub-300ms latency via WebSocket.
~$0.004/min | WebSocket API | Live transcription | Phone/video calls
Meeting Transcription (Multiple Speakers)
Use WhisperX or AssemblyAI Universal-2.
Built-in diarization | Speaker labels | Forced alignment for accurate timestamps
Non-English / Low-Resource Languages
Use Whisper large-v3. Best zero-shot multilingual accuracy.
99 languages | Automatic language detection | Consider fine-tuning for specific low-resource languages
Key Academic References
The Whisper Paper
Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. The foundational paper. Read sections 2 (approach) and 3 (experiments) at minimum.
CTC Loss
Graves, A. et al. (2006). Connectionist Temporal Classification. ICML. The training objective that enabled end-to-end ASR.
Conformer
Gulati, A. et al. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. The architecture that set SOTA before Whisper.
wav2vec 2.0
Baevski, A. et al. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Self-supervised pre-training for speech — the alternative paradigm to Whisper's weak supervision.
Deep Speech
Hannun, A. et al. (2014). Deep Speech: Scaling up end-to-end speech recognition. The paper that killed the HMM pipeline.
HMM Tutorial
Rabiner, L.R. (1989). A Tutorial on Hidden Markov Models. Proc. IEEE. Understand what Whisper replaced. Still the best explanation of the statistical ASR paradigm.
Distil-Whisper
Gandhi, S. et al. (2023). Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling. How to get 5.8x speedup with minimal quality loss.
Key Takeaways
- 1
ASR went through four paradigms — template matching, HMMs, end-to-end deep learning, and massive weak supervision. Each was a complete rethinking, not an incremental improvement.
- 2
Whisper's secret is data, not architecture — a standard encoder-decoder Transformer trained on 680K hours of noisy web audio. The architecture is deliberately boring; the dataset is unprecedented.
- 3
Use faster-whisper for production — 4x faster, 3x less memory, same accuracy. Always enable VAD filtering to prevent hallucinations on silence.
- 4
LibriSpeech WER is misleading — 2% WER on clean audiobooks tells you nothing about real-world performance. Test on data that matches your actual use case.
- 5
Speaker diarization is a separate problem — Whisper transcribes; pyannote/WhisperX identifies who spoke when. Plan your pipeline accordingly.
Practice Exercise
Build intuition for how ASR accuracy varies across conditions:
- 1.Record a 30-second voice memo in a quiet room. Transcribe it with the OpenAI API code above. Check every word.
- 2.Record the same text with background music playing. Compare the transcription — where does it fail?
- 3.Try a non-English language. How does the auto-detection work? Is the WER noticeably worse?
- 4.If you have a GPU, install faster-whisper and compare tiny vs large-v3. Measure both speed and accuracy.
- 5.Feed 30 seconds of silence to Whisper without VAD filtering. Document the hallucinated output.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.