Audio→Text

Speech Recognition

Transcribe spoken audio into text. The foundation for voice interfaces, meeting transcription, and audio search.

How Speech Recognition Works

A technical deep-dive into automatic speech recognition. From Whisper to real-time transcription with speaker diarization.

1. ASR Tasks 2. Models 3. Whisper Deep-Dive 4. Metrics 5. Optimization 6. Code

ASR Tasks

Speech recognition includes multiple related tasks beyond basic transcription.

Transcription

Audio to text

Output: Plain text transcript

Models: Whisper, Wav2Vec

Diarization

Who spoke when

Output: Speaker segments

Models: pyannote, NeMo

Word Timestamps

Precise timing

Output: Word-level alignment

Models: Whisper + alignment

Translation

Speech to text in another language

Output: Translated transcript

Models: Whisper, SeamlessM4T

The ASR Pipeline

Audio

Waveform

Mel
Spectrogram

80 channels

Encoder

Transformer

Decoder

Autoregressive

Text

Transcript

Model Evolution

From RNN-based models to transformer foundation models.

DeepSpeech

2014

CTC~8% WEREnd-to-end RNN

Wav2Vec 2.0

2020

Self-supervised~3% WERContrastive learning

Whisper

2022

Encoder-Decoder~2% WERMultitask, multilingual

Whisper Large v3

2023

Encoder-Decoder~1.5% WERMore data, languages

Distil-Whisper

2023

Distilled~2% WER6x faster, same quality

Whisper v3 Turbo

2024

Encoder-Decoder~1.5% WER8x faster than large

Canary-1B

2024

Encoder-Decoder~1.2% WERNVIDIA, best WER

Whisper v3 Turbo

Best speed/quality trade-off

8x faster than large-v3, same WER

Canary-1B

Best raw accuracy

NVIDIA NeMo, ~1.2% WER

Distil-Whisper

Best for edge deployment

6x faster, runs on CPU

Whisper Deep-Dive

OpenAI's Whisper is the most widely used ASR model. Trained on 680,000 hours of multilingual data.

Whisper Model Sizes

Size	Parameters	Relative Speed	VRAM	WER (English)
tiny	39M	32x	~1GB	~7.6%
base	74M	16x	~1GB	~5.0%
small	244M	6x	~2GB	~3.4%
medium	769M	2x	~5GB	~2.5%
large-v3	1.5B	1x	~10GB	~1.5%
turbo	809M	8x	~6GB	~1.5%

Key Features

+99 languages supported
+Multitask: transcribe, translate, timestamps
+Robust to noise, accents, background audio
+Word-level timestamps with alignment

Limitations

-No speaker diarization (need separate model)
-30-second processing chunks
-Can hallucinate on silence/noise
-Autoregressive = sequential decoding

Whisper Special Tokens

<|startoftranscript|>

Begin output

<|en|>

Language tag

<|transcribe|>

Task type

<|0.00|>

Timestamp

ASR Metrics

How to measure transcription quality.

Word Error Rate (WER)

The standard metric for ASR quality. Lower is better.

WER = (S + D + I) / N

S = Substitutions (wrong word)

D = Deletions (missing word)

I = Insertions (extra word)

N = Total reference words

Example:

Reference:

"the quick brown fox"

Hypothesis:

"the quik brown fox jumps"

WER = (1+1+1)/4 = 75%

<5%

Excellent

Near human-level

5-15%

Good

Usable with review

>15%

Poor

Needs improvement

Speed Optimization

Techniques for faster transcription.

Faster-Whisper

CTranslate2 backend. 4x faster with int8.

Flash Attention

Memory-efficient attention. 2x speedup.

Batching

Process multiple chunks in parallel.

VAD Pre-filter

Skip silence. Silero VAD integration.

Code Examples

Get started with speech recognition in Python.

OpenAI Whisperpip install openai-whisper

Official

import whisper

# Load model (tiny, base, small, medium, large-v3, turbo)
model = whisper.load_model('turbo')

# Transcribe audio file
result = model.transcribe(
    'audio.mp3',
    language='en',           # Optional: auto-detect if not set
    task='transcribe',       # or 'translate' for speech-to-English
    word_timestamps=True,    # Get word-level timing
    fp16=True               # Use FP16 for speed
)

# Results
print(result['text'])  # Full transcript

# Word-level timestamps
for segment in result['segments']:
    for word in segment.get('words', []):
        print(f"{word['start']:.2f}s: {word['word']}")

Quick Reference

For Quality

- Whisper large-v3
- Canary-1B (NVIDIA)

For Speed

- Whisper turbo
- Distil-Whisper
- faster-whisper

For Diarization

- pyannote 3.1
- NeMo MSDD

Use Cases

✓Meeting transcription
✓Voice assistants
✓Podcast search
✓Call center analytics

Architectural Patterns

End-to-End ASR

Single model that directly maps audio to text (Whisper-style).

Pros:

+Simple pipeline
+Handles accents well
+Multilingual

Cons:

-Can be slow for long audio
-Needs chunking strategy

Streaming ASR

Real-time transcription with low latency.

Pros:

+Live transcription
+Sub-second latency

Cons:

-Slightly lower accuracy
-More complex deployment

ASR + Diarization Pipeline

Separate speaker identification from transcription.

Pros:

+Know who said what
+Better for meetings

Cons:

-Multi-step pipeline
-Alignment challenges

Implementations

API Services

OpenAI Whisper API

OpenAI

API

whisper-1 model. Fast, accurate, handles many languages.

Deepgram

API

Fast streaming ASR. Nova-2 model. Good for real-time.

AssemblyAI

API

Best-in-class for English. Includes diarization, summarization.

Open Source

Whisper (local)

MIT

Open Source

Run locally. Large-v3 is best quality, turbo for speed.

GitHub HuggingFace

faster-whisper

MIT

Open Source

4x faster Whisper using CTranslate2. Same accuracy.

GitHub

Canary-1B

CC-BY-4.0

Open Source

NVIDIA's multilingual ASR. Strong for non-English.

HuggingFace

Benchmarks

LibriSpeech →Common Voice →

Code Examples

Transcribe with OpenAI Whisper API

Fast cloud transcription with OpenAI

Install:pip install openai

from openai import OpenAI

client = OpenAI()

# Transcribe audio file
with open('recording.mp3', 'rb') as audio_file:
    transcript = client.audio.transcriptions.create(
        model='whisper-1',
        file=audio_file,
        response_format='text'
    )

print(transcript)

Local Transcription with faster-whisper

4x faster than OpenAI Whisper, runs locally

Install:pip install faster-whisper

from faster_whisper import WhisperModel

# Load model (use 'large-v3' for best quality, 'base' for speed)
model = WhisperModel('large-v3', device='cuda', compute_type='float16')

# Transcribe
segments, info = model.transcribe('recording.mp3', beam_size=5)

print(f'Detected language: {info.language} ({info.language_probability:.2f})')
print('\nTranscript:')
for segment in segments:
    print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')

Transcribe with Speaker Diarization

Know who said what using pyannote

Install:pip install faster-whisper pyannote.audio

from faster_whisper import WhisperModel
from pyannote.audio import Pipeline
import torch

# Load models
whisper = WhisperModel('large-v3', device='cuda')
diarization = Pipeline.from_pretrained(
    'pyannote/speaker-diarization-3.1',
    use_auth_token='YOUR_HF_TOKEN'
)

# Diarize (who speaks when)
audio_file = 'meeting.wav'
diarization_result = diarization(audio_file)

# Transcribe
segments, _ = whisper.transcribe(audio_file)
segments = list(segments)

# Combine: assign speakers to transcript segments
for segment in segments:
    # Find speaker at segment midpoint
    t = (segment.start + segment.end) / 2
    speaker = 'UNKNOWN'
    for turn, _, spk in diarization_result.itertracks(yield_label=True):
        if turn.start <= t <= turn.end:
            speaker = spk
            break
    print(f'[{speaker}] {segment.text}')

Quick Facts

Input: Audio
Output: Text
Implementations: 3 open source, 3 API
Patterns: 3 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for speech recognition.

Submit Results