Home/Building Blocks/Speech Recognition
AudioText

Speech Recognition

Transcribe spoken audio into text. The foundation for voice interfaces, meeting transcription, and audio search.

How Speech Recognition Works

A technical deep-dive into automatic speech recognition. From Whisper to real-time transcription with speaker diarization.

1

ASR Tasks

Speech recognition includes multiple related tasks beyond basic transcription.

Transcription

Audio to text

Output: Plain text transcript
Models: Whisper, Wav2Vec

Diarization

Who spoke when

Output: Speaker segments
Models: pyannote, NeMo

Word Timestamps

Precise timing

Output: Word-level alignment
Models: Whisper + alignment

Translation

Speech to text in another language

Output: Translated transcript
Models: Whisper, SeamlessM4T

The ASR Pipeline

Audio
Waveform
->
Mel
Spectrogram
80 channels
->
Encoder
Transformer
->
Decoder
Autoregressive
->
Text
Transcript
2

Model Evolution

From RNN-based models to transformer foundation models.

DeepSpeech
2014
CTC~8% WEREnd-to-end RNN
Wav2Vec 2.0
2020
Self-supervised~3% WERContrastive learning
Whisper
2022
Encoder-Decoder~2% WERMultitask, multilingual
Whisper Large v3
2023
Encoder-Decoder~1.5% WERMore data, languages
Distil-Whisper
2023
Distilled~2% WER6x faster, same quality
Whisper v3 Turbo
2024
Encoder-Decoder~1.5% WER8x faster than large
Canary-1B
2024
Encoder-Decoder~1.2% WERNVIDIA, best WER
Whisper v3 Turbo
Best speed/quality trade-off
8x faster than large-v3, same WER
Canary-1B
Best raw accuracy
NVIDIA NeMo, ~1.2% WER
Distil-Whisper
Best for edge deployment
6x faster, runs on CPU
3

Whisper Deep-Dive

OpenAI's Whisper is the most widely used ASR model. Trained on 680,000 hours of multilingual data.

Whisper Model Sizes

SizeParametersRelative SpeedVRAMWER (English)
tiny39M32x~1GB~7.6%
base74M16x~1GB~5.0%
small244M6x~2GB~3.4%
medium769M2x~5GB~2.5%
large-v31.5B1x~10GB~1.5%
turbo809M8x~6GB~1.5%

Key Features

  • +99 languages supported
  • +Multitask: transcribe, translate, timestamps
  • +Robust to noise, accents, background audio
  • +Word-level timestamps with alignment

Limitations

  • -No speaker diarization (need separate model)
  • -30-second processing chunks
  • -Can hallucinate on silence/noise
  • -Autoregressive = sequential decoding

Whisper Special Tokens

<|startoftranscript|>
Begin output
<|en|>
Language tag
<|transcribe|>
Task type
<|0.00|>
Timestamp
4

ASR Metrics

How to measure transcription quality.

Word Error Rate (WER)

The standard metric for ASR quality. Lower is better.

WER = (S + D + I) / N
S = Substitutions (wrong word)
D = Deletions (missing word)
I = Insertions (extra word)
N = Total reference words
Example:
Reference:
"the quick brown fox"
Hypothesis:
"the quik brown fox jumps"
WER = (1+1+1)/4 = 75%
<5%
Excellent
Near human-level
5-15%
Good
Usable with review
>15%
Poor
Needs improvement
5

Speed Optimization

Techniques for faster transcription.

Faster-Whisper

CTranslate2 backend. 4x faster with int8.

Flash Attention

Memory-efficient attention. 2x speedup.

Batching

Process multiple chunks in parallel.

VAD Pre-filter

Skip silence. Silero VAD integration.

6

Code Examples

Get started with speech recognition in Python.

OpenAI Whisperpip install openai-whisper
Official
import whisper

# Load model (tiny, base, small, medium, large-v3, turbo)
model = whisper.load_model('turbo')

# Transcribe audio file
result = model.transcribe(
    'audio.mp3',
    language='en',           # Optional: auto-detect if not set
    task='transcribe',       # or 'translate' for speech-to-English
    word_timestamps=True,    # Get word-level timing
    fp16=True               # Use FP16 for speed
)

# Results
print(result['text'])  # Full transcript

# Word-level timestamps
for segment in result['segments']:
    for word in segment.get('words', []):
        print(f"{word['start']:.2f}s: {word['word']}")

Quick Reference

For Quality
  • - Whisper large-v3
  • - Canary-1B (NVIDIA)
For Speed
  • - Whisper turbo
  • - Distil-Whisper
  • - faster-whisper
For Diarization
  • - pyannote 3.1
  • - NeMo MSDD

Use Cases

  • Meeting transcription
  • Voice assistants
  • Podcast search
  • Call center analytics

Architectural Patterns

End-to-End ASR

Single model that directly maps audio to text (Whisper-style).

Pros:
  • +Simple pipeline
  • +Handles accents well
  • +Multilingual
Cons:
  • -Can be slow for long audio
  • -Needs chunking strategy

Streaming ASR

Real-time transcription with low latency.

Pros:
  • +Live transcription
  • +Sub-second latency
Cons:
  • -Slightly lower accuracy
  • -More complex deployment

ASR + Diarization Pipeline

Separate speaker identification from transcription.

Pros:
  • +Know who said what
  • +Better for meetings
Cons:
  • -Multi-step pipeline
  • -Alignment challenges

Implementations

API Services

OpenAI Whisper API

OpenAI
API

whisper-1 model. Fast, accurate, handles many languages.

Deepgram

Deepgram
API

Fast streaming ASR. Nova-2 model. Good for real-time.

AssemblyAI

AssemblyAI
API

Best-in-class for English. Includes diarization, summarization.

Open Source

Whisper (local)

MIT
Open Source

Run locally. Large-v3 is best quality, turbo for speed.

faster-whisper

MIT
Open Source

4x faster Whisper using CTranslate2. Same accuracy.

Canary-1B

CC-BY-4.0
Open Source

NVIDIA's multilingual ASR. Strong for non-English.

Benchmarks

Code Examples

Transcribe with OpenAI Whisper API

Fast cloud transcription with OpenAI

Install:pip install openai
from openai import OpenAI

client = OpenAI()

# Transcribe audio file
with open('recording.mp3', 'rb') as audio_file:
    transcript = client.audio.transcriptions.create(
        model='whisper-1',
        file=audio_file,
        response_format='text'
    )

print(transcript)

Local Transcription with faster-whisper

4x faster than OpenAI Whisper, runs locally

Install:pip install faster-whisper
from faster_whisper import WhisperModel

# Load model (use 'large-v3' for best quality, 'base' for speed)
model = WhisperModel('large-v3', device='cuda', compute_type='float16')

# Transcribe
segments, info = model.transcribe('recording.mp3', beam_size=5)

print(f'Detected language: {info.language} ({info.language_probability:.2f})')
print('\nTranscript:')
for segment in segments:
    print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')

Transcribe with Speaker Diarization

Know who said what using pyannote

Install:pip install faster-whisper pyannote.audio
from faster_whisper import WhisperModel
from pyannote.audio import Pipeline
import torch

# Load models
whisper = WhisperModel('large-v3', device='cuda')
diarization = Pipeline.from_pretrained(
    'pyannote/speaker-diarization-3.1',
    use_auth_token='YOUR_HF_TOKEN'
)

# Diarize (who speaks when)
audio_file = 'meeting.wav'
diarization_result = diarization(audio_file)

# Transcribe
segments, _ = whisper.transcribe(audio_file)
segments = list(segments)

# Combine: assign speakers to transcript segments
for segment in segments:
    # Find speaker at segment midpoint
    t = (segment.start + segment.end) / 2
    speaker = 'UNKNOWN'
    for turn, _, spk in diarization_result.itertracks(yield_label=True):
        if turn.start <= t <= turn.end:
            speaker = spk
            break
    print(f'[{speaker}] {segment.text}')

Quick Facts

Input
Audio
Output
Text
Implementations
3 open source, 3 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for speech recognition.

Submit Results