Level 1: Single Blocks~15 min

Speech Recognition (Audio to Text)

Your first non-text modality. Convert audio recordings to accurate transcriptions with Whisper and beyond.

From Sound Waves to Text

Speech recognition (also called Speech-to-Text or ASR - Automatic Speech Recognition) converts spoken audio into written text. This is the foundation for voice assistants, meeting transcription, and accessibility tools.

Until 2022, this required either expensive APIs or complex multi-model pipelines. Then OpenAI released Whisper - an open-source model that achieves near-human accuracy.

Why Whisper Changed Everything

-MIT licensed - run it locally, forever free
-Trained on 680,000 hours of multilingual audio
-Works in 99 languages with automatic language detection
-Handles background noise, accents, and multiple speakers

Option 1: OpenAI Whisper API

The fastest way to get started. Hosted by OpenAI, no GPU required, pay per minute of audio.

Install

Python

pip install openai

Basic Transcription

OpenAI API

from openai import OpenAI
client = OpenAI()

with open('recording.mp3', 'rb') as audio_file:
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=audio_file,
response_format='text'
)
print(transcript)

$0.006

per minute

25 MB

max file size

~10s

typical latency

Option 2: faster-whisper (Local, 4x Faster)

faster-whisper is a reimplementation using CTranslate2 that runs 4x faster than the original Whisper with the same accuracy. Best choice for local deployment.

Install

Python

pip install faster-whisper

Local Transcription with Timestamps

faster-whisper

from faster_whisper import WhisperModel

# Model sizes: tiny, base, small, medium, large-v3
model = WhisperModel('large-v3', device='cuda', compute_type='float16')
segments, info = model.transcribe('recording.mp3', beam_size=5)

print(f'Detected language: {info.language}')
for segment in segments:
print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')

Model Size Comparison

Model	VRAM	Speed (RTF)	WER
tiny	~1 GB	~32x	~7%
base	~1 GB	~16x	~5%
small	~2 GB	~6x	~4%
medium	~5 GB	~2x	~3%
large-v3	~10 GB	~1x	~2.5%

RTF = Real-Time Factor. 16x means 1 minute of audio transcribed in ~4 seconds. WER = Word Error Rate on LibriSpeech.

Option 3: With Speaker Diarization

Speaker diarization identifies "who spoke when" - essential for meeting transcription, interviews, and podcasts. This combines Whisper with pyannote.audio.

Install

Python

pip install faster-whisper pyannote.audio

Note: pyannote requires a HuggingFace token with access to the model.

Transcription with Speaker Labels

whisper + pyannote

from faster_whisper import WhisperModel
from pyannote.audio import Pipeline

# Initialize models
whisper = WhisperModel('large-v3', device='cuda')
diarization = Pipeline.from_pretrained(
'pyannote/speaker-diarization-3.1',
use_auth_token='YOUR_HF_TOKEN'
)

# Get speaker segments and transcription
diarization_result = diarization('meeting.wav')
segments, _ = whisper.transcribe('meeting.wav')
# Combine: assign speakers to transcript segments

Example Output

[SPEAKER_00] Welcome to the meeting. Let's start with the quarterly results.

[SPEAKER_01] Thanks. Revenue is up 15% compared to last quarter.

[SPEAKER_00] That's great news. What about customer acquisition?

[SPEAKER_02] We added 500 new customers, mostly in the enterprise segment.

Cloud Alternatives

While Whisper is excellent, specialized providers offer additional features like real-time streaming, better punctuation, and built-in diarization.

Deepgram

Best for Streaming

Real-time transcription with sub-300ms latency. WebSocket API for live audio streams.

# ~$0.0043/min (pay-as-you-go)
pip install deepgram-sdk

AssemblyAI

Best for English + Features

Best-in-class English accuracy. Built-in diarization, summarization, and content moderation.

# ~$0.0033/min (best practices tier)
pip install assemblyai

Google Speech-to-Text

Enterprise

125+ languages, medical and phone call models, automatic punctuation.

# ~$0.006-0.009/min
pip install google-cloud-speech

AWS Transcribe

AWS Ecosystem

Seamless AWS integration, custom vocabulary, automatic content redaction for PII.

# ~$0.024/min (standard)
pip install boto3

Benchmark: Word Error Rate (WER)

Word Error Rate measures transcription accuracy - lower is better. The standard benchmark is LibriSpeech, a corpus of audiobook readings.

Whisper large-v3

2.5% WER

Whisper medium

4.5% WER

Whisper small

5.8% WER

Whisper base

7.3% WER

LibriSpeech test-clean WER from OpenAI's Whisper paper. Lower is better. Human transcribers achieve ~5-6% WER.

Note: Cloud providers (AssemblyAI, Deepgram, Google) claim competitive accuracy but don't publish standardized LibriSpeech benchmarks.

Explore More Benchmarks

See how different models perform on speech recognition tasks:

View Speech Recognition Benchmarks->

When to Use What

Prototyping / Simple Transcription

Use OpenAI Whisper API. No setup, just works.

$0.006/min | 25MB limit | ~10s latency | Best for quick experiments

Production (Cost-Sensitive / Privacy)

Use faster-whisper locally. No API costs, data stays on-premise.

One-time GPU cost | No data leaves server | 4x faster than original Whisper

Real-Time / Streaming

Use Deepgram. Sub-300ms latency via WebSocket.

~$0.004/min | WebSocket API | Live transcription | Phone/video calls

Meeting Transcription (Multiple Speakers)

Use AssemblyAI or faster-whisper + pyannote.

Built-in diarization | Speaker labels | Summarization features

Non-English / Low-Resource Languages

Use Whisper large-v3. Best multilingual accuracy.

99 languages | Automatic language detection | Works with accents

Key Takeaways

1
Whisper is the foundation - MIT licensed, 99 languages, near-human accuracy. Start here.
2
faster-whisper for production - 4x faster, lower memory, same accuracy. Best local deployment.
3
Speaker diarization is separate - Use pyannote or specialized APIs for "who said what".
4
WER benchmark matters - large-v3 achieves ~2.5% WER, matching human transcribers.

Practice Exercise

Try transcribing your own audio:

1.Record a 30-second voice memo on your phone.
2.Transcribe it using the OpenAI API code above.
3.Try adding background music or speaking faster - how does accuracy change?
4.If you have a GPU, compare speed with faster-whisper locally.

Next: Text-to-Speech (Coming Soon)Previous: Image Search with CLIP