Speech Recognition (Audio to Text)
Your first non-text modality. Convert audio recordings to accurate transcriptions with Whisper and beyond.
From Sound Waves to Text
Speech recognition (also called Speech-to-Text or ASR - Automatic Speech Recognition) converts spoken audio into written text. This is the foundation for voice assistants, meeting transcription, and accessibility tools.
Until 2022, this required either expensive APIs or complex multi-model pipelines. Then OpenAI released Whisper - an open-source model that achieves near-human accuracy.
Why Whisper Changed Everything
- -MIT licensed - run it locally, forever free
- -Trained on 680,000 hours of multilingual audio
- -Works in 99 languages with automatic language detection
- -Handles background noise, accents, and multiple speakers
Option 1: OpenAI Whisper API
The fastest way to get started. Hosted by OpenAI, no GPU required, pay per minute of audio.
Install
PythonBasic Transcription
OpenAI APIclient = OpenAI()
with open('recording.mp3', 'rb') as audio_file:
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=audio_file,
response_format='text'
)
print(transcript)
per minute
max file size
typical latency
Option 2: faster-whisper (Local, 4x Faster)
faster-whisper is a reimplementation using CTranslate2 that runs 4x faster than the original Whisper with the same accuracy. Best choice for local deployment.
Install
PythonLocal Transcription with Timestamps
faster-whisper# Model sizes: tiny, base, small, medium, large-v3
model = WhisperModel('large-v3', device='cuda', compute_type='float16')
segments, info = model.transcribe('recording.mp3', beam_size=5)
print(f'Detected language: {info.language}')
for segment in segments:
print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')
Model Size Comparison
| Model | VRAM | Speed (RTF) | WER |
|---|---|---|---|
| tiny | ~1 GB | ~32x | ~7% |
| base | ~1 GB | ~16x | ~5% |
| small | ~2 GB | ~6x | ~4% |
| medium | ~5 GB | ~2x | ~3% |
| large-v3 | ~10 GB | ~1x | ~2.5% |
RTF = Real-Time Factor. 16x means 1 minute of audio transcribed in ~4 seconds. WER = Word Error Rate on LibriSpeech.
Option 3: With Speaker Diarization
Speaker diarization identifies "who spoke when" - essential for meeting transcription, interviews, and podcasts. This combines Whisper with pyannote.audio.
Install
PythonNote: pyannote requires a HuggingFace token with access to the model.
Transcription with Speaker Labels
whisper + pyannotefrom pyannote.audio import Pipeline
# Initialize models
whisper = WhisperModel('large-v3', device='cuda')
diarization = Pipeline.from_pretrained(
'pyannote/speaker-diarization-3.1',
use_auth_token='YOUR_HF_TOKEN'
)
# Get speaker segments and transcription
diarization_result = diarization('meeting.wav')
segments, _ = whisper.transcribe('meeting.wav')
# Combine: assign speakers to transcript segments
Example Output
Cloud Alternatives
While Whisper is excellent, specialized providers offer additional features like real-time streaming, better punctuation, and built-in diarization.
Deepgram
Best for StreamingReal-time transcription with sub-300ms latency. WebSocket API for live audio streams.
pip install deepgram-sdk
AssemblyAI
Best for English + FeaturesBest-in-class English accuracy. Built-in diarization, summarization, and content moderation.
pip install assemblyai
Google Speech-to-Text
Enterprise125+ languages, medical and phone call models, automatic punctuation.
pip install google-cloud-speech
AWS Transcribe
AWS EcosystemSeamless AWS integration, custom vocabulary, automatic content redaction for PII.
pip install boto3
Benchmark: Word Error Rate (WER)
Word Error Rate measures transcription accuracy - lower is better. The standard benchmark is LibriSpeech, a corpus of audiobook readings.
LibriSpeech test-clean WER from OpenAI's Whisper paper. Lower is better. Human transcribers achieve ~5-6% WER.
Note: Cloud providers (AssemblyAI, Deepgram, Google) claim competitive accuracy but don't publish standardized LibriSpeech benchmarks.
Explore More Benchmarks
See how different models perform on speech recognition tasks:
View Speech Recognition Benchmarks->When to Use What
Prototyping / Simple Transcription
Use OpenAI Whisper API. No setup, just works.
$0.006/min | 25MB limit | ~10s latency | Best for quick experiments
Production (Cost-Sensitive / Privacy)
Use faster-whisper locally. No API costs, data stays on-premise.
One-time GPU cost | No data leaves server | 4x faster than original Whisper
Real-Time / Streaming
Use Deepgram. Sub-300ms latency via WebSocket.
~$0.004/min | WebSocket API | Live transcription | Phone/video calls
Meeting Transcription (Multiple Speakers)
Use AssemblyAI or faster-whisper + pyannote.
Built-in diarization | Speaker labels | Summarization features
Non-English / Low-Resource Languages
Use Whisper large-v3. Best multilingual accuracy.
99 languages | Automatic language detection | Works with accents
Key Takeaways
- 1
Whisper is the foundation - MIT licensed, 99 languages, near-human accuracy. Start here.
- 2
faster-whisper for production - 4x faster, lower memory, same accuracy. Best local deployment.
- 3
Speaker diarization is separate - Use pyannote or specialized APIs for "who said what".
- 4
WER benchmark matters - large-v3 achieves ~2.5% WER, matching human transcribers.
Practice Exercise
Try transcribing your own audio:
- 1.Record a 30-second voice memo on your phone.
- 2.Transcribe it using the OpenAI API code above.
- 3.Try adding background music or speaking faster - how does accuracy change?
- 4.If you have a GPU, compare speed with faster-whisper locally.