ASR Benchmark Guide · March 2026
Speech Recognition in 2026
Whisper v3 vs Gemini 2.5 Pro vs AssemblyAI Universal-2 vs Deepgram Nova-3 vs Azure Speech vs Speechmatics. WER benchmarks, latency, pricing, and code examples — everything you need to pick the right ASR for your product.
TL;DR
Best Accuracy (English)
AssemblyAI Universal-2
2.1% WER on LibriSpeech clean
Lowest Latency
Deepgram Nova-3
~450ms median streaming latency
Best Open Source
Whisper large-v3
2.8% WER, 99 languages, self-hostable
Best Value
Deepgram Nova-3
$1.31/hr with strong accuracy
Most Languages
Azure Speech
130 languages with custom model training
Best Multimodal
Gemini 2.5 Pro
Transcribe + reason about audio in one call
WER Benchmark Comparison
Word Error Rate (%) — lower is better. Evaluated on standard test sets and a real-world noisy corpus (call center + podcast mix).
| Model | LibriSpeech clean | LibriSpeech other | Common Voice | Noisy Real-World | Streaming | Languages |
|---|---|---|---|---|---|---|
| Whisper large-v3OpenAI | 2.8% | 5.5% | 8.1% | 11.4% | No | 99 |
| Gemini 2.5 ProGoogle | 2.3% | 4.6% | 6.9% | 8.7% | Yes | 100 |
| AssemblyAI Universal-2AssemblyAI | 2.1% | 4.2% | 7.3% | 7.9% | Yes | 19 |
| Deepgram Nova-3Deepgram | 2.5% | 4.8% | 7.6% | 8.2% | Yes | 36 |
| Azure SpeechMicrosoft | 2.6% | 5.1% | 7% | 9.3% | Yes | 130 |
| SpeechmaticsSpeechmatics | 2.4% | 4.5% | 6.5% | 8% | Yes | 50 |
Benchmarks from published papers, official documentation, and independent evaluations (February-March 2026). Noisy real-world corpus: 50h mix of call center recordings, podcasts, and meeting audio at varying SNR levels.
Model Deep Dives
Whisper large-v3
OpenAI · large-v3 / large-v3-turbo
Best WER
2.8%
Latency
4.2s
Price/hr
$0.36
- Open-source weights (MIT-like license)
- large-v3-turbo: 4x faster with ~0.3% WER trade-off
- Massive community ecosystem (faster-whisper, whisper.cpp, WhisperX)
- Best multilingual breadth for open models
Gemini 2.5 Pro
Google · 2.5 Pro (audio input)
Best WER
2.3%
Latency
3.8s
Price/hr
$2.16
- Multimodal: transcribe + reason about audio in one call
- Excellent noise robustness from large-scale pre-training
- Live API enables real-time streaming transcription
- Can handle audio + video + text simultaneously
AssemblyAI Universal-2
AssemblyAI · Universal-2
Best WER
2.1%
Latency
1.1s
Price/hr
$2.22
- Top-tier English accuracy across accents
- Built-in speaker diarization, sentiment, summarization
- Excellent real-time streaming latency
- PII redaction and content safety built in
Deepgram Nova-3
Deepgram · Nova-3
Best WER
2.5%
Latency
450ms
Price/hr
$1.31
- Lowest latency of any commercial ASR
- Best price-to-performance ratio
- Strong multichannel and telephony support
- Topic detection and intent recognition built in
Azure Speech
Microsoft · Speech-to-Text v4
Best WER
2.6%
Latency
800ms
Price/hr
$1.44
- Broadest language support of any commercial API
- Custom model training with your own data
- Deep Azure ecosystem integration
- On-premises deployment via containers
Speechmatics
Speechmatics · Ursa 3
Best WER
2.4%
Latency
700ms
Price/hr
$2.64
- Strongest accuracy on non-English European languages
- On-premises and air-gapped deployment
- Excellent entity formatting and punctuation
- Translation and language identification built in
Pricing Comparison
Cost per hour of transcribed audio. Self-hosted Whisper costs depend on GPU choice.
| Model | Price / Hour | Price / Minute | Free Tier | Notes |
|---|---|---|---|---|
| Whisper (OpenAI API) | $0.36 | $0.006 | No | Cheapest API; no streaming |
| Whisper (self-hosted) | ~$0.05-0.15 | ~$0.001-0.003 | N/A | A100: ~$1.50/hr GPU, processes ~10-30x real-time |
| Deepgram Nova-3 | $1.31 | $0.0218 | $200 credit | Best commercial value; volume discounts |
| Azure Speech | $1.44 | $0.024 | 5hr/mo | Custom models add ~$1.44/hr extra |
| Gemini 2.5 Pro | ~$2.16 | ~$0.036 | Free tier | Token-based pricing; varies with output |
| AssemblyAI | $2.22 | $0.037 | 100hr trial | Includes diarization, summaries, sentiment |
| Speechmatics | $2.64 | $0.044 | Trial | Enterprise on-prem pricing negotiable |
Code Examples (Python)
Whisper large-v3 (self-hosted with faster-whisper)
# Self-hosted Whisper with faster-whisper (CTranslate2 backend)
# pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe(
"meeting.wav",
beam_size=5,
language="en",
vad_filter=True, # Skip silence for faster processing
vad_parameters=dict(
min_silence_duration_ms=500,
),
)
print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")Deepgram Nova-3 (real-time streaming)
# Deepgram Nova-3 real-time streaming
# pip install deepgram-sdk
import asyncio
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents
async def transcribe_stream():
dg = DeepgramClient("YOUR_API_KEY")
connection = dg.listen.asynclive.v("1")
async def on_message(self, result, **kwargs):
transcript = result.channel.alternatives[0].transcript
if transcript:
print(f"[{result.start:.2f}s] {transcript}")
connection.on(LiveTranscriptionEvents.Transcript, on_message)
options = LiveOptions(
model="nova-3",
language="en",
smart_format=True,
diarize=True,
encoding="linear16",
sample_rate=16000,
)
await connection.start(options)
# Stream audio chunks from microphone or file
with open("call_recording.wav", "rb") as f:
while chunk := f.read(4096):
connection.send(chunk)
await asyncio.sleep(0.1) # Simulate real-time
await connection.finish()
asyncio.run(transcribe_stream())AssemblyAI Universal-2 (diarization + chapters)
# AssemblyAI Universal-2 with speaker diarization
# pip install assemblyai
import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"
config = aai.TranscriptionConfig(
speech_model=aai.SpeechModel.best, # Universal-2
speaker_labels=True, # Diarization
auto_chapters=True, # Chapter summaries
entity_detection=True, # PII detection
sentiment_analysis=True,
)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("podcast_episode.mp3", config=config)
# Print with speaker labels
for utterance in transcript.utterances:
print(f"Speaker {utterance.speaker}: {utterance.text}")
# Auto-generated chapters
for chapter in transcript.chapters:
print(f"\n## {chapter.headline}")
print(f" {chapter.summary}")
print(f" [{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s]")Gemini 2.5 Pro (audio transcription + analysis)
# Gemini 2.5 Pro audio transcription + analysis
# pip install google-genai
from google import genai
client = genai.Client(api_key="YOUR_API_KEY")
# Upload audio file
audio_file = client.files.upload(file="earnings_call.mp3")
# Transcribe AND analyze in one call
response = client.models.generate_content(
model="gemini-2.5-pro",
contents=[
audio_file,
"""Transcribe this audio precisely, then provide:
1. Full transcript with timestamps
2. Key topics discussed
3. Action items mentioned
4. Overall sentiment per speaker""",
],
)
print(response.text)
# Streaming transcription via Live API
async def live_transcribe():
async with client.aio.live.connect(
model="gemini-2.5-pro",
config={"response_modalities": ["TEXT"]},
) as session:
# Send audio chunks for real-time transcription
with open("live_audio.pcm", "rb") as f:
while chunk := f.read(4096):
await session.send_realtime_input(
audio={"data": chunk, "mime_type": "audio/pcm"}
)
response = await session.receive()
print(response.text)Decision Matrix
Pick the right ASR based on your primary requirement.
Real-time voice assistant or live captioning
Deepgram Nova-3Sub-500ms latency and streaming WebSocket API. Best-in-class for latency-sensitive applications.
Highest accuracy on English (podcasts, meetings)
AssemblyAI Universal-2Lowest WER across English benchmarks. Built-in diarization, chapters, and sentiment make it a complete pipeline.
Multilingual transcription (50+ languages)
Azure Speech130 languages with custom model training. Best for global products and localization workflows.
Audio understanding beyond transcription
Gemini 2.5 ProTranscribe, summarize, analyze sentiment, extract action items in a single API call. Multimodal reasoning over audio.
Self-hosted / data privacy / air-gapped
Whisper large-v3 or SpeechmaticsWhisper is fully open-source. Speechmatics offers on-prem with better accuracy. Both run without sending data to a third-party.
Budget-constrained high volume (1000+ hours/month)
Self-hosted Whisper large-v3-turboWith faster-whisper on an A100, cost drops to ~$0.05-0.15/hr. Turbo variant processes at 30x real-time speed.
European language accuracy (DE, FR, ES, IT, etc.)
Speechmatics Ursa 3Strongest non-English European performance. Excellent entity formatting and on-prem option for EU data residency.
Frequently Asked Questions
What is the most accurate speech recognition model in 2026?
AssemblyAI Universal-2 leads on English benchmarks with ~2.1% WER on LibriSpeech clean. Speechmatics Ursa 3 and Gemini 2.5 Pro are close behind. For multilingual use, Gemini and Azure Speech offer the broadest coverage with strong accuracy.
Is Whisper still competitive in 2026?
Yes. Whisper large-v3 remains highly competitive at 2.8% WER on LibriSpeech clean and is the best open-source option. The large-v3-turbo variant offers 4x faster inference with only ~0.3% WER increase, making it ideal for self-hosted deployments.
Which ASR API has the lowest latency for real-time applications?
Deepgram Nova-3 has the lowest streaming latency at ~450ms median (under 300ms at p95). This makes it the top choice for live captioning, voice assistants, and real-time transcription use cases.
What is the cheapest speech-to-text API?
Self-hosted Whisper is cheapest at scale (GPU costs only). Among APIs, Deepgram Nova-3 at $0.0218/min ($1.31/hr) offers the best price-to-performance ratio. OpenAI Whisper API is cheapest outright at $0.006/min ($0.36/hr) but lacks streaming.
Should I use a speech-to-text API or self-host Whisper?
Use an API if you need streaming, diarization, or minimal ops overhead. Self-host Whisper if you process >100 hours/day (cost savings), need data privacy, or want full control. The faster-whisper library makes self-hosting practical with 4x speedup.
Which ASR model is best for noisy audio like call centers?
AssemblyAI Universal-2 and Speechmatics Ursa 3 perform best on noisy real-world audio with ~7.9-8.0% WER. Deepgram Nova-3 is also strong at 8.2% and offers the best latency for real-time call center use cases.