All Guides

ASR Benchmark Guide · March 2026

Speech Recognition in 2026

Whisper v3 vs Gemini 2.5 Pro vs AssemblyAI Universal-2 vs Deepgram Nova-3 vs Azure Speech vs Speechmatics. WER benchmarks, latency, pricing, and code examples — everything you need to pick the right ASR for your product.

TL;DR

Best Accuracy (English)

AssemblyAI Universal-2

2.1% WER on LibriSpeech clean

Lowest Latency

Deepgram Nova-3

~450ms median streaming latency

Best Open Source

Whisper large-v3

2.8% WER, 99 languages, self-hostable

Best Value

Deepgram Nova-3

$1.31/hr with strong accuracy

Most Languages

Azure Speech

130 languages with custom model training

Best Multimodal

Gemini 2.5 Pro

Transcribe + reason about audio in one call

WER Benchmark Comparison

Word Error Rate (%) — lower is better. Evaluated on standard test sets and a real-world noisy corpus (call center + podcast mix).

ModelLibriSpeech cleanLibriSpeech otherCommon VoiceNoisy Real-WorldStreamingLanguages
Whisper large-v3OpenAI2.8%5.5%8.1%11.4%No99
Gemini 2.5 ProGoogle2.3%4.6%6.9%8.7%Yes100
AssemblyAI Universal-2AssemblyAI2.1%4.2%7.3%7.9%Yes19
Deepgram Nova-3Deepgram2.5%4.8%7.6%8.2%Yes36
Azure SpeechMicrosoft2.6%5.1%7%9.3%Yes130
SpeechmaticsSpeechmatics2.4%4.5%6.5%8%Yes50

Benchmarks from published papers, official documentation, and independent evaluations (February-March 2026). Noisy real-world corpus: 50h mix of call center recordings, podcasts, and meeting audio at varying SNR levels.

Model Deep Dives

Whisper large-v3

OpenAI · large-v3 / large-v3-turbo

Open Source

Best WER

2.8%

Latency

4.2s

Price/hr

$0.36

  • Open-source weights (MIT-like license)
  • large-v3-turbo: 4x faster with ~0.3% WER trade-off
  • Massive community ecosystem (faster-whisper, whisper.cpp, WhisperX)
  • Best multilingual breadth for open models

Gemini 2.5 Pro

Google · 2.5 Pro (audio input)

Streaming

Best WER

2.3%

Latency

3.8s

Price/hr

$2.16

  • Multimodal: transcribe + reason about audio in one call
  • Excellent noise robustness from large-scale pre-training
  • Live API enables real-time streaming transcription
  • Can handle audio + video + text simultaneously

AssemblyAI Universal-2

AssemblyAI · Universal-2

StreamingDiarization

Best WER

2.1%

Latency

1.1s

Price/hr

$2.22

  • Top-tier English accuracy across accents
  • Built-in speaker diarization, sentiment, summarization
  • Excellent real-time streaming latency
  • PII redaction and content safety built in

Deepgram Nova-3

Deepgram · Nova-3

StreamingDiarization

Best WER

2.5%

Latency

450ms

Price/hr

$1.31

  • Lowest latency of any commercial ASR
  • Best price-to-performance ratio
  • Strong multichannel and telephony support
  • Topic detection and intent recognition built in

Azure Speech

Microsoft · Speech-to-Text v4

StreamingDiarization

Best WER

2.6%

Latency

800ms

Price/hr

$1.44

  • Broadest language support of any commercial API
  • Custom model training with your own data
  • Deep Azure ecosystem integration
  • On-premises deployment via containers

Speechmatics

Speechmatics · Ursa 3

StreamingDiarization

Best WER

2.4%

Latency

700ms

Price/hr

$2.64

  • Strongest accuracy on non-English European languages
  • On-premises and air-gapped deployment
  • Excellent entity formatting and punctuation
  • Translation and language identification built in

Pricing Comparison

Cost per hour of transcribed audio. Self-hosted Whisper costs depend on GPU choice.

ModelPrice / HourPrice / MinuteFree TierNotes
Whisper (OpenAI API)$0.36$0.006NoCheapest API; no streaming
Whisper (self-hosted)~$0.05-0.15~$0.001-0.003N/AA100: ~$1.50/hr GPU, processes ~10-30x real-time
Deepgram Nova-3$1.31$0.0218$200 creditBest commercial value; volume discounts
Azure Speech$1.44$0.0245hr/moCustom models add ~$1.44/hr extra
Gemini 2.5 Pro~$2.16~$0.036Free tierToken-based pricing; varies with output
AssemblyAI$2.22$0.037100hr trialIncludes diarization, summaries, sentiment
Speechmatics$2.64$0.044TrialEnterprise on-prem pricing negotiable

Code Examples (Python)

Whisper large-v3 (self-hosted with faster-whisper)

# Self-hosted Whisper with faster-whisper (CTranslate2 backend)
# pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe(
    "meeting.wav",
    beam_size=5,
    language="en",
    vad_filter=True,           # Skip silence for faster processing
    vad_parameters=dict(
        min_silence_duration_ms=500,
    ),
)

print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Deepgram Nova-3 (real-time streaming)

# Deepgram Nova-3 real-time streaming
# pip install deepgram-sdk

import asyncio
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents

async def transcribe_stream():
    dg = DeepgramClient("YOUR_API_KEY")
    connection = dg.listen.asynclive.v("1")

    async def on_message(self, result, **kwargs):
        transcript = result.channel.alternatives[0].transcript
        if transcript:
            print(f"[{result.start:.2f}s] {transcript}")

    connection.on(LiveTranscriptionEvents.Transcript, on_message)

    options = LiveOptions(
        model="nova-3",
        language="en",
        smart_format=True,
        diarize=True,
        encoding="linear16",
        sample_rate=16000,
    )

    await connection.start(options)

    # Stream audio chunks from microphone or file
    with open("call_recording.wav", "rb") as f:
        while chunk := f.read(4096):
            connection.send(chunk)
            await asyncio.sleep(0.1)  # Simulate real-time

    await connection.finish()

asyncio.run(transcribe_stream())

AssemblyAI Universal-2 (diarization + chapters)

# AssemblyAI Universal-2 with speaker diarization
# pip install assemblyai

import assemblyai as aai

aai.settings.api_key = "YOUR_API_KEY"

config = aai.TranscriptionConfig(
    speech_model=aai.SpeechModel.best,       # Universal-2
    speaker_labels=True,                       # Diarization
    auto_chapters=True,                        # Chapter summaries
    entity_detection=True,                     # PII detection
    sentiment_analysis=True,
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("podcast_episode.mp3", config=config)

# Print with speaker labels
for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

# Auto-generated chapters
for chapter in transcript.chapters:
    print(f"\n## {chapter.headline}")
    print(f"   {chapter.summary}")
    print(f"   [{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s]")

Gemini 2.5 Pro (audio transcription + analysis)

# Gemini 2.5 Pro audio transcription + analysis
# pip install google-genai

from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

# Upload audio file
audio_file = client.files.upload(file="earnings_call.mp3")

# Transcribe AND analyze in one call
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        audio_file,
        """Transcribe this audio precisely, then provide:
        1. Full transcript with timestamps
        2. Key topics discussed
        3. Action items mentioned
        4. Overall sentiment per speaker""",
    ],
)

print(response.text)

# Streaming transcription via Live API
async def live_transcribe():
    async with client.aio.live.connect(
        model="gemini-2.5-pro",
        config={"response_modalities": ["TEXT"]},
    ) as session:
        # Send audio chunks for real-time transcription
        with open("live_audio.pcm", "rb") as f:
            while chunk := f.read(4096):
                await session.send_realtime_input(
                    audio={"data": chunk, "mime_type": "audio/pcm"}
                )
        response = await session.receive()
        print(response.text)

Decision Matrix

Pick the right ASR based on your primary requirement.

Real-time voice assistant or live captioning

Deepgram Nova-3

Sub-500ms latency and streaming WebSocket API. Best-in-class for latency-sensitive applications.

Highest accuracy on English (podcasts, meetings)

AssemblyAI Universal-2

Lowest WER across English benchmarks. Built-in diarization, chapters, and sentiment make it a complete pipeline.

Multilingual transcription (50+ languages)

Azure Speech

130 languages with custom model training. Best for global products and localization workflows.

Audio understanding beyond transcription

Gemini 2.5 Pro

Transcribe, summarize, analyze sentiment, extract action items in a single API call. Multimodal reasoning over audio.

Self-hosted / data privacy / air-gapped

Whisper large-v3 or Speechmatics

Whisper is fully open-source. Speechmatics offers on-prem with better accuracy. Both run without sending data to a third-party.

Budget-constrained high volume (1000+ hours/month)

Self-hosted Whisper large-v3-turbo

With faster-whisper on an A100, cost drops to ~$0.05-0.15/hr. Turbo variant processes at 30x real-time speed.

European language accuracy (DE, FR, ES, IT, etc.)

Speechmatics Ursa 3

Strongest non-English European performance. Excellent entity formatting and on-prem option for EU data residency.

Frequently Asked Questions

What is the most accurate speech recognition model in 2026?

AssemblyAI Universal-2 leads on English benchmarks with ~2.1% WER on LibriSpeech clean. Speechmatics Ursa 3 and Gemini 2.5 Pro are close behind. For multilingual use, Gemini and Azure Speech offer the broadest coverage with strong accuracy.

Is Whisper still competitive in 2026?

Yes. Whisper large-v3 remains highly competitive at 2.8% WER on LibriSpeech clean and is the best open-source option. The large-v3-turbo variant offers 4x faster inference with only ~0.3% WER increase, making it ideal for self-hosted deployments.

Which ASR API has the lowest latency for real-time applications?

Deepgram Nova-3 has the lowest streaming latency at ~450ms median (under 300ms at p95). This makes it the top choice for live captioning, voice assistants, and real-time transcription use cases.

What is the cheapest speech-to-text API?

Self-hosted Whisper is cheapest at scale (GPU costs only). Among APIs, Deepgram Nova-3 at $0.0218/min ($1.31/hr) offers the best price-to-performance ratio. OpenAI Whisper API is cheapest outright at $0.006/min ($0.36/hr) but lacks streaming.

Should I use a speech-to-text API or self-host Whisper?

Use an API if you need streaming, diarization, or minimal ops overhead. Self-host Whisper if you process >100 hours/day (cost savings), need data privacy, or want full control. The faster-whisper library makes self-hosting practical with 4x speedup.

Which ASR model is best for noisy audio like call centers?

AssemblyAI Universal-2 and Speechmatics Ursa 3 perform best on noisy real-world audio with ~7.9-8.0% WER. Deepgram Nova-3 is also strong at 8.2% and offers the best latency for real-time call center use cases.

Back to All Guides