Codesota · Guides · Speech RecognitionSix ASR systems, scored honestlyUpdated March 2026
Guide · Automatic speech recognition

Word error rate, in context.

Whisper large-v3, Gemini 2.5 Pro, AssemblyAI Universal-2, Deepgram Nova-3, Azure Speech, Speechmatics Ursa 3 — six systems across accuracy, latency, price and language coverage.

WER figures come from vendor documentation, published papers and independent evaluations dated February–March 2026. Latency numbers are median from production streams.

§ 01 · Accuracy

Word error rate, across four splits.

Percentages — lower is better. LibriSpeech clean and other are the canonical academic splits; Common Voice and noisy real-world test robustness outside a reading studio.

The headline numbers on LibriSpeech have been compressed for years — a 0.5% gap at the top is real but small. The more informative row is the noisy real-world column, where differences of a full percentage point reflect meaningful changes in post-processing work.

ModelLS cleanLS otherCommon VoiceNoisy real-worldLanguages
AssemblyAI Universal-2
AssemblyAI
2.1%4.2%7.3%7.9%19
Gemini 2.5 Pro
Google
2.3%4.6%6.9%8.7%100
Speechmatics
Speechmatics
2.4%4.5%6.5%8%50
Deepgram Nova-3
Deepgram
2.5%4.8%7.6%8.2%36
Azure Speech
Microsoft
2.6%5.1%7%9.3%130
Whisper large-v3
OpenAI
2.8%5.5%8.1%11.4%99
Fig 1 · Benchmarks from published papers, official documentation, and independent evaluations (February–March 2026).
§ 02 · Latency

Real-time is a product decision.

For voice assistants and live captioning, median latency and streaming behaviour matter as much as WER.

ModelMedian latencyStreamingDiarisationNote
Deepgram Nova-3450msYesYesIndustry-leading streaming latency; <300ms p95
Speechmatics700msYesYesReal-time and batch; on-prem option
Azure Speech800msYesYesReal-time streaming; batch transcription available
AssemblyAI Universal-21.1sYesYesReal-time streaming; async available for batch
Gemini 2.5 Pro3.8sYesNoStreaming via Live API; batch via standard Gemini API
Whisper large-v34.2sNoNoBatch only via API; real-time possible self-hosted with faster-whisper
§ 03 · Price

Price per hour.

Cost per hour of transcribed audio. Self-hosted Whisper costs depend on GPU choice.

ModelPer hourPer minuteFree tierNotes
Whisper (OpenAI API)$0.36$0.006NoCheapest API; no streaming
Whisper (self-hosted)~$0.05-0.15~$0.001-0.003N/AA100: ~$1.50/hr GPU, processes ~10-30x real-time
Deepgram Nova-3$1.31$0.0218$200 creditBest commercial value; volume discounts
Azure Speech$1.44$0.0245hr/moCustom models add ~$1.44/hr extra
Gemini 2.5 Pro~$2.16~$0.036Free tierToken-based pricing; varies with output
AssemblyAI$2.22$0.037100hr trialIncludes diarization, summaries, sentiment
Speechmatics$2.64$0.044TrialEnterprise on-prem pricing negotiable
§ 04 · Profiles

Each system, one paragraph deep.

What the system is, who made it, and what it is actually for.

Whisper large-v3 · OpenAI · large-v3 / large-v3-turbo

WER (clean) · 2.8%Latency · 4.2sPrice · $0.36/hrOpen source
  • Open-source weights (MIT-like license)
  • large-v3-turbo: 4x faster with ~0.3% WER trade-off
  • Massive community ecosystem (faster-whisper, whisper.cpp, WhisperX)
  • Best multilingual breadth for open models

Gemini 2.5 Pro · Google · 2.5 Pro (audio input)

WER (clean) · 2.3%Latency · 3.8sPrice · $2.16/hr
  • Multimodal: transcribe + reason about audio in one call
  • Excellent noise robustness from large-scale pre-training
  • Live API enables real-time streaming transcription
  • Can handle audio + video + text simultaneously

AssemblyAI Universal-2 · AssemblyAI · Universal-2

WER (clean) · 2.1%Latency · 1.1sPrice · $2.22/hr
  • Top-tier English accuracy across accents
  • Built-in speaker diarization, sentiment, summarization
  • Excellent real-time streaming latency
  • PII redaction and content safety built in

Deepgram Nova-3 · Deepgram · Nova-3

WER (clean) · 2.5%Latency · 450msPrice · $1.31/hr
  • Lowest latency of any commercial ASR
  • Best price-to-performance ratio
  • Strong multichannel and telephony support
  • Topic detection and intent recognition built in

Azure Speech · Microsoft · Speech-to-Text v4

WER (clean) · 2.6%Latency · 800msPrice · $1.44/hr
  • Broadest language support of any commercial API
  • Custom model training with your own data
  • Deep Azure ecosystem integration
  • On-premises deployment via containers

Speechmatics · Speechmatics · Ursa 3

WER (clean) · 2.4%Latency · 700msPrice · $2.64/hr
  • Strongest accuracy on non-English European languages
  • On-premises and air-gapped deployment
  • Excellent entity formatting and punctuation
  • Translation and language identification built in
§ 05 · Code

Four small programs.

One example per path — self-hosted Whisper, the lowest-latency commercial stream, the richest analysis pipeline, and a multimodal option.

Whisper large-v3 — self-hosted with faster-whisper.
whisper_self_hosted.pypython
# Self-hosted Whisper with faster-whisper (CTranslate2 backend)
# pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe(
    "meeting.wav",
    beam_size=5,
    language="en",
    vad_filter=True,           # Skip silence for faster processing
    vad_parameters=dict(
        min_silence_duration_ms=500,
    ),
)

print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Deepgram Nova-3 — real-time streaming with diarisation.
deepgram_streaming.pypython
# Deepgram Nova-3 real-time streaming
# pip install deepgram-sdk

import asyncio
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents

async def transcribe_stream():
    dg = DeepgramClient("YOUR_API_KEY")
    connection = dg.listen.asynclive.v("1")

    async def on_message(self, result, **kwargs):
        transcript = result.channel.alternatives[0].transcript
        if transcript:
            print(f"[{result.start:.2f}s] {transcript}")

    connection.on(LiveTranscriptionEvents.Transcript, on_message)

    options = LiveOptions(
        model="nova-3",
        language="en",
        smart_format=True,
        diarize=True,
        encoding="linear16",
        sample_rate=16000,
    )

    await connection.start(options)

    # Stream audio chunks from microphone or file
    with open("call_recording.wav", "rb") as f:
        while chunk := f.read(4096):
            connection.send(chunk)
            await asyncio.sleep(0.1)  # Simulate real-time

    await connection.finish()

asyncio.run(transcribe_stream())
AssemblyAI Universal-2 — diarisation, chapters, sentiment, PII redaction.
assemblyai_universal2.pypython
# AssemblyAI Universal-2 with speaker diarization
# pip install assemblyai

import assemblyai as aai

aai.settings.api_key = "YOUR_API_KEY"

config = aai.TranscriptionConfig(
    speech_model=aai.SpeechModel.best,       # Universal-2
    speaker_labels=True,                       # Diarization
    auto_chapters=True,                        # Chapter summaries
    entity_detection=True,                     # PII detection
    sentiment_analysis=True,
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("podcast_episode.mp3", config=config)

# Print with speaker labels
for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

# Auto-generated chapters
for chapter in transcript.chapters:
    print(f"\n## {chapter.headline}")
    print(f"   {chapter.summary}")
    print(f"   [{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s]")
Gemini 2.5 Pro — transcription and reasoning in a single call.
gemini_transcribe.pypython
# Gemini 2.5 Pro audio transcription + analysis
# pip install google-genai

from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

audio_file = client.files.upload(file="earnings_call.mp3")

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        audio_file,
        """Transcribe this audio precisely, then provide:
        1. Full transcript with timestamps
        2. Key topics discussed
        3. Action items mentioned
        4. Overall sentiment per speaker""",
    ],
)

print(response.text)
§ 06 · Decision

Pick the tool for the job.

Seven common scenarios, with a primary pick and one-line reason.

Real-time voice assistant or live captioningDeepgram Nova-3
Sub-500ms latency and streaming WebSocket API. Best-in-class for latency-sensitive applications.
Highest accuracy on English (podcasts, meetings)AssemblyAI Universal-3 Pro
Near-SOTA WER (6.21% mean on the Open ASR Leaderboard) with built-in diarization, chapters, and sentiment for a complete pipeline. For raw accuracy, self-hosted IBM Granite Speech 4.1 2B leads at 5.33% mean WER.
Multilingual transcription (50+ languages)Azure Speech
130 languages with custom model training. Best for global products and localization workflows.
Audio understanding beyond transcriptionGemini 2.5 Pro
Transcribe, summarize, analyze sentiment, extract action items in a single API call. Multimodal reasoning over audio.
Self-hosted / data privacy / air-gappedWhisper large-v3 or Speechmatics
Whisper is fully open-source. Speechmatics offers on-prem with better accuracy. Both run without sending data to a third-party.
Budget-constrained high volume (1000+ hours/month)Self-hosted Whisper large-v3-turbo
With faster-whisper on an A100, cost drops to ~$0.05-0.15/hr. Turbo variant processes at 30x real-time speed.
European language accuracy (DE, FR, ES, IT, etc.)Speechmatics Ursa 3
Strongest non-English European performance. Excellent entity formatting and on-prem option for EU data residency.
§ 07 · FAQ

Questions, answered plainly.

The six we get most, with short answers.

Q01What is the most accurate speech recognition model in 2026?
For raw accuracy, open models lead the HF Open ASR Leaderboard — IBM Granite Speech 4.1 2B at 5.33% mean WER across 8 datasets, ahead of Cohere Transcribe and NVIDIA Canary-Qwen-2.5B. Among hosted cloud APIs, AssemblyAI Universal-2 is strong on English (~2.1% WER on LibriSpeech clean), with Speechmatics Ursa 3 and Gemini close behind. For multilingual use, Gemini and Azure Speech offer the broadest coverage.
Q02Is Whisper still competitive in 2026?
Yes. Whisper large-v3 remains highly competitive at 2.8% WER on LibriSpeech clean and is the best open-source option. The large-v3-turbo variant offers 4x faster inference with only ~0.3% WER increase, making it ideal for self-hosted deployments.
Q03Which ASR API has the lowest latency for real-time applications?
Deepgram Nova-3 has the lowest streaming latency at ~450ms median (under 300ms at p95). This makes it the top choice for live captioning, voice assistants, and real-time transcription use cases.
Q04What is the cheapest speech-to-text API?
Self-hosted Whisper is cheapest at scale (GPU costs only). Among APIs, Deepgram Nova-3 at $0.0218/min ($1.31/hr) offers the best price-to-performance ratio. OpenAI Whisper API is cheapest outright at $0.006/min ($0.36/hr) but lacks streaming.
Q05Should I use a speech-to-text API or self-host Whisper?
Use an API if you need streaming, diarization, or minimal ops overhead. Self-host Whisper if you process >100 hours/day (cost savings), need data privacy, or want full control. The faster-whisper library makes self-hosting practical with 4x speedup.
Q06Which ASR model is best for noisy audio like call centers?
AssemblyAI Universal-2 and Speechmatics Ursa 3 perform best on noisy real-world audio with ~7.9-8.0% WER. Deepgram Nova-3 is also strong at 8.2% and offers the best latency for real-time call center use cases.
Related · Further reading

Continue through the registry.