Codesota · Guides · Speech RecognitionSix ASR systems, scored honestlyUpdated March 2026
Guide · Automatic speech recognition

Word error rate, in context.

Whisper large-v3, Gemini 2.5 Pro, AssemblyAI Universal-2, Deepgram Nova-3, Azure Speech, Speechmatics Ursa 3 — six systems across accuracy, latency, price and language coverage.

WER figures come from vendor documentation, published papers and independent evaluations dated February–March 2026. Latency numbers are median from production streams.

§ 01 · Accuracy

Word error rate, across four splits.

Percentages — lower is better. LibriSpeech clean and other are the canonical academic splits; Common Voice and noisy real-world test robustness outside a reading studio.

The headline numbers on LibriSpeech have been compressed for years — a 0.5% gap at the top is real but small. The more informative row is the noisy real-world column, where differences of a full percentage point reflect meaningful changes in post-processing work.

ModelLS cleanLS otherCommon VoiceNoisy real-worldLanguages
AssemblyAI Universal-2
AssemblyAI
2.1%4.2%7.3%7.9%19
Gemini 2.5 Pro
Google
2.3%4.6%6.9%8.7%100
Speechmatics
Speechmatics
2.4%4.5%6.5%8%50
Deepgram Nova-3
Deepgram
2.5%4.8%7.6%8.2%36
Azure Speech
Microsoft
2.6%5.1%7%9.3%130
Whisper large-v3
OpenAI
2.8%5.5%8.1%11.4%99
Fig 1 · Benchmarks from published papers, official documentation, and independent evaluations (February–March 2026).
§ 02 · Latency

Real-time is a product decision.

For voice assistants and live captioning, median latency and streaming behaviour matter as much as WER.

ModelMedian latencyStreamingDiarisationNote
Deepgram Nova-3450msYesYesIndustry-leading streaming latency; <300ms p95
Speechmatics700msYesYesReal-time and batch; on-prem option
Azure Speech800msYesYesReal-time streaming; batch transcription available
AssemblyAI Universal-21.1sYesYesReal-time streaming; async available for batch
Gemini 2.5 Pro3.8sYesNoStreaming via Live API; batch via standard Gemini API
Whisper large-v34.2sNoNoBatch only via API; real-time possible self-hosted with faster-whisper
§ 03 · Price

Price per hour.

Cost per hour of transcribed audio. Self-hosted Whisper costs depend on GPU choice.

ModelPer hourPer minuteFree tierNotes
Whisper (OpenAI API)$0.36$0.006NoCheapest API; no streaming
Whisper (self-hosted)~$0.05-0.15~$0.001-0.003N/AA100: ~$1.50/hr GPU, processes ~10-30x real-time
Deepgram Nova-3$1.31$0.0218$200 creditBest commercial value; volume discounts
Azure Speech$1.44$0.0245hr/moCustom models add ~$1.44/hr extra
Gemini 2.5 Pro~$2.16~$0.036Free tierToken-based pricing; varies with output
AssemblyAI$2.22$0.037100hr trialIncludes diarization, summaries, sentiment
Speechmatics$2.64$0.044TrialEnterprise on-prem pricing negotiable
§ 04 · Profiles

Each system, one paragraph deep.

What the system is, who made it, and what it is actually for.

Whisper large-v3 · OpenAI · large-v3 / large-v3-turbo

WER (clean) · 2.8%Latency · 4.2sPrice · $0.36/hrOpen source
  • Open-source weights (MIT-like license)
  • large-v3-turbo: 4x faster with ~0.3% WER trade-off
  • Massive community ecosystem (faster-whisper, whisper.cpp, WhisperX)
  • Best multilingual breadth for open models

Gemini 2.5 Pro · Google · 2.5 Pro (audio input)

WER (clean) · 2.3%Latency · 3.8sPrice · $2.16/hr
  • Multimodal: transcribe + reason about audio in one call
  • Excellent noise robustness from large-scale pre-training
  • Live API enables real-time streaming transcription
  • Can handle audio + video + text simultaneously

AssemblyAI Universal-2 · AssemblyAI · Universal-2

WER (clean) · 2.1%Latency · 1.1sPrice · $2.22/hr
  • Top-tier English accuracy across accents
  • Built-in speaker diarization, sentiment, summarization
  • Excellent real-time streaming latency
  • PII redaction and content safety built in

Deepgram Nova-3 · Deepgram · Nova-3

WER (clean) · 2.5%Latency · 450msPrice · $1.31/hr
  • Lowest latency of any commercial ASR
  • Best price-to-performance ratio
  • Strong multichannel and telephony support
  • Topic detection and intent recognition built in

Azure Speech · Microsoft · Speech-to-Text v4

WER (clean) · 2.6%Latency · 800msPrice · $1.44/hr
  • Broadest language support of any commercial API
  • Custom model training with your own data
  • Deep Azure ecosystem integration
  • On-premises deployment via containers

Speechmatics · Speechmatics · Ursa 3

WER (clean) · 2.4%Latency · 700msPrice · $2.64/hr
  • Strongest accuracy on non-English European languages
  • On-premises and air-gapped deployment
  • Excellent entity formatting and punctuation
  • Translation and language identification built in
§ 05 · Code

Four small programs.

One example per path — self-hosted Whisper, the lowest-latency commercial stream, the richest analysis pipeline, and a multimodal option.

Whisper large-v3 — self-hosted with faster-whisper.
whisper_self_hosted.pypython
# Self-hosted Whisper with faster-whisper (CTranslate2 backend)
# pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe(
    "meeting.wav",
    beam_size=5,
    language="en",
    vad_filter=True,           # Skip silence for faster processing
    vad_parameters=dict(
        min_silence_duration_ms=500,
    ),
)

print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Deepgram Nova-3 — real-time streaming with diarisation.
deepgram_streaming.pypython
# Deepgram Nova-3 real-time streaming
# pip install deepgram-sdk

import asyncio
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents

async def transcribe_stream():
    dg = DeepgramClient("YOUR_API_KEY")
    connection = dg.listen.asynclive.v("1")

    async def on_message(self, result, **kwargs):
        transcript = result.channel.alternatives[0].transcript
        if transcript:
            print(f"[{result.start:.2f}s] {transcript}")

    connection.on(LiveTranscriptionEvents.Transcript, on_message)

    options = LiveOptions(
        model="nova-3",
        language="en",
        smart_format=True,
        diarize=True,
        encoding="linear16",
        sample_rate=16000,
    )

    await connection.start(options)

    # Stream audio chunks from microphone or file
    with open("call_recording.wav", "rb") as f:
        while chunk := f.read(4096):
            connection.send(chunk)
            await asyncio.sleep(0.1)  # Simulate real-time

    await connection.finish()

asyncio.run(transcribe_stream())
AssemblyAI Universal-2 — diarisation, chapters, sentiment, PII redaction.
assemblyai_universal2.pypython
# AssemblyAI Universal-2 with speaker diarization
# pip install assemblyai

import assemblyai as aai

aai.settings.api_key = "YOUR_API_KEY"

config = aai.TranscriptionConfig(
    speech_model=aai.SpeechModel.best,       # Universal-2
    speaker_labels=True,                       # Diarization
    auto_chapters=True,                        # Chapter summaries
    entity_detection=True,                     # PII detection
    sentiment_analysis=True,
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("podcast_episode.mp3", config=config)

# Print with speaker labels
for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

# Auto-generated chapters
for chapter in transcript.chapters:
    print(f"\n## {chapter.headline}")
    print(f"   {chapter.summary}")
    print(f"   [{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s]")
Gemini 2.5 Pro — transcription and reasoning in a single call.
gemini_transcribe.pypython
# Gemini 2.5 Pro audio transcription + analysis
# pip install google-genai

from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

audio_file = client.files.upload(file="earnings_call.mp3")

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        audio_file,
        """Transcribe this audio precisely, then provide:
        1. Full transcript with timestamps
        2. Key topics discussed
        3. Action items mentioned
        4. Overall sentiment per speaker""",
    ],
)

print(response.text)
§ 06 · Decision

Pick the tool for the job.

Seven common scenarios, with a primary pick and one-line reason.

Real-time voice assistant or live captioningDeepgram Nova-3
Sub-500ms latency and streaming WebSocket API. Best-in-class for latency-sensitive applications.
Highest accuracy on English (podcasts, meetings)AssemblyAI Universal-2
Lowest WER across English benchmarks. Built-in diarization, chapters, and sentiment make it a complete pipeline.
Multilingual transcription (50+ languages)Azure Speech
130 languages with custom model training. Best for global products and localization workflows.
Audio understanding beyond transcriptionGemini 2.5 Pro
Transcribe, summarize, analyze sentiment, extract action items in a single API call. Multimodal reasoning over audio.
Self-hosted / data privacy / air-gappedWhisper large-v3 or Speechmatics
Whisper is fully open-source. Speechmatics offers on-prem with better accuracy. Both run without sending data to a third-party.
Budget-constrained high volume (1000+ hours/month)Self-hosted Whisper large-v3-turbo
With faster-whisper on an A100, cost drops to ~$0.05-0.15/hr. Turbo variant processes at 30x real-time speed.
European language accuracy (DE, FR, ES, IT, etc.)Speechmatics Ursa 3
Strongest non-English European performance. Excellent entity formatting and on-prem option for EU data residency.
§ 07 · FAQ

Questions, answered plainly.

The six we get most, with short answers.

Q01What is the most accurate speech recognition model in 2026?
AssemblyAI Universal-2 leads on English benchmarks with ~2.1% WER on LibriSpeech clean. Speechmatics Ursa 3 and Gemini 2.5 Pro are close behind. For multilingual use, Gemini and Azure Speech offer the broadest coverage with strong accuracy.
Q02Is Whisper still competitive in 2026?
Yes. Whisper large-v3 remains highly competitive at 2.8% WER on LibriSpeech clean and is the best open-source option. The large-v3-turbo variant offers 4x faster inference with only ~0.3% WER increase, making it ideal for self-hosted deployments.
Q03Which ASR API has the lowest latency for real-time applications?
Deepgram Nova-3 has the lowest streaming latency at ~450ms median (under 300ms at p95). This makes it the top choice for live captioning, voice assistants, and real-time transcription use cases.
Q04What is the cheapest speech-to-text API?
Self-hosted Whisper is cheapest at scale (GPU costs only). Among APIs, Deepgram Nova-3 at $0.0218/min ($1.31/hr) offers the best price-to-performance ratio. OpenAI Whisper API is cheapest outright at $0.006/min ($0.36/hr) but lacks streaming.
Q05Should I use a speech-to-text API or self-host Whisper?
Use an API if you need streaming, diarization, or minimal ops overhead. Self-host Whisper if you process >100 hours/day (cost savings), need data privacy, or want full control. The faster-whisper library makes self-hosting practical with 4x speedup.
Q06Which ASR model is best for noisy audio like call centers?
AssemblyAI Universal-2 and Speechmatics Ursa 3 perform best on noisy real-world audio with ~7.9-8.0% WER. Deepgram Nova-3 is also strong at 8.2% and offers the best latency for real-time call center use cases.
Related · Further reading

Continue through the registry.