What is the most accurate speech recognition model in 2026?

AssemblyAI Universal-2 leads on English benchmarks with ~2.1% WER on LibriSpeech clean. Speechmatics Ursa 3 and Gemini 2.5 Pro are close behind. For multilingual use, Gemini and Azure Speech offer the broadest coverage with strong accuracy.

Is Whisper still competitive in 2026?

Yes. Whisper large-v3 remains highly competitive at 2.8% WER on LibriSpeech clean and is the best open-source option. The large-v3-turbo variant offers 4x faster inference with only ~0.3% WER increase, making it ideal for self-hosted deployments.

Which ASR API has the lowest latency for real-time applications?

Deepgram Nova-3 has the lowest streaming latency at ~450ms median (under 300ms at p95). This makes it the top choice for live captioning, voice assistants, and real-time transcription use cases.

What is the cheapest speech-to-text API?

Self-hosted Whisper is cheapest at scale (GPU costs only). Among APIs, Deepgram Nova-3 at $0.0218/min ($1.31/hr) offers the best price-to-performance ratio. OpenAI Whisper API is cheapest outright at $0.006/min ($0.36/hr) but lacks streaming.

Should I use a speech-to-text API or self-host Whisper?

Use an API if you need streaming, diarization, or minimal ops overhead. Self-host Whisper if you process >100 hours/day (cost savings), need data privacy, or want full control. The faster-whisper library makes self-hosting practical with 4x speedup.

Which ASR model is best for noisy audio like call centers?

AssemblyAI Universal-2 and Speechmatics Ursa 3 perform best on noisy real-world audio with ~7.9-8.0% WER. Deepgram Nova-3 is also strong at 8.2% and offers the best latency for real-time call center use cases.

Codesota · Guides · Speech RecognitionSix ASR systems, scored honestlyUpdated March 2026

Guide · Automatic speech recognition

Word error rate, in context.

Whisper large-v3, Gemini 2.5 Pro, AssemblyAI Universal-2, Deepgram Nova-3, Azure Speech, Speechmatics Ursa 3 — six systems across accuracy, latency, price and language coverage.

WER figures come from vendor documentation, published papers and independent evaluations dated February–March 2026. Latency numbers are median from production streams.

§ 01 · Accuracy

Word error rate, across four splits.

Percentages — lower is better. LibriSpeech clean and other are the canonical academic splits; Common Voice and noisy real-world test robustness outside a reading studio.

The headline numbers on LibriSpeech have been compressed for years — a 0.5% gap at the top is real but small. The more informative row is the noisy real-world column, where differences of a full percentage point reflect meaningful changes in post-processing work.

Model	LS clean	LS other	Common Voice	Noisy real-world	Languages
AssemblyAI Universal-2 AssemblyAI	2.1%	4.2%	7.3%	7.9%	19
Gemini 2.5 Pro Google	2.3%	4.6%	6.9%	8.7%	100
Speechmatics Speechmatics	2.4%	4.5%	6.5%	8%	50
Deepgram Nova-3 Deepgram	2.5%	4.8%	7.6%	8.2%	36
Azure Speech Microsoft	2.6%	5.1%	7%	9.3%	130
Whisper large-v3 OpenAI	2.8%	5.5%	8.1%	11.4%	99

Fig 1 · Benchmarks from published papers, official documentation, and independent evaluations (February–March 2026).

§ 02 · Latency

Real-time is a product decision.

For voice assistants and live captioning, median latency and streaming behaviour matter as much as WER.

Model	Median latency	Streaming	Diarisation	Note
Deepgram Nova-3	450ms	Yes	Yes	Industry-leading streaming latency; <300ms p95
Speechmatics	700ms	Yes	Yes	Real-time and batch; on-prem option
Azure Speech	800ms	Yes	Yes	Real-time streaming; batch transcription available
AssemblyAI Universal-2	1.1s	Yes	Yes	Real-time streaming; async available for batch
Gemini 2.5 Pro	3.8s	Yes	No	Streaming via Live API; batch via standard Gemini API
Whisper large-v3	4.2s	No	No	Batch only via API; real-time possible self-hosted with faster-whisper

§ 03 · Price

Price per hour.

Cost per hour of transcribed audio. Self-hosted Whisper costs depend on GPU choice.

Model	Per hour	Per minute	Free tier	Notes
Whisper (OpenAI API)	$0.36	$0.006	No	Cheapest API; no streaming
Whisper (self-hosted)	~$0.05-0.15	~$0.001-0.003	N/A	A100: ~$1.50/hr GPU, processes ~10-30x real-time
Deepgram Nova-3	$1.31	$0.0218	$200 credit	Best commercial value; volume discounts
Azure Speech	$1.44	$0.024	5hr/mo	Custom models add ~$1.44/hr extra
Gemini 2.5 Pro	~$2.16	~$0.036	Free tier	Token-based pricing; varies with output
AssemblyAI	$2.22	$0.037	100hr trial	Includes diarization, summaries, sentiment
Speechmatics	$2.64	$0.044	Trial	Enterprise on-prem pricing negotiable

§ 04 · Profiles

Each system, one paragraph deep.

What the system is, who made it, and what it is actually for.

Whisper large-v3 · OpenAI · large-v3 / large-v3-turbo

WER (clean) · 2.8%Latency · 4.2sPrice · $0.36/hrOpen source

Open-source weights (MIT-like license)
large-v3-turbo: 4x faster with ~0.3% WER trade-off
Massive community ecosystem (faster-whisper, whisper.cpp, WhisperX)
Best multilingual breadth for open models

Gemini 2.5 Pro · Google · 2.5 Pro (audio input)

WER (clean) · 2.3%Latency · 3.8sPrice · $2.16/hr

Multimodal: transcribe + reason about audio in one call
Excellent noise robustness from large-scale pre-training
Live API enables real-time streaming transcription
Can handle audio + video + text simultaneously

AssemblyAI Universal-2 · AssemblyAI · Universal-2

WER (clean) · 2.1%Latency · 1.1sPrice · $2.22/hr

Top-tier English accuracy across accents
Built-in speaker diarization, sentiment, summarization
Excellent real-time streaming latency
PII redaction and content safety built in

Deepgram Nova-3 · Deepgram · Nova-3

WER (clean) · 2.5%Latency · 450msPrice · $1.31/hr

Lowest latency of any commercial ASR
Best price-to-performance ratio
Strong multichannel and telephony support
Topic detection and intent recognition built in

Azure Speech · Microsoft · Speech-to-Text v4

WER (clean) · 2.6%Latency · 800msPrice · $1.44/hr

Broadest language support of any commercial API
Custom model training with your own data
Deep Azure ecosystem integration
On-premises deployment via containers

Speechmatics · Speechmatics · Ursa 3

WER (clean) · 2.4%Latency · 700msPrice · $2.64/hr

Strongest accuracy on non-English European languages
On-premises and air-gapped deployment
Excellent entity formatting and punctuation
Translation and language identification built in

§ 05 · Code

Four small programs.

One example per path — self-hosted Whisper, the lowest-latency commercial stream, the richest analysis pipeline, and a multimodal option.

Whisper large-v3 — self-hosted with faster-whisper.

whisper_self_hosted.pypython

# Self-hosted Whisper with faster-whisper (CTranslate2 backend)
# pip install faster-whisper

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe(
    "meeting.wav",
    beam_size=5,
    language="en",
    vad_filter=True,           # Skip silence for faster processing
    vad_parameters=dict(
        min_silence_duration_ms=500,
    ),
)

print(f"Detected language: {info.language} (prob: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Deepgram Nova-3 — real-time streaming with diarisation.

deepgram_streaming.pypython

# Deepgram Nova-3 real-time streaming
# pip install deepgram-sdk

import asyncio
from deepgram import DeepgramClient, LiveOptions, LiveTranscriptionEvents

async def transcribe_stream():
    dg = DeepgramClient("YOUR_API_KEY")
    connection = dg.listen.asynclive.v("1")

    async def on_message(self, result, **kwargs):
        transcript = result.channel.alternatives[0].transcript
        if transcript:
            print(f"[{result.start:.2f}s] {transcript}")

    connection.on(LiveTranscriptionEvents.Transcript, on_message)

    options = LiveOptions(
        model="nova-3",
        language="en",
        smart_format=True,
        diarize=True,
        encoding="linear16",
        sample_rate=16000,
    )

    await connection.start(options)

    # Stream audio chunks from microphone or file
    with open("call_recording.wav", "rb") as f:
        while chunk := f.read(4096):
            connection.send(chunk)
            await asyncio.sleep(0.1)  # Simulate real-time

    await connection.finish()

asyncio.run(transcribe_stream())

AssemblyAI Universal-2 — diarisation, chapters, sentiment, PII redaction.

assemblyai_universal2.pypython

# AssemblyAI Universal-2 with speaker diarization
# pip install assemblyai

import assemblyai as aai

aai.settings.api_key = "YOUR_API_KEY"

config = aai.TranscriptionConfig(
    speech_model=aai.SpeechModel.best,       # Universal-2
    speaker_labels=True,                       # Diarization
    auto_chapters=True,                        # Chapter summaries
    entity_detection=True,                     # PII detection
    sentiment_analysis=True,
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("podcast_episode.mp3", config=config)

# Print with speaker labels
for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

# Auto-generated chapters
for chapter in transcript.chapters:
    print(f"\n## {chapter.headline}")
    print(f"   {chapter.summary}")
    print(f"   [{chapter.start/1000:.0f}s - {chapter.end/1000:.0f}s]")

Gemini 2.5 Pro — transcription and reasoning in a single call.

gemini_transcribe.pypython

# Gemini 2.5 Pro audio transcription + analysis
# pip install google-genai

from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

audio_file = client.files.upload(file="earnings_call.mp3")

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        audio_file,
        """Transcribe this audio precisely, then provide:
        1. Full transcript with timestamps
        2. Key topics discussed
        3. Action items mentioned
        4. Overall sentiment per speaker""",
    ],
)

print(response.text)

§ 06 · Decision

Pick the tool for the job.

Seven common scenarios, with a primary pick and one-line reason.

Real-time voice assistant or live captioningDeepgram Nova-3: Sub-500ms latency and streaming WebSocket API. Best-in-class for latency-sensitive applications.
Highest accuracy on English (podcasts, meetings)AssemblyAI Universal-2: Lowest WER across English benchmarks. Built-in diarization, chapters, and sentiment make it a complete pipeline.
Multilingual transcription (50+ languages)Azure Speech: 130 languages with custom model training. Best for global products and localization workflows.
Audio understanding beyond transcriptionGemini 2.5 Pro: Transcribe, summarize, analyze sentiment, extract action items in a single API call. Multimodal reasoning over audio.
Self-hosted / data privacy / air-gappedWhisper large-v3 or Speechmatics: Whisper is fully open-source. Speechmatics offers on-prem with better accuracy. Both run without sending data to a third-party.
Budget-constrained high volume (1000+ hours/month)Self-hosted Whisper large-v3-turbo: With faster-whisper on an A100, cost drops to ~$0.05-0.15/hr. Turbo variant processes at 30x real-time speed.
European language accuracy (DE, FR, ES, IT, etc.)Speechmatics Ursa 3: Strongest non-English European performance. Excellent entity formatting and on-prem option for EU data residency.

§ 07 · FAQ

Questions, answered plainly.

The six we get most, with short answers.

Q01What is the most accurate speech recognition model in 2026?: AssemblyAI Universal-2 leads on English benchmarks with ~2.1% WER on LibriSpeech clean. Speechmatics Ursa 3 and Gemini 2.5 Pro are close behind. For multilingual use, Gemini and Azure Speech offer the broadest coverage with strong accuracy.
Q02Is Whisper still competitive in 2026?: Yes. Whisper large-v3 remains highly competitive at 2.8% WER on LibriSpeech clean and is the best open-source option. The large-v3-turbo variant offers 4x faster inference with only ~0.3% WER increase, making it ideal for self-hosted deployments.
Q03Which ASR API has the lowest latency for real-time applications?: Deepgram Nova-3 has the lowest streaming latency at ~450ms median (under 300ms at p95). This makes it the top choice for live captioning, voice assistants, and real-time transcription use cases.
Q04What is the cheapest speech-to-text API?: Self-hosted Whisper is cheapest at scale (GPU costs only). Among APIs, Deepgram Nova-3 at $0.0218/min ($1.31/hr) offers the best price-to-performance ratio. OpenAI Whisper API is cheapest outright at $0.006/min ($0.36/hr) but lacks streaming.
Q05Should I use a speech-to-text API or self-host Whisper?: Use an API if you need streaming, diarization, or minimal ops overhead. Self-host Whisper if you process >100 hours/day (cost savings), need data privacy, or want full control. The faster-whisper library makes self-hosting practical with 4x speedup.
Q06Which ASR model is best for noisy audio like call centers?: AssemblyAI Universal-2 and Speechmatics Ursa 3 perform best on noisy real-world audio with ~7.9-8.0% WER. Deepgram Nova-3 is also strong at 8.2% and offers the best latency for real-time call center use cases.

Related · Further reading