Updated March 2026

The Complete
Speech AI Benchmark

Compare 25+ models for Speech-to-Text (STT) and Text-to-Speech (TTS). From NVIDIA Parakeet's record 1.8% WER to Sesame CSM's human-like synthesis — every model that matters in 2026.

Benchmark Stats

1.8%
Best WER (Parakeet RNNT)
4.8
Best MOS (ElevenLabs v2.5)
25+
Models Compared
~150×
Fastest STT (Groq Whisper)

The State of Speech AI in 2026

Speech AI has reached an inflection point. STT models now achieve sub-2% word error rates — matching or exceeding human transcription accuracy on clean speech. TTS models produce voices that are nearly indistinguishable from human speech, with emotional expressiveness and conversational flow.

The open-source ecosystem has exploded. In 2023, Whisper was the only serious open STT option. By 2026, NVIDIA's Parakeet RNNT leads LibriSpeech with 1.8% WER. In TTS, Sesame CSM, Kokoro, Fish Speech, and Dia offer quality that rivals commercial APIs — all under permissive licenses.

This page tracks every model that matters across both domains. We benchmark on standard datasets (LibriSpeech for STT, MOS scores for TTS) and track practical factors like latency, language support, and deployment options.

STT Milestone
1.8% WER

Parakeet RNNT 1.1B — first open model to break 2% on LibriSpeech

TTS Milestone
4.7 MOS (Open Source)

Sesame CSM — open-source TTS approaching commercial quality

Speed Milestone
~150× Realtime

Groq Whisper on LPU — transcribes 1 hour of audio in ~24 seconds

Speech-to-Text (STT)

SOTA Progress: From Deep Speech to Parakeet

From Deep Speech 2's 12.6% WER in 2015 to Parakeet RNNT's 1.8% in 2025 — an 86% relative improvement in a decade. The Conformer architecture (2020) and large-scale weak supervision via Whisper (2022) were the two biggest inflection points.

Speech-to-Text SOTA progress timeline from 2015 to 2025, showing WER improvement from Deep Speech 2 (12.6%) through wav2vec 2.0, Conformer, Whisper to Parakeet RNNT (1.8%)

Accuracy vs. Speed Tradeoff

Cloud APIs cluster in the bottom-left (fast and accurate), while open-source models offer higher accuracy at the cost of latency. Groq Whisper is an outlier — the same Whisper model running on custom LPU hardware at ~150× realtime speed.

Scatter plot of STT accuracy (WER) vs latency, showing Groq Whisper as fastest, Parakeet as most accurate, and cloud APIs clustering at low latency

Word Error Rate (WER)

WER measures the percentage of words incorrectly transcribed. It counts three types of errors:

S

Substitutions

Wrong word: "the cat" becomes "the car"

D

Deletions

Missing word: "the big cat" becomes "the cat"

I

Insertions

Extra word: "the cat" becomes "the big cat"

Human-level WER on LibriSpeech test-clean is approximately 2–4%, depending on the annotator. Models like Parakeet (1.8%) now surpass average human transcription accuracy.

transcribe.py
import whisper

model = whisper.load_model("large-v3-turbo")
result = model.transcribe("audio.mp3")

print(result["text"])
# Supports 100+ languages automatically

# Or use jiwer to calculate WER:
from jiwer import wer
error = wer("the quick brown fox", "the quik brown cat")
print(f"WER: {error*100:.1f}%")
Output: WER: 50.0%|2/4 words wrong

STT Leaderboard

12 models ranked by WER on LibriSpeech test-clean. Lower is better.

#ModelWER (%)TypeParamsYearLinks
1Parakeet RNNT 1.1B
NVIDIA
1.8
Open Source1.1B2025
2Conformer XL
Google
2.0
Research600M2021
3Deepgram Nova-3
Deepgram
2.2
Cloud API2025
4AssemblyAI Universal-2
AssemblyAI
2.4
Cloud API2025
5Whisper Large v3 Turbo
OpenAI
2.5
Open Source809M2024
6Gladia v2
Gladia
2.5
Cloud API2025
7Speechmatics Flow
Speechmatics
2.6
Cloud API2025
8Whisper Large v3
OpenAI
2.7
Open Source1.55B2023
9Groq Whisper
Groq
2.7
Cloud API1.55B2025
10Google USM
Google
2.8
Cloud API2B2023
11Azure Speech
Microsoft
3.0
Cloud API2024
12wav2vec 2.0
Meta
3.8
Open Source317M2020

STT Datasets

LibriSpeech

2015

1000 hours of English speech from audiobooks. Standard benchmark for automatic speech recognition.

Common Voice

2019

Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents.

Text-to-Speech (TTS)

TTS Quality Progress: From WaveNet to Sesame CSM

From WaveNet's groundbreaking 3.0 MOS in 2016 to ElevenLabs' 4.8 MOS in 2024. The open-source gap has nearly closed — Sesame CSM achieves 4.7 MOS, just 0.1 behind the best cloud API. The dashed line shows the human speech reference at 5.0.

Text-to-Speech quality progress from WaveNet (3.0 MOS, 2016) through Tacotron 2, VITS, XTTS v2 to ElevenLabs v2.5 (4.8 MOS) and Sesame CSM (4.7 MOS, open source)

Quality vs. Latency Landscape

For voice bots and real-time applications, time-to-first-byte (TTFB) under 200ms is critical. Cartesia Sonic 2 leads at ~90ms with 4.7 MOS, while Piper serves the edge/embedded niche at ~30ms. The vertical pink line marks the voice bot threshold.

TTS quality vs latency scatter plot showing Cartesia Sonic 2 and ElevenLabs Flash as fastest high-quality options, Piper for edge deployment

Mean Opinion Score (MOS)

TTS is harder to evaluate objectively than STT. The gold standard is MOS: human raters listen to generated audio and rate it from 1 (Bad) to 5 (Excellent). Scores above 4.5 are generally indistinguishable from human speech in blind tests.

5
Excellent (Human-like, natural intonation & emotion)
4
Good (Intelligible, minor robotic artifacts)
3
Fair (Understandable but clearly synthetic)
2
Poor (Robotic, unnatural prosody)

Other TTS Metrics

  • TTFB (Time-to-First-Byte)

    Critical for voice bots. Best models achieve < 100ms. Cartesia Sonic 2 leads at ~90ms.

  • MCD (Mel Cepstral Distortion)

    Objective distance between generated and reference audio spectrograms. Lower is better.

  • Speaker Similarity

    For voice cloning: how close the output matches the target voice. Measured via speaker embedding cosine similarity.

  • Word Accuracy

    Does it skip words or hallucinate? Checked via STT on the generated output.

TTS Leaderboard

12 models ranked by approximate MOS. Higher is better.

#ModelMOS (1-5)TypeParamsYearLinks
1ElevenLabs Turbo v2.5
ElevenLabs
4.8
Cloud API2024
2Sesame CSM
Sesame
4.7
Open Source1B+2025
3OpenAI TTS HD
OpenAI
4.7
Cloud API2023
4Cartesia Sonic 2
Cartesia
4.7
Cloud API2025
5ElevenLabs Flash v2.5
ElevenLabs
4.6
Cloud API2025
6PlayHT 3.0
PlayHT
4.6
Cloud API2025
7Kokoro v1.0
Hexgrad
4.5
Open Source82M2025
8XTTS v2
Coqui
4.5
Open Source467M2024
9Fish Speech 1.5
Fish Audio
4.4
Open Source500M2025
10Dia 1.6B
Nari Labs
4.3
Open Source1.6B2025
11Parler-TTS
Hugging Face
4.1
Open Source880M2025
12Piper
Rhasspy
3.6
Open Source~20M2023

TTS Datasets

LJ Speech

2017

13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.

VCTK

2019

Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.

Open Source vs. Cloud in 2025

The open-source gap has nearly closed. For STT, Parakeet RNNT beats every cloud API on raw accuracy. For TTS, Sesame CSM matches cloud quality at 4.7 MOS. The remaining cloud advantage is in latency, streaming support, and managed infrastructure.

Side-by-side comparison of open source vs cloud speech models for STT and TTS, showing open source models matching or exceeding cloud accuracy

The Conformer Revolution

Before 2020, STT was dominated by RNNs and CTC-based models. The Conformer (2020) combined self-attention with convolutions, capturing both long-range dependencies and local features. This hybrid approach drove WER from ~5% to ~2% on LibriSpeech.

Today, nearly every leading STT model uses Conformer-style blocks: NVIDIA's Parakeet, Google's USM, and AssemblyAI's Universal-2 all build on this foundation. Whisper took a different path with pure Transformer encoder-decoder, trading efficiency for massive multilingual capability.

The Open-Source TTS Explosion

2024–2025 saw an unprecedented wave of open TTS models. Sesame CSM brought conversational, emotionally-aware synthesis to open source. Kokoro proved that 82M parameters is enough for near-commercial quality. Dia introduced non-verbal cues — laughter, breathing, pauses — making generated dialogue feel alive.

The key enabler was neural audio codecs (EnCodec, SoundStream) that compress speech into discrete tokens. This let researchers apply language model techniques to audio generation, dramatically improving quality and enabling zero-shot voice cloning.

How Modern STT Works

From raw audio waveform to text transcript. The typical pipeline for a modern ASR system.

Step 1

Audio Features

Raw audio is converted to mel spectrograms or filter bank features. Typically 80 mel-frequency bins at 10ms frame rate.

Step 2

Encoder

Conformer or Transformer blocks process spectrograms into rich acoustic representations. This is where the model "understands" speech sounds.

Step 3

Decoder

CTC, RNNT, or attention decoder converts acoustic features to token sequences. RNNT enables streaming; attention enables highest quality.

Step 4

Post-processing

Language model rescoring, punctuation restoration, speaker diarization, and timestamp alignment produce the final transcript.

Which Model Should You Use?

Speech-to-Text

Best Overall (Cloud)
Deepgram Nova-3 — 2.2% WER with real-time streaming and speaker diarization.
Best Open Source
Whisper Large v3 Turbo — 2.5% WER, 8× faster than v3, 100+ languages.
Best Accuracy
Parakeet RNNT 1.1B — 1.8% WER, current SOTA. Requires NVIDIA GPU.
Fastest Inference
Groq Whisper — same Whisper model at ~150× realtime via LPU hardware.
Best for Multilingual
Google USM — 300+ languages. Whisper v3 covers 100+ languages.

Text-to-Speech

Best Quality (Cloud)
ElevenLabs Turbo v2.5 — 4.8 MOS, indistinguishable from human. Voice library + cloning.
Best Open Source
Sesame CSM — 4.7 MOS, natural conversational flow with emotional expressiveness.
Best for Voice Bots
Cartesia Sonic 2 — 4.7 MOS at ~90ms TTFB. Purpose-built for real-time conversation.
Best Lightweight / Edge
Kokoro v1.0 (82M, Apache 2.0) for quality. Piper (~20M) for Raspberry Pi and embedded.
Best for Voice Cloning
PlayHT 3.0 (cloud) or Fish Speech 1.5 (open source) — high-fidelity zero-shot cloning.

Key Papers

The foundational research that shaped modern speech AI.

GitHub Repositories

The most important open-source speech projects to explore.

STT Model Comparison

All 12 STT models ranked by WER on LibriSpeech test-clean. Cloud APIs dominate the mid-range, while open-source models hold both the top (Parakeet) and the most popular (Whisper) positions.

Horizontal bar chart comparing all STT models by WER, from Parakeet RNNT (1.8%) to wav2vec 2.0 (3.8%)