The Complete
Speech AI Benchmark
Compare 25+ models for Speech-to-Text (STT) and Text-to-Speech (TTS). From NVIDIA Parakeet's record 1.8% WER to Sesame CSM's human-like synthesis — every model that matters in 2026.
Benchmark Stats
The State of Speech AI in 2026
Speech AI has reached an inflection point. STT models now achieve sub-2% word error rates — matching or exceeding human transcription accuracy on clean speech. TTS models produce voices that are nearly indistinguishable from human speech, with emotional expressiveness and conversational flow.
The open-source ecosystem has exploded. In 2023, Whisper was the only serious open STT option. By 2026, NVIDIA's Parakeet RNNT leads LibriSpeech with 1.8% WER. In TTS, Sesame CSM, Kokoro, Fish Speech, and Dia offer quality that rivals commercial APIs — all under permissive licenses.
This page tracks every model that matters across both domains. We benchmark on standard datasets (LibriSpeech for STT, MOS scores for TTS) and track practical factors like latency, language support, and deployment options.
Parakeet RNNT 1.1B — first open model to break 2% on LibriSpeech
Sesame CSM — open-source TTS approaching commercial quality
Groq Whisper on LPU — transcribes 1 hour of audio in ~24 seconds
Speech-to-Text (STT)
SOTA Progress: From Deep Speech to Parakeet
From Deep Speech 2's 12.6% WER in 2015 to Parakeet RNNT's 1.8% in 2025 — an 86% relative improvement in a decade. The Conformer architecture (2020) and large-scale weak supervision via Whisper (2022) were the two biggest inflection points.

Accuracy vs. Speed Tradeoff
Cloud APIs cluster in the bottom-left (fast and accurate), while open-source models offer higher accuracy at the cost of latency. Groq Whisper is an outlier — the same Whisper model running on custom LPU hardware at ~150× realtime speed.

Word Error Rate (WER)
WER measures the percentage of words incorrectly transcribed. It counts three types of errors:
Substitutions
Wrong word: "the cat" becomes "the car"
Deletions
Missing word: "the big cat" becomes "the cat"
Insertions
Extra word: "the cat" becomes "the big cat"
Human-level WER on LibriSpeech test-clean is approximately 2–4%, depending on the annotator. Models like Parakeet (1.8%) now surpass average human transcription accuracy.
import whisper
model = whisper.load_model("large-v3-turbo")
result = model.transcribe("audio.mp3")
print(result["text"])
# Supports 100+ languages automatically
# Or use jiwer to calculate WER:
from jiwer import wer
error = wer("the quick brown fox", "the quik brown cat")
print(f"WER: {error*100:.1f}%")STT Leaderboard
12 models ranked by WER on LibriSpeech test-clean. Lower is better.
| # | Model | WER (%) | Type | Params | Year | Links |
|---|---|---|---|---|---|---|
| 1 | Parakeet RNNT 1.1B NVIDIA | 1.8 | Open Source | 1.1B | 2025 | |
| 2 | Conformer XL Google | 2.0 | Research | 600M | 2021 | |
| 3 | Deepgram Nova-3 Deepgram | 2.2 | Cloud API | — | 2025 | |
| 4 | AssemblyAI Universal-2 AssemblyAI | 2.4 | Cloud API | — | 2025 | |
| 5 | Whisper Large v3 Turbo OpenAI | 2.5 | Open Source | 809M | 2024 | |
| 6 | Gladia v2 Gladia | 2.5 | Cloud API | — | 2025 | |
| 7 | Speechmatics Flow Speechmatics | 2.6 | Cloud API | — | 2025 | |
| 8 | Whisper Large v3 OpenAI | 2.7 | Open Source | 1.55B | 2023 | |
| 9 | Groq Whisper Groq | 2.7 | Cloud API | 1.55B | 2025 | |
| 10 | Google USM Google | 2.8 | Cloud API | 2B | 2023 | |
| 11 | Azure Speech Microsoft | 3.0 | Cloud API | — | 2024 | |
| 12 | wav2vec 2.0 Meta | 3.8 | Open Source | 317M | 2020 |
STT Datasets
Text-to-Speech (TTS)
TTS Quality Progress: From WaveNet to Sesame CSM
From WaveNet's groundbreaking 3.0 MOS in 2016 to ElevenLabs' 4.8 MOS in 2024. The open-source gap has nearly closed — Sesame CSM achieves 4.7 MOS, just 0.1 behind the best cloud API. The dashed line shows the human speech reference at 5.0.

Quality vs. Latency Landscape
For voice bots and real-time applications, time-to-first-byte (TTFB) under 200ms is critical. Cartesia Sonic 2 leads at ~90ms with 4.7 MOS, while Piper serves the edge/embedded niche at ~30ms. The vertical pink line marks the voice bot threshold.

Mean Opinion Score (MOS)
TTS is harder to evaluate objectively than STT. The gold standard is MOS: human raters listen to generated audio and rate it from 1 (Bad) to 5 (Excellent). Scores above 4.5 are generally indistinguishable from human speech in blind tests.
Other TTS Metrics
- •TTFB (Time-to-First-Byte)
Critical for voice bots. Best models achieve < 100ms. Cartesia Sonic 2 leads at ~90ms.
- •MCD (Mel Cepstral Distortion)
Objective distance between generated and reference audio spectrograms. Lower is better.
- •Speaker Similarity
For voice cloning: how close the output matches the target voice. Measured via speaker embedding cosine similarity.
- •Word Accuracy
Does it skip words or hallucinate? Checked via STT on the generated output.
TTS Leaderboard
12 models ranked by approximate MOS. Higher is better.
| # | Model | MOS (1-5) | Type | Params | Year | Links |
|---|---|---|---|---|---|---|
| 1 | ElevenLabs Turbo v2.5 ElevenLabs | 4.8 | Cloud API | — | 2024 | |
| 2 | Sesame CSM Sesame | 4.7 | Open Source | 1B+ | 2025 | |
| 3 | OpenAI TTS HD OpenAI | 4.7 | Cloud API | — | 2023 | |
| 4 | Cartesia Sonic 2 Cartesia | 4.7 | Cloud API | — | 2025 | |
| 5 | ElevenLabs Flash v2.5 ElevenLabs | 4.6 | Cloud API | — | 2025 | |
| 6 | PlayHT 3.0 PlayHT | 4.6 | Cloud API | — | 2025 | |
| 7 | Kokoro v1.0 Hexgrad | 4.5 | Open Source | 82M | 2025 | |
| 8 | XTTS v2 Coqui | 4.5 | Open Source | 467M | 2024 | |
| 9 | Fish Speech 1.5 Fish Audio | 4.4 | Open Source | 500M | 2025 | |
| 10 | Dia 1.6B Nari Labs | 4.3 | Open Source | 1.6B | 2025 | |
| 11 | Parler-TTS Hugging Face | 4.1 | Open Source | 880M | 2025 | |
| 12 | Piper Rhasspy | 3.6 | Open Source | ~20M | 2023 |
TTS Datasets
Open Source vs. Cloud in 2025
The open-source gap has nearly closed. For STT, Parakeet RNNT beats every cloud API on raw accuracy. For TTS, Sesame CSM matches cloud quality at 4.7 MOS. The remaining cloud advantage is in latency, streaming support, and managed infrastructure.

The Conformer Revolution
Before 2020, STT was dominated by RNNs and CTC-based models. The Conformer (2020) combined self-attention with convolutions, capturing both long-range dependencies and local features. This hybrid approach drove WER from ~5% to ~2% on LibriSpeech.
Today, nearly every leading STT model uses Conformer-style blocks: NVIDIA's Parakeet, Google's USM, and AssemblyAI's Universal-2 all build on this foundation. Whisper took a different path with pure Transformer encoder-decoder, trading efficiency for massive multilingual capability.
The Open-Source TTS Explosion
2024–2025 saw an unprecedented wave of open TTS models. Sesame CSM brought conversational, emotionally-aware synthesis to open source. Kokoro proved that 82M parameters is enough for near-commercial quality. Dia introduced non-verbal cues — laughter, breathing, pauses — making generated dialogue feel alive.
The key enabler was neural audio codecs (EnCodec, SoundStream) that compress speech into discrete tokens. This let researchers apply language model techniques to audio generation, dramatically improving quality and enabling zero-shot voice cloning.
How Modern STT Works
From raw audio waveform to text transcript. The typical pipeline for a modern ASR system.
Audio Features
Raw audio is converted to mel spectrograms or filter bank features. Typically 80 mel-frequency bins at 10ms frame rate.
Encoder
Conformer or Transformer blocks process spectrograms into rich acoustic representations. This is where the model "understands" speech sounds.
Decoder
CTC, RNNT, or attention decoder converts acoustic features to token sequences. RNNT enables streaming; attention enables highest quality.
Post-processing
Language model rescoring, punctuation restoration, speaker diarization, and timestamp alignment produce the final transcript.
Which Model Should You Use?
Speech-to-Text
Text-to-Speech
Key Papers
The foundational research that shaped modern speech AI.
Robust Speech Recognition via Large-Scale Weak Supervision
8,000+Radford, Kim, Xu, Brockman, McLeavey, Sutskever — ICML 2023
Whisper — defined open-source STT
Conformer: Convolution-augmented Transformer for Speech Recognition
4,500+Gulati, Qin, Chiu et al. — Interspeech 2020
Conformer architecture — basis of SOTA STT
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
10,000+Baevski, Zhou, Mohamed, Auli — NeurIPS 2020
Self-supervised speech pre-training
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
500+Zhang, Park, Han et al. — arXiv 2023
Massively multilingual (300+ languages)
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
7,000+Shen, Pang, Weiss et al. — ICASSP 2018
Tacotron 2 — neural TTS breakthrough
VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End TTS
3,000+Kim, Kong, Son — ICML 2021
End-to-end TTS, basis of many modern models
High Fidelity Neural Audio Compression
1,500+Défossez, Copet, Synnaeve, Adi — ICLR 2023
EnCodec — neural audio codec used in many TTS systems
WaveNet: A Generative Model for Raw Audio
12,000+van den Oord, Dieleman, Zen et al. — arXiv 2016
Started neural speech synthesis revolution
GitHub Repositories
The most important open-source speech projects to explore.
Most popular open-source STT. Supports 100+ languages with a single model.
Home of Parakeet RNNT — current SOTA on LibriSpeech. Full ASR/TTS toolkit.
Conversational Speech Model. Best open-source TTS for natural dialogue.
82M param TTS that rivals cloud APIs. Apache 2.0, runs on CPU.
Multilingual TTS with strong CJK support. VQGAN + Transformer architecture.
Dialogue TTS with non-verbal cues — laughter, pauses, breathing sounds.
Edge TTS for Raspberry Pi and embedded. 30+ languages, <30ms latency.
Control voice style with text descriptions. Fully open data and training.
STT Model Comparison
All 12 STT models ranked by WER on LibriSpeech test-clean. Cloud APIs dominate the mid-range, while open-source models hold both the top (Parakeet) and the most popular (Whisper) positions.
