Audio AI

The Complete
Speech AI Benchmark

Compare the best models for both Speech-to-Text (STT) and Text-to-Speech (TTS). From Whisper to ElevenLabs, see who leads the charts.

STT Leaderboard TTS Leaderboard

Benchmark Stats

2.0%

Best WER (STT)

4.8

Best MOS (TTS)

20+

Models Compared

Speech-to-Text (STT)

Word Error Rate (WER)

WER measures the percentage of words incorrectly transcribed. It counts three types of errors:

Substitutions

Wrong word: "the cat" becomes "the car"

Deletions

Missing word: "the big cat" becomes "the cat"

Insertions

Extra word: "the cat" becomes "the big cat"

wer_example.py

from jiwer import wer

reference = "the quick brown fox"
hypothesis = "the quik brown cat"

error_rate = wer(reference, hypothesis)
print("WER:", round(error_rate * 100, 1), "%")

Output: WER: 50.0 %|2/4 words wrong

STT Leaderboard

WER on LibriSpeech test-clean. Lower is better.

Rank	Model	WER (%)	Type	Year
#1	Conformer XL Google	2.0	Research	2021
#2	Whisper Large v3 OpenAI	2.7	Open Source	2024
#3	Google USM Google	2.8	Cloud API	2023
#4	Azure Speech Microsoft	3.0	Cloud API	2024
#5	Whisper Medium OpenAI	3.4	Open Source	2023
#6	wav2vec 2.0 Meta	3.8	Open Source	2020

STT Datasets

LibriSpeech

2015

1000 hours of English speech from audiobooks. Standard benchmark for automatic speech recognition.

Paper Dataset

Common Voice

2019

Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents.

Paper Dataset

Text-to-Speech (TTS)

Mean Opinion Score (MOS)

TTS is harder to evaluate objectively than STT. The gold standard is MOS: human raters listen to generated audio and rate it from 1 (Bad) to 5 (Excellent).

Excellent (Human-like, natural intonation)

Good (Intelligible, minor robotic artifacts)

Fair (Understandable but clearly synthetic)

Other TTS Metrics

•
MCD (Mel Cepstral Distortion)
Objective distance between generated and reference audio. Lower is better.
•
Latency (Time-to-First-Byte)
Critical for voice bots. Best models achieve < 200ms.
•
Word Accuracy
Does it skip words or hallucinate? Checked via STT on output.

TTS Leaderboard

Approximate MOS ratings based on community benchmarks and paper results. Higher is better.

Rank	Model	MOS (1-5)	Type	Year
#1	ElevenLabs Turbo v2.5 ElevenLabs	4.8	Cloud API	2024
#2	OpenAI TTS HD OpenAI	4.7	Cloud API	2023
#3	XTTS v2 Coqui	4.5	Open Source	2024
#4	MMS-TTS Meta	4.0	Open Source	2023
#5	Bark Suno	3.9	Open Source	2023
#6	Piper Rhasspy	3.6	Open Source	2023

TTS Datasets

LJ Speech

2017

13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.

Paper Dataset

VCTK

2019

Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.

Paper Dataset

Summary: Which Model Should You Use?

Speech-to-Text

Best Overall & Local

Whisper Large v3 (OpenAI) - Free, accurate, runs on consumer GPU.

Best for Streaming

Deepgram / Azure Speech - Extremely low latency for real-time apps.

Text-to-Speech

Best Quality

ElevenLabs - Indistinguishable from human speech, emotive.

Best Open Source

XTTS v2 (Coqui) - Voice cloning and high quality, runs locally.

The Complete Speech AI Benchmark

Benchmark Stats

Speech-to-Text (STT)

Word Error Rate (WER)

Substitutions

Deletions

Insertions

STT Leaderboard

STT Datasets

LibriSpeech

Common Voice

Text-to-Speech (TTS)

Mean Opinion Score (MOS)

Other TTS Metrics

TTS Leaderboard

TTS Datasets

LJ Speech

VCTK

Summary: Which Model Should You Use?

Speech-to-Text

Text-to-Speech

The Complete
Speech AI Benchmark