Audio AI

The Complete
Speech AI Benchmark

Compare the best models for both Speech-to-Text (STT) and Text-to-Speech (TTS). From Whisper to ElevenLabs, see who leads the charts.

Benchmark Stats

2.0%
Best WER (STT)
4.8
Best MOS (TTS)
20+
Models Compared

Speech-to-Text (STT)

Word Error Rate (WER)

WER measures the percentage of words incorrectly transcribed. It counts three types of errors:

S

Substitutions

Wrong word: "the cat" becomes "the car"

D

Deletions

Missing word: "the big cat" becomes "the cat"

I

Insertions

Extra word: "the cat" becomes "the big cat"

wer_example.py
from jiwer import wer

reference = "the quick brown fox"
hypothesis = "the quik brown cat"

error_rate = wer(reference, hypothesis)
print("WER:", round(error_rate * 100, 1), "%")
Output: WER: 50.0 %|2/4 words wrong

STT Leaderboard

WER on LibriSpeech test-clean. Lower is better.

RankModelWER (%)TypeYear
#1
Conformer XL
Google
2.0Research2021
#2
Whisper Large v3
OpenAI
2.7Open Source2024
#3
Google USM
Google
2.8Cloud API2023
#4
Azure Speech
Microsoft
3.0Cloud API2024
#5
Whisper Medium
OpenAI
3.4Open Source2023
#6
wav2vec 2.0
Meta
3.8Open Source2020

STT Datasets

LibriSpeech

2015

1000 hours of English speech from audiobooks. Standard benchmark for automatic speech recognition.

Common Voice

2019

Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents.

Text-to-Speech (TTS)

Mean Opinion Score (MOS)

TTS is harder to evaluate objectively than STT. The gold standard is MOS: human raters listen to generated audio and rate it from 1 (Bad) to 5 (Excellent).

5
Excellent (Human-like, natural intonation)
4
Good (Intelligible, minor robotic artifacts)
3
Fair (Understandable but clearly synthetic)

Other TTS Metrics

  • MCD (Mel Cepstral Distortion)

    Objective distance between generated and reference audio. Lower is better.

  • Latency (Time-to-First-Byte)

    Critical for voice bots. Best models achieve < 200ms.

  • Word Accuracy

    Does it skip words or hallucinate? Checked via STT on output.

TTS Leaderboard

Approximate MOS ratings based on community benchmarks and paper results. Higher is better.

RankModelMOS (1-5)TypeYear
#1
ElevenLabs Turbo v2.5
ElevenLabs
4.8Cloud API2024
#2
OpenAI TTS HD
OpenAI
4.7Cloud API2023
#3
XTTS v2
Coqui
4.5Open Source2024
#4
MMS-TTS
Meta
4.0Open Source2023
#5
Bark
Suno
3.9Open Source2023
#6
Piper
Rhasspy
3.6Open Source2023

TTS Datasets

LJ Speech

2017

13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.

VCTK

2019

Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.

Summary: Which Model Should You Use?

Speech-to-Text

Best Overall & Local
Whisper Large v3 (OpenAI) - Free, accurate, runs on consumer GPU.
Best for Streaming
Deepgram / Azure Speech - Extremely low latency for real-time apps.

Text-to-Speech

Best Quality
ElevenLabs - Indistinguishable from human speech, emotive.
Best Open Source
XTTS v2 (Coqui) - Voice cloning and high quality, runs locally.