Codesota · Lineage · Text-to-Speech Benchmarks5 benchmarks · 4 edgesUpdated 2026-04-28
Benchmark lineage

Text-to-Speech Benchmarks

How TTS evaluation evolved from single-speaker naturalness datasets toward production benchmarks that test intelligibility, voice similarity, latency, streaming behavior, and information preservation. The lineage separates beauty metrics like MOS from operational metrics such as WER round-trip, critical entity accuracy, and first-byte latency.

Editor's note

Classic TTS benchmarks such as LJ Speech and VCTK are useful for model development, but they do not answer whether a cloud TTS provider is good for production voice agents. Modern evaluation needs multiple axes: naturalness, intelligibility, speaker similarity, latency, cost, and robustness on hard text containing numbers, names, acronyms, dates, URLs, and addresses.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
DIRECT SUCCESSORCodeSOTA TTS EvalAPR 2026TTS IntelligibilityAPR 2026
LJ SpeechVCTK · scope shift
TTS evaluation moved from clean single-speaker synthesis to multi-speaker and accent variation.
VCTKSeed-TTS-Eval · scope shift
Model quality improved enough that basic corpora no longer exposed enough failure modes; harder text, similarity, and robustness became more important.
Seed-TTS-EvalCodeSOTA TTS Eval · scope shift
Vendor selection requires reproducible API-level measurements, not only research-set model scores.
CodeSOTA TTS EvalTTS Intelligibility · direct successor · attention
Clean Harvard sentences are not enough for production. The successor benchmark focuses on hard English prompts, critical entity preservation, latency, and cost.
§ 02 · Benchmarks in this lineage

Nodes in detail.

Jul 2017Saturating
View benchmark page →

LJ Speech

The LJ Speech Dataset

Single-speaker English audiobook dataset with 13,100 short clips. Became a standard reference for early neural TTS because it is clean, easy to use, and reproducible. Strong for basic synthesis research; weak for multi-speaker, streaming, latency, and hard text evaluation.

Ito and Johnson · paper

VCTK

CSTR VCTK Corpus

Multi-speaker English corpus with 110 speakers and accent variation. Useful for speaker conditioning, adaptation, and voice conversion. More diverse than LJ Speech, but still not a direct production benchmark for API TTS providers.

University of Edinburgh CSTR · paper
Jun 2024Active

Seed-TTS-Eval

Seed-TTS-Eval

Harder TTS evaluation set used with Seed-TTS, including speech naturalness, speaker similarity, and robustness to challenging text. Important bridge from simple MOS-style testing toward more diagnostic TTS evaluation.

ByteDance Seed team · paper
Apr 2026Saturating

CodeSOTA TTS Eval

CodeSOTA Independent TTS Vendor Evaluation

First-party CodeSOTA comparison using UTMOS for naturalness and Whisper round-trip WER for intelligibility on Harvard sentences. Useful as an auditable vendor sanity check, but the prompt set is intentionally clean and does not stress production hard cases.

CodeSOTA · paper

TTS Intelligibility

English TTS Intelligibility Benchmark

Production TTS information-preservation benchmark. Generates English hard-case prompts through each TTS provider, transcribes them with independent ASR, and ranks models on quality, speed, and cost: normalized WER/CER, critical entity accuracy, p95 TTFB, severity-weighted errors, and cost per 1K characters.

CodeSOTA · paper