Text-to-Speech Benchmarks
How TTS evaluation evolved from single-speaker naturalness datasets toward production benchmarks that test intelligibility, voice similarity, latency, streaming behavior, and information preservation. The lineage separates beauty metrics like MOS from operational metrics such as WER round-trip, critical entity accuracy, and first-byte latency.
Classic TTS benchmarks such as LJ Speech and VCTK are useful for model development, but they do not answer whether a cloud TTS provider is good for production voice agents. Modern evaluation needs multiple axes: naturalness, intelligibility, speaker similarity, latency, cost, and robustness on hard text containing numbers, names, acronyms, dates, URLs, and addresses.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
LJ Speech
Single-speaker English audiobook dataset with 13,100 short clips. Became a standard reference for early neural TTS because it is clean, easy to use, and reproducible. Strong for basic synthesis research; weak for multi-speaker, streaming, latency, and hard text evaluation.
VCTK
Multi-speaker English corpus with 110 speakers and accent variation. Useful for speaker conditioning, adaptation, and voice conversion. More diverse than LJ Speech, but still not a direct production benchmark for API TTS providers.
Seed-TTS-Eval
Harder TTS evaluation set used with Seed-TTS, including speech naturalness, speaker similarity, and robustness to challenging text. Important bridge from simple MOS-style testing toward more diagnostic TTS evaluation.
CodeSOTA TTS Eval
First-party CodeSOTA comparison using UTMOS for naturalness and Whisper round-trip WER for intelligibility on Harvard sentences. Useful as an auditable vendor sanity check, but the prompt set is intentionally clean and does not stress production hard cases.
TTS Intelligibility
Production TTS information-preservation benchmark. Generates English hard-case prompts through each TTS provider, transcribes them with independent ASR, and ranks models on quality, speed, and cost: normalized WER/CER, critical entity accuracy, p95 TTFB, severity-weighted errors, and cost per 1K characters.