Standard zero-shot voice-cloning / TTS evaluation using LibriTTS test-clean speaker prompts. WER on resynthesized utterances (measured with a frozen ASR like HuBERT-Large or Whisper) is the primary intelligibility metric (lower=better); speaker similarity (SECS) is a secondary metric.
3 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.
| # | Model | Org | Submitted | Paper / code | wer |
|---|---|---|---|---|---|
| 01 | NaturalSpeech 3 | Microsoft | Apr 2026 | editorial | 1.81 |
| 02 | Voicebox | Meta AI | Apr 2026 | editorial | 1.90 |
| 03 | VALL-E | Microsoft | Apr 2026 | editorial | 5.90 |
Each row below marks a model that broke the previous record on wer. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.
Lower scores win. Each subsequent entry improved upon the previous best.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.