Standard zero-shot voice-cloning / TTS evaluation using LibriTTS test-clean speaker prompts. WER on resynthesized utterances (measured with a frozen ASR like HuBERT-Large or Whisper) is the primary intelligibility metric (lower=better); speaker similarity (SECS) is a secondary metric.
Wer is the reported evaluation metric for LibriTTS test-clean (Zero-Shot TTS). Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Lower is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | NaturalSpeech 3 | paper | 1.81 | 2026 | Source ↗ | Looks wrong? |
| 02 | Voicebox | paper | 1.90 | 2026 | Source ↗ | Looks wrong? |
| 03 | VALL-E | paper | 5.90 | 2026 | Source ↗ | Looks wrong? |