Voice Cloning2019en

LibriTTS test-clean zero-shot TTS evaluation

Standard zero-shot voice-cloning / TTS evaluation using LibriTTS test-clean speaker prompts. WER on resynthesized utterances (measured with a frozen ASR like HuBERT-Large or Whisper) is the primary intelligibility metric (lower=better); speaker similarity (SECS) is a secondary metric.

Current State of the Art

NaturalSpeech 3

Microsoft

1.81

wer

LibriTTS test-clean (Zero-Shot TTS) — wer

3 results · 1 SOTA advances · lower is better

All results
SOTA frontier
123456720262027werNaturalSpeech 3

wer Progress Over Time

Showing 3 breakthroughs from Jan 2023 to Mar 2024

1.42.63.95.16.3Jan 2023Aug 2023Mar 2024werDate

Key Milestones

Jan 2023
VALL-E

VALL-E zero-shot TTS, LibriTTS-style test prompts, WER via ASR. seed — verify (original VALL-E evaluated on LibriSpeech).

5.9
Jun 2023
Voicebox

Voicebox zero-shot TTS, LibriSpeech/LibriTTS test-clean WER. seed — verify.

1.9
-67.8%
Mar 2024
NaturalSpeech 3Current SOTA

NaturalSpeech 3, LibriSpeech/LibriTTS test-clean zero-shot, WER. seed — verify.

1.8
-4.7%
Total Improvement
69.3%
Time Span
1y 2m
Breakthroughs
3
Current SOTA
1.8

Top Models Performance Comparison

Top 3 models ranked by wer (lower is better)

wer1NaturalSpeech 31.8100.0%2Voicebox1.995.3%3VALL-E5.930.7%0%25%50%75%100%% of best
Best Score
1.8
Top Model
NaturalSpeech 3
Models Compared
3
Score Range
4.1

werPrimary

#ModelScorePaper / CodeDate
1
NaturalSpeech 3
Microsoft
1.81Apr 2026
2
Voicebox
Meta AI
1.9Apr 2026
3
VALL-E
Microsoft
5.9Apr 2026