Voice Cloning2019en

LibriTTS test-clean zero-shot TTS evaluation

Standard zero-shot voice-cloning / TTS evaluation using LibriTTS test-clean speaker prompts. WER on resynthesized utterances (measured with a frozen ASR like HuBERT-Large or Whisper) is the primary intelligibility metric (lower=better); speaker similarity (SECS) is a secondary metric.

Paper / Website

Current State of the Art

NaturalSpeech 3

Microsoft

1.81

wer

LibriTTS test-clean (Zero-Shot TTS) — wer

3 results · 1 SOTA advances · lower is better

All results

SOTA frontier

wer Progress Over Time

Showing 3 breakthroughs from Jan 2023 to Mar 2024

Key Milestones

Jan 2023

VALL-E

VALL-E zero-shot TTS, LibriTTS-style test prompts, WER via ASR. seed — verify (original VALL-E evaluated on LibriSpeech).

5.9

Source

Jun 2023

Voicebox

Voicebox zero-shot TTS, LibriSpeech/LibriTTS test-clean WER. seed — verify.

1.9

-67.8%

Source

Mar 2024

NaturalSpeech 3Current SOTA

NaturalSpeech 3, LibriSpeech/LibriTTS test-clean zero-shot, WER. seed — verify.

1.8

-4.7%

Source

Total Improvement

69.3%

Time Span

1y 2m

Breakthroughs

Current SOTA

1.8

Top Models Performance Comparison

Top 3 models ranked by wer (lower is better)

Best Score

1.8

Top Model

NaturalSpeech 3

Models Compared

Score Range

4.1

werPrimary

#	Model	Score	Paper / Code	Date
1	NaturalSpeech 3 Microsoft	1.81	editorial	Apr 2026
2	Voicebox Meta AI	1.9	editorial	Apr 2026
3	VALL-E Microsoft	5.9	editorial	Apr 2026