Text-to-Speech2017en
The LJ Speech Dataset
13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.
Current State of the Art
VALL-E 2
Microsoft
4.61
mos
LJ Speech — mos
5 results · 3 SOTA advances · higher is better
All results
SOTA frontier
mos Progress Over Time
Showing 3 breakthroughs from Jun 2021 to Jun 2024
Key Milestones
Jun 2023
StyleTTS 2
MOS (1–5). Surpasses human baseline (4.44 MOS). Source: Table 2, arxiv:2306.07279 (NeurIPS 2023)
4.5
+2.7%
Jun 2024
VALL-E 2Current SOTA
MOS (1–5). Human parity: CMOS +0.17 above ground truth. Source: Table 1, arxiv:2406.05370 (Jun 2024)
4.6
+1.3%
Total Improvement
4.1%
Time Span
3y 1m
Breakthroughs
3
Current SOTA
4.6
Top Models Performance Comparison
Top 5 models ranked by mos
Best Score
4.6
Top Model
VALL-E 2
Models Compared
5
Score Range
0.400
mosPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | VALL-E 2 Microsoft | 4.61 | Jun 2024 | |
| 2 | StyleTTS 2Open Source Columbia University | 4.55 | Jun 2023 | |
| 3 | VITSOpen Source Kakao | 4.43 | Jun 2021 | |
| 4 | Voicebox Meta AI | 4.3 | Jun 2023 | |
| 5 | XTTS v2Open Source Coqui AI | 4.21 | Apr 2023 |
Related Papers5
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Jun 2023Models: Voicebox
XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model
Apr 2023Models: XTTS v2