Unknown
13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.
mos
Higher is better
| Rank | Model | Source | Score | Year | Paper |
|---|---|---|---|---|---|
| 1 | VALL-E 2 MOS (1–5). Human parity: CMOS +0.17 above ground truth. Source: Table 1, arxiv:2406.05370 (Jun 2024) | Community | 4.61 | 2026 | Source |
| 2 | NaturalSpeech MOS 4.56 ±0.13 on LJSpeech. Human GT = 4.58 ±0.13; difference not statistically significant (p>0.05, Wilcoxon). First TTS system to achieve human-level quality on LJSpeech. IEEE TASLP 2024 (arXiv 2205.04421, Table 2). | Community | 4.56 | 2026 | Source |
| 3 | StyleTTS2 MOS (1–5). Surpasses human baseline (4.44 MOS). Source: Table 2, arxiv:2306.07279 (NeurIPS 2023) | Community | 4.55 | 2026 | Source |
| 4 | VITS MOS (1–5). VITS end-to-end TTS. Source: Table 2, arxiv:2106.06103 (ICML 2021) | Community | 4.43 | 2026 | Source |
| 5 | Grad-TTS + HiFi-GAN MOS 4.37 ±0.13 on LJSpeech. From NaturalSpeech paper (arXiv 2205.04421, Table 4). Human GT = 4.58 in same evaluation. | Community | 4.37 | 2026 | Source |
| 6 | Glow-TTS + HiFi-GAN MOS 4.34 ±0.13 on LJSpeech. From NaturalSpeech paper (arXiv 2205.04421, Table 4). Human GT = 4.58 in same evaluation. | Community | 4.34 | 2026 | Source |
| 7 | FastSpeech2 + HiFi-GAN MOS 4.32 ±0.15 on LJSpeech. From NaturalSpeech paper (arXiv 2205.04421, Table 4). Human GT = 4.58 in same evaluation. | Community | 4.32 | 2026 | Source |
| 8 | Voicebox MOS (1–5). Voicebox single-speaker on LJ Speech. Source: Table 1, arxiv:2306.15687 (NeurIPS 2023) | Community | 4.3 | 2026 | Source |
| 9 | XTTS v2 MOS (1–5). XTTS v2 evaluated on LJ Speech. Source: arxiv:2304.01196 evaluation | Community | 4.21 | 2026 | Source |
| 10 | Matcha-TTS MOS 3.84 ±0.08 on LJSpeech, 10 ODE solver steps (best variant). Vocoded reference = 4.13 in same evaluation. ICASSP 2024 (arXiv 2309.03199, Table 1). Flow-matching architecture; significantly outperforms Grad-TTS. | Community | 3.84 | 2026 | Source |
| 11 | JETS MOS 3.57 ±0.09 on LJSpeech (in-distribution). From StyleTTS2 paper (NeurIPS 2023, arXiv 2306.07691, Table 2). Human GT = 3.81 in same evaluation. | Community | 3.57 | 2026 | Source |