Text-to-Speech2017en

The LJ Speech Dataset

13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.

Metrics:mos, mcd
Paper / WebsiteDownload
Current State of the Art

VALL-E 2

Microsoft

4.61

mos

LJ Speech — mos

5 results · 3 SOTA advances · higher is better

All results
SOTA frontier
4520212022202320242025mosVITSStyleTTS 2VALL-E 2

mos Progress Over Time

Showing 3 breakthroughs from Jun 2021 to Jun 2024

4.44.54.54.64.6Jun 2021Dec 2022Jun 2024mosDate

Key Milestones

Jun 2021
VITS

MOS (1–5). VITS end-to-end TTS. Source: Table 2, arxiv:2106.06103 (ICML 2021)

4.4
Jun 2023
StyleTTS 2

MOS (1–5). Surpasses human baseline (4.44 MOS). Source: Table 2, arxiv:2306.07279 (NeurIPS 2023)

4.5
+2.7%
Jun 2024
VALL-E 2Current SOTA

MOS (1–5). Human parity: CMOS +0.17 above ground truth. Source: Table 1, arxiv:2406.05370 (Jun 2024)

4.6
+1.3%
Total Improvement
4.1%
Time Span
3y 1m
Breakthroughs
3
Current SOTA
4.6

Top Models Performance Comparison

Top 5 models ranked by mos

mos1VALL-E 24.6100.0%2StyleTTS 24.598.7%3VITS4.496.1%4Voicebox4.393.3%5XTTS v24.291.3%0%25%50%75%100%% of best
Best Score
4.6
Top Model
VALL-E 2
Models Compared
5
Score Range
0.400

mosPrimary

Related Papers5

Other Text-to-Speech Datasets

LJ Speech Benchmark - Text-to-Speech | CodeSOTA