Codesota · Benchmark · LJ SpeechHome/Leaderboards/Audio & Speech/Text-to-Speech/LJ Speech
Unknown

LJ Speech.

13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

mos

Mos is the reported evaluation metric for LJ Speech. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for mosverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01VALL-E 2
MOS (1–5). Human parity: CMOS +0.17 above ground truth. Source: Table 1, arxiv:2406.05370 (Jun 2024)
verified4.612026Source ↗Looks wrong?
02NaturalSpeech
MOS 4.56 ±0.13 on LJSpeech. Human GT = 4.58 ±0.13; difference not statistically significant (p>0.05, Wilcoxon). First TTS system to achieve human-level quality on LJSpeech. IEEE TASLP 2024 (arXiv 2205.04421, Table 2).
paper4.562026Source ↗Looks wrong?
03StyleTTS2
MOS (1–5). Surpasses human baseline (4.44 MOS). Source: Table 2, arxiv:2306.07279 (NeurIPS 2023)
paper4.552026Source ↗Looks wrong?
04StyleTTS 2
MOS (1–5). Surpasses human baseline (4.44 MOS). Source: Table 2, arxiv:2306.07279 (NeurIPS 2023)
verified4.552023Paper ↗Looks wrong?
05VITS
MOS (1–5). VITS end-to-end TTS. Source: Table 2, arxiv:2106.06103 (ICML 2021)
verified4.432021Paper ↗Looks wrong?
06Grad-TTS + HiFi-GAN
MOS 4.37 ±0.13 on LJSpeech. From NaturalSpeech paper (arXiv 2205.04421, Table 4). Human GT = 4.58 in same evaluation.
paper4.372026Source ↗Looks wrong?
07Glow-TTS + HiFi-GAN
MOS 4.34 ±0.13 on LJSpeech. From NaturalSpeech paper (arXiv 2205.04421, Table 4). Human GT = 4.58 in same evaluation.
paper4.342026Source ↗Looks wrong?
08FastSpeech2 + HiFi-GAN
MOS 4.32 ±0.15 on LJSpeech. From NaturalSpeech paper (arXiv 2205.04421, Table 4). Human GT = 4.58 in same evaluation.
paper4.322026Source ↗Looks wrong?
09Voicebox
MOS (1–5). Voicebox single-speaker on LJ Speech. Source: Table 1, arxiv:2306.15687 (NeurIPS 2023)
verified4.302026Source ↗Looks wrong?
10XTTS v2
MOS (1–5). XTTS v2 evaluated on LJ Speech. Source: arxiv:2304.01196 evaluation
verified4.212026Source ↗Looks wrong?
11Matcha-TTS
MOS 3.84 ±0.08 on LJSpeech, 10 ODE solver steps (best variant). Vocoded reference = 4.13 in same evaluation. ICASSP 2024 (arXiv 2309.03199, Table 1). Flow-matching architecture; significantly outperforms Grad-TTS.
paper3.842026Source ↗Looks wrong?
12JETS
MOS 3.57 ±0.09 on LJSpeech (in-distribution). From StyleTTS2 paper (NeurIPS 2023, arXiv 2306.07691, Table 2). Human GT = 3.81 in same evaluation.
paper3.572026Source ↗Looks wrong?
Lineage

LJ Speech in context.

See full text-to-speech benchmarks lineage →
None — this is where the lineage begins.
This benchmark (1)
saturating2017-07
LJ Speech
§ 04 · Submit a result

Add to the leaderboard.

← Back to Text-to-Speech