Who leads the LJ Speech benchmark?

VALL-E 2 currently leads LJ Speech with a score of 4.61 on mos.

What is the state-of-the-art score on LJ Speech?

The state-of-the-art result on LJ Speech is 4.61 (mos), achieved by VALL-E 2 as of 2026.

How many models are tracked on LJ Speech?

Codesota tracks 12 models on LJ Speech.

When was the LJ Speech leaderboard last updated?

The LJ Speech leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2021.

Codesota · Benchmark · LJ SpeechHome/Leaderboards/Audio & Speech/Text-to-Speech/LJ Speech

Unknown

LJ Speech.

Name: LJ Speech Benchmark Results
Creator: Unknown
Published: 2021-01-01
License: https://creativecommons.org/licenses/by/4.0/

13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.

Paper ↗Leaderboard ↓Lineage

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

mos

Mos is the reported evaluation metric for LJ Speech. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for mosverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	VALL-E 2 MOS (1–5). Human parity: CMOS +0.17 above ground truth. Source: Table 1, arxiv:2406.05370 (Jun 2024)	verified	4.61	2026	Source ↗	Looks wrong?
02	NaturalSpeech MOS 4.56 ±0.13 on LJSpeech. Human GT = 4.58 ±0.13; difference not statistically significant (p>0.05, Wilcoxon). First TTS system to achieve human-level quality on LJSpeech. IEEE TASLP 2024 (arXiv 2205.04421, Table 2).	paper	4.56	2026	Source ↗	Looks wrong?
03	StyleTTS2 MOS (1–5). Surpasses human baseline (4.44 MOS). Source: Table 2, arxiv:2306.07279 (NeurIPS 2023)	paper	4.55	2026	Source ↗	Looks wrong?
04	StyleTTS 2 MOS (1–5). Surpasses human baseline (4.44 MOS). Source: Table 2, arxiv:2306.07279 (NeurIPS 2023)	verified	4.55	2023	Paper ↗	Looks wrong?
05	VITS MOS (1–5). VITS end-to-end TTS. Source: Table 2, arxiv:2106.06103 (ICML 2021)	verified	4.43	2021	Paper ↗	Looks wrong?
06	Grad-TTS + HiFi-GAN MOS 4.37 ±0.13 on LJSpeech. From NaturalSpeech paper (arXiv 2205.04421, Table 4). Human GT = 4.58 in same evaluation.	paper	4.37	2026	Source ↗	Looks wrong?
07	Glow-TTS + HiFi-GAN MOS 4.34 ±0.13 on LJSpeech. From NaturalSpeech paper (arXiv 2205.04421, Table 4). Human GT = 4.58 in same evaluation.	paper	4.34	2026	Source ↗	Looks wrong?
08	FastSpeech2 + HiFi-GAN MOS 4.32 ±0.15 on LJSpeech. From NaturalSpeech paper (arXiv 2205.04421, Table 4). Human GT = 4.58 in same evaluation.	paper	4.32	2026	Source ↗	Looks wrong?
09	Voicebox MOS (1–5). Voicebox single-speaker on LJ Speech. Source: Table 1, arxiv:2306.15687 (NeurIPS 2023)	verified	4.30	2026	Source ↗	Looks wrong?
10	XTTS v2 MOS (1–5). XTTS v2 evaluated on LJ Speech. Source: arxiv:2304.01196 evaluation	verified	4.21	2026	Source ↗	Looks wrong?
11	Matcha-TTS MOS 3.84 ±0.08 on LJSpeech, 10 ODE solver steps (best variant). Vocoded reference = 4.13 in same evaluation. ICASSP 2024 (arXiv 2309.03199, Table 1). Flow-matching architecture; significantly outperforms Grad-TTS.	paper	3.84	2026	Source ↗	Looks wrong?
12	JETS MOS 3.57 ±0.09 on LJSpeech (in-distribution). From StyleTTS2 paper (NeurIPS 2023, arXiv 2306.07691, Table 2). Human GT = 3.81 in same evaluation.	paper	3.57	2026	Source ↗	Looks wrong?

Lineage

LJ Speech in context.

See full text-to-speech benchmarks lineage →

None — this is where the lineage begins.

This benchmark (1)

saturating2017-07

LJ Speech

Successors (1)

active2019-11

VCTK

TTS evaluation moved from clean single-speaker synthesis to multi-speaker and accent variation.

§ 04 · Submit a result

Add to the leaderboard.

← Back to Text-to-Speech