Who leads the VCTK benchmark?

NaturalSpeech 3 currently leads VCTK with a score of 4.36 on mos.

What is the state-of-the-art score on VCTK?

The state-of-the-art result on VCTK is 4.36 (mos), achieved by NaturalSpeech 3 as of 2026.

How many models are tracked on VCTK?

Codesota tracks 10 models on VCTK across 2 metrics.

When was the VCTK leaderboard last updated?

The VCTK leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2022.

Codesota · Benchmark · VCTKHome/Leaderboards/Audio & Speech/Text-to-Speech/VCTK

Unknown

VCTK.

Name: VCTK Benchmark Results
Creator: Unknown
Published: 2022-01-01
License: https://creativecommons.org/licenses/by/4.0/

Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.

Paper ↗Leaderboard ↓Lineage

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

mos

Mos is the reported evaluation metric for VCTK. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for mosverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	NaturalSpeech 3 MOS (1–5). Zero-shot VCTK evaluation. Source: Table 3, arxiv:2403.03100 (2024)	verified	4.36	2024	Paper ↗	Looks wrong?
02	Ground Truth (VCTK) Human recordings from VCTK test set. Reported in YourTTS (Casanova et al., ICML 2022), Table 1.	verified	4.26	2022	Source ↗	Looks wrong?
03	VITS MOS (1–5). VITS multispeaker on VCTK. Source: Table 2, arxiv:2106.06103 (ICML 2021)	verified	4.21	2026	Source ↗	Looks wrong?
04	StyleTTS 2 MOS (1–5). StyleTTS 2 multispeaker on VCTK. Source: Table 3, arxiv:2306.07279 (NeurIPS 2023)	verified	4.19	2023	Paper ↗	Looks wrong?
05	StyleTTS2 MOS (1–5). StyleTTS 2 multispeaker on VCTK. Source: Table 3, arxiv:2306.07279 (NeurIPS 2023)	verified	4.19	2023	Source ↗	Looks wrong?
06	VALL-E 2 MOS (1–5). Zero-shot multi-speaker on VCTK. Source: Table 1, arxiv:2406.05370 (Jun 2024)	verified	4.18	2024	Paper ↗	Looks wrong?
07	XTTS v2 MOS (1–5). XTTS v2 zero-shot on VCTK speakers. Source: arxiv:2304.01196	verified	4.14	2023	Paper ↗	Looks wrong?
08	YourTTS MOS (1–5). YourTTS zero-shot on VCTK. Source: Table 2, arxiv:2202.04053 (ICML 2022)	verified	4.07	2022	Source ↗	Looks wrong?
09	SC-GlowTTS Multi-speaker GlowTTS baseline. Reported in YourTTS (Casanova et al., ICML 2022), Table 1.	verified	3.78	2022	Source ↗	Looks wrong?

sim-score

Sim Score is the reported evaluation metric for VCTK. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for sim-scoreverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Ground Truth (VCTK) Sim-MOS for human recordings, VCTK test set. Reported in YourTTS (Casanova et al., ICML 2022), Table 1.	verified	4.19	2022	Source ↗	Looks wrong?
02	YourTTS Sim-MOS on VCTK test set (Exp 1 monolingual) ±0.05. Casanova et al., ICML 2022.	verified	4.16	2022	Source ↗	Looks wrong?
03	SC-GlowTTS Sim-MOS on VCTK test set. Reported in YourTTS (Casanova et al., ICML 2022), Table 1.	verified	3.99	2022	Source ↗	Looks wrong?
04	VITS2 Speaker similarity MOS on VCTK multi-speaker test set ±0.08. Kong et al., Interspeech 2023.	verified	3.99	2023	Source ↗	Looks wrong?
05	VITS Speaker similarity MOS on VCTK multi-speaker test set ±0.09. Kong et al., Interspeech 2023 (VITS2 paper, Table 2b).	verified	3.79	2023	Source ↗	Looks wrong?

Lineage

VCTK in context.

See full text-to-speech benchmarks lineage →

Predecessors (1)

saturating2017-07

LJ Speech

TTS evaluation moved from clean single-speaker synthesis to multi-speaker and accent variation.

This benchmark (1)

active2019-11

VCTK

Successors (1)

active2024-06

Seed-TTS-Eval

Model quality improved enough that basic corpora no longer exposed enough failure modes; harder text, similarity, and robustness became more important.

§ 04 · Submit a result

Add to the leaderboard.

← Back to Text-to-Speech