Text-to-speech · measured leaderboard

Only rows measured by CodeSOTA rank here.

The leaderboard excludes vendor, paper, and community MOS. Every row below uses the same harness family, shared hard-text prompts, explicit metrics, and artifact links. Blind Elo is a separate preference study and does not affect this ranking. Models with only Elo audio are listed as a sample pool, not ranked benchmark results.

Open separate blind Elo study Open reported registry Open watchlist Polish track

Separate study · not benchmark-ranked

Active blind Elo sample pool

These seven male-voice systems have the shared prompt audio ready for preference voting. They will not enter the hard-text measured ranking until ASR transcripts, diffs, entity scoring, latency logs, and artifacts are published for that benchmark.

Vote on blind Elo

Model	Vendor	Voice condition	Elo clips	Measured benchmark status
Gradium TTS gradium-tts:kent	Gradium	Kent	30	audio ready · scoring pending
Gradium TTS gradium-tts:damon	Gradium	Damon	30	audio ready · scoring pending
Gradium TTS gradium-tts:russell	Gradium	Russell	30	audio ready · scoring pending
Kokoro v1.0 hexgrad/kokoro-82m:am_michael	Hexgrad	am_michael	30	audio ready · scoring pending
Speech-02 Turbo minimax/speech-02-turbo:english-deep-voiced-gentleman	MiniMax	English_Deep-VoicedGentleman	30	audio ready · scoring pending
Speech-02 HD minimax/speech-02-hd:english-deep-voiced-gentleman	MiniMax	English_Deep-VoicedGentleman	30	audio ready · scoring pending
Qwen3 TTS qwen/qwen3-tts:aiden	Qwen	Aiden	30	audio ready · scoring pending
Chatterbox Turbo resemble-ai/chatterbox-turbo:andy	Resemble AI	Andy	30	audio ready · scoring pending
Chatterbox Turbo resemble-ai/chatterbox-turbo	Resemble AI	default study voice	30	audio ready · scoring pending
ElevenLabs v3 elevenlabs/v3:james	ElevenLabs	James	30	audio ready · scoring pending
XTTS v2 coqui/xtts-v2:damien-black	Coqui	Damien Black	30	audio ready · scoring pending

Rank	Model	Benchmark	Verification	Entity acc.	WER	CER	p95 TTFB	CI	Artifacts
1	Gradium TTS audrey · 2026-05-17	codesota-tts-hardtext-v2 30 prompts · sha256:hardtext-v2-en-30-prompts	codesota measured	73.3%	13.4%	6.7%	299 ms	no CI yet	inspect
2	Kokoro v1.0 af_heart · 2026-05-17	codesota-tts-hardtext-v2 30 prompts · sha256:hardtext-v2-en-30-prompts	codesota measured	66.7%	15.6%	6.8%	2123 ms	no CI yet	inspect

Gradium TTS · evidence packet

tts-hardtext-v2:gradium-audrey:2026-05-17

This packet is meant to answer one narrow question: did the model keep hard text intact enough for downstream use? It is not trying to prove that the voice is pleasant or expressive.

Listen to the audited sample

Reference text

The quarterly revenue increased by 17.8 percent to 4.2 million dollars.

ASR transcript

The quarterly revenue increased by 17.8% to $4.2 million.

Observed difference

meaning preserved; percent and dollar amount normalized by ASR

What this evidence says

What this row tests

Can the voice preserve numbers, dates, names, URLs, acronyms, and business-critical wording after ASR transcription?

Current signal

73.3% critical entity accuracy across 30 prompts; 13.4% WER and 6.7% CER.

Example shown here

meaning preserved; percent and dollar amount normalized by ASR

Latency signal

p95 time-to-first-byte was 299 ms on api-eu.

Do not infer

This is not a naturalness, acting, emotion, cloning, audiobook, or preference score.

Auditable files

Prompt set, transcript outputs, latency log, model config, and run config are linked below.

Audit trail

Run config

/data/tts-intelligibility/runs/gradium-audrey-v1/run_config.json

Prompt manifest

/data/tts-intelligibility/runs/gradium-audrey-v1/prompts.json

Transcript file

/data/tts-intelligibility/runs/gradium-audrey-v1/asr_transcripts.json

Latency log

/data/tts-intelligibility/runs/gradium-audrey-v1/latency.json

Audio manifest

/data/tts-intelligibility/runs/gradium-audrey-v1/audio_manifest.json

Audio hashes

sha256:gradium-audrey-v1-manifest

Model config

/data/tts-intelligibility/runs/gradium-audrey-v1/model_config.json

Eval hash

sha256:hardtext-v2-en-30-prompts

Kokoro v1.0 · evidence packet

tts-hardtext-v2:kokoro-af-heart:2026-05-17

This packet is meant to answer one narrow question: did the model keep hard text intact enough for downstream use? It is not trying to prove that the voice is pleasant or expressive.

Listen to the audited sample

Reference text

The quarterly revenue increased by 17.8 percent to 4.2 million dollars.

ASR transcript

The quarterly revenue increased by 17.8% to $4.2 million.

Observed difference

meaning preserved; percent and dollar amount normalized by ASR

What this evidence says

What this row tests

Can the voice preserve numbers, dates, names, URLs, acronyms, and business-critical wording after ASR transcription?

Current signal

66.7% critical entity accuracy across 30 prompts; 15.6% WER and 6.8% CER.

Example shown here

meaning preserved; percent and dollar amount normalized by ASR

Latency signal

p95 time-to-first-byte was 2123 ms on m2-max.

Do not infer

This is not a naturalness, acting, emotion, cloning, audiobook, or preference score.

Auditable files

Prompt set, transcript outputs, latency log, model config, and run config are linked below.

Audit trail

Run config

/data/tts-intelligibility/runs/kokoro-af-heart-v1/run_config.json

Prompt manifest

/data/tts-intelligibility/runs/kokoro-af-heart-v1/prompts.json

Transcript file

/data/tts-intelligibility/runs/kokoro-af-heart-v1/asr_transcripts.json

Latency log

/data/tts-intelligibility/runs/kokoro-af-heart-v1/latency.json

Audio manifest

/data/tts-intelligibility/runs/kokoro-af-heart-v1/audio_manifest.json

Audio hashes

sha256:kokoro-af-heart-v1-manifest

Model config

/data/tts-intelligibility/runs/kokoro-af-heart-v1/model_config.json

Eval hash

sha256:hardtext-v2-en-30-prompts

Harness commands

codesota-tts synth --model <id> --eval <track> --out runs/<run_id>
codesota-tts score --run runs/<run_id> --metrics wer,cer,entity,utmos,latency
codesota-tts report --run runs/<run_id> --publish

TTS Eval v2 tracks

clean-read-en

Harvard-style clean sentences

UTMOS, WER, CER, latency

hardtext-en

Numbers, dates, currencies, addresses, acronyms, emails, URLs, product codes

WER, CER, critical entity accuracy, severe errors

hardtext-pl

Polish diacritics, dates, currencies, addresses, abbreviations

Polish CER, entity exactness, abbreviation handling

longform

5-15 minute narration/dialogue

voice drift, omission/repetition rate, long-run WER

cloning

Speaker preservation with reference audio

speaker similarity, WER

controllability

Emotion, speed, pitch, whisper, pauses, delivery style

control adherence and acoustic-channel movement