Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Text-to-speech · measured leaderboard

Only rows measured by CodeSOTA rank here.

The leaderboard excludes vendor, paper, and community MOS. Every row below uses the same harness family, shared hard-text prompts, explicit metrics, and artifact links. Blind Elo is a separate preference study and does not affect this ranking. Models with only Elo audio are listed as a sample pool, not ranked benchmark results.

Open separate blind Elo studyOpen reported registryOpen watchlistPolish track
Separate study · not benchmark-ranked

Active blind Elo sample pool

These seven male-voice systems have the shared prompt audio ready for preference voting. They will not enter the hard-text measured ranking until ASR transcripts, diffs, entity scoring, latency logs, and artifacts are published for that benchmark.

Vote on blind Elo
ModelVendorVoice conditionElo clipsMeasured benchmark status
Gradium TTS
gradium-tts:kent
GradiumKent30audio ready · scoring pending
Kokoro v1.0
hexgrad/kokoro-82m:am_michael
Hexgradam_michael30audio ready · scoring pending
Speech-02 Turbo
minimax/speech-02-turbo:english-deep-voiced-gentleman
MiniMaxEnglish_Deep-VoicedGentleman30audio ready · scoring pending
Speech-02 HD
minimax/speech-02-hd:english-deep-voiced-gentleman
MiniMaxEnglish_Deep-VoicedGentleman30audio ready · scoring pending
Qwen3 TTS
qwen/qwen3-tts:aiden
QwenAiden30audio ready · scoring pending
Chatterbox Turbo
resemble-ai/chatterbox-turbo:andy
Resemble AIAndy30audio ready · scoring pending
ElevenLabs v3
elevenlabs/v3:james
ElevenLabsJames30audio ready · scoring pending
RankModelBenchmarkVerificationEntity acc.WERCERp95 TTFBCIArtifacts
1
Gradium TTS
audrey · 2026-05-17
codesota-tts-hardtext-v2
30 prompts · sha256:hardtext-v2-en-30-prompts
codesota measured73.3%13.4%6.7%299 msno CI yetinspect
2
Kokoro v1.0
af_heart · 2026-05-17
codesota-tts-hardtext-v2
30 prompts · sha256:hardtext-v2-en-30-prompts
codesota measured66.7%15.6%6.8%2123 msno CI yetinspect
Gradium TTS · evidence packet

tts-hardtext-v2:gradium-audrey:2026-05-17

This packet is meant to answer one narrow question: did the model keep hard text intact enough for downstream use? It is not trying to prove that the voice is pleasant or expressive.

Listen to the audited sample
Reference text

The quarterly revenue increased by 17.8 percent to 4.2 million dollars.

ASR transcript

The quarterly revenue increased by 17.8% to $4.2 million.

Observed difference

meaning preserved; percent and dollar amount normalized by ASR

What this evidence says
What this row tests

Can the voice preserve numbers, dates, names, URLs, acronyms, and business-critical wording after ASR transcription?

Current signal

73.3% critical entity accuracy across 30 prompts; 13.4% WER and 6.7% CER.

Example shown here

meaning preserved; percent and dollar amount normalized by ASR

Latency signal

p95 time-to-first-byte was 299 ms on api-eu.

Do not infer

This is not a naturalness, acting, emotion, cloning, audiobook, or preference score.

Auditable files

Prompt set, transcript outputs, latency log, model config, and run config are linked below.

Audit trail
Kokoro v1.0 · evidence packet

tts-hardtext-v2:kokoro-af-heart:2026-05-17

This packet is meant to answer one narrow question: did the model keep hard text intact enough for downstream use? It is not trying to prove that the voice is pleasant or expressive.

Listen to the audited sample
Reference text

The quarterly revenue increased by 17.8 percent to 4.2 million dollars.

ASR transcript

The quarterly revenue increased by 17.8% to $4.2 million.

Observed difference

meaning preserved; percent and dollar amount normalized by ASR

What this evidence says
What this row tests

Can the voice preserve numbers, dates, names, URLs, acronyms, and business-critical wording after ASR transcription?

Current signal

66.7% critical entity accuracy across 30 prompts; 15.6% WER and 6.8% CER.

Example shown here

meaning preserved; percent and dollar amount normalized by ASR

Latency signal

p95 time-to-first-byte was 2123 ms on m2-max.

Do not infer

This is not a naturalness, acting, emotion, cloning, audiobook, or preference score.

Auditable files

Prompt set, transcript outputs, latency log, model config, and run config are linked below.

Audit trail
Harness commands
codesota-tts synth --model <id> --eval <track> --out runs/<run_id>
codesota-tts score --run runs/<run_id> --metrics wer,cer,entity,utmos,latency
codesota-tts report --run runs/<run_id> --publish
TTS Eval v2 tracks
clean-read-en
Harvard-style clean sentences
UTMOS, WER, CER, latency
hardtext-en
Numbers, dates, currencies, addresses, acronyms, emails, URLs, product codes
WER, CER, critical entity accuracy, severe errors
hardtext-pl
Polish diacritics, dates, currencies, addresses, abbreviations
Polish CER, entity exactness, abbreviation handling
longform
5-15 minute narration/dialogue
voice drift, omission/repetition rate, long-run WER
cloning
Speaker preservation with reference audio
speaker similarity, WER
controllability
Emotion, speed, pitch, whisper, pauses, delivery style
control adherence and acoustic-channel movement