CodeSOTA research note · text-to-speech preference modeling

From human Elo to a small TTS judge

The workflow starts with scarce but high-value human votes. Those votes train a small pairwise judge. The judge estimates win rates for models that have not yet been in the listening test. Bradley-Terry turns those win rates back into an Elo-like ranking, and the next human votes are used to re-check the machine estimates where they are uncertain or surprising.

Scientific loop

A benchmark that can grow without pretending automation is truth

The claim is not that WavLM features replace listeners. The claim is narrower and more useful: a small judge can turn a few hundred pairwise votes into a map of likely winners, likely losers, and high-value comparisons that deserve fresh human attention. It starts from one atom — a blind A/B listening test, two voices reading the same line, a listener picking the one they would ship.

Seed1

Human Elo

Blind same-prompt votes define the target: which clip a listener actually prefers.

Learn2

Small Judge

A lightweight pairwise model learns preference probabilities from speech embeddings and acoustic deltas.

Scale3

Predicted Elo

The judge scores new model pairs, then Bradley-Terry converts predicted win rates into an Elo-like scale.

Audit4

Fresh Votes

New human comparisons are held out after training to test calibration and find hard cases.

Baseline run

What was trained

The unit of prediction is a pair, not a voice in isolation. Every vote becomes two training rows: the observed winner over loser and the mirrored loser over winner. That makes the model learn a probability surface for comparisons, which is the right shape for Elo and Bradley-Terry.

Human votes
147
blind same-prompt comparisons
Audio samples
204
embedded clips
Feature width
776
WavLM + acoustic deltas
Trained
May 22, 2026
microsoft/wavlm-base-plus
1

Human pairs

Blind same-prompt A/B votes

2

Feature bank

WavLM pooled vectors + acoustic deltas

3

Pair model

P(A beats B | features)

4

Tournament

Predicted win matrix

5

BT / Elo

Ratings and uncertainty queue

xA - xBFeature contrast

Embedding delta, duration delta, loudness delta, and spectral deltas for clips on the same prompt.

=
sigma(w dot delta)Preference probability

A calibrated logistic score: the model estimates the chance that A wins the human vote.

=
BT / EloReadable rating

Predicted win rates are solved into a Bradley-Terry scale and displayed as Elo-like points.

Inside the judge

How a vote becomes a score

The full path from two audio clips to one rating — a few hundred human votes distilled into one number per voice, then used to score models that were never in the listening test. The only learned step is the logistic model in the middle; everything around it is fixed signal processing and a closed-form rating fit. That is deliberate — a small, legible model is easier to trust and recalibrate than an end-to-end black box.

01
Two clipsSame prompt, A and B

16 kHz mono, loudness-matched to −21 LUFS

02
WavLM encodermicrosoft/wavlm-base-plus

Mean-pooled hidden states → 768-dim embedding per clip

03
+ acoustic8 hand features

Duration, RMS, peak, ZCR, spectral centroid/bandwidth/flatness, silence

04
Pair vectorA minus B

Concatenated embedding & acoustic differences → 776-dim

05
Logistic modelP(A beats B)

Trained on 147 human votes, each mirrored → 294 ordered pairs

06
Bradley–TerryWin matrix → rating

Predicted pairwise wins fit to a 1500-centered Elo scale

Which layer? — a pitfall we tested

A subtle but real trap: self-supervised speech models like WavLM specialise by depth. Middle layers (~6–9) encode speaker identity, timbre, and prosody; the top layers drift toward phonetic content. Reaching for the last layer for every task is a common mistake that can pull a metric away from what listeners actually perceive. So we did not assume — we rebuilt the judge on every transformer layer and measured held-out AUC.

1
2
3
4
5
6
7
8
9
10
11
0.75
12

Leave-prompt-group-out AUC of the judge rebuilt on each of WavLM's 12 transformer layers. The last layer (highlighted) wins for this preference task — but a timbre or prosody judge would peak in the middle, which is why the speaker-similarity axis uses a model that learns its own layer weights.

For this judge, the last layer genuinely wins (AUC 0.75 vs ~0.60–0.67 in the middle) — because preference here is dominated by overall quality and naturalness, which the top layer integrates, not by raw speaker timbre. The pitfall is still real: the moment an axis is about who is speaking, the last layer is the wrong default. That is exactly why the speaker-similarity axis does not use raw WavLM at all — it uses the fine-tuned wavlm-base-plus-sv x-vector model, which learns its own weighting across layers. With only 147 votes this sweep is provisional; a timbre- or prosody-specific judge should re-run it and will likely land mid-stack.
Predicted tournament

Embedding model into Elo

The classifier is trained on ordered pairs: winner clip minus loser clip. After fitting the preference model, every available model pair is scored across shared prompts. Those predicted win probabilities are converted into a Bradley-Terry rating scale centered around 1500. Striped rows are external Replicate models scored by the judge, not direct human-vote Elo rows.

In-study voices Replicate candidates scored by the ranker, not human Elo
Chatterboxdefault study voice
1770
1
Spark-TTSAndy clone · estimated from generated Replicate samples
1748
2
Chatterbox AndyAndy
1730
3
Supertonic 3M1 (preset) · estimated from generated Replicate samples
1682
4
Turbodefault study voice
1674
5
Qwen AidenAiden
1626
6
HDdefault study voice
1566
7
Spark-TTSAiden clone · estimated from generated Replicate samples
1559
8
Gradium KentKent
1552
9
Eleven JamesJames
1551
10
Leave-prompt-out validation

75% ROC AUC

The validation split withholds prompt groups, which is stricter than randomly splitting votes. Accuracy is 65%, with uneven folds because some prompt families are sparse.

63
48
80
76
59
0.000.000.250.250.500.500.750.751.001.00ROC curveAUC 75%False positive rateTrue positive rate
Interpretation: the curve sits above the random diagonal, but the middle section is jagged because the holdout has only 294 mirrored rows across five prompt groups.

Fold bars show accuracy by held-out prompt group. Log loss 0.924; Brier 0.256.

Accuracy65%
ROC AUC75%
1 - Brier74%
Can you trust the numbers?

Calibration: a predicted 80% should win 80% of the time

Accuracy and AUC only ask whether the judge picks the right winner. A benchmark needs more — when it says “80% chance A wins,” A should actually win about 80% of the time. We test this directly: take every out-of-fold prediction from the leave-prompt-group-out cross-validation, bin by predicted probability, and compare each bin's claim against what really happened.

predicted win probability →observed win rate →
Expected calibration error
22% 6%
before → after temperature scaling (T = 4.1)

The dotted line is perfect calibration. The red curve is the raw judge — points sag below the line, meaning overconfidence. A single learned temperature pulls the green curve back onto the diagonal without changing any ranking. Dot size is comparisons per bin, over 294 mirrored pairs.

The raw judge is overconfident — and one number fixes it. The ranking is already sound (ROC AUC 75%), but the raw probabilities are too extreme: a claimed 97% win materialized only ~77% of the time. Fitting a single temperature (T = 4.1) on the held-out predictions softens every probability toward the outcome it actually earned, cutting expected calibration error from 22% to 6%. Because the transform is monotonic, the leaderboard order is untouched — you now get a trustworthy confidence on each pair, not just a trustworthy ranking. More votes will tighten it further.
Win-probability heatmap

Where the model thinks pairwise edges are strong

Each cell is the predicted chance that the row voice beats the column voice. Green favors the row; red favors the column. The diagonal is 50. Sparse support still matters: this matrix is useful for prioritizing new listening tests, not for declaring a final voice leaderboard.

Chatterbox
Chatterbox Andy
Turbo
Qwen Aiden
HD
Gradium Kent
Eleven James
Turbo Deep
XTTS v2
HD Deep
Chatterbox
50
50
74
64
99
83
83
98
94
63
Chatterbox Andy
50
50
59
59
78
75
76
86
72
73
Turbo
26
41
50
59
76
54
65
56
71
Qwen Aiden
36
41
41
50
56
53
62
41
85
61
HD
1
22
24
44
50
55
32
46
57
Gradium Kent
17
25
46
47
45
50
48
68
64
55
Eleven James
17
24
35
38
68
52
50
60
5
59
Turbo Deep
2
14
44
59
54
32
40
50
33
72
XTTS v2
6
28
15
36
95
67
50
44
HD Deep
37
27
29
39
43
45
41
28
56
50
Open-source quality estimates

Use the judge to triage open TTS models

This is the first practical use of the preference model: render the same prompts with open-source or open-weight TTS systems, embed the clips, predict pairwise win rates against the current field, then convert those wins into a provisional Elo. The estimate is a screening tool, not a final benchmark. The current pass covers ten open candidate conditions, including fresh additions — Orpheus 3B (Llama-based), Supertonic 3 (a 99M-parameter on-device model), and Sesame's CSM-1B — so the page can show both model quality and voice-condition sensitivity.

Loudness fairness: the judge reads RMS and peak as features, so a louder clip can win on volume alone. Before scoring, every candidate clip is loudness-matched to the study pool with EBU R128 / LUFS normalization (−21 LUFS, −1 dBFS peak ceiling), collapsing the field's loudness spread from ±2.9 to ±0.3 LUFS. What remains is timbre and quality, not gain.
1
Chatterbox Turbodefault study voice · open-source · MIT
1770

Top current estimate, but includes a default study voice condition.

Resemble Chatterbox GitHub
2
Chatterbox TurboAndy · open-source · MIT
1730

Best clean male-voice open-source estimate in this batch.

Resemble Chatterbox GitHub
3
Qwen3 TTSAiden · open-source · Apache 2.0
1626

Strong open model candidate; should get more same-voice prompt coverage.

Qwen3-TTS technical report
4
XTTS v2Damien Black · open-weight · CPML
1535

Good local voice-cloning baseline, but not a permissive commercial OSS license.

Coqui XTTS v2 model release
5
Kokoro v1.0af_heart · open-source · Apache 2.0
1392

Small, cheap, permissive baseline; quality trails larger expressive models here.

Kokoro Hugging Face model card
6
Kokoro v1.0am_michael · open-source · Apache 2.0
1362

Useful fast baseline, but this judge predicts weak preference against richer voices.

Kokoro Hugging Face model card
7
Qwen3 TTSdefault study voice · open-source · Apache 2.0
1113

Low estimate is likely a voice-condition artifact, not a model-family verdict.

Qwen3-TTS technical report

Replicate open-weight estimates

These model/voice conditions were not in the human Elo pool. We rendered 12 shared prompts through Replicate, embedded the WAVs, and asked the ranker to predict pairwise wins against the current field. Treat rows as model-plus-reference estimates: the same model can move materially when cloned from a different voice.

ModelRelative strengthPredicted EloMean win
1
Spark-TTSAndy clone · judged against 15 current voices across 98 prompt pairs
1748
78%
2
Supertonic 3M1 (preset) · judged against 15 current voices across 98 prompt pairs
1682
72%
3
Spark-TTSAiden clone · judged against 15 current voices across 98 prompt pairs
1559
58%
4
Zonos v0.1Andy clone · judged against 15 current voices across 98 prompt pairs
1548
57%
5
DiaAndy clone · judged against 15 current voices across 98 prompt pairs
1534
55%
6
F5-TTSAndy clone · judged against 15 current voices across 98 prompt pairs
1532
55%
7
Zonos v0.1Aiden clone · judged against 15 current voices across 98 prompt pairs
1492
50%
8
Orpheus 3BDan (preset) · judged against 15 current voices across 98 prompt pairs
1431
43%
9
CSM-1BSpeaker 0 · judged against 15 current voices across 98 prompt pairs
1408
41%
10
F5-TTSAiden clone · judged against 15 current voices across 98 prompt pairs
1395
39%
Continuous benchmark plan: Whisper intelligibility and UTMOS naturalness are now live for every candidate (see the three-axis scorecard below). What remains is a clean holdout of new human votes collected after the judge is trained: if it stays calibrated on those fresh votes, CodeSOTA runs as a continuous benchmark — machine scores for breadth, human votes for calibration and hard cases.
Three independent axes

Preference is not intelligibility is not naturalness

A single Elo number hides real trade-offs. Borrowing the axes professional speech-evaluation services keep separate, we score every candidate on three independent measures: predicted human preference (the Elo judge), objective intelligibility (Whisper whisper-small.en word error rate against the known script), and predicted naturalness (UTMOS, the top VoiceMOS-2022 system, on a 1–5 MOS scale). The three rankings disagree — which is exactly why one score is not enough.

Model · voice
Spark-TTSAndy clone
#1
1748
#5
10%
#1
4.48
±4
Supertonic 3M1 (preset)
#2
1682
#2
10%
#3
4.41
±1
Spark-TTSAiden clone
#3
1559
#4
10%
#2
4.48
±2
Zonos v0.1Andy clone
#4
1548
#7
14%
#7
4.28
±3
DiaAndy clone
#5
1534
#8
16%
#10
3.93
±5
F5-TTSAndy clone
#6
1532
#9
19%
#5
4.29
±4
Zonos v0.1Aiden clone
#7
1492
#3
10%
#8
4.22
±5
Orpheus 3BDan (preset)
#8
1431
#1
9%
#4
4.36
±7
CSM-1BSpeaker 0
#9
1408
#6
12%
#9
3.96
±3
F5-TTSAiden clone
#10
1395
#10
21%
#6
4.28
±4
The axes diverge: Spark-TTS leads on preference, but Orpheus 3B is the most intelligible (9% word error rate). F5-TTS sounds natural yet posts the highest error rate (21% WER, 33% on structured text) — it reads fluently but says the wrong words. A model that wins on charm can still fail on a phone number.

Click any column to re-sort and watch the order reshuffle. The #rank in each cell is the model's standing on that axis; the Disagreement column is the gap between a model's best and worst rank across the three — high values flag models the axes disagree about (accurate but disliked, or liked but unintelligible). Intelligibility bars are inverted (longer = lower WER), and WER uses the Whisper English text normalizer so spoken “ninety-eight thousand” and written “$98,750” count as a match.

Model fingerprints

Every model has a shape

The same five axes, drawn as a profile per voice. A balanced pentagon is an all-rounder; a spiky one is a specialist. You can read each model's personality at a glance — where it reaches the rim and where it caves in.

Each spoke is an axis, normalized across the field (further out = better). Clockwise from top: Preference · Intelligibility · Naturalness · Speaker sim · Expressiveness. Speaker similarity is blank for preset voices with no clone target.

Spark-TTSAndy clone
Supertonic 3M1 (preset)
Spark-TTSAiden clone
Zonos v0.1Andy clone
DiaAndy clone
F5-TTSAndy clone
Zonos v0.1Aiden clone
Orpheus 3BDan (preset)
CSM-1BSpeaker 0
F5-TTSAiden clone
Can a metric replace listening?

No single objective metric predicts preference

If intelligibility or naturalness alone tracked human taste, you could retire listening tests. So we tested it directly: across the 15 voices that carry real human-vote Elo, how well does each objective metric correlate with the preference ranking? The answer is sobering — both are essentially flat. The most-preferred voice is not the most intelligible, and UTMOS saturates near the top so it cannot separate already-good models at all.

Intelligibility vs preference

ρ = +0.13
preference Elo →word accuracy →

No significant relationship · Spearman p = 0.66, Pearson r = +0.32, n = 15.

Naturalness vs preference

ρ = +0.06
preference Elo →UTMOS →

No significant relationship · Spearman p = 0.84, Pearson r = -0.10, n = 15.

Why this matters: for modern, already-good TTS the cheap proxies break down — Spearman ρ = 0.13 for intelligibility and 0.06 for naturalness, neither significant. Preference is carried by timbre, expressiveness, and delivery that WER and MOS do not capture. That is the case for the learned preference judge (ROC AUC 75%): it models the comparison humans actually make, while WER and UTMOS stay valuable as guardrails — catching a voice that mangles a phone number or sounds robotic, not ranking the good ones.

Elo from the pairwise judge leaderboard (Bradley-Terry on human votes); WER from whisper-small.en; MOS from utmos22_strong. Restricted-range caveat: this field is all strong commercial-grade voices, which compresses the metrics and is itself the finding.

How the axes relate

The correlation matrix

Every axis against every other, as Spearman rank correlation across the 10 candidate voices. Blue is a positive relationship, red is negative, white is none. The diagonal is trivially 1. Two relationships jump out — and neither is about quality alone.

PreferenceIntelligibilityNaturalnessPitch dynamismSpeaking rate
Preference1+0.26+0.58+0.14-0.25
Intelligibility+0.261+0.36+0.42-0.66
Naturalness+0.58+0.361+0.06-0.06
Pitch dynamism+0.14+0.42+0.061-0.89
Speaking rate-0.25-0.66-0.06-0.891
Speech rate is the hidden variable. Speaking rate correlates −0.89 with pitch dynamism (fast voices flatten their intonation) and −0.66 with intelligibility (the faster a model talks, the more it fumbles structured text). Preference, meanwhile, tracks naturalness (+0.58) far more than intelligibility (+0.26): on this open-weight field the judge rewards how a voice sounds over whether every token survives. Note the contrast with the human-rated study field above, where even naturalness washed out — a restricted-range effect once every model is already excellent.
Voice character

Speaker similarity and prosody

Two more axes a voice team cares about. Speaker similarity asks, for the cloned voices, how close the generated speaker is to the intended identity — cosine similarity of wavlm-base-plus-sv speaker embeddings against a target centroid built from the established study-pool Chatterbox-Turbo Andy and Qwen3-TTS Aiden voices. Preset voices have no clone target. Prosody is reported descriptively, not as a quality score: pitch dynamism (F0 standard deviation in semitones, a proxy for expressive intonation) and speaking rate.

Model · voiceSpeaker similarityPitch dynamismRate
Spark-TTSAiden clone
98%
4.2 st
2.0/s
Zonos v0.1Andy clone
97%
3.6 st
2.1/s
Zonos v0.1Aiden clone
97%
3.4 st
2.1/s
F5-TTSAiden clone
97%
3.5 st
3.2/s
Spark-TTSAndy clone
97%
3.2 st
2.1/s
F5-TTSAndy clone
96%
2.6 st
3.4/s
DiaAndy clone
82%
4.8 st
1.9/s
Orpheus 3BDan (preset)
preset · no clone target
5.2 st
1.8/s
Supertonic 3M1 (preset)
preset · no clone target
4.3 st
2.1/s
CSM-1BSpeaker 0
preset · no clone target
2.6 st
2.4/s
Cloning is mostly solved; expressiveness is not preference. Spark-TTS holds the target identity best (98% similarity), while Dia drifts most (82%). Orpheus 3B swings pitch the most (5.2 semitones) yet does not top preference — more intonation is not automatically more likable.

Speaker bars are rescaled over a 70–100% window to spread the cluster. Prosody is descriptive; a trained prosody-MOS predictor is future work. descriptive features, not a quality score.

Evaluation standards

Where these scores sit in the ITU framework

Subjective speech evaluation has formal standards. Mapping each axis onto them keeps the method honest and makes clear what is measured, on what scale, and what is still missing.

Preference

Comparative / CMOS

blind A/B → Bradley-Terry → Elo

Listeners pick the better of two same-prompt clips. This is a comparative test — the family behind the ITU-T P.800 comparison-category (CMOS) rating — aggregated with Bradley-Terry into a 1500-centered Elo. Comparative tests resolve small differences that absolute rating blurs.

scale · win / loss → Elo
Naturalness

ACR MOS · ITU-T P.800 / P.808

UTMOS neural predictor

Absolute Category Rating asks a listener to score one clip from 1 (bad) to 5 (excellent); the average is MOS. UTMOS is trained to predict that human ACR score, so it is a no-listener proxy for P.800 (lab) and P.808 (crowdsourced) naturalness.

scale · 1–5 MOS
Intelligibility

Objective WER

Whisper ASR + jiwer

Not an ITU subjective test: we transcribe each clip and measure word error rate against the script. It catches the failure mode naturalness scores miss — a fluent clip that mangles a number, URL, or email.

scale · 0–100% error
Not yet covered: ITU-T P.835 (separate signal / background / overall ratings, built for noise-suppressed speech) and ITU-R BS.1534 (MUSHRA — a 0–100 scale with a hidden reference and low-quality anchor for fine-grained naturalness ranking). MUSHRA with screened listeners is the natural next step for a public benchmark; the Elo + UTMOS + WER stack is the low-cost machine approximation that runs on every new model the moment it ships.
Protocol explainer

How MUSHRA works

MUSHRA — MUltiple Stimuli with Hidden Reference and Anchor (ITU-R BS.1534) — is the most discriminating listening test, built to separate systems that are all already good. Rather than rating one clip in isolation, the listener sees every version of the same passage on a single screen and scores each on a continuous 0–100 scale, split into five quality bands. Crucially, two of the clips on that screen are traps.

Excellent 80100
Good 6080
Fair 4060
Poor 2040
Bad 020
84
System A
67
System B
43
System C
99
Hidden refa secret copy of the original
55
Mid anchor7 kHz low-pass
19
Low anchor3.5 kHz low-pass
Systems under test Hidden reference — a reliable listener parks it near 100; if they don't, their scores are dropped Anchors — deliberately degraded clips that pin the bottom of the scale so ratings are comparable across people
The two hidden controls are what make it rigorous. The hidden reference is an unlabeled copy of the pristine original dropped in among the candidates. A listener who can really hear should rate it at the very top (~100); if they park it at 60, they are guessing or on bad equipment, so their whole session is thrown out. The anchors are deliberately broken versions — typically the source low-pass filtered to 3.5 kHz and 7 kHz — that nail the bottom of the scale to a known degradation, so a “40” means the same thing for every listener and every lab.

Because all versions are heard side by side, MUSHRA resolves differences far smaller than absolute MOS can, and a single trained panel of ~15–20 listeners yields tight confidence intervals. The cost is exactly that: screened, trained listeners and careful session design. That is why our page leans on Elo + UTMOS + WER as the cheap, always-on approximation — and flags MUSHRA as the gold standard to reach for when two voices are too close to call.

Illustrated glossary

Every term, explained

This page leans on a lot of acronyms — error rates, opinion scores, embeddings, rating systems. Here is each one in plain language, with its own picture: what it measures, how it is computed, the scale it lives on, and where it shows up in this study.

Word Error Rate illustration
WERWord Error Rate

Run a clip through a speech recognizer, then compare the transcript to the script the model was supposed to read. WER counts the edits needed to fix it — substitutions, insertions, and deletions — divided by the number of reference words. It is the standard objective measure of intelligibility: did the words actually survive the trip through synthesis? We normalize both sides with the Whisper text normalizer first, so spoken “ninety-eight thousand” and written “$98,750” are treated as the same.

0% is perfect; 50% means half the words are wrong. On this page F5-TTS hits ~31% on structured text.

Absolute Category Rating illustration
ACR / MOSAbsolute Category Rating

The oldest and most common subjective test, standardized in ITU-T P.800. A listener hears one clip in isolation and rates it on a five-point scale: 5 excellent, 4 good, 3 fair, 2 poor, 1 bad. Average those ratings across many listeners and clips and you get the Mean Opinion Score. Because it rates clips independently it is simple to run, but it blurs small differences — two great voices both land near 4.5.

P.808 is the crowdsourced variant. UTMOS predicts this 1–5 score with no human in the loop.

Comparison MOS illustration
CMOSComparison MOS

Instead of rating one clip alone, the listener hears two and judges which is better and by how much, usually on a −3 to +3 scale. Comparative tests resolve differences that absolute rating misses, because the brain is far better at “A is slightly better than B” than at pinning an absolute number on a single clip. Our blind A/B vote is the binary version: just pick the winner, no magnitude.

Podonos’ head-to-head slider is a CMOS readout; our Elo aggregates thousands of these binary calls.

ITU-R BS.1534 illustration
MUSHRAITU-R BS.1534

Multiple Stimuli with Hidden Reference and Anchor — the most discriminating subjective protocol. The listener rates several clips at once on a continuous 0–100 scale, while a known high-quality reference and a deliberately degraded low anchor are hidden among them to calibrate the scale and catch inattentive raters. It needs trained listeners and is expensive, which is why it is reserved for fine-grained ranking of already-good systems.

This is the gold standard our cheap Elo + UTMOS + WER stack approximates at near-zero cost.

Pairwise rating illustration
EloPairwise rating

A rating system borrowed from chess. Everyone starts at 1500; after each match the winner takes points from the loser, and the amount depends on how surprising the result was — beating a much higher-rated voice earns more. Over many comparisons the ratings settle into a ranking. The scale is interpretable: a 400-point gap implies the higher voice should win about 10 times out of 11.

On the leaderboard, Chatterbox Turbo sits near 1770 and the weakest voice near 1110.

Win-prob → rating illustration
Bradley–TerryWin-prob → rating

A statistical model that takes a whole table of pairwise win probabilities and solves for one strength number per competitor that best explains them. Where Elo updates incrementally one match at a time, Bradley–Terry fits the entire set of comparisons at once, which is more stable when data is sparse. We use it to convert the judge’s predicted win matrix into the 1500-centered scale you see.

It is the math that lets a few hundred votes produce a coherent full ranking.

Speech encoder illustration
WavLMSpeech encoder

A large self-supervised transformer from Microsoft, trained on huge amounts of unlabeled audio to predict masked speech. The payoff is that its internal layers turn any clip into a dense vector that already encodes speaker identity, prosody, accent, and recording quality — without anyone labeling those things. The preference judge reads a 768-dimensional WavLM vector per clip as its main feature.

We use microsoft/wavlm-base-plus, mean-pooled across time into one vector.

Speaker embedding illustration
x-vectorSpeaker embedding

A fixed-length fingerprint of who is speaking, designed so that two clips of the same person land close together regardless of what words are said. Speaker-verification models produce them, and the cosine similarity between two x-vectors is a number from 0 to 1 measuring how much two voices sound like the same identity. For cloned voices it answers: did the model actually capture the target speaker?

We compare each clone against an Andy/Aiden centroid; Dia drifts most at ~0.82.

Neural MOS predictor illustration
UTMOSNeural MOS predictor

A neural network that listens to a clip and predicts what MOS score human raters would give it, trained on large banks of human-rated audio. utmos22_strong was the top system in the VoiceMOS 2022 challenge. It is a free, instant stand-in for a naturalness panel — but, as this page shows, it saturates near the top, so it cannot separate two already-excellent voices.

On the candidate field it ranges 3.9 (Dia) to 4.5 (Spark-TTS) on the 1–5 scale.

Loudness units illustration
LUFSLoudness units

Loudness Units relative to Full Scale, the EBU R128 standard for perceived loudness — the same measure streaming services use to keep tracks at an even volume. It models how loud audio actually sounds to a person, not just its peak amplitude. We normalize every clip to −21 LUFS before scoring so the judge compares timbre and delivery, not which clip happens to be louder.

Normalizing collapsed the field’s loudness spread from ±2.9 to ±0.3 LUFS.

Rank correlation illustration
Spearman ρRank correlation

A measure of whether two rankings agree, from −1 (perfectly reversed) through 0 (unrelated) to +1 (identical order). Unlike Pearson correlation it only cares about order, not exact values, so it is robust to odd scales and outliers. We use it to ask the central question: does any objective metric rank voices the same way humans do?

Intelligibility vs preference came out at ρ = 0.13 — essentially no relationship.

Why ρ can vanish illustration
Restricted rangeWhy ρ can vanish

A statistical trap: correlation needs spread to detect. When every model in a comparison is already excellent, each metric’s values bunch into a narrow band, and even a real underlying relationship collapses toward zero. The flat correlations on the study field are partly this effect — and noticing it is itself the finding, because it explains why off-the-shelf metrics fail exactly where you most need them: separating the best from the very best.

UTMOS spanning only 4.1–4.5 across all 15 study voices is restricted range in action.

Content-type analysis

Where models break: structured text vs natural prose

Overall Elo hides where a model actually struggles. We split the shared prompts into symbol-heavy structured text (numbers, URLs, emails, dates, addresses, finance, acronyms) and natural prose (plain, entity, legal, medical, support), then asked the judge for each candidate's win rate against the field within each bucket. Almost every model is weaker on structured text — the input that stresses normalization and grapheme-to-phoneme handling rather than voice quality.

Orpheus 3B is the sharpest example: a 64% win rate on natural prose collapses to 22% on structured text — a 42-point gap. It reads sentences cleanly but mangles numbers and URLs, which is exactly what drags down its overall estimate.
Model · voiceStructured textNatural proseGap
Orpheus 3BDan (preset)
22%
64%
42
DiaAndy clone
45%
60%
15
Supertonic 3M1 (preset)
62%
76%
14
Spark-TTSAiden clone
48%
62%
14
F5-TTSAndy clone
48%
55%
7
Spark-TTSAndy clone
73%
78%
6
F5-TTSAiden clone
40%
33%
+7
Zonos v0.1Andy clone
58%
44%
+14
CSM-1BSpeaker 0
47%
32%
+15
Zonos v0.1Aiden clone
57%
33%
+24

Bars show predicted win rate vs the 15-voice study field, averaged within each content bucket. Gap is structured minus natural in points; red means the model loses ground on structured text.

Failure map

Which content type breaks which model

The bucket view averages a lot away. Here is every model against all twelve content types, coloured by word error rate. The structured columns (left) light up red almost everywhere; natural prose (right) stays green. Numbers, URLs, and emails are where intelligibility goes to die.

ModelAcronymsAddressesDatesEmailsFinanceNumbersURLsNamesLegalMedicalPlain proseSupportAll
Orpheus 3BDan (preset)1118171801425000009
Supertonic 3M1 (preset)11017271802517000010
Zonos v0.1Aiden clone1100271804201700010
Spark-TTSAiden clone00027005000400010
Spark-TTSAndy clone0017270145000100010
CSM-1BSpeaker 03327027043170000012
Zonos v0.1Andy clone1127172727141702500014
DiaAndy clone110171009142500100016
F5-TTSAndy clone679172718572500100019
F5-TTSAiden clone6736172718432500200021

Cell = mean word error rate (%) for that model on that content type. Green is accurate, red is broken. The seven left columns are structured/symbol-heavy text; the five right are natural prose.

Anatomy of a failure

What the model actually said

WER is abstract until you read the transcripts. These are real clips, transcribed by Whisper: the script the model was given versus what came out. Struck-through words were dropped; highlighted words are wrong or hallucinated. Smooth on prose, shattered on symbols — the most spectacular failure is a model that abandoned the prompt entirely and recited a YouTube outro.

EmailsDia · Andy clone100% WER
script

the escalation alias is support dash priority at codesota dot com

heard

i will see you next time

AcronymsF5-TTS · Andy clone67% WER
script

the api uses oauth jwt tls and http 2

heard

the api uses yogeo thtp

NumbersF5-TTS · Andy clone57% WER
script

the confirmation code is 739 184 552

heard

the comfort code is 7398048

URLsSpark-TTS · Andy clone50% WER
script

visit status dot example dot com slash incidents slash april dash report

heard

visit status example com incidence april report

MedicalSpark-TTS · Aiden clone40% WER
script

record the dosage as 25 milligrams twice daily with food

heard

it will print the dosage as 25 mg twice daily with food

AddressesF5-TTS · Aiden clone36% WER
script

ship the replacement unit to 742 evergreen terrace springfield oregon 97403

heard

ship the replacement unit to 742 evergreen terry springfield oregon 97 4 3

Active-learning queue

Ask humans about these next

The best next comparisons are not the most famous models. They are the pairs where the model is near 50/50, where prompt support is thin, or where a voice-condition mismatch could be hiding in the data. This is no longer just a table: the live voting page now samples pairs by the same logic — after every vote it favors near-tied ratings and under-tested voices, so human attention flows to the comparisons that move the ranking most.

Predicted A win probability
Prompt support
highest value: near 50/50, low support
PairPredicted splitPrompt supportWhy it matters
Chatterbox Turbo / default study voice
vs Chatterbox Turbo / Andy
50 / 508Near-tie: high leverage for rank ordering.
Gradium TTS / Kent
vs Kokoro v1.0 / af_heart
52 / 483Near-tie: high leverage for rank ordering.
ElevenLabs v3 / James
vs Gradium TTS / Kent
52 / 4812Near-tie: high leverage for rank ordering.
Gradium TTS / Kent
vs Qwen3 TTS / Aiden
47 / 5315Near-tie: high leverage for rank ordering.
Kokoro v1.0 / am_michael
vs Chatterbox Turbo / default study voice
46 / 544Sparse prompt overlap: add votes before trusting the edge.
Gradium TTS / Kent
vs Speech-02 Turbo / default study voice
46 / 547Stable enough to track, still useful for calibration.
Speech-02 HD / default study voice
vs Speech-02 Turbo / English_Deep-VoicedGentleman
46 / 545Sparse prompt overlap: add votes before trusting the edge.
ElevenLabs v3 / default study voice
vs Gradium TTS / Kent
55 / 454Sparse prompt overlap: add votes before trusting the edge.
Gradium TTS / Kent
vs Speech-02 HD / default study voice
45 / 556Stable enough to track, still useful for calibration.
Gradium TTS / Kent
vs Speech-02 HD / English_Deep-VoicedGentleman
55 / 4511Stable enough to track, still useful for calibration.
Current limitation: this baseline mixes an older female/default voice pool with the newer male-voice condition. That is useful for debugging the method, but the public benchmark should either split by voice condition or retrain on a clean male-only study batch.
Research position

The useful claim is workflow, not final rank

Low-volume human preference data is too scarce for a universal voice judge. The stronger claim is that a lightweight model can turn a few hundred votes into a ranked map of uncertainty. WavLM embeddings capture speaker, prosody, and quality cues; simple acoustic features catch duration, loudness, and spectral shape; and Bradley-Terry turns model-level pair probabilities into a readable scale.

That breadth now spans five axes — preference (Elo), intelligibility (Whisper WER), naturalness (UTMOS), speaker similarity, and prosody — grounded in the ITU subjective-evaluation standards above, and the voting page now records a per-vote factor (expressiveness, pacing, pronunciation, hallucinations) so wins can be attributed, not just counted. What is genuinely still open is calibration: the reliability test above shows the judge is overconfident, so the next steps are a temperature-scaling pass, a held-out set of fresh human votes collected after training to confirm the fix holds, and enough factor votes to report why one clip wins — not just that it does.