From human Elo to a small TTS judge
The workflow starts with scarce but high-value human votes. Those votes train a small pairwise judge. The judge estimates win rates for models that have not yet been in the listening test. Bradley-Terry turns those win rates back into an Elo-like ranking, and the next human votes are used to re-check the machine estimates where they are uncertain or surprising.
A benchmark that can grow without pretending automation is truth
The claim is not that WavLM features replace listeners. The claim is narrower and more useful: a small judge can turn a few hundred pairwise votes into a map of likely winners, likely losers, and high-value comparisons that deserve fresh human attention. It starts from one atom — a blind A/B listening test, two voices reading the same line, a listener picking the one they would ship.
Human Elo
Blind same-prompt votes define the target: which clip a listener actually prefers.
Small Judge
A lightweight pairwise model learns preference probabilities from speech embeddings and acoustic deltas.
Predicted Elo
The judge scores new model pairs, then Bradley-Terry converts predicted win rates into an Elo-like scale.
Fresh Votes
New human comparisons are held out after training to test calibration and find hard cases.
What was trained
The unit of prediction is a pair, not a voice in isolation. Every vote becomes two training rows: the observed winner over loser and the mirrored loser over winner. That makes the model learn a probability surface for comparisons, which is the right shape for Elo and Bradley-Terry.
Human pairs
Blind same-prompt A/B votes
Feature bank
WavLM pooled vectors + acoustic deltas
Pair model
P(A beats B | features)
Tournament
Predicted win matrix
BT / Elo
Ratings and uncertainty queue
Embedding delta, duration delta, loudness delta, and spectral deltas for clips on the same prompt.
A calibrated logistic score: the model estimates the chance that A wins the human vote.
Predicted win rates are solved into a Bradley-Terry scale and displayed as Elo-like points.
How a vote becomes a score
The full path from two audio clips to one rating — a few hundred human votes distilled into one number per voice, then used to score models that were never in the listening test. The only learned step is the logistic model in the middle; everything around it is fixed signal processing and a closed-form rating fit. That is deliberate — a small, legible model is easier to trust and recalibrate than an end-to-end black box.
Same prompt, A and B16 kHz mono, loudness-matched to −21 LUFS
microsoft/wavlm-base-plusMean-pooled hidden states → 768-dim embedding per clip
8 hand featuresDuration, RMS, peak, ZCR, spectral centroid/bandwidth/flatness, silence
A minus BConcatenated embedding & acoustic differences → 776-dim
P(A beats B)Trained on 147 human votes, each mirrored → 294 ordered pairs
Win matrix → ratingPredicted pairwise wins fit to a 1500-centered Elo scale
Which layer? — a pitfall we tested
A subtle but real trap: self-supervised speech models like WavLM specialise by depth. Middle layers (~6–9) encode speaker identity, timbre, and prosody; the top layers drift toward phonetic content. Reaching for the last layer for every task is a common mistake that can pull a metric away from what listeners actually perceive. So we did not assume — we rebuilt the judge on every transformer layer and measured held-out AUC.
wavlm-base-plus-sv x-vector model, which learns its own weighting across layers. With only 147 votes this sweep is provisional; a timbre- or prosody-specific judge should re-run it and will likely land mid-stack.Embedding model into Elo
The classifier is trained on ordered pairs: winner clip minus loser clip. After fitting the preference model, every available model pair is scored across shared prompts. Those predicted win probabilities are converted into a Bradley-Terry rating scale centered around 1500. Striped rows are external Replicate models scored by the judge, not direct human-vote Elo rows.
75% ROC AUC
The validation split withholds prompt groups, which is stricter than randomly splitting votes. Accuracy is 65%, with uneven folds because some prompt families are sparse.
Calibration: a predicted 80% should win 80% of the time
Accuracy and AUC only ask whether the judge picks the right winner. A benchmark needs more — when it says “80% chance A wins,” A should actually win about 80% of the time. We test this directly: take every out-of-fold prediction from the leave-prompt-group-out cross-validation, bin by predicted probability, and compare each bin's claim against what really happened.
Where the model thinks pairwise edges are strong
Each cell is the predicted chance that the row voice beats the column voice. Green favors the row; red favors the column. The diagonal is 50. Sparse support still matters: this matrix is useful for prioritizing new listening tests, not for declaring a final voice leaderboard.
Use the judge to triage open TTS models
This is the first practical use of the preference model: render the same prompts with open-source or open-weight TTS systems, embed the clips, predict pairwise win rates against the current field, then convert those wins into a provisional Elo. The estimate is a screening tool, not a final benchmark. The current pass covers ten open candidate conditions, including fresh additions — Orpheus 3B (Llama-based), Supertonic 3 (a 99M-parameter on-device model), and Sesame's CSM-1B — so the page can show both model quality and voice-condition sensitivity.
Top current estimate, but includes a default study voice condition.
Resemble Chatterbox GitHubBest clean male-voice open-source estimate in this batch.
Resemble Chatterbox GitHubStrong open model candidate; should get more same-voice prompt coverage.
Qwen3-TTS technical reportGood local voice-cloning baseline, but not a permissive commercial OSS license.
Coqui XTTS v2 model releaseSmall, cheap, permissive baseline; quality trails larger expressive models here.
Kokoro Hugging Face model cardUseful fast baseline, but this judge predicts weak preference against richer voices.
Kokoro Hugging Face model cardLow estimate is likely a voice-condition artifact, not a model-family verdict.
Qwen3-TTS technical reportReplicate open-weight estimates
These model/voice conditions were not in the human Elo pool. We rendered 12 shared prompts through Replicate, embedded the WAVs, and asked the ranker to predict pairwise wins against the current field. Treat rows as model-plus-reference estimates: the same model can move materially when cloned from a different voice.
Preference is not intelligibility is not naturalness
A single Elo number hides real trade-offs. Borrowing the axes professional speech-evaluation services keep separate, we score every candidate on three independent measures: predicted human preference (the Elo judge), objective intelligibility (Whisper whisper-small.en word error rate against the known script), and predicted naturalness (UTMOS, the top VoiceMOS-2022 system, on a 1–5 MOS scale). The three rankings disagree — which is exactly why one score is not enough.
Every model has a shape
The same five axes, drawn as a profile per voice. A balanced pentagon is an all-rounder; a spiky one is a specialist. You can read each model's personality at a glance — where it reaches the rim and where it caves in.
No single objective metric predicts preference
If intelligibility or naturalness alone tracked human taste, you could retire listening tests. So we tested it directly: across the 15 voices that carry real human-vote Elo, how well does each objective metric correlate with the preference ranking? The answer is sobering — both are essentially flat. The most-preferred voice is not the most intelligible, and UTMOS saturates near the top so it cannot separate already-good models at all.
Intelligibility vs preference
ρ = +0.13Naturalness vs preference
ρ = +0.06The correlation matrix
Every axis against every other, as Spearman rank correlation across the 10 candidate voices. Blue is a positive relationship, red is negative, white is none. The diagonal is trivially 1. Two relationships jump out — and neither is about quality alone.
| Preference | Intelligibility | Naturalness | Pitch dynamism | Speaking rate | |
|---|---|---|---|---|---|
| Preference | 1 | +0.26 | +0.58 | +0.14 | -0.25 |
| Intelligibility | +0.26 | 1 | +0.36 | +0.42 | -0.66 |
| Naturalness | +0.58 | +0.36 | 1 | +0.06 | -0.06 |
| Pitch dynamism | +0.14 | +0.42 | +0.06 | 1 | -0.89 |
| Speaking rate | -0.25 | -0.66 | -0.06 | -0.89 | 1 |
Speaker similarity and prosody
Two more axes a voice team cares about. Speaker similarity asks, for the cloned voices, how close the generated speaker is to the intended identity — cosine similarity of wavlm-base-plus-sv speaker embeddings against a target centroid built from the established study-pool Chatterbox-Turbo Andy and Qwen3-TTS Aiden voices. Preset voices have no clone target. Prosody is reported descriptively, not as a quality score: pitch dynamism (F0 standard deviation in semitones, a proxy for expressive intonation) and speaking rate.
Where these scores sit in the ITU framework
Subjective speech evaluation has formal standards. Mapping each axis onto them keeps the method honest and makes clear what is measured, on what scale, and what is still missing.
Comparative / CMOS
blind A/B → Bradley-Terry → EloListeners pick the better of two same-prompt clips. This is a comparative test — the family behind the ITU-T P.800 comparison-category (CMOS) rating — aggregated with Bradley-Terry into a 1500-centered Elo. Comparative tests resolve small differences that absolute rating blurs.
ACR MOS · ITU-T P.800 / P.808
UTMOS neural predictorAbsolute Category Rating asks a listener to score one clip from 1 (bad) to 5 (excellent); the average is MOS. UTMOS is trained to predict that human ACR score, so it is a no-listener proxy for P.800 (lab) and P.808 (crowdsourced) naturalness.
Objective WER
Whisper ASR + jiwerNot an ITU subjective test: we transcribe each clip and measure word error rate against the script. It catches the failure mode naturalness scores miss — a fluent clip that mangles a number, URL, or email.
How MUSHRA works
MUSHRA — MUltiple Stimuli with Hidden Reference and Anchor (ITU-R BS.1534) — is the most discriminating listening test, built to separate systems that are all already good. Rather than rating one clip in isolation, the listener sees every version of the same passage on a single screen and scores each on a continuous 0–100 scale, split into five quality bands. Crucially, two of the clips on that screen are traps.
Because all versions are heard side by side, MUSHRA resolves differences far smaller than absolute MOS can, and a single trained panel of ~15–20 listeners yields tight confidence intervals. The cost is exactly that: screened, trained listeners and careful session design. That is why our page leans on Elo + UTMOS + WER as the cheap, always-on approximation — and flags MUSHRA as the gold standard to reach for when two voices are too close to call.
Every term, explained
This page leans on a lot of acronyms — error rates, opinion scores, embeddings, rating systems. Here is each one in plain language, with its own picture: what it measures, how it is computed, the scale it lives on, and where it shows up in this study.

Run a clip through a speech recognizer, then compare the transcript to the script the model was supposed to read. WER counts the edits needed to fix it — substitutions, insertions, and deletions — divided by the number of reference words. It is the standard objective measure of intelligibility: did the words actually survive the trip through synthesis? We normalize both sides with the Whisper text normalizer first, so spoken “ninety-eight thousand” and written “$98,750” are treated as the same.
0% is perfect; 50% means half the words are wrong. On this page F5-TTS hits ~31% on structured text.

The oldest and most common subjective test, standardized in ITU-T P.800. A listener hears one clip in isolation and rates it on a five-point scale: 5 excellent, 4 good, 3 fair, 2 poor, 1 bad. Average those ratings across many listeners and clips and you get the Mean Opinion Score. Because it rates clips independently it is simple to run, but it blurs small differences — two great voices both land near 4.5.
P.808 is the crowdsourced variant. UTMOS predicts this 1–5 score with no human in the loop.

Instead of rating one clip alone, the listener hears two and judges which is better and by how much, usually on a −3 to +3 scale. Comparative tests resolve differences that absolute rating misses, because the brain is far better at “A is slightly better than B” than at pinning an absolute number on a single clip. Our blind A/B vote is the binary version: just pick the winner, no magnitude.
Podonos’ head-to-head slider is a CMOS readout; our Elo aggregates thousands of these binary calls.

Multiple Stimuli with Hidden Reference and Anchor — the most discriminating subjective protocol. The listener rates several clips at once on a continuous 0–100 scale, while a known high-quality reference and a deliberately degraded low anchor are hidden among them to calibrate the scale and catch inattentive raters. It needs trained listeners and is expensive, which is why it is reserved for fine-grained ranking of already-good systems.
This is the gold standard our cheap Elo + UTMOS + WER stack approximates at near-zero cost.

A rating system borrowed from chess. Everyone starts at 1500; after each match the winner takes points from the loser, and the amount depends on how surprising the result was — beating a much higher-rated voice earns more. Over many comparisons the ratings settle into a ranking. The scale is interpretable: a 400-point gap implies the higher voice should win about 10 times out of 11.
On the leaderboard, Chatterbox Turbo sits near 1770 and the weakest voice near 1110.

A statistical model that takes a whole table of pairwise win probabilities and solves for one strength number per competitor that best explains them. Where Elo updates incrementally one match at a time, Bradley–Terry fits the entire set of comparisons at once, which is more stable when data is sparse. We use it to convert the judge’s predicted win matrix into the 1500-centered scale you see.
It is the math that lets a few hundred votes produce a coherent full ranking.

A large self-supervised transformer from Microsoft, trained on huge amounts of unlabeled audio to predict masked speech. The payoff is that its internal layers turn any clip into a dense vector that already encodes speaker identity, prosody, accent, and recording quality — without anyone labeling those things. The preference judge reads a 768-dimensional WavLM vector per clip as its main feature.
We use microsoft/wavlm-base-plus, mean-pooled across time into one vector.

A fixed-length fingerprint of who is speaking, designed so that two clips of the same person land close together regardless of what words are said. Speaker-verification models produce them, and the cosine similarity between two x-vectors is a number from 0 to 1 measuring how much two voices sound like the same identity. For cloned voices it answers: did the model actually capture the target speaker?
We compare each clone against an Andy/Aiden centroid; Dia drifts most at ~0.82.

A neural network that listens to a clip and predicts what MOS score human raters would give it, trained on large banks of human-rated audio. utmos22_strong was the top system in the VoiceMOS 2022 challenge. It is a free, instant stand-in for a naturalness panel — but, as this page shows, it saturates near the top, so it cannot separate two already-excellent voices.
On the candidate field it ranges 3.9 (Dia) to 4.5 (Spark-TTS) on the 1–5 scale.

Loudness Units relative to Full Scale, the EBU R128 standard for perceived loudness — the same measure streaming services use to keep tracks at an even volume. It models how loud audio actually sounds to a person, not just its peak amplitude. We normalize every clip to −21 LUFS before scoring so the judge compares timbre and delivery, not which clip happens to be louder.
Normalizing collapsed the field’s loudness spread from ±2.9 to ±0.3 LUFS.

A measure of whether two rankings agree, from −1 (perfectly reversed) through 0 (unrelated) to +1 (identical order). Unlike Pearson correlation it only cares about order, not exact values, so it is robust to odd scales and outliers. We use it to ask the central question: does any objective metric rank voices the same way humans do?
Intelligibility vs preference came out at ρ = 0.13 — essentially no relationship.

A statistical trap: correlation needs spread to detect. When every model in a comparison is already excellent, each metric’s values bunch into a narrow band, and even a real underlying relationship collapses toward zero. The flat correlations on the study field are partly this effect — and noticing it is itself the finding, because it explains why off-the-shelf metrics fail exactly where you most need them: separating the best from the very best.
UTMOS spanning only 4.1–4.5 across all 15 study voices is restricted range in action.
Where models break: structured text vs natural prose
Overall Elo hides where a model actually struggles. We split the shared prompts into symbol-heavy structured text (numbers, URLs, emails, dates, addresses, finance, acronyms) and natural prose (plain, entity, legal, medical, support), then asked the judge for each candidate's win rate against the field within each bucket. Almost every model is weaker on structured text — the input that stresses normalization and grapheme-to-phoneme handling rather than voice quality.
Which content type breaks which model
The bucket view averages a lot away. Here is every model against all twelve content types, coloured by word error rate. The structured columns (left) light up red almost everywhere; natural prose (right) stays green. Numbers, URLs, and emails are where intelligibility goes to die.
| Model | Acronyms | Addresses | Dates | Emails | Finance | Numbers | URLs | Names | Legal | Medical | Plain prose | Support | All |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Orpheus 3BDan (preset) | 11 | 18 | 17 | 18 | 0 | 14 | 25 | 0 | 0 | 0 | 0 | 0 | 9 |
| Supertonic 3M1 (preset) | 11 | 0 | 17 | 27 | 18 | 0 | 25 | 17 | 0 | 0 | 0 | 0 | 10 |
| Zonos v0.1Aiden clone | 11 | 0 | 0 | 27 | 18 | 0 | 42 | 0 | 17 | 0 | 0 | 0 | 10 |
| Spark-TTSAiden clone | 0 | 0 | 0 | 27 | 0 | 0 | 50 | 0 | 0 | 40 | 0 | 0 | 10 |
| Spark-TTSAndy clone | 0 | 0 | 17 | 27 | 0 | 14 | 50 | 0 | 0 | 10 | 0 | 0 | 10 |
| CSM-1BSpeaker 0 | 33 | 27 | 0 | 27 | 0 | 43 | 17 | 0 | 0 | 0 | 0 | 0 | 12 |
| Zonos v0.1Andy clone | 11 | 27 | 17 | 27 | 27 | 14 | 17 | 0 | 25 | 0 | 0 | 0 | 14 |
| DiaAndy clone | 11 | 0 | 17 | 100 | 9 | 14 | 25 | 0 | 0 | 10 | 0 | 0 | 16 |
| F5-TTSAndy clone | 67 | 9 | 17 | 27 | 18 | 57 | 25 | 0 | 0 | 10 | 0 | 0 | 19 |
| F5-TTSAiden clone | 67 | 36 | 17 | 27 | 18 | 43 | 25 | 0 | 0 | 20 | 0 | 0 | 21 |
What the model actually said
WER is abstract until you read the transcripts. These are real clips, transcribed by Whisper: the script the model was given versus what came out. Struck-through words were dropped; highlighted words are wrong or hallucinated. Smooth on prose, shattered on symbols — the most spectacular failure is a model that abandoned the prompt entirely and recited a YouTube outro.
the escalation alias is support dash priority at codesota dot com
i will see you next time
the api uses oauth jwt tls and http 2
the api uses yogeo thtp
the confirmation code is 739 184 552
the comfort code is 7398048
visit status dot example dot com slash incidents slash april dash report
visit status example com incidence april report
record the dosage as 25 milligrams twice daily with food
it will print the dosage as 25 mg twice daily with food
ship the replacement unit to 742 evergreen terrace springfield oregon 97403
ship the replacement unit to 742 evergreen terry springfield oregon 97 4 3
Ask humans about these next
The best next comparisons are not the most famous models. They are the pairs where the model is near 50/50, where prompt support is thin, or where a voice-condition mismatch could be hiding in the data. This is no longer just a table: the live voting page now samples pairs by the same logic — after every vote it favors near-tied ratings and under-tested voices, so human attention flows to the comparisons that move the ranking most.
| Pair | Predicted split | Prompt support | Why it matters |
|---|---|---|---|
| Chatterbox Turbo / default study voice | 50 / 50 | 8 | Near-tie: high leverage for rank ordering. |
| Gradium TTS / Kent | 52 / 48 | 3 | Near-tie: high leverage for rank ordering. |
| ElevenLabs v3 / James | 52 / 48 | 12 | Near-tie: high leverage for rank ordering. |
| Gradium TTS / Kent | 47 / 53 | 15 | Near-tie: high leverage for rank ordering. |
| Kokoro v1.0 / am_michael | 46 / 54 | 4 | Sparse prompt overlap: add votes before trusting the edge. |
| Gradium TTS / Kent | 46 / 54 | 7 | Stable enough to track, still useful for calibration. |
| Speech-02 HD / default study voice | 46 / 54 | 5 | Sparse prompt overlap: add votes before trusting the edge. |
| ElevenLabs v3 / default study voice | 55 / 45 | 4 | Sparse prompt overlap: add votes before trusting the edge. |
| Gradium TTS / Kent | 45 / 55 | 6 | Stable enough to track, still useful for calibration. |
| Gradium TTS / Kent | 55 / 45 | 11 | Stable enough to track, still useful for calibration. |
The useful claim is workflow, not final rank
Low-volume human preference data is too scarce for a universal voice judge. The stronger claim is that a lightweight model can turn a few hundred votes into a ranked map of uncertainty. WavLM embeddings capture speaker, prosody, and quality cues; simple acoustic features catch duration, loudness, and spectral shape; and Bradley-Terry turns model-level pair probabilities into a readable scale.
That breadth now spans five axes — preference (Elo), intelligibility (Whisper WER), naturalness (UTMOS), speaker similarity, and prosody — grounded in the ITU subjective-evaluation standards above, and the voting page now records a per-vote factor (expressiveness, pacing, pronunciation, hallucinations) so wins can be attributed, not just counted. What is genuinely still open is calibration: the reliability test above shows the judge is overconfident, so the next steps are a temperature-scaling pass, a held-out set of fresh human votes collected after training to confirm the fix holds, and enough factor votes to report why one clip wins — not just that it does.