CodeSOTA research note · text-to-speech preference modeling

From human Elo to a small TTS judge

The workflow starts with scarce but high-value human votes. Those votes train a small pairwise judge. The judge estimates win rates for models that have not yet been in the listening test. Bradley-Terry turns those win rates back into an Elo-like ranking, and the next human votes are used to re-check the machine estimates where they are uncertain or surprising.

Run the listening test JSON report Pairwise matrix

Scientific loop

A benchmark that can grow without pretending automation is truth

The claim is not that WavLM features replace listeners. The claim is narrower and more useful: a small judge can turn a few hundred pairwise votes into a map of likely winners, likely losers, and high-value comparisons that deserve fresh human attention. It starts from one atom — a blind A/B listening test, two voices reading the same line, a listener picking the one they would ship.

Seed1

Human Elo

Blind same-prompt votes define the target: which clip a listener actually prefers.

Learn2

Small Judge

A lightweight pairwise model learns preference probabilities from speech embeddings and acoustic deltas.

Scale3

Predicted Elo

The judge scores new model pairs, then Bradley-Terry converts predicted win rates into an Elo-like scale.

Audit4

Fresh Votes

New human comparisons are held out after training to test calibration and find hard cases.

Baseline run

What was trained

The unit of prediction is a pair, not a voice in isolation. Every vote becomes two training rows: the observed winner over loser and the mirrored loser over winner. That makes the model learn a probability surface for comparisons, which is the right shape for Elo and Bradley-Terry.

Human votes

147

blind same-prompt comparisons

Audio samples

204

embedded clips

Feature width

776

WavLM + acoustic deltas

Trained

May 22, 2026

microsoft/wavlm-base-plus

Human pairs

Blind same-prompt A/B votes

Feature bank

WavLM pooled vectors + acoustic deltas

Pair model

P(A beats B | features)

Tournament

Predicted win matrix

BT / Elo

Ratings and uncertainty queue

xA - xBFeature contrast

Embedding delta, duration delta, loudness delta, and spectral deltas for clips on the same prompt.

sigma(w dot delta)Preference probability

A calibrated logistic score: the model estimates the chance that A wins the human vote.

BT / EloReadable rating

Predicted win rates are solved into a Bradley-Terry scale and displayed as Elo-like points.

Inside the judge

How a vote becomes a score

The full path from two audio clips to one rating — a few hundred human votes distilled into one number per voice, then used to score models that were never in the listening test. The only learned step is the logistic model in the middle; everything around it is fixed signal processing and a closed-form rating fit. That is deliberate — a small, legible model is easier to trust and recalibrate than an end-to-end black box.

Two clipsSame prompt, A and B

16 kHz mono, loudness-matched to −21 LUFS

→

WavLM encodermicrosoft/wavlm-base-plus

Mean-pooled hidden states → 768-dim embedding per clip

→

+ acoustic8 hand features

Duration, RMS, peak, ZCR, spectral centroid/bandwidth/flatness, silence

→

Pair vectorA minus B

Concatenated embedding & acoustic differences → 776-dim

→

Logistic modelP(A beats B)

Trained on 147 human votes, each mirrored → 294 ordered pairs

→

Bradley–TerryWin matrix → rating

Predicted pairwise wins fit to a 1500-centered Elo scale

Which layer? — a pitfall we tested

A subtle but real trap: self-supervised speech models like WavLM specialise by depth. Middle layers (~6–9) encode speaker identity, timbre, and prosody; the top layers drift toward phonetic content. Reaching for the last layer for every task is a common mistake that can pull a metric away from what listeners actually perceive. So we did not assume — we rebuilt the judge on every transformer layer and measured held-out AUC.

0.75

Leave-prompt-group-out AUC of the judge rebuilt on each of WavLM's 12 transformer layers. The last layer (highlighted) wins for this preference task — but a timbre or prosody judge would peak in the middle, which is why the speaker-similarity axis uses a model that learns its own layer weights.

For this judge, the last layer genuinely wins (AUC 0.75 vs ~0.60–0.67 in the middle) — because preference here is dominated by overall quality and naturalness, which the top layer integrates, not by raw speaker timbre. The pitfall is still real: the moment an axis is about who is speaking, the last layer is the wrong default. That is exactly why the speaker-similarity axis does not use raw WavLM at all — it uses the fine-tuned wavlm-base-plus-sv x-vector model, which learns its own weighting across layers. With only 147 votes this sweep is provisional; a timbre- or prosody-specific judge should re-run it and will likely land mid-stack.

Predicted tournament

Embedding model into Elo

The classifier is trained on ordered pairs: winner clip minus loser clip. After fitting the preference model, every available model pair is scored across shared prompts. Those predicted win probabilities are converted into a Bradley-Terry rating scale centered around 1500. Striped rows are external Replicate models scored by the judge, not direct human-vote Elo rows.

In-study voices Replicate candidates scored by the ranker, not human Elo

Chatterboxdefault study voice

1770

Spark-TTSAndy clone · estimated from generated Replicate samples

1748

Chatterbox AndyAndy

1730

Supertonic 3M1 (preset) · estimated from generated Replicate samples

1682

Turbodefault study voice

1674

Qwen AidenAiden

1626

HDdefault study voice

1566

Spark-TTSAiden clone · estimated from generated Replicate samples

1559

Gradium KentKent

1552

Eleven JamesJames

1551

Leave-prompt-out validation

75% ROC AUC

The validation split withholds prompt groups, which is stricter than randomly splitting votes. Accuracy is 65%, with uneven folds because some prompt families are sparse.

Interpretation: the curve sits above the random diagonal, but the middle section is jagged because the holdout has only 294 mirrored rows across five prompt groups.

Fold bars show accuracy by held-out prompt group. Log loss 0.924; Brier 0.256.

Accuracy65%

ROC AUC75%

1 - Brier74%

Can you trust the numbers?

Calibration: a predicted 80% should win 80% of the time

Accuracy and AUC only ask whether the judge picks the right winner. A benchmark needs more — when it says “80% chance A wins,” A should actually win about 80% of the time. We test this directly: take every out-of-fold prediction from the leave-prompt-group-out cross-validation, bin by predicted probability, and compare each bin's claim against what really happened.

Expected calibration error

22% → 6%

before → after temperature scaling (T = 4.1)

The dotted line is perfect calibration. The red curve is the raw judge — points sag below the line, meaning overconfidence. A single learned temperature pulls the green curve back onto the diagonal without changing any ranking. Dot size is comparisons per bin, over 294 mirrored pairs.

The raw judge is overconfident — and one number fixes it. The ranking is already sound (ROC AUC 75%), but the raw probabilities are too extreme: a claimed 97% win materialized only ~77% of the time. Fitting a single temperature (T = 4.1) on the held-out predictions softens every probability toward the outcome it actually earned, cutting expected calibration error from 22% to 6%. Because the transform is monotonic, the leaderboard order is untouched — you now get a trustworthy confidence on each pair, not just a trustworthy ranking. More votes will tighten it further.

Win-probability heatmap

Where the model thinks pairwise edges are strong

Each cell is the predicted chance that the row voice beats the column voice. Green favors the row; red favors the column. The diagonal is 50. Sparse support still matters: this matrix is useful for prioritizing new listening tests, not for declaring a final voice leaderboard.

Chatterbox

Chatterbox Andy

Turbo

Qwen Aiden

Gradium Kent

Eleven James

Turbo Deep

XTTS v2

HD Deep

Chatterbox

Chatterbox Andy

Turbo

Qwen Aiden

Gradium Kent

Eleven James

Turbo Deep

XTTS v2

HD Deep

Open-source quality estimates

Use the judge to triage open TTS models

This is the first practical use of the preference model: render the same prompts with open-source or open-weight TTS systems, embed the clips, predict pairwise win rates against the current field, then convert those wins into a provisional Elo. The estimate is a screening tool, not a final benchmark. The current pass covers ten open candidate conditions, including fresh additions — Orpheus 3B (Llama-based), Supertonic 3 (a 99M-parameter on-device model), and Sesame's CSM-1B — so the page can show both model quality and voice-condition sensitivity.

Loudness fairness: the judge reads RMS and peak as features, so a louder clip can win on volume alone. Before scoring, every candidate clip is loudness-matched to the study pool with EBU R128 / LUFS normalization (−21 LUFS, −1 dBFS peak ceiling), collapsing the field's loudness spread from ±2.9 to ±0.3 LUFS. What remains is timbre and quality, not gain.

Chatterbox Turbodefault study voice · open-source · MIT

1770

Top current estimate, but includes a default study voice condition.

Resemble Chatterbox GitHub

Chatterbox TurboAndy · open-source · MIT

1730

Best clean male-voice open-source estimate in this batch.

Resemble Chatterbox GitHub

Qwen3 TTSAiden · open-source · Apache 2.0

1626

Strong open model candidate; should get more same-voice prompt coverage.

Qwen3-TTS technical report

XTTS v2Damien Black · open-weight · CPML

1535

Good local voice-cloning baseline, but not a permissive commercial OSS license.

Coqui XTTS v2 model release

Kokoro v1.0af_heart · open-source · Apache 2.0

1392

Small, cheap, permissive baseline; quality trails larger expressive models here.

Kokoro Hugging Face model card

Kokoro v1.0am_michael · open-source · Apache 2.0

1362

Useful fast baseline, but this judge predicts weak preference against richer voices.

Kokoro Hugging Face model card

Qwen3 TTSdefault study voice · open-source · Apache 2.0

1113

Low estimate is likely a voice-condition artifact, not a model-family verdict.

Qwen3-TTS technical report

Replicate open-weight estimates

These model/voice conditions were not in the human Elo pool. We rendered 12 shared prompts through Replicate, embedded the WAVs, and asked the ranker to predict pairwise wins against the current field. Treat rows as model-plus-reference estimates: the same model can move materially when cloned from a different voice.

ModelRelative strengthPredicted EloMean win

Spark-TTSAndy clone · judged against 15 current voices across 98 prompt pairs

1748

78%

Supertonic 3M1 (preset) · judged against 15 current voices across 98 prompt pairs

1682

72%

Spark-TTSAiden clone · judged against 15 current voices across 98 prompt pairs

1559

58%

Zonos v0.1Andy clone · judged against 15 current voices across 98 prompt pairs

1548

57%

DiaAndy clone · judged against 15 current voices across 98 prompt pairs

1534

55%

F5-TTSAndy clone · judged against 15 current voices across 98 prompt pairs

1532

55%

Zonos v0.1Aiden clone · judged against 15 current voices across 98 prompt pairs

1492

50%

Orpheus 3BDan (preset) · judged against 15 current voices across 98 prompt pairs

1431

43%

CSM-1BSpeaker 0 · judged against 15 current voices across 98 prompt pairs

1408

41%

F5-TTSAiden clone · judged against 15 current voices across 98 prompt pairs

1395

39%

Candidate JSON Candidate CSV

Continuous benchmark plan: Whisper intelligibility and UTMOS naturalness are now live for every candidate (see the three-axis scorecard below). What remains is a clean holdout of new human votes collected after the judge is trained: if it stays calibrated on those fresh votes, CodeSOTA runs as a continuous benchmark — machine scores for breadth, human votes for calibration and hard cases.

Three independent axes

Preference is not intelligibility is not naturalness

A single Elo number hides real trade-offs. Borrowing the axes professional speech-evaluation services keep separate, we score every candidate on three independent measures: predicted human preference (the Elo judge), objective intelligibility (Whisper whisper-small.en word error rate against the known script), and predicted naturalness (UTMOS, the top VoiceMOS-2022 system, on a 1–5 MOS scale). The three rankings disagree — which is exactly why one score is not enough.

Model · voice

Spark-TTSAndy clone

1748

10%

4.48

±4

Supertonic 3M1 (preset)

1682

10%

4.41

±1

Spark-TTSAiden clone

1559

10%

4.48

±2

Zonos v0.1Andy clone

1548

14%

4.28

±3

DiaAndy clone

1534

16%

#10

3.93

±5

F5-TTSAndy clone

1532

19%

4.29

±4

Zonos v0.1Aiden clone

1492

10%

4.22

±5

Orpheus 3BDan (preset)

1431

4.36

±7

CSM-1BSpeaker 0

1408

12%

3.96

±3

F5-TTSAiden clone

#10

1395

#10

21%

4.28

±4

The axes diverge: Spark-TTS leads on preference, but Orpheus 3B is the most intelligible (9% word error rate). F5-TTS sounds natural yet posts the highest error rate (21% WER, 33% on structured text) — it reads fluently but says the wrong words. A model that wins on charm can still fail on a phone number.

Click any column to re-sort and watch the order reshuffle. The #rank in each cell is the model's standing on that axis; the Disagreement column is the gap between a model's best and worst rank across the three — high values flag models the axes disagree about (accurate but disliked, or liked but unintelligible). Intelligibility bars are inverted (longer = lower WER), and WER uses the Whisper English text normalizer so spoken “ninety-eight thousand” and written “$98,750” count as a match.

Model fingerprints

Every model has a shape

The same five axes, drawn as a profile per voice. A balanced pentagon is an all-rounder; a spiky one is a specialist. You can read each model's personality at a glance — where it reaches the rim and where it caves in.

Each spoke is an axis, normalized across the field (further out = better). Clockwise from top: Preference · Intelligibility · Naturalness · Speaker sim · Expressiveness. Speaker similarity is blank for preset voices with no clone target.

Spark-TTSAndy clone

Supertonic 3M1 (preset)

Spark-TTSAiden clone

Zonos v0.1Andy clone

DiaAndy clone

F5-TTSAndy clone

Zonos v0.1Aiden clone

Orpheus 3BDan (preset)

CSM-1BSpeaker 0

F5-TTSAiden clone

Can a metric replace listening?

No single objective metric predicts preference

If intelligibility or naturalness alone tracked human taste, you could retire listening tests. So we tested it directly: across the 15 voices that carry real human-vote Elo, how well does each objective metric correlate with the preference ranking? The answer is sobering — both are essentially flat. The most-preferred voice is not the most intelligible, and UTMOS saturates near the top so it cannot separate already-good models at all.

Intelligibility vs preference

ρ = +0.13

No significant relationship · Spearman p = 0.66, Pearson r = +0.32, n = 15.

Naturalness vs preference

ρ = +0.06

No significant relationship · Spearman p = 0.84, Pearson r = -0.10, n = 15.

Why this matters: for modern, already-good TTS the cheap proxies break down — Spearman ρ = 0.13 for intelligibility and 0.06 for naturalness, neither significant. Preference is carried by timbre, expressiveness, and delivery that WER and MOS do not capture. That is the case for the learned preference judge (ROC AUC 75%): it models the comparison humans actually make, while WER and UTMOS stay valuable as guardrails — catching a voice that mangles a phone number or sounds robotic, not ranking the good ones.

Elo from the pairwise judge leaderboard (Bradley-Terry on human votes); WER from whisper-small.en; MOS from utmos22_strong. Restricted-range caveat: this field is all strong commercial-grade voices, which compresses the metrics and is itself the finding.

How the axes relate

The correlation matrix

Every axis against every other, as Spearman rank correlation across the 10 candidate voices. Blue is a positive relationship, red is negative, white is none. The diagonal is trivially 1. Two relationships jump out — and neither is about quality alone.

	Preference	Intelligibility	Naturalness	Pitch dynamism	Speaking rate
Preference	1	+0.26	+0.58	+0.14	-0.25
Intelligibility	+0.26	1	+0.36	+0.42	-0.66
Naturalness	+0.58	+0.36	1	+0.06	-0.06
Pitch dynamism	+0.14	+0.42	+0.06	1	-0.89
Speaking rate	-0.25	-0.66	-0.06	-0.89	1

Speech rate is the hidden variable. Speaking rate correlates −0.89 with pitch dynamism (fast voices flatten their intonation) and −0.66 with intelligibility (the faster a model talks, the more it fumbles structured text). Preference, meanwhile, tracks naturalness (+0.58) far more than intelligibility (+0.26): on this open-weight field the judge rewards how a voice sounds over whether every token survives. Note the contrast with the human-rated study field above, where even naturalness washed out — a restricted-range effect once every model is already excellent.

Voice character

Speaker similarity and prosody

Two more axes a voice team cares about. Speaker similarity asks, for the cloned voices, how close the generated speaker is to the intended identity — cosine similarity of wavlm-base-plus-sv speaker embeddings against a target centroid built from the established study-pool Chatterbox-Turbo Andy and Qwen3-TTS Aiden voices. Preset voices have no clone target. Prosody is reported descriptively, not as a quality score: pitch dynamism (F0 standard deviation in semitones, a proxy for expressive intonation) and speaking rate.

Model · voiceSpeaker similarityPitch dynamismRate

Spark-TTSAiden clone

98%

4.2 st

2.0/s

Zonos v0.1Andy clone

97%

3.6 st

2.1/s

Zonos v0.1Aiden clone

97%

3.4 st

2.1/s

F5-TTSAiden clone

97%

3.5 st

3.2/s

Spark-TTSAndy clone

97%

3.2 st

2.1/s

F5-TTSAndy clone

96%

2.6 st

3.4/s

DiaAndy clone

82%

4.8 st

1.9/s

Orpheus 3BDan (preset)

preset · no clone target

5.2 st

1.8/s

Supertonic 3M1 (preset)

preset · no clone target

4.3 st

2.1/s

CSM-1BSpeaker 0

preset · no clone target

2.6 st

2.4/s

Cloning is mostly solved; expressiveness is not preference. Spark-TTS holds the target identity best (98% similarity), while Dia drifts most (82%). Orpheus 3B swings pitch the most (5.2 semitones) yet does not top preference — more intonation is not automatically more likable.

Speaker bars are rescaled over a 70–100% window to spread the cluster. Prosody is descriptive; a trained prosody-MOS predictor is future work. descriptive features, not a quality score.

The long view · 1970s → now

Fifty years of trying to grade a voice

None of the choices on this page — pairwise votes, an Elo scale, a neural MOS proxy, word error rate as a guardrail — are arbitrary. They are the current end of a fifty-year argument about how you measure something as slippery as “does this synthetic voice sound good?” That argument is documented in the 2025 Interspeech tutorial Automatic Quality Assessment for Speech and Beyond by Wen-Chin Huang (Nagoya University), Erica Cooper (NICT), and Jiatong Shi (CMU)[1], and the survey it builds on[2]. The failures of each era are exactly why our stack looks the way it does.

1970s–1990s
intelligibility

First, just make it understandable

Early formant and diphone synthesizers sounded robotic, so the only question worth asking was whether a listener could make out the words at all — naturalness came later. Evaluation meant the Diagnostic Rhyme Test[3], the Modified Rhyme Test[4] (BAD · BACK · BAN · BASS · BAT · BATH), and transcription of semantically-unpredictable sentences[5] — lines like “The table walked through the blue truth” that you cannot guess from context, so the score reflects the acoustics, not the listener’s language model. Comprehension tests existed too, but they saturate the moment synthetic speech is merely understandable, so they never saw wide use.

naturalness was not yet the point

1990s–2000s
naturalness

Then, make it sound human — and standardize

Unit-selection and HMM synthesis cleared the intelligibility bar, so the field pivoted to naturalness. The ITU codified subjective testing for voice-output devices in P.85[6], though its many rating scales proved redundant and saw little adoption. The lasting institution from this era is the Blizzard Challenge (2005–): a shared task with shared data and shared listening tests, still the strongest precedent for comparing TTS systems fairly and for releasing the listening-test results back to the community[7].

MOS becomes the default scorecard

2010s–now
crowdsourcing

Then, scale it — and discover it was leaky

Crowdsourcing opened listening tests to thousands of online raters: faster and cheaper, but with no control over the listening environment. Hence attention checks, qualification thresholds, and headphone-screening tests like Huggins pitch[8]. And as MOS became the universal unit of account, the critiques piled up — which is the next section.

cheap data, fragile numbers

The MOS critique

Why a single 1–5 score stopped being enough

Mean Opinion Score — average a five-point rating over many listeners and clips — is still the most-reported number in TTS papers. It is also, by the field’s own admission, deeply flawed. The tutorial collects the indictment, and most of it traces back to a handful of papers worth knowing.

Construct

Nobody agrees what “naturalness” means

Listeners are asked to rate naturalness with no shared definition, so style, expectation, and instruction wording leak into the number[9].

validity

Statistics

Averaging throws away the distribution

A 3.5 from tight agreement and a 3.5 from a bimodal split are very different signals, and a mean cannot tell them apart[11].

lost information

Comparability

You cannot compare across tests

A 4.1 in one paper and a 4.3 in another are not comparable — ratings are relative to the range of quality inside that test[11].

context dependence

Power

Too few listeners to be significant

Many published TTS evaluations simply do not run enough listeners to support the differences they claim[10].

underpowered

Bias

The test design tilts the answer

Scale layout, label words, stimulus spacing, and a listener’s mood all bias the score — a whole taxonomy of affective, response-mapping, and interface effects[12].

response mapping

Ceiling

Modern voices are all near the top

When every system scores 4.3–4.6, MOS saturates and stops separating them — the exact failure this page is built around.

restricted range

The field’s own best-practice answer: never cross-compare MOS across separate tests; report your listener count and protocol; choose test material that actually separates good systems instead of saturating; and where you can, prefer comparative tests, which have more discriminative power per listener and can be sped up with active learning that stops comparing systems already known to differ[13]. Our Elo-from-A/B-votes design is that advice taken literally.

The objective taxonomy

Five ways to score a voice without a panel

Once you accept that human listening tests are slow, expensive, and hard to reproduce, the goal becomes an automatic stand-in. The tutorial organizes every objective metric by one question: what reference does it get to see? That single axis explains why our scorecard mixes the metric types it does — and what it is still missing.

No reference · used here

Reference-free / single-ended

UTMOS · the WavLM preference judge

The metric sees only the clip — no ground truth. This is the only setting available for a fresh TTS model with no paired recording. UTMOS[25] (naturalness) and our pairwise judge both live here.

naturalness + preference

Reference text · used here

Transcript as the reference

Whisper ASR → WER

No reference audio, but you have the script: run ASR and measure word error rate. This is how intelligibility is scored today — ASR-WER tracks human transcription error at ρ≈0.94[32].

intelligibility

Partial reference · used here

Speaker-matched, not lexically matched

WavLM-ECAPA cosine similarity

A recording of the target speaker saying something else. Cosine similarity of speaker embeddings is the standard speaker-similarity proxy — ρ≈0.85 for x-vectors[33], ρ≈0.75 for ECAPA[34].

speaker similarity

Non-matched · the gap

Distributional, no per-clip pair

FAD · TTSDS

Compare the distribution of a model’s output to a pool of natural speech in an embedding space — Fréchet Audio Distance[30] and the TTS-specific TTSDS[31]. We do not run this yet; it is the obvious next axis.

not yet on the scorecard

Matched · the gap

Intrusive / double-ended

MCD · PESQ · SpeechBERTScore

A lexically-matched ground-truth recording, compared frame by frame. Powerful when it exists (MCD[16], the telephony metric PESQ[17], the SSL-based SpeechBERTScore[29]) — but TTS rarely has a paired natural take.

rarely available for TTS

Read through this lens, three of our five axes are the metric families you can actually run on an arbitrary new voice — no reference, reference-text, partial reference. The distributional and intrusive families are where a paired-corpus benchmark would extend the scorecard.

Where UTMOS came from

The lineage of automatic MOS

The naturalness number on this page (UTMOS) is the current rung of a twenty-year ladder. Each rung fixed the previous one’s blind spot, and knowing the chain is the difference between treating UTMOS as a magic 1–5 box and knowing precisely where it breaks.

1990s–2000s

Telephony signal metrics

MCD[16], f0 RMSE, and PESQ[17] / P.563[18] — built for codecs and phone lines, borrowed by early TTS.

2008–2015

Hand-built features + ML

Decision trees and SVMs over prosodic, MFCC, and spectral features; correlations climbed past 0.9 on Blizzard data.

2016–2020

End-to-end neural

MOSNet[19] (CNN-BLSTM, open-sourced) and NISQA-TTS[20] learn the rating directly from the waveform.

2021–2023

Listener modeling

MBNet[21], LDNet[22], and DeePMOS[23] model individual raters, not just the average — quality is per-listener.

2022 →

SSL fine-tuning

SSL-MOS[24] fine-tunes wav2vec2; UTMOS[25] ensembles it to win VoiceMOS 2022; SQuId[26] scales the idea to 52 locales.

2023 →

Reference-model / unsupervised

SpeechLMScore[27], VQScore[28], and FAD[30] score by distance from natural speech — no MOS labels needed.

Why this matters here: UTMOS sits at the SSL rung, trained to predict a human ACR score. That is exactly why it saturates near the top of its range on an all-strong field — it learned the distribution of human MOS, and human MOS itself saturates. The lineage predicts the failure we measure on the correlation plots above.

The case for preference

Why we collect votes, not ratings

The deepest point in the tutorial is also the simplest: MOS is not absolute, it is relative. A model trained with an L1 loss in “score space” pretends 4.2 is a fixed physical quantity, when in fact a listener only ever judges a clip against the others in front of them. Research that leans into that relativity — predicting a quality difference against a non-matching reference, as in NORESQA[35] — generalizes better than raw score regression.

Pairwise preference takes the idea to its conclusion: ask only “which of these two is better?” Preference scores can be compared across listening tests, need fewer samples for a significant result, and can be made cheap with online/active learning that stops spending votes on pairs whose winner is already obvious[13]. Learning directly from preference data has been shown to generalize better than fitting the raw scores[36]. That chain — relative judgments → pairwise tests → active sampling → a Bradley-Terry / Elo scale — is the entire spine of this page. It is the field’s recommendation, not a shortcut.

Evaluation standards

Where these scores sit in the ITU framework

Subjective speech evaluation has formal standards. Mapping each axis onto them keeps the method honest and makes clear what is measured, on what scale, and what is still missing.

Preference

Comparative / CMOS

blind A/B → Bradley-Terry → Elo

Listeners pick the better of two same-prompt clips. This is a comparative test — the family behind the ITU-T P.800 comparison-category (CMOS) rating — aggregated with Bradley-Terry into a 1500-centered Elo. Comparative tests resolve small differences that absolute rating blurs.

scale · win / loss → Elo

Naturalness

ACR MOS · ITU-T P.800 / P.808

UTMOS neural predictor

Absolute Category Rating asks a listener to score one clip from 1 (bad) to 5 (excellent); the average is MOS. UTMOS[25] is trained to predict that human ACR score, so it is a no-listener proxy for P.800[14] (lab) and P.808 (crowdsourced) naturalness.

scale · 1–5 MOS

Intelligibility

Objective WER

Whisper ASR + jiwer

Not an ITU subjective test: we transcribe each clip and measure word error rate against the script. It catches the failure mode naturalness scores miss — a fluent clip that mangles a number, URL, or email.

scale · 0–100% error

Not yet covered: ITU-T P.835 (separate signal / background / overall ratings, built for noise-suppressed speech) and ITU-R BS.1534[15] (MUSHRA — a 0–100 scale with a hidden reference and low-quality anchor for fine-grained naturalness ranking). MUSHRA with screened listeners is the natural next step for a public benchmark; the Elo + UTMOS + WER stack is the low-cost machine approximation that runs on every new model the moment it ships.

Protocol explainer

How MUSHRA works

MUSHRA — MUltiple Stimuli with Hidden Reference and Anchor (ITU-R BS.1534) — is the most discriminating listening test, built to separate systems that are all already good. Rather than rating one clip in isolation, the listener sees every version of the same passage on a single screen and scores each on a continuous 0–100 scale, split into five quality bands. Crucially, two of the clips on that screen are traps.

Excellent 80–100

Good 60–80

Fair 40–60

Poor 20–40

Bad 0–20

System A

System B

System C

Hidden refa secret copy of the original

Mid anchor7 kHz low-pass

Low anchor3.5 kHz low-pass

Systems under test Hidden reference — a reliable listener parks it near 100; if they don't, their scores are dropped Anchors — deliberately degraded clips that pin the bottom of the scale so ratings are comparable across people

The two hidden controls are what make it rigorous. The hidden reference is an unlabeled copy of the pristine original dropped in among the candidates. A listener who can really hear should rate it at the very top (~100); if they park it at 60, they are guessing or on bad equipment, so their whole session is thrown out. The anchors are deliberately broken versions — typically the source low-pass filtered to 3.5 kHz and 7 kHz — that nail the bottom of the scale to a known degradation, so a “40” means the same thing for every listener and every lab.

Because all versions are heard side by side, MUSHRA resolves differences far smaller than absolute MOS can, and a single trained panel of ~15–20 listeners yields tight confidence intervals. The cost is exactly that: screened, trained listeners and careful session design. That is why our page leans on Elo + UTMOS + WER as the cheap, always-on approximation — and flags MUSHRA as the gold standard to reach for when two voices are too close to call.

Illustrated glossary

Every term, explained

This page leans on a lot of acronyms — error rates, opinion scores, embeddings, rating systems. Here is each one in plain language, with its own picture: what it measures, how it is computed, the scale it lives on, and where it shows up in this study.

WERWord Error Rate

Run a clip through a speech recognizer, then compare the transcript to the script the model was supposed to read. WER counts the edits needed to fix it — substitutions, insertions, and deletions — divided by the number of reference words. It is the standard objective measure of intelligibility: did the words actually survive the trip through synthesis? We normalize both sides with the Whisper text normalizer first, so spoken “ninety-eight thousand” and written “$98,750” are treated as the same.

0% is perfect; 50% means half the words are wrong. On this page F5-TTS hits ~31% on structured text.

ACR / MOSAbsolute Category Rating

The oldest and most common subjective test, standardized in ITU-T P.800. A listener hears one clip in isolation and rates it on a five-point scale: 5 excellent, 4 good, 3 fair, 2 poor, 1 bad. Average those ratings across many listeners and clips and you get the Mean Opinion Score. Because it rates clips independently it is simple to run, but it blurs small differences — two great voices both land near 4.5.

P.808 is the crowdsourced variant. UTMOS predicts this 1–5 score with no human in the loop.

CMOSComparison MOS

Instead of rating one clip alone, the listener hears two and judges which is better and by how much, usually on a −3 to +3 scale. Comparative tests resolve differences that absolute rating misses, because the brain is far better at “A is slightly better than B” than at pinning an absolute number on a single clip. Our blind A/B vote is the binary version: just pick the winner, no magnitude.

Podonos’ head-to-head slider is a CMOS readout; our Elo aggregates thousands of these binary calls.

MUSHRAITU-R BS.1534

Multiple Stimuli with Hidden Reference and Anchor — the most discriminating subjective protocol. The listener rates several clips at once on a continuous 0–100 scale, while a known high-quality reference and a deliberately degraded low anchor are hidden among them to calibrate the scale and catch inattentive raters. It needs trained listeners and is expensive, which is why it is reserved for fine-grained ranking of already-good systems.

This is the gold standard our cheap Elo + UTMOS + WER stack approximates at near-zero cost.

EloPairwise rating

A rating system borrowed from chess. Everyone starts at 1500; after each match the winner takes points from the loser, and the amount depends on how surprising the result was — beating a much higher-rated voice earns more. Over many comparisons the ratings settle into a ranking. The scale is interpretable: a 400-point gap implies the higher voice should win about 10 times out of 11.

On the leaderboard, Chatterbox Turbo sits near 1770 and the weakest voice near 1110.

Bradley–TerryWin-prob → rating

A statistical model that takes a whole table of pairwise win probabilities and solves for one strength number per competitor that best explains them. Where Elo updates incrementally one match at a time, Bradley–Terry fits the entire set of comparisons at once, which is more stable when data is sparse. We use it to convert the judge’s predicted win matrix into the 1500-centered scale you see.

It is the math that lets a few hundred votes produce a coherent full ranking.

WavLMSpeech encoder

A large self-supervised transformer from Microsoft, trained on huge amounts of unlabeled audio to predict masked speech. The payoff is that its internal layers turn any clip into a dense vector that already encodes speaker identity, prosody, accent, and recording quality — without anyone labeling those things. The preference judge reads a 768-dimensional WavLM vector per clip as its main feature.

We use microsoft/wavlm-base-plus, mean-pooled across time into one vector.

x-vectorSpeaker embedding

A fixed-length fingerprint of who is speaking, designed so that two clips of the same person land close together regardless of what words are said. Speaker-verification models produce them, and the cosine similarity between two x-vectors is a number from 0 to 1 measuring how much two voices sound like the same identity. For cloned voices it answers: did the model actually capture the target speaker?

We compare each clone against an Andy/Aiden centroid; Dia drifts most at ~0.82.

UTMOSNeural MOS predictor

A neural network that listens to a clip and predicts what MOS score human raters would give it, trained on large banks of human-rated audio. utmos22_strong was the top system in the VoiceMOS 2022 challenge. It is a free, instant stand-in for a naturalness panel — but, as this page shows, it saturates near the top, so it cannot separate two already-excellent voices.

On the candidate field it ranges 3.9 (Dia) to 4.5 (Spark-TTS) on the 1–5 scale.

LUFSLoudness units

Loudness Units relative to Full Scale, the EBU R128 standard for perceived loudness — the same measure streaming services use to keep tracks at an even volume. It models how loud audio actually sounds to a person, not just its peak amplitude. We normalize every clip to −21 LUFS before scoring so the judge compares timbre and delivery, not which clip happens to be louder.

Normalizing collapsed the field’s loudness spread from ±2.9 to ±0.3 LUFS.

Spearman ρRank correlation

A measure of whether two rankings agree, from −1 (perfectly reversed) through 0 (unrelated) to +1 (identical order). Unlike Pearson correlation it only cares about order, not exact values, so it is robust to odd scales and outliers. We use it to ask the central question: does any objective metric rank voices the same way humans do?

Intelligibility vs preference came out at ρ = 0.13 — essentially no relationship.

Restricted rangeWhy ρ can vanish

A statistical trap: correlation needs spread to detect. When every model in a comparison is already excellent, each metric’s values bunch into a narrow band, and even a real underlying relationship collapses toward zero. The flat correlations on the study field are partly this effect — and noticing it is itself the finding, because it explains why off-the-shelf metrics fail exactly where you most need them: separating the best from the very best.

UTMOS spanning only 4.1–4.5 across all 15 study voices is restricted range in action.

Content-type analysis

Where models break: structured text vs natural prose

Overall Elo hides where a model actually struggles. We split the shared prompts into symbol-heavy structured text (numbers, URLs, emails, dates, addresses, finance, acronyms) and natural prose (plain, entity, legal, medical, support), then asked the judge for each candidate's win rate against the field within each bucket. Almost every model is weaker on structured text — the input that stresses normalization and grapheme-to-phoneme handling rather than voice quality.

Orpheus 3B is the sharpest example: a 64% win rate on natural prose collapses to 22% on structured text — a 42-point gap. It reads sentences cleanly but mangles numbers and URLs, which is exactly what drags down its overall estimate.

Model · voiceStructured textNatural proseGap

Orpheus 3BDan (preset)

22%

64%

−42

DiaAndy clone

45%

60%

−15

Supertonic 3M1 (preset)

62%

76%

−14

Spark-TTSAiden clone

48%

62%

−14

F5-TTSAndy clone

48%

55%

−7

Spark-TTSAndy clone

73%

78%

−6

F5-TTSAiden clone

40%

33%

Zonos v0.1Andy clone

58%

44%

+14

CSM-1BSpeaker 0

47%

32%

+15

Zonos v0.1Aiden clone

57%

33%

+24

Bars show predicted win rate vs the 15-voice study field, averaged within each content bucket. Gap is structured minus natural in points; red means the model loses ground on structured text.

Failure map

Which content type breaks which model

The bucket view averages a lot away. Here is every model against all twelve content types, coloured by word error rate. The structured columns (left) light up red almost everywhere; natural prose (right) stays green. Numbers, URLs, and emails are where intelligibility goes to die.

Model	Acronyms	Addresses	Dates	Emails	Finance	Numbers	URLs	Names	Legal	Medical	All
Orpheus 3BDan (preset)	11	18	17	18	0	14	25	0	0	0	9
Supertonic 3M1 (preset)	11	0	17	27	18	0	25	17	0	0	10
Zonos v0.1Aiden clone	11	0	0	27	18	0	42	0	17	0	10
Spark-TTSAiden clone	0	0	0	27	0	0	50	0	0	40	10
Spark-TTSAndy clone	0	0	17	27	0	14	50	0	0	10	10
CSM-1BSpeaker 0	33	27	0	27	0	43	17	0	0	0	12
Zonos v0.1Andy clone	11	27	17	27	27	14	17	0	25	0	14
DiaAndy clone	11	0	17	100	9	14	25	0	0	10	16
F5-TTSAndy clone	67	9	17	27	18	57	25	0	0	10	19
F5-TTSAiden clone	67	36	17	27	18	43	25	0	0	20	21

Cell = mean word error rate (%) for that model on that content type. Green is accurate, red is broken. The seven left columns are structured/symbol-heavy text; the five right are natural prose.

Anatomy of a failure

What the model actually said

WER is abstract until you read the transcripts. These are real clips, transcribed by Whisper: the script the model was given versus what came out. Struck-through words were dropped; highlighted words are wrong or hallucinated. Smooth on prose, shattered on symbols — the most spectacular failure is a model that abandoned the prompt entirely and recited a YouTube outro.

EmailsDia · Andy clone100% WER

script

the escalation alias is support dash priority at codesota dot com

heard

i will see you next time

AcronymsF5-TTS · Andy clone67% WER

script

the api uses oauth jwt tls and http 2

heard

the api uses yogeo thtp

NumbersF5-TTS · Andy clone57% WER

script

the confirmation code is 739 184 552

heard

the comfort code is 7398048

URLsSpark-TTS · Andy clone50% WER

script

visit status dot example dot com slash incidents slash april dash report

heard

visit status example com incidence april report

MedicalSpark-TTS · Aiden clone40% WER

script

record the dosage as 25 milligrams twice daily with food

heard

it will print the dosage as 25 mg twice daily with food

AddressesF5-TTS · Aiden clone36% WER

script

ship the replacement unit to 742 evergreen terrace springfield oregon 97403

heard

ship the replacement unit to 742 evergreen terry springfield oregon 97 4 3

Active-learning queue

Ask humans about these next

The best next comparisons are not the most famous models. They are the pairs where the model is near 50/50, where prompt support is thin, or where a voice-condition mismatch could be hiding in the data. This is no longer just a table: the live voting page now samples pairs by the same logic — after every vote it favors near-tied ratings and under-tested voices, so human attention flows to the comparisons that move the ranking most.

Predicted A win probability

Prompt support

highest value: near 50/50, low support

Pair	Predicted split	Prompt support	Why it matters
Chatterbox Turbo / default study voice vs Chatterbox Turbo / Andy	50 / 50	8	Near-tie: high leverage for rank ordering.
Gradium TTS / Kent vs Kokoro v1.0 / af_heart	52 / 48	3	Near-tie: high leverage for rank ordering.
ElevenLabs v3 / James vs Gradium TTS / Kent	52 / 48	12	Near-tie: high leverage for rank ordering.
Gradium TTS / Kent vs Qwen3 TTS / Aiden	47 / 53	15	Near-tie: high leverage for rank ordering.
Kokoro v1.0 / am_michael vs Chatterbox Turbo / default study voice	46 / 54	4	Sparse prompt overlap: add votes before trusting the edge.
Gradium TTS / Kent vs Speech-02 Turbo / default study voice	46 / 54	7	Stable enough to track, still useful for calibration.
Speech-02 HD / default study voice vs Speech-02 Turbo / English_Deep-VoicedGentleman	46 / 54	5	Sparse prompt overlap: add votes before trusting the edge.
ElevenLabs v3 / default study voice vs Gradium TTS / Kent	55 / 45	4	Sparse prompt overlap: add votes before trusting the edge.
Gradium TTS / Kent vs Speech-02 HD / default study voice	45 / 55	6	Stable enough to track, still useful for calibration.
Gradium TTS / Kent vs Speech-02 HD / English_Deep-VoicedGentleman	55 / 45	11	Stable enough to track, still useful for calibration.

Current limitation: this baseline mixes an older female/default voice pool with the newer male-voice condition. That is useful for debugging the method, but the public benchmark should either split by voice condition or retrain on a clean male-only study batch.

The VoiceMOS Challenge · generalization

The hard part is generalization, and the fix is a moving baseline

Automatic MOS prediction has its own benchmark series — the VoiceMOS Challenge (2022[41], 2023[42], 2024[43]) and now the AudioMOS Challenge 2025[44] — and its central lesson is blunt: the whole problem is generalization. A predictor that hits 0.939 system-level SRCC in-domain[41] can fall apart on a new TTS system, a new listening test, a new language, or a new distortion. In practice you should assume the next thing you score is out-of-domain.

That is precisely the “restricted range” caveat stamped on our correlation plots above: a judge trained on one field of strong commercial voices is an in-domain instrument, and we say so rather than pretend it is universal. The challenge organizers also make a point baked into CodeSOTA’s design — the baseline should be state-of-the-art. If your starter system is SOTA and a participant beats it, that is provable progress; a weak baseline just manufactures the illusion of it.

Skin in the game: we are building an emotional-TTS MOS predictor for VoiceMOS Challenge 2026 Track 2, extending the UTMOS lineage[25] with an emotion encoder. Competing on the same benchmark we cite here is the cleanest way to keep this page honest — and to harden the auto-scorers behind the scorecard.

The frontier

Evaluation that explains itself

A single number — even a perfectly calibrated one — does not tell a voice team what to fix. Two directions in the tutorial point past the scalar MOS, and both match where this page is already heading.

Multi-dimensional

One model, several named axes

Instead of collapsing everything into “naturalness,” predict interpretable dimensions. NISQA[37] outputs noisiness, coloration, discontinuity, and loudness; Meta’s Audiobox Aesthetics[40] outputs production quality, production complexity, content enjoyment, and content usefulness. Our five axes plus per-vote factor tags (expressiveness · pacing · pronunciation · hallucinations) are the same instinct: a scorecard, not a score.

Explainable

Language, not just numbers

The newest work asks an audio language model to describe the defect — “a distorted, electric-current quality from 1.5–2.0s” — localizing it in time and attributing a cause. QualiSpeech[38] and ALLD[39] are early steps toward evaluation that reads like a reviewer’s note. The tutorial calls this the ultimate goal; for a public benchmark it is the difference between a leaderboard and a diagnosis.

Standard infrastructure

The toolkits the field actually runs on

None of this requires reinventing metrics. The community has converged on shared infrastructure, and our stack is deliberately assembled from the same parts so results stay comparable to published work.

Benchmark

MOS-Bench

Seven training sets and twelve test sets spanning TTS, voice conversion, singing, and distorted speech across five languages and 8–48 kHz — built specifically to measure the generalization a single listening test cannot.

SQA generalization

Toolkit

SHEET

All-in-one recipes for speech MOS prediction — data prep, training, and off-the-shelf models via torch.hub / HuggingFace. The lineage from SSL-MOS[24] to UTMOS[25] ships here.

train + infer

Toolkit

VERSA

Versatile Evaluation of Speech and Audio: ~90 metrics behind one interface, integrated into ESPnet and the CHiME challenges; Uni-VERSA predicts many at once for a ~100× speedup[1].

~90 metrics, one API

Our own stack — UTMOS via SpeechMOS, Whisper for WER, WavLM-ECAPA for speaker similarity, and a Bradley-Terry preference judge — is a small, opinionated slice of exactly these toolkits, chosen to run at near-zero marginal cost on every model the moment it ships.

Research position

The useful claim is workflow, not final rank

Low-volume human preference data is too scarce for a universal voice judge. The stronger claim is that a lightweight model can turn a few hundred votes into a ranked map of uncertainty. WavLM embeddings capture speaker, prosody, and quality cues; simple acoustic features catch duration, loudness, and spectral shape; and Bradley-Terry turns model-level pair probabilities into a readable scale.

That breadth now spans five axes — preference (Elo), intelligibility (Whisper WER), naturalness (UTMOS), speaker similarity, and prosody — grounded in the ITU subjective-evaluation standards above, with a per-vote factor (expressiveness, pacing, pronunciation, hallucinations) recorded so wins can be attributed, not just counted. The roadmap below lays out what already ships, what is being built, and the longer bets — read against the history and taxonomy on this page.

Roadmap

Research goals: what ships, what’s next

This is a benchmark that grows without pretending automation is truth — so the goals are explicit. Each one is tagged by where it stands, and the open items are the field’s open items too, not just ours.

ShippedBuildingFrontier

Shipped

Five-axis scorecard

Preference (Elo via Bradley-Terry), intelligibility (Whisper WER), naturalness (UTMOS[25]), speaker similarity (WavLM-ECAPA cosine), and descriptive prosody — every axis runs on any new voice the moment it ships, at near-zero marginal cost.

Shipped

Attributed wins, not just counts

The voting page records a per-vote factor — expressiveness · pacing · pronunciation · hallucinations — so a win can be traced to a cause. This is the multi-dimensional instinct (cf. NISQA[37], Audiobox Aesthetics[40]) applied to preference.

Shipped

Active-learning vote queue

New comparisons are chosen by uncertainty sampling — near-tied or under-sampled pairs — rather than uniform random, the cheaper-preference-test recipe from the literature[13].

Building

Fix the calibration

The reliability diagram above shows the judge is overconfident (a claimed 97% win lands nearer 77%). Next: a temperature-scaling pass, then a held-out set of fresh human votes collected after training to confirm the fix holds. Trust the ranking, not the literal probability — yet.

Building

Report why a clip wins

Enough factor votes to attribute each win to a named cause at statistical strength — turning the scorecard from “which won” into “won on delivery, lost on a mangled number.”

Building

Add the distributional axis

The non-matched-reference family missing from our five — Fréchet Audio Distance[30] and the TTS-specific TTSDS[31] — scoring a model by how close its output distribution sits to natural speech.

Building

A trained prosody-MOS predictor

Prosody is currently descriptive (F0 dynamism, speaking rate), not a learned quality score. A prosody- specific judge should re-sweep SSL layers — quality lives in the last layer, but timbre and prosody peak mid-stack.

Frontier

A MUSHRA human anchor

A screened-listener MUSHRA panel (ITU-R BS.1534[15]) as the gold-standard calibration that the cheap Elo + UTMOS + WER stack is approximating — the natural next step when two voices are too close to call.

Frontier

Evaluation that explains itself

Localized, attributed defect descriptions in natural language — “electric-current distortion from 1.5–2.0s” — the audio-LLM direction of QualiSpeech[38] and ALLD[39]. The tutorial calls this the ultimate goal.

Frontier

Generalize past the restricted range

Today’s judge is an in-domain instrument trained on one field of strong commercial voices. The open problem — and the whole point of the VoiceMOS Challenge series[41],[43] — is out-of-domain generalization to new systems, languages, and distortions.

Frontier

Compete at VoiceMOS 2026

Skin in the game: an emotional-TTS MOS predictor for Track 2, extending the UTMOS lineage[25] with an emotion encoder. The best way to keep this page honest is to be scored on the same benchmark we cite.

Sources & further reading

Where this comes from

The history, taxonomy, and frontier on this page are drawn from the Interspeech 2025 tutorial Automatic Quality Assessment for Speech and Beyond by Wen-Chin Huang (Nagoya University), Erica Cooper (NICT), and Jiatong Shi (CMU)[1], and the survey it builds on, Cooper et al. (2024)[2]. The numbered references below back the specific claims throughout — each bracketed marker links here.

[1]W.-C. Huang, E. Cooper, and J. Shi. "Automatic Quality Assessment for Speech and Beyond." Tutorial, Interspeech 2025.

[2]E. Cooper, W.-C. Huang, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi. "A review on subjective and objective evaluation of synthetic speech." Acoustical Science and Technology 45, no. 4 (2024): 161–183.

[3]W. D. Voiers. "Evaluating processed speech using the diagnostic rhyme test." Speech Technology 1 (1983): 30–39.

[4]A. S. House, C. Williams, M. H. Hecker, and K. D. Kryter. "Psychoacoustic speech tests: A modified rhyme test." The Journal of the Acoustical Society of America 35 (1963): 1899.

[5]M. Grice. "Syntactic structures and lexicon requirements for semantically unpredictable sentences in a number of languages." Proc. Speech Input/Output Assessment and Speech Databases, vol. 2 (1989): 19–22.

[6]ITU-T Rec. P.85. "A method for subjective performance assessment of the quality of speech voice output devices." International Telecommunication Union, 1994.

[7]O. Perrotin, B. Stephenson, S. Gerber, G. Bailly, and S. King. "Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023." Computer Speech & Language 90 (2025): 101747.

[8]A. E. Milne, R. Bianco, K. C. Poole, S. Zhao, A. J. Oxenham, A. J. Billig, and M. Chait. "An online headphone screening test based on dichotic pitch." Behavior Research Methods 53, no. 4 (2021): 1551–1562.

[9]P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz, É. Székely, C. Tånnander, and J. Voße. "Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program." ISCA SSW 10, 2019.

[10]M. Wester, C. Valentini-Botinhao, and G. E. Henter. "Are we using enough listeners? No! — An empirically-supported critique of Interspeech 2014 TTS evaluations." Interspeech 2015.

[11]R. C. Streijl, S. Winkler, and D. S. Hands. "Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives." Multimedia Systems 22, no. 2 (2016): 213–227.

[12]S. Zielinski, F. Rumsey, and S. Bech. "On some biases encountered in modern audio quality listening tests — a review." Journal of the Audio Engineering Society 56, no. 6 (2008): 427–451.

[13]Y. Yasuda and T. Toda. "Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment." arXiv:2403.06100, 2024.

[14]ITU-T Rec. P.800. "Methods for subjective determination of transmission quality." International Telecommunication Union, 1996.

[15]ITU-R Rec. BS.1534-3. "Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems (MUSHRA)." International Telecommunication Union, 2015.

[16]R. Kubichek. "Mel-cepstral distance measure for objective speech quality assessment." IEEE Pacific Rim Conf. on Communications, Computers and Signal Processing, 1993.

[17]A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. "Perceptual evaluation of speech quality (PESQ) — a new method for speech quality assessment of telephone networks and codecs." ICASSP 2001. (ITU-T Rec. P.862.)

[18]ITU-T Rec. P.563. "Single-ended method for objective speech quality assessment in narrow-band telephony applications." International Telecommunication Union, 2004.

[19]C.-C. Lo, S.-W. Fu, W.-C. Huang, et al. "MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion." Interspeech 2019.

[20]G. Mittag and S. Möller. "Deep Learning Based Assessment of Synthetic Speech Naturalness." Interspeech 2020.

[21]Y. Leng, X. Tan, S. Zhao, F. Soong, X.-Y. Li, and T. Qin. "MBNet: MOS prediction for synthesized speech with mean-bias network." ICASSP 2021.

[22]W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda. "LDNet: Unified listener dependent modeling in MOS prediction for synthetic speech." ICASSP 2022.

[23]X. Liang, F. Cumlin, C. Schüldt, and S. Chatterjee. "DeePMOS: Deep Posterior Mean-Opinion-Score of Speech." Interspeech 2023.

[24]E. Cooper, W.-C. Huang, T. Toda, and J. Yamagishi. "Generalization ability of MOS prediction networks." ICASSP 2022.

[25]T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari. "UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022." Interspeech 2022.

[26]T. Sellam, A. Bapna, J. Camp, D. Mackinnon, A. P. Parikh, and J. Riesa. "SQuId: Measuring speech naturalness in many languages." ICASSP 2023.

[27]S. Maiti, Y. Peng, T. Saeki, and S. Watanabe. "SpeechLMScore: Evaluating speech generation using speech language model." ICASSP 2023.

[28]S.-W. Fu, K.-H. Hung, Y. Tsao, and Y.-C. F. Wang. "Self-supervised speech quality estimation and enhancement using only clean speech (VQScore)." ICLR 2024.

[29]T. Saeki, S. Maiti, S. Takamichi, S. Watanabe, and H. Saruwatari. "SpeechBERTScore: reference-aware automatic evaluation of speech generation leveraging NLP evaluation metrics." Interspeech 2024.

[30]K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. "Fréchet Audio Distance: A reference-free metric for evaluating music enhancement algorithms." Interspeech 2019.

[31]C. Minixhofer, O. Klejch, and P. Bell. "TTSDS — Text-to-Speech Distribution Score." IEEE SLT 2024.

[32]F. Hinterleitner, S. Zander, K.-P. Engelbrecht, and S. Möller. "On the use of automatic speech recognizers for the quality and intelligibility prediction of synthetic speech." Konferenz Elektronische Sprachsignalverarbeitung, 2015.

[33]R. K. Das, T. Kinnunen, W.-C. Huang, et al. "Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions." Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, 2020.

[34]J. Ahn, Y. Kim, Y. Choi, D. Kwak, J.-H. Kim, S. Mun, and J. S. Chung. "VoxSim: a perceptual voice similarity dataset." Interspeech 2024.

[35]P. Manocha, B. Xu, and A. Kumar. "NORESQA: A framework for speech quality assessment using non-matching references." NeurIPS 2021.

[36]C.-H. Hu, Y. Yasuda, and T. Toda. "E2EPref: An end-to-end preference-based framework for speech quality assessment to alleviate bias in direct assessment scores." Computer Speech & Language 93 (2025).

[37]G. Mittag, B. Naderi, A. Chehadi, and S. Möller. "NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets." Interspeech 2021.

[38]S. Wang, et al. "QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions." arXiv:2503.20290, 2025.

[39]C. Chen, Y. Hu, S. Wang, H. Wang, Z. Chen, C. Zhang, C.-H. Huck Yang, and E. S. Chng. "Audio large language models can be descriptive speech quality evaluators." ICLR 2025.

[40]A. Tjandra, Y.-C. Wu, B. Guo, J. Hoffman, B. Ellis, A. Vyas, B. Shi, S. Chen, M. Le, N. Zacharov, C. Wood, A. Lee, and W.-N. Hsu. "Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound." arXiv:2502.05139, 2025.

[41]W.-C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi. "The VoiceMOS Challenge 2022." Interspeech 2022: 4536–4540.

[42]E. Cooper, W.-C. Huang, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi. "The VoiceMOS Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains." ASRU 2023.

[43]W.-C. Huang, S.-W. Fu, E. Cooper, R. E. Zezario, T. Toda, H.-M. Wang, J. Yamagishi, and Y. Tsao. "The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction." IEEE SLT 2024.

[44]W.-C. Huang, H. Wang, C. Liu, Y.-C. Wu, A. Tjandra, W.-N. Hsu, E. Cooper, Y. Qin, and T. Toda. "The AudioMOS Challenge 2025." ASRU 2025.

Citations follow the tutorial’s reference lists. Credit for the synthesis of this field belongs to its authors; any errors in paraphrase are ours.