Codesota · SpeechBlind human preference evaluationBack to landing
Register for the TTS listening study

Help decide which AI voice actually sounds better.

You will listen to short clips of A, B, and C reading the same text. Your choices help build a public TTS ranking based on human preference, not only automatic scores.

§ 01 · Kokoro lab

The open-source baseline is audible and measurable.

I rendered Kokoro-82M voices with the same prompt, then extracted pitch, voicing, brightness, zero-crossing, and MFCC views. These descriptors are not a final ranking, but they show where human preference should look closely.

Pitch versus timbre map
2000240028003200100130160190220median F0, Hzspectral centroid, Hzaf_sarahaf_heartaf_bellaam_fenrirbf_emmaaf_skyam_liambm_georgeam_michaelaf_nicoleam_adam
Expressiveness radar
pitch rangemotionbrightnessvoicingedgeaf_saraham_fenrirbf_emmaaf_nicole
§ 02 · Listen

A concrete Kokoro sample.

This is the kind of evidence the study will pair with blind listener votes: a real clip translated into acoustic signals listeners can actually compare.

Expression
79
composite prior
F0 span
199 Hz
tracked pitch range
Voiced
80%
periodic frames
VOICE PORTRAITaf_sarahfemale · american · Kokoro-82MF0BRaf-sarah79EXPRESSIONPRIORPitch199 Hz spanMotion44 Hz sigmaBrightness2860 Hz centroidVoicing80% voicedEdge0.148 ZCRspectral color strippitch motionvoiced density
Visual summary of one Kokoro clip. Purple traces pitch motion, blue bars mark voiced frames, amber carries brightness, rose spikes show articulation edge.
§ 03 · Register

Join the listener pool.

Leave your email and I will send the first listening rounds when the study opens.

Register for the TTS listening study

You will get an email when the first blind A/B/C rounds are ready.

§ 04 · Method

How the study works.

This is a preference test, not a vendor demo. The same text goes through every TTS system, and the listener only sees neutral labels.

01

Same text

Every system receives the same prompt, so listeners compare voice quality rather than prompt choice.

02

Blind labels

Audio clips are shown as A, B, and C. Provider names, model names, and prices stay hidden until after scoring.

03

Human vote

Listeners pick the clip they prefer and can flag problems like robotic prosody, bad emphasis, noise, or unclear words.

04

Preference ranking

Votes are aggregated into a human preference layer that sits next to WER, latency, cost, and license data.

§ 05 · What we measure

Expressiveness needs more than one number.

The composite score is intentionally transparent: it rewards pitch movement, avoids treating whisperiness as a failure by itself, and keeps timbre separate from intelligibility.

Scoring recipe
Expressiveness scorenot a quality claim; a visual prior for blind listeningpitch rangeweight 28%+pitch motionweight 24%+brightnessweight 18%+voicingweight 18%+edgeweight 12%
Expressive fingerprints
F0BRaf-sarah79scoreaf_sarahfemale · americanPitch rangePitch motionBrightnessVoicedEdgemetric-shaped pitch pathF0 σ 44 Hz · centroid 2860 Hz
F0BRaf-heart68scoreaf_heartfemale · americanPitch rangePitch motionBrightnessVoicedEdgemetric-shaped pitch pathF0 σ 37 Hz · centroid 2807 Hz
F0BRaf-bella68scoreaf_bellafemale · americanPitch rangePitch motionBrightnessVoicedEdgemetric-shaped pitch pathF0 σ 30 Hz · centroid 3111 Hz
F0BRam-fenrir64scoream_fenrirmale · americanPitch rangePitch motionBrightnessVoicedEdgemetric-shaped pitch pathF0 σ 39 Hz · centroid 2423 Hz
F0BRbf-emma64scorebf_emmafemale · britishPitch rangePitch motionBrightnessVoicedEdgemetric-shaped pitch pathF0 σ 23 Hz · centroid 3274 Hz
F0BRaf-sky62scoreaf_skyfemale · americanPitch rangePitch motionBrightnessVoicedEdgemetric-shaped pitch pathF0 σ 40 Hz · centroid 2135 Hz
§ 06 · Ranking prior

Which voices should listeners compare first?

The ranking below is a triage tool for study design. It identifies voices with more acoustic variation so the blind rounds can test whether listeners actually prefer that variation.

Current Kokoro expressiveness ranking
01 · af_sarah79/100 · 44 Hz F0 σ
02 · af_heart68/100 · 37 Hz F0 σ
03 · af_bella68/100 · 30 Hz F0 σ
04 · am_fenrir64/100 · 39 Hz F0 σ
05 · bf_emma64/100 · 23 Hz F0 σ
06 · af_sky62/100 · 40 Hz F0 σ
07 · am_liam57/100 · 39 Hz F0 σ
§ 07 · Current baseline

Automatic TTS scores are already live.

See intelligibility benchmark