Codesota · SpeechBlind human preference evaluationBack to landing

Register for the TTS listening study

Help decide which AI voice actually sounds better.

You will listen to short clips of A, B, and C reading the same text. Your choices help build a public TTS ranking based on human preference, not only automatic scores.

Register ↓How it works

§ 01 · Kokoro lab

The open-source baseline is audible and measurable.

I rendered Kokoro-82M voices with the same prompt, then extracted pitch, voicing, brightness, zero-crossing, and MFCC views. These descriptors are not a final ranking, but they show where human preference should look closely.

Open full voice fingerprints →

Pitch versus timbre map

Expressiveness radar

§ 02 · Listen

A concrete Kokoro sample.

This is the kind of evidence the study will pair with blind listener votes: a real clip translated into acoustic signals listeners can actually compare.

Expression

79

composite prior

F0 span

199 Hz

tracked pitch range

Voiced

80%

periodic frames

Visual summary of one Kokoro clip. Purple traces pitch motion, blue bars mark voiced frames, amber carries brightness, rose spikes show articulation edge.

§ 03 · Register

Join the listener pool.

Leave your email and I will send the first listening rounds when the study opens.

Register for the TTS listening study

You will get an email when the first blind A/B/C rounds are ready.

§ 04 · Method

How the study works.

This is a preference test, not a vendor demo. The same text goes through every TTS system, and the listener only sees neutral labels.

01

Same text

Every system receives the same prompt, so listeners compare voice quality rather than prompt choice.

02

Blind labels

Audio clips are shown as A, B, and C. Provider names, model names, and prices stay hidden until after scoring.

03

Human vote

Listeners pick the clip they prefer and can flag problems like robotic prosody, bad emphasis, noise, or unclear words.

04

Preference ranking

Votes are aggregated into a human preference layer that sits next to WER, latency, cost, and license data.

§ 05 · What we measure

Expressiveness needs more than one number.

The composite score is intentionally transparent: it rewards pitch movement, avoids treating whisperiness as a failure by itself, and keeps timbre separate from intelligibility.

Scoring recipe

Expressive fingerprints

§ 06 · Ranking prior

Which voices should listeners compare first?

The ranking below is a triage tool for study design. It identifies voices with more acoustic variation so the blind rounds can test whether listeners actually prefer that variation.

Current Kokoro expressiveness ranking

01 · af_sarah79/100 · 44 Hz F0 σ

02 · af_heart68/100 · 37 Hz F0 σ

03 · af_bella68/100 · 30 Hz F0 σ

04 · am_fenrir64/100 · 39 Hz F0 σ

05 · bf_emma64/100 · 23 Hz F0 σ

06 · af_sky62/100 · 40 Hz F0 σ

07 · am_liam57/100 · 39 Hz F0 σ

§ 07 · Current baseline

Automatic TTS scores are already live.

See intelligibility benchmark →