Speech/TTS eval/English intelligibility
§ TTS information preservation

English TTS intelligibility, measured as information fidelity.

Most TTS demos optimize for ten seconds of beautiful speech. This benchmark asks a stricter production question: can independent ASR recover the exact intended English message?

Leaderboard Run locallyBack to TTS leaderboard →

The benchmark pipeline

FIG 1 · ROUND-TRIP INFORMATION CHANNELPrompt set30 tagged English prompts01TTSGradium Audrey · Kokoro af_heart02AudioWAV artifacts + latency metadata03ASRWhisper large-v3-turbo04ScoringWER · CER · entity checks05Reportspeed × quality × cost06
Same prompts, same ASR, same scoring. Only the TTS provider changes.
§ Gradium angle

Measured takeaways.

First run: Gradium Audrey vs Kokoro af_heart on 30 hard English prompts, transcribed by Whisper large-v3-turbo. This is a small but real benchmark run, not placeholder copy.

Gradium ranked #1

On this 30-prompt run, Gradium beat Kokoro on the composite information-fidelity score: 37.4 vs 44.1, lower is better.

Lower WER

Gradium produced 13.4% normalized WER versus Kokoro at 15.6% after Whisper large-v3-turbo transcription.

Better entity preservation

Gradium preserved 73.3% of critical entities versus Kokoro at 66.7%, across numbers, dates, names, addresses, emails, and URLs.

Much faster first byte

Gradium p95 first-byte latency was 299 ms versus Kokoro at 2,123 ms on this local run, the clearest voice-agent advantage.

§ 00 · Visual scorecard

Speed, quality, and cost in one view.

The benchmark should make the tradeoff visible before the reader reaches the table: low WER, high entity accuracy, low first-byte latency, and transparent list-price cost.

FIG 2 · SPEED VS QUALITY VS COST050010001500200010%12%14%16%18%p95 first-byte latency, ms · lower is fasternormalized WER · lower is betteragent-ready zoneGradium$0.0478 / 1KKokorolocal infra / 1K
Bubble size encodes critical entity accuracy. Gradium sits in the lower-left region: faster first byte and lower WER.
Fig 3 · metric deltas
WER
lower is better
Gradium
13.4%
Kokoro
15.6%
Entity accuracy
higher is better
Gradium
73.3%
Kokoro
66.7%
TTFB p95
lower is better
Gradium
299 ms
Kokoro
2123 ms
High severity errors
lower is better
Gradium
6
Kokoro
6
§ 01 · Leaderboard

Built to decide which provider is better.

Every provider runs the same prompts, the same ASR, the same normalization, and the same entity checks. The winner is the model that preserves the most information with acceptable latency.

#ModelWER ↓Entity Acc ↑TTFB p95 ↓Total p95 ↓Cost / 1K ↓Best for
1Gradium
n=30 · score 37.4
13.4%73.3%299 ms3517 ms$0.0478real-time agents
2Kokoro
n=30 · score 44.1
15.6%66.7%2123 ms2123 mslocal infralocal/open source

Measured locally on April 28, 2026: 30 hard English prompts, Gradium Audrey vs Kokoro af_heart, both transcribed with Whisper large-v3-turbo. Gradium cost uses public list pricing from the S plan: 1 TTS character = 1 credit, $43/month for 900k credits, or about $0.0478 per 1K TTS characters. This execution used a granted API key; Kokoro cost is local infrastructure only. Gradium pricing.

§ 02 · Category heatmap

Where each voice breaks.

Darker cells have higher WER. The bar inside each cell shows critical entity accuracy. This exposes whether a model is generally intelligible but weak on emails, URLs, names, or identifiers.

Category
Gradium WER / entity
Kokoro WER / entity
plain
0.0%100.0%
0.0%100.0%
numbers
11.3%100.0%
16.9%88.9%
dates
8.9%75.0%
14.2%75.0%
time
4.2%50.0%
9.1%100.0%
address
9.1%100.0%
18.2%50.0%
names
6.3%66.7%
10.3%33.3%
acronyms
11.1%33.3%
11.1%33.3%
email
38.3%0.0%
39.4%0.0%
url
45.8%0.0%
41.7%0.0%
long form
9.4%50.0%
10.9%50.0%
domain terms
2.5%100.0%
2.5%100.0%
technical
8.3%66.7%
8.3%66.7%
§ 03 · Method

Round-trip intelligibility.

Gradium is only used for TTS. The transcript must come from an independent ASR system.

01

Start with curated English prompts.

02

Generate speech with Gradium TTS and store audio plus latency metadata.

03

Transcribe the audio with independent ASR, for example Whisper, Deepgram, AssemblyAI, or Google STT.

04

Normalize reference and hypothesis.

05

Compute strict and normalized WER/CER.

06

Extract critical entities and classify failures by category and severity.

§ 04 · Coverage

Easy sentences are not enough.

plain speech
numbers
dates and times
currencies
addresses
names and entities
acronyms
emails and URLs
domain terms
long-form speech
§ 05 · Metrics
WER

Word-level intelligibility after ASR round-trip.

CER

Character-level preservation for identifiers and dense strings.

Exact match

Whether normalized reference equals normalized ASR output.

Critical entity accuracy

Whether numbers, dates, identifiers, names, and address-like strings survive.

TTFB

Time from TTS request to first audio byte.

Error taxonomy

Entity-level failure records with severity classes.

§ 06 · Ranking logic
Rank

Composite information-fidelity error.

WER ↓

Normalized word error rate.

CER ↓

Normalized character error rate.

Exact ↑

Strict normalized recovery rate.

Entity Acc ↑

Critical entity preservation.

TTFB p95 ↓

Real-time readiness.

Severity profile
Gradium6 high · 2 medium
Kokoro6 high · 5 medium
High severity means a number, date, identifier, address, or named entity changed. WER alone does not price those failures correctly.
Example diff
REF: Your appointment is scheduled for March 12th, 2026 at 2:45 PM.
ASR: Your appointment is scheduled for March 12th, 2026 at 2:40 PM.

ERROR:
- time changed from 2:45 PM to 2:40 PM
- category: date/time
- severity: high
§ 07 · Prompt set
date_001 · dates, time

Your appointment is scheduled for March 12th, 2026 at 2:45 PM.

num_002 · numbers, identifier

The confirmation code is 739-184-552.

acro_001 · acronyms, technical

The API uses OAuth, JWT, TLS, and HTTP/2.

email_001 · email

Please send the invoice to alex.smith plus billing at example dot com.

§ 08 · CodeSOTA harness

Runnable from the repo root.

The API key stays in the environment. Raw audio and transcripts stay in run directories.

export GRADIUM_API_KEY=...

python scripts/tts_intelligibility_generate_gradium.py \
  --prompts data/tts-intelligibility/english_prompts.jsonl \
  --run-dir data/tts-intelligibility/runs/gradium-audrey-v1 \
  --voice-id Zd5POlBGSbD-JBXF

python scripts/tts_intelligibility_transcribe_whisper.py \
  --prompts data/tts-intelligibility/english_prompts.jsonl \
  --run-dir data/tts-intelligibility/runs/gradium-audrey-v1 \
  --model turbo

python scripts/tts_intelligibility_score.py \
  --prompts data/tts-intelligibility/english_prompts.jsonl \
  --run-dir data/tts-intelligibility/runs/gradium-audrey-v1
Independent TTS eval Speech hubGradium API docs →