Speech/TTS eval/English intelligibility

§ TTS information preservation

English TTS intelligibility, measured as information fidelity.

Most TTS demos optimize for ten seconds of beautiful speech. This benchmark asks a stricter production question: can independent ASR recover the exact intended English message?

Leaderboard →Run locally Back to measured TTS leaderboard →

The benchmark pipeline

Same prompts, same ASR, same scoring. Only the TTS provider changes.

§ Gradium angle

Measured takeaways.

First run: Gradium Audrey vs Kokoro af_heart on 30 hard English prompts, transcribed by Whisper large-v3-turbo. This is a small but real benchmark run, not placeholder copy.

Gradium ranked #1

On this 30-prompt run, Gradium beat Kokoro on the composite information-fidelity score: 37.4 vs 44.1, lower is better.

Lower WER

Gradium produced 13.4% normalized WER versus Kokoro at 15.6% after Whisper large-v3-turbo transcription.

Better entity preservation

Gradium preserved 73.3% of critical entities versus Kokoro at 66.7%, across numbers, dates, names, addresses, emails, and URLs.

Much faster first byte

Gradium p95 first-byte latency was 299 ms versus Kokoro at 2,123 ms on this local run, the clearest voice-agent advantage.

§ 00 · Visual scorecard

Speed, quality, and cost in one view.

The benchmark should make the tradeoff visible before the reader reaches the table: low WER, high entity accuracy, low first-byte latency, and transparent list-price cost.

Bubble size encodes critical entity accuracy. Gradium sits in the lower-left region: faster first byte and lower WER.

Fig 3 · metric deltas

WER

lower is better

Gradium

13.4%

Kokoro

15.6%

Entity accuracy

higher is better

Gradium

73.3%

Kokoro

66.7%

TTFB p95

lower is better

Gradium

299 ms

Kokoro

2123 ms

High severity errors

lower is better

Gradium

Kokoro

§ 01 · Leaderboard

Built to decide which provider is better.

Every provider runs the same prompts, the same ASR, the same normalization, and the same entity checks. The winner is the model that preserves the most information with acceptable latency.

#	Model	WER ↓	Entity Acc ↑	TTFB p95 ↓	Total p95 ↓	Cost / 1K ↓	Best for
1	Gradium n=30 · score 37.4	13.4%	73.3%	299 ms	3517 ms	$0.0478	real-time agents
2	Kokoro n=30 · score 44.1	15.6%	66.7%	2123 ms	2123 ms	local infra	local/open source

Measured locally on April 28, 2026: 30 hard English prompts, Gradium Audrey vs Kokoro af_heart, both transcribed with Whisper large-v3-turbo. Gradium cost uses public list pricing from the S plan: 1 TTS character = 1 credit, $43/month for 900k credits, or about $0.0478 per 1K TTS characters. This execution used a granted API key; Kokoro cost is local infrastructure only. Gradium pricing.

§ 02 · Category heatmap

Where each voice breaks.

Darker cells have higher WER. The bar inside each cell shows critical entity accuracy. This exposes whether a model is generally intelligible but weak on emails, URLs, names, or identifiers.

Round-trip intelligibility.

Gradium is only used for TTS. The transcript must come from an independent ASR system.

Start with curated English prompts.

Generate speech with Gradium TTS and store audio plus latency metadata.

Transcribe the audio with independent ASR, for example Whisper, Deepgram, AssemblyAI, or Google STT.

Normalize reference and hypothesis.

Compute strict and normalized WER/CER.

Extract critical entities and classify failures by category and severity.

§ 04 · Coverage

Easy sentences are not enough.

plain speech

numbers

dates and times

currencies

addresses

names and entities

acronyms

emails and URLs

domain terms

long-form speech

§ 05 · Metrics

WER

Word-level intelligibility after ASR round-trip.

CER

Character-level preservation for identifiers and dense strings.

Exact match

Whether normalized reference equals normalized ASR output.

Critical entity accuracy

Whether numbers, dates, identifiers, names, and address-like strings survive.

TTFB

Time from TTS request to first audio byte.

Error taxonomy

Entity-level failure records with severity classes.

§ 06 · Ranking logic

Rank

Composite information-fidelity error.

WER ↓

Normalized word error rate.

CER ↓

Normalized character error rate.

Exact ↑

Strict normalized recovery rate.

Entity Acc ↑

Critical entity preservation.

TTFB p95 ↓

Real-time readiness.

Severity profile

Gradium6 high · 2 medium

Kokoro6 high · 5 medium

High severity means a number, date, identifier, address, or named entity changed. WER alone does not price those failures correctly.

Example diff

REF: Your appointment is scheduled for March 12th, 2026 at 2:45 PM.
ASR: Your appointment is scheduled for March 12th, 2026 at 2:40 PM.

ERROR:
- time changed from 2:45 PM to 2:40 PM
- category: date/time
- severity: high

§ 07 · Prompt set

date_001 · dates, time

Your appointment is scheduled for March 12th, 2026 at 2:45 PM.

num_002 · numbers, identifier

The confirmation code is 739-184-552.

acro_001 · acronyms, technical

The API uses OAuth, JWT, TLS, and HTTP/2.

email_001 · email

Please send the invoice to alex.smith plus billing at example dot com.

§ 08 · CodeSOTA harness

Runnable from the repo root.

The API key stays in the environment. Raw audio and transcripts stay in run directories.

export GRADIUM_API_KEY=...

python scripts/tts_intelligibility_generate_gradium.py \
  --prompts data/tts-intelligibility/english_prompts.jsonl \
  --run-dir data/tts-intelligibility/runs/gradium-audrey-v1 \
  --voice-id Zd5POlBGSbD-JBXF

python scripts/tts_intelligibility_transcribe_whisper.py \
  --prompts data/tts-intelligibility/english_prompts.jsonl \
  --run-dir data/tts-intelligibility/runs/gradium-audrey-v1 \
  --model turbo

python scripts/tts_intelligibility_score.py \
  --prompts data/tts-intelligibility/english_prompts.jsonl \
  --run-dir data/tts-intelligibility/runs/gradium-audrey-v1

Independent TTS eval →Speech hub Gradium API docs →