English TTS intelligibility, measured as information fidelity.
Most TTS demos optimize for ten seconds of beautiful speech. This benchmark asks a stricter production question: can independent ASR recover the exact intended English message?
The benchmark pipeline
Measured takeaways.
First run: Gradium Audrey vs Kokoro af_heart on 30 hard English prompts, transcribed by Whisper large-v3-turbo. This is a small but real benchmark run, not placeholder copy.
On this 30-prompt run, Gradium beat Kokoro on the composite information-fidelity score: 37.4 vs 44.1, lower is better.
Gradium produced 13.4% normalized WER versus Kokoro at 15.6% after Whisper large-v3-turbo transcription.
Gradium preserved 73.3% of critical entities versus Kokoro at 66.7%, across numbers, dates, names, addresses, emails, and URLs.
Gradium p95 first-byte latency was 299 ms versus Kokoro at 2,123 ms on this local run, the clearest voice-agent advantage.
Speed, quality, and cost in one view.
The benchmark should make the tradeoff visible before the reader reaches the table: low WER, high entity accuracy, low first-byte latency, and transparent list-price cost.
Built to decide which provider is better.
Every provider runs the same prompts, the same ASR, the same normalization, and the same entity checks. The winner is the model that preserves the most information with acceptable latency.
| # | Model | WER ↓ | Entity Acc ↑ | TTFB p95 ↓ | Total p95 ↓ | Cost / 1K ↓ | Best for |
|---|---|---|---|---|---|---|---|
| 1 | Gradium n=30 · score 37.4 | 13.4% | 73.3% | 299 ms | 3517 ms | $0.0478 | real-time agents |
| 2 | Kokoro n=30 · score 44.1 | 15.6% | 66.7% | 2123 ms | 2123 ms | local infra | local/open source |
Measured locally on April 28, 2026: 30 hard English prompts, Gradium Audrey vs Kokoro af_heart, both transcribed with Whisper large-v3-turbo. Gradium cost uses public list pricing from the S plan: 1 TTS character = 1 credit, $43/month for 900k credits, or about $0.0478 per 1K TTS characters. This execution used a granted API key; Kokoro cost is local infrastructure only. Gradium pricing.
Where each voice breaks.
Darker cells have higher WER. The bar inside each cell shows critical entity accuracy. This exposes whether a model is generally intelligible but weak on emails, URLs, names, or identifiers.
Round-trip intelligibility.
Gradium is only used for TTS. The transcript must come from an independent ASR system.
Start with curated English prompts.
Generate speech with Gradium TTS and store audio plus latency metadata.
Transcribe the audio with independent ASR, for example Whisper, Deepgram, AssemblyAI, or Google STT.
Normalize reference and hypothesis.
Compute strict and normalized WER/CER.
Extract critical entities and classify failures by category and severity.
Easy sentences are not enough.
Word-level intelligibility after ASR round-trip.
Character-level preservation for identifiers and dense strings.
Whether normalized reference equals normalized ASR output.
Whether numbers, dates, identifiers, names, and address-like strings survive.
Time from TTS request to first audio byte.
Entity-level failure records with severity classes.
Composite information-fidelity error.
Normalized word error rate.
Normalized character error rate.
Strict normalized recovery rate.
Critical entity preservation.
Real-time readiness.
REF: Your appointment is scheduled for March 12th, 2026 at 2:45 PM. ASR: Your appointment is scheduled for March 12th, 2026 at 2:40 PM. ERROR: - time changed from 2:45 PM to 2:40 PM - category: date/time - severity: high
Your appointment is scheduled for March 12th, 2026 at 2:45 PM.
The confirmation code is 739-184-552.
The API uses OAuth, JWT, TLS, and HTTP/2.
Please send the invoice to alex.smith plus billing at example dot com.
Runnable from the repo root.
The API key stays in the environment. Raw audio and transcripts stay in run directories.
export GRADIUM_API_KEY=... python scripts/tts_intelligibility_generate_gradium.py \ --prompts data/tts-intelligibility/english_prompts.jsonl \ --run-dir data/tts-intelligibility/runs/gradium-audrey-v1 \ --voice-id Zd5POlBGSbD-JBXF python scripts/tts_intelligibility_transcribe_whisper.py \ --prompts data/tts-intelligibility/english_prompts.jsonl \ --run-dir data/tts-intelligibility/runs/gradium-audrey-v1 \ --model turbo python scripts/tts_intelligibility_score.py \ --prompts data/tts-intelligibility/english_prompts.jsonl \ --run-dir data/tts-intelligibility/runs/gradium-audrey-v1