Text-to-Speech, focused
A dedicated TTS landing page — decoupled from STT. Literature-sourced leaderboard, use-case picks, and first-party measurements where we have them. The gap between commercial APIs and open-source has nearly closed; this page is here to help you pick correctly.
TTS Landscape
TTS Vendor Evaluation — UTMOS + WER round-trip
First-party measurement of Kokoro v1.0 vs Gradium on 50 Harvard sentences. Real UTMOS22-strong for naturalness, Whisper-v3 for intelligibility. Reproducible scripts + raw data in-repo. Vendors: email k.wikiel@gmail.com to be in v2.
TTS Leaderboard
18 models ranked by MOS (Mean Opinion Score). Scores marked Subjective are vendor-reported or from literature — not independently verified. For measured numbers see the independent eval.
| # | Model | MOS | Architecture | Type | Params | Year |
|---|---|---|---|---|---|---|
| 1 | ElevenLabs Turbo v2.5 ElevenLabs · Industry-leading naturalness. Voice library + cloning. | 4.8 | Proprietary (diffusion-based) | Cloud API | — | 2024 |
| 2 | Sesame CSM Sesame · Best open-source TTS. Emotionally expressive dialogue. | 4.7 | Conversational Speech Model | Open Source | 1B+ | 2025 |
| 3 | OpenAI TTS HD OpenAI · 6 built-in voices. Simple API integration. | 4.7 | Proprietary | Cloud API | — | 2023 |
| 4 | Gemini 2.5 Pro TTS Google · 30 speakers, 80+ locales. Prompt-controlled style/emotion. | 4.7 | Multimodal LLM (native audio) | Cloud API | — | 2025 |
| 5 | Cartesia Sonic 2 Cartesia · Ultra-low latency (<90ms TTFB). Built for voice bots. | 4.7 | State-space model | Cloud API | — | 2025 |
| 6 | ElevenLabs Flash v2.5 ElevenLabs · Speed-optimized variant. ~120ms TTFB. | 4.6 | Proprietary (optimized) | Cloud API | — | 2025 |
| 7 | PlayHT 3.0 PlayHT · Advanced voice cloning with emotion control. | 4.6 | Proprietary | Cloud API | — | 2025 |
| 8 | Orpheus TTS Canopy Labs · Human-level with emotion tags (<laugh>, <sigh>). Fine-tunable on custom data. | 4.6 | LLM-based (Llama backbone) | Open Source | 3B | 2025 |
| 9 | Gemini 2.5 Flash TTS Google · Real-time optimized. Multi-speaker dialogue. | 4.5 | Multimodal LLM (native audio) | Cloud API | — | 2025 |
| 10 | Kokoro v1.0 Hexgrad · 82M params, Apache 2.0. Runs on CPU. | 4.5 | Lightweight autoregressive | Open Source | 82M | 2025 |
| 11 | XTTS v2 Coqui · Zero-shot voice cloning in 17 languages. | 4.5 | GPT-like + VITS decoder | Open Source | 467M | 2024 |
| 12 | Google Chirp 3 HD Google · 8 voice personas, 31 langs, instant voice cloning. | 4.4 | Generative (USM-based) | Cloud API | — | 2025 |
| 13 | Fish Speech 1.5 Fish Audio · Multilingual. Strong CJK language support. | 4.4 | VQGAN + Transformer | Open Source | 500M | 2025 |
| 14 | F5-TTS Shanghai AI Lab · Fast inference via flow matching. Strong zero-shot voice cloning. | 4.4 | Flow-matching (non-autoregressive) | Open Source | 335M | 2024 |
| 15 | Dia 1.6B Nari Labs · Generates dialogue with laughter, pauses, breathing. | 4.3 | Transformer + non-verbal tokens | Open Source | 1.6B | 2025 |
| 16 | Spark-TTS SparkAudio · Controllable attributes: pitch, speed, emotion. Multilingual. | 4.3 | Controllable Transformer | Open Source | 500M | 2025 |
| 17 | Parler-TTS Hugging Face · Text-described voice styles. Fully open training. | 4.1 | Prompt-controlled Transformer | Open Source | 880M | 2025 |
| 18 | Piper Rhasspy · Runs on Raspberry Pi. 30+ languages, <30ms latency. | 3.6 | VITS (lightweight) | Open Source | ~20M | 2023 |
Picks by Use Case
MOS rankings are one axis. The model you actually want depends on what you're building — a voice agent has different needs than audiobook narration or an embedded on-device assistant.
<90ms TTFB, state-space architecture purpose-built for interactive voice. ElevenLabs Flash v2.5 is the fallback at ~120ms.
4.8 MOS — widely considered indistinguishable from human. Massive voice library and commercial cloning.
82M params, Apache-2.0, ~10× real-time on CPU. Tied with commercial APIs on CodeSOTA's independent UTMOS eval (4.48).
Conversational Speech Model — 4.7 MOS with emotional expressiveness. Dia 1.6B is the alternative for scripted dialogue with non-verbal cues.
F5-TTS uses flow-matching for fast cloning; XTTS v2 covers 17 languages. Orpheus TTS is the LLM-based alternative with emotion tags.
31 languages, 8 voice personas, instant cloning. Fish Speech 1.5 is the open-source alternative for CJK-heavy deployments.
Open Source vs Cloud
As of 2026 the quality gap is nearly closed: the best open-source TTS (Sesame CSM, 4.7 MOS) is within 0.1–0.3 MOS of the top commercial API (ElevenLabs Turbo v2.5, 4.8 MOS). The remaining cloud advantages are in infrastructure, not naturalness.
When to go Open Source
- Data residency / air-gapped deployment
- High volume where per-character pricing hurts
- Custom fine-tuning on your voice or domain
- Edge / on-device inference (Kokoro, Piper)
- Full reproducibility for research
When to go Cloud API
- Sub-200ms TTFB streaming (Cartesia, ElevenLabs Flash)
- Professional voice cloning with licensing support
- Broad multilingual coverage (Chirp 3, ElevenLabs)
- Managed infrastructure, SLA, autoscaling
- You don't want to host a GPU
How TTS is Scored
MOS — Mean Opinion Score
Human raters listen to generated audio and score naturalness 1 (bad) to 5 (indistinguishable from human). A modern TTS target is 4.5+. Reference recordings of real speech typically score 4.5–4.7 — the ceiling.
Because real MOS studies are slow and expensive, papers often use automatic MOS predictors like UTMOS (trained on crowdsourced ratings). UTMOS correlates ~0.9 with true MOS on TTS-like audio — good enough for ranking, noisy enough that small gaps (<0.1) should not be treated as decisive.
Beyond MOS — what MOS misses
- Intelligibility on hard text. Numbers, abbreviations, named entities, code-switching. Measured via WER round-trip through an ASR system.
- Latency (TTFB). Time to first audio byte. For voice agents this matters more than MOS.
- Prosody & emotion. Does it sound bored? Does it pause at commas? Does it laugh believably?
- Voice similarity (SECS). For cloning: cosine similarity of speaker embeddings between reference and output.
- Streaming coherence. Whether chunks concatenate seamlessly or reveal synthesis boundaries.
Related
Independent TTS Eval
First-party UTMOS + WER measurement. Kokoro vs Gradium.
HubSpeech Benchmarks (STT + TTS)
Combined leaderboards, 35+ models, LibriSpeech and MOS.
GuideText-to-Audio Building Block
Integration guide: API shapes, voices, streaming patterns.
GuideTTS Models Guide
Narrative walkthrough of the TTS landscape.
TutorialLearn: Build a TTS pipeline
Hands-on lesson comparing models in code.
HubAudio Benchmarks
Classification, music generation, audio understanding.