Text-to-SpeechBETA

Text-to-Speech, focused

A dedicated TTS landing page — decoupled from STT. Literature-sourced leaderboard, use-case picks, and first-party measurements where we have them. The gap between commercial APIs and open-source has nearly closed; this page is here to help you pick correctly.

TTS Landscape

4.8
Best MOS (ElevenLabs Turbo v2.5)
4.7
Best Open Source (Sesame CSM)
18
Models tracked
Measured by CodeSOTA

TTS Vendor Evaluation — UTMOS + WER round-trip

First-party measurement of Kokoro v1.0 vs Gradium on 50 Harvard sentences. Real UTMOS22-strong for naturalness, Whisper-v3 for intelligibility. Reproducible scripts + raw data in-repo. Vendors: email k.wikiel@gmail.com to be in v2.

TTS Leaderboard

18 models ranked by MOS (Mean Opinion Score). Scores marked Subjective are vendor-reported or from literature — not independently verified. For measured numbers see the independent eval.

#ModelMOSArchitectureTypeParamsYear
1
ElevenLabs Turbo v2.5
ElevenLabs · Industry-leading naturalness. Voice library + cloning.
4.8
Proprietary (diffusion-based)Cloud API2024
2
Sesame CSM
Sesame · Best open-source TTS. Emotionally expressive dialogue.
4.7
Conversational Speech ModelOpen Source1B+2025
3
OpenAI TTS HD
OpenAI · 6 built-in voices. Simple API integration.
4.7
ProprietaryCloud API2023
4
Gemini 2.5 Pro TTS
Google · 30 speakers, 80+ locales. Prompt-controlled style/emotion.
4.7
Multimodal LLM (native audio)Cloud API2025
5
Cartesia Sonic 2
Cartesia · Ultra-low latency (<90ms TTFB). Built for voice bots.
4.7
State-space modelCloud API2025
6
ElevenLabs Flash v2.5
ElevenLabs · Speed-optimized variant. ~120ms TTFB.
4.6
Proprietary (optimized)Cloud API2025
7
PlayHT 3.0
PlayHT · Advanced voice cloning with emotion control.
4.6
ProprietaryCloud API2025
8
Orpheus TTS
Canopy Labs · Human-level with emotion tags (<laugh>, <sigh>). Fine-tunable on custom data.
4.6
LLM-based (Llama backbone)Open Source3B2025
9
Gemini 2.5 Flash TTS
Google · Real-time optimized. Multi-speaker dialogue.
4.5
Multimodal LLM (native audio)Cloud API2025
10
Kokoro v1.0
Hexgrad · 82M params, Apache 2.0. Runs on CPU.
4.5
Lightweight autoregressiveOpen Source82M2025
11
XTTS v2
Coqui · Zero-shot voice cloning in 17 languages.
4.5
GPT-like + VITS decoderOpen Source467M2024
12
Google Chirp 3 HD
Google · 8 voice personas, 31 langs, instant voice cloning.
4.4
Generative (USM-based)Cloud API2025
13
Fish Speech 1.5
Fish Audio · Multilingual. Strong CJK language support.
4.4
VQGAN + TransformerOpen Source500M2025
14
F5-TTS
Shanghai AI Lab · Fast inference via flow matching. Strong zero-shot voice cloning.
4.4
Flow-matching (non-autoregressive)Open Source335M2024
15
Dia 1.6B
Nari Labs · Generates dialogue with laughter, pauses, breathing.
4.3
Transformer + non-verbal tokensOpen Source1.6B2025
16
Spark-TTS
SparkAudio · Controllable attributes: pitch, speed, emotion. Multilingual.
4.3
Controllable TransformerOpen Source500M2025
17
Parler-TTS
Hugging Face · Text-described voice styles. Fully open training.
4.1
Prompt-controlled TransformerOpen Source880M2025
18
Piper
Rhasspy · Runs on Raspberry Pi. 30+ languages, <30ms latency.
3.6
VITS (lightweight)Open Source~20M2023

Picks by Use Case

MOS rankings are one axis. The model you actually want depends on what you're building — a voice agent has different needs than audiobook narration or an embedded on-device assistant.

Open Source vs Cloud

As of 2026 the quality gap is nearly closed: the best open-source TTS (Sesame CSM, 4.7 MOS) is within 0.1–0.3 MOS of the top commercial API (ElevenLabs Turbo v2.5, 4.8 MOS). The remaining cloud advantages are in infrastructure, not naturalness.

When to go Open Source

  • Data residency / air-gapped deployment
  • High volume where per-character pricing hurts
  • Custom fine-tuning on your voice or domain
  • Edge / on-device inference (Kokoro, Piper)
  • Full reproducibility for research

When to go Cloud API

  • Sub-200ms TTFB streaming (Cartesia, ElevenLabs Flash)
  • Professional voice cloning with licensing support
  • Broad multilingual coverage (Chirp 3, ElevenLabs)
  • Managed infrastructure, SLA, autoscaling
  • You don't want to host a GPU

How TTS is Scored

MOS — Mean Opinion Score

Human raters listen to generated audio and score naturalness 1 (bad) to 5 (indistinguishable from human). A modern TTS target is 4.5+. Reference recordings of real speech typically score 4.5–4.7 — the ceiling.

Because real MOS studies are slow and expensive, papers often use automatic MOS predictors like UTMOS (trained on crowdsourced ratings). UTMOS correlates ~0.9 with true MOS on TTS-like audio — good enough for ranking, noisy enough that small gaps (<0.1) should not be treated as decisive.

Beyond MOS — what MOS misses

  • Intelligibility on hard text. Numbers, abbreviations, named entities, code-switching. Measured via WER round-trip through an ASR system.
  • Latency (TTFB). Time to first audio byte. For voice agents this matters more than MOS.
  • Prosody & emotion. Does it sound bored? Does it pause at commas? Does it laugh believably?
  • Voice similarity (SECS). For cloning: cosine similarity of speaker embeddings between reference and output.
  • Streaming coherence. Whether chunks concatenate seamlessly or reveal synthesis boundaries.

Related

Beta. This page is new and in active iteration. MOS column is literature-sourced for most rows; see the independent eval for numbers measured in-repo. Feedback welcome at k.wikiel@gmail.com.