Speechtext-to-speech

Text-to-Speech

Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.

2 datasets11 resultsView full task mapping →

Text-to-speech converts written text into natural-sounding audio. The field has gone from robotic concatenative synthesis to neural models that are nearly indistinguishable from human speech. ElevenLabs, OpenAI TTS, and open-source models like XTTS-v2 and F5-TTS achieve remarkable naturalness, with the frontier now focused on expressiveness, emotion control, and real-time streaming.

History

2016

WaveNet (DeepMind) generates raw audio waveforms with autoregressive neural networks — first truly natural-sounding TTS

2017

Tacotron (Google) introduces end-to-end TTS from characters to spectrograms, simplifying the pipeline

2018

Tacotron 2 + WaveGlow achieve near-human MOS (Mean Opinion Score) of 4.5/5 on LJSpeech

2019

FastSpeech introduces non-autoregressive spectrogram generation, enabling real-time synthesis

2021

VITS (Kim et al.) combines variational inference with adversarial training for end-to-end TTS with high fidelity

2023

XTTS-v2 (Coqui) and Bark enable zero-shot voice cloning from short reference audio

2023

ElevenLabs launches with strikingly natural multi-speaker TTS; captures significant commercial market share

2024

OpenAI TTS API and GPT-4o voice mode demonstrate conversational-quality real-time speech synthesis

2024

F5-TTS and MaskGCT introduce flow-matching and masked generative approaches, rivaling autoregressive quality

2025

Fish Speech, Dia (Nari Labs), and Sesame CSM push open-source TTS to near-commercial quality with multi-speaker support

How Text-to-Speech Works

1Text normalizationInput text is expanded: num…2Phoneme conversionGraphemes are converted to …3Acoustic modelingA transformer or diffusion …4VocodingA vocoder (HiFi-GAN5Prosody controlDurationText-to-Speech Pipeline
1

Text normalization

Input text is expanded: numbers to words, abbreviations to full forms, handling of punctuation and special characters

2

Phoneme conversion

Graphemes are converted to phonemes using a pronunciation model or G2P (grapheme-to-phoneme) system

3

Acoustic modeling

A transformer or diffusion model generates mel-spectrograms or latent audio tokens from the phoneme sequence

4

Vocoding

A vocoder (HiFi-GAN, BigVGAN, or flow-based) converts spectrograms to raw audio waveforms at 22-48kHz

5

Prosody control

Duration, pitch, and energy are either predicted by the model or controllable via conditioning signals

Current Landscape

TTS in 2025 has crossed the uncanny valley — casual listeners cannot reliably distinguish top models from human speech. The commercial market is dominated by ElevenLabs and OpenAI, while open-source alternatives (F5-TTS, Fish Speech, Dia) have closed the quality gap remarkably. The architecture landscape is diverse: autoregressive (VALL-E style), flow-matching (F5-TTS), diffusion (NaturalSpeech 3), and codec-based approaches (SoundStorm, MaskGCT) all produce excellent results. The competitive frontier has shifted from quality to control: emotion, style, pacing, and multi-speaker conversations.

Key Challenges

Expressiveness and emotion: conveying sarcasm, excitement, sadness, and subtle tonal shifts naturally remains difficult

Long-form synthesis: maintaining consistent prosody, pacing, and voice quality over paragraphs of text

Multilingual TTS with natural accent handling — code-switching between languages in a single utterance

Real-time streaming with low latency (<200ms first-byte) for conversational AI applications

Ethical concerns: voice cloning enables deepfakes and impersonation; consent and detection mechanisms are needed

Quick Recommendations

Best quality (API)

ElevenLabs Turbo v2.5 or OpenAI TTS HD

Near-human naturalness with voice selection and emotion control; low-latency streaming

Open-source (best quality)

F5-TTS or Fish Speech 1.5

Flow-matching architecture with excellent prosody; fully open weights

Zero-shot voice cloning

XTTS-v2 or OpenVoice v2

Clone any voice from 6-30 seconds of reference audio; supports 17+ languages

Real-time / low-latency

VITS or Piper TTS

Non-autoregressive, runs in real-time on CPU; ideal for edge and embedded devices

Conversational AI

GPT-4o voice mode or Sesame CSM

Native speech-in-speech-out with natural turn-taking and expressiveness

What's Next

The next wave is fully conversational TTS that responds in real-time with appropriate emotion and turn-taking (a la GPT-4o voice). Expect voice agents that maintain personality consistency over hours of dialogue, song synthesis that rivals studio recordings, and universal multilingual TTS covering 100+ languages from a single model. On the safety side, voice watermarking and synthetic speech detection will become mandatory features.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Text-to-Speech benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000