Text-to-Speech
Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.
Text-to-speech converts written text into natural-sounding audio. The field has gone from robotic concatenative synthesis to neural models that are nearly indistinguishable from human speech. ElevenLabs, OpenAI TTS, and open-source models like XTTS-v2 and F5-TTS achieve remarkable naturalness, with the frontier now focused on expressiveness, emotion control, and real-time streaming.
History
WaveNet (DeepMind) generates raw audio waveforms with autoregressive neural networks — first truly natural-sounding TTS
Tacotron (Google) introduces end-to-end TTS from characters to spectrograms, simplifying the pipeline
Tacotron 2 + WaveGlow achieve near-human MOS (Mean Opinion Score) of 4.5/5 on LJSpeech
FastSpeech introduces non-autoregressive spectrogram generation, enabling real-time synthesis
VITS (Kim et al.) combines variational inference with adversarial training for end-to-end TTS with high fidelity
XTTS-v2 (Coqui) and Bark enable zero-shot voice cloning from short reference audio
ElevenLabs launches with strikingly natural multi-speaker TTS; captures significant commercial market share
OpenAI TTS API and GPT-4o voice mode demonstrate conversational-quality real-time speech synthesis
F5-TTS and MaskGCT introduce flow-matching and masked generative approaches, rivaling autoregressive quality
Fish Speech, Dia (Nari Labs), and Sesame CSM push open-source TTS to near-commercial quality with multi-speaker support
How Text-to-Speech Works
Text normalization
Input text is expanded: numbers to words, abbreviations to full forms, handling of punctuation and special characters
Phoneme conversion
Graphemes are converted to phonemes using a pronunciation model or G2P (grapheme-to-phoneme) system
Acoustic modeling
A transformer or diffusion model generates mel-spectrograms or latent audio tokens from the phoneme sequence
Vocoding
A vocoder (HiFi-GAN, BigVGAN, or flow-based) converts spectrograms to raw audio waveforms at 22-48kHz
Prosody control
Duration, pitch, and energy are either predicted by the model or controllable via conditioning signals
Current Landscape
TTS in 2025 has crossed the uncanny valley — casual listeners cannot reliably distinguish top models from human speech. The commercial market is dominated by ElevenLabs and OpenAI, while open-source alternatives (F5-TTS, Fish Speech, Dia) have closed the quality gap remarkably. The architecture landscape is diverse: autoregressive (VALL-E style), flow-matching (F5-TTS), diffusion (NaturalSpeech 3), and codec-based approaches (SoundStorm, MaskGCT) all produce excellent results. The competitive frontier has shifted from quality to control: emotion, style, pacing, and multi-speaker conversations.
Key Challenges
Expressiveness and emotion: conveying sarcasm, excitement, sadness, and subtle tonal shifts naturally remains difficult
Long-form synthesis: maintaining consistent prosody, pacing, and voice quality over paragraphs of text
Multilingual TTS with natural accent handling — code-switching between languages in a single utterance
Real-time streaming with low latency (<200ms first-byte) for conversational AI applications
Ethical concerns: voice cloning enables deepfakes and impersonation; consent and detection mechanisms are needed
Quick Recommendations
Best quality (API)
ElevenLabs Turbo v2.5 or OpenAI TTS HD
Near-human naturalness with voice selection and emotion control; low-latency streaming
Open-source (best quality)
F5-TTS or Fish Speech 1.5
Flow-matching architecture with excellent prosody; fully open weights
Zero-shot voice cloning
XTTS-v2 or OpenVoice v2
Clone any voice from 6-30 seconds of reference audio; supports 17+ languages
Real-time / low-latency
VITS or Piper TTS
Non-autoregressive, runs in real-time on CPU; ideal for edge and embedded devices
Conversational AI
GPT-4o voice mode or Sesame CSM
Native speech-in-speech-out with natural turn-taking and expressiveness
What's Next
The next wave is fully conversational TTS that responds in real-time with appropriate emotion and turn-taking (a la GPT-4o voice). Expect voice agents that maintain personality consistency over hours of dialogue, song synthesis that rivals studio recordings, and universal multilingual TTS covering 100+ languages from a single model. On the safety side, voice watermarking and synthetic speech detection will become mandatory features.
Benchmarks & SOTA
VCTK
CSTR VCTK Corpus
Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.
State of the Art
NaturalSpeech 3
Microsoft Research
4.36
mos
LJ Speech
The LJ Speech Dataset
13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.
State of the Art
VALL-E 2
Microsoft
4.61
mos
Related Tasks
Speaker Verification
Verifying speaker identity from voice samples.
Speech Translation
Translating spoken audio directly to another language.
Voice Cloning
Replicating a speaker's voice characteristics.
Speech Recognition
Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.
Something wrong or missing?
Help keep Text-to-Speech benchmarks accurate. Report outdated results, missing benchmarks, or errors.