Text-to-speech
Text-to-speech (TTS) is technology that converts written text into natural-sounding audio, also known as "read aloud" technology or speech synthesis. It works by analyzing text to understand words, punctuation, and sentence structure, then generating phonetic representations of those words before synthesizing them into a human-like voice output. TTS is a crucial form of assistive technology and a key component of natural language processing, making digital content accessible and improving user interaction in numerous applications.
Text-to-speech converts written text into natural-sounding audio. The field has gone from robotic concatenative synthesis to neural models that are nearly indistinguishable from human speech. ElevenLabs, OpenAI TTS, and open-source models like XTTS-v2 and F5-TTS achieve remarkable naturalness, with the frontier now focused on expressiveness, emotion control, and real-time streaming.
History
WaveNet (DeepMind) generates raw audio waveforms with autoregressive neural networks — first truly natural-sounding TTS
Tacotron (Google) introduces end-to-end TTS from characters to spectrograms, simplifying the pipeline
Tacotron 2 + WaveGlow achieve near-human MOS (Mean Opinion Score) of 4.5/5 on LJSpeech
FastSpeech introduces non-autoregressive spectrogram generation, enabling real-time synthesis
VITS (Kim et al.) combines variational inference with adversarial training for end-to-end TTS with high fidelity
XTTS-v2 (Coqui) and Bark enable zero-shot voice cloning from short reference audio
ElevenLabs launches with strikingly natural multi-speaker TTS; captures significant commercial market share
OpenAI TTS API and GPT-4o voice mode demonstrate conversational-quality real-time speech synthesis
F5-TTS and MaskGCT introduce flow-matching and masked generative approaches, rivaling autoregressive quality
Fish Speech, Dia (Nari Labs), and Sesame CSM push open-source TTS to near-commercial quality with multi-speaker support
How Text-to-speech Works
Text normalization
Input text is expanded: numbers to words, abbreviations to full forms, handling of punctuation and special characters
Phoneme conversion
Graphemes are converted to phonemes using a pronunciation model or G2P (grapheme-to-phoneme) system
Acoustic modeling
A transformer or diffusion model generates mel-spectrograms or latent audio tokens from the phoneme sequence
Vocoding
A vocoder (HiFi-GAN, BigVGAN, or flow-based) converts spectrograms to raw audio waveforms at 22-48kHz
Prosody control
Duration, pitch, and energy are either predicted by the model or controllable via conditioning signals
Current Landscape
TTS in 2025 has crossed the uncanny valley — casual listeners cannot reliably distinguish top models from human speech. The commercial market is dominated by ElevenLabs and OpenAI, while open-source alternatives (F5-TTS, Fish Speech, Dia) have closed the quality gap remarkably. The architecture landscape is diverse: autoregressive (VALL-E style), flow-matching (F5-TTS), diffusion (NaturalSpeech 3), and codec-based approaches (SoundStorm, MaskGCT) all produce excellent results. The competitive frontier has shifted from quality to control: emotion, style, pacing, and multi-speaker conversations.
Key Challenges
Expressiveness and emotion: conveying sarcasm, excitement, sadness, and subtle tonal shifts naturally remains difficult
Long-form synthesis: maintaining consistent prosody, pacing, and voice quality over paragraphs of text
Multilingual TTS with natural accent handling — code-switching between languages in a single utterance
Real-time streaming with low latency (<200ms first-byte) for conversational AI applications
Ethical concerns: voice cloning enables deepfakes and impersonation; consent and detection mechanisms are needed
Quick Recommendations
Best quality (API)
ElevenLabs Turbo v2.5 or OpenAI TTS HD
Near-human naturalness with voice selection and emotion control; low-latency streaming
Open-source (best quality)
F5-TTS or Fish Speech 1.5
Flow-matching architecture with excellent prosody; fully open weights
Zero-shot voice cloning
XTTS-v2 or OpenVoice v2
Clone any voice from 6-30 seconds of reference audio; supports 17+ languages
Real-time / low-latency
VITS or Piper TTS
Non-autoregressive, runs in real-time on CPU; ideal for edge and embedded devices
Conversational AI
GPT-4o voice mode or Sesame CSM
Native speech-in-speech-out with natural turn-taking and expressiveness
What's Next
The next wave is fully conversational TTS that responds in real-time with appropriate emotion and turn-taking (a la GPT-4o voice). Expect voice agents that maintain personality consistency over hours of dialogue, song synthesis that rivals studio recordings, and universal multilingual TTS covering 100+ languages from a single model. On the safety side, voice watermarking and synthetic speech detection will become mandatory features.
Benchmarks & SOTA
VCTK
CSTR VCTK Corpus
Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.
State of the Art
NaturalSpeech 3
Microsoft Research
4.36
mos
LJ Speech
The LJ Speech Dataset
13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.
State of the Art
VALL-E 2
Microsoft
4.61
mos
Related Tasks
Audio-Language Models
Audio-Language Models (ALMs) are a form of artificial intelligence that extend natural language processing (NLP) to the domain of audio, enabling computers to understand, generate, and reason about sounds and speech by integrating audio data with language understanding. Trained on audio-text data, ALMs bridge the gap between acoustic signals and linguistic meaning, allowing for tasks like zero-shot audio recognition, audio captioning, and the creation of generative audio, such as text-to-audio synthesis.
Voice cloning
Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.
Automatic Speech Recognition
Automatic Speech Recognition (ASR) is the technology that converts spoken language into written text. ASR systems process audio signals containing human speech and transcribe them into readable text format. These systems use acoustic models, language models, and often neural networks to recognize phonemes, words, and sentences from audio input. ASR is foundational for applications like voice assistants (Siri, Alexa), transcription services, voice-controlled systems, and accessibility tools for the hearing impaired.
Audio Classification
Classification of audio signals into predefined categories such as music genres, environmental sounds, or speaker identification.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Text-to-speech benchmarks accurate. Report outdated results, missing benchmarks, or errors.