Audiotext-to-speech

Text-to-speech

Text-to-speech (TTS) is technology that converts written text into natural-sounding audio, also known as "read aloud" technology or speech synthesis. It works by analyzing text to understand words, punctuation, and sentence structure, then generating phonetic representations of those words before synthesizing them into a human-like voice output. TTS is a crucial form of assistive technology and a key component of natural language processing, making digital content accessible and improving user interaction in numerous applications.

2 datasets11 resultsView full task mapping →

Text-to-speech converts written text into natural-sounding audio. The field has gone from robotic concatenative synthesis to neural models that are nearly indistinguishable from human speech. ElevenLabs, OpenAI TTS, and open-source models like XTTS-v2 and F5-TTS achieve remarkable naturalness, with the frontier now focused on expressiveness, emotion control, and real-time streaming.

History

2016

WaveNet (DeepMind) generates raw audio waveforms with autoregressive neural networks — first truly natural-sounding TTS

2017

Tacotron (Google) introduces end-to-end TTS from characters to spectrograms, simplifying the pipeline

2018

Tacotron 2 + WaveGlow achieve near-human MOS (Mean Opinion Score) of 4.5/5 on LJSpeech

2019

FastSpeech introduces non-autoregressive spectrogram generation, enabling real-time synthesis

2021

VITS (Kim et al.) combines variational inference with adversarial training for end-to-end TTS with high fidelity

2023

XTTS-v2 (Coqui) and Bark enable zero-shot voice cloning from short reference audio

2023

ElevenLabs launches with strikingly natural multi-speaker TTS; captures significant commercial market share

2024

OpenAI TTS API and GPT-4o voice mode demonstrate conversational-quality real-time speech synthesis

2024

F5-TTS and MaskGCT introduce flow-matching and masked generative approaches, rivaling autoregressive quality

2025

Fish Speech, Dia (Nari Labs), and Sesame CSM push open-source TTS to near-commercial quality with multi-speaker support

How Text-to-speech Works

Text normalization

Input text is expanded: numbers to words, abbreviations to full forms, handling of punctuation and special characters

Phoneme conversion

Graphemes are converted to phonemes using a pronunciation model or G2P (grapheme-to-phoneme) system

Acoustic modeling

A transformer or diffusion model generates mel-spectrograms or latent audio tokens from the phoneme sequence

Vocoding

A vocoder (HiFi-GAN, BigVGAN, or flow-based) converts spectrograms to raw audio waveforms at 22-48kHz

Prosody control

Duration, pitch, and energy are either predicted by the model or controllable via conditioning signals

Current Landscape

TTS in 2025 has crossed the uncanny valley — casual listeners cannot reliably distinguish top models from human speech. The commercial market is dominated by ElevenLabs and OpenAI, while open-source alternatives (F5-TTS, Fish Speech, Dia) have closed the quality gap remarkably. The architecture landscape is diverse: autoregressive (VALL-E style), flow-matching (F5-TTS), diffusion (NaturalSpeech 3), and codec-based approaches (SoundStorm, MaskGCT) all produce excellent results. The competitive frontier has shifted from quality to control: emotion, style, pacing, and multi-speaker conversations.

Key Challenges

Expressiveness and emotion: conveying sarcasm, excitement, sadness, and subtle tonal shifts naturally remains difficult

Long-form synthesis: maintaining consistent prosody, pacing, and voice quality over paragraphs of text

Multilingual TTS with natural accent handling — code-switching between languages in a single utterance

Real-time streaming with low latency (<200ms first-byte) for conversational AI applications

Ethical concerns: voice cloning enables deepfakes and impersonation; consent and detection mechanisms are needed

Quick Recommendations

Best quality (API)

ElevenLabs Turbo v2.5 or OpenAI TTS HD

Near-human naturalness with voice selection and emotion control; low-latency streaming

Open-source (best quality)

F5-TTS or Fish Speech 1.5

Flow-matching architecture with excellent prosody; fully open weights

Zero-shot voice cloning

XTTS-v2 or OpenVoice v2

Clone any voice from 6-30 seconds of reference audio; supports 17+ languages

Real-time / low-latency

VITS or Piper TTS

Non-autoregressive, runs in real-time on CPU; ideal for edge and embedded devices

Conversational AI

GPT-4o voice mode or Sesame CSM

Native speech-in-speech-out with natural turn-taking and expressiveness

What's Next

The next wave is fully conversational TTS that responds in real-time with appropriate emotion and turn-taking (a la GPT-4o voice). Expect voice agents that maintain personality consistency over hours of dialogue, song synthesis that rivals studio recordings, and universal multilingual TTS covering 100+ languages from a single model. On the safety side, voice watermarking and synthetic speech detection will become mandatory features.

Benchmarks & SOTA

VCTK

CSTR VCTK Corpus

20196 results

Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.

State of the Art

NaturalSpeech 3

Microsoft Research

4.36

mos

LJ Speech

The LJ Speech Dataset

20175 results

13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.

State of the Art

VALL-E 2

Microsoft

4.61

mos

Related Tasks

Audio-Language Models

Audio-Language Models (ALMs) are a form of artificial intelligence that extend natural language processing (NLP) to the domain of audio, enabling computers to understand, generate, and reason about sounds and speech by integrating audio data with language understanding. Trained on audio-text data, ALMs bridge the gap between acoustic signals and linguistic meaning, allowing for tasks like zero-shot audio recognition, audio captioning, and the creation of generative audio, such as text-to-audio synthesis.

Voice cloning

Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is the technology that converts spoken language into written text. ASR systems process audio signals containing human speech and transcribe them into readable text format. These systems use acoustic models, language models, and often neural networks to recognize phonemes, words, and sentences from audio input. ASR is foundational for applications like voice assistants (Siri, Alexa), transcription services, voice-controlled systems, and accessibility tools for the hearing impaired.

Audio Classification

Classification of audio signals into predefined categories such as music genres, environmental sounds, or speaker identification.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Text-to-speech benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Audio