Level 1: Single Blocks~25 min

Text-to-Speech

Seven decades of teaching machines to speak — from hand-tuned formant oscillators to neural networks that clone a voice from three seconds of audio.

70 Years of Synthetic Speech

Text-to-speech is one of the oldest problems in computing. Bell Labs demonstrated an electronic speech synthesizer in 1939. Since then, every generation of TTS has solved a fundamental limitation of the one before it — trading more data and compute for speech that sounds less like a machine and more like a person.

Understanding this arc is the fastest way to see why modern systems work the way they do, what trade-offs they inherit, and what problems remain unsolved.

Era I: Rule-Based Synthesis
1939

The VODER

At the 1939 World's Fair, Bell Labs unveiled the VODER (Voice Operating Demonstrator) — the first electronic device to generate continuous human speech. A trained operator used a keyboard, wrist bar, and foot pedal to control a bank of electronic oscillators in real time. It could produce any English phoneme, but required months of training and sounded unmistakably artificial.

The VODER was not TTS (it had no text input), but it proved the principle: human speech could be decomposed into a small set of acoustic parameters and reconstructed electronically. Homer Dudley, its inventor, had earlier built the Vocoder (1936) for compressing telephone signals — the same analysis-synthesis paradigm that every modern TTS system still uses.

1960s–1980s

Formant Synthesis

The first true TTS systems used formant synthesis: hand-crafted rules that controlled electronic resonators to mimic the resonant frequencies (formants) of the human vocal tract. Dennis Klatt at MIT built the most influential formant synthesizer, which became the voice of early Macs and Stephen Hawking's wheelchair.

"The goal is to simulate the physics of the human vocal tract — a source (vocal cords) exciting a filter (the throat, mouth, and nasal cavities). Control the filter parameters over time, and you control the speech."

Klatt, D.H. (1980). Software for a cascade/parallel formant synthesizer. JASA, 67(3), 971–995.

Formant synthesis was infinitely flexible — any phoneme in any language could theoretically be produced — but every rule was hand-tuned by a phonetician. Prosody (rhythm, stress, intonation) was nearly impossible to get right. The result: intelligible but robotic speech with the uncanny quality that defined "computer voice" for 30 years.

1980s–2000s

Concatenative / Unit Selection

Instead of synthesizing speech from rules, concatenative synthesis recorded a human speaker reading tens of hours of text, then sliced the recordings into small units (diphones, triphones, or half-phones). At synthesis time, the system selected and stitched together the best-matching units for the target text.

Unit selection (Hunt & Black, 1996) was the refined version: instead of fixed-length units, it searched a large database for the optimal sequence of variable-length speech segments, minimizing both the "target cost" (how well each unit matches what you want) and the "join cost" (how smoothly adjacent units splice together).

Hunt, A.J. & Black, A.W. (1996). Unit selection in a concatenative speech synthesis system. ICASSP.

This was the technology behind Apple's original Siri, Google's early TTS, and most GPS navigation voices. Quality was far better than formant synthesis within the recorded speaker's voice and style, but it couldn't generalize: a new voice required another 20+ hours of recording, and prosodic expressiveness was limited by what existed in the database.

Era II: Statistical Parametric Synthesis
2000s

HMM-Based Speech Synthesis (HTS)

Heiga Zen, Keiichi Tokuda, and colleagues at Nagoya Institute of Technology replaced the unit-selection database with a Hidden Markov Model that learned to generate acoustic parameters (spectral features, pitch, duration) from text. A vocoder then converted these parameters into a waveform.

# HMM-TTS conceptual pipeline
text → text analysis → phoneme sequence + prosody labels
     → HMM generates: spectral params (MGC), pitch (logF0), duration
     → MLSA vocoder → waveform

# Key advantage: new voices from ~1 hour of speech
# Key limitation: "buzzy" vocoder quality, over-smoothed prosody

The breakthrough was flexibility: adapting to a new voice required only retraining the model on a small dataset, not recording a massive unit-selection corpus. But HMM outputs were over-smoothed — the model averaged over natural variation, producing speech that was intelligible but flat and buzzy. The vocoder was the bottleneck.

Zen, H., Tokuda, K., & Black, A.W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.

Era III: The Neural Revolution
September 2016

WaveNet — The Inflection Point

Aäron van den Oord and colleagues at DeepMind published a paper that changed everything. WaveNet was an autoregressive neural network that generated raw audio waveforms one sample at a time — 16,000 samples per second, each conditioned on all previous samples via dilated causal convolutions.

"WaveNet reduces the gap with human performance by over 50% for both US English and Mandarin Chinese… [Listeners] rated WaveNet as significantly more natural than the best existing parametric and concatenative systems."

van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499.

The quality was stunning — MOS (Mean Opinion Score) jumped from ~3.8 (concatenative) to ~4.2 (WaveNet), where 5.0 is indistinguishable from human speech. But the original model was catastrophically slow: generating one second of audio took several minutes on a GPU because each of the 16,000 samples depended sequentially on the last.

Why WaveNet mattered beyond TTS

WaveNet proved that neural networks could model raw audio waveforms directly, without hand-designed vocoders or signal processing. The same autoregressive architecture was later adapted for music generation (OpenAI Jukebox), speech coding, and audio super-resolution. Google deployed a production-optimized version in Google Assistant in 2017.

2017–2018

Tacotron & Tacotron 2: End-to-End TTS

Yuxuan Wang et al. at Google introduced Tacotron — the first truly end-to-end TTS system. Instead of the traditional multi-stage pipeline (text analysis, duration model, acoustic model, vocoder), Tacotron was a single sequence-to-sequence model with attention that converted text characters directly to mel spectrograms.

Tacotron 2 (Shen et al., 2018) combined this with a modified WaveNet vocoder, achieving a MOS of 4.53 — within the confidence interval of human speech recordings (4.58). For the first time, synthesized speech was statistically indistinguishable from a real human in controlled listening tests.

# Tacotron 2 architecture (simplified)
text = "Hello, how are you?"
     → character embeddings → encoder (3-layer CNN + BiLSTM)
     → attention mechanism (location-sensitive)
     → decoder (2-layer LSTM, autoregressive)
     → mel spectrogram (80 bands, 12.5ms frames)
     → WaveNet vocoder → 24kHz waveform

# The attention mechanism learns alignment between
# text and audio without any forced alignment labels

Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. ICASSP.
Wang, Y. et al. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech.

Tacotron's end-to-end approach eliminated the need for linguistic expertise in building TTS systems. No phoneme dictionaries, no prosody rules, no duration models — just text in, audio out. This democratized TTS research and paved the way for rapid iteration.

2018–2020

The Vocoder Race: Speed Without Sacrificing Quality

WaveNet's autoregressive generation was too slow for production. A flurry of research produced faster alternatives:

WaveRNN (Kalchbrenner, 2018)

Single-layer RNN, 4x faster than WaveNet. Enabled on-device TTS.

Parallel WaveGAN (Yamamoto, 2020)

GAN-based vocoder. Non-autoregressive. Real-time on CPU.

HiFi-GAN (Kong et al., 2020)

Multi-scale discriminator. Near-WaveNet quality at 167x speed. Became the default vocoder.

VITS (Kim et al., 2021)

VAE + normalizing flow + GAN. First single-stage model matching Tacotron 2 + HiFi-GAN quality.

Kong, J. et al. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS.

Era IV: Zero-Shot Voice Cloning & Codec Models
January 2023

VALL-E — TTS as Language Modeling

Chengyi Wang et al. at Microsoft reframed TTS entirely: instead of generating spectrograms, treat speech as a sequence of discrete audio tokens from a neural audio codec (EnCodec), then train a language model to predict those tokens conditioned on text and a 3-second voice prompt.

# VALL-E paradigm shift
# Old: text → mel spectrogram → vocoder → waveform
# New: text + 3s voice prompt → discrete audio tokens → waveform

text_tokens = tokenize("Hello, how are you?")
voice_prompt = encodec.encode(3_second_clip)  # 8 codebook streams

# Autoregressive model predicts first codebook
coarse_tokens = ar_model(text_tokens, voice_prompt)
# Non-autoregressive model predicts remaining 7 codebooks
fine_tokens = nar_model(coarse_tokens)

audio = encodec.decode(coarse_tokens + fine_tokens)

Wang, C. et al. (2023). VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv:2301.02111.

Trained on 60,000 hours of English speech (LibriLight), VALL-E achieved zero-shot voice cloning from just a 3-second reference — preserving the speaker's timbre, emotion, and even acoustic environment. This was a paradigm shift: TTS became a prompting problem, just like LLM text generation. VALL-E 2 (2024) improved robustness with repetition-aware sampling and grouped code modeling.

2024–present

The Modern Landscape

The codec language model approach spawned a wave of systems, each pushing different frontiers:

Bark (Suno)

Open-source, generates laughter, music, sound effects via special tokens. GPT-style architecture.

XTTS v2 (Coqui)

Open-source, 17 languages, voice cloning from 6s. GPT + HiFi-GAN decoder.

OpenAI TTS / GPT-4o

API-only. 6 preset voices. Real-time streaming. Native multimodal in GPT-4o.

ElevenLabs

Best commercial quality. Voice cloning, design, 32 languages. Turbo v2.5 for low latency.

Piper (Rhasspy)

VITS-based, runs on Raspberry Pi. 30+ languages. Optimized for local/embedded use.

Fish Speech / CosyVoice

Open-source Chinese TTS. VQGAN + LLM. Competitive with commercial APIs.

The throughline: 1939 → 2026

Each generation replaced hand-crafted knowledge with learned representations:

1939–1980sRules: Hand-tuned oscillators and formant parameters (Klatt, VODER)
1980s–2000sData: Record a speaker, stitch segments together (unit selection)
2000s–2015Statistics: HMMs learn acoustic parameters, but vocoders limit quality
2016–2020Neural: WaveNet, Tacotron — end-to-end learning matches human quality
2023–nowCodec LMs: TTS as language modeling. Zero-shot cloning from seconds of audio

Every advance traded hand-engineering for data. The core challenge remains the same: convert a sequence of symbols into a sequence of air pressure changes that a human brain interprets as speech.

How Modern TTS Works

Despite surface differences, every modern TTS system follows the same conceptual pipeline. Understanding these stages helps you choose and debug any system.

Stage 1: Text Frontend

Raw text is normalized and converted to a pronunciation representation. This is harder than it looks — "Dr. Smith lives on 5th St." requires expanding abbreviations, and "read" has different pronunciations depending on tense.

# Text normalization examples
"Dr. Smith"      → "Doctor Smith"
"$3.50"          → "three dollars and fifty cents"
"2024-01-15"     → "January fifteenth, twenty twenty-four"
"I read a book"  → /aɪ rɛd ə bʊk/ or /aɪ riːd ə bʊk/ (context-dependent)

# G2P (Grapheme-to-Phoneme) conversion
"synthesis"      → /ˈsɪnθəsɪs/
"colonel"        → /ˈkɜːrnəl/  # English is irregular

Stage 2: Acoustic Model

The core neural network converts the text representation into an acoustic representation — either mel spectrograms (Tacotron-family) or discrete audio tokens (VALL-E-family). This is where prosody, rhythm, and emotion are determined.

# Two paradigms for the acoustic model:

# 1. Spectrogram prediction (Tacotron, FastSpeech)
phonemes → encoder → attention → decoder → mel spectrogram
# Output: 80-band mel spectrogram, ~86 frames/second

# 2. Codec token prediction (VALL-E, Bark)
text_tokens → transformer LM → audio codec tokens
# Output: 8 codebook streams, 75 tokens/second per stream

Stage 3: Waveform Generation

The acoustic representation is converted to a raw audio waveform. For spectrogram-based systems, this is a neural vocoder (HiFi-GAN, WaveRNN). For codec-based systems, the codec decoder handles this directly.

# Vocoder: mel spectrogram → waveform
mel_spec.shape  # (80, T)  — 80 mel bands, T time frames
waveform = hifi_gan(mel_spec)  # → (1, T*256) at 22.05kHz
# Each mel frame expands to 256 audio samples (hop_size)

# Codec decoder: tokens → waveform
tokens.shape  # (8, T)  — 8 codebooks, T token frames
waveform = encodec.decode(tokens)  # → 24kHz audio

Production Code Examples

Three systems covering the full spectrum: cloud API for simplicity, open-source GPU model for quality and control, lightweight local model for edge deployment.

OpenAI TTS — Cloud API, Simplest Integration

Six preset voices, two quality tiers, streaming support. No voice cloning. Best for applications where you need reliable quality with minimal setup.

# OpenAI TTS — streaming to file and to speaker
from openai import OpenAI
from pathlib import Path

client = OpenAI()

# Simple: generate and save
response = client.audio.speech.create(
model='tts-1-hd', # or 'tts-1' for lower latency
voice='nova', # alloy, echo, fable, onyx, nova, shimmer
input='Neural text-to-speech has come a long way since formant synthesis.'
)
response.stream_to_file(Path('output.mp3'))

# Streaming: play audio as it generates (low TTFB)
response = client.audio.speech.create(
model='tts-1',
voice='alloy',
input=text,
response_format='pcm', # raw PCM for real-time playback
)
for chunk in response.iter_bytes(chunk_size=4096):
audio_player.write(chunk) # Play as chunks arrive

Pricing (as of March 2026)

tts-1: $15 / 1M characters | tts-1-hd: $30 / 1M characters | ~150ms TTFB (streaming)

XTTS v2 — Open-Source Voice Cloning

Coqui's XTTS is the best open-source option for voice cloning. It supports 17 languages, clones from a 6-second reference clip, and runs on consumer GPUs. The model uses a GPT-2-style autoregressive decoder with a HiFi-GAN vocoder.

# XTTS v2 — open-source voice cloning
from TTS.api import TTS

# Load XTTS v2 (downloads ~1.8GB on first run)
tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2')

# Clone a voice from a reference audio file
tts.tts_to_file(
text='This is my cloned voice speaking in English.',
speaker_wav='reference_voice.wav', # 6+ seconds of clean speech
language='en',
file_path='cloned_output.wav'
)

# Streaming generation for real-time playback
chunks = tts.tts_stream(
text='Streaming reduces time-to-first-audio.',
speaker_wav='reference_voice.wav',
language='en'
)
for chunk in chunks:
play_audio(chunk)

Requirements

VRAM: ~4GB (inference) | Languages: 17 | License: CPML (non-commercial) / Commercial license available

Piper — Lightweight Local TTS

When you need TTS that runs on a Raspberry Pi, a phone, or any device without internet, Piper is the answer. It's a VITS-based model optimized with ONNX Runtime, producing natural speech at 2–4x real-time on a single CPU core. No GPU required. 30+ languages with pre-trained voices.

# Piper — fast local TTS, no GPU needed
# Install: pip install piper-tts

# Command line (simplest)
echo 'Hello from Piper running locally.' | \
piper --model en_US-lessac-medium.onnx --output_file out.wav

# Python API
import wave
from piper import PiperVoice

voice = PiperVoice.load('en_US-lessac-medium.onnx')
with wave.open('output.wav', 'wb') as wav_file:
voice.synthesize('Fast local synthesis on any device.', wav_file)

# Model sizes: 15MB (low) to 75MB (high quality)
# Speed: 2-4x real-time on single CPU core

Best for

Home assistants, embedded devices, offline apps, accessibility tools. No internet, no API costs, no GPU.

Measuring TTS Quality: MOS and Beyond

How do you objectively compare TTS systems? Speech quality evaluation is genuinely hard because it's inherently perceptual. The gold standard remains human listening tests, but automated metrics are catching up.

Mean Opinion Score (MOS)

The standard metric since ITU-T P.800 (1996). Human listeners rate speech samples on a 1–5 scale:

5
Excellent (indistinguishable from human)
4
Good (noticeable but not annoying)
3
Fair (slightly annoying)
2
Poor (annoying)
1
Bad (very annoying)

Professional human speech recordings typically score 4.5–4.7 (not 5.0 — recording artifacts and microphone coloration prevent perfect scores).

SystemYearMOSArchitecture
Formant (Klatt)1980~2.5Rule-based resonators
Unit Selection2000s~3.8Concatenated recordings
WaveNet20164.21Autoregressive CNN
Tacotron 220184.53Seq2seq + WaveNet vocoder
VITS20214.43VAE + flow + GAN (single-stage)
VALL-E20233.8*Codec LM (zero-shot)
VALL-E 220244.64Codec LM + repetition-aware sampling
Human recordings4.58Ground truth reference
* VALL-E's MOS is for zero-shot voice cloning (3s prompt), not read-speech from a trained voice — a harder task.

Automated Metrics

UTMOS (2022)

Neural MOS predictor trained on human ratings. Correlates ~0.9 with human MOS. Used for rapid iteration.

PESQ / POLQA

ITU standards for telephony quality. Good for comparing degradation, less useful for naturalness.

Speaker Similarity (SV cosine)

Cosine similarity of speaker embeddings between reference and generated audio. Key for voice cloning eval.

What MOS Doesn't Capture

Prosodic appropriateness

Speech can sound natural in isolation but wrong for the context (e.g., cheerful tone for sad news).

Long-form coherence

MOS tests use 5–15s clips. A system can score high on short samples but produce monotonous 30-minute narration.

Robustness

Does it handle numbers, abbreviations, code-switching, and unusual proper nouns without failing?

Saeki, T. et al. (2022). UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. Interspeech.
ITU-T P.800 (1996). Methods for subjective determination of transmission quality.

Choosing the Right System

There is no single "best" TTS system. The right choice depends on your constraints:

SystemQualityLatencyCostVoice Cloning
OpenAI TTSGood~150ms$15/1M charsNo
ElevenLabsExcellent~200ms$0.30/1K charsYes (30s ref)
XTTS v2Very Good~1s (GPU)Free (local)Yes (6s ref)
PiperGood~50ms (CPU)Free (local)No (pre-trained voices)
BarkVariable~5s (GPU)Free (local)Yes (prompt-based)

Voice Assistants / Chatbots

Latency is king. Users notice delays over 300ms. Use OpenAI TTS with streaming (tts-1, not tts-1-hd) or Piper for offline. Combine with STT for full voice loop.

Recommended: OpenAI TTS (cloud) or Piper (local)

Audiobook / Podcast Generation

Quality and expressiveness matter more than latency. Long-form coherence is critical — test with 10+ minute passages, not 10-second clips.

Recommended: ElevenLabs or XTTS v2

Accessibility / Screen Readers

Must work offline, handle arbitrary text (URLs, code, math), and be fast. Users listen at 2–3x speed, so intelligibility at high rates matters more than naturalness.

Recommended: Piper (offline, fast, lightweight)

Voice Cloning / Custom Characters

Clone a specific voice for a game character, virtual presenter, or personalized assistant. Quality of the reference audio matters enormously — clean, single-speaker, minimal noise.

Recommended: ElevenLabs (quality) or XTTS v2 (open-source)

Open Problems in TTS

Despite dramatic progress, several fundamental challenges remain unsolved:

Controllable Prosody

How do you tell a TTS system to emphasize this word, pause here, sound sarcastic there? Current systems offer limited control — you can sometimes use SSML tags or prompt engineering, but fine-grained prosodic control remains an open research problem. Recent work on style transfer and prosody embeddings (Wang et al., "Style Tokens", 2018) offers partial solutions.

Wang, Y. et al. (2018). Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. ICML.

Long-Form Coherence

Most TTS systems are evaluated on sentences or short paragraphs. At book or lecture length (30+ minutes), maintaining consistent voice quality, natural pacing, and appropriate paragraph-level prosody remains difficult. The attention mechanism in autoregressive models can drift over long sequences.

Multilingual Code-Switching

Real speech often mixes languages: "Let's meet at the café on Straße and discuss the projet." Handling mid-sentence language switches with correct pronunciation in both languages is still error-prone for most systems.

Ethical Safety

Voice cloning from 3 seconds of audio enables remarkable applications — and remarkable abuse. Deepfake voice calls, impersonation, and non-consensual cloning are active threats. Watermarking synthesized audio, speaker verification, and consent frameworks are still maturing. VALL-E's original paper explicitly noted: "we do not release the code of the model due to the potential risks."

Key Takeaways

  • 1

    TTS evolved through four paradigm shifts — rule-based formants, concatenative unit selection, neural spectrogram prediction (Tacotron/WaveNet), and codec language models (VALL-E). Each traded hand-engineering for data.

  • 2

    Modern TTS is statistically indistinguishable from human speech — Tacotron 2 hit MOS 4.53 in 2018, matching human recordings (4.58). The frontier has moved to zero-shot voice cloning and expressiveness.

  • 3

    The pipeline is: text frontend, acoustic model, waveform generator — whether you use an API or run locally, this structure is universal. Understanding it helps you debug every system.

  • 4

    Choose based on constraints, not hype — OpenAI TTS for simplicity, ElevenLabs for quality, XTTS for open-source cloning, Piper for edge/offline. There is no universal "best."

  • 5

    Voice cloning raises real ethical questions — 3-second voice cloning is a powerful capability that demands responsible use. Watermarking, consent, and detection are active research areas.

References

Klatt, D.H. (1980). Software for a cascade/parallel formant synthesizer. JASA, 67(3).

Hunt, A.J. & Black, A.W. (1996). Unit selection in a concatenative speech synthesis system. ICASSP.

Zen, H., Tokuda, K., & Black, A.W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11).

van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499.

Wang, Y. et al. (2017). Tacotron: Towards End-to-End Speech Synthesis. Interspeech.

Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. ICASSP.

Wang, Y. et al. (2018). Style Tokens: Unsupervised Style Modeling, Control and Transfer. ICML.

Kong, J. et al. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS.

Kim, J. et al. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End TTS (VITS). ICML.

Wang, C. et al. (2023). VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.

Saeki, T. et al. (2022). UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. Interspeech.

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.