Level 1: Single Blocks~12 min

Text-to-Speech

The inverse of speech recognition. Convert text into natural-sounding audio for voice assistants, audiobooks, and accessibility tools.

What is Text-to-Speech?

Text-to-Speech (TTS) is the technology that converts written text into spoken audio. Modern neural TTS systems produce remarkably natural-sounding speech, far beyond the robotic voices of early systems.

While speech recognition (STT) converts audio to text, TTS is the inverse operation - converting text back to audio. Together, they enable complete voice interfaces.

Audio Output Pipeline

A typical TTS pipeline involves:

-Text preprocessing (handling abbreviations, numbers, punctuation)
-Phoneme conversion (text to pronunciation)
-Acoustic model (phonemes to mel spectrograms)
-Vocoder (spectrograms to audio waveform)

OpenAI TTS - Good Default Choice

OpenAI's TTS API offers a balance of quality, speed, and simplicity. At $15 per 1 million characters, it's cost-effective for most applications. Choose between tts-1 for speed or tts-1-hd for higher quality.

Available Voices

alloy

echo

fable

onyx

nova

shimmer

# OpenAI TTS API
from openai import OpenAI
from pathlib import Path

client = OpenAI()

speech_file = Path('speech.mp3')
response = client.audio.speech.create(
model='tts-1', # or 'tts-1-hd' for higher quality
voice='alloy', # alloy, echo, fable, onyx, nova, shimmer
input='Hello! This is a test of text to speech synthesis.'
)

response.stream_to_file(speech_file)

Pricing

$15 per 1M characters (tts-1) | $30 per 1M characters (tts-1-hd)

ElevenLabs - Most Realistic

ElevenLabs produces the most natural-sounding speech currently available. Their key differentiator is voice cloning - create a custom voice from just a few minutes of audio samples.

Why ElevenLabs?

-Most realistic prosody and emotion
-Voice cloning from short samples
-29+ languages with native quality
-Voice design (create new voices from descriptions)

# ElevenLabs - Most realistic TTS
from elevenlabs import generate, play, voices

audio = generate(
text='Hello! This is incredibly realistic speech.',
voice='Rachel', # Or voice ID for cloned voices
model='eleven_turbo_v2_5'
)
play(audio)

# Or save to file
with open('output.mp3', 'wb') as f:
f.write(audio)

Coqui TTS - Open Source & Local

Coqui TTS is an open-source library that runs entirely on your machine. No API costs, no internet required, and full control over the models. Ideal for privacy-sensitive applications or offline use.

# Coqui TTS - Local speech synthesis
from TTS.api import TTS

# List available models
print(TTS().list_models())

tts = TTS('tts_models/en/ljspeech/tacotron2-DDC')
tts.tts_to_file(
text='Local speech synthesis without API costs.',
file_path='output.wav'
)

Model	Quality	Speed
tacotron2-DDC	Good	Medium
vits	Better	Fast
xtts_v2	Best	Slow

Bark - Expressive with Emotions

Bark (by Suno) is unique in its ability to generate speech with emotions, laughter, music, and sound effects. Use special tokens to control the output.

Special Tokens

[laughs]- Add laughter

[sighs]- Add sighing

[music]- Generate music

[clears throat]- Natural pause

# Bark - Expressive TTS with emotions
from bark import generate_audio, preload_models, SAMPLE_RATE
from scipy.io.wavfile import write

preload_models()
audio_array = generate_audio('Hello! [laughs] This is Bark speaking.')
write('bark_output.wav', SAMPLE_RATE, audio_array)

Note

Bark requires significant GPU memory (8GB+ VRAM recommended). It's slower than other options but excels at expressive, emotional speech.

Provider Comparison

Provider	Quality	Cost	Best For
OpenAI TTS	Good	$15/1M chars	General purpose, quick integration
ElevenLabs	Excellent	$0.30/1K chars	Professional content, voice cloning
Coqui TTS	Good	Free (local)	Privacy, offline, no API costs
Bark	Variable	Free (local)	Emotions, sound effects, creative

Common Use Cases

Voice Assistants

Build conversational AI that speaks naturally. Combine with STT for full voice interfaces.

Recommended: OpenAI TTS (low latency)

Audiobook Generation

Convert written content to audio. Long-form narration with consistent voice quality.

Recommended: ElevenLabs (quality)

Accessibility Tools

Screen readers, navigation assistance, and content accessibility for visually impaired users.

Recommended: Coqui TTS (privacy, offline)

Video Voiceovers

Generate narration for videos, presentations, and educational content at scale.

Recommended: ElevenLabs or Bark (expressiveness)

Key Takeaways

1
OpenAI TTS - Good default choice at $15/1M characters. Simple API, decent quality, six voice options.
2
ElevenLabs - Most realistic speech, voice cloning capabilities. Best for professional content.
3
Coqui TTS - Open source, runs locally, no API costs. Ideal for privacy-sensitive or offline use.
4
Bark - Unique expressiveness with emotions, laughter, and sound effects. Great for creative content.

Previous: Speech Recognition Back to Roadmap