Text-to-Speech
The inverse of speech recognition. Convert text into natural-sounding audio for voice assistants, audiobooks, and accessibility tools.
What is Text-to-Speech?
Text-to-Speech (TTS) is the technology that converts written text into spoken audio. Modern neural TTS systems produce remarkably natural-sounding speech, far beyond the robotic voices of early systems.
While speech recognition (STT) converts audio to text, TTS is the inverse operation - converting text back to audio. Together, they enable complete voice interfaces.
Audio Output Pipeline
A typical TTS pipeline involves:
- -Text preprocessing (handling abbreviations, numbers, punctuation)
- -Phoneme conversion (text to pronunciation)
- -Acoustic model (phonemes to mel spectrograms)
- -Vocoder (spectrograms to audio waveform)
OpenAI TTS - Good Default Choice
OpenAI's TTS API offers a balance of quality, speed, and simplicity. At $15 per 1 million characters, it's cost-effective for most applications. Choose between tts-1 for speed or tts-1-hd for higher quality.
Available Voices
from openai import OpenAI
from pathlib import Path
client = OpenAI()
speech_file = Path('speech.mp3')
response = client.audio.speech.create(
model='tts-1', # or 'tts-1-hd' for higher quality
voice='alloy', # alloy, echo, fable, onyx, nova, shimmer
input='Hello! This is a test of text to speech synthesis.'
)
response.stream_to_file(speech_file)
Pricing
$15 per 1M characters (tts-1) | $30 per 1M characters (tts-1-hd)
ElevenLabs - Most Realistic
ElevenLabs produces the most natural-sounding speech currently available. Their key differentiator is voice cloning - create a custom voice from just a few minutes of audio samples.
Why ElevenLabs?
- -Most realistic prosody and emotion
- -Voice cloning from short samples
- -29+ languages with native quality
- -Voice design (create new voices from descriptions)
from elevenlabs import generate, play, voices
audio = generate(
text='Hello! This is incredibly realistic speech.',
voice='Rachel', # Or voice ID for cloned voices
model='eleven_turbo_v2_5'
)
play(audio)
# Or save to file
with open('output.mp3', 'wb') as f:
f.write(audio)
Coqui TTS - Open Source & Local
Coqui TTS is an open-source library that runs entirely on your machine. No API costs, no internet required, and full control over the models. Ideal for privacy-sensitive applications or offline use.
from TTS.api import TTS
# List available models
print(TTS().list_models())
tts = TTS('tts_models/en/ljspeech/tacotron2-DDC')
tts.tts_to_file(
text='Local speech synthesis without API costs.',
file_path='output.wav'
)
| Model | Quality | Speed |
|---|---|---|
| tacotron2-DDC | Good | Medium |
| vits | Better | Fast |
| xtts_v2 | Best | Slow |
Bark - Expressive with Emotions
Bark (by Suno) is unique in its ability to generate speech with emotions, laughter, music, and sound effects. Use special tokens to control the output.
Special Tokens
[laughs]- Add laughter[sighs]- Add sighing[music]- Generate music[clears throat]- Natural pausefrom bark import generate_audio, preload_models, SAMPLE_RATE
from scipy.io.wavfile import write
preload_models()
audio_array = generate_audio('Hello! [laughs] This is Bark speaking.')
write('bark_output.wav', SAMPLE_RATE, audio_array)
Note
Bark requires significant GPU memory (8GB+ VRAM recommended). It's slower than other options but excels at expressive, emotional speech.
Provider Comparison
| Provider | Quality | Cost | Best For |
|---|---|---|---|
| OpenAI TTS | Good | $15/1M chars | General purpose, quick integration |
| ElevenLabs | Excellent | $0.30/1K chars | Professional content, voice cloning |
| Coqui TTS | Good | Free (local) | Privacy, offline, no API costs |
| Bark | Variable | Free (local) | Emotions, sound effects, creative |
Common Use Cases
Voice Assistants
Build conversational AI that speaks naturally. Combine with STT for full voice interfaces.
Recommended: OpenAI TTS (low latency)
Audiobook Generation
Convert written content to audio. Long-form narration with consistent voice quality.
Recommended: ElevenLabs (quality)
Accessibility Tools
Screen readers, navigation assistance, and content accessibility for visually impaired users.
Recommended: Coqui TTS (privacy, offline)
Video Voiceovers
Generate narration for videos, presentations, and educational content at scale.
Recommended: ElevenLabs or Bark (expressiveness)
Key Takeaways
- 1
OpenAI TTS - Good default choice at $15/1M characters. Simple API, decent quality, six voice options.
- 2
ElevenLabs - Most realistic speech, voice cloning capabilities. Best for professional content.
- 3
Coqui TTS - Open source, runs locally, no API costs. Ideal for privacy-sensitive or offline use.
- 4
Bark - Unique expressiveness with emotions, laughter, and sound effects. Great for creative content.