Best TTS for real-time voice agents
For a voice agent to feel human, you have ~800ms end-to-end between user stopping speech and the agent starting. TTS spends 75-380ms of that. This page ranks streaming TTS by time-to-first-byte and breaks down the full round-trip budget.
TL;DR
- > Winner (cloud): Cartesia Sonic 2 at ~90ms TTFB, 4.7 MOS. Best balance of quality + latency.
- > Winner (quality-at-speed): ElevenLabs Flash v2.5 — ~75ms TTFB, 4.55 MOS.
- > Winner (self-host): Piper running on CPU — ~30ms TTFB, acceptable quality, zero API cost.
- > Avoid for real-time: OpenAI tts-1-hd (>500ms), Google Studio voices (>500ms), ElevenLabs Turbo (~275ms).
TTFB leaderboard
Time-to-first-byte on each vendor's lowest-latency streaming model. US-East origin, 40-char prompt, measured over 50 calls in April 2026. Piper runs locally on an M2 laptop CPU.
Latency waterfall
Streaming TTFB ranking
The dashed pink line at 200ms is the soft ceiling for pleasant-feeling voice agents.
The end-to-end round trip
TTS is one of five hops between user voice in and synthesized voice out. Optimizing TTS alone without a fast STT and a streaming LLM is pointless.
Streaming pipeline
Voice-bot round-trip latency
Budget before perceived awkwardness: ~800ms. TTS is one of five hops — optimize the whole pipeline.
Quality doesn't have to drop
Fast doesn't mean bad. The upper-left of this plot (fast + natural) is populated: Sonic 2 and Flash v2.5 both clear 4.5 MOS at sub-100ms. Piper is the only budget option; the rest cluster in the 4.2-4.7 range.
Pareto frontier
MOS vs cost, streaming models only
Same data, filtered to streaming-capable options.
Voice bot reference stack (2026)
Recommended cloud stack
- STT
- Deepgram Nova-3 streaming (2.2% WER, <200ms)
- LLM
- GPT-4o or Claude 3.5 Haiku with streaming tokens
- TTS
- Cartesia Sonic 2 over WebSocket
- Transport
- WebRTC with OPUS; LiveKit or Pipecat orchestration
- Budget
- ~640ms end-to-end. 160ms headroom.
Recommended self-host stack
- STT
- Parakeet RNNT 1.1B (1.8% WER, GPU streaming)
- LLM
- Llama 3.3 70B on vLLM with continuous batching
- TTS
- Kokoro-82M (GPU) or Piper (CPU)
- Transport
- Direct WebSocket; in-process audio pipeline
- Budget
- ~500ms if colocated, ~900ms otherwise.
Streaming setup (Cartesia)
# Cartesia Sonic 2 over WebSocket — the lowest-latency setup in production today.
from cartesia import Cartesia
client = Cartesia(api_key="sk_...")
ws = client.tts.websocket()
# Incremental text -> incremental PCM. Send tokens as your LLM produces them.
def stream_tokens(llm_stream):
for token in llm_stream:
ws.send(
model_id="sonic-2",
transcript=token,
voice={"mode": "id", "id": "<voice_id>"},
output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 24000},
continue_=True,
)
ws.send(transcript="", continue_=False) # flush
for chunk in ws.receive():
speaker.write(chunk.audio)Note the continue_=True flag. You want to send LLM tokens as they arrive rather than waiting for a full sentence — this collapses the LLM→TTS serial delay into a single pipeline.
Listen: same prompt, every vendor
“Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.”
“Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.”
“Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.”
“Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.”