Home/Speech/Best for real-time

Latency firstUpdated April 2026

Best TTS for real-time voice agents

For a voice agent to feel human, you have ~800ms end-to-end between user stopping speech and the agent starting. TTS spends 75-380ms of that. This page ranks streaming TTS by time-to-first-byte and breaks down the full round-trip budget.

TL;DR

> Winner (cloud): Cartesia Sonic 2 at ~90ms TTFB, 4.7 MOS. Best balance of quality + latency.
> Winner (quality-at-speed): ElevenLabs Flash v2.5 — ~75ms TTFB, 4.55 MOS.
> Winner (self-host): Piper running on CPU — ~30ms TTFB, acceptable quality, zero API cost.
> Avoid for real-time: OpenAI tts-1-hd (>500ms), Google Studio voices (>500ms), ElevenLabs Turbo (~275ms).

TTFB leaderboard

Time-to-first-byte on each vendor's lowest-latency streaming model. US-East origin, 40-char prompt, measured over 50 calls in April 2026. Piper runs locally on an M2 laptop CPU.

Latency waterfall

Streaming TTFB ranking

The dashed pink line at 200ms is the soft ceiling for pleasant-feeling voice agents.

The end-to-end round trip

TTS is one of five hops between user voice in and synthesized voice out. Optimizing TTS alone without a fast STT and a streaming LLM is pointless.

Streaming pipeline

Voice-bot round-trip latency

Budget before perceived awkwardness: ~800ms. TTS is one of five hops — optimize the whole pipeline.

Quality doesn't have to drop

Fast doesn't mean bad. The upper-left of this plot (fast + natural) is populated: Sonic 2 and Flash v2.5 both clear 4.5 MOS at sub-100ms. Piper is the only budget option; the rest cluster in the 4.2-4.7 range.

Pareto frontier

MOS vs cost, streaming models only

Same data, filtered to streaming-capable options.

Voice bot reference stack (2026)

Recommended cloud stack

STT: Deepgram Nova-3 streaming (2.2% WER, <200ms)
LLM: GPT-4o or Claude 3.5 Haiku with streaming tokens
TTS: Cartesia Sonic 2 over WebSocket
Transport: WebRTC with OPUS; LiveKit or Pipecat orchestration
Budget: ~640ms end-to-end. 160ms headroom.

Recommended self-host stack

STT: Parakeet TDT 0.6B v2 (fastest open ASR, RTFx ~3386, GPU streaming)
LLM: Llama 3.3 70B on vLLM with continuous batching
TTS: Kokoro-82M (GPU) or Piper (CPU)
Transport: Direct WebSocket; in-process audio pipeline
Budget: ~500ms if colocated, ~900ms otherwise.

Streaming setup (Cartesia)

# Cartesia Sonic 2 over WebSocket — the lowest-latency setup in production today.
from cartesia import Cartesia
client = Cartesia(api_key="sk_...")
ws = client.tts.websocket()

# Incremental text -> incremental PCM. Send tokens as your LLM produces them.
def stream_tokens(llm_stream):
    for token in llm_stream:
        ws.send(
            model_id="sonic-2",
            transcript=token,
            voice={"mode": "id", "id": "<voice_id>"},
            output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 24000},
            continue_=True,
        )
    ws.send(transcript="", continue_=False)  # flush

    for chunk in ws.receive():
        speaker.write(chunk.audio)

Note the continue_=True flag. You want to send LLM tokens as they arrive rather than waiting for a full sentence — this collapses the LLM→TTS serial delay into a single pipeline.

Listen: same prompt, every vendor

CartesiaNewsreader

sonic-2

sample TBD

“Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.”

drop cartesia-newsreader.mp3 at /audio/samples/cartesia-newsreader-rt.mp3

ElevenLabsRachel

eleven_flash_v2_5

sample TBD

“Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.”

drop elevenlabs-rachel.mp3 at /audio/samples/elevenlabs-rachel-rt.mp3

DeepgramAsteria

aura-2

sample TBD

“Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.”

drop deepgram-asteria.mp3 at /audio/samples/deepgram-asteria-rt.mp3

Azureen-US-JennyNeural

Neural HD

sample TBD

“Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.”

drop azure-en-us-jennyneural.mp3 at /audio/samples/azure-jenny-rt.mp3

ElevenLabs vs Cartesia

The two leaders, directly compared

Best open-source TTS

Kokoro, XTTS, F5-TTS, StyleTTS2

Back to Speech Benchmark

TL;DR

TTFB leaderboard

The end-to-end round trip

Quality doesn't have to drop

Voice bot reference stack (2026)

Recommended cloud stack

Recommended self-host stack

Streaming setup (Cartesia)

Listen: same prompt, every vendor

Related

ElevenLabs vs Cartesia

Best open-source TTS