Voice Assistant Pipeline
From 1960s touch-tone menus to GPT-4o Realtime — the architecture, latency physics, and code behind conversational voice interfaces.
60 Years of Talking to Machines
Voice assistants didn't start with Siri. They are the product of six decades of converging advances in signal processing, speech recognition, natural language understanding, and synthesis — each generation constrained by the hardware and models available, each breakthrough redefining what "talking to a computer" could mean.
Understanding this history matters because today's architectural choices — cascaded vs end-to-end, streaming vs batch, on-device vs cloud — are direct responses to limitations discovered at each stage.
IBM Shoebox
At the 1962 World's Fair, IBM demonstrated the Shoebox — a machine the size of a shoebox that could recognize 16 spoken words: the digits 0–9 plus six commands like "plus" and "total." It used analog circuits to match formant frequencies — the resonant peaks in the audio spectrum that distinguish vowels. There was no learning: each word was a hand-tuned filter bank. But it was the first public demonstration that a machine could take spoken input and produce computed output.
IVR: The First Voice "Assistants"
Interactive Voice Response systems became the backbone of telephone customer service. Early IVR was DTMF-only ("Press 1 for billing"), but by the 1990s, systems like Nuance's Dragon Dictate and AT&T's WATSON (no relation to IBM Watson) introduced speaker-independent recognition of limited vocabularies — typically 50–500 words within a constrained grammar.
The architecture was entirely rule-based: a finite-state grammar defined what the user could say, an HMM acoustic model matched audio to phonemes, and a decision tree determined the response. No language model. No generation. The "intelligence" was hand-authored dialog flows. Airlines, banks, and telecoms deployed millions of these systems, and many still run today.
Hidden Markov Models Dominate ASR
Jim Baker at CMU (1975) and the IBM Speech group led by Fred Jelinek pioneered Hidden Markov Models for speech recognition. The insight: treat speech as a sequence of hidden states (phonemes) generating observable features (spectral frames), and use the Baum-Welch algorithm to learn transition and emission probabilities from data.
"Every time I fire a linguist, the performance of the speech recognizer goes up."
HMMs combined with n-gram language models and Gaussian Mixture Model (GMM) acoustic models were the standard ASR stack for 30 years. Word error rates on clean read speech (Wall Street Journal corpus) dropped from >40% in the 1980s to ~5% by 2010 — but noisy, conversational speech remained brutally difficult.
— Baker, J. (1975). The DRAGON System. IEEE ICASSP.
— Rabiner, L. (1989). A Tutorial on HMMs. Proc. IEEE, 77(2), 257–286.
Siri Ships on iPhone 4S
Apple acquired Siri Inc. (a DARPA CALO spinoff) in 2010 and shipped it as a built-in feature in October 2011. For the first time, a voice assistant was available on a device in hundreds of millions of pockets. The architecture was a cascaded pipeline: audio streamed to Apple's servers, Nuance ASR transcribed it, an NLU module extracted intent and slots, and a dialog manager generated a response via templates or API calls (weather, restaurants, reminders).
Siri's real contribution wasn't technical — it was cultural. It normalized the act of talking to a phone in public. Within 18 months, Google launched Google Now (2012) and Microsoft launched Cortana (2014), each with the same cascaded architecture but different NLU backends.
Amazon Echo and Alexa
Amazon did something nobody expected: put a voice assistant in a speaker on a kitchen counter. The Echo introduced always-on, far-field voice interaction with a 7-microphone array and the "Alexa" wake word processed on-device by a small neural keyword spotter. Everything after wake word detection went to the cloud.
The Alexa Skills Kit (2015) was equally important — it turned Alexa into a platform. Third-party developers could register intents and slot types, and Alexa would route recognized utterances to their Lambda functions. By 2020, there were 100,000+ skills. The weakness: the rigid intent-slot NLU framework couldn't handle open-ended conversation. Users defaulted to timers, weather, and music.
Deep Learning Replaces HMMs
Between 2015 and 2020, every component of the voice pipeline was rebuilt with neural networks. DeepSpeech (Hannun et al., 2014) at Baidu showed that a single end-to-end RNN could match HMM-GMM systems. Google's Listen, Attend and Spell (Chan et al., 2016) introduced attention-based seq2seq for ASR. For TTS, WaveNet (van den Oord et al., 2016) at DeepMind produced speech so natural that human listeners couldn't distinguish it from recordings — but it took 90 seconds to generate 1 second of audio.
# The cascaded pipeline, circa 2018 audio → [ASR: CTC/Attention Encoder-Decoder] → text → [NLU: BERT intent classifier + slot filler] → structured intent → [Dialog Manager: state machine] → response template → [TTS: Tacotron 2 + WaveGlow vocoder] → audio # Total latency: 3-5 seconds # Four separate models, four separate training pipelines, four separate failure modes
— Hannun, A. et al. (2014). Deep Speech. arXiv:1412.5567.
— van den Oord, A. et al. (2016). WaveNet. arXiv:1609.03499.
— Shen, J. et al. (2018). Tacotron 2. arXiv:1712.05884.
Whisper: ASR Goes Universal
OpenAI released Whisper, a transformer encoder-decoder trained on 680,000 hours of web-scraped audio spanning 99 languages. The key insight: throw enough diverse, weakly-supervised data at a simple architecture and you get robustness for free. Whisper handled accents, background noise, code-switching, and domain-specific vocabulary that would have required extensive tuning in previous systems.
Combined with ChatGPT (November 2022), this transformed the voice pipeline. Instead of an NLU module parsing rigid intents, the LLM could handle open-ended conversation, follow-ups, and nuanced requests. The "intelligence" bottleneck was solved — but the latency problem got worse: GPT-3.5 alone took 500ms–1.5s per response.
— Radford, A. et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. ICML.
GPT-4o: Audio In, Audio Out
OpenAI demonstrated GPT-4o ("omni") — a single model that natively accepts and generates audio tokens alongside text. No separate ASR or TTS pipeline. The model processes raw audio spectrograms as input and produces audio tokens that a small vocoder converts to waveforms. Response latency: ~320ms average, comparable to human conversational turn-taking.
The Realtime API (October 2024) exposed this capability via WebSocket, enabling developers to build voice assistants with natural interruption handling, emotional tone variation, and sub-second latency — things that cascaded pipelines could never achieve because information was lost at each text bottleneck.
The paradigm shift
For 60 years, voice assistants were pipelines: audio in, text intermediary, audio out. GPT-4o collapses this to a single model. It can hear tone of voice, detect hesitation, laugh, whisper, and sing — because it never discards the audio information into text. This is the same shift that happened in machine translation when seq2seq replaced the analysis-transfer-generation pipeline. Whether end-to-end models will fully replace cascaded systems in production is still an open question (see below).
The Open-Source Response
The open-source ecosystem moved fast. Kyutai Moshi (2024) demonstrated real-time, full-duplex speech interaction with a 160ms theoretical latency. Sesame CSM (March 2025) achieved voice quality that listeners rated as more natural than GPT-4o in blind tests, using a context-aware speech model trained on dialog-specific data. Pipecat, LiveKit Agents, and Vercel AI SDK provided production-grade frameworks for orchestrating either cascaded or end-to-end pipelines with built-in VAD, interruption handling, and transport layers.
— Défossez, A. et al. (2024). Moshi: a speech-text foundation model. arXiv:2410.00037.
— Sesame (2025). Crossing the Uncanny Valley of Voice.
The throughline: 1961 → 2025
The Cascaded Pipeline (Still the Default)
Despite the end-to-end models, most production voice assistants in 2025 still use a cascaded pipeline — three or four models chained sequentially. The reason is simple: each component can be independently swapped, debugged, and optimized.
Speech-to-Text (ASR)
Convert the user's spoken audio into text. The dominant models in 2025: Whisper large-v3 (OpenAI, open-weight, 99 languages), Universal-1 (AssemblyAI, API-only, best WER on English), Chirp 2 (Google, 100+ languages), and faster-whisper (CTranslate2 optimization, 4x speedup).
Typical latency: 200–800ms depending on model size, hardware, and audio length.
LLM Processing (The Brain)
The transcribed text goes to an LLM for response generation. This is the component that transformed voice assistants from rigid intent-matchers into conversational agents. The LLM handles follow-ups, context, multi-step reasoning, and tool use. For voice, you optimize for time-to-first-token (TTFT) rather than total generation time, because you can start TTS as soon as the first sentence is complete.
Typical TTFT: 150–500ms (GPT-4o-mini, Claude 3.5 Haiku, Gemini 2.0 Flash).
Text-to-Speech (TTS)
Convert the LLM's text response into audio. Modern neural TTS is nearly indistinguishable from human speech. The key metric is time-to-first-byte — how quickly the first audio chunk is ready to play. Streaming TTS models like ElevenLabs Turbo v2.5, OpenAI tts-1, and Cartesia Sonic can start producing audio within 100–300ms of receiving the first text chunk.
Typical time-to-first-byte: 100–300ms with streaming. Voice cloning adds ~50ms.
The Critical Path
In a naive implementation, these run sequentially: 800ms + 500ms + 300ms = 1.6s minimum before the user hears anything. The key optimization is pipelining: stream LLM tokens into TTS as they arrive. This cuts perceived latency to ASR time + LLM TTFT + TTS first-byte — often under 1 second.
Latency Anatomy: Where Every Millisecond Goes
Human conversational turn-taking has a median gap of ~200ms (Stivers et al., 2009). Anything over 1 second feels unnatural. Over 2 seconds, users start repeating themselves or abandon the interaction. Understanding where latency comes from is the first step to eliminating it.
Latency Breakdown: Cascaded Pipeline (Optimized Streaming)
Detect speech end (padding required to avoid cutoffs)
Client to server (depends on geographic proximity)
Whisper large-v3 on A100: ~300ms for 5s audio
Time to first token (model + prompt dependent)
Tokens until first sentence boundary (. ! ?)
Neural vocoder produces first playable audio
Client-side buffering before speakers fire
Why Streaming Changes Everything
Without streaming, you wait for the LLM to generate the entire response before sending it to TTS, and wait for TTS to synthesize the entire audio before playing it. With streaming, three things happen concurrently:
- LLM generates token by token, accumulating into a sentence buffer
- When a sentence boundary is detected, that sentence is immediately sent to TTS
- TTS generates audio chunks that play as they arrive — the user hears the first sentence while the LLM is still generating the second
This means the user's perceived latency is only: VAD + Network + ASR + LLM TTFT + time to first sentence boundary + TTS first chunk. In practice, 600ms–1.2s for well-optimized systems.
Code: Streaming Cascaded Pipeline
This is the production pattern: stream LLM tokens into a sentence buffer, flush complete sentences to TTS, and play audio chunks as they arrive. The entire response overlaps generation and playback.
Streaming Voice Pipeline
Python + OpenAIimport asyncio
import time
from openai import AsyncOpenAI
client = AsyncOpenAI()
SENTENCE_ENDINGS = {'.', '!', '?', '\n'}
async def transcribe(audio_path: str) -> str:
"""ASR: audio file → text via Whisper."""
with open(audio_path, "rb") as f:
transcript = await client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="text"
)
return transcript
async def stream_llm_to_tts(
user_text: str,
conversation: list[dict],
on_audio_chunk: callable
):
"""Stream LLM response sentence-by-sentence into TTS.
This is the core optimization: we don't wait for the full
LLM response before starting TTS. Each complete sentence
is sent to TTS immediately, and audio chunks play as they
arrive.
"""
conversation.append({"role": "user", "content": user_text})
# Stream LLM response
stream = await client.chat.completions.create(
model="gpt-4o-mini", # Fast TTFT (~150ms)
messages=conversation,
max_tokens=200, # Keep voice responses concise
stream=True
)
sentence_buffer = ""
full_response = ""
tts_tasks = []
async for chunk in stream:
token = chunk.choices[0].delta.content or ""
sentence_buffer += token
full_response += token
# Flush on sentence boundary
if token and token[-1] in SENTENCE_ENDINGS and len(sentence_buffer.strip()) > 10:
sentence = sentence_buffer.strip()
sentence_buffer = ""
# Fire TTS for this sentence concurrently
task = asyncio.create_task(
speak_sentence(sentence, on_audio_chunk)
)
tts_tasks.append(task)
# Flush remaining buffer
if sentence_buffer.strip():
task = asyncio.create_task(
speak_sentence(sentence_buffer.strip(), on_audio_chunk)
)
tts_tasks.append(task)
# Wait for all TTS tasks to complete
await asyncio.gather(*tts_tasks)
conversation.append({"role": "assistant", "content": full_response})
return full_response
async def speak_sentence(text: str, on_audio_chunk: callable):
"""TTS: text → streaming audio chunks."""
response = await client.audio.speech.create(
model="tts-1", # Low-latency model
voice="nova",
input=text,
response_format="opus" # Efficient for streaming
)
async for chunk in response.iter_bytes(chunk_size=4096):
on_audio_chunk(chunk) # Play immediately
async def voice_assistant_loop():
"""Main loop with timing instrumentation."""
conversation = [{
"role": "system",
"content": (
"You are a voice assistant. Respond in 1-3 short sentences. "
"Be conversational and concise — the user is listening, not reading."
)
}]
print("Voice assistant ready. Ctrl+C to exit.")
while True:
# In production: record with VAD, save to temp file
audio_path = await record_with_vad() # Your recording function
t0 = time.perf_counter()
user_text = await transcribe(audio_path)
t_asr = time.perf_counter() - t0
if not user_text.strip():
continue
print(f"User: {user_text} [ASR: {t_asr*1000:.0f}ms]")
t1 = time.perf_counter()
first_audio = False
def on_chunk(chunk):
nonlocal first_audio, t1
if not first_audio:
print(f" [First audio: {(time.perf_counter()-t1)*1000:.0f}ms]")
first_audio = True
play_audio_chunk(chunk) # Your playback function
response = await stream_llm_to_tts(
user_text, conversation, on_chunk
)
t_total = time.perf_counter() - t0
print(f"Assistant: {response} [Total: {t_total*1000:.0f}ms]")Code: WebSocket Realtime API
The end-to-end approach: send raw audio in, receive audio out, through a single WebSocket connection. No ASR/TTS orchestration — the model handles everything internally. This is GPT-4o Realtime, the fastest path to sub-second voice interaction.
GPT-4o Realtime via WebSocket
TypeScript / Node.jsimport WebSocket from "ws";
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview";
const ws = new WebSocket(url, {
headers: {
Authorization: `Bearer ${OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1",
},
});
ws.on("open", () => {
// Configure the session
ws.send(JSON.stringify({
type: "session.update",
session: {
modalities: ["text", "audio"],
instructions: "You are a friendly voice assistant. Be concise.",
voice: "nova",
input_audio_format: "pcm16",
output_audio_format: "pcm16",
input_audio_transcription: { model: "whisper-1" },
turn_detection: {
type: "server_vad", // Server-side voice activity detection
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500, // End of turn after 500ms silence
},
},
}));
});
// Stream microphone audio to the model
function sendAudioChunk(pcm16Buffer: Buffer) {
ws.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: pcm16Buffer.toString("base64"),
}));
}
// Receive events from the model
ws.on("message", (data: string) => {
const event = JSON.parse(data);
switch (event.type) {
case "response.audio.delta":
// Audio chunk ready — play immediately
const audioBytes = Buffer.from(event.delta, "base64");
playAudioChunk(audioBytes); // Your playback function
break;
case "response.audio_transcript.delta":
// Real-time transcript of what the model is saying
process.stdout.write(event.delta);
break;
case "input_audio_buffer.speech_started":
// User started speaking — model auto-interrupts its response
stopPlayback();
break;
case "input_audio_buffer.speech_stopped":
// User stopped speaking — model will respond
console.log("\n[User finished speaking]");
break;
case "error":
console.error("Error:", event.error);
break;
}
});Key differences from the cascaded approach:
- No separate ASR call — the model transcribes internally
- Server-side VAD handles turn detection — no client-side silence detection logic
- Natural interruption: if the user speaks while the model is responding, it stops automatically
- Audio in, audio out — the model preserves tone, emotion, and prosody that text discards
- Trade-off: less control over individual components, higher per-minute cost (~$0.06/min audio input, ~$0.24/min audio output)
Cascaded vs End-to-End: The Real Trade-offs
This is the defining architectural decision for voice assistants in 2025. Neither approach is strictly superior — the choice depends on your constraints.
| Dimension | Cascaded (ASR + LLM + TTS) | End-to-End (GPT-4o / Moshi) |
|---|---|---|
| Latency (optimized) | 0.6–1.5s to first audio | 0.3–0.5s to first audio |
| Audio understanding | Text only — tone, hesitation, emotion lost at ASR | Full audio features preserved through the model |
| Voice quality | Depends on TTS choice. ElevenLabs, Cartesia are excellent | Good but less controllable. Improving rapidly |
| Debuggability | High — inspect text at each stage | Low — audio in, audio out is opaque |
| Component swapping | Swap any model independently | Monolithic — take it or leave it |
| Language support | Whisper: 99 languages. TTS: varies (10–30) | GPT-4o: strong on ~20 languages |
| Cost (per minute) | ~$0.01–0.04 (depends on models) | ~$0.12–0.30 (GPT-4o Realtime pricing) |
| Tool use / function calling | Full LLM tool-use support | Supported in Realtime API |
| Interruption handling | Client-side VAD, manual cancellation | Native — model detects and handles it |
Choose Cascaded When
- --You need to log/audit the text at each stage (compliance, healthcare)
- --You want to swap models without rewriting the pipeline
- --Cost matters — cascaded is 3–10x cheaper per minute
- --You need niche language support or domain-specific ASR
Choose End-to-End When
- --Sub-500ms latency is critical (real-time conversation)
- --You need the model to understand tone, emotion, or non-verbal audio cues
- --Natural interruption handling matters (call center, tutoring)
- --Development speed matters more than cost optimization
The industry consensus in 2025: most teams start with cascaded (cheaper, debuggable, more control) and migrate to end-to-end for latency-critical paths. Many production systems use a hybrid approach: end-to-end for the main conversation loop, cascaded for function calls and structured data extraction. Frameworks like Pipecat and LiveKit abstract over both.
— Pipecat Documentation (2024).
— LiveKit Agents Documentation (2024).
The Unsung Component: Voice Activity Detection
Every voice assistant needs to answer one question before anything else: is the user speaking right now? This is Voice Activity Detection (VAD), and it's more consequential than most developers realize. A bad VAD either clips the end of the user's sentence (cutting off critical words) or waits too long after they stop (adding hundreds of milliseconds of dead latency).
Silero VAD (State of the Art, Open Source)
Python / ONNXimport torch
import numpy as np
# Load Silero VAD (runs in <1ms per frame on CPU)
model, utils = torch.hub.load(
"snakers4/silero-vad", "silero_vad",
force_reload=False
)
(get_speech_timestamps, _, _, _, _) = utils
def detect_speech_end(
audio_frames: list[np.ndarray],
sample_rate: int = 16000,
silence_threshold_ms: int = 500
) -> bool:
"""Detect if the user has stopped speaking.
Uses Silero VAD which outperforms WebRTC VAD and
energy-based methods on noisy audio. Runs on CPU
in real-time with negligible overhead.
Returns True if speech followed by silence > threshold.
"""
audio = np.concatenate(audio_frames)
audio_tensor = torch.from_numpy(audio).float()
# Get speech timestamps
timestamps = get_speech_timestamps(
audio_tensor,
model,
sampling_rate=sample_rate,
threshold=0.5, # Speech probability threshold
min_speech_duration_ms=250, # Ignore very short sounds
min_silence_duration_ms=silence_threshold_ms,
)
if not timestamps:
return True # No speech detected
# Check if last speech ended > threshold ago
last_speech_end = timestamps[-1]["end"] / sample_rate
audio_duration = len(audio) / sample_rate
silence_at_end = audio_duration - last_speech_end
return silence_at_end > (silence_threshold_ms / 1000)Inference per frame (CPU)
Accuracy on noisy audio
Model size (ONNX)
The 500ms Dilemma
Setting silence_duration_ms is a fundamental tension. Too short (200ms): you clip the user mid-pause ("I want to order... [cut off] ...a pizza"). Too long (800ms): you add dead time after every utterance, making the assistant feel sluggish. Most production systems use 400–600ms and also implement an endpointing model that uses linguistic features (did the user finish a sentence?) in addition to silence duration. This is an active research area — Google's end-pointer model and OpenAI's server-side VAD in the Realtime API both use learned models rather than fixed thresholds.
Wake Word Detection: Always Listening, Barely Computing
"Hey Siri," "Alexa," "OK Google" — wake words let the assistant listen continuously without streaming audio to the cloud. The model runs on-device, typically on a dedicated DSP or neural accelerator, consuming less power than the screen backlight.
The architecture is a tiny keyword-spotting neural network (50K–500K parameters) that classifies fixed-length audio frames as "wake word detected" or "not detected." Apple's "Hey Siri" detector uses a two-pass system: a small always-on detector on the motion coprocessor triggers a larger verification model on the main CPU, keeping false acceptance rate below 1 in 100,000.
"The always-on processor runs a detector with a small memory footprint [...] When it detects the phrase ‘Hey Siri,’ it passes the audio to the main processor, which runs a larger, more accurate detector to verify."
— Apple Machine Learning Journal (2017). Hey Siri: An On-device DNN-powered Voice Trigger.
For custom wake words, Picovoice Porcupine and OpenWakeWord are the leading options. Porcupine runs cross-platform (including Raspberry Pi and microcontrollers) with sub-millisecond latency and customizable wake phrases. OpenWakeWord is fully open-source and supports training custom keywords with as few as 100 positive samples.
Key Papers and Further Reading
The voice assistant field sits at the intersection of speech processing, natural language understanding, and real-time systems. These papers represent the foundational and frontier work.
ASR Foundations
- Rabiner, L. (1989). A Tutorial on Hidden Markov Models. Proc. IEEE, 77(2), 257–286. The definitive HMM reference. 30,000+ citations.
- Chan, W. et al. (2016). Listen, Attend and Spell. ICASSP. Attention-based seq2seq ASR. The architecture Whisper is built on.
- Radford, A. et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. ICML. The Whisper paper. 680K hours, 99 languages.
TTS Milestones
- van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499. The paper that made neural TTS sound human.
- Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. ICASSP. Tacotron 2. The standard two-stage pipeline.
- Wang, C. et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv:2301.02111. VALL-E. 3 seconds of reference audio = voice cloning.
End-to-End Spoken Dialog
- Rubenstein, P. et al. (2023). AudioPaLM: A Large Language Model That Can Speak and Listen. arXiv:2306.12925. Google's first audio LLM. Unified speech understanding and generation.
- Défossez, A. et al. (2024). Moshi: a speech-text foundation model for real-time dialogue. arXiv:2410.00037. Kyutai. Open-weight, full-duplex, 160ms theoretical latency.
- Fang, H. et al. (2024). LLaMA-Omni: Simultaneous Speech Interaction with LLMs. arXiv:2409.02427. Open-source alternative demonstrating low-latency speech-to-speech.
Conversational Dynamics
- Stivers, T. et al. (2009). Universals and cultural variation in turn-taking in conversation. PNAS, 106(26), 10587–10592. The 200ms turn-taking gap that sets the latency target.
- Apple ML Journal (2017). Hey Siri: An On-device DNN-powered Voice Trigger. The definitive wake-word detection architecture reference.
Key Takeaways
- 1
Two architectures compete — Cascaded (ASR + LLM + TTS) gives control and debuggability. End-to-end (GPT-4o Realtime, Moshi) gives sub-second latency and audio understanding. Most production systems use a hybrid.
- 2
Streaming is non-negotiable — Without streaming, you cannot break the 2-second barrier. Pipe LLM tokens into TTS and play audio chunks as they arrive. The user hears the first sentence while the second is still generating.
- 3
VAD is the hidden latency lever — The time between the user stopping and the system detecting it is pure waste. Silero VAD + learned endpointing can save 200–400ms compared to energy-threshold methods.
- 4
200ms is the target — Human turn-taking gaps average 200ms. Current best systems achieve ~320ms (GPT-4o). Closing the remaining gap requires on-device inference, predictive response generation, and better endpointing.
- 5
The text bottleneck is ending — For 60 years, voice assistants converted speech to text and back. End-to-end models process audio natively, preserving tone, emotion, and prosody. This is the biggest architectural shift since Siri shipped.
Latency Reference (March 2025)
| Component | Model / Service | Latency |
|---|---|---|
| ASR | Whisper large-v3 (API) | 400–800ms |
| faster-whisper base (GPU) | 80–150ms | |
| Deepgram Nova-2 (streaming) | 100–300ms | |
| LLM (TTFT) | GPT-4o | 200–400ms |
| GPT-4o-mini | 100–200ms | |
| Claude 3.5 Haiku | 150–300ms | |
| TTS (first byte) | OpenAI tts-1 | 200–400ms |
| ElevenLabs Turbo v2.5 | 100–250ms | |
| Cartesia Sonic | 80–150ms | |
| End-to-end (GPT-4o Realtime) | ~320ms average | |
| Optimized cascaded pipeline | 600ms–1.2s to first audio | |
Latencies are approximate and vary by network conditions, input length, and server load. Measured from US-East. Add 50–150ms for other regions.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.