Level 2: Pipelines~35 min

Voice Assistant Pipeline

From 1960s touch-tone menus to GPT-4o Realtime — the architecture, latency physics, and code behind conversational voice interfaces.

60 Years of Talking to Machines

Voice assistants didn't start with Siri. They are the product of six decades of converging advances in signal processing, speech recognition, natural language understanding, and synthesis — each generation constrained by the hardware and models available, each breakthrough redefining what "talking to a computer" could mean.

Understanding this history matters because today's architectural choices — cascaded vs end-to-end, streaming vs batch, on-device vs cloud — are direct responses to limitations discovered at each stage.

Era I: Rule-Based Systems
1961

IBM Shoebox

At the 1962 World's Fair, IBM demonstrated the Shoebox — a machine the size of a shoebox that could recognize 16 spoken words: the digits 0–9 plus six commands like "plus" and "total." It used analog circuits to match formant frequencies — the resonant peaks in the audio spectrum that distinguish vowels. There was no learning: each word was a hand-tuned filter bank. But it was the first public demonstration that a machine could take spoken input and produce computed output.

1970s–1990s

IVR: The First Voice "Assistants"

Interactive Voice Response systems became the backbone of telephone customer service. Early IVR was DTMF-only ("Press 1 for billing"), but by the 1990s, systems like Nuance's Dragon Dictate and AT&T's WATSON (no relation to IBM Watson) introduced speaker-independent recognition of limited vocabularies — typically 50–500 words within a constrained grammar.

The architecture was entirely rule-based: a finite-state grammar defined what the user could say, an HMM acoustic model matched audio to phonemes, and a decision tree determined the response. No language model. No generation. The "intelligence" was hand-authored dialog flows. Airlines, banks, and telecoms deployed millions of these systems, and many still run today.

1980s–2000s

Hidden Markov Models Dominate ASR

Jim Baker at CMU (1975) and the IBM Speech group led by Fred Jelinek pioneered Hidden Markov Models for speech recognition. The insight: treat speech as a sequence of hidden states (phonemes) generating observable features (spectral frames), and use the Baum-Welch algorithm to learn transition and emission probabilities from data.

"Every time I fire a linguist, the performance of the speech recognizer goes up."

Fred Jelinek, IBM (apocryphal but widely attributed).

HMMs combined with n-gram language models and Gaussian Mixture Model (GMM) acoustic models were the standard ASR stack for 30 years. Word error rates on clean read speech (Wall Street Journal corpus) dropped from >40% in the 1980s to ~5% by 2010 — but noisy, conversational speech remained brutally difficult.

Baker, J. (1975). The DRAGON System. IEEE ICASSP.
Rabiner, L. (1989). A Tutorial on HMMs. Proc. IEEE, 77(2), 257–286.

Era II: Cloud + Deep Learning
October 2011

Siri Ships on iPhone 4S

Apple acquired Siri Inc. (a DARPA CALO spinoff) in 2010 and shipped it as a built-in feature in October 2011. For the first time, a voice assistant was available on a device in hundreds of millions of pockets. The architecture was a cascaded pipeline: audio streamed to Apple's servers, Nuance ASR transcribed it, an NLU module extracted intent and slots, and a dialog manager generated a response via templates or API calls (weather, restaurants, reminders).

Siri's real contribution wasn't technical — it was cultural. It normalized the act of talking to a phone in public. Within 18 months, Google launched Google Now (2012) and Microsoft launched Cortana (2014), each with the same cascaded architecture but different NLU backends.

November 2014

Amazon Echo and Alexa

Amazon did something nobody expected: put a voice assistant in a speaker on a kitchen counter. The Echo introduced always-on, far-field voice interaction with a 7-microphone array and the "Alexa" wake word processed on-device by a small neural keyword spotter. Everything after wake word detection went to the cloud.

The Alexa Skills Kit (2015) was equally important — it turned Alexa into a platform. Third-party developers could register intents and slot types, and Alexa would route recognized utterances to their Lambda functions. By 2020, there were 100,000+ skills. The weakness: the rigid intent-slot NLU framework couldn't handle open-ended conversation. Users defaulted to timers, weather, and music.

2015–2020

Deep Learning Replaces HMMs

Between 2015 and 2020, every component of the voice pipeline was rebuilt with neural networks. DeepSpeech (Hannun et al., 2014) at Baidu showed that a single end-to-end RNN could match HMM-GMM systems. Google's Listen, Attend and Spell (Chan et al., 2016) introduced attention-based seq2seq for ASR. For TTS, WaveNet (van den Oord et al., 2016) at DeepMind produced speech so natural that human listeners couldn't distinguish it from recordings — but it took 90 seconds to generate 1 second of audio.

# The cascaded pipeline, circa 2018
audio → [ASR: CTC/Attention Encoder-Decoder] → text
  → [NLU: BERT intent classifier + slot filler] → structured intent
  → [Dialog Manager: state machine] → response template
  → [TTS: Tacotron 2 + WaveGlow vocoder] → audio

# Total latency: 3-5 seconds
# Four separate models, four separate training pipelines, four separate failure modes

Hannun, A. et al. (2014). Deep Speech. arXiv:1412.5567.
van den Oord, A. et al. (2016). WaveNet. arXiv:1609.03499.
Shen, J. et al. (2018). Tacotron 2. arXiv:1712.05884.

Era III: LLMs + End-to-End Audio
September 2022

Whisper: ASR Goes Universal

OpenAI released Whisper, a transformer encoder-decoder trained on 680,000 hours of web-scraped audio spanning 99 languages. The key insight: throw enough diverse, weakly-supervised data at a simple architecture and you get robustness for free. Whisper handled accents, background noise, code-switching, and domain-specific vocabulary that would have required extensive tuning in previous systems.

Combined with ChatGPT (November 2022), this transformed the voice pipeline. Instead of an NLU module parsing rigid intents, the LLM could handle open-ended conversation, follow-ups, and nuanced requests. The "intelligence" bottleneck was solved — but the latency problem got worse: GPT-3.5 alone took 500ms–1.5s per response.

Radford, A. et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. ICML.

May 2024

GPT-4o: Audio In, Audio Out

OpenAI demonstrated GPT-4o ("omni") — a single model that natively accepts and generates audio tokens alongside text. No separate ASR or TTS pipeline. The model processes raw audio spectrograms as input and produces audio tokens that a small vocoder converts to waveforms. Response latency: ~320ms average, comparable to human conversational turn-taking.

The Realtime API (October 2024) exposed this capability via WebSocket, enabling developers to build voice assistants with natural interruption handling, emotional tone variation, and sub-second latency — things that cascaded pipelines could never achieve because information was lost at each text bottleneck.

The paradigm shift

For 60 years, voice assistants were pipelines: audio in, text intermediary, audio out. GPT-4o collapses this to a single model. It can hear tone of voice, detect hesitation, laugh, whisper, and sing — because it never discards the audio information into text. This is the same shift that happened in machine translation when seq2seq replaced the analysis-transfer-generation pipeline. Whether end-to-end models will fully replace cascaded systems in production is still an open question (see below).

2024–2025

The Open-Source Response

The open-source ecosystem moved fast. Kyutai Moshi (2024) demonstrated real-time, full-duplex speech interaction with a 160ms theoretical latency. Sesame CSM (March 2025) achieved voice quality that listeners rated as more natural than GPT-4o in blind tests, using a context-aware speech model trained on dialog-specific data. Pipecat, LiveKit Agents, and Vercel AI SDK provided production-grade frameworks for orchestrating either cascaded or end-to-end pipelines with built-in VAD, interruption handling, and transport layers.

Défossez, A. et al. (2024). Moshi: a speech-text foundation model. arXiv:2410.00037.
Sesame (2025). Crossing the Uncanny Valley of Voice.

The throughline: 1961 → 2025

1961–1990Rule-based: HMMs, grammars, decision trees. No learning from dialog.
2011–2014Cloud cascaded: ASR + NLU + Dialog + TTS. Siri, Alexa, Google Assistant.
2022–2023LLM-powered cascaded: Whisper + GPT/Claude + TTS. Open-ended conversation, high latency.
2024–nowEnd-to-end audio: Single model, audio tokens in and out. Sub-second latency.

The Cascaded Pipeline (Still the Default)

Despite the end-to-end models, most production voice assistants in 2025 still use a cascaded pipeline — three or four models chained sequentially. The reason is simple: each component can be independently swapped, debugged, and optimized.

1

Speech-to-Text (ASR)

Convert the user's spoken audio into text. The dominant models in 2025: Whisper large-v3 (OpenAI, open-weight, 99 languages), Universal-1 (AssemblyAI, API-only, best WER on English), Chirp 2 (Google, 100+ languages), and faster-whisper (CTranslate2 optimization, 4x speedup).

Typical latency: 200–800ms depending on model size, hardware, and audio length.

2

LLM Processing (The Brain)

The transcribed text goes to an LLM for response generation. This is the component that transformed voice assistants from rigid intent-matchers into conversational agents. The LLM handles follow-ups, context, multi-step reasoning, and tool use. For voice, you optimize for time-to-first-token (TTFT) rather than total generation time, because you can start TTS as soon as the first sentence is complete.

Typical TTFT: 150–500ms (GPT-4o-mini, Claude 3.5 Haiku, Gemini 2.0 Flash).

3

Text-to-Speech (TTS)

Convert the LLM's text response into audio. Modern neural TTS is nearly indistinguishable from human speech. The key metric is time-to-first-byte — how quickly the first audio chunk is ready to play. Streaming TTS models like ElevenLabs Turbo v2.5, OpenAI tts-1, and Cartesia Sonic can start producing audio within 100–300ms of receiving the first text chunk.

Typical time-to-first-byte: 100–300ms with streaming. Voice cloning adds ~50ms.

The Critical Path

User speaks → [VAD detects end] → [ASR] → text → [LLM streaming] → partial text → [TTS streaming] → first audio chunk plays

In a naive implementation, these run sequentially: 800ms + 500ms + 300ms = 1.6s minimum before the user hears anything. The key optimization is pipelining: stream LLM tokens into TTS as they arrive. This cuts perceived latency to ASR time + LLM TTFT + TTS first-byte — often under 1 second.

Latency Anatomy: Where Every Millisecond Goes

Human conversational turn-taking has a median gap of ~200ms (Stivers et al., 2009). Anything over 1 second feels unnatural. Over 2 seconds, users start repeating themselves or abandon the interaction. Understanding where latency comes from is the first step to eliminating it.

Stivers, T. et al. (2009). Universals and Cultural Variation in Turn-Taking. PNAS, 106(26), 10587–10592.

Latency Breakdown: Cascaded Pipeline (Optimized Streaming)

Voice Activity Detection
100–200ms

Detect speech end (padding required to avoid cutoffs)

Network round-trip
20–80ms

Client to server (depends on geographic proximity)

ASR transcription
200–600ms

Whisper large-v3 on A100: ~300ms for 5s audio

LLM TTFT
150–500ms

Time to first token (model + prompt dependent)

LLM sentence completion
200–800ms

Tokens until first sentence boundary (. ! ?)

TTS first audio chunk
100–300ms

Neural vocoder produces first playable audio

Audio buffer + playback start
50–100ms

Client-side buffering before speakers fire

Naive sequential total2–4 seconds
With streaming pipeline0.6–1.2 seconds

Why Streaming Changes Everything

Without streaming, you wait for the LLM to generate the entire response before sending it to TTS, and wait for TTS to synthesize the entire audio before playing it. With streaming, three things happen concurrently:

  1. LLM generates token by token, accumulating into a sentence buffer
  2. When a sentence boundary is detected, that sentence is immediately sent to TTS
  3. TTS generates audio chunks that play as they arrive — the user hears the first sentence while the LLM is still generating the second

This means the user's perceived latency is only: VAD + Network + ASR + LLM TTFT + time to first sentence boundary + TTS first chunk. In practice, 600ms–1.2s for well-optimized systems.

Code: Streaming Cascaded Pipeline

This is the production pattern: stream LLM tokens into a sentence buffer, flush complete sentences to TTS, and play audio chunks as they arrive. The entire response overlaps generation and playback.

Streaming Voice Pipeline

Python + OpenAI
import asyncio
import time
from openai import AsyncOpenAI

client = AsyncOpenAI()

SENTENCE_ENDINGS = {'.', '!', '?', '\n'}

async def transcribe(audio_path: str) -> str:
    """ASR: audio file → text via Whisper."""
    with open(audio_path, "rb") as f:
        transcript = await client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            response_format="text"
        )
    return transcript

async def stream_llm_to_tts(
    user_text: str,
    conversation: list[dict],
    on_audio_chunk: callable
):
    """Stream LLM response sentence-by-sentence into TTS.

    This is the core optimization: we don't wait for the full
    LLM response before starting TTS. Each complete sentence
    is sent to TTS immediately, and audio chunks play as they
    arrive.
    """
    conversation.append({"role": "user", "content": user_text})

    # Stream LLM response
    stream = await client.chat.completions.create(
        model="gpt-4o-mini",         # Fast TTFT (~150ms)
        messages=conversation,
        max_tokens=200,              # Keep voice responses concise
        stream=True
    )

    sentence_buffer = ""
    full_response = ""
    tts_tasks = []

    async for chunk in stream:
        token = chunk.choices[0].delta.content or ""
        sentence_buffer += token
        full_response += token

        # Flush on sentence boundary
        if token and token[-1] in SENTENCE_ENDINGS and len(sentence_buffer.strip()) > 10:
            sentence = sentence_buffer.strip()
            sentence_buffer = ""

            # Fire TTS for this sentence concurrently
            task = asyncio.create_task(
                speak_sentence(sentence, on_audio_chunk)
            )
            tts_tasks.append(task)

    # Flush remaining buffer
    if sentence_buffer.strip():
        task = asyncio.create_task(
            speak_sentence(sentence_buffer.strip(), on_audio_chunk)
        )
        tts_tasks.append(task)

    # Wait for all TTS tasks to complete
    await asyncio.gather(*tts_tasks)

    conversation.append({"role": "assistant", "content": full_response})
    return full_response

async def speak_sentence(text: str, on_audio_chunk: callable):
    """TTS: text → streaming audio chunks."""
    response = await client.audio.speech.create(
        model="tts-1",              # Low-latency model
        voice="nova",
        input=text,
        response_format="opus"      # Efficient for streaming
    )

    async for chunk in response.iter_bytes(chunk_size=4096):
        on_audio_chunk(chunk)        # Play immediately

async def voice_assistant_loop():
    """Main loop with timing instrumentation."""
    conversation = [{
        "role": "system",
        "content": (
            "You are a voice assistant. Respond in 1-3 short sentences. "
            "Be conversational and concise — the user is listening, not reading."
        )
    }]

    print("Voice assistant ready. Ctrl+C to exit.")

    while True:
        # In production: record with VAD, save to temp file
        audio_path = await record_with_vad()  # Your recording function

        t0 = time.perf_counter()
        user_text = await transcribe(audio_path)
        t_asr = time.perf_counter() - t0

        if not user_text.strip():
            continue

        print(f"User: {user_text}  [ASR: {t_asr*1000:.0f}ms]")

        t1 = time.perf_counter()
        first_audio = False

        def on_chunk(chunk):
            nonlocal first_audio, t1
            if not first_audio:
                print(f"  [First audio: {(time.perf_counter()-t1)*1000:.0f}ms]")
                first_audio = True
            play_audio_chunk(chunk)    # Your playback function

        response = await stream_llm_to_tts(
            user_text, conversation, on_chunk
        )
        t_total = time.perf_counter() - t0
        print(f"Assistant: {response}  [Total: {t_total*1000:.0f}ms]")

Code: WebSocket Realtime API

The end-to-end approach: send raw audio in, receive audio out, through a single WebSocket connection. No ASR/TTS orchestration — the model handles everything internally. This is GPT-4o Realtime, the fastest path to sub-second voice interaction.

GPT-4o Realtime via WebSocket

TypeScript / Node.js
import WebSocket from "ws";

const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview";

const ws = new WebSocket(url, {
  headers: {
    Authorization: `Bearer ${OPENAI_API_KEY}`,
    "OpenAI-Beta": "realtime=v1",
  },
});

ws.on("open", () => {
  // Configure the session
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      modalities: ["text", "audio"],
      instructions: "You are a friendly voice assistant. Be concise.",
      voice: "nova",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      input_audio_transcription: { model: "whisper-1" },
      turn_detection: {
        type: "server_vad",         // Server-side voice activity detection
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 500,   // End of turn after 500ms silence
      },
    },
  }));
});

// Stream microphone audio to the model
function sendAudioChunk(pcm16Buffer: Buffer) {
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: pcm16Buffer.toString("base64"),
  }));
}

// Receive events from the model
ws.on("message", (data: string) => {
  const event = JSON.parse(data);

  switch (event.type) {
    case "response.audio.delta":
      // Audio chunk ready — play immediately
      const audioBytes = Buffer.from(event.delta, "base64");
      playAudioChunk(audioBytes);   // Your playback function
      break;

    case "response.audio_transcript.delta":
      // Real-time transcript of what the model is saying
      process.stdout.write(event.delta);
      break;

    case "input_audio_buffer.speech_started":
      // User started speaking — model auto-interrupts its response
      stopPlayback();
      break;

    case "input_audio_buffer.speech_stopped":
      // User stopped speaking — model will respond
      console.log("\n[User finished speaking]");
      break;

    case "error":
      console.error("Error:", event.error);
      break;
  }
});

Key differences from the cascaded approach:

  • No separate ASR call — the model transcribes internally
  • Server-side VAD handles turn detection — no client-side silence detection logic
  • Natural interruption: if the user speaks while the model is responding, it stops automatically
  • Audio in, audio out — the model preserves tone, emotion, and prosody that text discards
  • Trade-off: less control over individual components, higher per-minute cost (~$0.06/min audio input, ~$0.24/min audio output)

Cascaded vs End-to-End: The Real Trade-offs

This is the defining architectural decision for voice assistants in 2025. Neither approach is strictly superior — the choice depends on your constraints.

DimensionCascaded (ASR + LLM + TTS)End-to-End (GPT-4o / Moshi)
Latency (optimized)0.6–1.5s to first audio0.3–0.5s to first audio
Audio understandingText only — tone, hesitation, emotion lost at ASRFull audio features preserved through the model
Voice qualityDepends on TTS choice. ElevenLabs, Cartesia are excellentGood but less controllable. Improving rapidly
DebuggabilityHigh — inspect text at each stageLow — audio in, audio out is opaque
Component swappingSwap any model independentlyMonolithic — take it or leave it
Language supportWhisper: 99 languages. TTS: varies (10–30)GPT-4o: strong on ~20 languages
Cost (per minute)~$0.01–0.04 (depends on models)~$0.12–0.30 (GPT-4o Realtime pricing)
Tool use / function callingFull LLM tool-use supportSupported in Realtime API
Interruption handlingClient-side VAD, manual cancellationNative — model detects and handles it

Choose Cascaded When

  • --You need to log/audit the text at each stage (compliance, healthcare)
  • --You want to swap models without rewriting the pipeline
  • --Cost matters — cascaded is 3–10x cheaper per minute
  • --You need niche language support or domain-specific ASR

Choose End-to-End When

  • --Sub-500ms latency is critical (real-time conversation)
  • --You need the model to understand tone, emotion, or non-verbal audio cues
  • --Natural interruption handling matters (call center, tutoring)
  • --Development speed matters more than cost optimization

The industry consensus in 2025: most teams start with cascaded (cheaper, debuggable, more control) and migrate to end-to-end for latency-critical paths. Many production systems use a hybrid approach: end-to-end for the main conversation loop, cascaded for function calls and structured data extraction. Frameworks like Pipecat and LiveKit abstract over both.

Pipecat Documentation (2024).
LiveKit Agents Documentation (2024).

The Unsung Component: Voice Activity Detection

Every voice assistant needs to answer one question before anything else: is the user speaking right now? This is Voice Activity Detection (VAD), and it's more consequential than most developers realize. A bad VAD either clips the end of the user's sentence (cutting off critical words) or waits too long after they stop (adding hundreds of milliseconds of dead latency).

Silero VAD (State of the Art, Open Source)

Python / ONNX
import torch
import numpy as np

# Load Silero VAD (runs in <1ms per frame on CPU)
model, utils = torch.hub.load(
    "snakers4/silero-vad", "silero_vad",
    force_reload=False
)
(get_speech_timestamps, _, _, _, _) = utils

def detect_speech_end(
    audio_frames: list[np.ndarray],
    sample_rate: int = 16000,
    silence_threshold_ms: int = 500
) -> bool:
    """Detect if the user has stopped speaking.

    Uses Silero VAD which outperforms WebRTC VAD and
    energy-based methods on noisy audio. Runs on CPU
    in real-time with negligible overhead.

    Returns True if speech followed by silence > threshold.
    """
    audio = np.concatenate(audio_frames)
    audio_tensor = torch.from_numpy(audio).float()

    # Get speech timestamps
    timestamps = get_speech_timestamps(
        audio_tensor,
        model,
        sampling_rate=sample_rate,
        threshold=0.5,              # Speech probability threshold
        min_speech_duration_ms=250, # Ignore very short sounds
        min_silence_duration_ms=silence_threshold_ms,
    )

    if not timestamps:
        return True  # No speech detected

    # Check if last speech ended > threshold ago
    last_speech_end = timestamps[-1]["end"] / sample_rate
    audio_duration = len(audio) / sample_rate
    silence_at_end = audio_duration - last_speech_end

    return silence_at_end > (silence_threshold_ms / 1000)
<1ms

Inference per frame (CPU)

98.5%

Accuracy on noisy audio

~2MB

Model size (ONNX)

The 500ms Dilemma

Setting silence_duration_ms is a fundamental tension. Too short (200ms): you clip the user mid-pause ("I want to order... [cut off] ...a pizza"). Too long (800ms): you add dead time after every utterance, making the assistant feel sluggish. Most production systems use 400–600ms and also implement an endpointing model that uses linguistic features (did the user finish a sentence?) in addition to silence duration. This is an active research area — Google's end-pointer model and OpenAI's server-side VAD in the Realtime API both use learned models rather than fixed thresholds.

Wake Word Detection: Always Listening, Barely Computing

"Hey Siri," "Alexa," "OK Google" — wake words let the assistant listen continuously without streaming audio to the cloud. The model runs on-device, typically on a dedicated DSP or neural accelerator, consuming less power than the screen backlight.

The architecture is a tiny keyword-spotting neural network (50K–500K parameters) that classifies fixed-length audio frames as "wake word detected" or "not detected." Apple's "Hey Siri" detector uses a two-pass system: a small always-on detector on the motion coprocessor triggers a larger verification model on the main CPU, keeping false acceptance rate below 1 in 100,000.

"The always-on processor runs a detector with a small memory footprint [...] When it detects the phrase ‘Hey Siri,’ it passes the audio to the main processor, which runs a larger, more accurate detector to verify."

Apple Machine Learning Journal (2017). Hey Siri: An On-device DNN-powered Voice Trigger.

For custom wake words, Picovoice Porcupine and OpenWakeWord are the leading options. Porcupine runs cross-platform (including Raspberry Pi and microcontrollers) with sub-millisecond latency and customizable wake phrases. OpenWakeWord is fully open-source and supports training custom keywords with as few as 100 positive samples.

Key Papers and Further Reading

The voice assistant field sits at the intersection of speech processing, natural language understanding, and real-time systems. These papers represent the foundational and frontier work.

ASR Foundations

TTS Milestones

End-to-End Spoken Dialog

Conversational Dynamics

Key Takeaways

  • 1

    Two architectures compete — Cascaded (ASR + LLM + TTS) gives control and debuggability. End-to-end (GPT-4o Realtime, Moshi) gives sub-second latency and audio understanding. Most production systems use a hybrid.

  • 2

    Streaming is non-negotiable — Without streaming, you cannot break the 2-second barrier. Pipe LLM tokens into TTS and play audio chunks as they arrive. The user hears the first sentence while the second is still generating.

  • 3

    VAD is the hidden latency lever — The time between the user stopping and the system detecting it is pure waste. Silero VAD + learned endpointing can save 200–400ms compared to energy-threshold methods.

  • 4

    200ms is the target — Human turn-taking gaps average 200ms. Current best systems achieve ~320ms (GPT-4o). Closing the remaining gap requires on-device inference, predictive response generation, and better endpointing.

  • 5

    The text bottleneck is ending — For 60 years, voice assistants converted speech to text and back. End-to-end models process audio natively, preserving tone, emotion, and prosody. This is the biggest architectural shift since Siri shipped.

Latency Reference (March 2025)

ComponentModel / ServiceLatency
ASRWhisper large-v3 (API)400–800ms
faster-whisper base (GPU)80–150ms
Deepgram Nova-2 (streaming)100–300ms
LLM (TTFT)GPT-4o200–400ms
GPT-4o-mini100–200ms
Claude 3.5 Haiku150–300ms
TTS (first byte)OpenAI tts-1200–400ms
ElevenLabs Turbo v2.5100–250ms
Cartesia Sonic80–150ms
End-to-end (GPT-4o Realtime)~320ms average
Optimized cascaded pipeline600ms–1.2s to first audio

Latencies are approximate and vary by network conditions, input length, and server load. Measured from US-East. Add 50–150ms for other regions.

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.