Real-time AI Systems
From batch inference to sub-100ms responses. The engineering that makes AI feel instant.
The Long Road from Batch to Real-Time
Real-time AI didn't arrive in a single breakthrough. It is the product of three decades of converging advances in hardware, serving infrastructure, and algorithmic efficiency — each generation chipping away at the latency that separates "useful" from "magical."
Understanding this evolution is essential. The techniques you choose for a 500ms conversational system are fundamentally different from a 10ms gaming loop, and both are different from a 100ms interactive search. The history explains why.
The Offline Era
Early ML systems operated in pure batch mode. You collected data, trained a model overnight (or over weeks), and deployed a static artifact. Inference happened in bulk — process a CSV of 10,000 inputs, write results to a file, hand them to a business analyst. "Real-time" meant "we ran the batch job last night instead of last week."
Netflix's original recommendation system (2006) recomputed suggestions once per day. Google's PageRank was a batch job over the entire web graph. Even spam filters at Gmail were periodically retrained and deployed as static classifiers. The latency budget was hours, not milliseconds.
Request/Response Inference
TensorFlow Serving (2016) and Clipper (2017, UC Berkeley) introduced the idea of ML models behind REST APIs. For the first time, you could send a single input and get a prediction back in real-time — classification in 5–50ms, object detection in 100–300ms. But these were small models: ResNet-50 is 25M parameters. The idea of serving a billion-parameter model per-request was absurd.
"The cost of a single forward pass must be small enough that you can afford to do it on every user request. This constraint shaped a decade of production ML."
— Crankshaw, D. et al. (2017). Clipper: A Low-Latency Online Prediction Serving System. NSDI.
TensorRT & ONNX Runtime: Compiler-Level Optimization
NVIDIA's TensorRT treated inference as a compilation problem. Take a trained model, fuse operations (convolution + batch norm + ReLU into a single kernel), select optimal GPU kernels for each layer, quantize weights from FP32 to FP16 or INT8, and output a binary optimized for the specific GPU you're deploying to.
Results were dramatic: 2–6x speedup on the same hardware with negligible accuracy loss. Microsoft's ONNX Runtime brought similar optimizations across hardware backends (CPU, GPU, edge devices). For the first time, a BERT model could run inference in under 10ms on a V100 GPU — fast enough for search autocomplete.
NVIDIA Triton: Multi-Model Serving
Triton Inference Server solved the deployment problem: serve multiple models (TensorRT, ONNX, PyTorch, TensorFlow) behind a single endpoint with dynamic batching, model pipelining, and GPU memory management. It could pack multiple models onto one GPU, batch incoming requests transparently, and route to the right backend. Production ML teams finally had a serving infrastructure that matched the sophistication of web application servers.
FlashAttention: The Memory Wall Breakthrough
Tri Dao et al. at Stanford identified the real bottleneck in transformer inference: not compute, but memory bandwidth. Standard attention writes enormous intermediate matrices to GPU HBM (High Bandwidth Memory), then reads them back. FlashAttention fused the entire attention computation into a single GPU kernel that kept everything in fast SRAM, never materializing the full attention matrix.
The impact: 2–4x speedup on attention computation, 5–20x memory reduction, and the ability to handle context lengths that were previously impossible. FlashAttention-2 (2023) pushed this further with better work partitioning across GPU thread blocks, reaching close to the theoretical maximum memory bandwidth utilization.
vLLM & PagedAttention
Woosuk Kwon et al. at UC Berkeley identified the core problem of LLM serving: the KV (key-value) cache. During autoregressive generation, each token's attention keys and values must be stored for all subsequent tokens. For a 13B model with a 2048-token sequence, the KV cache alone consumes 1.7 GB per request. Naively allocated, GPU memory fragmentation meant you could only serve 2–3 concurrent requests on an A100.
PagedAttention borrowed virtual memory paging from operating systems: split the KV cache into fixed-size blocks, allocate them on-demand from a shared pool, and use a page table to map logical positions to physical GPU memory. The result: near-zero memory waste and the ability to serve 10–24x more concurrent requests.
Speculative Decoding: Breaking the Sequential Bottleneck
Autoregressive LLM generation is inherently sequential: each token depends on the previous one. Speculative decoding (proposed independently by Leviathan et al. at Google and Chen et al. at DeepMind) breaks this constraint with a clever trick: use a small, fast "draft" model to generate K tokens speculatively, then verify all K tokens in a single forward pass of the large "target" model.
Because the large model can verify K tokens in parallel (the cost of processing K tokens is nearly the same as processing 1 due to GPU parallelism), accepted tokens come essentially for free. With a good draft model, acceptance rates of 70–90% are common, yielding 2–3x speedup with mathematically identical output distribution.
# Speculative decoding — conceptual flow
draft_model = load("llama-68m") # Tiny, fast: ~2ms/token
target_model = load("llama-70b") # Large, slow: ~40ms/token
def speculative_decode(prompt, K=5):
draft_tokens = []
for _ in range(K):
token = draft_model.generate_one(prompt + draft_tokens)
draft_tokens.append(token) # ~2ms each = 10ms total
# Verify ALL K tokens in ONE forward pass of target model (~45ms)
accepted = target_model.verify(prompt, draft_tokens)
# If 4 of 5 accepted: generated 4 tokens in ~55ms instead of ~160ms— Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML.
— Chen, C. et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv.
The Optimization Arms Race
The field has exploded with complementary optimizations, each attacking a different bottleneck:
SGLang
RadixAttention for automatic KV cache reuse across requests sharing common prefixes. 3–5x throughput on multi-turn conversations.
Medusa
Multiple decoding heads predict several future tokens simultaneously. No draft model needed. 2x speedup.
Continuous Batching
Instead of waiting for all requests in a batch to finish, insert new requests as slots free up. Near-100% GPU utilization.
Prefix Caching
Cache the KV states for common system prompts. Skip recomputation for every request sharing the same prefix.
The throughline: 1990s to now
Each generation solved the previous generation's bottleneck:
The Latency Budget
"Real-time" is not a single number. Different applications have fundamentally different latency requirements, and each demands a different architecture.
Latency Tiers in Production AI
Where Latency Comes From
A typical LLM API call decomposes into these sequential stages:
The Prefill vs Decode Distinction Matters
Prefill is compute-bound: processing N input tokens in parallel using matrix multiplications that fully saturate the GPU. Decode is memory-bound: generating one token at a time, where the bottleneck is reading model weights and the KV cache from GPU memory. These are fundamentally different workloads.
This is why a 70B model can process 4,000 input tokens in the same time it takes to generate 50 output tokens. Optimization strategies differ: prefill benefits from tensor parallelism across GPUs, while decode benefits from smaller model sizes (quantization) and speculative methods.
Streaming: The Most Important UX Optimization
Streaming is the single most impactful technique in real-time AI — not because it makes generation faster, but because it makes it feel faster. The total time to generate a 500-token response is identical whether you stream or not. But the user sees the first token in 200ms instead of waiting 15 seconds for the complete response.
This is not just a nice-to-have. Research from Microsoft (2023) showed that streaming reduces perceived wait time by 50–75% and increases user satisfaction scores by 20–30% compared to buffered responses of the same content.
Without Streaming
User stares at a spinner for 15 seconds. Entire response appears at once. Feels broken.
With Streaming
First token at 200ms. User reads as tokens arrive. Feels conversational.
Streaming Architecture with Latency Budgets
Batch vs Streaming Processing
Server-Sent Events (SSE) — The Standard Protocol
SSE is the dominant protocol for LLM streaming. Unlike WebSockets (bidirectional), SSE is unidirectional (server to client), simpler to implement, works through CDNs and proxies, and auto-reconnects. OpenAI, Anthropic, and Google all use SSE for their streaming APIs.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json, time
app = FastAPI()
client = OpenAI()
@app.post("/chat/stream")
async def stream_chat(prompt: str):
"""Stream LLM tokens via Server-Sent Events."""
async def generate():
start = time.time()
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
# SSE format: "data: <json>\n\n"
payload = {
"token": chunk.choices[0].delta.content,
"latency_ms": round((time.time() - start) * 1000),
}
yield f"data: {json.dumps(payload)}\n\n"
# Final chunk includes usage stats
if chunk.usage:
yield f"data: {json.dumps({"usage": chunk.usage.model_dump()})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no", # Disable nginx buffering
},
)Client-Side: Consuming the Stream
// Browser: consuming SSE with fetch + ReadableStream
async function streamChat(prompt: string, onToken: (t: string) => void) {
const response = await fetch("/chat/stream", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ prompt }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n\n");
buffer = lines.pop()!; // Keep incomplete chunk
for (const line of lines) {
const data = line.replace("data: ", "");
if (data === "[DONE]") return;
const { token } = JSON.parse(data);
if (token) onToken(token); // Append to UI
}
}
}
// Usage: tokens appear character by character
streamChat("Explain transformers", (token) => {
document.getElementById("output")!.textContent += token;
});When to Use WebSockets Instead
WebSockets add complexity but are necessary when you need bidirectional real-time communication — real-time voice (user speaks while AI responds), collaborative editing, or streaming both audio input and text output simultaneously.
# WebSocket server for real-time voice + text
import asyncio
import websockets
async def handle_session(websocket):
async for message in websocket:
msg = json.loads(message)
if msg["type"] == "audio_chunk":
# Process audio in real-time (STT)
text = await transcribe(msg["data"])
await websocket.send(json.dumps({
"type": "transcript", "text": text
}))
elif msg["type"] == "generate":
# Stream LLM response back
async for token in generate_stream(msg["prompt"]):
await websocket.send(json.dumps({
"type": "token", "text": token
}))Latency Optimization: The Full Toolkit
Streaming hides latency. The techniques below actually reduce it. Each targets a different bottleneck; real production systems combine several.
Quantization
Reduce model weight precision from 16-bit floats to 8-bit or 4-bit integers. Since LLM decoding is memory-bandwidth-bound (reading weights from GPU memory), smaller weights means faster reads. A 4-bit quantized 70B model fits on a single A100 (80GB) and generates tokens 2–3x faster than its FP16 counterpart on two GPUs.
# GPTQ quantization: 70B model in 4-bit Model size: FP16 = 140 GB → INT4 = 35 GB (4x smaller) Token/s: FP16 = 12 t/s → INT4 = 35 t/s (2.9x faster) Quality: MMLU 69.8 → MMLU 69.1 (0.7pt loss)
Modern quantization methods (GPTQ, AWQ, GGUF) are sophisticated enough that 4-bit models lose less than 1% quality on most benchmarks. The real risk is on edge cases and long-tail tasks where the quality degradation concentrates.
Semantic Caching
Exact-match caching misses almost everything — users rarely phrase questions identically. Semantic caching embeds the query, searches a vector store for similar past queries, and returns the cached response if similarity exceeds a threshold. Cache hits return in 5–10ms instead of 2–15 seconds.
# Semantic cache with cosine similarity
import numpy as np
from redis import Redis
THRESHOLD = 0.92 # Tune: higher = fewer false hits
def cached_generate(query: str):
query_emb = embed(query)
# Search cache (Redis + vector index, ~3ms)
cached = vector_search(query_emb, top_k=1)
if cached and cached.score > THRESHOLD:
return cached.response # Cache hit: 5ms total
# Cache miss: generate normally
response = llm.generate(query) # 2-15s
cache_store(query_emb, response) # Background write
return responseIn production, semantic caches typically achieve 15–40% hit rates on customer support workloads (many users ask similar questions) but under 5% on creative or code generation tasks. The hit rate determines whether the infrastructure cost is justified.
Intelligent Model Routing
Not every query needs GPT-4. A classifier (itself a small, fast model) examines the query and routes to the appropriate tier. Simple factual questions go to a 7B model (20ms), complex reasoning to a 70B model (500ms), and only truly hard problems to a frontier API (2s).
# Routing with a lightweight classifier
def route_query(query: str) -> str:
complexity = classifier.predict(query) # ~5ms
if complexity == "simple":
return "llama-8b" # Local, fast, cheap
elif complexity == "medium":
return "llama-70b" # Local, moderate
else:
return "claude-opus" # API, expensive, best quality
# Result: 60% of queries → 8B (fast + cheap)
# 30% of queries → 70B (moderate)
# 10% of queries → API (slow + expensive)
# Average latency drops 3x, cost drops 5xKV Cache & Prefix Optimization
If every request starts with the same 2,000-token system prompt, you're recomputing the same KV cache every time. Prefix caching computes it once and reuses it across requests, eliminating 50–80% of prefill latency for structured applications.
# vLLM prefix caching # Without: every request processes full prompt # Request 1: [system prompt 2000 tok] + [user msg 50 tok] → prefill 2050 tok # Request 2: [system prompt 2000 tok] + [user msg 80 tok] → prefill 2080 tok # With prefix caching: # Request 1: [system prompt 2000 tok → CACHED] + [user msg 50 tok] → prefill 50 tok # Request 2: [cache hit!] + [user msg 80 tok] → prefill 80 tok # Prefill latency: 400ms → 20ms (20x improvement)
GPU Serving at Scale: vLLM vs TGI
If you're self-hosting models, the choice of inference server matters more than the choice of model. A well-optimized server can extract 10–24x more throughput from the same hardware. Two frameworks dominate production: vLLM (UC Berkeley) and TGI (HuggingFace).
vLLM
PagedAttention for near-zero memory waste. The throughput king for high-concurrency workloads.
- - 24x throughput vs naive HuggingFace
- - Continuous batching
- - Tensor & pipeline parallelism
- - OpenAI-compatible API
- - Prefix caching built-in
- - Speculative decoding support
Text Generation Inference (TGI)
HuggingFace's production server. Tight HF ecosystem integration, battle-tested at scale.
- - Flash Attention 2
- - GPTQ / AWQ quantization
- - Speculative decoding (Medusa)
- - Built-in metrics (Prometheus)
- - Watermarking support
- - Grammar-constrained generation
vLLM: Production Deployment
# Start vLLM with production settings
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \ # Shard across 4 GPUs
--max-model-len 8192 \ # Max context length
--enable-prefix-caching \ # Reuse KV for shared prefixes
--quantization awq \ # 4-bit quantization
--gpu-memory-utilization 0.9 \ # Use 90% of GPU memory
--max-num-seqs 256 \ # Max concurrent sequences
--port 8000
# Then use with any OpenAI-compatible client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
# Streaming response
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention"}],
stream=True,
max_tokens=512,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)TGI: Docker Deployment
# TGI with Docker — production-ready in one command
docker run --gpus all -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-70B-Instruct \
--num-shard 4 \
--quantize gptq \
--max-input-tokens 4096 \
--max-total-tokens 8192 \
--max-batch-prefill-tokens 16384
# TGI client
from huggingface_hub import InferenceClient
client = InferenceClient("http://localhost:8080")
# Streaming
for token in client.text_generation(
"Explain continuous batching",
max_new_tokens=256,
stream=True,
):
print(token, end="", flush=True)Edge vs Cloud Inference
The most impactful latency reduction isn't algorithmic — it's eliminating the network round-trip entirely by running inference on the user's device. With quantized models running on Apple Silicon, Qualcomm NPUs, and even WebGPU in the browser, edge AI has become viable for a growing set of use cases.
Edge Inference
Model runs on the user's device. Zero network latency. Full privacy.
- Latency: 0ms network + 10–100ms compute
- Models: 1–8B quantized (GGUF, CoreML)
- Tools: llama.cpp, MLX, ONNX Runtime Mobile
- Use cases: Autocomplete, on-device search, voice commands, privacy-sensitive medical/legal
- Limitation: Model size capped by device memory
Cloud Inference
Model runs on GPU servers. Unlimited model size. Network latency overhead.
- Latency: 20–200ms network + 50–500ms compute
- Models: Any size (8B to 400B+)
- Tools: vLLM, TGI, cloud APIs
- Use cases: Complex reasoning, long context, multi-modal, high-quality generation
- Advantage: Update models without app releases
The Hybrid Pattern
The most sophisticated production systems use both. A small on-device model handles immediate interactions (autocomplete, quick classification) while complex requests are routed to cloud models. Apple Intelligence uses this pattern: Siri's quick responses run on the Neural Engine, but complex multi-step tasks route to Apple's "Private Cloud Compute" servers.
User input
|
v
[On-device classifier] --simple--> [On-device 3B model] → 30ms response
|
complex
v
[Cloud API (70B model)] → 500ms response (streamed, feels ~200ms)Cost Optimization
Real-time AI at scale is expensive. An A100 GPU costs $2–3/hour, and a 70B model needs four of them. At 100 requests/second, you're spending $25,000/month on GPU compute alone. Every optimization technique is also a cost optimization — but some are specifically about reducing spend.
Warning: Streaming Costs More
Streaming doesn't just send tokens earlier — it requires holding the connection open for the entire generation. This means your server handles fewer concurrent requests per GPU (longer-lived connections consume more memory and file descriptors). A non-streaming batch endpoint with aggressive request batching can serve 3–5x more throughput per dollar. Choose streaming for user-facing real-time UX, batch for backend processing pipelines.
Cost Per 1M Output Tokens (March 2026)
Self-hosted costs assume 70% GPU utilization and include amortized hardware costs. Actual costs vary significantly with batching efficiency, request patterns, and spot pricing.
Prompt Compression
Input tokens are the hidden cost driver. A 3,000-token system prompt on every request at 1,000 req/s is 2.6B input tokens/day. LLMLingua (Microsoft, 2023) compresses prompts by 2–5x with minimal quality loss by identifying and removing tokens that contribute least to the model's understanding. Combined with prefix caching, this reduces costs by 60–80%.
Distillation
Train a small model to mimic a large model's behavior on your specific task. Generate 50K examples with GPT-4, fine-tune Llama 8B on them. For narrowly-scoped tasks (classification, extraction, structured output), the distilled model often matches the teacher at 1/100th the cost. The catch: it only works for tasks you can define in advance.
Production Architecture: Putting It All Together
A production real-time AI system combines multiple techniques into a layered defense against latency and cost. Here is the architecture that companies like Perplexity, Vercel, and Cursor use in various forms.
Multi-Layer Serving Architecture
User Request (WebSocket or HTTP)
|
v
[Edge CDN / Load Balancer]
|
v
[Rate Limiter + Auth] Latency budget: 5ms
|
v
[Semantic Cache] ──── hit ────────> Return cached response ~8ms total
|
miss
v
[Query Classifier] Latency budget: 10ms
| | |
simple medium complex
v v v
[Local 8B] [Local 70B] [Cloud API]
~200ms ~800ms ~2000ms
| | |
v v v
[Stream tokens via SSE to client]
|
v
[Async: update semantic cache, log metrics, run safety filter]Real-Time Audio Pipeline
Voice AI systems like phone agents face the tightest latency constraints: humans perceive delays beyond 300ms as "the other person is slow to respond." The architecture is entirely different from text chat.
Microphone input (16kHz PCM audio chunks, every 100ms)
|
v
[Voice Activity Detection] ~5ms (Silero VAD, on-device)
|
speaking detected
v
[Streaming STT] ~100ms (Whisper Streaming / Deepgram)
| Partial transcripts as user speaks
v
[Intent + Turn Detection] ~20ms (Is user done speaking?)
|
user turn complete
v
[LLM Generation — streaming] ~200ms TTFT (time to first token)
|
first tokens
v
[Streaming TTS] ~80ms (ElevenLabs / Cartesia)
| Start playing before LLM finishes
v
Speaker output
Total perceived latency: ~400ms (user stops → AI starts speaking)
Without streaming: ~3-5 seconds (unusable for conversation)Key Architecture Insight
The dominant pattern in production real-time AI is pipeline parallelism with streaming handoff. Each stage starts processing as soon as the previous stage produces its first output, not when it finishes. The TTS engine doesn't wait for the LLM to finish generating; it starts synthesizing speech from the first sentence. The client doesn't wait for TTS to finish; it starts playing the first audio chunk.
This is the same principle as CPU instruction pipelining, applied to ML inference. The total latency is the sum of each stage's time-to-first-output, not the sum of each stage's total processing time.
Monitoring: What to Measure
You cannot optimize what you don't measure. Real-time AI systems require different metrics than traditional web services.
Critical Metrics for LLM Serving
The user's perceived latency. Measures prefill time + queue wait. Target: <500ms for chat, <200ms for autocomplete.
Generation speed. Below 15 TPS, users can read faster than the model writes, creating a frustrating experience.
The worst 1% of requests. Often 3–10x the median. Usually caused by long inputs, cold GPU cache, or GC pauses. The metric that keeps SREs awake.
Below 60% utilization means you're wasting money. Above 95% means you're one traffic spike from OOM errors. Sweet spot: 70–85%.
Semantic cache effectiveness. Track both exact and semantic hits. If below 10%, the cache infra cost may exceed savings.
Structured Logging for LLM Requests
import time, json, logging
logger = logging.getLogger("llm_serving")
async def instrumented_generate(request):
t0 = time.perf_counter()
first_token_time = None
token_count = 0
async for token in llm.stream(request):
if first_token_time is None:
first_token_time = time.perf_counter()
token_count += 1
yield token
total_time = time.perf_counter() - t0
ttft = first_token_time - t0 if first_token_time else total_time
logger.info(json.dumps({
"ttft_ms": round(ttft * 1000, 1),
"total_ms": round(total_time * 1000, 1),
"tokens": token_count,
"tps": round(token_count / total_time, 1),
"input_tokens": request.input_token_count,
"model": request.model,
"cache_hit": request.cache_hit,
}))Key Takeaways
- 1
Streaming is non-negotiable for user-facing AI — it reduces perceived latency by 50–75% even though total generation time stays the same. Use SSE for text, WebSockets for bidirectional audio.
- 2
vLLM's PagedAttention was the LLM serving breakthrough — borrowing OS virtual memory concepts to achieve 24x throughput over naive inference. If you self-host, your choice of serving framework matters more than your choice of model.
- 3
Speculative decoding gives 2–3x speedup for free — use a small draft model to generate candidates, verify in bulk with the target model. Same output distribution, dramatically lower latency.
- 4
Layer your defenses: cache, route, quantize, stream — production systems combine semantic caching (8ms hits), intelligent routing (60% to small models), quantization (2–3x speedup), and streaming (perceived ~200ms). No single technique is sufficient.
- 5
Measure TTFT and p99, not just median latency — your worst 1% of requests define user experience more than your average. A system with 200ms median but 5s p99 feels broken 1% of the time, which at scale means thousands of frustrated users per day.
Further Reading
Level 4 Complete
You've completed the Advanced level. You now understand multi-modal RAG, agent architectures, video understanding, and production real-time systems.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.