Level 4: Advanced~45 min

Real-time AI Systems

From batch inference to sub-100ms responses. The engineering that makes AI feel instant.

The Long Road from Batch to Real-Time

Real-time AI didn't arrive in a single breakthrough. It is the product of three decades of converging advances in hardware, serving infrastructure, and algorithmic efficiency — each generation chipping away at the latency that separates "useful" from "magical."

Understanding this evolution is essential. The techniques you choose for a 500ms conversational system are fundamentally different from a 10ms gaming loop, and both are different from a 100ms interactive search. The history explains why.

Era I: Batch Processing
1990s–2000s

The Offline Era

Early ML systems operated in pure batch mode. You collected data, trained a model overnight (or over weeks), and deployed a static artifact. Inference happened in bulk — process a CSV of 10,000 inputs, write results to a file, hand them to a business analyst. "Real-time" meant "we ran the batch job last night instead of last week."

Netflix's original recommendation system (2006) recomputed suggestions once per day. Google's PageRank was a batch job over the entire web graph. Even spam filters at Gmail were periodically retrained and deployed as static classifiers. The latency budget was hours, not milliseconds.

2010–2016

Request/Response Inference

TensorFlow Serving (2016) and Clipper (2017, UC Berkeley) introduced the idea of ML models behind REST APIs. For the first time, you could send a single input and get a prediction back in real-time — classification in 5–50ms, object detection in 100–300ms. But these were small models: ResNet-50 is 25M parameters. The idea of serving a billion-parameter model per-request was absurd.

"The cost of a single forward pass must be small enough that you can afford to do it on every user request. This constraint shaped a decade of production ML."

Crankshaw, D. et al. (2017). Clipper: A Low-Latency Online Prediction Serving System. NSDI.

Era II: GPU Serving Revolution
2019

TensorRT & ONNX Runtime: Compiler-Level Optimization

NVIDIA's TensorRT treated inference as a compilation problem. Take a trained model, fuse operations (convolution + batch norm + ReLU into a single kernel), select optimal GPU kernels for each layer, quantize weights from FP32 to FP16 or INT8, and output a binary optimized for the specific GPU you're deploying to.

Results were dramatic: 2–6x speedup on the same hardware with negligible accuracy loss. Microsoft's ONNX Runtime brought similar optimizations across hardware backends (CPU, GPU, edge devices). For the first time, a BERT model could run inference in under 10ms on a V100 GPU — fast enough for search autocomplete.

2020

NVIDIA Triton: Multi-Model Serving

Triton Inference Server solved the deployment problem: serve multiple models (TensorRT, ONNX, PyTorch, TensorFlow) behind a single endpoint with dynamic batching, model pipelining, and GPU memory management. It could pack multiple models onto one GPU, batch incoming requests transparently, and route to the right backend. Production ML teams finally had a serving infrastructure that matched the sophistication of web application servers.

2022

FlashAttention: The Memory Wall Breakthrough

Tri Dao et al. at Stanford identified the real bottleneck in transformer inference: not compute, but memory bandwidth. Standard attention writes enormous intermediate matrices to GPU HBM (High Bandwidth Memory), then reads them back. FlashAttention fused the entire attention computation into a single GPU kernel that kept everything in fast SRAM, never materializing the full attention matrix.

The impact: 2–4x speedup on attention computation, 5–20x memory reduction, and the ability to handle context lengths that were previously impossible. FlashAttention-2 (2023) pushed this further with better work partitioning across GPU thread blocks, reaching close to the theoretical maximum memory bandwidth utilization.

Era III: The LLM Serving Crisis
June 2023

vLLM & PagedAttention

Woosuk Kwon et al. at UC Berkeley identified the core problem of LLM serving: the KV (key-value) cache. During autoregressive generation, each token's attention keys and values must be stored for all subsequent tokens. For a 13B model with a 2048-token sequence, the KV cache alone consumes 1.7 GB per request. Naively allocated, GPU memory fragmentation meant you could only serve 2–3 concurrent requests on an A100.

PagedAttention borrowed virtual memory paging from operating systems: split the KV cache into fixed-size blocks, allocate them on-demand from a shared pool, and use a page table to map logical positions to physical GPU memory. The result: near-zero memory waste and the ability to serve 10–24x more concurrent requests.

2023

Speculative Decoding: Breaking the Sequential Bottleneck

Autoregressive LLM generation is inherently sequential: each token depends on the previous one. Speculative decoding (proposed independently by Leviathan et al. at Google and Chen et al. at DeepMind) breaks this constraint with a clever trick: use a small, fast "draft" model to generate K tokens speculatively, then verify all K tokens in a single forward pass of the large "target" model.

Because the large model can verify K tokens in parallel (the cost of processing K tokens is nearly the same as processing 1 due to GPU parallelism), accepted tokens come essentially for free. With a good draft model, acceptance rates of 70–90% are common, yielding 2–3x speedup with mathematically identical output distribution.

# Speculative decoding — conceptual flow
draft_model = load("llama-68m")     # Tiny, fast: ~2ms/token
target_model = load("llama-70b")    # Large, slow: ~40ms/token

def speculative_decode(prompt, K=5):
    draft_tokens = []
    for _ in range(K):
        token = draft_model.generate_one(prompt + draft_tokens)
        draft_tokens.append(token)  # ~2ms each = 10ms total

    # Verify ALL K tokens in ONE forward pass of target model (~45ms)
    accepted = target_model.verify(prompt, draft_tokens)
    # If 4 of 5 accepted: generated 4 tokens in ~55ms instead of ~160ms

Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. ICML.
Chen, C. et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv.

2024–present

The Optimization Arms Race

The field has exploded with complementary optimizations, each attacking a different bottleneck:

SGLang

RadixAttention for automatic KV cache reuse across requests sharing common prefixes. 3–5x throughput on multi-turn conversations.

Medusa

Multiple decoding heads predict several future tokens simultaneously. No draft model needed. 2x speedup.

Continuous Batching

Instead of waiting for all requests in a batch to finish, insert new requests as slots free up. Near-100% GPU utilization.

Prefix Caching

Cache the KV states for common system prompts. Skip recomputation for every request sharing the same prefix.

The throughline: 1990s to now

Each generation solved the previous generation's bottleneck:

1990s–2000sProblem: No serving infrastructure. ML is a batch job.
2016–2019Solved: Model serving exists. New bottleneck: Compute per-request too high for large models.
2019–2022Solved: Compiler optimizations (TensorRT, FlashAttention). New bottleneck: KV cache memory for LLMs.
2023–nowSolved: PagedAttention, speculative decoding. New bottleneck: Cost at scale.

The Latency Budget

"Real-time" is not a single number. Different applications have fundamentally different latency requirements, and each demands a different architecture.

Latency Tiers in Production AI

Gaming / robotics— physics, NPC behavior
<16ms
Search autocomplete— keystroke-level
<50ms
Interactive UI— classification, embedding lookup
50–200ms
Conversational AI— first token of LLM response
200–500ms
Document processing— summarization, extraction
1–10s

Where Latency Comes From

A typical LLM API call decomposes into these sequential stages:

Network round-trip (user to API)20–200ms
Queue wait (GPU busy with other requests)0–5000ms
Prefill (process all input tokens)50–500ms
Decode (generate output tokens, one by one)15–80ms/token
Post-processing (safety filters, formatting)1–20ms

The Prefill vs Decode Distinction Matters

Prefill is compute-bound: processing N input tokens in parallel using matrix multiplications that fully saturate the GPU. Decode is memory-bound: generating one token at a time, where the bottleneck is reading model weights and the KV cache from GPU memory. These are fundamentally different workloads.

This is why a 70B model can process 4,000 input tokens in the same time it takes to generate 50 output tokens. Optimization strategies differ: prefill benefits from tensor parallelism across GPUs, while decode benefits from smaller model sizes (quantization) and speculative methods.

Streaming: The Most Important UX Optimization

Streaming is the single most impactful technique in real-time AI — not because it makes generation faster, but because it makes it feel faster. The total time to generate a 500-token response is identical whether you stream or not. But the user sees the first token in 200ms instead of waiting 15 seconds for the complete response.

This is not just a nice-to-have. Research from Microsoft (2023) showed that streaming reduces perceived wait time by 50–75% and increases user satisfaction scores by 20–30% compared to buffered responses of the same content.

Without Streaming

User stares at a spinner for 15 seconds. Entire response appears at once. Feels broken.

[----------- 15,000ms -----------] Full response

With Streaming

First token at 200ms. User reads as tokens arrive. Feels conversational.

[200ms] The [+] key [+] insight [+] is [+] ...

Streaming Architecture with Latency Budgets

Input Streamaudio / text chunksPreprocessingVAD / tokenize~5msLLM Inferenceprefill + decode~200ms TTFTPost-processTTS / format~80msOutput Streamtokens / audio chunksTotal TTFT budget: ~285ms (pipeline parallel)Each stage starts as soon as the previous emits first output -- not when it finishes

Batch vs Streaming Processing

Batch ProcessingCollect allinput dataProcessentire batchRespondall at once0msUser waits 15s15,000msStreaming ProcessingProcess asdata arrivesEmit tokensincrementallyDonesame total0ms200ms first token!15,000msPerceived latency: 15 secondsHigher throughput per dollarBetter for backend / bulk processingPerceived latency: 200msSame total generation timeEssential for user-facing real-time UX

Server-Sent Events (SSE) — The Standard Protocol

SSE is the dominant protocol for LLM streaming. Unlike WebSockets (bidirectional), SSE is unidirectional (server to client), simpler to implement, works through CDNs and proxies, and auto-reconnects. OpenAI, Anthropic, and Google all use SSE for their streaming APIs.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json, time

app = FastAPI()
client = OpenAI()

@app.post("/chat/stream")
async def stream_chat(prompt: str):
    """Stream LLM tokens via Server-Sent Events."""
    async def generate():
        start = time.time()
        stream = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            stream_options={"include_usage": True},
        )
        for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                # SSE format: "data: <json>\n\n"
                payload = {
                    "token": chunk.choices[0].delta.content,
                    "latency_ms": round((time.time() - start) * 1000),
                }
                yield f"data: {json.dumps(payload)}\n\n"

            # Final chunk includes usage stats
            if chunk.usage:
                yield f"data: {json.dumps({"usage": chunk.usage.model_dump()})}\n\n"

        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # Disable nginx buffering
        },
    )

Client-Side: Consuming the Stream

// Browser: consuming SSE with fetch + ReadableStream
async function streamChat(prompt: string, onToken: (t: string) => void) {
  const response = await fetch("/chat/stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ prompt }),
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n\n");
    buffer = lines.pop()!; // Keep incomplete chunk

    for (const line of lines) {
      const data = line.replace("data: ", "");
      if (data === "[DONE]") return;
      const { token } = JSON.parse(data);
      if (token) onToken(token); // Append to UI
    }
  }
}

// Usage: tokens appear character by character
streamChat("Explain transformers", (token) => {
  document.getElementById("output")!.textContent += token;
});

When to Use WebSockets Instead

WebSockets add complexity but are necessary when you need bidirectional real-time communication — real-time voice (user speaks while AI responds), collaborative editing, or streaming both audio input and text output simultaneously.

# WebSocket server for real-time voice + text
import asyncio
import websockets

async def handle_session(websocket):
    async for message in websocket:
        msg = json.loads(message)

        if msg["type"] == "audio_chunk":
            # Process audio in real-time (STT)
            text = await transcribe(msg["data"])
            await websocket.send(json.dumps({
                "type": "transcript", "text": text
            }))

        elif msg["type"] == "generate":
            # Stream LLM response back
            async for token in generate_stream(msg["prompt"]):
                await websocket.send(json.dumps({
                    "type": "token", "text": token
                }))

Latency Optimization: The Full Toolkit

Streaming hides latency. The techniques below actually reduce it. Each targets a different bottleneck; real production systems combine several.

Q

Quantization

Reduce model weight precision from 16-bit floats to 8-bit or 4-bit integers. Since LLM decoding is memory-bandwidth-bound (reading weights from GPU memory), smaller weights means faster reads. A 4-bit quantized 70B model fits on a single A100 (80GB) and generates tokens 2–3x faster than its FP16 counterpart on two GPUs.

# GPTQ quantization: 70B model in 4-bit
Model size:   FP16 = 140 GB  →  INT4 = 35 GB  (4x smaller)
Token/s:      FP16 = 12 t/s  →  INT4 = 35 t/s (2.9x faster)
Quality:      MMLU 69.8       →  MMLU 69.1     (0.7pt loss)

Modern quantization methods (GPTQ, AWQ, GGUF) are sophisticated enough that 4-bit models lose less than 1% quality on most benchmarks. The real risk is on edge cases and long-tail tasks where the quality degradation concentrates.

C

Semantic Caching

Exact-match caching misses almost everything — users rarely phrase questions identically. Semantic caching embeds the query, searches a vector store for similar past queries, and returns the cached response if similarity exceeds a threshold. Cache hits return in 5–10ms instead of 2–15 seconds.

# Semantic cache with cosine similarity
import numpy as np
from redis import Redis

THRESHOLD = 0.92  # Tune: higher = fewer false hits

def cached_generate(query: str):
    query_emb = embed(query)

    # Search cache (Redis + vector index, ~3ms)
    cached = vector_search(query_emb, top_k=1)
    if cached and cached.score > THRESHOLD:
        return cached.response  # Cache hit: 5ms total

    # Cache miss: generate normally
    response = llm.generate(query)   # 2-15s
    cache_store(query_emb, response) # Background write
    return response

In production, semantic caches typically achieve 15–40% hit rates on customer support workloads (many users ask similar questions) but under 5% on creative or code generation tasks. The hit rate determines whether the infrastructure cost is justified.

R

Intelligent Model Routing

Not every query needs GPT-4. A classifier (itself a small, fast model) examines the query and routes to the appropriate tier. Simple factual questions go to a 7B model (20ms), complex reasoning to a 70B model (500ms), and only truly hard problems to a frontier API (2s).

# Routing with a lightweight classifier
def route_query(query: str) -> str:
    complexity = classifier.predict(query)  # ~5ms
    if complexity == "simple":
        return "llama-8b"    # Local, fast, cheap
    elif complexity == "medium":
        return "llama-70b"   # Local, moderate
    else:
        return "claude-opus"  # API, expensive, best quality

# Result: 60% of queries → 8B (fast + cheap)
#         30% of queries → 70B (moderate)
#         10% of queries → API (slow + expensive)
# Average latency drops 3x, cost drops 5x
K

KV Cache & Prefix Optimization

If every request starts with the same 2,000-token system prompt, you're recomputing the same KV cache every time. Prefix caching computes it once and reuses it across requests, eliminating 50–80% of prefill latency for structured applications.

# vLLM prefix caching
# Without: every request processes full prompt
# Request 1: [system prompt 2000 tok] + [user msg 50 tok] → prefill 2050 tok
# Request 2: [system prompt 2000 tok] + [user msg 80 tok] → prefill 2080 tok

# With prefix caching:
# Request 1: [system prompt 2000 tok → CACHED] + [user msg 50 tok] → prefill 50 tok
# Request 2: [cache hit!] + [user msg 80 tok] → prefill 80 tok
# Prefill latency: 400ms → 20ms (20x improvement)

GPU Serving at Scale: vLLM vs TGI

If you're self-hosting models, the choice of inference server matters more than the choice of model. A well-optimized server can extract 10–24x more throughput from the same hardware. Two frameworks dominate production: vLLM (UC Berkeley) and TGI (HuggingFace).

vLLM

PagedAttention for near-zero memory waste. The throughput king for high-concurrency workloads.

  • - 24x throughput vs naive HuggingFace
  • - Continuous batching
  • - Tensor & pipeline parallelism
  • - OpenAI-compatible API
  • - Prefix caching built-in
  • - Speculative decoding support

Text Generation Inference (TGI)

HuggingFace's production server. Tight HF ecosystem integration, battle-tested at scale.

  • - Flash Attention 2
  • - GPTQ / AWQ quantization
  • - Speculative decoding (Medusa)
  • - Built-in metrics (Prometheus)
  • - Watermarking support
  • - Grammar-constrained generation

vLLM: Production Deployment

# Start vLLM with production settings
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \           # Shard across 4 GPUs
    --max-model-len 8192 \               # Max context length
    --enable-prefix-caching \            # Reuse KV for shared prefixes
    --quantization awq \                 # 4-bit quantization
    --gpu-memory-utilization 0.9 \       # Use 90% of GPU memory
    --max-num-seqs 256 \                 # Max concurrent sequences
    --port 8000

# Then use with any OpenAI-compatible client
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

# Streaming response
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention"}],
    stream=True,
    max_tokens=512,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

TGI: Docker Deployment

# TGI with Docker — production-ready in one command
docker run --gpus all -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-70B-Instruct \
    --num-shard 4 \
    --quantize gptq \
    --max-input-tokens 4096 \
    --max-total-tokens 8192 \
    --max-batch-prefill-tokens 16384

# TGI client
from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Streaming
for token in client.text_generation(
    "Explain continuous batching",
    max_new_tokens=256,
    stream=True,
):
    print(token, end="", flush=True)
Architecture Decision

Edge vs Cloud Inference

The most impactful latency reduction isn't algorithmic — it's eliminating the network round-trip entirely by running inference on the user's device. With quantized models running on Apple Silicon, Qualcomm NPUs, and even WebGPU in the browser, edge AI has become viable for a growing set of use cases.

Edge Inference

Model runs on the user's device. Zero network latency. Full privacy.

  • Latency: 0ms network + 10–100ms compute
  • Models: 1–8B quantized (GGUF, CoreML)
  • Tools: llama.cpp, MLX, ONNX Runtime Mobile
  • Use cases: Autocomplete, on-device search, voice commands, privacy-sensitive medical/legal
  • Limitation: Model size capped by device memory

Cloud Inference

Model runs on GPU servers. Unlimited model size. Network latency overhead.

  • Latency: 20–200ms network + 50–500ms compute
  • Models: Any size (8B to 400B+)
  • Tools: vLLM, TGI, cloud APIs
  • Use cases: Complex reasoning, long context, multi-modal, high-quality generation
  • Advantage: Update models without app releases

The Hybrid Pattern

The most sophisticated production systems use both. A small on-device model handles immediate interactions (autocomplete, quick classification) while complex requests are routed to cloud models. Apple Intelligence uses this pattern: Siri's quick responses run on the Neural Engine, but complex multi-step tasks route to Apple's "Private Cloud Compute" servers.

User input
     |
     v
[On-device classifier] --simple--> [On-device 3B model] → 30ms response
     |
     complex
     v
[Cloud API (70B model)] → 500ms response (streamed, feels ~200ms)

Cost Optimization

Real-time AI at scale is expensive. An A100 GPU costs $2–3/hour, and a 70B model needs four of them. At 100 requests/second, you're spending $25,000/month on GPU compute alone. Every optimization technique is also a cost optimization — but some are specifically about reducing spend.

Warning: Streaming Costs More

Streaming doesn't just send tokens earlier — it requires holding the connection open for the entire generation. This means your server handles fewer concurrent requests per GPU (longer-lived connections consume more memory and file descriptors). A non-streaming batch endpoint with aggressive request batching can serve 3–5x more throughput per dollar. Choose streaming for user-facing real-time UX, batch for backend processing pipelines.

Cost Per 1M Output Tokens (March 2026)

Claude Opus 4 (API)$75.00
GPT-4o (API)$15.00
Claude Sonnet 4 / GPT-4o-mini (API)$3.00–4.00
Llama 3.1 70B (self-hosted, 4x A100)$0.40–0.80
Llama 3.1 8B quantized (self-hosted, 1x A100)$0.03–0.08

Self-hosted costs assume 70% GPU utilization and include amortized hardware costs. Actual costs vary significantly with batching efficiency, request patterns, and spot pricing.

$

Prompt Compression

Input tokens are the hidden cost driver. A 3,000-token system prompt on every request at 1,000 req/s is 2.6B input tokens/day. LLMLingua (Microsoft, 2023) compresses prompts by 2–5x with minimal quality loss by identifying and removing tokens that contribute least to the model's understanding. Combined with prefix caching, this reduces costs by 60–80%.

D

Distillation

Train a small model to mimic a large model's behavior on your specific task. Generate 50K examples with GPT-4, fine-tune Llama 8B on them. For narrowly-scoped tasks (classification, extraction, structured output), the distilled model often matches the teacher at 1/100th the cost. The catch: it only works for tasks you can define in advance.

Production Architecture: Putting It All Together

A production real-time AI system combines multiple techniques into a layered defense against latency and cost. Here is the architecture that companies like Perplexity, Vercel, and Cursor use in various forms.

Multi-Layer Serving Architecture

User Request (WebSocket or HTTP)
     |
     v
[Edge CDN / Load Balancer]
     |
     v
[Rate Limiter + Auth]              Latency budget: 5ms
     |
     v
[Semantic Cache] ──── hit ────────> Return cached response     ~8ms total
     |
     miss
     v
[Query Classifier]                  Latency budget: 10ms
     |              |           |
     simple         medium      complex
     v              v           v
[Local 8B]      [Local 70B]    [Cloud API]
  ~200ms          ~800ms        ~2000ms
     |              |           |
     v              v           v
[Stream tokens via SSE to client]
     |
     v
[Async: update semantic cache, log metrics, run safety filter]

Real-Time Audio Pipeline

Voice AI systems like phone agents face the tightest latency constraints: humans perceive delays beyond 300ms as "the other person is slow to respond." The architecture is entirely different from text chat.

Microphone input (16kHz PCM audio chunks, every 100ms)
     |
     v
[Voice Activity Detection]       ~5ms    (Silero VAD, on-device)
     |
     speaking detected
     v
[Streaming STT]                  ~100ms  (Whisper Streaming / Deepgram)
     |                                    Partial transcripts as user speaks
     v
[Intent + Turn Detection]        ~20ms   (Is user done speaking?)
     |
     user turn complete
     v
[LLM Generation — streaming]     ~200ms  TTFT (time to first token)
     |
     first tokens
     v
[Streaming TTS]                  ~80ms   (ElevenLabs / Cartesia)
     |                                    Start playing before LLM finishes
     v
Speaker output

Total perceived latency: ~400ms (user stops → AI starts speaking)
Without streaming: ~3-5 seconds (unusable for conversation)

Key Architecture Insight

The dominant pattern in production real-time AI is pipeline parallelism with streaming handoff. Each stage starts processing as soon as the previous stage produces its first output, not when it finishes. The TTS engine doesn't wait for the LLM to finish generating; it starts synthesizing speech from the first sentence. The client doesn't wait for TTS to finish; it starts playing the first audio chunk.

This is the same principle as CPU instruction pipelining, applied to ML inference. The total latency is the sum of each stage's time-to-first-output, not the sum of each stage's total processing time.

Monitoring: What to Measure

You cannot optimize what you don't measure. Real-time AI systems require different metrics than traditional web services.

Critical Metrics for LLM Serving

TTFT (Time to First Token)

The user's perceived latency. Measures prefill time + queue wait. Target: <500ms for chat, <200ms for autocomplete.

TPS (Tokens Per Second)

Generation speed. Below 15 TPS, users can read faster than the model writes, creating a frustrating experience.

p99 Latency

The worst 1% of requests. Often 3–10x the median. Usually caused by long inputs, cold GPU cache, or GC pauses. The metric that keeps SREs awake.

GPU Utilization & Memory

Below 60% utilization means you're wasting money. Above 95% means you're one traffic spike from OOM errors. Sweet spot: 70–85%.

Cache Hit Rate

Semantic cache effectiveness. Track both exact and semantic hits. If below 10%, the cache infra cost may exceed savings.

Structured Logging for LLM Requests

import time, json, logging

logger = logging.getLogger("llm_serving")

async def instrumented_generate(request):
    t0 = time.perf_counter()
    first_token_time = None
    token_count = 0

    async for token in llm.stream(request):
        if first_token_time is None:
            first_token_time = time.perf_counter()
        token_count += 1
        yield token

    total_time = time.perf_counter() - t0
    ttft = first_token_time - t0 if first_token_time else total_time

    logger.info(json.dumps({
        "ttft_ms": round(ttft * 1000, 1),
        "total_ms": round(total_time * 1000, 1),
        "tokens": token_count,
        "tps": round(token_count / total_time, 1),
        "input_tokens": request.input_token_count,
        "model": request.model,
        "cache_hit": request.cache_hit,
    }))

Key Takeaways

  • 1

    Streaming is non-negotiable for user-facing AI — it reduces perceived latency by 50–75% even though total generation time stays the same. Use SSE for text, WebSockets for bidirectional audio.

  • 2

    vLLM's PagedAttention was the LLM serving breakthrough — borrowing OS virtual memory concepts to achieve 24x throughput over naive inference. If you self-host, your choice of serving framework matters more than your choice of model.

  • 3

    Speculative decoding gives 2–3x speedup for free — use a small draft model to generate candidates, verify in bulk with the target model. Same output distribution, dramatically lower latency.

  • 4

    Layer your defenses: cache, route, quantize, stream — production systems combine semantic caching (8ms hits), intelligent routing (60% to small models), quantization (2–3x speedup), and streaming (perceived ~200ms). No single technique is sufficient.

  • 5

    Measure TTFT and p99, not just median latency — your worst 1% of requests define user experience more than your average. A system with 200ms median but 5s p99 feels broken 1% of the time, which at scale means thousands of frustrated users per day.

Further Reading

Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. — The vLLM paper. Essential reading for understanding modern LLM serving.
Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention. — IO-aware attention that changed how we think about GPU memory hierarchy.
Leviathan, Y. et al. (2023). Fast Inference from Transformers via Speculative Decoding. — The speculative decoding technique that enables 2–3x generation speedup.
Zheng, L. et al. (2024). SGLang: Efficient Execution of Structured Language Model Programs. — RadixAttention for automatic KV cache sharing across structured LLM calls.

Level 4 Complete

You've completed the Advanced level. You now understand multi-modal RAG, agent architectures, video understanding, and production real-time systems.

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.