APIs vs Local Models
The first infrastructure decision you will make. Every AI application needs inference — but where that inference runs determines your cost, latency, privacy, and control.
70 Years of "Where Does the Compute Live?"
The API-vs-local debate is not new. It is the latest incarnation of computing's oldest architectural question: should processing happen centrally or at the edge? Every generation of computing has oscillated between these poles, and understanding the pattern is the fastest way to see where AI inference is headed.
The pendulum swings because the economics change. When bandwidth is cheap and hardware is expensive, centralize. When hardware is cheap and latency matters, decentralize. AI is currently in the middle of a swing.
Mainframe Time-Sharing
At MIT, Fernando Corbató built CTSS (Compatible Time-Sharing System) in 1961 — the first system where multiple users shared a single expensive computer via dumb terminals. An IBM 7094 cost $3.5 million (roughly $35M adjusted for inflation). Nobody could afford their own. The terminal sent keystrokes; the mainframe did everything.
"The key idea was that the machine was so expensive, it was more economical to have several people using it simultaneously than to have it sitting idle between jobs."
— Corbató won the 1990 Turing Award for this work on time-sharing.
This is exactly the API model: expensive centralized hardware, thin clients, pay for what you use. Today's POST api.openai.com/v1/chat/completions is structurally identical to a 1965 terminal sending a batch job to an IBM mainframe.
The PC Revolution: Compute Goes Local
The IBM PC (1981) and its clones put real compute on every desk. Suddenly you didn't need the mainframe for word processing, spreadsheets, or databases. Hardware got cheap enough to decentralize. By 1995, a $2,000 PC had more raw power than a 1975 mainframe. The pendulum swung to local — and stayed there for two decades.
AWS EC2 and the Cloud Era
Amazon launched Elastic Compute Cloud. The mainframe model returned, rebranded: rent compute by the hour instead of buying servers. Jeff Bezos turned Amazon's excess infrastructure into a business. By 2024, AWS, Azure, and GCP collectively generate over $200B in annual revenue — the economics of centralization at unprecedented scale.
AWS Lambda: Pay Per Invocation
Serverless computing arrived. Don't rent a server — just pay per function call. This is the direct ancestor of AI API pricing: you pay per token, not per hour. The compute exists somewhere in a data center; you don't care where.
OpenAI Launches the GPT-3 API
OpenAI released API access to GPT-3 — a 175-billion parameter model that no one could run locally. Training cost an estimated $4.6M in compute alone. The model's weights were never released. If you wanted GPT-3, you used the API. Period.
This established the modern AI API paradigm: frontier models are too large and too expensive to run yourself, so you pay the provider per token. Within two years, Anthropic (Claude), Cohere, Google (PaLM/Gemini), and dozens of others launched competing APIs.
Meta Releases LLaMA
Hugo Touvron et al. at Meta AI released LLaMA (Large Language Model Meta AI) — a family of models from 7B to 65B parameters that matched or exceeded GPT-3 performance while being small enough to run on consumer hardware. The weights leaked within a week. Within a month, the open-source community had fine-tuned variants running on MacBooks.
"LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B."
— Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
llama.cpp & the Quantization Breakthrough
Georgi Gerganov released llama.cpp — a pure C/C++ implementation of LLaMA inference that ran on CPUs without GPUs. The key innovation was aggressive quantization: converting 16-bit floating point weights to 4-bit integers, shrinking a 7B model from 14GB to 3.5GB with minimal quality loss.
Simultaneously, GPTQ (Frantar et al., 2022) and GGML/GGUF (Gerganov, 2023) formats emerged for GPU and CPU quantization respectively. Tim Dettmers et al. published QLoRA (2023), enabling fine-tuning of quantized models on a single GPU. The barrier to local inference dropped from "data center" to "gaming laptop."
— Frantar, E. et al. (2022). GPTQ: Accurate Post-Training Quantization. ICLR 2023.
— Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023.
The Ecosystem Explodes
Llama 2 (July 2023), Mistral 7B (September 2023), Llama 3 (April 2024), Llama 3.1 (July 2024), Qwen2.5, DeepSeek-V3, Gemma 2 — open-weight models now routinely match proprietary API models from two generations prior. Tools like Ollama, vLLM, and TensorRT-LLM made serving them trivially easy.
The pendulum is swinging again. Not all the way to local — frontier reasoning models still require clusters — but for the 80% of tasks that don't need the most powerful model, local inference is now viable, cheaper, and faster.
The throughline: 1961 → 2026
The question was never "which is better." It was always "which is cheaper for this specific workload right now." That's the framework for the rest of this lesson.
The Two Models of AI Inference
Every AI application needs inference — the process of running a model to get predictions. You have two fundamental options, and the right choice depends on your constraints.
API-Based Inference
Send requests to cloud providers (OpenAI, Anthropic, Cohere, Google). Pay per token. Zero infrastructure to manage. Models update automatically.
Local / Self-Hosted Inference
Run models on your own hardware or cloud GPU instances. Fixed costs. Full control over data, latency, and model versions.
API Inference: Code & Cost
All major API providers follow the same pattern: authenticate with an API key, send a prompt, receive a response. The differences are in pricing, model quality, and feature sets.
OpenAI
GPT-4o, GPT-4o-mini, o1, o3
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from env
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain API vs local inference in one paragraph."}
],
max_tokens=200
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
# Cost: ~$0.003 for this request (GPT-4o: $2.50/1M input, $10/1M output)Strengths
- - Best-in-class instruction following
- - Largest ecosystem (plugins, integrations)
- - Function calling / structured outputs
- - Excellent documentation
Considerations
- - Higher cost at scale
- - Rate limits can be restrictive
- - Data retention: 30 days (enterprise: 0)
- - US-only data residency
Anthropic
Claude Opus 4, Sonnet 4, Haiku 3.5
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
messages=[
{"role": "user", "content": "Explain API vs local inference in one paragraph."}
]
)
print(message.content[0].text)
# Cost: ~$0.002 for this request (Sonnet: $3/1M input, $15/1M output)Strengths
- - Excellent for long-form & reasoning
- - 200K context window standard
- - Strong coding capabilities
- - Better safety alignment
Considerations
- - Smaller ecosystem than OpenAI
- - Can be overly cautious on edge cases
- - EU data residency in progress
- - No image generation
Cohere
Command-R+, Command-R, Embed v3
import cohere
co = cohere.ClientV2() # reads CO_API_KEY
response = co.chat(
model="command-r-plus",
messages=[
{"role": "user", "content": "Explain API vs local inference in one paragraph."}
]
)
print(response.message.content[0].text)
# Cost: ~$0.001 (Command-R+: $2.50/1M input, $10/1M output)Strengths
- - Enterprise-focused (on-prem deployment options)
- - Excellent embedding models
- - Built-in RAG capabilities
- - AWS/GCP/Azure deployment
Considerations
- - Less consumer mindshare
- - Fewer community resources
- - Creative tasks lag behind
- - Pricing less transparent
Notice the pattern
Every API provider uses the same interface: HTTP POST with a JSON body containing your model choice and messages. This means switching providers is mostly a matter of changing the import, the model string, and the API key. Libraries like LiteLLM and LangChain abstract even that away. Vendor lock-in is low — which keeps prices competitive.
Local Inference: Code & Setup
Local inference means the model runs on hardware you control — your laptop, your server, or a cloud GPU you rent. The model weights live on your disk. No tokens leave your network.
Ollama
Easiest local inference — Docker for LLMs
# Terminal: install and run (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.1:8b
# Downloads ~4.7GB once, then runs locally. No API key needed.
# Python: same OpenAI-compatible interface
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "user", "content": "Explain API vs local inference in one paragraph."}
]
)
print(response.choices[0].message.content)
# Cost: $0.00. Runs on your CPU/GPU. ~30 tok/s on M2 MacBook Pro.Best For
- - Development and prototyping
- - Privacy-sensitive applications
- - Mac users (Apple Silicon optimized)
- - Learning and experimentation
Limitations
- - Single-user by default
- - No batching or continuous batching
- - Limited model parallelism
- - Not designed for high-throughput serving
vLLM
Production-grade GPU serving
# Start vLLM server (requires NVIDIA GPU)
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
# Same OpenAI-compatible client works
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain inference."}]
)
# Throughput: ~2000 tok/s on A100. PagedAttention = 24x vs naive HuggingFace.Best For
- - High-throughput production workloads
- - Batched inference (multiple users)
- - Multi-GPU deployment
- - When you need PagedAttention optimization
Key Features
- - Continuous batching
- - Tensor parallelism
- - OpenAI-compatible API out of box
- - Speculative decoding support
llama.cpp
CPU inference, maximum portability, GGUF format
# Build from source
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
make -j
# Run with 4-bit quantized model (3.5GB instead of 14GB)
./llama-cli -m models/llama-3.1-8b-instruct-Q4_K_M.gguf \
-p "Explain API vs local inference." -n 200
# Or run as an OpenAI-compatible server
./llama-server -m models/llama-3.1-8b-instruct-Q4_K_M.gguf --port 8080
# Cost: $0. Runs on CPU. ~10 tok/s on modern laptop without GPU.Best For
- - Running on CPU-only machines
- - Edge deployment / embedded systems
- - Maximum hardware compatibility
- - When Ollama is too opinionated
Key Features
- - Pure C/C++ — no Python dependency
- - GGUF quantization (Q2 through Q8)
- - Metal / CUDA / Vulkan / CPU backends
- - Foundation for Ollama internally
Key Insight: OpenAI Compatibility Is the Standard
Notice that Ollama, vLLM, and llama.cpp all expose an OpenAI-compatible API. This is not accidental. OpenAI's /v1/chat/completions endpoint has become the de facto standard for LLM inference. You can write your application against the OpenAI SDK and switch between cloud and local by changing only the base_url. This dramatically reduces migration cost between deployment strategies.
Interactive Cost & Latency Calculator
Use the calculator below to estimate costs, compare latency, and get a recommendation based on your specific requirements. Adjust the sliders to match your workload.
Monthly Token Usage
Monthly Cost Comparison
Understanding the Economics
API pricing is variable (pay per token). Local inference has fixed costs (hardware, electricity, engineering time). The crossover point is where the math flips — and it depends entirely on your volume.
Worked Example: 1M Requests/Month
Assume each request averages 500 input tokens + 200 output tokens (a typical chatbot interaction).
Option A: GPT-4o API
Input: 1M requests x 500 tokens x $2.50/1M tokens = $1,250/mo Output: 1M requests x 200 tokens x $10.00/1M tokens = $2,000/mo Total: $3,250/month
Option B: GPT-4o-mini API
Input: 1M requests x 500 tokens x $0.15/1M tokens = $75/mo Output: 1M requests x 200 tokens x $0.60/1M tokens = $120/mo Total: $195/month
Option C: Self-hosted Llama 3.1 8B on A10G (AWS)
GPU instance: g5.xlarge @ $1.006/hr x 730 hrs = $734/mo
Throughput: ~800 req/hr at these token counts
Instances needed for 1M/mo: 2 instances = $1,468/mo
+ Engineering time: ~10 hrs/mo @ $150/hr = $1,500/mo
Total: $2,968/month (first year, amortizing setup)
$1,468/month (steady state, ops automated)The Crossover Framework
Hidden costs most teams forget
For APIs: Rate limit overage, retry logic for 429/500 errors, prompt caching misses, vendor price changes (OpenAI has raised and lowered prices multiple times).
For self-hosting: DevOps/MLOps engineering time (the biggest hidden cost), model update deployment, monitoring and alerting, GPU memory management, cold start latency, and on-call burden when the inference server goes down at 3am.
Privacy, Compliance & Data Residency
For many applications, privacy is the deciding factor — not cost. When data cannot leave your network, the cost comparison is irrelevant.
When Local Inference Is Non-Negotiable
- - HIPAA-covered health data — patient records cannot be sent to third-party APIs without a BAA (Business Associate Agreement). Azure OpenAI and AWS Bedrock offer BAAs; direct OpenAI API does not.
- - Financial data (SOX, PCI-DSS) — trade secrets, insider information, and payment card data have strict data handling requirements.
- - EU data residency (GDPR) — personal data of EU citizens may not leave the EU without adequate protections. Most US API providers cannot guarantee this.
- - Proprietary code / trade secrets — sending source code to an API means trusting the provider's data handling policies.
- - Air-gapped environments — defense, critical infrastructure, and some financial systems have no internet connectivity by design.
- - Government / FedRAMP — US federal systems require FedRAMP-authorized services. Only Azure OpenAI currently has this.
API Provider Privacy Comparison
| Provider | Data Retention | Training Opt-out | SOC 2 | HIPAA BAA |
|---|---|---|---|---|
| OpenAI API | 30 days (enterprise: 0) | Yes (API default) | Yes | No |
| Anthropic | 30 days | Yes | Yes | Enterprise only |
| Azure OpenAI | 0 days | Default | Yes | Yes |
| AWS Bedrock | 0 days | Default | Yes | Yes |
| Local (Ollama/vLLM) | You control | N/A | Your infra | Your infra |
When to Use What
There is no single right answer. But there is a decision tree that covers 90% of cases.
Use APIs when...
- - You need frontier model quality (GPT-4o, Claude Opus, Gemini Ultra). No open model matches these yet.
- - Your team has no ML/infra expertise and you want to ship fast.
- - Volume is low to moderate (<100K requests/day).
- - You need multimodal capabilities (vision, audio, tool use) that aren't available in open models.
- - You're prototyping and need to iterate on prompts, not infrastructure.
Use local/self-hosted when...
- - Data cannot leave your network (HIPAA, GDPR, classified, air-gapped).
- - You need sub-100ms latency and can't tolerate network round-trip variance.
- - Volume is high enough that API costs exceed GPU rental (>$2K/mo).
- - You need to fine-tune the model on proprietary data for your specific domain.
- - You need 100% uptime guarantees independent of third-party providers.
- - You're running batch processing where throughput matters more than per-request latency.
Hybrid Approaches: The Production Reality
Most production systems use both. The engineering skill is knowing how to route between them. Here are four battle-tested patterns.
Pattern 1: Tiered Routing
Route simple queries to local/cheap models, complex ones to powerful APIs. A classifier (often itself a small local model) decides the tier.
def route(prompt: str) -> str:
complexity = classifier.predict(prompt)
if complexity < 0.5:
return ollama.generate(
model="llama3.1:8b", prompt=prompt)
else:
return openai.chat(
model="gpt-4o", messages=[...])Pattern 2: Fallback Chain
Start with local. Fall back to API on failure, timeout, or when the local model's confidence is below threshold.
async def generate(prompt: str) -> str:
try:
result = await local_llm.generate(
prompt, timeout=5.0)
if result.confidence > 0.7:
return result.text
except (Timeout, ModelError):
pass
return await api_llm.generate(prompt)Pattern 3: Privacy Split
Sensitive data stays local. Non-sensitive goes to cloud for maximum quality. A PII detector classifies each request.
def generate(data: dict) -> str:
if pii_detector.contains_pii(data):
return local_llm.generate(
sanitize(data)) # Never leaves network
else:
return cloud_api.generate(data)Pattern 4: Speculative Execution
Generate with a fast local model first. If quality checks pass, use it. If not, the API result (requested in parallel) is ready.
async def generate(prompt: str) -> str:
local, api = await asyncio.gather(
local_llm.generate(prompt),
cloud_api.generate(prompt))
if quality_score(local) > 0.8:
return local # Fast + free
return api # Slow + paidLatency: Where Local Wins
API latency has three components: network round-trip, queue wait time, and generation time. Local inference eliminates the first two entirely.
Typical Latency Breakdown (Time to First Token)
When Latency Is the Decision Factor
Real-time applications — voice assistants, co-pilots with keystroke-level suggestions, interactive coding tools, gaming NPCs — need consistent sub-200ms TTFT. API latency has a long tail: the p50 might be 300ms, but the p99 can spike to 2-5 seconds during peak load or provider incidents. Local inference gives you deterministic, predictable latency bounded only by your hardware. For latency-critical paths, local is often the only viable option.
Key Takeaways
- 1
APIs are simpler to start — no infrastructure, pay per use, best for prototyping and low-to-medium volume. Start here unless you have a specific reason not to.
- 2
Local wins on privacy, latency, and cost-at-scale — required for regulated industries, better for real-time applications, cheaper above ~$2K/month.
- 3
Hybrid is the production answer — route based on complexity, privacy needs, or cost thresholds. The four patterns above cover most architectures.
- 4
The gap is closing fast — open-weight models in 2026 match proprietary APIs from 2024. The economics shift toward local every quarter. Re-evaluate regularly.
- 5
OpenAI-compatible API is the universal interface — write against it once, deploy anywhere. This is the single most important architectural decision for future flexibility.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.