Level 0: Foundations~15 min

APIs vs Local Models

The first infrastructure decision you will make. Cloud APIs or self-hosted inference?

The Trade-off

Every AI application needs inference - the process of running your model to get predictions. You have two fundamental options:

API-Based

Send requests to cloud providers (OpenAI, Anthropic, Cohere). Pay per token, no infrastructure to manage.

POST api.openai.com/v1/chat/completions

Local / Self-Hosted

Run models on your own hardware or cloud instances. Fixed costs, full control over data and latency.

ollama run llama3.1:8b

Neither is universally better. The right choice depends on your specific constraints: budget, volume, latency needs, privacy requirements, and team capacity.

API Providers: The Major Players

OAI

OpenAI

GPT-4o, GPT-4o-mini, o1

Strengths

- Best-in-class instruction following
- Largest ecosystem (plugins, integrations)
- Excellent documentation
- Function calling / structured outputs

Considerations

- Higher cost at scale
- Rate limits can be restrictive
- Data used for training (unless opted out)
- US-only data residency

ANT

Anthropic

Claude Opus, Sonnet, Haiku

Strengths

- Excellent for long-form content
- 200K context window standard
- Strong reasoning capabilities
- Better safety alignment

Considerations

- Smaller ecosystem
- Can be overly cautious
- EU data residency in progress
- Less structured output support

COH

Cohere

Command-R+, Command-R, Embed v3

Strengths

- Enterprise-focused features
- Excellent embedding models
- Built-in RAG capabilities
- AWS/GCP/Azure deployment options

Considerations

- Less consumer mindshare
- Fewer community resources
- Creative tasks lag behind
- Pricing less transparent

Local / Self-Hosted Solutions

OLL

Ollama

Easiest local inference

Best For

- Development and prototyping
- Privacy-sensitive applications
- Mac users (Apple Silicon optimized)
- Learning and experimentation

Setup

# Install

curl -fsSL https://ollama.com/install.sh | sh

# Run

ollama run llama3.1:8b

vLLM

Production-grade serving

Best For

- High-throughput production workloads
- Batched inference
- When you need PagedAttention optimization
- Multi-GPU deployment

Key Features

- 24x throughput vs HuggingFace
- OpenAI-compatible API
- Continuous batching
- Tensor parallelism

cpp

llama.cpp

CPU inference, maximum portability

Best For

- Running on CPU-only machines
- Edge deployment
- Embedded systems
- Maximum hardware compatibility

Key Features

- Pure C/C++ implementation
- GGUF quantization format
- Metal/CUDA/CPU backends
- Active community development

Interactive Comparison

Use the calculator to estimate costs, compare latency, and get a recommendation based on your specific requirements.

Monthly Token Usage

Monthly Tokens10M tokens

1M500M1B

Output Token Ratio30%

10% output50%90% output

Input Tokens:7.0M

Output Tokens:3.0M

Monthly Cost Comparison

M3 Max (Llama 3.1 8B)Cheapest$0

OpenAI GPT-4o-mini$3

Cohere Command-R$3

Anthropic Claude Haiku$6

Together.ai (Llama 3.1 70B)$9

OpenAI GPT-4o$48

Cohere Command-R+$48

Anthropic Claude Sonnet$66

API Provider

Local / Self-hosted

Understanding the Economics

API pricing is pay-per-token. Local inference has fixed costs (hardware, electricity). The crossover point depends on your volume.

Rule of Thumb

< $500/moUse APIs. Infrastructure overhead not worth it.

$500-2K/moEvaluate. Run the numbers for your specific workload.

> $2K/moSelf-hosting likely cheaper. Amortize GPU costs over time.

Hidden costs to consider: DevOps time for self-hosting, rate limit costs for APIs, model fine-tuning infrastructure, and monitoring/observability overhead.

Privacy & Data Residency

For many applications, privacy is the deciding factor - not cost.

When You Must Go Local

- HIPAA-covered health data
- Financial data with regulatory requirements
- Proprietary code or trade secrets
- EU data residency requirements (GDPR)
- Air-gapped environments
- Government/defense applications

API Provider Privacy Options

Provider	Data Retention	Training Opt-out	SOC 2
OpenAI API	30 days (enterprise: 0)	Yes (API default)	Yes
Anthropic	30 days	Yes	Yes
Cohere	Configurable	Yes	Yes
Azure OpenAI	0 days	Default	Yes + HIPAA

Hybrid Approaches

Many production systems use both. The trick is knowing when to route to which.

Pattern: Tiered Routing

Route simple queries to local/cheap models, complex ones to powerful APIs.

# Pseudo-code

if is_simple_query(prompt):

return ollama.generate()

else:

return openai.chat()

Pattern: Fallback Chain

Start with local, fall back to API on failure or for specific capabilities.

# Pseudo-code

try:

result = local_llm.generate()

except (Timeout, QualityError):

result = api_llm.generate()

Pattern: Privacy Split

Sensitive data stays local, non-sensitive goes to cloud for quality.

# Pseudo-code

if contains_pii(data):

return local_llm.generate()

else:

return cloud_api.generate()

Pattern: Dev/Prod Split

Local for development, API for production quality.

# Pseudo-code

if env == "development":

client = OllamaClient()

else:

client = OpenAIClient()

Key Takeaways

1
APIs are simpler to start - no infrastructure, pay per use, excellent for prototyping and low-medium volume.
2
Local wins on privacy and latency - required for regulated industries, better for real-time applications.
3
Hybrid is often the answer - route based on complexity, privacy needs, or cost thresholds.
4
Calculate the crossover point - above ~$2K/month API spend, self-hosting usually saves money.

Next: Tokens & Context Windows Back to Roadmap