Level 0: Foundations~15 min

APIs vs Local Models

The first infrastructure decision you will make. Cloud APIs or self-hosted inference?

The Trade-off

Every AI application needs inference - the process of running your model to get predictions. You have two fundamental options:

API-Based

Send requests to cloud providers (OpenAI, Anthropic, Cohere). Pay per token, no infrastructure to manage.

POST api.openai.com/v1/chat/completions

Local / Self-Hosted

Run models on your own hardware or cloud instances. Fixed costs, full control over data and latency.

ollama run llama3.1:8b

Neither is universally better. The right choice depends on your specific constraints: budget, volume, latency needs, privacy requirements, and team capacity.

API Providers: The Major Players

OAI

OpenAI

GPT-4o, GPT-4o-mini, o1

Strengths

  • - Best-in-class instruction following
  • - Largest ecosystem (plugins, integrations)
  • - Excellent documentation
  • - Function calling / structured outputs

Considerations

  • - Higher cost at scale
  • - Rate limits can be restrictive
  • - Data used for training (unless opted out)
  • - US-only data residency
ANT

Anthropic

Claude Opus, Sonnet, Haiku

Strengths

  • - Excellent for long-form content
  • - 200K context window standard
  • - Strong reasoning capabilities
  • - Better safety alignment

Considerations

  • - Smaller ecosystem
  • - Can be overly cautious
  • - EU data residency in progress
  • - Less structured output support
COH

Cohere

Command-R+, Command-R, Embed v3

Strengths

  • - Enterprise-focused features
  • - Excellent embedding models
  • - Built-in RAG capabilities
  • - AWS/GCP/Azure deployment options

Considerations

  • - Less consumer mindshare
  • - Fewer community resources
  • - Creative tasks lag behind
  • - Pricing less transparent

Local / Self-Hosted Solutions

OLL

Ollama

Easiest local inference

Best For

  • - Development and prototyping
  • - Privacy-sensitive applications
  • - Mac users (Apple Silicon optimized)
  • - Learning and experimentation

Setup

# Install
curl -fsSL https://ollama.com/install.sh | sh
# Run
ollama run llama3.1:8b
vLLM

vLLM

Production-grade serving

Best For

  • - High-throughput production workloads
  • - Batched inference
  • - When you need PagedAttention optimization
  • - Multi-GPU deployment

Key Features

  • - 24x throughput vs HuggingFace
  • - OpenAI-compatible API
  • - Continuous batching
  • - Tensor parallelism
cpp

llama.cpp

CPU inference, maximum portability

Best For

  • - Running on CPU-only machines
  • - Edge deployment
  • - Embedded systems
  • - Maximum hardware compatibility

Key Features

  • - Pure C/C++ implementation
  • - GGUF quantization format
  • - Metal/CUDA/CPU backends
  • - Active community development

Interactive Comparison

Use the calculator to estimate costs, compare latency, and get a recommendation based on your specific requirements.

Monthly Token Usage

Monthly Tokens10M tokens
1M500M1B
Output Token Ratio30%
10% output50%90% output
Input Tokens:7.0M
Output Tokens:3.0M

Monthly Cost Comparison

M3 Max (Llama 3.1 8B)Cheapest$0
OpenAI GPT-4o-mini$3
Cohere Command-R$3
Anthropic Claude Haiku$6
Together.ai (Llama 3.1 70B)$9
OpenAI GPT-4o$48
Cohere Command-R+$48
Anthropic Claude Sonnet$66
API Provider
Local / Self-hosted

Understanding the Economics

API pricing is pay-per-token. Local inference has fixed costs (hardware, electricity). The crossover point depends on your volume.

Rule of Thumb

< $500/moUse APIs. Infrastructure overhead not worth it.
$500-2K/moEvaluate. Run the numbers for your specific workload.
> $2K/moSelf-hosting likely cheaper. Amortize GPU costs over time.

Hidden costs to consider: DevOps time for self-hosting, rate limit costs for APIs, model fine-tuning infrastructure, and monitoring/observability overhead.

Privacy & Data Residency

For many applications, privacy is the deciding factor - not cost.

When You Must Go Local

  • - HIPAA-covered health data
  • - Financial data with regulatory requirements
  • - Proprietary code or trade secrets
  • - EU data residency requirements (GDPR)
  • - Air-gapped environments
  • - Government/defense applications

API Provider Privacy Options

ProviderData RetentionTraining Opt-outSOC 2
OpenAI API30 days (enterprise: 0)Yes (API default)Yes
Anthropic30 daysYesYes
CohereConfigurableYesYes
Azure OpenAI0 daysDefaultYes + HIPAA

Hybrid Approaches

Many production systems use both. The trick is knowing when to route to which.

Pattern: Tiered Routing

Route simple queries to local/cheap models, complex ones to powerful APIs.

# Pseudo-code
if is_simple_query(prompt):
return ollama.generate()
else:
return openai.chat()

Pattern: Fallback Chain

Start with local, fall back to API on failure or for specific capabilities.

# Pseudo-code
try:
result = local_llm.generate()
except (Timeout, QualityError):
result = api_llm.generate()

Pattern: Privacy Split

Sensitive data stays local, non-sensitive goes to cloud for quality.

# Pseudo-code
if contains_pii(data):
return local_llm.generate()
else:
return cloud_api.generate()

Pattern: Dev/Prod Split

Local for development, API for production quality.

# Pseudo-code
if env == "development":
client = OllamaClient()
else:
client = OpenAIClient()

Key Takeaways

  • 1

    APIs are simpler to start - no infrastructure, pay per use, excellent for prototyping and low-medium volume.

  • 2

    Local wins on privacy and latency - required for regulated industries, better for real-time applications.

  • 3

    Hybrid is often the answer - route based on complexity, privacy needs, or cost thresholds.

  • 4

    Calculate the crossover point - above ~$2K/month API spend, self-hosting usually saves money.