Home/Guides/The Bitter Lesson

AI Philosophy

The Bitter Lesson

Rich Sutton's 2019 observation that changed how we think about AI progress: general methods leveraging computation beat human-engineered approaches. Every time.

December 2025|15 min read|Original Essay

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
— Rich Sutton, March 2019

Why "Bitter"?

The lesson is bitter because it contradicts what researchers want to believe. We want our domain expertise to matter. We want clever algorithms to beat brute force. We want human insight to be the key ingredient.

But history tells a different story. Again and again, methods that seemed "inelegant" or "brute force" eventually dominated—once enough compute became available.

The Pattern

Researchers build systems encoding human domain knowledge
This works in the short term and feels intellectually satisfying
Progress plateaus as human knowledge becomes the bottleneck
Someone tries a "simple" approach with more compute
The compute-heavy approach wins decisively
Researchers are reluctant to accept this, then move on

Historical Evidence

Every major AI breakthrough follows the same pattern. Here's the timeline:

1997Deep Blue defeats Kasparov

Brute-force search with specialized hardware beat decades of chess knowledge engineering.

Search + compute > human chess knowledge

2011IBM Watson wins Jeopardy!

Statistical methods over massive text corpora defeated human champions.

Scale > hand-crafted QA rules

2012AlexNet wins ImageNet

Deep learning on GPUs crushed hand-engineered features (SIFT, HOG).

Learned features > engineered features

2016AlphaGo defeats Lee Sedol

Self-play + MCTS beat 2,500 years of Go theory.

Search + learning > human intuition

2017Transformer architecture

Attention replaced recurrence. Enabled massive parallelization.

Parallelizable compute > sequential constraints

2020GPT-3 (175B parameters)

Few-shot learning emerged from scale alone. No task-specific training.

Scale creates capabilities

2022Chinchilla scaling laws

DeepMind proved compute-optimal training: 20 tokens per parameter.

Data and compute must scale together

2023GPT-4 multimodal

Vision, code, reasoning from unified transformer + scale.

General architecture + scale > specialized models

2024o1 test-time compute

Reasoning improves with inference-time compute scaling.

Compute at inference matters too

2025Agentic AI (METR benchmarks)

Autonomous task completion scales with model capability.

General agents > task-specific automation

Chess: Where It All Started

The Knowledge Approach

For decades, chess AI researchers hand-coded opening books, endgame tablebases, positional evaluation heuristics, and strategic concepts from grandmaster play.

- Piece-square tables for positional value
- King safety heuristics
- Pawn structure evaluation
- Endgame patterns from theory

The Compute Approach

Deep Blue evaluated 200 million positions per second. Raw search depth + simple evaluation beat sophisticated knowledge. Chess researchers were "not good losers."

+ Specialized hardware (480 chess chips)
+ Alpha-beta search at massive depth
+ Simple material + mobility evaluation
+ Brute force wins

Today, Stockfish (search + simple eval) and Leela Chess Zero (pure neural network from self-play) both crush any human. Neither uses human chess knowledge in any meaningful way.

Modern Evidence: Large Language Models

LLMs are perhaps the purest expression of the Bitter Lesson. No linguistic rules. No syntax trees. No semantic ontologies. Just next-token prediction at scale.

What GPT Doesn't Have

No grammar rules

No knowledge graphs

No reasoning modules

No world models

No symbolic logic

No explicit memory

What GPT Does Have

175B+ parameters (GPT-3)

Trillions of training tokens

Thousands of GPUs

Months of training time

Result: emergent capabilities including reasoning, coding, translation, and creative writing—none of which were explicitly programmed.

Scaling Laws: The Math of the Bitter Lesson

The Bitter Lesson isn't just philosophy—it's now quantified. Scaling laws show predictable performance improvements from compute:

Law	Finding	Interpretation
Kaplan (2020)OpenAI	N_optimal = C^0.73	Performance scales as power law with compute. Prioritize model size.
Chinchilla (2022)DeepMind	N_optimal = C^0.50	Optimal: 20 tokens per parameter. Data and model size scale equally.
Llama 3 (2024)Meta	200+ tokens per parameter	Overtraining beyond Chinchilla optimal improves inference efficiency.
o1 (2024)OpenAI	Test-time scaling	Compute at inference improves reasoning. Both train and test compute matter.

The implication: If you have a fixed compute budget, the scaling laws tell you exactly how to allocate it between model size and data. Human intuition about architecture becomes secondary to these mathematical relationships.

Sutton's Second Point

"We should build in only the meta-methods that can find and capture arbitrary complexity. We want AI agents that can discover like we can, not which contain what we have discovered."

This is the constructive corollary: instead of encoding human knowledge, build systems that can acquire knowledge. The contents of minds are irredeemably complex. Don't try to specify them—let the system learn them.

Don't build in

- Object permanence
- Spatial reasoning rules
- Theory of mind
- Causal models

Do build in

+ Learning algorithms
+ Search procedures
+ Scalable architectures
+ Compute efficiency

Agentic AI: The Latest Evidence

The METR evaluations show the Bitter Lesson extending to autonomous agents. As models scale, their ability to complete complex, multi-step tasks scales too—without explicit task-specific engineering.

METR Time Horizon Scaling

15 min

GPT-4 (2023)

75 min

Claude 3 Opus

120 min

o1-preview

160 min

GPT-5.1

50% success rate on tasks requiring this duration of autonomous work

METR Benchmarks

Full agentic AI evaluation data

METR Evaluations

Official evaluation reports

Hardware: The Engine of the Bitter Lesson

The Bitter Lesson is fundamentally about Moore's Law. Compute gets cheaper exponentially. Human knowledge doesn't compound the same way.

GPU Evolution

2012: GTX 6803 TFLOPS FP32

2017: V100125 TFLOPS (Tensor)

2020: A100312 TFLOPS (Tensor)

2022: H100989 TFLOPS (Tensor)

2024: B2002,250 TFLOPS (Tensor)

Training Compute Growth

AlexNet (2012)~10^17 FLOP

GPT-2 (2019)~10^20 FLOP

GPT-3 (2020)~10^23 FLOP

GPT-4 (2023)~10^25 FLOP

Frontier (2025)~10^26+ FLOP

Implication: Every 2-3 years, the "brute force" approach that seemed impractical becomes practical. Plan accordingly.

But What About...

Common objections to the Bitter Lesson, and why they don't hold:

"But architecture matters!"

Transformers beat RNNs not because of inductive bias, but because they parallelize. They enabled more compute utilization. The Bitter Lesson stands.

"Human knowledge helped AlphaGo"

AlphaGo used human games initially. AlphaGo Zero used none and performed better. More compute, less human knowledge, better results.

"Data quality matters"

Data curation is valuable, but deduplication and filtering are algorithmic. The trend is automated data pipelines at scale, not human annotation.

"Efficiency research saves compute"

Efficiency gains (FlashAttention, quantization) are valuable precisely because they let you use more effective compute. They serve the Bitter Lesson.

Practical Implications

For Researchers

- Prefer methods that scale with compute
- Be skeptical of "clever tricks" that don't scale
- Focus on removing bottlenecks to scaling
- Invest in infrastructure as much as algorithms
- Benchmark at multiple scales before concluding

For Practitioners

- Use the largest model you can afford
- Fine-tuning often beats custom architectures
- Wait for foundation models before building from scratch
- Your domain knowledge is less unique than you think
- Compute cost drops; engineer time doesn't

For Startups

- Don't compete on model training with big labs
- Application layer and data moats matter more
- Assume capabilities will continue scaling
- Build for the models that will exist in 2 years
- Vertical integration is risky in fast-moving capability space

For Society

- Compute access becomes a key resource
- Centralization risk from compute concentration
- Efficiency research has democratizing potential
- Safety research must scale with capabilities
- Economic disruption from continued scaling

The Sweet Corollary

The Bitter Lesson has a silver lining: if general methods + compute win, then progress is predictable. We don't need to wait for conceptual breakthroughs. We need to scale.

This doesn't mean breakthroughs don't matter—Transformers mattered enormously. But they mattered because they enabled scaling, not because they encoded human knowledge about language.