Home/Guides/The Bitter Lesson
AI Philosophy

The Bitter Lesson

Rich Sutton's 2019 observation that changed how we think about AI progress: general methods leveraging computation beat human-engineered approaches. Every time.

December 2025|15 min read|Original Essay

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."

Rich Sutton, March 2019

Why "Bitter"?

The lesson is bitter because it contradicts what researchers want to believe. We want our domain expertise to matter. We want clever algorithms to beat brute force. We want human insight to be the key ingredient.

But history tells a different story. Again and again, methods that seemed "inelegant" or "brute force" eventually dominated—once enough compute became available.

The Pattern

  1. Researchers build systems encoding human domain knowledge
  2. This works in the short term and feels intellectually satisfying
  3. Progress plateaus as human knowledge becomes the bottleneck
  4. Someone tries a "simple" approach with more compute
  5. The compute-heavy approach wins decisively
  6. Researchers are reluctant to accept this, then move on

Historical Evidence

Every major AI breakthrough follows the same pattern. Here's the timeline:

1997Deep Blue defeats Kasparov

Brute-force search with specialized hardware beat decades of chess knowledge engineering.

Search + compute > human chess knowledge
2011IBM Watson wins Jeopardy!

Statistical methods over massive text corpora defeated human champions.

Scale > hand-crafted QA rules
2012AlexNet wins ImageNet

Deep learning on GPUs crushed hand-engineered features (SIFT, HOG).

Learned features > engineered features
2016AlphaGo defeats Lee Sedol

Self-play + MCTS beat 2,500 years of Go theory.

Search + learning > human intuition
2017Transformer architecture

Attention replaced recurrence. Enabled massive parallelization.

Parallelizable compute > sequential constraints
2020GPT-3 (175B parameters)

Few-shot learning emerged from scale alone. No task-specific training.

Scale creates capabilities
2022Chinchilla scaling laws

DeepMind proved compute-optimal training: 20 tokens per parameter.

Data and compute must scale together
2023GPT-4 multimodal

Vision, code, reasoning from unified transformer + scale.

General architecture + scale > specialized models
2024o1 test-time compute

Reasoning improves with inference-time compute scaling.

Compute at inference matters too
2025Agentic AI (METR benchmarks)

Autonomous task completion scales with model capability.

General agents > task-specific automation

Chess: Where It All Started

The Knowledge Approach

For decades, chess AI researchers hand-coded opening books, endgame tablebases, positional evaluation heuristics, and strategic concepts from grandmaster play.

  • - Piece-square tables for positional value
  • - King safety heuristics
  • - Pawn structure evaluation
  • - Endgame patterns from theory

The Compute Approach

Deep Blue evaluated 200 million positions per second. Raw search depth + simple evaluation beat sophisticated knowledge. Chess researchers were "not good losers."

  • + Specialized hardware (480 chess chips)
  • + Alpha-beta search at massive depth
  • + Simple material + mobility evaluation
  • + Brute force wins

Today, Stockfish (search + simple eval) and Leela Chess Zero (pure neural network from self-play) both crush any human. Neither uses human chess knowledge in any meaningful way.

Modern Evidence: Large Language Models

LLMs are perhaps the purest expression of the Bitter Lesson. No linguistic rules. No syntax trees. No semantic ontologies. Just next-token prediction at scale.

What GPT Doesn't Have

No grammar rules
No knowledge graphs
No reasoning modules
No world models
No symbolic logic
No explicit memory

What GPT Does Have

175B+ parameters (GPT-3)
Trillions of training tokens
Thousands of GPUs
Months of training time

Result: emergent capabilities including reasoning, coding, translation, and creative writing—none of which were explicitly programmed.

Scaling Laws: The Math of the Bitter Lesson

The Bitter Lesson isn't just philosophy—it's now quantified. Scaling laws show predictable performance improvements from compute:

LawFindingInterpretation
Kaplan (2020)OpenAIN_optimal = C^0.73Performance scales as power law with compute. Prioritize model size.
Chinchilla (2022)DeepMindN_optimal = C^0.50Optimal: 20 tokens per parameter. Data and model size scale equally.
Llama 3 (2024)Meta200+ tokens per parameterOvertraining beyond Chinchilla optimal improves inference efficiency.
o1 (2024)OpenAITest-time scalingCompute at inference improves reasoning. Both train and test compute matter.

The implication: If you have a fixed compute budget, the scaling laws tell you exactly how to allocate it between model size and data. Human intuition about architecture becomes secondary to these mathematical relationships.

Sutton's Second Point

"We should build in only the meta-methods that can find and capture arbitrary complexity. We want AI agents that can discover like we can, not which contain what we have discovered."

This is the constructive corollary: instead of encoding human knowledge, build systems that can acquire knowledge. The contents of minds are irredeemably complex. Don't try to specify them—let the system learn them.

Don't build in

  • - Object permanence
  • - Spatial reasoning rules
  • - Theory of mind
  • - Causal models

Do build in

  • + Learning algorithms
  • + Search procedures
  • + Scalable architectures
  • + Compute efficiency

Agentic AI: The Latest Evidence

The METR evaluations show the Bitter Lesson extending to autonomous agents. As models scale, their ability to complete complex, multi-step tasks scales too—without explicit task-specific engineering.

METR Time Horizon Scaling

15 min
GPT-4 (2023)
75 min
Claude 3 Opus
120 min
o1-preview
160 min
GPT-5.1

50% success rate on tasks requiring this duration of autonomous work

Hardware: The Engine of the Bitter Lesson

The Bitter Lesson is fundamentally about Moore's Law. Compute gets cheaper exponentially. Human knowledge doesn't compound the same way.

GPU Evolution

2012: GTX 6803 TFLOPS FP32
2017: V100125 TFLOPS (Tensor)
2020: A100312 TFLOPS (Tensor)
2022: H100989 TFLOPS (Tensor)
2024: B2002,250 TFLOPS (Tensor)

Training Compute Growth

AlexNet (2012)~10^17 FLOP
GPT-2 (2019)~10^20 FLOP
GPT-3 (2020)~10^23 FLOP
GPT-4 (2023)~10^25 FLOP
Frontier (2025)~10^26+ FLOP

Implication: Every 2-3 years, the "brute force" approach that seemed impractical becomes practical. Plan accordingly.

But What About...

Common objections to the Bitter Lesson, and why they don't hold:

"But architecture matters!"

Transformers beat RNNs not because of inductive bias, but because they parallelize. They enabled more compute utilization. The Bitter Lesson stands.

"Human knowledge helped AlphaGo"

AlphaGo used human games initially. AlphaGo Zero used none and performed better. More compute, less human knowledge, better results.

"Data quality matters"

Data curation is valuable, but deduplication and filtering are algorithmic. The trend is automated data pipelines at scale, not human annotation.

"Efficiency research saves compute"

Efficiency gains (FlashAttention, quantization) are valuable precisely because they let you use more effective compute. They serve the Bitter Lesson.

Practical Implications

For Researchers

  • - Prefer methods that scale with compute
  • - Be skeptical of "clever tricks" that don't scale
  • - Focus on removing bottlenecks to scaling
  • - Invest in infrastructure as much as algorithms
  • - Benchmark at multiple scales before concluding

For Practitioners

  • - Use the largest model you can afford
  • - Fine-tuning often beats custom architectures
  • - Wait for foundation models before building from scratch
  • - Your domain knowledge is less unique than you think
  • - Compute cost drops; engineer time doesn't

For Startups

  • - Don't compete on model training with big labs
  • - Application layer and data moats matter more
  • - Assume capabilities will continue scaling
  • - Build for the models that will exist in 2 years
  • - Vertical integration is risky in fast-moving capability space

For Society

  • - Compute access becomes a key resource
  • - Centralization risk from compute concentration
  • - Efficiency research has democratizing potential
  • - Safety research must scale with capabilities
  • - Economic disruption from continued scaling

The Sweet Corollary

The Bitter Lesson has a silver lining: if general methods + compute win, then progress is predictable. We don't need to wait for conceptual breakthroughs. We need to scale.

This doesn't mean breakthroughs don't matter—Transformers mattered enormously. But they mattered because they enabled scaling, not because they encoded human knowledge about language.

References & Further Reading

Related Guides

Not sure which solution fits your use case?

Describe your challenge and we'll point you to the right solution - or create a dedicated benchmark for your needs.