The Bitter Lesson
Rich Sutton's 2019 observation that changed how we think about AI progress: general methods leveraging computation beat human-engineered approaches. Every time.
"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
Why "Bitter"?
The lesson is bitter because it contradicts what researchers want to believe. We want our domain expertise to matter. We want clever algorithms to beat brute force. We want human insight to be the key ingredient.
But history tells a different story. Again and again, methods that seemed "inelegant" or "brute force" eventually dominated—once enough compute became available.
The Pattern
- Researchers build systems encoding human domain knowledge
- This works in the short term and feels intellectually satisfying
- Progress plateaus as human knowledge becomes the bottleneck
- Someone tries a "simple" approach with more compute
- The compute-heavy approach wins decisively
- Researchers are reluctant to accept this, then move on
Historical Evidence
Every major AI breakthrough follows the same pattern. Here's the timeline:
Brute-force search with specialized hardware beat decades of chess knowledge engineering.
Statistical methods over massive text corpora defeated human champions.
Deep learning on GPUs crushed hand-engineered features (SIFT, HOG).
Self-play + MCTS beat 2,500 years of Go theory.
Attention replaced recurrence. Enabled massive parallelization.
Few-shot learning emerged from scale alone. No task-specific training.
DeepMind proved compute-optimal training: 20 tokens per parameter.
Vision, code, reasoning from unified transformer + scale.
Reasoning improves with inference-time compute scaling.
Autonomous task completion scales with model capability.
Chess: Where It All Started
The Knowledge Approach
For decades, chess AI researchers hand-coded opening books, endgame tablebases, positional evaluation heuristics, and strategic concepts from grandmaster play.
- - Piece-square tables for positional value
- - King safety heuristics
- - Pawn structure evaluation
- - Endgame patterns from theory
The Compute Approach
Deep Blue evaluated 200 million positions per second. Raw search depth + simple evaluation beat sophisticated knowledge. Chess researchers were "not good losers."
- + Specialized hardware (480 chess chips)
- + Alpha-beta search at massive depth
- + Simple material + mobility evaluation
- + Brute force wins
Today, Stockfish (search + simple eval) and Leela Chess Zero (pure neural network from self-play) both crush any human. Neither uses human chess knowledge in any meaningful way.
Modern Evidence: Large Language Models
LLMs are perhaps the purest expression of the Bitter Lesson. No linguistic rules. No syntax trees. No semantic ontologies. Just next-token prediction at scale.
What GPT Doesn't Have
What GPT Does Have
Result: emergent capabilities including reasoning, coding, translation, and creative writing—none of which were explicitly programmed.
Scaling Laws: The Math of the Bitter Lesson
The Bitter Lesson isn't just philosophy—it's now quantified. Scaling laws show predictable performance improvements from compute:
| Law | Finding | Interpretation |
|---|---|---|
| Kaplan (2020)OpenAI | N_optimal = C^0.73 | Performance scales as power law with compute. Prioritize model size. |
| Chinchilla (2022)DeepMind | N_optimal = C^0.50 | Optimal: 20 tokens per parameter. Data and model size scale equally. |
| Llama 3 (2024)Meta | 200+ tokens per parameter | Overtraining beyond Chinchilla optimal improves inference efficiency. |
| o1 (2024)OpenAI | Test-time scaling | Compute at inference improves reasoning. Both train and test compute matter. |
The implication: If you have a fixed compute budget, the scaling laws tell you exactly how to allocate it between model size and data. Human intuition about architecture becomes secondary to these mathematical relationships.
Sutton's Second Point
"We should build in only the meta-methods that can find and capture arbitrary complexity. We want AI agents that can discover like we can, not which contain what we have discovered."
This is the constructive corollary: instead of encoding human knowledge, build systems that can acquire knowledge. The contents of minds are irredeemably complex. Don't try to specify them—let the system learn them.
Don't build in
- - Object permanence
- - Spatial reasoning rules
- - Theory of mind
- - Causal models
Do build in
- + Learning algorithms
- + Search procedures
- + Scalable architectures
- + Compute efficiency
Agentic AI: The Latest Evidence
The METR evaluations show the Bitter Lesson extending to autonomous agents. As models scale, their ability to complete complex, multi-step tasks scales too—without explicit task-specific engineering.
METR Time Horizon Scaling
50% success rate on tasks requiring this duration of autonomous work
Hardware: The Engine of the Bitter Lesson
The Bitter Lesson is fundamentally about Moore's Law. Compute gets cheaper exponentially. Human knowledge doesn't compound the same way.
GPU Evolution
Training Compute Growth
Implication: Every 2-3 years, the "brute force" approach that seemed impractical becomes practical. Plan accordingly.
But What About...
Common objections to the Bitter Lesson, and why they don't hold:
"But architecture matters!"
Transformers beat RNNs not because of inductive bias, but because they parallelize. They enabled more compute utilization. The Bitter Lesson stands.
"Human knowledge helped AlphaGo"
AlphaGo used human games initially. AlphaGo Zero used none and performed better. More compute, less human knowledge, better results.
"Data quality matters"
Data curation is valuable, but deduplication and filtering are algorithmic. The trend is automated data pipelines at scale, not human annotation.
"Efficiency research saves compute"
Efficiency gains (FlashAttention, quantization) are valuable precisely because they let you use more effective compute. They serve the Bitter Lesson.
Practical Implications
For Researchers
- - Prefer methods that scale with compute
- - Be skeptical of "clever tricks" that don't scale
- - Focus on removing bottlenecks to scaling
- - Invest in infrastructure as much as algorithms
- - Benchmark at multiple scales before concluding
For Practitioners
- - Use the largest model you can afford
- - Fine-tuning often beats custom architectures
- - Wait for foundation models before building from scratch
- - Your domain knowledge is less unique than you think
- - Compute cost drops; engineer time doesn't
For Startups
- - Don't compete on model training with big labs
- - Application layer and data moats matter more
- - Assume capabilities will continue scaling
- - Build for the models that will exist in 2 years
- - Vertical integration is risky in fast-moving capability space
For Society
- - Compute access becomes a key resource
- - Centralization risk from compute concentration
- - Efficiency research has democratizing potential
- - Safety research must scale with capabilities
- - Economic disruption from continued scaling
The Sweet Corollary
The Bitter Lesson has a silver lining: if general methods + compute win, then progress is predictable. We don't need to wait for conceptual breakthroughs. We need to scale.
This doesn't mean breakthroughs don't matter—Transformers mattered enormously. But they mattered because they enabled scaling, not because they encoded human knowledge about language.
References & Further Reading
Related Guides
Not sure which solution fits your use case?
Describe your challenge and we'll point you to the right solution - or create a dedicated benchmark for your needs.