The Timeline
Each node marks a moment that changed what was possible. Click through 13 years of compounding breakthroughs.
DQN — Playing Atari from Pixels
Deep Q-Network combined deep convolutional networks with Q-learning, learning directly from raw pixel input to achieve superhuman performance on multiple Atari 2600 games. Published in Nature 2015, the work that started the deep RL era.
Proved neural networks could replace hand-crafted features in RL.
Double DQN & Dueling DQN
Double DQN addressed overestimation bias by decoupling action selection from evaluation. Dueling DQN separated value and advantage streams, improving policy evaluation in states with many similar-valued actions.
Established that architectural innovations could significantly boost sample efficiency.
A3C & AlphaGo
Asynchronous Advantage Actor-Critic (A3C) introduced parallel actor-learners for stable training. AlphaGo defeated Lee Sedol 4-1 using policy and value networks with Monte Carlo tree search — a watershed moment for AI.
Combined search + learning dominated the hardest classical game. Policy gradient methods matured.
PPO & AlphaZero
Proximal Policy Optimization became the default policy gradient algorithm — simple, stable, scalable. AlphaZero mastered chess, shogi, and Go from self-play alone with zero human knowledge.
PPO remains the backbone of most RL applications including RLHF. AlphaZero showed tabula rasa learning works.
SAC & World Models
Soft Actor-Critic introduced entropy regularization for robust continuous control. World Models learned compressed representations of environments, enabling agents to "dream" and train in imagination.
SAC became the go-to for robotics. World Models sparked the model-based RL renaissance.
MuZero & OpenAI Five
MuZero planned without knowing the rules — learning a dynamics model, reward model, and policy end-to-end. OpenAI Five defeated world champions at Dota 2, coordinating 5 agents over 45-minute games.
Model-based planning without ground-truth models. Multi-agent coordination at scale.
Dreamer & Agent57
Dreamer v1/v2 trained policies entirely inside learned world models, achieving strong results with far fewer environment interactions. Agent57 was the first agent to outperform the human baseline on all 57 Atari games.
Closed the loop on Atari — superhuman across the full suite. Model-based RL became practical.
Decision Transformer
Reframed RL as sequence modeling: condition a transformer on desired returns, past states, and actions. No value functions, no policy gradients — just autoregressive prediction.
Bridged RL and large language models. Opened the door to offline RL at scale.
RLHF Powers ChatGPT
Reinforcement Learning from Human Feedback (RLHF) used PPO to align GPT models with human preferences. ChatGPT launched and became the fastest-growing consumer app in history.
RL became the alignment mechanism for the entire LLM industry.
RT-2 & Eureka
RT-2 used vision-language models as robot policies, transferring web-scale knowledge to physical manipulation. Eureka used LLMs to automatically generate reward functions for dexterous manipulation tasks.
Foundation models entered robotics. Reward engineering automated by LLMs.
GRPO & Reasoning Models
Group Relative Policy Optimization (GRPO) eliminated the critic network by using group-relative advantages, dramatically simplifying RL for LLMs. OpenAI o1 and DeepSeek-R1 demonstrated that RL could teach models to reason step-by-step.
RL for LLMs became simpler and more effective. Test-time compute scaling emerged.
Physical World Models & Sim-to-Real
Large-scale world models trained on video enabled sim-to-real transfer for manipulation and locomotion. Physical intelligence companies deployed RL-trained robots in warehouses and kitchens using foundation world models.
RL escaped simulation. Physical tasks became trainable at scale.
Current State
RL is the fine-tuning mechanism for frontier LLMs (GRPO, REINFORCE++), the training paradigm for humanoid robotics, and the optimization layer for scientific discovery. The field has converged: foundation models provide priors, RL provides optimization.
RL is no longer a research niche — it is core infrastructure for AI.
Paradigm Shifts
The field didn't evolve linearly. Five distinct paradigm shifts redefined what RL meant and what it could do.
Value-Based to Policy Gradient
“Maximizing a value function is brittle. Directly optimizing the policy is more stable and scales to continuous action spaces.”
Model-Free to Model-Based
“Sample efficiency matters. Learning a model of the world and planning inside it dramatically reduces real-world data requirements.”
RL as Optimization to RL as Sequence Modeling
“RL problems can be recast as supervised learning on trajectory data. This unlocks the scaling properties of transformers.”
Game AI to LLM Alignment
“The biggest impact of RL shifted from playing games to shaping how billions of people interact with AI.”
Simulation to Physical Reality
“World models learned from video close the sim-to-real gap. RL-trained robots work in unstructured real environments.”
Current SOTA: Atari
Agent57 (2020) was the first agent to beat human baselines on all 57 Atari games. Current agents achieve superhuman scores by enormous margins.
| Game | Human | Best Agent | Ratio |
|---|---|---|---|
| Breakout | 31.8 | 864.0 | 27x |
| Pong | 14.6 | 21.0 | 1.4x |
| Space Invaders | 1,669 | 54,576 | 33x |
| Seaquest | 42,055 | 999,999 | 24x |
| Q*bert | 13,455 | 999,999 | 74x |
| Montezuma | 4,753 | 12,200 | 2.6x |
Scores from published benchmarks. Montezuma's Revenge, once considered unsolvable for RL, has been cracked through exploration bonuses and Go-Explore.
Robotics: State of the Art
RL in robotics has shifted from sim-only curiosities to deployed systems. Three converging trends are driving this.
Foundation World Models
Video prediction models trained on internet-scale data provide physics priors. Robots learn manipulation in these learned simulators with 10-100x less real-world data.
Sim-to-Real Transfer
Domain randomization, system identification, and learned adaptation modules close the reality gap. Policies trained in IsaacGym transfer to physical hardware with minimal fine-tuning.
Language-Conditioned Policies
RT-2 and successors use VLMs as policy backbones. Natural language instructions map to motor commands. Robots generalize to novel objects and tasks zero-shot.
RL for LLMs: RLHF to GRPO
The most impactful application of RL in 2024-2026 isn't games or robots — it's making language models useful, safe, and capable of reasoning.
RLHF (2022)
Train reward model from human preferences, optimize with PPO
DPO (2023)
Direct preference optimization without explicit reward model
GRPO (2024)
Group samples, compute relative advantages within group, no critic needed
REINFORCE++ & Variants (2025-26)
Token-level credit assignment, process reward models, multi-turn RL
What's Next
The frontiers of RL in 2026 and beyond. These are active research areas where breakthroughs are expected.
Multi-Agent Foundation Models
Active researchTraining teams of agents that coordinate through emergent communication. Applications in traffic, supply chains, and collaborative robotics.
RL for Scientific Discovery
Early deploymentOptimizing molecular structures, protein folding strategies, and experimental designs. AlphaFold showed the potential; RL is the optimization layer.
Continuous Learning Agents
Active researchAgents that improve indefinitely in deployment without catastrophic forgetting. Combining RL with continual learning and memory architectures.
RL-Native Hardware
EmergingCustom silicon for RL workloads: fast simulation, parallel rollouts, real-time inference for robotics control loops at 1kHz+.
Explore RL Benchmarks & Papers
Track the latest reinforcement learning results, compare methods, and find the papers that matter.