Reinforcement Learning
Training agents to make decisions? Benchmark your policies on game playing, continuous control, and offline learning tasks.
Reinforcement learning trains agents to make sequential decisions through interaction with environments. From game-playing breakthroughs to robotics control and RLHF for LLM alignment, RL has become a foundational technique across AI, though sample efficiency and sim-to-real transfer remain key challenges.
Tasks & Benchmarks
State of the Field (2025)
- RLHF and RLVR (RL with Verifiable Rewards) are now standard for LLM alignment and reasoning: DeepSeek-R1, OpenAI o3, and Claude use RL-based training to improve instruction following and chain-of-thought reasoning
- Offline RL matured significantly: Decision Transformer, IQL, and Cal-QL enable learning from static datasets without environment interaction, critical for healthcare, finance, and robotics where online exploration is costly or dangerous
- Multi-agent RL scaled to complex coordination: OpenAI Five (Dota 2), DeepMind's AlphaStar (StarCraft II) demonstrated superhuman team coordination, while MAPPO and QMIX provide practical frameworks for cooperative multi-agent problems
- Sim-to-real transfer improved through domain randomization and system identification, but reliable zero-shot transfer to real robots remains unsolved for contact-rich manipulation tasks
Quick Recommendations
Game playing and simulation benchmarks
PPO or SAC with vectorized environments
PPO provides robust on-policy training for discrete and continuous action spaces. SAC offers better sample efficiency for continuous control. Both well-supported in Stable-Baselines3 and CleanRL.
Robotics control (MuJoCo, real-world)
SAC for simulation, offline RL (IQL/Cal-QL) for real-world
SAC's entropy regularization provides robust exploration in simulation. For real robots, offline RL learns from demonstration data without risky online exploration.
LLM alignment and reasoning improvement
RLHF with PPO or DPO (Direct Preference Optimization)
PPO-based RLHF remains the standard for frontier models. DPO simplifies the pipeline by eliminating the reward model, achieving comparable results with less infrastructure.
Multi-agent coordination
MAPPO or QMIX
MAPPO scales PPO to multi-agent settings with centralized training and decentralized execution. QMIX provides value decomposition for cooperative tasks. Both handle partial observability.
Show all datasets and SOTA results
Atari Games
Continuous Control
Offline RL
Honest Takes
RL's biggest impact is inside LLMs, not robotics
The RL community spent decades on game-playing and robot control. The technology's largest real-world impact turned out to be RLHF for language model alignment. DeepSeek-R1 showed that RL alone (without supervised fine-tuning) can teach models to reason. This is where RL delivers the most value today.
Sample efficiency is still embarrassing
State-of-the-art RL agents need millions of environment interactions to learn tasks a human figures out in minutes. Offline RL and world models help, but the fundamental sample efficiency gap means RL remains impractical for most real-world applications without simulation.
Sim-to-real is the real bottleneck for robotics RL
Papers show impressive MuJoCo results that fail on real hardware. Domain randomization helps but doesn't solve contact dynamics, sensor noise, and actuator delays. Until sim-to-real transfer is reliable, RL for physical robots will remain a research endeavor for most teams.
Get notified when these results update
New models drop weekly. We track them so you don't have to.