Reinforcement Learning
Training agents to make decisions? Benchmark your policies on game playing, continuous control, and offline learning tasks.
Reinforcement learning trains agents to make sequential decisions through interaction with environments. From game-playing breakthroughs to robotics control and RLHF for LLM alignment, RL has become a foundational technique across AI, though sample efficiency and sim-to-real transfer remain key challenges.
State of the Field (2025)
- RLHF and RLVR (RL with Verifiable Rewards) are now standard for LLM alignment and reasoning: DeepSeek-R1, OpenAI o3, and Claude use RL-based training to improve instruction following and chain-of-thought reasoning
- Offline RL matured significantly: Decision Transformer, IQL, and Cal-QL enable learning from static datasets without environment interaction, critical for healthcare, finance, and robotics where online exploration is costly or dangerous
- Multi-agent RL scaled to complex coordination: OpenAI Five (Dota 2), DeepMind's AlphaStar (StarCraft II) demonstrated superhuman team coordination, while MAPPO and QMIX provide practical frameworks for cooperative multi-agent problems
- Sim-to-real transfer improved through domain randomization and system identification, but reliable zero-shot transfer to real robots remains unsolved for contact-rich manipulation tasks
Quick Recommendations
Game playing and simulation benchmarks
PPO or SAC with vectorized environments
PPO provides robust on-policy training for discrete and continuous action spaces. SAC offers better sample efficiency for continuous control. Both well-supported in Stable-Baselines3 and CleanRL.
Robotics control (MuJoCo, real-world)
SAC for simulation, offline RL (IQL/Cal-QL) for real-world
SAC's entropy regularization provides robust exploration in simulation. For real robots, offline RL learns from demonstration data without risky online exploration.
LLM alignment and reasoning improvement
RLHF with PPO or DPO (Direct Preference Optimization)
PPO-based RLHF remains the standard for frontier models. DPO simplifies the pipeline by eliminating the reward model, achieving comparable results with less infrastructure.
Multi-agent coordination
MAPPO or QMIX
MAPPO scales PPO to multi-agent settings with centralized training and decentralized execution. QMIX provides value decomposition for cooperative tasks. Both handle partial observability.
Tasks & Benchmarks
Atari Games
Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).
Continuous Control
Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.
Offline RL
Offline RL — learning policies from fixed datasets without further environment interaction — matters because most real-world domains (healthcare, robotics, autonomous driving) can't afford online exploration. CQL (2020) and IQL (2022) established strong baselines on the D4RL benchmark, but the field was disrupted by Decision Transformer (2021), which recast RL as sequence modeling. The latest wave uses pretrained language models as policy backbones, blurring the line between offline RL and in-context learning, with benchmarks like CORL tracking reproducibility across dozens of algorithms.
Show all datasets and SOTA results
Atari Games
Continuous Control
Offline RL
No datasets indexed yet. Contribute on GitHub
Honest Takes
RL's biggest impact is inside LLMs, not robotics
The RL community spent decades on game-playing and robot control. The technology's largest real-world impact turned out to be RLHF for language model alignment. DeepSeek-R1 showed that RL alone (without supervised fine-tuning) can teach models to reason. This is where RL delivers the most value today.
Sample efficiency is still embarrassing
State-of-the-art RL agents need millions of environment interactions to learn tasks a human figures out in minutes. Offline RL and world models help, but the fundamental sample efficiency gap means RL remains impractical for most real-world applications without simulation.
Sim-to-real is the real bottleneck for robotics RL
Papers show impressive MuJoCo results that fail on real hardware. Domain randomization helps but doesn't solve contact dynamics, sensor noise, and actuator delays. Until sim-to-real transfer is reliable, RL for physical robots will remain a research endeavor for most teams.