Offline RL
Offline RL — learning policies from fixed datasets without further environment interaction — matters because most real-world domains (healthcare, robotics, autonomous driving) can't afford online exploration. CQL (2020) and IQL (2022) established strong baselines on the D4RL benchmark, but the field was disrupted by Decision Transformer (2021), which recast RL as sequence modeling. The latest wave uses pretrained language models as policy backbones, blurring the line between offline RL and in-context learning, with benchmarks like CORL tracking reproducibility across dozens of algorithms.
Offline RL (batch RL) learns policies from fixed datasets of pre-collected experience without any online environment interaction. This is critical for domains like healthcare and autonomous driving where exploration is dangerous. CQL, IQL, and Decision Transformer are the key methods, with conservative estimation and sequence modeling as the two dominant paradigms.
History
Batch RL paper (Fujimoto et al.) formalizes the distributional shift problem in offline RL
D4RL benchmark released — standardized offline RL datasets for MuJoCo, maze navigation, and more
CQL (Conservative Q-Learning) addresses overestimation by penalizing Q-values on unseen actions
Decision Transformer recasts offline RL as sequence modeling — return-conditioned autoregression
IQL (Implicit Q-Learning) avoids querying OOD actions entirely via expectile regression
Diffusion policies (Diffuser) model action distributions with denoising diffusion for offline RL
Cal-QL bridges offline and online RL with calibrated conservative learning
Offline RL at scale: large datasets + transformer architectures show strong results
Real-world applications emerge in robotics (RT-2), chip design, and recommendation systems
Foundation models for offline RL — pretrained on diverse offline data, fine-tuned for specific tasks
How Offline RL Works
Dataset Collection
A fixed dataset of (state, action, reward, next_state) transitions is collected by some behavior policy — human demonstrations, random exploration, or a prior agent.
Conservative Value Estimation
Q-values are learned conservatively — penalizing overestimation on actions not well-represented in the data (CQL) or avoiding OOD action queries entirely (IQL).
Policy Extraction
A policy is derived that stays close to the data distribution while improving upon the behavior policy. This can be explicit (policy constraint) or implicit (sequence modeling).
Sequence Modeling Alternative
Decision Transformer and variants treat offline RL as autoregressive prediction: given desired return, predict the action sequence that achieves it.
Optional Online Fine-Tuning
The offline-learned policy can be fine-tuned with limited online interaction for significant additional improvement (offline-to-online RL).
Current Landscape
Offline RL in 2025 has matured from a research curiosity to a practical paradigm, especially for robotics and recommendation systems. Two schools dominate: conservative value methods (CQL, IQL) that constrain learning to stay near the data, and sequence modeling approaches (Decision Transformer) that sidestep value estimation entirely. The field increasingly connects to foundation models — pretraining on large offline datasets and fine-tuning for specific tasks. Real-world deployments in chip design (Google), robotics (RT-2), and recommendation systems demonstrate practical value.
Key Challenges
Distributional shift — the learned policy encounters states unseen in the training data, leading to compounding errors
Dataset quality dependency — offline RL performance is bounded by the quality and coverage of the collected data
Conservatism vs. improvement tradeoff — too conservative means copying the behavior policy; too optimistic means catastrophic overestimation
Hyperparameter sensitivity — offline RL methods are notoriously sensitive to conservatism coefficients and architecture choices
Evaluation challenges — D4RL benchmarks don't fully capture real-world offline RL difficulties (partial observability, non-stationarity)
Quick Recommendations
Standard offline RL benchmark
IQL / CQL
Most reliable methods on D4RL with extensive ablations and reproducible code
Sequence modeling approach
Decision Transformer / Trajectory Transformer
Simple, scalable, and avoids the conservatism tuning problem
Robotics with demonstrations
Diffusion Policy (offline)
Best for learning multi-modal action distributions from demonstrations
Offline-to-online fine-tuning
Cal-QL / IQL + online
Proven pipeline for initializing from offline data then improving online
What's Next
The frontier is scaling offline RL with internet-scale data — learning general behavioral priors from massive demonstration datasets, then fine-tuning for specific tasks. Expect convergence with vision-language-action models and increasing use of offline RL as the 'pretraining' phase of real-world RL systems.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Atari Games
Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).
Continuous Control
Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Offline RL benchmarks accurate. Report outdated results, missing benchmarks, or errors.