Reinforcement Learning

Offline RL

Offline RL — learning policies from fixed datasets without further environment interaction — matters because most real-world domains (healthcare, robotics, autonomous driving) can't afford online exploration. CQL (2020) and IQL (2022) established strong baselines on the D4RL benchmark, but the field was disrupted by Decision Transformer (2021), which recast RL as sequence modeling. The latest wave uses pretrained language models as policy backbones, blurring the line between offline RL and in-context learning, with benchmarks like CORL tracking reproducibility across dozens of algorithms.

1 datasets0 resultsView full task mapping →

Offline RL (batch RL) learns policies from fixed datasets of pre-collected experience without any online environment interaction. This is critical for domains like healthcare and autonomous driving where exploration is dangerous. CQL, IQL, and Decision Transformer are the key methods, with conservative estimation and sequence modeling as the two dominant paradigms.

History

2019

Batch RL paper (Fujimoto et al.) formalizes the distributional shift problem in offline RL

2020

D4RL benchmark released — standardized offline RL datasets for MuJoCo, maze navigation, and more

2020

CQL (Conservative Q-Learning) addresses overestimation by penalizing Q-values on unseen actions

2021

Decision Transformer recasts offline RL as sequence modeling — return-conditioned autoregression

2021

IQL (Implicit Q-Learning) avoids querying OOD actions entirely via expectile regression

2022

Diffusion policies (Diffuser) model action distributions with denoising diffusion for offline RL

2023

Cal-QL bridges offline and online RL with calibrated conservative learning

2024

Offline RL at scale: large datasets + transformer architectures show strong results

2024

Real-world applications emerge in robotics (RT-2), chip design, and recommendation systems

2025

Foundation models for offline RL — pretrained on diverse offline data, fine-tuned for specific tasks

How Offline RL Works

Dataset Collection

A fixed dataset of (state, action, reward, next_state) transitions is collected by some behavior policy — human demonstrations, random exploration, or a prior agent.

Conservative Value Estimation

Q-values are learned conservatively — penalizing overestimation on actions not well-represented in the data (CQL) or avoiding OOD action queries entirely (IQL).

Policy Extraction

A policy is derived that stays close to the data distribution while improving upon the behavior policy. This can be explicit (policy constraint) or implicit (sequence modeling).

Sequence Modeling Alternative

Decision Transformer and variants treat offline RL as autoregressive prediction: given desired return, predict the action sequence that achieves it.

Optional Online Fine-Tuning

The offline-learned policy can be fine-tuned with limited online interaction for significant additional improvement (offline-to-online RL).

Current Landscape

Offline RL in 2025 has matured from a research curiosity to a practical paradigm, especially for robotics and recommendation systems. Two schools dominate: conservative value methods (CQL, IQL) that constrain learning to stay near the data, and sequence modeling approaches (Decision Transformer) that sidestep value estimation entirely. The field increasingly connects to foundation models — pretraining on large offline datasets and fine-tuning for specific tasks. Real-world deployments in chip design (Google), robotics (RT-2), and recommendation systems demonstrate practical value.

Key Challenges

Distributional shift — the learned policy encounters states unseen in the training data, leading to compounding errors

Dataset quality dependency — offline RL performance is bounded by the quality and coverage of the collected data

Conservatism vs. improvement tradeoff — too conservative means copying the behavior policy; too optimistic means catastrophic overestimation

Hyperparameter sensitivity — offline RL methods are notoriously sensitive to conservatism coefficients and architecture choices

Evaluation challenges — D4RL benchmarks don't fully capture real-world offline RL difficulties (partial observability, non-stationarity)

Quick Recommendations

Standard offline RL benchmark

IQL / CQL

Most reliable methods on D4RL with extensive ablations and reproducible code

Sequence modeling approach

Decision Transformer / Trajectory Transformer

Simple, scalable, and avoids the conservatism tuning problem

Robotics with demonstrations

Diffusion Policy (offline)

Best for learning multi-modal action distributions from demonstrations

Offline-to-online fine-tuning

Cal-QL / IQL + online

Proven pipeline for initializing from offline data then improving online

What's Next

The frontier is scaling offline RL with internet-scale data — learning general behavioral priors from massive demonstration datasets, then fine-tuning for specific tasks. Expect convergence with vision-language-action models and increasing use of offline RL as the 'pretraining' phase of real-world RL systems.

Benchmarks & SOTA

D4RL HalfCheetah-Medium-v2

D4RL: Datasets for Deep Data-Driven Reinforcement Learning (halfcheetah-medium-v2)

20200 results

Canonical offline RL benchmark environment from D4RL. The halfcheetah-medium-v2 dataset contains 1M transitions collected from a medium-level SAC policy. Scores are reported as normalized return where 0 = random policy and 100 = expert SAC policy.

No results tracked yet

Related Tasks

Atari Games

Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).

Continuous Control

Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Offline RL benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Reinforcement Learning