Reinforcement Learning

Continuous Control

Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.

1 datasets9 resultsView full task mapping →

Continuous control tasks require RL agents to output real-valued actions (torques, forces, velocities) for locomotion, manipulation, and other physical systems. MuJoCo and Isaac Gym are the standard simulators, with PPO and SAC as dominant algorithms. Sim-to-real transfer remains the key challenge for real-world deployment.

History

2015

DDPG (Lillicrap et al.) extends DQN to continuous action spaces

2015

MuJoCo (Todorov et al.) becomes the standard physics simulator for continuous control benchmarks

2016

Schulman et al. introduce TRPO for stable policy gradient optimization

2017

PPO (Proximal Policy Optimization) simplifies TRPO and becomes the default algorithm

2018

SAC (Soft Actor-Critic) achieves state-of-the-art by maximizing entropy alongside reward

2019

OpenAI's Rubik's Cube hand (Dactyl) solves manipulation with sim-to-real transfer

2021

Isaac Gym (NVIDIA) enables GPU-accelerated parallel simulation for massive speedups

2022

DreamerV3 achieves strong continuous control via learned world models

2023

TD-MPC2 scales model-based RL to 80+ continuous control tasks with a single model

2024

Foundation policies trained on diverse simulation data show zero-shot transfer capabilities

How Continuous Control Works

State Observation

The agent receives the system state — joint positions, velocities, contact forces — as a continuous vector (or images for vision-based control).

Policy Evaluation

A neural network maps the state to a continuous action distribution (typically Gaussian), parameterized by mean and variance for each action dimension.

Action Execution

An action is sampled from the distribution and applied as torques/forces to the simulated (or real) physical system.

Reward Collection

The environment returns a scalar reward based on task progress — distance traveled, object reached, energy minimized.

Policy Update

The policy network is updated using policy gradients (PPO), Q-learning (SAC), or world-model-based planning (DreamerV3, TD-MPC2).

Current Landscape

Continuous control in 2025 has converged on two paradigms: model-free (PPO/SAC with massive parallelism via Isaac Gym) and model-based (DreamerV3/TD-MPC2 learning dynamics models). Standard MuJoCo benchmarks (HalfCheetah, Ant, Humanoid) are well-solved, with research pushing toward more complex manipulation, multi-agent control, and sim-to-real transfer. The field is increasingly integrated with robotics, where continuous control is a necessary component of end-to-end robot learning.

Key Challenges

Reward design — continuous control tasks require carefully shaped rewards to avoid degenerate behaviors

Sim-to-real gap — policies trained in simulation often fail on real hardware due to modeling errors

Sample complexity — complex locomotion tasks can require billions of simulation steps to solve

Multi-task generalization — single policies that handle diverse control tasks remain difficult

Contact dynamics — simulating and learning from contact-rich manipulation is numerically challenging

Quick Recommendations

General-purpose continuous control

SAC / PPO

Most reliable, well-documented algorithms with extensive codebases (CleanRL, Stable-Baselines3)

Sample-efficient control

DreamerV3 / TD-MPC2

World-model-based methods achieve strong performance with 10-100x fewer environment interactions

Large-scale parallel training

Isaac Gym + PPO

GPU-accelerated simulation enables training in minutes instead of hours

Real robot deployment

SAC + domain randomization

Proven sim-to-real pipeline with robust transfer

What's Next

The frontier is foundation policies for control — large models pretrained on diverse simulation data that can adapt to new tasks with minimal fine-tuning, analogous to language model pretraining. Expect convergence with vision-language-action models for robots and increasing emphasis on real-world deployment over simulation benchmarks.

Benchmarks & SOTA

MuJoCo

Multi-Joint dynamics with Contact

20129 results

Physics-based continuous control benchmark. Evaluated on 15 DMControl tasks; metric is mean normalized score (0=random, 1000=expert) at 1M environment steps.

State of the Art

TD-MPC2 (317M params)

UC San Diego

960

average-return

Related Tasks

Atari Games

Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).

Offline RL

Offline RL — learning policies from fixed datasets without further environment interaction — matters because most real-world domains (healthcare, robotics, autonomous driving) can't afford online exploration. CQL (2020) and IQL (2022) established strong baselines on the D4RL benchmark, but the field was disrupted by Decision Transformer (2021), which recast RL as sequence modeling. The latest wave uses pretrained language models as policy backbones, blurring the line between offline RL and in-context learning, with benchmarks like CORL tracking reproducibility across dozens of algorithms.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Continuous Control benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Reinforcement Learning