Continuous Control
Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.
Continuous control tasks require RL agents to output real-valued actions (torques, forces, velocities) for locomotion, manipulation, and other physical systems. MuJoCo and Isaac Gym are the standard simulators, with PPO and SAC as dominant algorithms. Sim-to-real transfer remains the key challenge for real-world deployment.
History
DDPG (Lillicrap et al.) extends DQN to continuous action spaces
MuJoCo (Todorov et al.) becomes the standard physics simulator for continuous control benchmarks
Schulman et al. introduce TRPO for stable policy gradient optimization
PPO (Proximal Policy Optimization) simplifies TRPO and becomes the default algorithm
SAC (Soft Actor-Critic) achieves state-of-the-art by maximizing entropy alongside reward
OpenAI's Rubik's Cube hand (Dactyl) solves manipulation with sim-to-real transfer
Isaac Gym (NVIDIA) enables GPU-accelerated parallel simulation for massive speedups
DreamerV3 achieves strong continuous control via learned world models
TD-MPC2 scales model-based RL to 80+ continuous control tasks with a single model
Foundation policies trained on diverse simulation data show zero-shot transfer capabilities
How Continuous Control Works
State Observation
The agent receives the system state — joint positions, velocities, contact forces — as a continuous vector (or images for vision-based control).
Policy Evaluation
A neural network maps the state to a continuous action distribution (typically Gaussian), parameterized by mean and variance for each action dimension.
Action Execution
An action is sampled from the distribution and applied as torques/forces to the simulated (or real) physical system.
Reward Collection
The environment returns a scalar reward based on task progress — distance traveled, object reached, energy minimized.
Policy Update
The policy network is updated using policy gradients (PPO), Q-learning (SAC), or world-model-based planning (DreamerV3, TD-MPC2).
Current Landscape
Continuous control in 2025 has converged on two paradigms: model-free (PPO/SAC with massive parallelism via Isaac Gym) and model-based (DreamerV3/TD-MPC2 learning dynamics models). Standard MuJoCo benchmarks (HalfCheetah, Ant, Humanoid) are well-solved, with research pushing toward more complex manipulation, multi-agent control, and sim-to-real transfer. The field is increasingly integrated with robotics, where continuous control is a necessary component of end-to-end robot learning.
Key Challenges
Reward design — continuous control tasks require carefully shaped rewards to avoid degenerate behaviors
Sim-to-real gap — policies trained in simulation often fail on real hardware due to modeling errors
Sample complexity — complex locomotion tasks can require billions of simulation steps to solve
Multi-task generalization — single policies that handle diverse control tasks remain difficult
Contact dynamics — simulating and learning from contact-rich manipulation is numerically challenging
Quick Recommendations
General-purpose continuous control
SAC / PPO
Most reliable, well-documented algorithms with extensive codebases (CleanRL, Stable-Baselines3)
Sample-efficient control
DreamerV3 / TD-MPC2
World-model-based methods achieve strong performance with 10-100x fewer environment interactions
Large-scale parallel training
Isaac Gym + PPO
GPU-accelerated simulation enables training in minutes instead of hours
Real robot deployment
SAC + domain randomization
Proven sim-to-real pipeline with robust transfer
What's Next
The frontier is foundation policies for control — large models pretrained on diverse simulation data that can adapt to new tasks with minimal fine-tuning, analogous to language model pretraining. Expect convergence with vision-language-action models for robots and increasing emphasis on real-world deployment over simulation benchmarks.
Benchmarks & SOTA
Related Tasks
Atari Games
Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).
Offline RL
Offline RL — learning policies from fixed datasets without further environment interaction — matters because most real-world domains (healthcare, robotics, autonomous driving) can't afford online exploration. CQL (2020) and IQL (2022) established strong baselines on the D4RL benchmark, but the field was disrupted by Decision Transformer (2021), which recast RL as sequence modeling. The latest wave uses pretrained language models as policy backbones, blurring the line between offline RL and in-context learning, with benchmarks like CORL tracking reproducibility across dozens of algorithms.
Something wrong or missing?
Help keep Continuous Control benchmarks accurate. Report outdated results, missing benchmarks, or errors.