Reinforcement Learning

Continuous Control

Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.

1 datasets9 resultsView full task mapping →

Continuous control tasks require RL agents to output real-valued actions (torques, forces, velocities) for locomotion, manipulation, and other physical systems. MuJoCo and Isaac Gym are the standard simulators, with PPO and SAC as dominant algorithms. Sim-to-real transfer remains the key challenge for real-world deployment.

History

2015

DDPG (Lillicrap et al.) extends DQN to continuous action spaces

2015

MuJoCo (Todorov et al.) becomes the standard physics simulator for continuous control benchmarks

2016

Schulman et al. introduce TRPO for stable policy gradient optimization

2017

PPO (Proximal Policy Optimization) simplifies TRPO and becomes the default algorithm

2018

SAC (Soft Actor-Critic) achieves state-of-the-art by maximizing entropy alongside reward

2019

OpenAI's Rubik's Cube hand (Dactyl) solves manipulation with sim-to-real transfer

2021

Isaac Gym (NVIDIA) enables GPU-accelerated parallel simulation for massive speedups

2022

DreamerV3 achieves strong continuous control via learned world models

2023

TD-MPC2 scales model-based RL to 80+ continuous control tasks with a single model

2024

Foundation policies trained on diverse simulation data show zero-shot transfer capabilities

How Continuous Control Works

1State ObservationThe agent receives the syst…2Policy EvaluationA neural network maps the s…3Action ExecutionAn action is sampled from t…4Reward CollectionThe environment returns a s…5Policy UpdateThe policy network is updat…Continuous Control Pipeline
1

State Observation

The agent receives the system state — joint positions, velocities, contact forces — as a continuous vector (or images for vision-based control).

2

Policy Evaluation

A neural network maps the state to a continuous action distribution (typically Gaussian), parameterized by mean and variance for each action dimension.

3

Action Execution

An action is sampled from the distribution and applied as torques/forces to the simulated (or real) physical system.

4

Reward Collection

The environment returns a scalar reward based on task progress — distance traveled, object reached, energy minimized.

5

Policy Update

The policy network is updated using policy gradients (PPO), Q-learning (SAC), or world-model-based planning (DreamerV3, TD-MPC2).

Current Landscape

Continuous control in 2025 has converged on two paradigms: model-free (PPO/SAC with massive parallelism via Isaac Gym) and model-based (DreamerV3/TD-MPC2 learning dynamics models). Standard MuJoCo benchmarks (HalfCheetah, Ant, Humanoid) are well-solved, with research pushing toward more complex manipulation, multi-agent control, and sim-to-real transfer. The field is increasingly integrated with robotics, where continuous control is a necessary component of end-to-end robot learning.

Key Challenges

Reward design — continuous control tasks require carefully shaped rewards to avoid degenerate behaviors

Sim-to-real gap — policies trained in simulation often fail on real hardware due to modeling errors

Sample complexity — complex locomotion tasks can require billions of simulation steps to solve

Multi-task generalization — single policies that handle diverse control tasks remain difficult

Contact dynamics — simulating and learning from contact-rich manipulation is numerically challenging

Quick Recommendations

General-purpose continuous control

SAC / PPO

Most reliable, well-documented algorithms with extensive codebases (CleanRL, Stable-Baselines3)

Sample-efficient control

DreamerV3 / TD-MPC2

World-model-based methods achieve strong performance with 10-100x fewer environment interactions

Large-scale parallel training

Isaac Gym + PPO

GPU-accelerated simulation enables training in minutes instead of hours

Real robot deployment

SAC + domain randomization

Proven sim-to-real pipeline with robust transfer

What's Next

The frontier is foundation policies for control — large models pretrained on diverse simulation data that can adapt to new tasks with minimal fine-tuning, analogous to language model pretraining. Expect convergence with vision-language-action models for robots and increasing emphasis on real-world deployment over simulation benchmarks.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Continuous Control benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000