Robot Manipulation
Robot manipulation — grasping, placing, and using tools — is where sim-to-real and foundation models meet physical dexterity. DexNet (2017) pioneered data-driven grasp planning, but the field accelerated when contact-rich manipulation was tackled with RL in simulation (DexterousHands, 2023) and then transferred to real hardware. Current state-of-the-art combines diffusion policies (Chi et al., 2023) with large pretrained vision encoders to achieve robust 6-DOF manipulation from a handful of demonstrations, though deformable objects and multi-step assembly remain unsolved.
Robot manipulation focuses on grasping, moving, and assembling objects with robotic grippers and hands. Diffusion policies and vision-language-action models have dramatically improved generalization, but dexterous multi-finger manipulation and deformable object handling remain open challenges.
History
Levine et al. scale robotic grasping to 800K attempts across 14 robots
Dex-Net 2.0 plans grasps from depth images using a GQ-CNN
QT-Opt learns vision-based grasping from 580K real-world grasps at 96% success
OpenAI Dactyl solves Rubik's Cube with dexterous in-hand manipulation
Transporter Networks learn pick-and-place from few demonstrations
Diffusion Policy models multi-modal action distributions for complex manipulation
RT-2 demonstrates language-conditioned manipulation via VLM backbone
π0 achieves cross-task manipulation transfer from folding to packing to cleaning
DexCap enables learning dexterous hand manipulation from human hand capture data
Multi-finger dexterous manipulation with tactile sensing reaches practical reliability
How Robot Manipulation Works
Scene Perception
RGB-D cameras observe the workspace, and the robot segments objects, estimates poses, and identifies grasp candidates.
Grasp Planning
The system selects a grasp configuration — where and how to grip the object — based on geometry, physics, and task requirements.
Motion Planning
A collision-free trajectory is computed from the current configuration to the grasp pose and then to the placement target.
Closed-Loop Execution
During execution, force/torque and visual feedback enable real-time adjustments for robust grasping and placement.
Skill Composition
Complex manipulation tasks chain multiple primitive skills: reach, grasp, lift, transport, orient, insert, release.
Current Landscape
Robot manipulation in 2025 has advanced dramatically through two paradigm shifts: (1) diffusion policies that model multi-modal action distributions for complex contact-rich tasks, and (2) vision-language-action models that enable language-conditioned manipulation. Simple pick-and-place is commercially deployed (Amazon, logistics), while research pushes toward dexterous multi-finger manipulation, deformable object handling, and tool use. The data bottleneck is being addressed through teleoperation (ALOHA) and simulation (Isaac Gym).
Key Challenges
Deformable objects (cloth, rope, food) have infinite-dimensional state spaces that resist standard planning
Dexterous manipulation with multi-finger hands requires controlling 20+ degrees of freedom simultaneously
Tool use — using objects as tools (spatulas, screwdrivers) requires understanding physics beyond contact
Tactile sensing integration — combining vision and touch for reliable manipulation in occlusion
Long-horizon assembly — multi-step assembly tasks with tight tolerances remain extremely challenging
Quick Recommendations
General manipulation research
Diffusion Policy + 6-DOF robot
Best framework for learning multi-modal manipulation from demonstrations
Language-conditioned manipulation
RT-2 / Octo
Map natural language instructions to manipulation actions
Dexterous hand manipulation
DexCap + Isaac Gym sim-to-real
State-of-the-art pipeline for learning hand manipulation skills
Industrial pick-and-place
Dex-Net 4.0 / Contact-GraspNet
Proven in production for bin-picking applications
What's Next
The frontier is reliable dexterous manipulation in unstructured environments — folding laundry, cooking meals, assembling furniture. Key advances needed: (1) better tactile sensing integration, (2) manipulation foundation models trained on diverse cross-task data, (3) real-time adaptation to novel objects through in-context learning.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Robot Navigation
Autonomous navigation — moving through unstructured environments while avoiding obstacles — spans indoor service robots to outdoor last-mile delivery. Classical SLAM (simultaneous localization and mapping) methods like ORB-SLAM still dominate mapping, but end-to-end learning approaches using habitat simulators (Habitat 2.0, iGibson) show promise for semantic navigation ("go to the kitchen"). The Habitat Challenge results reveal that modular pipelines (map → plan → act) consistently beat monolithic learned policies, suggesting that full end-to-end navigation is still years away from displacing classical stacks in production.
Robotics
End-to-end robotics — learning perception, planning, and control in a single model — entered a new era with vision-language-action (VLA) models. Google's RT-2 (2023) showed that a web-pretrained VLM could directly output robot actions, and the open-source Open X-Embodiment dataset (2023) unified data from 22 robot types across 21 institutions. The key tension is generalization: lab demos on specific robots are plentiful, but a single policy that transfers across embodiments, tasks, and environments remains the holy grail, with π₀ (Physical Intelligence, 2024) and Google's RT-X pushing this frontier.
Sim-to-Real Transfer
Sim-to-real transfer — training policies in simulation and deploying on physical hardware — is the bridge between unlimited virtual data and messy reality. Domain randomization (Tobin et al., 2017) was the first scalable approach, and OpenAI's Rubik's cube hand (2019) showed it could work for dexterous manipulation. The modern toolkit combines photorealistic rendering (Isaac Sim, MuJoCo MJX on GPU), system identification, and real-world fine-tuning, but the gap persists for contact-rich tasks where simulation physics diverge from reality. Narrowing this gap is existential for robotics — it determines whether lab results actually work in factories and homes.
Something wrong or missing?
Help keep Robot Manipulation benchmarks accurate. Report outdated results, missing benchmarks, or errors.