Robots

Robot Manipulation

Robot manipulation — grasping, placing, and using tools — is where sim-to-real and foundation models meet physical dexterity. DexNet (2017) pioneered data-driven grasp planning, but the field accelerated when contact-rich manipulation was tackled with RL in simulation (DexterousHands, 2023) and then transferred to real hardware. Current state-of-the-art combines diffusion policies (Chi et al., 2023) with large pretrained vision encoders to achieve robust 6-DOF manipulation from a handful of demonstrations, though deformable objects and multi-step assembly remain unsolved.

0 datasets0 resultsView full task mapping →

Robot manipulation focuses on grasping, moving, and assembling objects with robotic grippers and hands. Diffusion policies and vision-language-action models have dramatically improved generalization, but dexterous multi-finger manipulation and deformable object handling remain open challenges.

History

2016

Levine et al. scale robotic grasping to 800K attempts across 14 robots

2017

Dex-Net 2.0 plans grasps from depth images using a GQ-CNN

2018

QT-Opt learns vision-based grasping from 580K real-world grasps at 96% success

2019

OpenAI Dactyl solves Rubik's Cube with dexterous in-hand manipulation

2022

Transporter Networks learn pick-and-place from few demonstrations

2023

Diffusion Policy models multi-modal action distributions for complex manipulation

2023

RT-2 demonstrates language-conditioned manipulation via VLM backbone

2024

π0 achieves cross-task manipulation transfer from folding to packing to cleaning

2024

DexCap enables learning dexterous hand manipulation from human hand capture data

2025

Multi-finger dexterous manipulation with tactile sensing reaches practical reliability

How Robot Manipulation Works

1Scene PerceptionRGB-D cameras observe the w…2Grasp PlanningThe system selects a grasp …3Motion PlanningA collision-free trajectory…4Closed-Loop ExecutionDuring execution5Skill CompositionComplex manipulation tasks …Robot Manipulation Pipeline
1

Scene Perception

RGB-D cameras observe the workspace, and the robot segments objects, estimates poses, and identifies grasp candidates.

2

Grasp Planning

The system selects a grasp configuration — where and how to grip the object — based on geometry, physics, and task requirements.

3

Motion Planning

A collision-free trajectory is computed from the current configuration to the grasp pose and then to the placement target.

4

Closed-Loop Execution

During execution, force/torque and visual feedback enable real-time adjustments for robust grasping and placement.

5

Skill Composition

Complex manipulation tasks chain multiple primitive skills: reach, grasp, lift, transport, orient, insert, release.

Current Landscape

Robot manipulation in 2025 has advanced dramatically through two paradigm shifts: (1) diffusion policies that model multi-modal action distributions for complex contact-rich tasks, and (2) vision-language-action models that enable language-conditioned manipulation. Simple pick-and-place is commercially deployed (Amazon, logistics), while research pushes toward dexterous multi-finger manipulation, deformable object handling, and tool use. The data bottleneck is being addressed through teleoperation (ALOHA) and simulation (Isaac Gym).

Key Challenges

Deformable objects (cloth, rope, food) have infinite-dimensional state spaces that resist standard planning

Dexterous manipulation with multi-finger hands requires controlling 20+ degrees of freedom simultaneously

Tool use — using objects as tools (spatulas, screwdrivers) requires understanding physics beyond contact

Tactile sensing integration — combining vision and touch for reliable manipulation in occlusion

Long-horizon assembly — multi-step assembly tasks with tight tolerances remain extremely challenging

Quick Recommendations

General manipulation research

Diffusion Policy + 6-DOF robot

Best framework for learning multi-modal manipulation from demonstrations

Language-conditioned manipulation

RT-2 / Octo

Map natural language instructions to manipulation actions

Dexterous hand manipulation

DexCap + Isaac Gym sim-to-real

State-of-the-art pipeline for learning hand manipulation skills

Industrial pick-and-place

Dex-Net 4.0 / Contact-GraspNet

Proven in production for bin-picking applications

What's Next

The frontier is reliable dexterous manipulation in unstructured environments — folding laundry, cooking meals, assembling furniture. Key advances needed: (1) better tactile sensing integration, (2) manipulation foundation models trained on diverse cross-task data, (3) real-time adaptation to novel objects through in-context learning.

Benchmarks & SOTA

No datasets indexed for this task yet.

Contribute on GitHub

Related Tasks

Robot Navigation

Autonomous navigation — moving through unstructured environments while avoiding obstacles — spans indoor service robots to outdoor last-mile delivery. Classical SLAM (simultaneous localization and mapping) methods like ORB-SLAM still dominate mapping, but end-to-end learning approaches using habitat simulators (Habitat 2.0, iGibson) show promise for semantic navigation ("go to the kitchen"). The Habitat Challenge results reveal that modular pipelines (map → plan → act) consistently beat monolithic learned policies, suggesting that full end-to-end navigation is still years away from displacing classical stacks in production.

Robotics

End-to-end robotics — learning perception, planning, and control in a single model — entered a new era with vision-language-action (VLA) models. Google's RT-2 (2023) showed that a web-pretrained VLM could directly output robot actions, and the open-source Open X-Embodiment dataset (2023) unified data from 22 robot types across 21 institutions. The key tension is generalization: lab demos on specific robots are plentiful, but a single policy that transfers across embodiments, tasks, and environments remains the holy grail, with π₀ (Physical Intelligence, 2024) and Google's RT-X pushing this frontier.

Sim-to-Real Transfer

Sim-to-real transfer — training policies in simulation and deploying on physical hardware — is the bridge between unlimited virtual data and messy reality. Domain randomization (Tobin et al., 2017) was the first scalable approach, and OpenAI's Rubik's cube hand (2019) showed it could work for dexterous manipulation. The modern toolkit combines photorealistic rendering (Isaac Sim, MuJoCo MJX on GPU), system identification, and real-world fine-tuning, but the gap persists for contact-rich tasks where simulation physics diverge from reality. Narrowing this gap is existential for robotics — it determines whether lab results actually work in factories and homes.

Something wrong or missing?

Help keep Robot Manipulation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000