Robotics
End-to-end robotics — learning perception, planning, and control in a single model — entered a new era with vision-language-action (VLA) models. Google's RT-2 (2023) showed that a web-pretrained VLM could directly output robot actions, and the open-source Open X-Embodiment dataset (2023) unified data from 22 robot types across 21 institutions. The key tension is generalization: lab demos on specific robots are plentiful, but a single policy that transfers across embodiments, tasks, and environments remains the holy grail, with π₀ (Physical Intelligence, 2024) and Google's RT-X pushing this frontier.
General-purpose robotics combines perception, planning, and control to build machines that manipulate objects and navigate the physical world. Foundation models (RT-2, π0) are transforming the field by enabling language-conditioned robot behavior learned from internet-scale data combined with robot demonstrations.
History
Levine et al. demonstrate large-scale robotic grasping with deep learning (800K grasps)
OpenAI Dactyl solves Rubik's Cube with a robot hand using sim-to-real transfer
RoboNet provides diverse multi-robot video data for learning visual dynamics
SayCan (Google) grounds language models in robot affordances for task planning
Inner Monologue uses LLM reasoning for closed-loop robotic task execution
RT-2 (Robotic Transformer 2) directly maps vision and language to robot actions using a VLM backbone
Mobile ALOHA enables low-cost bimanual mobile manipulation with teleoperation learning
Physical Intelligence π0 — foundation model for robotics trained on diverse manipulation data
Figure 01 and Tesla Optimus demonstrate humanoid robots performing warehouse tasks
Open X-Embodiment dataset enables cross-robot transfer learning at scale
How Robotics Works
Perception
Cameras, depth sensors, and proprioception provide the robot's understanding of the scene — object positions, shapes, and spatial relationships.
Task Specification
The robot receives a task via natural language ('pick up the red cup'), goal images, or learned reward functions.
Planning
High-level planning decomposes the task into subtasks; low-level planning computes motion trajectories that avoid collisions and respect physical constraints.
Control Execution
Joint torques or position commands are sent to actuators, with real-time feedback correction for disturbances.
Learning and Adaptation
The robot improves through demonstration data, simulation experience, and real-world trial-and-error, building generalizable manipulation skills.
Current Landscape
Robotics in 2025 is being revolutionized by foundation models — large networks pretrained on internet data and fine-tuned on robot demonstrations. RT-2 showed that VLMs can directly output robot actions, and π0 demonstrated cross-task generalization. The hardware landscape is diversifying from industrial arms to humanoids (Figure, Tesla), low-cost bimanual systems (ALOHA), and mobile manipulators. Data remains the bottleneck: Open X-Embodiment and similar initiatives are trying to create the 'ImageNet moment' for robotics.
Key Challenges
Data scarcity — robot interaction data is 1000x harder to collect than internet text or images
Sim-to-real gap — policies trained in simulation often fail on real hardware due to unmodeled dynamics
Generalization — handling novel objects, lighting, and environments remains extremely difficult
Safety — robots operating near humans must be provably safe, adding hard constraints on learned policies
Hardware cost — research-grade robot arms cost $20K-100K, limiting accessibility
Quick Recommendations
Research platform
Mobile ALOHA / low-cost bimanual setup
Best cost-performance ratio for manipulation research
Language-conditioned manipulation
RT-2 / Octo
State-of-the-art in mapping language instructions to robot actions
Foundation model approach
π0 (Physical Intelligence)
Most general-purpose robot foundation model as of 2025
Simulation development
Isaac Sim + MuJoCo
Best combination of speed (MuJoCo) and realism (Isaac Sim) for sim-to-real pipelines
What's Next
The frontier is general-purpose household robots — systems that can handle diverse manipulation tasks in unstructured environments with minimal task-specific training. Key enablers: (1) larger robot foundation models trained on cross-embodiment data, (2) fast sim-to-real transfer for new tasks, (3) natural language interfaces for non-expert users.
Benchmarks & SOTA
Related Tasks
Robot Navigation
Autonomous navigation — moving through unstructured environments while avoiding obstacles — spans indoor service robots to outdoor last-mile delivery. Classical SLAM (simultaneous localization and mapping) methods like ORB-SLAM still dominate mapping, but end-to-end learning approaches using habitat simulators (Habitat 2.0, iGibson) show promise for semantic navigation ("go to the kitchen"). The Habitat Challenge results reveal that modular pipelines (map → plan → act) consistently beat monolithic learned policies, suggesting that full end-to-end navigation is still years away from displacing classical stacks in production.
Robot Manipulation
Robot manipulation — grasping, placing, and using tools — is where sim-to-real and foundation models meet physical dexterity. DexNet (2017) pioneered data-driven grasp planning, but the field accelerated when contact-rich manipulation was tackled with RL in simulation (DexterousHands, 2023) and then transferred to real hardware. Current state-of-the-art combines diffusion policies (Chi et al., 2023) with large pretrained vision encoders to achieve robust 6-DOF manipulation from a handful of demonstrations, though deformable objects and multi-step assembly remain unsolved.
Sim-to-Real Transfer
Sim-to-real transfer — training policies in simulation and deploying on physical hardware — is the bridge between unlimited virtual data and messy reality. Domain randomization (Tobin et al., 2017) was the first scalable approach, and OpenAI's Rubik's cube hand (2019) showed it could work for dexterous manipulation. The modern toolkit combines photorealistic rendering (Isaac Sim, MuJoCo MJX on GPU), system identification, and real-world fine-tuning, but the gap persists for contact-rich tasks where simulation physics diverge from reality. Narrowing this gap is existential for robotics — it determines whether lab results actually work in factories and homes.
Something wrong or missing?
Help keep Robotics benchmarks accurate. Report outdated results, missing benchmarks, or errors.