Robotsrobotics

Robotics

End-to-end robotics — learning perception, planning, and control in a single model — entered a new era with vision-language-action (VLA) models. Google's RT-2 (2023) showed that a web-pretrained VLM could directly output robot actions, and the open-source Open X-Embodiment dataset (2023) unified data from 22 robot types across 21 institutions. The key tension is generalization: lab demos on specific robots are plentiful, but a single policy that transfers across embodiments, tasks, and environments remains the holy grail, with π₀ (Physical Intelligence, 2024) and Google's RT-X pushing this frontier.

2 datasets0 resultsView full task mapping →

General-purpose robotics combines perception, planning, and control to build machines that manipulate objects and navigate the physical world. Foundation models (RT-2, π0) are transforming the field by enabling language-conditioned robot behavior learned from internet-scale data combined with robot demonstrations.

History

2016

Levine et al. demonstrate large-scale robotic grasping with deep learning (800K grasps)

2018

OpenAI Dactyl solves Rubik's Cube with a robot hand using sim-to-real transfer

2019

RoboNet provides diverse multi-robot video data for learning visual dynamics

2022

SayCan (Google) grounds language models in robot affordances for task planning

2022

Inner Monologue uses LLM reasoning for closed-loop robotic task execution

2023

RT-2 (Robotic Transformer 2) directly maps vision and language to robot actions using a VLM backbone

2023

Mobile ALOHA enables low-cost bimanual mobile manipulation with teleoperation learning

2024

Physical Intelligence π0 — foundation model for robotics trained on diverse manipulation data

2024

Figure 01 and Tesla Optimus demonstrate humanoid robots performing warehouse tasks

2025

Open X-Embodiment dataset enables cross-robot transfer learning at scale

How Robotics Works

1PerceptionCameras2Task SpecificationThe robot receives a task v…3PlanningHigh-level planning decompo…4Control ExecutionJoint torques or position c…5Learning and Adaptati…The robot improves through …Robotics Pipeline
1

Perception

Cameras, depth sensors, and proprioception provide the robot's understanding of the scene — object positions, shapes, and spatial relationships.

2

Task Specification

The robot receives a task via natural language ('pick up the red cup'), goal images, or learned reward functions.

3

Planning

High-level planning decomposes the task into subtasks; low-level planning computes motion trajectories that avoid collisions and respect physical constraints.

4

Control Execution

Joint torques or position commands are sent to actuators, with real-time feedback correction for disturbances.

5

Learning and Adaptation

The robot improves through demonstration data, simulation experience, and real-world trial-and-error, building generalizable manipulation skills.

Current Landscape

Robotics in 2025 is being revolutionized by foundation models — large networks pretrained on internet data and fine-tuned on robot demonstrations. RT-2 showed that VLMs can directly output robot actions, and π0 demonstrated cross-task generalization. The hardware landscape is diversifying from industrial arms to humanoids (Figure, Tesla), low-cost bimanual systems (ALOHA), and mobile manipulators. Data remains the bottleneck: Open X-Embodiment and similar initiatives are trying to create the 'ImageNet moment' for robotics.

Key Challenges

Data scarcity — robot interaction data is 1000x harder to collect than internet text or images

Sim-to-real gap — policies trained in simulation often fail on real hardware due to unmodeled dynamics

Generalization — handling novel objects, lighting, and environments remains extremely difficult

Safety — robots operating near humans must be provably safe, adding hard constraints on learned policies

Hardware cost — research-grade robot arms cost $20K-100K, limiting accessibility

Quick Recommendations

Research platform

Mobile ALOHA / low-cost bimanual setup

Best cost-performance ratio for manipulation research

Language-conditioned manipulation

RT-2 / Octo

State-of-the-art in mapping language instructions to robot actions

Foundation model approach

π0 (Physical Intelligence)

Most general-purpose robot foundation model as of 2025

Simulation development

Isaac Sim + MuJoCo

Best combination of speed (MuJoCo) and realism (Isaac Sim) for sim-to-real pipelines

What's Next

The frontier is general-purpose household robots — systems that can handle diverse manipulation tasks in unstructured environments with minimal task-specific training. Key enablers: (1) larger robot foundation models trained on cross-embodiment data, (2) fast sim-to-real transfer for new tasks, (3) natural language interfaces for non-expert users.

Benchmarks & SOTA

Related Tasks

Robot Navigation

Autonomous navigation — moving through unstructured environments while avoiding obstacles — spans indoor service robots to outdoor last-mile delivery. Classical SLAM (simultaneous localization and mapping) methods like ORB-SLAM still dominate mapping, but end-to-end learning approaches using habitat simulators (Habitat 2.0, iGibson) show promise for semantic navigation ("go to the kitchen"). The Habitat Challenge results reveal that modular pipelines (map → plan → act) consistently beat monolithic learned policies, suggesting that full end-to-end navigation is still years away from displacing classical stacks in production.

Robot Manipulation

Robot manipulation — grasping, placing, and using tools — is where sim-to-real and foundation models meet physical dexterity. DexNet (2017) pioneered data-driven grasp planning, but the field accelerated when contact-rich manipulation was tackled with RL in simulation (DexterousHands, 2023) and then transferred to real hardware. Current state-of-the-art combines diffusion policies (Chi et al., 2023) with large pretrained vision encoders to achieve robust 6-DOF manipulation from a handful of demonstrations, though deformable objects and multi-step assembly remain unsolved.

Sim-to-Real Transfer

Sim-to-real transfer — training policies in simulation and deploying on physical hardware — is the bridge between unlimited virtual data and messy reality. Domain randomization (Tobin et al., 2017) was the first scalable approach, and OpenAI's Rubik's cube hand (2019) showed it could work for dexterous manipulation. The modern toolkit combines photorealistic rendering (Isaac Sim, MuJoCo MJX on GPU), system identification, and real-world fine-tuning, but the gap persists for contact-rich tasks where simulation physics diverge from reality. Narrowing this gap is existential for robotics — it determines whether lab results actually work in factories and homes.

Something wrong or missing?

Help keep Robotics benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Robotics Benchmarks - Robots - CodeSOTA | CodeSOTA