What VLA means.
A single transformer that consumes camera frames and a natural-language instruction, and emits robot actions as tokens. RT-2 was the first at scale; OpenVLA (7B) is the open reference.
The open register of robot-learning benchmarks — Open X-Embodiment, LIBERO, Habitat, ManiSkill, RoboSuite — read next to the vision-language-action models that are starting to cross the sim-to-real gap. Pick-and-place is essentially solved; long-horizon manipulation still sits below fifty percent.
Simulation is where most robot learning still happens — cheaper iteration, no broken gripper, and the only tractable path to billions of interaction steps. These are the registers the field actually agrees on.
| Benchmark | Org | Type | Scale | Best-known result |
|---|---|---|---|---|
| Open X-Embodiment | Google DeepMind + 34 Labs | Multi-robot dataset | 60+ datasets · 22 robot types · 527 skills | RT-2-X · +50% over single-embodiment |
| RoboSuite | Stanford / ARISE | Manipulation benchmark | 8 robots · 12 tasks · MuJoCo | Diffusion Policy · 80%+ on complex tasks |
| LIBERO | UT Austin / Robotics | Long-horizon manipulation | 5+ step task chains | Sub-50% on deepest chains |
| Habitat | Meta FAIR | Embodied navigation | Photorealistic 3D scenes | 90%+ in structured envs |
| ManiSkill | UC San Diego | Manipulation suite | Dexterous + rigid-body tasks | Active research frontier |
| Meta-World | Stanford / Berkeley | Multi-task RL | 50 manipulation tasks | Pick-and-place solved · long-horizon open |
| M3Bench | Research Community | Mobile manipulation | 30k tasks · 119 household scenes | VLA + motion planning |
| BARN Challenge | IEEE ICRA | Navigation competition | 300 environments | Hybrid learning + planning |
| DROID | Toyota Research / Berkeley | Demonstration dataset | 76k trajectories · 564 scenes · 86 tasks | Training corpus for Octo, RT-X |
The models that have crossed the sim-to-real gap on a physical arm. Most are transformer policies trained on tele-operated demonstrations; a handful have taken delivery on humanoids and dexterous hands.
RT-1 opened the era in 2022 (130k demos, 700+ tasks). RT-2 introduced VLA in 2023. 2024 added Octo and OpenVLA as the first open-source generalists; 2025 brought production deployment at Tesla Optimus and Figure AI scale.
| Model | Org | Kind | Trained on | Params | Access |
|---|---|---|---|---|---|
| RT-2-X | Google DeepMind | Vision-Language-Action | Open X-Embodiment | — | Research only |
| Pi0 | Physical Intelligence | Generalist robot policy | Private corpus | — | Commercial |
| Isaac GR00T N1 | NVIDIA | Humanoid foundation model | NVIDIA ecosystem data | — | NVIDIA ecosystem |
| Octo | Berkeley AI Research | Open-source generalist | Open X-Embodiment · DROID | Octo-base / small | Apache 2.0 |
| OpenVLA | Stanford / TRI | Vision-Language-Action | Open X-Embodiment | 7B | Open source |
| RT-1 | Robot transformer | 130k demos · 700+ tasks | — | Research |
The architectural thesis of the foundation-model era in robotics. One transformer, three modalities, and a training corpus wide enough to cross embodiments.
OpenVLA (Stanford / TRI · 7B) and Octo (Berkeley · Apache 2.0) are the default open-source starting points; RT-2-X, Pi0 and Isaac GR00T N1 sit behind partner or research access.
A single transformer that consumes camera frames and a natural-language instruction, and emits robot actions as tokens. RT-2 was the first at scale; OpenVLA (7B) is the open reference.
Training one model across 22 robot types and 527 skills — the Open X-Embodiment recipe — lifts a target-robot score by +50% over training on that robot alone. Scale transfers.
Five-plus step tasks still land below 50% on LIBERO and CALVIN. Error propagates across steps faster than any current policy can recover.
Each panel traces the qualitative arc of a robotics sub-task — entry-level manipulation climbed first, dexterous rotation followed, long-horizon chains are still below half. The copper dot marks today's frontier ceiling.
MuJoCo for contact-rich accuracy; Isaac for GPU-scale RL; PyBullet for the first week; Genesis for the differentiable frontier.
DeepMind acquired MuJoCo in 2021 and made it free — the single largest accelerant of the modern simulation era. Isaac Gym unlocked 1,000s of parallel envs per GPU.
| Engine | Org | License | Strength | Weakness | Best for |
|---|---|---|---|---|---|
| MuJoCo | DeepMind | Apache 2.0 | Accurate contact physics | CPU-only · steeper curve | Contact-rich manipulation · research |
| Isaac Gym | NVIDIA | NVIDIA (free research) | GPU-accelerated · 1,000s parallel envs | NVIDIA required · less accurate contacts | RL at scale · locomotion |
| PyBullet | Erwin Coumans | zlib | Easy to use · Python-native | Less accurate · slower | Beginners · prototyping |
| Genesis | Stanford / CMU | Apache 2.0 | Differentiable · multi-GPU | New (2024) · smaller community | Cutting-edge research |
Entry-level manipulation is solved. Dexterous manipulation is within reach. Contact-rich tolerance work is still brittle. Long-horizon chains remain the honest ceiling of the field.
| Task | Difficulty | Benchmarks | Best-known | Open challenge |
|---|---|---|---|---|
| Pick and Place | Entry | RoboSuite Lift · Meta-World | 95%+ success | Generalisation to novel objects |
| Autonomous Navigation | Medium | BARN Challenge · Habitat | 90%+ in structured envs | Dynamic obstacles |
| Dexterous Manipulation | Hard | DexMV · DexArt · Shadow Hand | ~70% on complex rotation | High DoF · sim-to-real gap |
| Mobile Manipulation | Hard | M3Bench · BEHAVIOR-1K | Active frontier | Whole-body coordination |
| Contact-Rich Tasks | Hard | FurnitureBench · Peg Insertion | 60–80% (tolerance-dependent) | Force sensing · compliance |
| Long-Horizon Tasks | Very Hard | CALVIN · LIBERO | <50% on 5+ step chains | Error propagation · memory |
Robotics benchmarks are harder to trust than LLM ones. A score can depend on the gripper, the tabletop, the lighting — and on whether the model was trained on the exact same embodiment it is now being tested on. We report accordingly.
First, same-embodiment comparison. Cross-robot numbers are inflated by the Open X-Embodiment training distribution; we separate single-robot scores from generalist-policy scores.
Second, reported simulation seeds. A 95% pick-and-place on RoboSuite with one seed is not a 95% on ten seeds. When we cite a number, we prefer the multi-seed average; where only a headline is available we say so.
Third, sim-to-real honesty. A simulation score is not a hardware score. Policies that work perfectly in MuJoCo routinely fail on the physical arm — friction, contacts, sensor noise, and actuator delay all diverge. We flag which lane a number comes from.
Where the existing literature reports a qualitative ceiling rather than a reproduced number, we preserve the qualitative phrasing — “sub-50%”, “active frontier” — rather than invent precision.
A concrete, contact-rich, long-horizon capability test the field has not yet cleared — a commercial general-purpose robot that assembles unmodified IKEA furniture in a normal home, under $100k, in under a day.
Sourced from a date forecast on Metaculus. Closes 2036-01-01; the community median sits in early 2031, with the middle 50% spanning four years on either side of that.
Read the question on Metaculus →Early research stage
Berkeley plank-handling robot (Jan 2025) — single plank type, no screws, human still drives the screwdriver. Recent context: Pi 0.7 (Physical Intelligence), Gemini Robotics-ER 1.6.
The frontier LLMs whose visual cousins drive most VLA backbones today.
Detection, segmentation, and the perception stack underneath every robot policy.
The GPUs and edge silicon robot policies actually run on — B200 to Jetson Orin.
Planning and tool-use benchmarks — the cognitive layer above the motor policy.
Every ML task in the register, with its canonical benchmark and trust grade.
How every number on Codesota is reproduced, dated, and preserved under regression.
The predecessor registry we are building a calmer, stricter successor to.
The full benchmark catalogue — by area, by modality, by size.