Codesota · Robotics · Vol. IIFrom simulation to the factory floorIssue: April 22, 2026
Live register · simulation · real-world · VLA

Robotics, measured honestly.
From simulation to the factory floor.

The open register of robot-learning benchmarks — Open X-Embodiment, LIBERO, Habitat, ManiSkill, RoboSuite — read next to the vision-language-action models that are starting to cross the sim-to-real gap. Pick-and-place is essentially solved; long-horizon manipulation still sits below fifty percent.

Read the benchmarks VLA modelsFree · no paywall · no signup
§ 01 · Simulation

Benchmarks, simulated.

Simulation is where most robot learning still happens — cheaper iteration, no broken gripper, and the only tractable path to billions of interaction steps. These are the registers the field actually agrees on.


Surface
Datasets · suites · competitions
Engines
MuJoCo · Isaac · PyBullet
Updated
April 2026
How we evaluate robot benchmarks →
Register · April 2026
.csv.json
BenchmarkOrgTypeScaleBest-known result
Open X-EmbodimentGoogle DeepMind + 34 LabsMulti-robot dataset60+ datasets · 22 robot types · 527 skillsRT-2-X · +50% over single-embodiment
RoboSuiteStanford / ARISEManipulation benchmark8 robots · 12 tasks · MuJoCoDiffusion Policy · 80%+ on complex tasks
LIBEROUT Austin / RoboticsLong-horizon manipulation5+ step task chainsSub-50% on deepest chains
HabitatMeta FAIREmbodied navigationPhotorealistic 3D scenes90%+ in structured envs
ManiSkillUC San DiegoManipulation suiteDexterous + rigid-body tasksActive research frontier
Meta-WorldStanford / BerkeleyMulti-task RL50 manipulation tasksPick-and-place solved · long-horizon open
M3BenchResearch CommunityMobile manipulation30k tasks · 119 household scenesVLA + motion planning
BARN ChallengeIEEE ICRANavigation competition300 environmentsHybrid learning + planning
DROIDToyota Research / BerkeleyDemonstration dataset76k trajectories · 564 scenes · 86 tasksTraining corpus for Octo, RT-X
Fig 2 · Simulation register. Shaded row marks the largest open dataset; Open X-Embodiment is the corpus that trained the current generation of generalist policies (RT-2-X, Octo, OpenVLA).
§ 02 · Real-world

Manipulation, on hardware.

The models that have crossed the sim-to-real gap on a physical arm. Most are transformer policies trained on tele-operated demonstrations; a handful have taken delivery on humanoids and dexterous hands.

RT-1 opened the era in 2022 (130k demos, 700+ tasks). RT-2 introduced VLA in 2023. 2024 added Octo and OpenVLA as the first open-source generalists; 2025 brought production deployment at Tesla Optimus and Figure AI scale.

Generalist policies · April 2026
.csv
ModelOrgKindTrained onParamsAccess
RT-2-XGoogle DeepMindVision-Language-ActionOpen X-EmbodimentResearch only
Pi0Physical IntelligenceGeneralist robot policyPrivate corpusCommercial
Isaac GR00T N1NVIDIAHumanoid foundation modelNVIDIA ecosystem dataNVIDIA ecosystem
OctoBerkeley AI ResearchOpen-source generalistOpen X-Embodiment · DROIDOcto-base / smallApache 2.0
OpenVLAStanford / TRIVision-Language-ActionOpen X-Embodiment7BOpen source
RT-1GoogleRobot transformer130k demos · 700+ tasksResearch
Fig 3 · Generalist robot policies. Shaded rows mark the three frontier systems (RT-2-X, Pi0, Isaac GR00T N1). Open-source recommendation: Octo or OpenVLA — both fine-tunable on a new robot with limited data.
§ 03 · VLA

Vision, language, action.

The architectural thesis of the foundation-model era in robotics. One transformer, three modalities, and a training corpus wide enough to cross embodiments.

OpenVLA (Stanford / TRI · 7B) and Octo (Berkeley · Apache 2.0) are the default open-source starting points; RT-2-X, Pi0 and Isaac GR00T N1 sit behind partner or research access.

Vision-Language-Action

What VLA means.

A single transformer that consumes camera frames and a natural-language instruction, and emits robot actions as tokens. RT-2 was the first at scale; OpenVLA (7B) is the open reference.

Cross-embodiment

Why it matters.

Training one model across 22 robot types and 527 skills — the Open X-Embodiment recipe — lifts a target-robot score by +50% over training on that robot alone. Scale transfers.

Long-horizon

Where it breaks.

Five-plus step tasks still land below 50% on LIBERO and CALVIN. Error propagates across steps faster than any current policy can recover.

§ 04 · Trends

Eight registers. One arc at a time.

Each panel traces the qualitative arc of a robotics sub-task — entry-level manipulation climbed first, dexterous rotation followed, long-horizon chains are still below half. The copper dot marks today's frontier ceiling.

Pick-and-place · sim
202326
95%+RoboSuite · higher ↑
Complex manip. · sim
202326
80%+Diffusion Pol. · higher ↑
Dexterous rotation
202326
~70%Shadow Hand · higher ↑
Navigation · structured
202326
90%+BARN · higher ↑
Contact-rich · peg-in-hole
202326
60–80tolerance · higher ↑
Long-horizon · 5+ steps
202326
<50LIBERO · higher ↑
Cross-embodiment lift
202326
+50%RT-X vs single · higher ↑
Open-source VLA params
202326
7BOpenVLA · B ↑
Fig 4 · Qualitative trends per robotics sub-task across recent years. Headline figure is the current frontier score (or qualitative ceiling) reported in the cited benchmark; line shape is indicative of the trajectory, not an exact measurement series.
§ 05 · Simulators

Engines, compared.

MuJoCo for contact-rich accuracy; Isaac for GPU-scale RL; PyBullet for the first week; Genesis for the differentiable frontier.

DeepMind acquired MuJoCo in 2021 and made it free — the single largest accelerant of the modern simulation era. Isaac Gym unlocked 1,000s of parallel envs per GPU.

EngineOrgLicenseStrengthWeaknessBest for
MuJoCoDeepMindApache 2.0Accurate contact physicsCPU-only · steeper curveContact-rich manipulation · research
Isaac GymNVIDIANVIDIA (free research)GPU-accelerated · 1,000s parallel envsNVIDIA required · less accurate contactsRL at scale · locomotion
PyBulletErwin CoumanszlibEasy to use · Python-nativeLess accurate · slowerBeginners · prototyping
GenesisStanford / CMUApache 2.0Differentiable · multi-GPUNew (2024) · smaller communityCutting-edge research
Fig 5 · The four simulators that own the field. Anything exotic — Genesis, differentiable physics — sits alongside, not above.
§ 06 · Difficulty

From pick-and-place, to folding a shirt.

Entry-level manipulation is solved. Dexterous manipulation is within reach. Contact-rich tolerance work is still brittle. Long-horizon chains remain the honest ceiling of the field.

TaskDifficultyBenchmarksBest-knownOpen challenge
Pick and PlaceEntryRoboSuite Lift · Meta-World95%+ successGeneralisation to novel objects
Autonomous NavigationMediumBARN Challenge · Habitat90%+ in structured envsDynamic obstacles
Dexterous ManipulationHardDexMV · DexArt · Shadow Hand~70% on complex rotationHigh DoF · sim-to-real gap
Mobile ManipulationHardM3Bench · BEHAVIOR-1KActive frontierWhole-body coordination
Contact-Rich TasksHardFurnitureBench · Peg Insertion60–80% (tolerance-dependent)Force sensing · compliance
Long-Horizon TasksVery HardCALVIN · LIBERO<50% on 5+ step chainsError propagation · memory
Fig 6 · Canonical task classes in robot learning, in rough order of how well current policies handle them.
§ 07
Methodology

How we read robotics numbers.

Robotics benchmarks are harder to trust than LLM ones. A score can depend on the gripper, the tabletop, the lighting — and on whether the model was trained on the exact same embodiment it is now being tested on. We report accordingly.

First, same-embodiment comparison. Cross-robot numbers are inflated by the Open X-Embodiment training distribution; we separate single-robot scores from generalist-policy scores.

Second, reported simulation seeds. A 95% pick-and-place on RoboSuite with one seed is not a 95% on ten seeds. When we cite a number, we prefer the multi-seed average; where only a headline is available we say so.

Third, sim-to-real honesty. A simulation score is not a hardware score. Policies that work perfectly in MuJoCo routinely fail on the physical arm — friction, contacts, sensor noise, and actuator delay all diverge. We flag which lane a number comes from.

Where the existing literature reports a qualitative ceiling rather than a reproduced number, we preserve the qualitative phrasing — “sub-50%”, “active frontier” — rather than invent precision.

§ 08 · Horizon

The IKEA test, still distant.

A concrete, contact-rich, long-horizon capability test the field has not yet cleared — a commercial general-purpose robot that assembles unmodified IKEA furniture in a normal home, under $100k, in under a day.

Sourced from a date forecast on Metaculus. Closes 2036-01-01; the community median sits in early 2031, with the middle 50% spanning four years on either side of that.

Read the question on Metaculus →
Community forecast · Metaculus · Q43262
11 forecasters · opens 2026-04-21 · resolves by 2036-01-01
Q43262

When will commercially available robots build IKEA furniture on their own?

Jan 2031
2026
2036
Median Jan 2031 · 50% interval Jan 2029Mar 2033
Resolution criteria
  • Commercially available, general-purpose system sold publicly.
  • Assembles ≥ 5 different IKEA items in ≥ 3 categories (table, chair, bed, storage, lighting).
  • Each assembly completes in less than 24 hours.
  • One or two robots, all with self-powered locomotion — no fixed industrial cells.
  • Total retail price under $100,000 (Jan 2025 USD).
  • Works in most home-like environments without task-specific fiducials. Up to 24 h on-site training allowed.
  • No human assistance once assembly begins. Items must be unmodified IKEA retail units.
Today

Early research stage

Berkeley plank-handling robot (Jan 2025) — single plank type, no screws, human still drives the screwdriver. Recent context: Pi 0.7 (Physical Intelligence), Gemini Robotics-ER 1.6.

Fig 7 · Date forecast on Metaculus Q43262. Bar shows the recency-weighted 50% community interval; copper tick marks the median. Snapshot 2026-04-28.
§ 09 · Related

Read next, around the register.

/llm

LLM leaderboard

The frontier LLMs whose visual cousins drive most VLA backbones today.

/vision

Vision register

Detection, segmentation, and the perception stack underneath every robot policy.

/hardware

Hardware register

The GPUs and edge silicon robot policies actually run on — B200 to Jetson Orin.

/agentic

Agentic AI

Planning and tool-use benchmarks — the cognitive layer above the motor policy.

/tasks

Task index

Every ML task in the register, with its canonical benchmark and trust grade.

/methodology

Methodology

How every number on Codesota is reproduced, dated, and preserved under regression.

/papers-with-code

Papers with Code

The predecessor registry we are building a calmer, stricter successor to.

/browse

Browse benchmarks

The full benchmark catalogue — by area, by modality, by size.