Robots

Building robotic systems? Find benchmarks for manipulation, navigation, and simulation-to-reality transfer.

3 tasks0 datasets0 results

Robot foundation models trained on diverse, distributed datasets have fundamentally changed manipulation from task-specific scripts to generalizable policies. OpenVLA, RT-2-X, and GR00T demonstrate 55-93% success rates across embodiments, but brittleness under perturbations remains the critical deployment challenge.

State of the Field (Dec 2024)

  • -Vision-Language-Action models (OpenVLA 7B, RT-2-X 55B, GR00T N1.5) achieve 55-93% success across multiple robot platforms by training on 1M+ diverse trajectories from Open X-Embodiment and DROID datasets
  • -Action chunking with Transformers and diffusion-based trajectory generation are now standard for long-horizon manipulation, with diffusion models providing 25% improvement over baselines
  • -COLOSSEUM benchmark reveals critical weakness: 30-50% performance degradation on individual perturbations (lighting, distractors, object color), 75%+ degradation on combined perturbations
  • -Sim-to-real transfer with safety guarantees (SPiDR) and retrieval-augmented training (70% improvement through strategic data selection) enable practical deployment without massive proprietary datasets

Quick Recommendations

General manipulation research or rapid prototyping

OpenVLA 7B

Open-source, runs on consumer GPUs, 16.5% better than RT-2-X. Efficiently fine-tunable via LoRA. Best starting point for 95% of robot learning projects.

Parallel-jaw grasping in cluttered environments

AnyDexGrasp framework

75-95% success on 150+ novel objects in clutter. Generalizes across gripper types (3-finger, 5-finger, parallel). Production-ready for pick-and-place.

Bimanual or long-horizon manipulation

Action Chunking Transformer (ACT) + Diffusion Policy

ALOHA Unleashed achieves 70%+ on contact-rich bimanual tasks. Diffusion trajectory guidance provides 25% improvement on tasks >50 timesteps. Practical and reproducible.

Humanoid or high-DOF robot control

GR00T N1.5 (with caveats)

93% language-following, 83% overall success on humanoid. But requires massive compute for training and careful safety validation. Only if you have NVIDIA-scale resources and tolerance for brittleness.

Safe sim-to-real transfer for control tasks

SPiDR (Sim-to-Real via Pessimistic Domain Randomization)

Provable safety guarantees despite sim-to-real gap. Zero-shot constraint satisfaction on real robots. Critical for safety-critical deployments where violations are unacceptable.

Edge deployment with <100ms latency

Octo-Base models with 8-bit quantization

Lightweight architecture optimized for real-time control. 3-8x model compression with minimal performance loss. Fits on mobile manipulator compute budgets.

Limited proprietary data, need transfer learning

OpenVLA fine-tuned on Open X-Embodiment + retrieval-augmented training

Leverage 1M+ public trajectories, then fine-tune on 50-100 task-specific demos with strategic retrieval. Eliminates need for massive data collection while maintaining strong performance.

Long-horizon tasks with intermediate waypoints

Diffusion Trajectory-guided Policy (DTP)

Generates 2D trajectory supervision through diffusion, reduces effective planning horizon. 25% improvement over baselines on long-horizon tasks. Computationally tractable on consumer GPUs.

Tasks & Benchmarks

Show all datasets and SOTA results

Robot Manipulation

No datasets indexed yet. Contribute on GitHub

Robot Navigation

No datasets indexed yet. Contribute on GitHub

Sim-to-Real Transfer

No datasets indexed yet. Contribute on GitHub

Honest Takes

Small open beats large closed

OpenVLA (7B params) outperforms RT-2-X (55B params) by 16.5% absolute success rate while running on consumer GPUs. Parameter count is not destiny when architecture and training data are optimized. For most practitioners, OpenVLA is the only foundation model worth deploying.

Data curation matters more than data volume

DROID research proves not all demonstrations are equal: retrieval-augmented training on similar scenes beats naive training on 10x more data. Camera pose variation and spatial arrangement diversity predict generalization better than raw trajectory count. Stop collecting random demos, start curating strategic ones.

Foundation models are still brittle

Despite 1M+ training trajectories, SOTA models lose 75% success rate when you change lighting and add distractor objects simultaneously. These are not edge cases - this is Tuesday in a warehouse. The hype around generalist robots is 2-3 years ahead of reality.

Imitation learning beats RL for production

While RL papers dominate venues, imitation learning with action chunking remains the most reliable path to real-world deployment. Action Chunking Transformer (ACT) + quality demonstrations consistently outperforms RL in sample efficiency, stability, and interpretability of failures.

Humanoid control is solved for demos, not deployment

GR00T N1.5 achieves 93% language-following on fruit placement tasks with careful setup. Real warehouses require 99.9% reliability across hundreds of object categories, occlusions, packaging variations. The gap between research demos and production reliability is wider in robotics than any other AI domain.

Robots Benchmarks - CodeSOTA | CodeSOTA