Robotics · Grasping · Grasp-AnythingBenchmark detail
Grasp benchmark · Parallel-jaw

Grasp-Anything

language-drivenfoundation modelsparallel-jawopen-vocabulary

Grasp-Anything is a large-scale grasp dataset synthesized using foundation models: 1 million scenes with text descriptions and more than 3 million objects, with far greater object and scene diversity than hand-collected datasets.

It pushes grasping toward open-vocabulary, language-conditioned picking — predicting grasps for objects described in natural language — and seeds the language-driven line that continues in Grasp-Anything++ and Grasp-Anything-6D.

Primary source
At a glance
Source
Vuong et al., ICRA 2024
Year
2024
Scale
1M samples · 3M+ objects · text descriptions · foundation-model-generated
Gripper
Parallel-jaw
Modality
RGB + language
Best-known
Language-driven grasp synthesis · open-vocabulary scenes
Key results
  • 1M scenes · 3M+ objects generated via foundation models
  • Enables open-vocabulary, language-conditioned grasp detection
  • Basis for language-driven successors (Grasp-Anything++, -6D)

SIM = simulation result · HW = physical hardware. Image-wise accuracy is detection quality, not real-robot pick success. Figures cited from Vuong et al., ICRA 2024.

Related benchmarks

← Back to the grasping register
Parallel-jaw

GraspNet-1Billion

De-facto clutter benchmark · AnyGrasp current SOTA (AP)

Parallel-jaw

Dex-Net 2.0

HW: 93% on adversarial · 99% precision on 40 novel objects (YuMi)

Parallel-jaw

Jacquard

~95% image-wise (GR-ConvNet-class)

Parallel-jaw

Cornell Grasp

~99% image-wise accuracy — saturated benchmark