Grasp-Anything
Grasp-Anything is a large-scale grasp dataset synthesized using foundation models: 1 million scenes with text descriptions and more than 3 million objects, with far greater object and scene diversity than hand-collected datasets.
It pushes grasping toward open-vocabulary, language-conditioned picking — predicting grasps for objects described in natural language — and seeds the language-driven line that continues in Grasp-Anything++ and Grasp-Anything-6D.
Primary source →- Source
- Vuong et al., ICRA 2024
- Year
- 2024
- Scale
- 1M samples · 3M+ objects · text descriptions · foundation-model-generated
- Gripper
- Parallel-jaw
- Modality
- RGB + language
- Best-known
- Language-driven grasp synthesis · open-vocabulary scenes
- 1M scenes · 3M+ objects generated via foundation models
- Enables open-vocabulary, language-conditioned grasp detection
- Basis for language-driven successors (Grasp-Anything++, -6D)
SIM = simulation result · HW = physical hardware. Image-wise accuracy is detection quality, not real-robot pick success. Figures cited from Vuong et al., ICRA 2024.
Related benchmarks
← Back to the grasping registerGraspNet-1Billion →
De-facto clutter benchmark · AnyGrasp current SOTA (AP)
Dex-Net 2.0 →
HW: 93% on adversarial · 99% precision on 40 novel objects (YuMi)
Jacquard →
~95% image-wise (GR-ConvNet-class)
Cornell Grasp →
~99% image-wise accuracy — saturated benchmark