Grasp-Anything

Name: Grasp-Anything
Creator: Vuong, Vu, Le, Huang, Huynh, Vo, Kugi, Nguyen
Published: 2024
Keywords: language-driven, foundation models, parallel-jaw, open-vocabulary

language-drivenfoundation modelsparallel-jawopen-vocabulary

Grasp-Anything is a large-scale grasp dataset synthesized using foundation models: 1 million scenes with text descriptions and more than 3 million objects, with far greater object and scene diversity than hand-collected datasets.

It pushes grasping toward open-vocabulary, language-conditioned picking — predicting grasps for objects described in natural language — and seeds the language-driven line that continues in Grasp-Anything++ and Grasp-Anything-6D.

Primary source →

At a glance

Source: Vuong et al., ICRA 2024
Year: 2024
Scale: 1M samples · 3M+ objects · text descriptions · foundation-model-generated
Gripper: Parallel-jaw
Modality: RGB + language
Best-known: Language-driven grasp synthesis · open-vocabulary scenes

Key results

1M scenes · 3M+ objects generated via foundation models
Enables open-vocabulary, language-conditioned grasp detection
Basis for language-driven successors (Grasp-Anything++, -6D)

SIM = simulation result · HW = physical hardware. Image-wise accuracy is detection quality, not real-robot pick success. Figures cited from Vuong et al., ICRA 2024.

Related benchmarks

← Back to the grasping register

Parallel-jaw

Grasp-Anything

Related benchmarks

GraspNet-1Billion →

Dex-Net 2.0 →

Jacquard →

Cornell Grasp →