Successful grasps — object actually transported — per hour. The dominant industrial throughput metric.
Every robot grasping benchmark that matters, in one map — from GraspNet-1Billion to the warehouse belt.
GraspNet-1Billion, Dex-Net, Contact-GraspNet, AnyGrasp — the datasets, success rates, suction-versus-jaw trade-offs, and the bin-picking throughput metrics that decide what actually ships. Tabletop detection is near-solved; reliably clearing a heap of unseen objects is where the field — and industrial robotics — is still being won.
From toil to autonomy.
Parts ride the belt; you sort them by hand — triangles to green, the rogue circle to blue. It is tedious and you start dropping picks. Then press Automate this toil and a multi-joint arm (FABRIK IK — a 6-axis robot, not one pivot) clears the line on its own. That hand-off is the whole story of warehouse AI.
A human sorts the line by hand. It works — until the parts come faster than two hands can move, and picks start hitting the floor.
“Automate this toil.” The job is described once, then handed to a policy. No reprogramming per part — it generalises to the rogue circle too.
GripAI clears the belt tirelessly at a steadier picks-per-hour. The human moves up the stack — from doing the toil to defining it.
Grasping is solved on the bench and unsolved in the bin.
Read the leaderboards and grasping looks finished. Cornell sits at 99%, Jacquard at 95%, GraspNet has a clean leaderboard with a clear winner. Then you put a robot in front of a tote of mixed retail SKUs and it drops one pick in ten. The gap between those two facts is the entire field, and it is worth being honest about where it actually lives.
First: detection accuracy is a vanity metric. A 99% image-wise score on Cornell means a predicted rectangle overlaps a labeled one. It does not mean a robot lifted anything. A decade of papers optimized a number that never touched a gripper. The metrics that predict a working system — grasp success rate on hardware, declutter rate, and mean picks per hour — are far less flattering, and far more honest.
Suction quietly wins the warehouse. The research glamour goes to dexterous hands; the picks that ship run on a vacuum cup.
Second: the gripper debate is already over. Dexterous five-finger hands dominate the papers and almost none of the deployments. The lesson of Dex-Net 4.0 is blunt: the highest reliability comes from an ambidextrous policy that defaults to suction and falls back to a jaw — 95% reliability at 300 picks per hour. Suction is unglamorous, geometry-tolerant, and fast. In a warehouse, that wins.
Third: the bottleneck moved from grasping to perception. Modern grasp planners are good. What fails is the point cloud feeding them — a single viewpoint occludes most of a pile, glass and metal punch holes straight through the depth map, and clutter turns every scene into a guess. The next ten points of reliability will not come from a better grasp sampler. They will come from better perception: multi-view capture, shape completion, NeRF-based depth for transparent objects.
Fourth, and the one that matters commercially: generalization is the product. A system that hits 95% on a known catalogue is worthless if it needs a week of re-tuning for the next warehouse's SKUs. “Reliable from the first day on unfamiliar items” is not a marketing line — it is the actual unsolved research problem, and the survey literature is converging on the same answer: it is a data and generalization bottleneck, not a grasp-geometry one.
So: grasping is solved on the bench and unsolved in the bin. The interesting work for the next few years is not a new grasp representation. It is making a policy that walks up to a pile of objects it has never seen, in a warehouse it has never been deployed in, and clears it reliably from minute one. Everything else on this page is the scaffolding for that one problem.
We measured the clutter gap ourselves.
Rather than only cite the literature, we ran a bin-picking grasp simulation in PyBullet on an RTX 3090 and measured grasp success rate as the bin fills up. The degradation is real and reproducible.
A free-floating parallel-jaw gripper attempts a top-down grasp on a randomly chosen object dropped into a tray. We sweep the number of objects from one to eight and record whether the target is lifted clear — 1,440 grasp attempts in total (3 seeds × 120 trials × 4 clutter levels).
Grasp success rate falls from 86.7% on an isolated object to 53.3% in an eight-object pile — the same packed-versus-pile collapse the literature reports (GIGA, VGN), reproduced here with our own numbers. Clutter, not grasp geometry, is what breaks picking.
What a grasp actually is.
Two representations split the field. Older benchmarks predict a flat grasp rectangle on an image; modern clutter models predict a full 6-DoF gripper pose on a raw point cloud.
A planar grasp is four numbers on an image. Drag θ and the opening — this is exactly what Cornell- and Jacquard-trained networks predict.
The physics of a hold.
Before any network, a grasp is a mechanics problem: will the object stay in the gripper under gravity and motion? Two analytic ideas underpin almost every benchmark on this page — force closure for fingers, and seal-plus-wrench-resistance for suction.
Learning-based grasping did not replace this physics — it learned to predict it from pixels and points. Understanding the underlying model is what separates tuning a network from diagnosing why it drops a part.
Grasp detection, benchmarked.
The datasets the grasping field actually agrees on, in rough order of how much they drive the current frontier. GraspNet-1Billion and the Dex-Net family anchor clutter and industrial picking; Cornell and Jacquard are the saturated tabletop-detection classics.
SIM = simulation result · HW = physical hardware. Image-wise accuracy is detection quality, not real-robot pick success.
| Benchmark | Source | Year ▼ | Scale | Gripper | Modality | Best-known result |
|---|---|---|---|---|---|---|
| Grasp-Anything → | Vuong et al., ICRA 2024 | 2024 | 1M samples · 3M+ objects · text descriptions · foundation-model-generated | Parallel-jaw | RGB + language | Language-driven grasp synthesis · open-vocabulary scenes |
| SuctionNet-1Billion → | Cao et al., RA-L 2021 | 2021 | 190 scenes · 88 objects · 97,280 images · ~1.1B suction annotations | Suction | RGB-D | HW: 80.65% grasp success · 100% object clearance (their method) |
| ACRONYM → | Eppner et al., ICRA 2021 | 2021 | 17.7M grasps · 8,872 objects · 262 categories · FleX physics | Parallel-jaw (Franka) | Simulation-only | SIM: 59.21% of generated grasps succeed (label generation) |
| GIGA → | Jiang et al., RSS 2021 | 2021 | Built on VGN synthetic setup · affordance + implicit geometry | Parallel-jaw | TSDF + implicit | HW: 83.3% packed · 86.9% pile · SIM: 87.9% / 69.8% |
| GraspNet-1Billion → | Fang et al., CVPR 2020 | 2020 | 97,280 RGB-D images · 190 cluttered scenes · 88 objects · ~1.1B grasp poses | Parallel-jaw | RGB-D · point cloud | De-facto clutter benchmark · AnyGrasp current SOTA (AP) |
| VGN → | Breyer et al., CoRL 2020 | 2020 | ~2M synthetic grasps · 303 training meshes | Parallel-jaw (Franka) | TSDF (from depth) | HW: 80% grasp success · 92% clutter clearance · ~10 ms plan |
| EGAD! → | Morrison et al., RA-L 2020 | 2020 | 2,000+ evolved objects · 49 diverse 3D-printable eval objects | Parallel-jaw | Mesh · depth | Diagnostic set (geometry × difficulty) · no single SOTA number |
| Dex-Net 4.0 → | Mahler et al., Science Robotics 2019 | 2019 | 5M+ synthetic grasps · 1,664 objects in simulated heaps | Ambidextrous (jaw + suction) | Depth | HW: 95% reliability · 300 MPPH (ABB YuMi) |
| Dex-Net 3.0 → | Mahler et al., ICRA 2018 | 2018 | 2.8M point clouds · 1,500 models · analytic suction-seal labels | Suction | Depth · point cloud | HW: 98% basic · 82% typical · 58% adversarial |
| Jacquard → | Depierre et al., IROS 2018 | 2018 | 50,000+ images · ~11,000 objects · ~1.1M successful grasps | Parallel-jaw | RGB-D (synthetic trials) | ~95% image-wise (GR-ConvNet-class) |
| Dex-Net 2.0 → | Mahler et al., RSS 2017 | 2017 | 6.7M synthetic point clouds + grasps from thousands of 3D models | Parallel-jaw | Depth | HW: 93% on adversarial · 99% precision on 40 novel objects (YuMi) |
| YCB Object & Model Set → | Calli et al., IEEE R&A Magazine 2015 | 2015 | 77 physical objects + RGB-D scans & meshes | Object set | RGB-D meshes | Standard physical object set — not a scored benchmark |
| Cornell Grasp → | Lenz et al., IJRR / RSS 2013–15 | 2011–13 | 885 RGB-D images · 240 objects · 8,019 labeled grasp rectangles | Parallel-jaw | RGB-D | ~99% image-wise accuracy — saturated benchmark |
The grasp, predicted.
From a single depth image or point cloud to a gripper pose. The lineage runs from grasp-quality CNNs (Dex-Net) to dense 6-DoF generators (Contact-GraspNet, AnyGrasp) that clear bins of unseen objects close to human throughput.
Read the lane column carefully: HW is a physical-robot success rate; image-wise is detection accuracy on a labeled dataset and does not imply a real pick.
| Model | Source | Input | Reported result | Lane |
|---|---|---|---|---|
| AnyGrasp | Fang et al. · IEEE T-RO 2023 | Point cloud | 93.3% bin-clearing · >900 MPPH single-armCleared bins of 300+ unseen objects "on par with humans" | HW |
| Contact-GraspNet | Sundermeyer et al. (NVIDIA) · ICRA 2021 | Depth · point cloud | >90% on unseen objects in structured clutterTrained on ~17M simulated grasps (ACRONYM); ~halves failure rate | HW |
| 6-DoF GraspNet | Mousavian et al. (NVIDIA) · ICCV 2019 | Depth · point cloud | ~88% success across varied objectsVAE grasp sampler + learned evaluator | HW |
| Dex-Net GQ-CNN | Mahler et al. · RSS 2017 | Depth | 93% on known adversarial objectsGrasp-quality CNN trained on 6.7M synthetic grasps | HW |
| GG-CNN | Morrison, Corke, Leitner · RSS 2018 | Depth | 83% adversarial · 81% in dynamic clutterLightweight, closed-loop up to 50 Hz | HW |
| GR-ConvNet v2 | Kumra et al. · 2020 / 2022 | RGB-D | 98.8% Cornell · 95.1% Jacquard · 97.4% GraspNetImage-wise detection accuracy — not real-robot pick success | image-wise |
Suction, jaw, or both.
The end-effector decision is the first thing an industrial picking policy makes. Suction is faster on flat packaging; a parallel jaw generalizes across geometry; an ambidextrous system chooses per object to push reliability higher than either alone.
Dex-Net 4.0 made the ambidextrous case quantitatively: 95% reliability at 300 MPPH by learning when to suck and when to pinch.
| Gripper | Principle | Best for | Fails on | Benchmarks |
|---|---|---|---|---|
| Parallel-jaw | Form / force closure across two opposing contacts | Rigid objects of varied geometry · most grasp benchmarks | Large flat faces · heavy smooth surfaces with no graspable edge | GraspNet-1B · Dex-Net 2.0 · Cornell · Jacquard |
| Suction (vacuum) | Air-seal on a single sufficiently flat, non-porous surface | Boxes · flat packaging · fast top-down picks | Porous · perforated · highly curved or deformable surfaces | Dex-Net 3.0 · SuctionNet-1B |
| Ambidextrous | Policy chooses jaw or suction per object from depth | Mixed warehouse SKUs · maximizing reliability across a bin | Added mechanical + planning complexity | Dex-Net 4.0 (95% · 300 MPPH) |
| Multi-finger / dexterous | High-DoF hand · in-hand reorientation possible | Research · tools · complex in-hand manipulation | Not yet a warehouse-throughput standard · sim-to-real gap | Shadow Hand · DexMV · DexArt |
What a pick is worth.
A grasp success rate and a throughput number are different currencies. The warehouse cares about picks per hour at a reliability it can trust; the paper usually reports a grasp success rate under controlled conditions.
Commercial single-arm systems publicly cluster in roughly the 300–900 MPPH band; the high end is a research figure and real deployments are item-dependent. No single authoritative cross-vendor SOTA number exists — treat it as a range.
Successful grasps ÷ grasp attempts. The headline academic number; says nothing about speed.
Fraction of objects in a bin removed before the policy gives up or fails. Exposes the pile-vs-packed gap.
Success held over long autonomous runs, usually paired with a human-intervention rate.
Industrial picking is a closed loop, not a single grasp. Throughput (MPPH) is the loop rate; the declutter rate is how much of the bin it clears before giving up. A high grasp success rate that re-perceives slowly still loses on MPPH.Fig 5.1 · The perceive → detect → score → pick → place cycle.
| System | Throughput | Reliability | Source |
|---|---|---|---|
| Dex-Net 4.0 | 300 | 95% reliability | Science Robotics 2019 · HW |
| AnyGrasp | >900 | 93.3% bin clearing | IEEE T-RO 2023 · HW, controlled |
| Covariant (vendor) | ~515 picks/hr | <0.1% orders need human | covariant.ai · vendor claim |
Where grasping still breaks.
A near-99% tabletop detection score hides where industrial picking actually fails — clutter, glass, deformables, and the gap between simulation and a real bin.
A single view sees only one side
A depth camera mounted over a bin observes only the surfaces facing it. The back and underside of every object — and anything beneath the top layer — is simply missing from the point cloud. The grasp planner is reasoning about a partial, one-sided reconstruction of the scene.
This is why packed-versus-pile success rates diverge so sharply: in a heap, most graspable surface is occluded, and the model must infer geometry it has never measured. Multi-view capture and shape completion help, but every extra view costs cycle time the warehouse counts against MPPH.
Glass and metal break the depth sensor
Structured-light and time-of-flight depth sensors assume light reflects diffusely off a surface. Transparent objects let the infrared pattern pass straight through and read the background; specular metal scatters it. Both produce holes and false readings exactly where an object is.
Because the grasp planner never sees valid geometry there, it cannot propose a grasp at all. The research answer is to infer the missing surface: ClearGrasp learns transparent geometry from synthetic data, and Dex-NeRF reconstructs it with a neural radiance field before handing depth back to a standard grasp model.
Simulation success is not a hardware number
Almost all grasp data is synthetic, because real labels require real picks. The danger is the reality gap: unmodeled friction, sensor noise, soft deformation, and actuator latency all diverge from the simulator, and a policy that scores perfectly in MuJoCo can fail on the arm.
Domain randomization — training across randomized lighting, textures, friction, and poses — is the standard mitigation, and physics-based labels (ACRONYM) transfer better than analytic ones. But the gap is narrowed, never closed, which is why this page tags every number as SIM or HW.
Lifting one item drags another
Real bins contain hangers, cables, and interlocking parts. A grasp can be geometrically perfect and still fail because the target is physically linked to its neighbor — lifting one drags or jams the other, causing a double-pick or a drop mid-transport.
There is no saturated benchmark for this; it is a documented, still-open failure mode. Robust systems detect it after the fact (a weight or vision check on lift) and re-plan — which only works if the perception-to-execution loop is fast enough to retry.
The core driver of the pile-vs-packed gap. GIGA drops from 87.9% (packed) to 69.8% GSR (pile) in simulation; VGN shows the same collapse. Picking from a heap is a different problem from picking from a surface.
Depth sensors fail on glass and shiny plastic. ClearGrasp (ICRA 2020) infers transparent geometry from 50k+ synthetic frames; Dex-NeRF (CoRL 2021) renders depth via a NeRF density field and feeds Dex-Net; Evo-NeRF (CoRL 2022) grasps them in sequence.
Suction success is a physics problem: will the seal hold the wrench? Dex-Net 3.0 introduced an analytic quasi-static seal model; SuctionNet-1Billion evaluates seal formation and wrench resistance at billion-scale.
Cloth, bags, cables, and interlocked items resist rigid grasp models. No saturated benchmark exists; lifting two entangled objects at once is a well-documented, still-open failure mode.
Most grasp training is synthetic. Domain randomization (Dex-Net 4.0) and physics-based labels (ACRONYM, which transfer better than analytic labels) narrow the gap — they do not close it. A simulation GSR is not a hardware GSR.
Cornell and Jacquard 95–99% are image-wise detection scores, not real-robot reliability. Confusing the two is the single most common way grasping numbers get overstated.
The picking challenges.
The Amazon Picking / Robotics Challenge (2015–17) defined warehouse bin-picking as a field and seeded much of the talent now in industry. There is no single flagship successor — GraspNet-1Billion and SuctionNet-1Billion now serve as the standardized leaderboards.
| Year | Event | Winner | Approach |
|---|---|---|---|
| 2015 | Amazon Picking Challenge | Team RBO · TU Berlin | 148 pts · compliant soft hand + suction · ICRA Seattle |
| 2016 | Amazon Picking Challenge | Team Delft · TU Delft | Won Pick & Stow · 3D cameras + hybrid suction/two-finger gripper |
| 2017 | Amazon Robotics Challenge | Team ACRV · "Cartman" | Low-cost (<$24k) Cartesian gantry · rotating suction + jaw · $80k prize |
| 2025 | Multi-Object Grasping benchmark | Chen et al. · arXiv 2503.20820 | Grasping multiple objects per attempt, in pile and on surface |
Where the research goes next.
The 2025 manipulation survey by Bai et al. frames the open problems as three bottlenecks — collection, utilization, generalization. Grasping sits squarely inside all three.
Real grasp data is expensive — every label is a physical pick. The field leans on synthetic generation (ACRONYM, Dex-Net) and foundation-model synthesis (Grasp-Anything), but each trades realism for scale. Closing that trade-off is the central data problem.
Even with billion-scale corpora, models under-use them: most grasp detectors still train per-embodiment and discard cross-task structure. Shared geometry/affordance representations (GIGA) and 3D/implicit inputs are early attempts to extract more signal per sample.
A policy that clears one bin distribution often fails on the next — new SKUs, new clutter statistics, transparent or deformable items. Generalization across objects, scenes, and embodiments is the bottleneck that decides whether a system works "from day one" in a new warehouse.
How we read grasping numbers.
Grasping is the most over-claimed corner of robotics benchmarking, because a single word — “success” — hides four different measurements. We separate them.
First, detection is not picking. Cornell and Jacquard 95–99% are image-wise accuracy: does a predicted rectangle overlap a labeled one. That is not a robot lifting an object, and we never present it as one.
Second, simulation is not hardware. ACRONYM is sim-only; its 59% is a label-quality figure, not a pick rate. Dex-Net, Contact-GraspNet and AnyGrasp headline numbers are physical, and we tag the lane on every row.
Third, success rate is not throughput. A 93% grasp success rate and 900 MPPH answer different questions; a warehouse buys the second at a reliability it can trust. We keep GSR and MPPH in separate columns.
Fourth, vendor claims are flagged. Deployment numbers from commercial picking companies are measured under conditions they choose; we mark them and never rank them against peer-reviewed results.
Read next, around the register.
Robotics register →
The full robot-learning register — VLA models, simulators, sim-to-real, the IKEA horizon.
Vision register →
Detection, segmentation, depth — the perception stack every grasp model sees through.
Methodology →
How every number on Codesota is reproduced, dated, and preserved under regression.
Hardware register →
The GPUs and edge silicon grasp policies run on — from B200 to Jetson Orin.
Browse benchmarks →
The full benchmark catalogue — by area, by modality, by size.
Agentic AI →
Planning and tool-use benchmarks — the cognitive layer above the motor policy.