What is the best robot grasping benchmark?

GraspNet-1Billion (Fang et al., CVPR 2020) is the de-facto benchmark for grasp detection in clutter — 97,280 RGB-D images, 190 cluttered scenes, 88 objects and roughly 1.1 billion labeled parallel-jaw grasp poses. AnyGrasp is the current state of the art on it. For suction, SuctionNet-1Billion is the equivalent; for industrial ambidextrous picking, Dex-Net 4.0 reports 95% reliability at 300 mean picks per hour.

What is grasp success rate (GSR)?

Grasp success rate is the number of successful grasps divided by grasp attempts on a physical robot. It is distinct from image-wise detection accuracy (e.g. Cornell or Jacquard 95–99%), which only measures whether a predicted grasp rectangle matches a labeled one — not whether a real robot lifts the object.

What is MPPH in bin picking?

MPPH means Mean Picks Per Hour — successful grasps (the object is actually transported) per hour. It is the dominant industrial throughput metric. Dex-Net 4.0 reports 300 MPPH at 95% reliability; AnyGrasp reports over 900 MPPH single-arm under controlled conditions. Commercial deployments publicly cluster roughly in the 300–900 MPPH band depending on the object mix.

What is the difference between suction and parallel-jaw grasping?

A parallel-jaw gripper achieves force closure across two opposing contacts and handles varied rigid geometry; it is the gripper behind most grasp benchmarks. A suction gripper forms an air-seal on a single flat, non-porous surface and is faster on boxes and flat packaging but fails on porous, perforated or highly curved objects. Ambidextrous systems (Dex-Net 4.0) let a policy choose between the two per object to maximize reliability.

GraspNet-1Billion vs Dex-Net — what is the difference?

GraspNet-1Billion is a real-world RGB-D dataset of cluttered scenes with ~1.1B labeled grasp poses, used to benchmark grasp-detection models. Dex-Net is a family of methods (2.0 parallel-jaw, 3.0 suction, 4.0 ambidextrous) that trains grasp-quality CNNs on millions of synthetic grasps and reports physical-robot reliability — Dex-Net 4.0 reaches 95% reliability at 300 MPPH.

Robotics · Grasping & bin-pickingFrom the bin to the beltIssue: May 29, 2026

Grasp detection · suction vs jaw · bin-picking · 2026

Every robot grasping benchmark that matters, in one map — from GraspNet-1Billion to the warehouse belt.

GraspNet-1Billion, Dex-Net, Contact-GraspNet, AnyGrasp — the datasets, success rates, suction-versus-jaw trade-offs, and the bin-picking throughput metrics that decide what actually ships. Tabletop detection is near-solved; reliably clearing a heap of unseen objects is where the field — and industrial robotics — is still being won.

Read the benchmarks →Bin-picking metricsEvery number cited · SIM / HW flagged

Interactive · WebGL · a story in three acts

From toil to autonomy.

Parts ride the belt; you sort them by hand — triangles to green, the rogue circle to blue. It is tedious and you start dropping picks. Then press Automate this toil and a multi-joint arm (FABRIK IK — a 6-axis robot, not one pivot) clears the line on its own. That hand-off is the whole story of warehouse AI.

Act I · You are on the line. Sort the parts by hand.

🔺 → green · 🔵 → blue · hover a part

sorted

missed

toy MPPH

Act I · The toil

A human sorts the line by hand. It works — until the parts come faster than two hands can move, and picks start hitting the floor.

Act II · The hand-off

“Automate this toil.” The job is described once, then handed to a policy. No reprogramming per part — it generalises to the rogue circle too.

Act III · Autonomy

A learned policy clears the belt tirelessly at a steadier picks-per-hour. The human moves up the stack — from doing the toil to defining it.

Fig · Real-time multi-joint IK + parallel-jaw pick-and-sort on a conveyor. Play by hand, then press “Automate this toil” to watch the policy take over. WebGL (three.js), custom shaders.

Opinion · State of robot grasping, 2026

Grasping is solved on the bench and unsolved in the bin.

Kacper Wikiel · CodeSOTA · 29 May 2026

Read the leaderboards and grasping looks finished. Cornell sits at 99%, Jacquard at 95%, GraspNet has a clean leaderboard with a clear winner. Then you put a robot in front of a tote of mixed retail SKUs and it drops one pick in ten. The gap between those two facts is the entire field, and it is worth being honest about where it actually lives.

First: detection accuracy is a vanity metric. A 99% image-wise score on Cornell means a predicted rectangle overlaps a labeled one. It does not mean a robot lifted anything. A decade of papers optimized a number that never touched a gripper. The metrics that predict a working system — grasp success rate on hardware, declutter rate, and mean picks per hour — are far less flattering, and far more honest.

Suction quietly wins the warehouse. The research glamour goes to dexterous hands; the picks that ship run on a vacuum cup.

Second: the gripper debate is already over. Dexterous five-finger hands dominate the papers and almost none of the deployments. The lesson of Dex-Net 4.0 is blunt: the highest reliability comes from an ambidextrous policy that defaults to suction and falls back to a jaw — 95% reliability at 300 picks per hour. Suction is unglamorous, geometry-tolerant, and fast. In a warehouse, that wins.

Third: the bottleneck moved from grasping to perception. Modern grasp planners are good. What fails is the point cloud feeding them — a single viewpoint occludes most of a pile, glass and metal punch holes straight through the depth map, and clutter turns every scene into a guess. The next ten points of reliability will not come from a better grasp sampler. They will come from better perception: multi-view capture, shape completion, NeRF-based depth for transparent objects.

Fourth, and the one that matters commercially: generalization is the product. A system that hits 95% on a known catalogue is worthless if it needs a week of re-tuning for the next warehouse's SKUs. “Reliable from the first day on unfamiliar items” is not a marketing line — it is the actual unsolved research problem, and the survey literature is converging on the same answer: it is a data and generalization bottleneck, not a grasp-geometry one.

So: grasping is solved on the bench and unsolved in the bin. The interesting work for the next few years is not a new grasp representation. It is making a policy that walks up to a pile of objects it has never seen, in a warehouse it has never been deployed in, and clears it reliably from minute one. Everything else on this page is the scaffolding for that one problem.

Original experiment · run on our own GPU

We measured the clutter gap ourselves.

Rather than only cite the literature, we ran a bin-picking grasp simulation in PyBullet on an RTX 3090 and measured grasp success rate as the bin fills up. The degradation is real and reproducible.

A free-floating parallel-jaw gripper attempts a top-down grasp on a randomly chosen object dropped into a tray. We sweep the number of objects from one to eight and record whether the target is lifted clear — 1,440 grasp attempts in total (3 seeds × 120 trials × 4 clutter levels).

Grasp success rate falls from 86.7% on an isolated object to 53.3% in an eight-object pile — the same packed-versus-pile collapse the literature reports (GIGA, VGN), reproduced here with our own numbers. Clutter, not grasp geometry, is what breaks picking.

setup · PyBullet · floating parallel-jaw · YCB-style + primitive objects

protocol · random target · lift ≥ 10 cm = success · 3 seeds

Honest caveat: this is simulation with a simplified gripper and our own protocol — the absolute numbers are not comparable to any paper. The trend is the result.

Reproducible script →·raw results (JSON) →

Fig B · Measured grasp success rate vs clutter. Mean of 3 seeds; whiskers show the per-seed range. Run on a single RTX 3090 in PyBullet (DIRECT), May 2026.

§ 00 · Primer

What a grasp actually is.

Two representations split the field. Older benchmarks predict a flat grasp rectangle on an image; modern clutter models predict a full 6-DoF gripper pose on a raw point cloud.

Fig A · The two dominant grasp representations. Left: the 4-parameter planar grasp rectangle (x, y, θ, width) of Cornell and Jacquard. Right: a 6-DoF grasp pose — three translation, three rotation — regressed directly on the point cloud, as in GraspNet-1Billion and Contact-GraspNet.

Interactive · drag the parameters

Build a grasp

angle θ-22°

opening w150px

A planar grasp is four numbers on an image. Drag θ and the opening — this is exactly what Cornell- and Jacquard-trained networks predict.

Fig A.1 · Live grasp builder. Switch between the planar and 6-DoF representations and drag the sliders to see exactly what a grasp-detection network has to predict.

§ 01 · Stability

The physics of a hold.

Before any network, a grasp is a mechanics problem: will the object stay in the gripper under gravity and motion? Two analytic ideas underpin almost every benchmark on this page — force closure for fingers, and seal-plus-wrench-resistance for suction.

Learning-based grasping did not replace this physics — it learned to predict it from pixels and points. Understanding the underlying model is what separates tuning a network from diagnosing why it drops a part.

Fig 1.1 · Force closure. A parallel-jaw grasp is stable when the line between the two contact points lies inside both friction cones — then no external wrench can cause slip. The cone half-angle is set by the friction coefficient μ.

Fig 1.2 · Suction as physics. A vacuum grasp must first form a seal on the local surface, then that seal must resist the wrench from gravity and motion. Dex-Net 3.0 modeled both stages analytically; seal quality falls with surface curvature and porosity.

§ 02 · Datasets

Grasp detection, benchmarked.

The datasets the grasping field actually agrees on, in rough order of how much they drive the current frontier. GraspNet-1Billion and the Dex-Net family anchor clutter and industrial picking; Cornell and Jacquard are the saturated tabletop-detection classics.

SIM = simulation result · HW = physical hardware. Image-wise accuracy is detection quality, not real-robot pick success.

Interactive register · May 2026

Benchmark	Source	Year ▼	Scale	Gripper	Modality	Best-known result
Grasp-Anything →	Vuong et al., ICRA 2024	2024	1M samples · 3M+ objects · text descriptions · foundation-model-generated	Parallel-jaw	RGB + language	Language-driven grasp synthesis · open-vocabulary scenes
SuctionNet-1Billion →	Cao et al., RA-L 2021	2021	190 scenes · 88 objects · 97,280 images · ~1.1B suction annotations	Suction	RGB-D	HW: 80.65% grasp success · 100% object clearance (their method)
ACRONYM →	Eppner et al., ICRA 2021	2021	17.7M grasps · 8,872 objects · 262 categories · FleX physics	Parallel-jaw (Franka)	Simulation-only	SIM: 59.21% of generated grasps succeed (label generation)
GIGA →	Jiang et al., RSS 2021	2021	Built on VGN synthetic setup · affordance + implicit geometry	Parallel-jaw	TSDF + implicit	HW: 83.3% packed · 86.9% pile · SIM: 87.9% / 69.8%
GraspNet-1Billion →	Fang et al., CVPR 2020	2020	97,280 RGB-D images · 190 cluttered scenes · 88 objects · ~1.1B grasp poses	Parallel-jaw	RGB-D · point cloud	De-facto clutter benchmark · AnyGrasp current SOTA (AP)
VGN →	Breyer et al., CoRL 2020	2020	~2M synthetic grasps · 303 training meshes	Parallel-jaw (Franka)	TSDF (from depth)	HW: 80% grasp success · 92% clutter clearance · ~10 ms plan
EGAD! →	Morrison et al., RA-L 2020	2020	2,000+ evolved objects · 49 diverse 3D-printable eval objects	Parallel-jaw	Mesh · depth	Diagnostic set (geometry × difficulty) · no single SOTA number
Dex-Net 4.0 →	Mahler et al., Science Robotics 2019	2019	5M+ synthetic grasps · 1,664 objects in simulated heaps	Ambidextrous (jaw + suction)	Depth	HW: 95% reliability · 300 MPPH (ABB YuMi)
Dex-Net 3.0 →	Mahler et al., ICRA 2018	2018	2.8M point clouds · 1,500 models · analytic suction-seal labels	Suction	Depth · point cloud	HW: 98% basic · 82% typical · 58% adversarial
Jacquard →	Depierre et al., IROS 2018	2018	50,000+ images · ~11,000 objects · ~1.1M successful grasps	Parallel-jaw	RGB-D (synthetic trials)	~95% image-wise (GR-ConvNet-class)
Dex-Net 2.0 →	Mahler et al., RSS 2017	2017	6.7M synthetic point clouds + grasps from thousands of 3D models	Parallel-jaw	Depth	HW: 93% on adversarial · 99% precision on 40 novel objects (YuMi)
YCB Object & Model Set →	Calli et al., IEEE R&A Magazine 2015	2015	77 physical objects + RGB-D scans & meshes	Object set	RGB-D meshes	Standard physical object set — not a scored benchmark
Cornell Grasp →	Lenz et al., IJRR / RSS 2013–15	2011–13	885 RGB-D images · 240 objects · 8,019 labeled grasp rectangles	Parallel-jaw	RGB-D	~99% image-wise accuracy — saturated benchmark

13 of 13 benchmarks · filter by gripper, search by tag, click a column to sort. Every benchmark links to a detail page.

Fig 2 · Grasp-detection register — filter by gripper, search, sort, and click any benchmark for its detail page. Shaded rows mark the three that define the current frontier: GraspNet-1Billion (clutter), Dex-Net 4.0 (ambidextrous industrial), SuctionNet-1Billion (suction).

§ 03 · Models

The grasp, predicted.

From a single depth image or point cloud to a gripper pose. The lineage runs from grasp-quality CNNs (Dex-Net) to dense 6-DoF generators (Contact-GraspNet, AnyGrasp) that clear bins of unseen objects close to human throughput.

Read the lane column carefully: HW is a physical-robot success rate; image-wise is detection accuracy on a labeled dataset and does not imply a real pick.

Fig 2.1 · The grasp-detection pipeline. Most modern systems share this inference path; the quality network (e.g. GQ-CNN) is what learns to rank candidates.

Model	Source	Input	Reported result	Lane
AnyGrasp	Fang et al. · IEEE T-RO 2023	Point cloud	93.3% bin-clearing · >900 MPPH single-armCleared bins of 300+ unseen objects "on par with humans"	HW
Contact-GraspNet	Sundermeyer et al. (NVIDIA) · ICRA 2021	Depth · point cloud	>90% on unseen objects in structured clutterTrained on ~17M simulated grasps (ACRONYM); ~halves failure rate	HW
6-DoF GraspNet	Mousavian et al. (NVIDIA) · ICCV 2019	Depth · point cloud	~88% success across varied objectsVAE grasp sampler + learned evaluator	HW
Dex-Net GQ-CNN	Mahler et al. · RSS 2017	Depth	93% on known adversarial objectsGrasp-quality CNN trained on 6.7M synthetic grasps	HW
GG-CNN	Morrison, Corke, Leitner · RSS 2018	Depth	83% adversarial · 81% in dynamic clutterLightweight, closed-loop up to 50 Hz	HW
GR-ConvNet v2	Kumra et al. · 2020 / 2022	RGB-D	98.8% Cornell · 95.1% Jacquard · 97.4% GraspNetImage-wise detection accuracy — not real-robot pick success	image-wise

Physical-hardware grasp success · % (varied settings)

AnyGrasp

93.3%

Contact-GraspNet

90%

6-DoF GraspNet

88%

Dex-Net GQ-CNN

93%

GG-CNN

83%

VGN

80%

SuctionNet-1B

80.65%

Fig 3 · Headline physical-hardware grasp-success figures. Each was measured under different objects, clutter, and gripper conditions — read as orders of magnitude, not a ranked head-to-head. Dex-Net GQ-CNN's 93% is on known adversarial objects; AnyGrasp's 93.3% is bin-clearing of 300+ unseen objects.

Fig 2.2 · Method lineage. The jaw → suction → ambidextrous arc of Dex-Net runs alongside the dense 6-DoF arc from GG-CNN to AnyGrasp.

§ 04 · Grippers

Suction, jaw, or both.

The end-effector decision is the first thing an industrial picking policy makes. Suction is faster on flat packaging; a parallel jaw generalizes across geometry; an ambidextrous system chooses per object to push reliability higher than either alone.

Dex-Net 4.0 made the ambidextrous case quantitatively: 95% reliability at 300 MPPH by learning when to suck and when to pinch.

Fig 4.1 · The three end-effector strategies. Suction seals a flat face; the parallel jaw closes on antipodal contacts; the ambidextrous policy picks whichever is more reliable per object.

Gripper	Principle	Best for	Fails on	Benchmarks
Parallel-jaw	Form / force closure across two opposing contacts	Rigid objects of varied geometry · most grasp benchmarks	Large flat faces · heavy smooth surfaces with no graspable edge	GraspNet-1B · Dex-Net 2.0 · Cornell · Jacquard
Suction (vacuum)	Air-seal on a single sufficiently flat, non-porous surface	Boxes · flat packaging · fast top-down picks	Porous · perforated · highly curved or deformable surfaces	Dex-Net 3.0 · SuctionNet-1B
Ambidextrous	Policy chooses jaw or suction per object from depth	Mixed warehouse SKUs · maximizing reliability across a bin	Added mechanical + planning complexity	Dex-Net 4.0 (95% · 300 MPPH)
Multi-finger / dexterous	High-DoF hand · in-hand reorientation possible	Research · tools · complex in-hand manipulation	Not yet a warehouse-throughput standard · sim-to-real gap	Shadow Hand · DexMV · DexArt

Fig 4 · The gripper-choice spectrum. The ambidextrous row is where most warehouse SOTA now sits.

§ 05 · Metrics

What a pick is worth.

A grasp success rate and a throughput number are different currencies. The warehouse cares about picks per hour at a reliability it can trust; the paper usually reports a grasp success rate under controlled conditions.

Commercial single-arm systems publicly cluster in roughly the 300–900 MPPH band; the high end is a research figure and real deployments are item-dependent. No single authoritative cross-vendor SOTA number exists — treat it as a range.

MPPHMean Picks Per Hour

Successful grasps — object actually transported — per hour. The dominant industrial throughput metric.

GSRGrasp Success Rate

Successful grasps ÷ grasp attempts. The headline academic number; says nothing about speed.

DeclutterClearance rate

Fraction of objects in a bin removed before the policy gives up or fails. Exposes the pile-vs-packed gap.

ReliabilitySustained pick reliability

Success held over long autonomous runs, usually paired with a human-intervention rate.

Industrial picking is a closed loop, not a single grasp. Throughput (MPPH) is the loop rate; the declutter rate is how much of the bin it clears before giving up. A high grasp success rate that re-perceives slowly still loses on MPPH.Fig 5.1 · The perceive → detect → score → pick → place cycle.

Publicly reported throughput

System	Throughput	Reliability	Source
Dex-Net 4.0	300	95% reliability	Science Robotics 2019 · HW
AnyGrasp	>900	93.3% bin clearing	IEEE T-RO 2023 · HW, controlled
Covariant (vendor)	~515 picks/hr	<0.1% orders need human	covariant.ai · vendor claim

Fig 5 · Reported industrial picking figures. Vendor numbers (Covariant) are deployment claims under differing conditions, not peer-reviewed — do not compare directly against the academic rows.

§ 06 · Open problems

Where grasping still breaks.

A near-99% tabletop detection score hides where industrial picking actually fails — clutter, glass, deformables, and the gap between simulation and a real bin.

Fig 6 · The clutter gap, quantified. The same model loses ~18 points of grasp success rate moving from an ordered, packed bin to a heaped pile — occlusion and contact ambiguity are the dominant industrial failure mode.

Perception · occlusion

A single view sees only one side

A depth camera mounted over a bin observes only the surfaces facing it. The back and underside of every object — and anything beneath the top layer — is simply missing from the point cloud. The grasp planner is reasoning about a partial, one-sided reconstruction of the scene.

This is why packed-versus-pile success rates diverge so sharply: in a heap, most graspable surface is occluded, and the model must infer geometry it has never measured. Multi-view capture and shape completion help, but every extra view costs cycle time the warehouse counts against MPPH.

Perception · transparency

Glass and metal break the depth sensor

Structured-light and time-of-flight depth sensors assume light reflects diffusely off a surface. Transparent objects let the infrared pattern pass straight through and read the background; specular metal scatters it. Both produce holes and false readings exactly where an object is.

Because the grasp planner never sees valid geometry there, it cannot propose a grasp at all. The research answer is to infer the missing surface: ClearGrasp learns transparent geometry from synthetic data, and Dex-NeRF reconstructs it with a neural radiance field before handing depth back to a standard grasp model.

Transfer · sim-to-real

Simulation success is not a hardware number

Almost all grasp data is synthetic, because real labels require real picks. The danger is the reality gap: unmodeled friction, sensor noise, soft deformation, and actuator latency all diverge from the simulator, and a policy that scores perfectly in MuJoCo can fail on the arm.

Domain randomization — training across randomized lighting, textures, friction, and poses — is the standard mitigation, and physics-based labels (ACRONYM) transfer better than analytic ones. But the gap is narrowed, never closed, which is why this page tags every number as SIM or HW.

Mechanics · entanglement

Lifting one item drags another

Real bins contain hangers, cables, and interlocking parts. A grasp can be geometrically perfect and still fail because the target is physically linked to its neighbor — lifting one drags or jams the other, causing a double-pick or a drop mid-transport.

There is no saturated benchmark for this; it is a documented, still-open failure mode. Robust systems detect it after the fact (a weight or vision check on lift) and re-plan — which only works if the perception-to-execution loop is fast enough to retry.

More failure modes

Clutter & occlusion

The core driver of the pile-vs-packed gap. GIGA drops from 87.9% (packed) to 69.8% GSR (pile) in simulation; VGN shows the same collapse. Picking from a heap is a different problem from picking from a surface.

Transparent & reflective

Depth sensors fail on glass and shiny plastic. ClearGrasp (ICRA 2020) infers transparent geometry from 50k+ synthetic frames; Dex-NeRF (CoRL 2021) renders depth via a NeRF density field and feeds Dex-Net; Evo-NeRF (CoRL 2022) grasps them in sequence.

Suction-seal modeling

Suction success is a physics problem: will the seal hold the wrench? Dex-Net 3.0 introduced an analytic quasi-static seal model; SuctionNet-1Billion evaluates seal formation and wrench resistance at billion-scale.

Deformables & entanglement

Cloth, bags, cables, and interlocked items resist rigid grasp models. No saturated benchmark exists; lifting two entangled objects at once is a well-documented, still-open failure mode.

Sim-to-real gap

Most grasp training is synthetic. Domain randomization (Dex-Net 4.0) and physics-based labels (ACRONYM, which transfer better than analytic labels) narrow the gap — they do not close it. A simulation GSR is not a hardware GSR.

Detection ≠ pick success

Cornell and Jacquard 95–99% are image-wise detection scores, not real-robot reliability. Confusing the two is the single most common way grasping numbers get overstated.

§ 07 · Competitions

The picking challenges.

The Amazon Picking / Robotics Challenge (2015–17) defined warehouse bin-picking as a field and seeded much of the talent now in industry. There is no single flagship successor — GraspNet-1Billion and SuctionNet-1Billion now serve as the standardized leaderboards.

Year	Event	Winner	Approach
2015	Amazon Picking Challenge	Team RBO · TU Berlin	148 pts · compliant soft hand + suction · ICRA Seattle
2016	Amazon Picking Challenge	Team Delft · TU Delft	Won Pick & Stow · 3D cameras + hybrid suction/two-finger gripper
2017	Amazon Robotics Challenge	Team ACRV · "Cartman"	Low-cost (<$24k) Cartesian gantry · rotating suction + jaw · $80k prize
2025	Multi-Object Grasping benchmark	Chen et al. · arXiv 2503.20820	Grasping multiple objects per attempt, in pile and on surface

Fig 6 · The competitions that built industrial bin-picking. Winners converged early on hybrid suction + two-finger grippers — the same ambidextrous logic that now defines warehouse SOTA.

§ 08 · Bottlenecks

Where the research goes next.

The 2025 manipulation survey by Bai et al. frames the open problems as three bottlenecks — collection, utilization, generalization. Grasping sits squarely inside all three.

Data collection

Real grasp data is expensive — every label is a physical pick. The field leans on synthetic generation (ACRONYM, Dex-Net) and foundation-model synthesis (Grasp-Anything), but each trades realism for scale. Closing that trade-off is the central data problem.

Data utilization

Even with billion-scale corpora, models under-use them: most grasp detectors still train per-embodiment and discard cross-task structure. Shared geometry/affordance representations (GIGA) and 3D/implicit inputs are early attempts to extract more signal per sample.

Generalization

A policy that clears one bin distribution often fails on the next — new SKUs, new clutter statistics, transparent or deformable items. Generalization across objects, scenes, and embodiments is the bottleneck that decides whether a system works "from day one" in a new warehouse.

Taxonomy after Bai et al., “Towards a Unified Understanding of Robot Manipulation: A Comprehensive Survey” (arXiv:2510.10903, 2025). Framing applied to grasping by CodeSOTA.

§ 09

Methodology

How we read grasping numbers.

Grasping is the most over-claimed corner of robotics benchmarking, because a single word — “success” — hides four different measurements. We separate them.

First, detection is not picking. Cornell and Jacquard 95–99% are image-wise accuracy: does a predicted rectangle overlap a labeled one. That is not a robot lifting an object, and we never present it as one.

Second, simulation is not hardware. ACRONYM is sim-only; its 59% is a label-quality figure, not a pick rate. Dex-Net, Contact-GraspNet and AnyGrasp headline numbers are physical, and we tag the lane on every row.

Third, success rate is not throughput. A 93% grasp success rate and 900 MPPH answer different questions; a warehouse buys the second at a reliability it can trust. We keep GSR and MPPH in separate columns.

Fourth, vendor claims are flagged. Deployment numbers from commercial picking companies are measured under conditions they choose; we mark them and never rank them against peer-reviewed results.

Note 1 · Flagship corpora

GraspNet-1Billion (1.1B parallel-jaw poses) and SuctionNet-1Billion (1.1B suction annotations) share 190 scenes and 88 objects — the closest thing to a common substrate for clutter grasping.

Note 2 · Synthetic scale

ACRONYM's 17.7M physics-based grasps train Contact-GraspNet; physics labels transfer to hardware better than analytic ones, but the sim-to-real gap is never zero.

Note 3 · Lineage

Dex-Net 2.0 (jaw, 2017) → 3.0 (suction, 2018) → 4.0 (ambidextrous, 2019) · GG-CNN (2018) → 6-DoF GraspNet (2019) → Contact-GraspNet (2021) → AnyGrasp (2023).

Note 4 · Unverified

We could not confirm a public picks-per-hour figure for several commercial vendors, nor a single cross-vendor warehouse SOTA point estimate. Where a number is not verifiable, we give the range and say so.

Note 5 · Further reading

The canonical survey is Newbury et al., “Deep Learning Approaches to Grasp Synthesis: A Review” (T-RO 2023). Newer language-driven datasets — Grasp-Anything and its successors — push grasping toward open-vocabulary, instruction-conditioned picking.

§ 10 · Related