Vision Benchmarks
How computer vision evaluation moved from image classification on ImageNet through object detection and dense prediction on COCO, to open-world promptable segmentation with SA-1B and SA-V. The lineage reflects a structural shift: early benchmarks measured closed-set accuracy on fixed categories; modern benchmarks ask models to segment anything a user points at, including in video. Branches include CIFAR and Pascal VOC (historically important precursors) and ADE20K / Open Images (semantic and large-scale detection offshoots). SAM and SAM 2 are the reference *models* Meta shipped alongside their respective benchmarks — included here only as the systems that established SOTA on each.
ImageNet is the rare benchmark that defined an era, trained a generation of researchers, and then quietly saturated — top-1 accuracy at 91–92% means the last few percentage points are noise, not insight. COCO is still the de-facto object detection standard; mAP on test-dev is what every detector paper reports, though performance on common categories is approaching ceiling. The open-vocabulary and promptable-segmentation line (SA-1B, then SA-V) is the active frontier: the evaluation question shifted from 'what class is this box?' to 'segment whatever the user points at, including video'. SAM and SAM 2 are the reference models that ship with each benchmark — useful as anchors but not the benchmark itself. SA-V's video tracks are genuinely unsolved at human parity.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
Pascal VOC
20-class detection and segmentation challenge. The dominant benchmark before COCO. Introduced mAP at IoU=0.5 as the standard detection metric. Superseded by COCO's larger class vocabulary and multi-object density.
CIFAR-10/100
60,000 32×32 images in 10 or 100 classes. The standard small-scale benchmark through the 2010s. CIFAR-10 top-1 accuracy reached 99%+ before ImageNet-scale models were routine.
ImageNet
1.28M training images across 1,000 classes. AlexNet's 2012 win on ILSVRC launched the deep-learning era. Top-1 accuracy reached ~92% by 2022; human-level (~95%) has been surpassed. Still cited as a pretraining quality proxy, but no longer used as a frontier discriminator.
COCO
328K images, 80 object categories, 2.5M instance annotations. Box detection, instance segmentation, and keypoint tracks. Still the standard detection benchmark; every major detection model cites COCO test-dev mAP. Common-category performance is saturating but tail-class and dense-scene detection still discriminates.
ADE20K
20,000 images densely labelled with 150 semantic categories. The dominant semantic segmentation benchmark. mIoU on val is the standard metric. Complementary to COCO — finer per-pixel labelling, more categories, scenes rather than objects.
SA-1B
11M images, 1B masks — the largest segmentation dataset ever released. The benchmark introduced the promptable-segmentation task definition: given an image and a click/box/text prompt, produce a mask. SAM is the reference model Meta shipped alongside it and remains the SOTA anchor; the benchmark is the test split + the task itself.
SA-V
51K videos with mask annotations — released alongside the SAM 2 model. Evaluates promptable video object segmentation: a click on frame 0 must propagate accurately through the rest of the clip. Standard splits include DAVIS and YouTube-VOS for cross-comparison. SAM 2 is the reference model; SA-V is the benchmark that defines the task.