Codesota · Lineage · Vision Benchmarks7 benchmarks · 6 edgesUpdated 2026-04-27
Benchmark lineage

Vision Benchmarks

How computer vision evaluation moved from image classification on ImageNet through object detection and dense prediction on COCO, to open-world promptable segmentation with SA-1B and SA-V. The lineage reflects a structural shift: early benchmarks measured closed-set accuracy on fixed categories; modern benchmarks ask models to segment anything a user points at, including in video. Branches include CIFAR and Pascal VOC (historically important precursors) and ADE20K / Open Images (semantic and large-scale detection offshoots). SAM and SAM 2 are the reference *models* Meta shipped alongside their respective benchmarks — included here only as the systems that established SOTA on each.

Editor's note

ImageNet is the rare benchmark that defined an era, trained a generation of researchers, and then quietly saturated — top-1 accuracy at 91–92% means the last few percentage points are noise, not insight. COCO is still the de-facto object detection standard; mAP on test-dev is what every detector paper reports, though performance on common categories is approaching ceiling. The open-vocabulary and promptable-segmentation line (SA-1B, then SA-V) is the active frontier: the evaluation question shifted from 'what class is this box?' to 'segment whatever the user points at, including video'. SAM and SAM 2 are the reference models that ship with each benchmark — useful as anchors but not the benchmark itself. SA-V's video tracks are genuinely unsolved at human parity.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
DIRECT SUCCESSORDIRECT SUCCESSORSCOPE SHIFTDIRECT SUCCESSORImageNetJUN 2009SOTA 91.0%CIFAR-10/100JAN 2009Pascal VOCJAN 2005COCOMAY 2014SOTA 66.12ADE20KAUG 2016SOTA 62.90SA-1BAPR 2023SA-VJUL 2024
CIFAR-10/100ImageNet · direct successor · attention
ImageNet replaced CIFAR as the canonical vision benchmark when GPU compute made large-scale image classification practical. AlexNet's 2012 ImageNet win effectively ended CIFAR's era as the frontier benchmark.
Pascal VOCCOCO · direct successor · attention
COCO superseded Pascal VOC with 4× more classes, denser annotations, and the stricter AP averaged across IoU thresholds (0.5:0.95). Detection papers stopped citing VOC as the primary result almost immediately.
ImageNetCOCO · scope shift
ImageNet covers classification; COCO extended the frontier to object detection and instance segmentation. Successive benchmarks for successive vision tasks, not the same task at higher difficulty.
COCOADE20K · scope shift
ADE20K extends from object instances to dense scene-level semantic parsing with finer category resolution. Semantic segmentation researchers adopted ADE20K as the standard while detection researchers stayed on COCO.
COCOSA-1B · scope shift · attention
SA-1B moves from fixed-category closed-set detection to open-vocabulary promptable segmentation. The task boundary shifted from 'detect known categories' to 'segment anything a user points at'. SA-1B's scale (1B masks) dwarfs COCO's 2.5M annotations, and SAM is the reference model that established SOTA on the benchmark.
SA-1BSA-V · direct successor · attention
SA-V extends the promptable-segmentation benchmark from still images to video — a click on frame 0 must propagate through the rest of the clip. Released alongside SAM 2 (the reference model). The current frontier benchmark for zero-shot visual segmentation across both image and video domains.
§ 02 · Benchmarks in this lineage

Nodes in detail.

Jan 2005Superseded

Pascal VOC

PASCAL Visual Object Classes

20-class detection and segmentation challenge. The dominant benchmark before COCO. Introduced mAP at IoU=0.5 as the standard detection metric. Superseded by COCO's larger class vocabulary and multi-object density.

Everingham et al. · paper
Jan 2009Saturated

CIFAR-10/100

CIFAR-10 and CIFAR-100

60,000 32×32 images in 10 or 100 classes. The standard small-scale benchmark through the 2010s. CIFAR-10 top-1 accuracy reached 99%+ before ImageNet-scale models were routine.

Krizhevsky, Hinton (Toronto) · paper
Jun 2009Saturated
View benchmark page →

ImageNet

ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

1.28M training images across 1,000 classes. AlexNet's 2012 win on ILSVRC launched the deep-learning era. Top-1 accuracy reached ~92% by 2022; human-level (~95%) has been surpassed. Still cited as a pretraining quality proxy, but no longer used as a frontier discriminator.

Deng, Fei-Fei et al. (Stanford / Princeton) · paper
May 2014Saturating
View benchmark page →

COCO

MS COCO: Common Objects in Context

328K images, 80 object categories, 2.5M instance annotations. Box detection, instance segmentation, and keypoint tracks. Still the standard detection benchmark; every major detection model cites COCO test-dev mAP. Common-category performance is saturating but tail-class and dense-scene detection still discriminates.

Lin et al. (Microsoft) · paper

ADE20K

ADE20K Scene Parsing Benchmark

20,000 images densely labelled with 150 semantic categories. The dominant semantic segmentation benchmark. mIoU on val is the standard metric. Complementary to COCO — finer per-pixel labelling, more categories, scenes rather than objects.

Zhou et al. (MIT CSAIL) · paper
Apr 2023Active

SA-1B

Segment Anything 1B (image-segmentation benchmark + dataset)

11M images, 1B masks — the largest segmentation dataset ever released. The benchmark introduced the promptable-segmentation task definition: given an image and a click/box/text prompt, produce a mask. SAM is the reference model Meta shipped alongside it and remains the SOTA anchor; the benchmark is the test split + the task itself.

Kirillov et al. (Meta AI) · paper
Jul 2024Active

SA-V

Segment Anything in Video (SA-V benchmark)

51K videos with mask annotations — released alongside the SAM 2 model. Evaluates promptable video object segmentation: a click on frame 0 must propagate accurately through the rest of the clip. Standard splits include DAVIS and YouTube-VOS for cross-comparison. SAM 2 is the reference model; SA-V is the benchmark that defines the task.

Ravi et al. (Meta AI) · paper