The ontology unit is not a dataset row. It is the workflow capability being evaluated: recognize, localize, segment, retrieve, estimate geometry or track across time.

Each row keeps three choices visible: academic ceiling, production tradeoff and edge/economy deployment.

Use case	Primitive	SOTA	Premium	Economy / edge
Detect objects	localize / detect	Co-DETR · COCO ceiling	YOLO11x / RT-DETR	YOLO11n/s · Hailo / Jetson class
Segment pixels or objects	segment	SegGPT / Mask2Former	SAM 2 / Mask DINO	MobileSAM / FastSAM-style
Classify images	recognize	CoCa · ImageNet-1K	EVA-02-L / ConvNeXt-V2	MobileNetV4 / EfficientViT
Search images	retrieve / embed	SigLIP-class image-text	OpenCLIP + domain eval	MobileCLIP-style
Estimate scene geometry	estimate geometry	Depth Anything / stereo stacks	metric depth + calibration	task-specific depth head
Track video objects	track / video	SAM 2 video / MOT stacks	detector + tracker pipeline	ByteTrack / lightweight MOT

Fig 1 · Router view. Raw benchmark rows remain below; this table is the buyer surface for choosing a model tier.

§ 02 · Classification

Image classification, ranked.

ImageNet top-1 remains the canonical classifier benchmark, but it is now a saturated ceiling metric. Codesota should show the absolute ImageNet leader separately from the open deployable classifier slice, then push production decisions toward robustness, latency and hardware fit.

Metric: Top-1 · higher is better
Models: 6 shown
Dataset: ImageNet

April 2026

Shaded row marks current SOTA

#	Model	Org	Licence	Dataset	Score	Unit	Note
01	CoCa (finetuned)	Google	research	ImageNet-1K	91.0	%	absolute top-1
02	EVA-02-L	BAAI	open	ImageNet-1K	90.1	%	open ViT-L-class
03	ConvNeXt-V2 (H)	Meta	open	ImageNet	88.9	%	top-1
04	EfficientNetV2-XL	Google	open	ImageNet	87.3	%	top-1
05	ViT-L/14	Google	open	ImageNet	85.3	%	top-1
06	MobileNetV4-L	Google	open	ImageNet	83.4	%	top-1 · edge

Fig 2 · ImageNet top-1 at named registry slices. CoCa is the absolute headline row; EVA-02 and ConvNeXt are clearer production candidates.

§ 03 · Detection

Object detection, ranked.

COCO box AP is only comparable when the split, pretraining, multiscale testing and ensemble status are explicit. Co-DETR-style detectors hold the academic ceiling; YOLO11 and RT-DETR own the production real-time band.

Metric: mAP · higher is better
Models: 5 shown
Dataset: COCO

April 2026

Shaded row marks current SOTA

#	Model	Org	Licence	Dataset	Score	Unit	Note
01	Co-DETR	SenseTime	open	COCO test-dev	66.0	mAP	box AP · strong config
02	DINO	IDEA Research	open	COCO	63.3	mAP	box AP · config-dependent
03	RT-DETR-X	Baidu	open	COCO	54.8	mAP	real-time
04	YOLO11x	Ultralytics	open	COCO	54.7	mAP	real-time
05	YOLOv8x	Ultralytics	open	COCO	53.9	mAP	real-time

Fig 3 · COCO scores are config-sensitive. Use the AP ceiling to track research progress, then choose from the real-time band for production.

§ 04 · Segmentation

Segmentation families, separated.

ADE20K semantic mIoU, COCO instance mask AP and promptable SAM-style segmentation answer different questions. Keep them in one orientation table, but do not pretend they are one leaderboard.

Metric: mIoU · AP
Models: 5 shown
Dataset: ADE20K · COCO

April 2026

Shaded row marks current SOTA

#	Model	Org	Licence	Dataset	Score	Unit	Note
01	SegGPT	BAAI	open	ADE20K	62.6	mIoU	semantic
02	Mask2Former	Meta	open	ADE20K	57.7	mIoU	semantic
03	SegFormer-B5	NVIDIA	open	ADE20K	51.8	mIoU	semantic · efficient
04	Mask DINO	IDEA Research	open	COCO	50.9	AP	instance
05	SAM 2	Meta	open	zero-shot	—	—	prompt-based; no single mIoU

Fig 4 · SAM 2 is an unscored production pick here because promptable segmentation is not the same task as ADE20K semantic mIoU or COCO mask AP.

§ 05 · Adjacent vertical

Document AI is not a primitive CV task.

OCR and document parsing belong in the Vision & Documents area, but the model selection problem is different: text OCR, layout detection, table structure, formula recognition, reading order and document VQA are all active at once.

Open the OCR register →

OCR

recognize text lines, handwriting, scene text

Layout

detect blocks, figures, reading order and captions

Tables

recover cells, spans, headers and structure

Formulas

recognize math and scientific notation

Document VQA

answer over page images with evidence

End-to-end parsing

combine all subtasks into Markdown, HTML or JSON

Fig 5 · OmniDocBench and DocVQA stay tracked, but as a document-AI stack rather than a sibling of classification, detection and segmentation.

§ 06 · Retrieval

Image embeddings, ranked.

The piece of CV that leaves the lab as infrastructure. CLIP-class models vectorise images so they can be searched with text or compared with other images. Flickr30k R@1 is a research metric; product search still needs catalog recall@K and false-positive cost.

DINOv2 is not in this table because it is a self-supervised visual feature extractor, not a text-image retriever. Track it under visual features, localization transfer or segmentation transfer.

Image-text retrieval · Flickr30k

Shaded row marks current SOTA

#	Model	Org	Licence	Dataset	R@1	Metric
01	SigLIP	Google	open	Flickr30k	97.1	R@1
02	OpenCLIP ViT-G/14	LAION	open	Flickr30k	94.4	R@1
03	CLIP ViT-L/14	OpenAI	open	Flickr30k	87.4	R@1

Fig 6 · Flickr30k R@1 text-to-image retrieval. SigLIP uses the sigmoid loss variant; OpenCLIP is LAION's community reproduction of OpenAI CLIP.

§ 07 · Benchmarks

The datasets we believe.

Canonical for each task family. ImageNet, COCO and ADE20K are primitive CV anchors; DocVQA and OmniDocBench are adjacent document-AI anchors that should route to the OCR register.

Rows with a mark live in the registry and carry full lineage.

Benchmark	Scope	Primary metric	Year	Source
ImageNet	Image classification	Top-1 %	2009	link →
COCO	Detection · segmentation · captions	mAP · AP · CIDEr	2014	link →
ADE20K	Semantic segmentation	mIoU	2017	link →
Cityscapes	Urban scene segmentation	mIoU	2016	link →
DocVQA	Adjacent: document VQA	ANLS	2021	link →
OmniDocBench	Adjacent: document parsing stack	composite	2025	link →
NYU Depth V2	Monocular depth	AbsRel	2012	link →
KITTI	Driving · depth · detection	AbsRel · AP	2012	link →
Flickr30k	Image-text retrieval	R@1	2014	link →
VQAv2	Visual question answering	Accuracy	2017	link →
IAM	Handwriting recognition	CER	2002	link →

Fig 7 · Solid marker = canonicalised in the Codesota registry. Hollow marker = widely cited, tracked qualitatively, not yet graded.

§ 08

Ontology

Capability first, benchmark second.

The stable top-level area is Vision & Documents, but the user-facing route should not flatten everything into one table. Primitive CV asks whether a model can recognize, localize, segment, retrieve, estimate geometry or track. Vertical stacks such as document AI, medical imaging and industrial inspection combine several primitives and need their own benchmark envelopes.

Capability	Tasks	Benchmark envelope
recognize	image classification	ImageNet-1K, ImageNet-R, ObjectNet
localize	object detection, keypoints	COCO, LVIS, KITTI, aerial/small-object sets
segment	semantic, instance, panoptic, promptable	ADE20K, COCO mask AP, SA-V, DAVIS
retrieve	image-text retrieval, image-image search	Flickr30k, MSCOCO retrieval, catalog recall@K
estimate geometry	depth, optical flow, pose	NYU Depth V2, KITTI, Sintel, MPII
track	multi-object tracking, video segmentation	MOTChallenge, DAVIS, MOSE
parse documents	OCR, layout, tables, formulas, reading order	OmniDocBench, DocVQA, IAM · routed to OCR