Codesota · Vision · Vol. IIUse-case router over the vision ontologyIssue: April 22, 2026
§ 00 · Vision

Vision task router

Pick the visual output you need, then compare models by quality, latency and deployment fit.

Classify
cat · product · defect
Detect
box + label + confidence
Segment
mask / pixels
Retrieve
nearest visual matches
Geometry
depth / pose / motion
Track
object id over time
Registry slicesLive
§ 01 · Model selection

Start with the use case, then the benchmark.

The ontology unit is not a dataset row. It is the workflow capability being evaluated: recognize, localize, segment, retrieve, estimate geometry or track across time.

Each row keeps three choices visible: academic ceiling, production tradeoff and edge/economy deployment.

Use casePrimitiveSOTAPremiumEconomy / edge
Detect objectslocalize / detectCo-DETR · COCO ceilingYOLO11x / RT-DETRYOLO11n/s · Hailo / Jetson class
Segment pixels or objectssegmentSegGPT / Mask2FormerSAM 2 / Mask DINOMobileSAM / FastSAM-style
Classify imagesrecognizeCoCa · ImageNet-1KEVA-02 / ConvNeXtMobileNetV4 / EfficientViT
Search imagesretrieve / embedSigLIP-class image-textOpenCLIP + domain evalMobileCLIP-style
Estimate scene geometryestimate geometryDepth Anything / stereo stacksmetric depth + calibrationtask-specific depth head
Track video objectstrack / videoSAM 2 video / MOT stacksdetector + tracker pipelineByteTrack / lightweight MOT
Fig 1 · Router view. Raw benchmark rows remain below; this table is the buyer surface for choosing a model tier.
§ 02 · Classification

Image classification, ranked.

ImageNet top-1 remains the canonical classifier benchmark, but it is now a saturated ceiling metric. Codesota should show the absolute ImageNet leader separately from the open deployable classifier slice, then push production decisions toward robustness, latency and hardware fit.


Metric
Top-1 · higher is better
Models
6 shown
Dataset
ImageNet
April 2026
Shaded row marks current SOTA
#ModelOrgLicenceDatasetScoreUnitNote
01CoCa (finetuned)GoogleresearchImageNet-1K91.0%absolute top-1
02EVA-02-LBAAIopenImageNet-1K90.1%open ViT-L-class
03ConvNeXt-V2 (H)MetaopenImageNet88.9%top-1
04EfficientNetV2-XLGoogleopenImageNet87.3%top-1
05ViT-L/14GoogleopenImageNet85.3%top-1
06MobileNetV4-LGoogleopenImageNet83.4%top-1 · edge
Fig 2 · ImageNet top-1 at named registry slices. CoCa is the absolute headline row; EVA-02 and ConvNeXt are clearer production candidates.
§ 03 · Detection

Object detection, ranked.

COCO box AP is only comparable when the split, pretraining, multiscale testing and ensemble status are explicit. Co-DETR-style detectors hold the academic ceiling; YOLO11 and RT-DETR own the production real-time band.


Metric
mAP · higher is better
Models
5 shown
Dataset
COCO
April 2026
Shaded row marks current SOTA
#ModelOrgLicenceDatasetScoreUnitNote
01Co-DETRSenseTimeopenCOCO test-dev66.0mAPbox AP · strong config
02DINOIDEA ResearchopenCOCO63.3mAPbox AP · config-dependent
03RT-DETR-XBaiduopenCOCO54.8mAPreal-time
04YOLO11xUltralyticsopenCOCO54.7mAPreal-time
05YOLOv8xUltralyticsopenCOCO53.9mAPreal-time
Fig 3 · COCO scores are config-sensitive. Use the AP ceiling to track research progress, then choose from the real-time band for production.
§ 04 · Segmentation

Segmentation families, separated.

ADE20K semantic mIoU, COCO instance mask AP and promptable SAM-style segmentation answer different questions. Keep them in one orientation table, but do not pretend they are one leaderboard.


Metric
mIoU · AP
Models
5 shown
Dataset
ADE20K · COCO
April 2026
Shaded row marks current SOTA
#ModelOrgLicenceDatasetScoreUnitNote
01SegGPTBAAIopenADE20K62.6mIoUsemantic
02Mask2FormerMetaopenADE20K57.7mIoUsemantic
03SegFormer-B5NVIDIAopenADE20K51.8mIoUsemantic · efficient
04Mask DINOIDEA ResearchopenCOCO50.9APinstance
05SAM 2Metaopenzero-shotprompt-based; no single mIoU
Fig 4 · SAM 2 is an unscored production pick here because promptable segmentation is not the same task as ADE20K semantic mIoU or COCO mask AP.
§ 05 · Adjacent vertical

Document AI is not a primitive CV task.

OCR and document parsing belong in the Vision & Documents area, but the model selection problem is different: text OCR, layout detection, table structure, formula recognition, reading order and document VQA are all active at once.

Open the OCR register
OCR

recognize text lines, handwriting, scene text

Layout

detect blocks, figures, reading order and captions

Tables

recover cells, spans, headers and structure

Formulas

recognize math and scientific notation

Document VQA

answer over page images with evidence

End-to-end parsing

combine all subtasks into Markdown, HTML or JSON

Fig 5 · OmniDocBench and DocVQA stay tracked, but as a document-AI stack rather than a sibling of classification, detection and segmentation.
§ 06 · Retrieval

Image embeddings, ranked.

The piece of CV that leaves the lab as infrastructure. CLIP-class models vectorise images so they can be searched with text or compared with other images. Flickr30k R@1 is a research metric; product search still needs catalog recall@K and false-positive cost.


DINOv2 is not in this table because it is a self-supervised visual feature extractor, not a text-image retriever. Track it under visual features, localization transfer or segmentation transfer.

Image-text retrieval · Flickr30k
Shaded row marks current SOTA
#ModelOrgLicenceDatasetR@1Metric
01SigLIPGoogleopenFlickr30k97.1R@1
02OpenCLIP ViT-G/14LAIONopenFlickr30k94.4R@1
03CLIP ViT-L/14OpenAIopenFlickr30k87.4R@1
Fig 6 · Flickr30k R@1 text-to-image retrieval. SigLIP uses the sigmoid loss variant; OpenCLIP is LAION's community reproduction of OpenAI CLIP.
§ 07 · Benchmarks

The datasets we believe.

Canonical for each task family. ImageNet, COCO and ADE20K are primitive CV anchors; DocVQA and OmniDocBench are adjacent document-AI anchors that should route to the OCR register.

Rows with a mark live in the registry and carry full lineage.

BenchmarkScopePrimary metricYearSource
ImageNetImage classificationTop-1 %2009link →
COCODetection · segmentation · captionsmAP · AP · CIDEr2014link →
ADE20KSemantic segmentationmIoU2017link →
CityscapesUrban scene segmentationmIoU2016link →
DocVQAAdjacent: document VQAANLS2021link →
OmniDocBenchAdjacent: document parsing stackcomposite2025link →
NYU Depth V2Monocular depthAbsRel2012link →
KITTIDriving · depth · detectionAbsRel · AP2012link →
Flickr30kImage-text retrievalR@12014link →
VQAv2Visual question answeringAccuracy2017link →
IAMHandwriting recognitionCER2002link →
Fig 7 · Solid marker = canonicalised in the Codesota registry. Hollow marker = widely cited, tracked qualitatively, not yet graded.
§ 08
Ontology

Capability first, benchmark second.

The stable top-level area is Vision & Documents, but the user-facing route should not flatten everything into one table. Primitive CV asks whether a model can recognize, localize, segment, retrieve, estimate geometry or track. Vertical stacks such as document AI, medical imaging and industrial inspection combine several primitives and need their own benchmark envelopes.

CapabilityTasksBenchmark envelope
recognizeimage classificationImageNet-1K, ImageNet-R, ObjectNet
localizeobject detection, keypointsCOCO, LVIS, KITTI, aerial/small-object sets
segmentsemantic, instance, panoptic, promptableADE20K, COCO mask AP, SA-V, DAVIS
retrieveimage-text retrieval, image-image searchFlickr30k, MSCOCO retrieval, catalog recall@K
estimate geometrydepth, optical flow, poseNYU Depth V2, KITTI, Sintel, MPII
trackmulti-object tracking, video segmentationMOTChallenge, DAVIS, MOSE
parse documentsOCR, layout, tables, formulas, reading orderOmniDocBench, DocVQA, IAM · routed to OCR

Benchmark traps

ImageNet
Saturated headline split; use robustness and deployment slices before choosing a production classifier.
COCO
Scores shift with split, pretraining, multiscale testing, ensembling, and latency constraints.
ADE20K
Semantic mIoU is not comparable to instance masks or promptable segmentation.
Flickr30k
Useful for research retrieval, weak as a proxy for enterprise visual search without domain recall@K.
OmniDocBench
A document parsing stack benchmark, not a primitive computer-vision leaderboard.
§ 09 · Adjacent reads

Deeper, by task.

When the top-level ranking isn’t enough: the per-task register, the building block, the cost-vs-quality comparison pages.

Fig 9 · Each page has its own evidence surface; these are editorial reads, not benchmark duplicates.
Related

Neighbouring registers.

Other modality hubs on Codesota worth reading next.

OCR · register
Document understanding and text extraction.
LLM · register
Frontier language-model benchmarks.
Speech · register
Speech-to-text and text-to-speech, both directions.
All tasks
Every benchmark the registry carries.
Methodology
How scores are sourced, graded, and dated.
Product roadmap
The smart-router thesis: one API, every task, three tiers.