Codesota · Vision · Vol. IIThe register of image, detection, segmentation and docsIssue: April 22, 2026
§ 00 · Vision

Computer vision, measured.

Pixels in, structure out. The register that carries the oldest leaderboards on Codesota — ImageNet, COCO, ADE20K — and the newest hinge benchmark, OmniDocBench. Four tables, one rhythm: classification, detection, segmentation, and document parsing.

Scores shown are the vendor- or paper-reported headline number on the canonical benchmark for each task. Shaded rows mark the current state of the art. Saturating benchmarks are flagged in the methodology note.

§ 01 · Classification

Image classification, ranked.

ImageNet top-1 remains the canonical benchmark. Human accuracy on the ILSVRC validation set sits around 94%; the current open SOTA at 90.0% is saturating the headline split — the action has moved to efficiency and robustness variants.


Metric
Top-1 · higher is better
Models
5 shown
Dataset
ImageNet
April 2026
Shaded row marks current SOTA
#ModelOrgLicenceDatasetScoreUnitNote
01EVA-02-LBAAIopenImageNet90.0%top-1
02ConvNeXt-V2 (H)MetaopenImageNet88.9%top-1
03EfficientNetV2-XLGoogleopenImageNet87.3%top-1
04ViT-L/14GoogleopenImageNet85.3%top-1
05MobileNetV4-LGoogleopenImageNet83.4%top-1 · edge
Fig 1 · ImageNet top-1 at standard resolution. EVA-02-L is the current SOTA; production teams ship EfficientNetV2 or ConvNeXt for the accuracy/latency trade-off.
§ 02 · Detection

Object detection, ranked.

COCO box mAP (0.5:0.95) is the canonical benchmark. Transformer-based detectors (Co-DETR, DINO) hold the accuracy ceiling at 63.3; YOLO11 and RT-DETR own the real-time band at ~55 mAP with end-to-end inference under 10 ms.


Metric
mAP · higher is better
Models
5 shown
Dataset
COCO
April 2026
Shaded row marks current SOTA
#ModelOrgLicenceDatasetScoreUnitNote
01Co-DETRSenseTimeopenCOCO63.3mAPbox mAP
02DINOIDEA ResearchopenCOCO63.3mAPbox mAP
03RT-DETR-XBaiduopenCOCO54.8mAPreal-time
04YOLO11xUltralyticsopenCOCO54.7mAPreal-time
05YOLOv8xUltralyticsopenCOCO53.9mAPreal-time
Fig 2 · Co-DETR and DINO tie for the SOTA line. For production, latency constraints usually pin the choice to the YOLO11 / RT-DETR tier.
§ 03 · Segmentation

Semantic & instance segmentation, ranked.

Two canonical splits: ADE20K mIoU for semantic segmentation, COCO AP for instance masks. SegGPT holds the ADE20K line at 62.6; Mask DINO is the COCO instance leader at 50.9 AP. SAM 2 sits outside these numbers as a prompt-based zero-shot model.


Metric
mIoU · AP
Models
5 shown
Dataset
ADE20K · COCO
April 2026
Shaded row marks current SOTA
#ModelOrgLicenceDatasetScoreUnitNote
01SegGPTBAAIopenADE20K62.6mIoUsemantic
02Mask2FormerMetaopenADE20K57.7mIoUsemantic
03Mask DINOIDEA ResearchopenCOCO50.9APinstance
04SAM 2Metaopenzero-shotprompt-based; no single mIoU
05SegFormer-B5NVIDIAopenADE20K51.8mIoUsemantic · efficient
Fig 3 · SAM 2 is scored with an em-dash rather than a number — it is promptable and not single-shot comparable. Mask DINO owns COCO instance; SegGPT owns ADE20K semantic.
§ 04 · Document parsing

Document OCR, ranked.

The newest hinge of the register. Open-source parsers now clear closed-source APIs on OmniDocBench: PaddleOCR-VL-1.5 at 94.50 composite, two orders of magnitude cheaper than GPT-5.4’s 85.80. Qwen3-VL leads DocVQA at 96.5% ANLS.


Metric
composite · ANLS · CER
Models
5 shown
Dataset
OmniDocBench · DocVQA · IAM
April 2026
Shaded row marks current SOTA
#ModelOrgLicenceDatasetScoreUnitNote
01PaddleOCR-VL-1.5BaiduopenOmniDocBench94.5compositeend-to-end
02Qwen3-VLAlibabaopenDocVQA96.5ANLS
03GPT-5.4OpenAIclosedDocVQA92.8ANLS
04dots.ocr 3BRednoteopenOmniDocBench88.4composite
05TrOCRMicrosoftopenIAM2.89CER %handwriting · lower is better
Fig 4 · Mixed metrics across three benchmarks. Full table, cost-per-page and decision tools on the OCR register.
§ 05 · Retrieval

Image embeddings, ranked.

The piece of CV that leaves the lab as infrastructure. CLIP-class models vectorise images so they can be searched with text or with other images. Flickr30k R@1 is the canonical metric.


DINOv2 is listed without a Flickr30k score — it is a self-supervised feature extractor, not a text-image retriever. It’s here because production teams pair it with SigLIP to cover both modes.

Image-text retrieval · Flickr30k
Shaded row marks current SOTA
#ModelOrgLicenceDatasetR@1Metric
01SigLIPGoogleopenFlickr30k97.1R@1
02OpenCLIP ViT-G/14LAIONopenFlickr30k94.4R@1
03DINOv2Metaopenself-sup
04CLIP ViT-L/14OpenAIopenFlickr30k87.4R@1
Fig 5 · Flickr30k R@1 text-to-image retrieval. SigLIP uses the sigmoid loss variant; OpenCLIP is LAION's community reproduction of OpenAI CLIP.
§ 06 · Benchmarks

The datasets we believe.

Canonical for each task. ImageNet, COCO, ADE20K and OmniDocBench are canonicalised in our dataset registry; the rest are tracked qualitatively pending canonicalisation.

Rows with a mark live in the registry and carry full lineage.

BenchmarkScopePrimary metricYearSource
ImageNetImage classificationTop-1 %2009link →
COCODetection · segmentation · captionsmAP · AP · CIDEr2014link →
ADE20KSemantic segmentationmIoU2017link →
CityscapesUrban scene segmentationmIoU2016link →
DocVQADocument understandingANLS2021link →
OmniDocBenchEnd-to-end OCRcomposite2025link →
NYU Depth V2Monocular depthAbsRel2012link →
KITTIDriving · depth · detectionAbsRel · AP2012link →
Flickr30kImage-text retrievalR@12014link →
VQAv2Visual question answeringAccuracy2017link →
IAMHandwriting recognitionCER2002link →
Fig 6 · Solid marker = canonicalised in the Codesota registry. Hollow marker = widely cited, tracked qualitatively, not yet graded.
§ 07
How it works

Four tasks, one backbone.

Classification — “is this a cat?” — is the oldest framing and the one that set the pace. After AlexNet in 2012, ResNet’s skip connections (2015) and the Vision Transformer (2020) pushed ImageNet top-1 from ~74% to ~90%. Detection and segmentation are classification at finer spatial granularity: boxes and masks over the same feature maps.

The modern pattern is a shared encoder — a ViT, a ConvNeXt, or a hybrid Swin backbone — feeding a task head. Co-DETR and DINO attach a transformer decoder for set-prediction detection; Mask2Former unifies semantic, instance and panoptic heads over the same features; SAM 2 trains the head to accept point, box and mask prompts.

What changed after 2023 is that the encoder can be multimodal. CLIP and SigLIP align image and text into a shared vector space; the same embedding that powers visual search also conditions a vision-language model like Qwen3-VL or GPT-5.4, which then runs document parsing, VQA and captioning through one decoder.

Document OCR is the current hinge task. A VLM that reads PDFs well has to handle layout, table structure, multi-column text and chart reasoning — a broader surface than any prior CV benchmark. That’s why PaddleOCR-VL, Qwen3-VL and dots.ocr are leading where hand-tuned OCR stacks did a year ago.

§ 08 · Adjacent reads

Deeper, by task.

When the top-level ranking isn’t enough: the per-task register, the building block, the cost-vs-quality comparison pages.

Fig 8 · Each page has its own evidence surface; these are editorial reads, not benchmark duplicates.
Related

Neighbouring registers.

Other modality hubs on Codesota worth reading next.

OCR · register
Document understanding and text extraction.
LLM · register
Frontier language-model benchmarks.
Speech · register
Speech-to-text and text-to-speech, both directions.
All tasks
Every benchmark the registry carries.
Methodology
How scores are sourced, graded, and dated.
Product roadmap
The smart-router thesis: one API, every task, three tiers.