Codesota · Registry · Computer VisionThe area-level registerIssue: April 22, 2026
Area hub · Computer Vision

Computer vision,
measured.

Pixels in, structure out: classification, detection, segmentation and depth. The area with the oldest leaderboards in the register — and most of its headline numbers are saturating in public view.

Computer vision in 2026 looks nothing like 2023. Foundation models (DINOv2, SAM 3) have replaced task-specific training for most pipelines. NMS-free detection (YOLO26, RF-DETR) is the new production standard. Open-source rivals proprietary across every task. The bottleneck has shifted from models to data, deployment, and evaluation on your actual domain.

§ 01 · Top tasks

Sub-tasks in computer vision.

Each task opens onto a leaderboard of its canonical benchmark, with the full submission history and dated scores. Tasks without an indexed result are listed elsewhere in the register; the table below is sorted by result count.

Fig 01 · Showing top 12 of 27 tasks under Computer Vision.
§ 02 · Top benchmarks

Current state of the art.

Leading scores for the headline benchmarks in this area, drawn from the registry. Shaded rows mark the top result per task; follow any row into the full leaderboard.

#TaskBenchmarkLeading modelScoreNote
01Image ClassificationImageNet-1KCoCa91.0% top-1Benchmark saturated — focus shifting to robustness variants
02Object DetectionCOCO test-devScyllaNet66.0 APRF-DETR: 60+ AP real-time (<5ms)
03Object Detection (open-vocab)LVIS-minivalDINO-X Pro59.8 APZero-shot, no LVIS training
04Semantic SegmentationADE20KInternImage-H62.9 mIoU1.08B params
05Panoptic SegmentationCOCOSAM 3SOTAAlso: open-vocab + video tracking
06Depth EstimationMulti-viewDepth Anything 3+44% vs VGGTSingle DINOv2 transformer, any number of views
07Image GenerationImageNet-256 FIDDiT variant1.35 FIDFLUX.2 best open-source for text-to-image
08Video UnderstandingKinetics-400InternVideo 2.5~92%Multimodal, SOTA across 39 video datasets
Fig 02 · Headline benchmarks for Computer Vision. Full leaderboards, dated history and reproduction status live on the task pages.
Side note

State of the Field (2026)

  • 01DINOv2 is the default backbone — used by RF-DETR (detection), Depth Anything 3 (depth), and SAM 3 (segmentation). It's the new ImageNet-pretrained ResNet.
  • 02SAM 3 (Meta, Nov 2025) does open-vocabulary detection + segmentation + video tracking from text prompts. The 'GPT moment' for segmentation.
  • 03DINO-X achieves 56.0 AP on COCO zero-shot — no training on COCO at all. 59.8 AP on LVIS-minival. The best open-set detector, period.
  • 04RF-DETR is the first real-time model to exceed 60 AP on COCO. 54.7% mAP at <5ms latency on a T4 GPU.
  • 05YOLO26 (Sep 2025) removes NMS entirely. 43% faster CPU inference than YOLO11. Purpose-built for edge deployment.
  • 06ImageNet top-1 is 91% (CoCa). COCO AP is 66% (ScyllaNet). Further gains cost orders of magnitude more compute for diminishing returns.
  • 07The line between 'vision model' and 'vision-language model' has dissolved. SAM 3, InternVL3.5, DINO-X all accept text prompts natively.
Key models

Names to watch.

Meta
SAM 3

Open-vocab detect + segment + track

IDEA Research
DINO-X

Zero-shot detection (1200+ categories)

Roboflow
RF-DETR

First real-time >60 AP on COCO

Ultralytics
YOLO26

NMS-free edge detection standard

Meta
DINOv2

Self-supervised visual features backbone

ByteDance
Depth Anything 3

Unified monocular + multi-view depth

OpenGVLab
InternVL 3.5

Best open-source VLM (72.2 MMMU)

Black Forest Labs
FLUX.2

Production-grade open image generation

Picks by use-case

What to reach for.

Editorial picks · not vendor rankings
Detection (production, known classes)
YOLO26 (edge) or RF-DETR (server)

YOLO26: NMS-free, 43% faster CPU. RF-DETR: first >60 AP real-time. Fine-tune on your data. Always.

Detection (open-vocabulary)
DINO-X Pro or Grounding DINO 1.6

Best zero-shot accuracy. Use as a labelling assistant, then train YOLO for production.

Segmentation
SAM 3 (interactive) or Mask2Former (production)

SAM 3 for annotation and prompting. Mask2Former/OneFormer fine-tuned for deployment metrics.

Depth estimation
Depth Anything V2 (single image) or V3 (multi-view)

Production-ready, fast, well-supported. Metric3D v2 if you need absolute scale for robotics.

Vision-language understanding
InternVL3.5 (open-source) or GPT-4o (API)

InternVL3.5: 72.2 MMMU, runs locally. GPT-4o: best reasoning but 100x cost. Gemini 2.0 Flash for high-volume.

Image generation
FLUX.2 (local) or SD3.5 (ecosystem)

FLUX.2 rivals proprietary quality. SD3.5 has the LoRA/ControlNet ecosystem. SDXL still best for low VRAM.

Editor's note

Honest takes.

Classification and clean-doc OCR are solved. Move on.

91% top-1 on ImageNet. Real-time detection at 55+ AP in <5ms. Monocular depth is production-ready. Stop optimising saturated benchmarks and focus on your actual domain gap.

Zero-shot is a starting point, not an endpoint

DINO-X gets 56 AP zero-shot on COCO. A fine-tuned YOLO26 will beat it on your specific domain every time. Use zero-shot for labelling and prototyping, then train a specialist for production.

The bottleneck is data and deployment, not models

Foundation models are good enough. The real work is getting labelled data for your domain (industrial defects, medical images, satellite), then quantising and distilling for your hardware.

3D vision is still 5 years behind 2D

Depth maps look cool in demos. In production, you need multi-view or LiDAR for anything safety-critical. No single foundation model does 3D as well as DINOv2 does 2D features.

FID scores for image generation are meaningless

FID doesn't capture what humans care about — coherence, prompt following, aesthetics. FLUX.1 'feels' better than models with lower FID. Trust human evals, not automated metrics.

§ 03 · Method
How this area is tracked

Every row in this register is dated and sourced.

The benchmarks above come from the same Postgres registry that powers the wider Codesota index. Each task has exactly one canonical dataset. Each score carries a metric direction, a date and — where possible — a reproduction status.

When a score regresses, the prior record stays visible. When a benchmark is contested, we mark it rather than delete it. The goal is a register that argues in public.

Full methodology The unified task index
In-depth guides

Further reading.

Image Segmentation: Models, Methods & Benchmarks

SAM 2 vs Mask2Former vs OneFormer — when to use which

Multimodal AI: State of Benchmarks

GPT-4o, Gemini, Claude, InternVL compared on MMMU, MathVista, more

Code Generation Models Compared

Claude Opus 4, GPT-5, Gemini 2.5 Pro, DeepSeek-V3

§ Final · Related

Neighbouring registers.

Sibling area hubs, the unified task index and the methodology that binds them.

Editorial invitation

Need help choosing?

We benchmark models on your actual data. Same methodology as CodeSOTA, your domain, your hardware constraints.

Book Assessment