Open-vocab detect + segment + track
Pixels in, structure out: classification, detection, segmentation and depth. The area with the oldest leaderboards in the register — and most of its headline numbers are saturating in public view.
Computer vision in 2026 looks nothing like 2023. Foundation models (DINOv2, SAM 3) have replaced task-specific training for most pipelines. NMS-free detection (YOLO26, RF-DETR) is the new production standard. Open-source rivals proprietary across every task. The bottleneck has shifted from models to data, deployment, and evaluation on your actual domain.
Each task opens onto a leaderboard of its canonical benchmark, with the full submission history and dated scores. Tasks without an indexed result are listed elsewhere in the register; the table below is sorted by result count.
Leading scores for the headline benchmarks in this area, drawn from the registry. Shaded rows mark the top result per task; follow any row into the full leaderboard.
| # | Task | Benchmark | Leading model | Score | Note |
|---|---|---|---|---|---|
| 01 | Image Classification | ImageNet-1K | CoCa | 91.0% top-1 | Benchmark saturated — focus shifting to robustness variants |
| 02 | Object Detection | COCO test-dev | ScyllaNet | 66.0 AP | RF-DETR: 60+ AP real-time (<5ms) |
| 03 | Object Detection (open-vocab) | LVIS-minival | DINO-X Pro | 59.8 AP | Zero-shot, no LVIS training |
| 04 | Semantic Segmentation | ADE20K | InternImage-H | 62.9 mIoU | 1.08B params |
| 05 | Panoptic Segmentation | COCO | SAM 3 | SOTA | Also: open-vocab + video tracking |
| 06 | Depth Estimation | Multi-view | Depth Anything 3 | +44% vs VGGT | Single DINOv2 transformer, any number of views |
| 07 | Image Generation | ImageNet-256 FID | DiT variant | 1.35 FID | FLUX.2 best open-source for text-to-image |
| 08 | Video Understanding | Kinetics-400 | InternVideo 2.5 | ~92% | Multimodal, SOTA across 39 video datasets |
Open-vocab detect + segment + track
Zero-shot detection (1200+ categories)
First real-time >60 AP on COCO
NMS-free edge detection standard
Self-supervised visual features backbone
Unified monocular + multi-view depth
Best open-source VLM (72.2 MMMU)
Production-grade open image generation
YOLO26: NMS-free, 43% faster CPU. RF-DETR: first >60 AP real-time. Fine-tune on your data. Always.
Best zero-shot accuracy. Use as a labelling assistant, then train YOLO for production.
SAM 3 for annotation and prompting. Mask2Former/OneFormer fine-tuned for deployment metrics.
Production-ready, fast, well-supported. Metric3D v2 if you need absolute scale for robotics.
InternVL3.5: 72.2 MMMU, runs locally. GPT-4o: best reasoning but 100x cost. Gemini 2.0 Flash for high-volume.
FLUX.2 rivals proprietary quality. SD3.5 has the LoRA/ControlNet ecosystem. SDXL still best for low VRAM.
91% top-1 on ImageNet. Real-time detection at 55+ AP in <5ms. Monocular depth is production-ready. Stop optimising saturated benchmarks and focus on your actual domain gap.
DINO-X gets 56 AP zero-shot on COCO. A fine-tuned YOLO26 will beat it on your specific domain every time. Use zero-shot for labelling and prototyping, then train a specialist for production.
Foundation models are good enough. The real work is getting labelled data for your domain (industrial defects, medical images, satellite), then quantising and distilling for your hardware.
Depth maps look cool in demos. In production, you need multi-view or LiDAR for anything safety-critical. No single foundation model does 3D as well as DINOv2 does 2D features.
FID doesn't capture what humans care about — coherence, prompt following, aesthetics. FLUX.1 'feels' better than models with lower FID. Trust human evals, not automated metrics.
The benchmarks above come from the same Postgres registry that powers the wider Codesota index. Each task has exactly one canonical dataset. Each score carries a metric direction, a date and — where possible — a reproduction status.
When a score regresses, the prior record stays visible. When a benchmark is contested, we mark it rather than delete it. The goal is a register that argues in public.
Sibling area hubs, the unified task index and the methodology that binds them.
We benchmark models on your actual data. Same methodology as CodeSOTA, your domain, your hardware constraints.