recognize text lines, handwriting, scene text
Vision task router
Pick the visual output you need, then compare models by quality, latency and deployment fit.
Start with the use case, then the benchmark.
The ontology unit is not a dataset row. It is the workflow capability being evaluated: recognize, localize, segment, retrieve, estimate geometry or track across time.
Each row keeps three choices visible: academic ceiling, production tradeoff and edge/economy deployment.
| Use case | Primitive | SOTA | Premium | Economy / edge |
|---|---|---|---|---|
| Detect objects | localize / detect | Co-DETR · COCO ceiling | YOLO11x / RT-DETR | YOLO11n/s · Hailo / Jetson class |
| Segment pixels or objects | segment | SegGPT / Mask2Former | SAM 2 / Mask DINO | MobileSAM / FastSAM-style |
| Classify images | recognize | CoCa · ImageNet-1K | EVA-02 / ConvNeXt | MobileNetV4 / EfficientViT |
| Search images | retrieve / embed | SigLIP-class image-text | OpenCLIP + domain eval | MobileCLIP-style |
| Estimate scene geometry | estimate geometry | Depth Anything / stereo stacks | metric depth + calibration | task-specific depth head |
| Track video objects | track / video | SAM 2 video / MOT stacks | detector + tracker pipeline | ByteTrack / lightweight MOT |
Image classification, ranked.
ImageNet top-1 remains the canonical classifier benchmark, but it is now a saturated ceiling metric. Codesota should show the absolute ImageNet leader separately from the open deployable classifier slice, then push production decisions toward robustness, latency and hardware fit.
- Metric
- Top-1 · higher is better
- Models
- 6 shown
- Dataset
- ImageNet
| # | Model | Org | Licence | Dataset | Score | Unit | Note |
|---|---|---|---|---|---|---|---|
| 01 | CoCa (finetuned) | research | ImageNet-1K | 91.0 | % | absolute top-1 | |
| 02 | EVA-02-L | BAAI | open | ImageNet-1K | 90.1 | % | open ViT-L-class |
| 03 | ConvNeXt-V2 (H) | Meta | open | ImageNet | 88.9 | % | top-1 |
| 04 | EfficientNetV2-XL | open | ImageNet | 87.3 | % | top-1 | |
| 05 | ViT-L/14 | open | ImageNet | 85.3 | % | top-1 | |
| 06 | MobileNetV4-L | open | ImageNet | 83.4 | % | top-1 · edge |
Object detection, ranked.
COCO box AP is only comparable when the split, pretraining, multiscale testing and ensemble status are explicit. Co-DETR-style detectors hold the academic ceiling; YOLO11 and RT-DETR own the production real-time band.
- Metric
- mAP · higher is better
- Models
- 5 shown
- Dataset
- COCO
| # | Model | Org | Licence | Dataset | Score | Unit | Note |
|---|---|---|---|---|---|---|---|
| 01 | Co-DETR | SenseTime | open | COCO test-dev | 66.0 | mAP | box AP · strong config |
| 02 | DINO | IDEA Research | open | COCO | 63.3 | mAP | box AP · config-dependent |
| 03 | RT-DETR-X | Baidu | open | COCO | 54.8 | mAP | real-time |
| 04 | YOLO11x | Ultralytics | open | COCO | 54.7 | mAP | real-time |
| 05 | YOLOv8x | Ultralytics | open | COCO | 53.9 | mAP | real-time |
Segmentation families, separated.
ADE20K semantic mIoU, COCO instance mask AP and promptable SAM-style segmentation answer different questions. Keep them in one orientation table, but do not pretend they are one leaderboard.
- Metric
- mIoU · AP
- Models
- 5 shown
- Dataset
- ADE20K · COCO
| # | Model | Org | Licence | Dataset | Score | Unit | Note |
|---|---|---|---|---|---|---|---|
| 01 | SegGPT | BAAI | open | ADE20K | 62.6 | mIoU | semantic |
| 02 | Mask2Former | Meta | open | ADE20K | 57.7 | mIoU | semantic |
| 03 | SegFormer-B5 | NVIDIA | open | ADE20K | 51.8 | mIoU | semantic · efficient |
| 04 | Mask DINO | IDEA Research | open | COCO | 50.9 | AP | instance |
| 05 | SAM 2 | Meta | open | zero-shot | — | — | prompt-based; no single mIoU |
Document AI is not a primitive CV task.
OCR and document parsing belong in the Vision & Documents area, but the model selection problem is different: text OCR, layout detection, table structure, formula recognition, reading order and document VQA are all active at once.
Open the OCR register →detect blocks, figures, reading order and captions
recover cells, spans, headers and structure
recognize math and scientific notation
answer over page images with evidence
combine all subtasks into Markdown, HTML or JSON
Image embeddings, ranked.
The piece of CV that leaves the lab as infrastructure. CLIP-class models vectorise images so they can be searched with text or compared with other images. Flickr30k R@1 is a research metric; product search still needs catalog recall@K and false-positive cost.
DINOv2 is not in this table because it is a self-supervised visual feature extractor, not a text-image retriever. Track it under visual features, localization transfer or segmentation transfer.
| # | Model | Org | Licence | Dataset | R@1 | Metric |
|---|---|---|---|---|---|---|
| 01 | SigLIP | open | Flickr30k | 97.1 | R@1 | |
| 02 | OpenCLIP ViT-G/14 | LAION | open | Flickr30k | 94.4 | R@1 |
| 03 | CLIP ViT-L/14 | OpenAI | open | Flickr30k | 87.4 | R@1 |
The datasets we believe.
Canonical for each task family. ImageNet, COCO and ADE20K are primitive CV anchors; DocVQA and OmniDocBench are adjacent document-AI anchors that should route to the OCR register.
Rows with a mark live in the registry and carry full lineage.
| Benchmark | Scope | Primary metric | Year | Source | |
|---|---|---|---|---|---|
| ImageNet | Image classification | Top-1 % | 2009 | link → | |
| COCO | Detection · segmentation · captions | mAP · AP · CIDEr | 2014 | link → | |
| ADE20K | Semantic segmentation | mIoU | 2017 | link → | |
| Cityscapes | Urban scene segmentation | mIoU | 2016 | link → | |
| DocVQA | Adjacent: document VQA | ANLS | 2021 | link → | |
| OmniDocBench | Adjacent: document parsing stack | composite | 2025 | link → | |
| NYU Depth V2 | Monocular depth | AbsRel | 2012 | link → | |
| KITTI | Driving · depth · detection | AbsRel · AP | 2012 | link → | |
| Flickr30k | Image-text retrieval | R@1 | 2014 | link → | |
| VQAv2 | Visual question answering | Accuracy | 2017 | link → | |
| IAM | Handwriting recognition | CER | 2002 | link → |
Capability first, benchmark second.
The stable top-level area is Vision & Documents, but the user-facing route should not flatten everything into one table. Primitive CV asks whether a model can recognize, localize, segment, retrieve, estimate geometry or track. Vertical stacks such as document AI, medical imaging and industrial inspection combine several primitives and need their own benchmark envelopes.
| Capability | Tasks | Benchmark envelope |
|---|---|---|
| recognize | image classification | ImageNet-1K, ImageNet-R, ObjectNet |
| localize | object detection, keypoints | COCO, LVIS, KITTI, aerial/small-object sets |
| segment | semantic, instance, panoptic, promptable | ADE20K, COCO mask AP, SA-V, DAVIS |
| retrieve | image-text retrieval, image-image search | Flickr30k, MSCOCO retrieval, catalog recall@K |
| estimate geometry | depth, optical flow, pose | NYU Depth V2, KITTI, Sintel, MPII |
| track | multi-object tracking, video segmentation | MOTChallenge, DAVIS, MOSE |
| parse documents | OCR, layout, tables, formulas, reading order | OmniDocBench, DocVQA, IAM · routed to OCR |
Benchmark traps
Deeper, by task.
When the top-level ranking isn’t enough: the per-task register, the building block, the cost-vs-quality comparison pages.
Browse · Computer Vision
Full benchmark data: every dataset, every result, every submission.
Read →Object Detection
YOLO, RT-DETR and DINO compared — real-time vs accuracy trade-offs.
Read →OCR · register
Document parsing as its own OCR/layout/table/formula stack.
Read →Image embeddings
CLIP, SigLIP and DINOv2 — picking a vectoriser for search.
Read →Object detection block
Boxes out of pixels — the contract each detector has to meet.
Read →Segmentation block
SAM 2 and Mask2Former as interchangeable pixel-mask providers.
Read →Neighbouring registers.
Other modality hubs on Codesota worth reading next.