Detecting and localizing objects in images with bounding boxes and class labels.
Microsoft COCO is the gold standard for large-scale object detection, segmentation, and captioning, with 330k+ images, 1.5M+ object instances, and 80 categories. Primary metric is box mAP averaged over 10 IoU thresholds (0.5:0.95).
Leading models on COCO.
| Rank | Model | box-map | Year | Source |
|---|---|---|---|---|
| 1 | ScyllaNet | 66.1 | 2026 | paper |
| 2 | co-detr-swin-l | 66.0 | 2025 | paper |
| 3 | CW_Detection | 66.0 | 2026 | paper |
| 4 | Thinker | 66.0 | 2026 | paper |
| 5 | SenseTime Basemodel | 66.0 | 2026 | paper |
| 6 | InternImage-H (OneFormer) | 65.5 | 2026 | paper |
| 7 | internimage-h | 65.4 | 2025 | paper |
| 8 | Focal-Stable-DINO | 64.6 | 2023 | paper |
| 9 | dino-swin-l | 63.3 | 2025 | paper |
| 10 | DINO-ViT-L | 63.3 | 2026 | paper |
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
3 datasets tracked for this task.
Other tasks in Computer Vision.
Still looking for something on Object Detection? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.