BenchmarkECCV 2014Microsoft Research

COCO Benchmark

Microsoft Common Objects in Context (COCO) is the gold standard for large-scale object detection, segmentation, and captioning. With over 330,000 images and 1.5 million object instances, it challenges models to understand complex scenes in realistic environments.

SOTA AP
66.12
Categories
80
Instances
1.5M
Citations
45k+

The Standard for Scene Understanding

Before COCO, datasets like PASCAL VOC focused on iconic views of objects. COCO shifted the paradigm toward contextual understanding. Images contain multiple objects, often small, occluded, or in complex backgrounds. This forced the development of Feature Pyramid Networks (FPN) and more robust backbones.

Scale Variation

Objects range from a few pixels to the entire frame, requiring multi-scale feature extraction.

Non-Iconic Views

Objects are shown in natural settings, often partially hidden or at unusual angles.

Evaluation Metric

Primary MetricAP

COCO uses Average Precision (AP) averaged over 10 IoU thresholds (0.50 to 0.95 with 0.05 steps). This rewards models with high localization accuracy.

  • AP50: AP at IoU=0.50
  • APS: AP for small objects (< 32² px)
  • APM: AP for medium objects
  • APL: AP for large objects

SOTA Evolution

The journey from early CNNs to modern Vision Transformers.

mAP Score
Faster R-CNN: 37.4
2014
Mask R-CNN: 39.8
2017
DETR: 43.3
2020
Swin-L: 58.9
2021
DINO: 63.3
2023
ScyllaNet: 66.1
2025

Detection Leaderboard

Official Leaderboard ↗
RankModelOrganizationDateAPLinks
#1
ScyllaNet
Scylla Technologies2025-0966.1
#2
CW_Detection
Independent2025-0166.0
#3
SenseTime Basemodel
SenseTime2024-1166.0
#4
Thinker
UBTECH2024-0866.0
#5
InternImage-H (OneFormer)
PJLab & Tsinghua2024-0365.5
#6
DINO-ViT-L
IDEA-Research2023-0363.3
#7
ViT-Adapter-L
Nanjing University2022-1160.5
#8
Swin-L (Cascade R-CNN)
Microsoft Research2021-0758.9

Error Analysis

COCO Error Analysis

Most modern detectors struggle with False Positives on background textures and Localization Errors for small objects. COCO's analysis tools categorize errors into: Clutter, Similar Categories, and Poor Localization.

Speed vs. Accuracy

Speed vs Accuracy Tradeoff

While SOTA models reach 60+ AP, they often run at < 5 FPS. Real-time models like YOLOv11 or RT-DETR target the 45-55 AP range while maintaining 100+ FPS on modern GPUs.

Dataset Variants

Active

COCO 2017 Core

118k Train / 5k Val

Standard object detection & segmentation benchmark.

Extension

COCO-Stuff

164k Images

Adds 91 "stuff" categories (sky, grass, wall) for semantic context.

Extension

COCO-Keypoints

250k People

Human pose estimation with 17 annotated keypoints.

Extension

COCO-Captions

330k Images

5 natural language descriptions per image for multimodal tasks.

Foundational Papers

Official Repositories

Comparison with Other Benchmarks

BenchmarkFocusKey Difference
LVISLong-tail recognition1000+ categories; addresses class imbalance better than COCO.
PASCAL VOCEarly detectionSmaller scale (20 classes), mostly centered objects.
Open ImagesMassive scale9M images; uses image-level labels and bounding boxes.

Ready to Benchmark?

Download the COCO 2017 dataset and start training your models. Use the official API for standardized evaluation.