Object Detection
Object Detection is a computer vision task that involves identifying and localizing objects within an image. The goal is to detect instances or objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Object detection models typically output a set of bounding boxes with corresponding predicted class names.
Object detection localizes and classifies multiple objects in an image with bounding boxes. It's the backbone of autonomous driving, surveillance, and robotics. COCO mAP has climbed from 19.7% (R-CNN, 2014) to 65%+ (Co-DETR, 2024), and the field has split between closed-set detectors and open-vocabulary models that find anything described in text.
History
R-CNN (Girshick et al.) combines selective search proposals with CNN features, achieving 31.4% mAP on VOC — the first deep detector
Faster R-CNN introduces the Region Proposal Network (RPN), making detection end-to-end trainable at 5 FPS
SSD and YOLO (v1-v2) prove single-shot detection is viable for real-time (45+ FPS), trading accuracy for speed
Feature Pyramid Networks (FPN) solve multi-scale detection, and RetinaNet's focal loss fixes class imbalance in one-stage detectors — reaching 40.8% COCO AP
EfficientDet optimizes compound scaling for detection; FCOS proves anchor-free detection works, simplifying pipelines
DETR (Carion et al.) eliminates NMS and anchors entirely by casting detection as set prediction with transformers
DINO-DETR achieves 63.3% COCO AP, making transformer detectors decisively better than CNN-based ones for the first time
YOLOv8 (Ultralytics) and RT-DETR bridge the real-time gap — DETR-quality accuracy at YOLO-like speeds (100+ FPS)
Grounding DINO and OWLv2 enable open-vocabulary detection — find any object described in natural language without retraining
Co-DETR and Group-DETR push COCO AP above 65% with collaborative training; Florence-2 unifies detection with other vision tasks in a single model
How Object Detection Works
Backbone Feature Extraction
A pretrained backbone (ResNet-50, Swin Transformer, InternViT) processes the input image into multi-scale feature maps at 1/8, 1/16, and 1/32 resolution.
Neck / Feature Fusion
FPN or BiFPN merges multi-scale features top-down and bottom-up, ensuring small and large objects are represented at appropriate resolutions.
Proposal Generation or Query Matching
Two-stage detectors (Faster R-CNN) generate ~300 region proposals via RPN. Transformer detectors (DETR) use learned object queries (100-900) that attend to the feature map. Single-shot detectors (YOLO) predict directly on a dense grid.
Box Regression + Classification
Each proposal/query is refined into a bounding box (x, y, w, h) and classified. DETR uses bipartite matching (Hungarian algorithm) to assign predictions to ground truth; YOLO/SSD use anchor-based assignment with IoU thresholds.
Post-Processing
Non-maximum suppression (NMS) removes duplicate boxes in anchor-based detectors. DETR avoids NMS entirely. Output: list of (box, class, confidence) tuples, evaluated with mAP at IoU thresholds 0.5:0.95.
Current Landscape
Object detection in 2025 is dominated by two parallel tracks: DETR-family transformers for maximum accuracy (Co-DETR, DINO-DETR) and the YOLO lineage for real-time deployment. The gap between them has narrowed dramatically — RT-DETR showed that transformer detectors can match YOLO speeds, and YOLOv8/v9 incorporated transformer ideas into the YOLO framework. Meanwhile, open-vocabulary detection (Grounding DINO, OWLv2) is disrupting the entire paradigm: instead of training a detector per domain, you describe what you want to find in text. Foundation models like Florence-2 are further blurring the boundary between detection, segmentation, and captioning.
Key Challenges
Small object detection — objects under 32×32 pixels account for 41% of COCO annotations but drive only ~15% of AP, and most detectors struggle here
Real-time inference constraints for autonomous driving (10-30ms latency budget) force painful accuracy/speed tradeoffs
Domain adaptation — detectors trained on COCO (everyday objects) fail on specialized domains like aerial imagery, medical scans, or manufacturing defects without significant fine-tuning
Crowded scenes with heavy occlusion (e.g., pedestrians in dense urban environments) cause proposal collision and NMS failures
Annotation cost — drawing bounding boxes takes 25-35 seconds per instance, making large-scale labeled datasets expensive to create
Quick Recommendations
Best accuracy (no latency constraint)
Co-DETR with Swin-L backbone
65%+ COCO mAP, best available closed-set detector; uses collaborative hybrid assignments for superior training
Real-time detection
YOLOv8-L or RT-DETR-L
54-56% COCO mAP at 100+ FPS on an A100; YOLOv8 for simpler deployment, RT-DETR for NMS-free inference
Open-vocabulary / zero-shot
Grounding DINO 1.5 or OWLv2
Detect any object described in text without retraining — critical for robotics, content moderation, and novel domains
Edge / mobile deployment
YOLOv8-N or NanoDet-Plus
~37% COCO mAP at 1.5-3M params, runs at 30+ FPS on mobile NPUs
Low-annotation regime
Grounding DINO + SAM
Use text prompts to generate pseudo-labels, then fine-tune a smaller detector — bootstraps detection without manual annotation
What's Next
The field is converging toward unified vision models that handle detection as one of many tasks (Florence-2, PaLI-X). Open-vocabulary detection will likely make closed-set training obsolete for most applications within 2-3 years. Active research frontiers include 3D object detection from monocular images (crucial for autonomous driving without LiDAR), temporal object detection in video (tracking + detection jointly), and detection foundation models that work zero-shot across wildly different domains like satellite imagery, microscopy, and underwater robotics.
Benchmarks & SOTA
COCO
Microsoft Common Objects in Context
Microsoft COCO is the gold standard for large-scale object detection, segmentation, and captioning, with 330k+ images, 1.5M+ object instances, and 80 categories. Primary metric is box mAP averaged over 10 IoU thresholds (0.5:0.95).
State of the Art
ScyllaNet
Scylla Technologies
66.12
box-map
LVIS v1.0
Large Vocabulary Instance Segmentation v1.0
1,203 object categories with federated, long-tail distribution across 164K COCO images. Tests real-world detection with rare and fine-grained categories.
State of the Art
DINO-X
IDEA Research
71.4
box-ap
Pascal VOC 2012
Pascal Visual Object Classes Challenge 2012
11,530 images with 27,450 ROI annotated objects and 6,929 segmentations. Classic object detection benchmark.
State of the Art
SSD512 (VGG-16)
Google / UNC
80
mAP-coco-pretrain
ImageNet Detection (ILSVRC DET)
ImageNet Large Scale Visual Recognition Challenge — Detection (ILSVRC DET)
ImageNet Detection (commonly called ILSVRC DET) is the object detection track of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It provides bounding-box annotations for images across 200 object categories and was used as a large-scale benchmark for object detection in ILSVRC competitions (2012–2017). Models are evaluated with detection metrics (mean Average Precision, commonly reported at IoU = 0.5 / mAP@0.5, following the ILSVRC evaluation protocol). The dataset and challenge are described in the ILSVRC overview paper (Russakovsky et al., 2014) and on the ImageNet challenge website, which hosts the list of 200 detection synsets, development kits and per-year results.
No results tracked yet
ImageNet Localization (ILSVRC LOC)
ImageNet Large Scale Visual Recognition Challenge — Localization (ILSVRC LOC)
ImageNet Localization (ILSVRC LOC) is the localization subset of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It provides per-image annotations (bounding boxes) for target object instances across the 1,000 ILSVRC categories and is used to evaluate object localization performance (commonly reported as top-5 localization error %). The localization task requires a model to both classify the primary object in an image and provide its bounding box (typically one localized box per image in the ILSVRC LOC setup). The dataset and challenge are described in the original ImageNet paper (Deng et al., CVPR 2009) and in the ILSVRC challenge overview (Russakovsky et al., arXiv:1409.0575).
No results tracked yet
DIOR
DIOR (Dataset for Object detection in Optical Remote sensing images)
DIOR is a large-scale benchmark dataset for object detection in optical remote sensing (aerial/satellite) images. It contains approximately 23,463 images (800×800 px) and ~192,472 axis-aligned object instances covering 20 object categories (e.g., airplane, airport, ship, bridge, stadium, vehicle, windmill, storage tank, dam, chimney, golf course, tennis court, baseball field, basketball court, expressway toll station/service area, harbor, overpass, ground track field, train station). Images have varying spatial resolutions (~0.5 m to 30 m). Standard splits are provided (training, validation, test — commonly reported splits: train ~5,862, val ~5,863, test ~11,725). DIOR is typically evaluated using object-detection metrics such as mean Average Precision (mAP). A rotated-box variant (DIOR-R) with oriented bounding-box annotations has also been released/used by the community.
No results tracked yet
COCO val2017
COCO 2017 Object Detection (validation split)
COCO 2017 validation split (5K images) for object detection evaluation. This dataset is specifically used for object detection tasks, where models are evaluated on their ability to detect and localize objects in images using bounding boxes.
No results tracked yet
COCO 2014 val
COCO 2014 Validation Split
COCO 2014 validation split.
No results tracked yet
Roboflow100-VL (RF100-VL)
Roboflow100-VL (RF100-VL)
Roboflow100-VL (RF100-VL) is a multi-domain object-detection benchmark designed to evaluate vision-language models (VLMs) on diverse, out-of-distribution concepts and imaging modalities. The benchmark aggregates 100 heterogeneous object-detection datasets (drawn from Roboflow/Roboflow Universe collections) spanning domains such as medical imagery (X-ray), thermal, aerial, industrial inspection, synthetic/game imagery, and more. The paper reports aggregate metrics (e.g., AP, latency, FLOPs) averaged across all 100 tasks and evaluates models in zero-shot, few-shot, semi-supervised, and fully supervised settings; the project provides code, dataset interfaces (PyPI package rf100vl), and a public website. Primary sources: paper (arXiv:2505.20612), project site (https://rf100-vl.org), code repository (https://github.com/roboflow/rf100-vl), and a Hugging Face mirror/hosted collection (https://hf.co/datasets/gatilin/rf100-vl).
No results tracked yet
COCO test-dev
COCO test-dev Split
COCO test-dev evaluation split used for benchmark submissions and leaderboard rankings.
No results tracked yet
PASCAL VOC 2007
PASCAL Visual Object Classes (VOC) Challenge 2007
PASCAL VOC 2007 (PASCAL Visual Object Classes Challenge 2007) is a standard benchmark dataset for object detection, classification and segmentation. VOC2007 contains 9,963 images with annotations for 20 object classes (e.g., person, car, bicycle, dog) and about 24,640 annotated object instances. Annotations include class labels, object bounding boxes and (for some images) pixel-level segmentation masks, plus object attributes such as "difficult" and "truncated". The dataset is provided with standard train/val/test splits (the official VOC2007 test annotations were held out on the evaluation server), and the canonical detection evaluation metric reported on this dataset is mean Average Precision (mAP) computed using the PASCAL VOC protocol (AP at IoU 0.5). VOC2007 is widely used for benchmarking object detection models and is often combined with VOC2012 or COCO for additional training (e.g., VOC07+12 or COCO+07+12).
No results tracked yet
Related Tasks
Few-Shot Image Classification
Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.
Open-Vocabulary Object Detection
Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.
Object counting
Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.
Video segmentation
Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Object Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.