Object Detection
Object detection — finding what's in an image and where — is the backbone of autonomous vehicles, surveillance, and robotics. The two-stage R-CNN lineage (2014–2017) gave way to single-shot detectors like YOLO, now in its 11th iteration and still getting faster. DETR (2020) proved transformers could replace hand-designed components like NMS entirely, spawning a family of end-to-end detectors that dominate COCO leaderboards above 60 mAP. The field's current obsession: open-vocabulary detection that works on any object described in natural language, not just fixed categories.
Object detection localizes and classifies multiple objects in an image with bounding boxes. It's the backbone of autonomous driving, surveillance, and robotics. COCO mAP has climbed from 19.7% (R-CNN, 2014) to 65%+ (Co-DETR, 2024), and the field has split between closed-set detectors and open-vocabulary models that find anything described in text.
History
R-CNN (Girshick et al.) combines selective search proposals with CNN features, achieving 31.4% mAP on VOC — the first deep detector
Faster R-CNN introduces the Region Proposal Network (RPN), making detection end-to-end trainable at 5 FPS
SSD and YOLO (v1-v2) prove single-shot detection is viable for real-time (45+ FPS), trading accuracy for speed
Feature Pyramid Networks (FPN) solve multi-scale detection, and RetinaNet's focal loss fixes class imbalance in one-stage detectors — reaching 40.8% COCO AP
EfficientDet optimizes compound scaling for detection; FCOS proves anchor-free detection works, simplifying pipelines
DETR (Carion et al.) eliminates NMS and anchors entirely by casting detection as set prediction with transformers
DINO-DETR achieves 63.3% COCO AP, making transformer detectors decisively better than CNN-based ones for the first time
YOLOv8 (Ultralytics) and RT-DETR bridge the real-time gap — DETR-quality accuracy at YOLO-like speeds (100+ FPS)
Grounding DINO and OWLv2 enable open-vocabulary detection — find any object described in natural language without retraining
Co-DETR and Group-DETR push COCO AP above 65% with collaborative training; Florence-2 unifies detection with other vision tasks in a single model
How Object Detection Works
Backbone Feature Extraction
A pretrained backbone (ResNet-50, Swin Transformer, InternViT) processes the input image into multi-scale feature maps at 1/8, 1/16, and 1/32 resolution.
Neck / Feature Fusion
FPN or BiFPN merges multi-scale features top-down and bottom-up, ensuring small and large objects are represented at appropriate resolutions.
Proposal Generation or Query Matching
Two-stage detectors (Faster R-CNN) generate ~300 region proposals via RPN. Transformer detectors (DETR) use learned object queries (100-900) that attend to the feature map. Single-shot detectors (YOLO) predict directly on a dense grid.
Box Regression + Classification
Each proposal/query is refined into a bounding box (x, y, w, h) and classified. DETR uses bipartite matching (Hungarian algorithm) to assign predictions to ground truth; YOLO/SSD use anchor-based assignment with IoU thresholds.
Post-Processing
Non-maximum suppression (NMS) removes duplicate boxes in anchor-based detectors. DETR avoids NMS entirely. Output: list of (box, class, confidence) tuples, evaluated with mAP at IoU thresholds 0.5:0.95.
Current Landscape
Object detection in 2025 is dominated by two parallel tracks: DETR-family transformers for maximum accuracy (Co-DETR, DINO-DETR) and the YOLO lineage for real-time deployment. The gap between them has narrowed dramatically — RT-DETR showed that transformer detectors can match YOLO speeds, and YOLOv8/v9 incorporated transformer ideas into the YOLO framework. Meanwhile, open-vocabulary detection (Grounding DINO, OWLv2) is disrupting the entire paradigm: instead of training a detector per domain, you describe what you want to find in text. Foundation models like Florence-2 are further blurring the boundary between detection, segmentation, and captioning.
Key Challenges
Small object detection — objects under 32×32 pixels account for 41% of COCO annotations but drive only ~15% of AP, and most detectors struggle here
Real-time inference constraints for autonomous driving (10-30ms latency budget) force painful accuracy/speed tradeoffs
Domain adaptation — detectors trained on COCO (everyday objects) fail on specialized domains like aerial imagery, medical scans, or manufacturing defects without significant fine-tuning
Crowded scenes with heavy occlusion (e.g., pedestrians in dense urban environments) cause proposal collision and NMS failures
Annotation cost — drawing bounding boxes takes 25-35 seconds per instance, making large-scale labeled datasets expensive to create
Quick Recommendations
Best accuracy (no latency constraint)
Co-DETR with Swin-L backbone
65%+ COCO mAP, best available closed-set detector; uses collaborative hybrid assignments for superior training
Real-time detection
YOLOv8-L or RT-DETR-L
54-56% COCO mAP at 100+ FPS on an A100; YOLOv8 for simpler deployment, RT-DETR for NMS-free inference
Open-vocabulary / zero-shot
Grounding DINO 1.5 or OWLv2
Detect any object described in text without retraining — critical for robotics, content moderation, and novel domains
Edge / mobile deployment
YOLOv8-N or NanoDet-Plus
~37% COCO mAP at 1.5-3M params, runs at 30+ FPS on mobile NPUs
Low-annotation regime
Grounding DINO + SAM
Use text prompts to generate pseudo-labels, then fine-tune a smaller detector — bootstraps detection without manual annotation
What's Next
The field is converging toward unified vision models that handle detection as one of many tasks (Florence-2, PaLI-X). Open-vocabulary detection will likely make closed-set training obsolete for most applications within 2-3 years. Active research frontiers include 3D object detection from monocular images (crucial for autonomous driving without LiDAR), temporal object detection in video (tracking + detection jointly), and detection foundation models that work zero-shot across wildly different domains like satellite imagery, microscopy, and underwater robotics.
Benchmarks & SOTA
LVIS v1.0
Large Vocabulary Instance Segmentation v1.0
1,203 object categories with federated, long-tail distribution across 164K COCO images. Tests real-world detection with rare and fine-grained categories.
State of the Art
DINO-X
IDEA Research
71.4
box-ap
COCO
Microsoft COCO: Common Objects in Context
330K images, 1.5 million object instances, 80 object categories. Standard benchmark for object detection and segmentation.
State of the Art
Co-DETR (Swin-L)
Research
66
mAP
Pascal VOC 2012
Pascal Visual Object Classes Challenge 2012
11,530 images with 27,450 ROI annotated objects and 6,929 segmentations. Classic object detection benchmark.
State of the Art
SSD512 (VGG-16)
Google / UNC
80
mAP-coco-pretrain
Related Tasks
Something wrong or missing?
Help keep Object Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.