Computer Visionobject-detection

Object Detection

Object detection — finding what's in an image and where — is the backbone of autonomous vehicles, surveillance, and robotics. The two-stage R-CNN lineage (2014–2017) gave way to single-shot detectors like YOLO, now in its 11th iteration and still getting faster. DETR (2020) proved transformers could replace hand-designed components like NMS entirely, spawning a family of end-to-end detectors that dominate COCO leaderboards above 60 mAP. The field's current obsession: open-vocabulary detection that works on any object described in natural language, not just fixed categories.

3 datasets35 resultsView full task mapping →

Object detection localizes and classifies multiple objects in an image with bounding boxes. It's the backbone of autonomous driving, surveillance, and robotics. COCO mAP has climbed from 19.7% (R-CNN, 2014) to 65%+ (Co-DETR, 2024), and the field has split between closed-set detectors and open-vocabulary models that find anything described in text.

History

2014

R-CNN (Girshick et al.) combines selective search proposals with CNN features, achieving 31.4% mAP on VOC — the first deep detector

2015

Faster R-CNN introduces the Region Proposal Network (RPN), making detection end-to-end trainable at 5 FPS

2016

SSD and YOLO (v1-v2) prove single-shot detection is viable for real-time (45+ FPS), trading accuracy for speed

2017

Feature Pyramid Networks (FPN) solve multi-scale detection, and RetinaNet's focal loss fixes class imbalance in one-stage detectors — reaching 40.8% COCO AP

2019

EfficientDet optimizes compound scaling for detection; FCOS proves anchor-free detection works, simplifying pipelines

2020

DETR (Carion et al.) eliminates NMS and anchors entirely by casting detection as set prediction with transformers

2022

DINO-DETR achieves 63.3% COCO AP, making transformer detectors decisively better than CNN-based ones for the first time

2023

YOLOv8 (Ultralytics) and RT-DETR bridge the real-time gap — DETR-quality accuracy at YOLO-like speeds (100+ FPS)

2024

Grounding DINO and OWLv2 enable open-vocabulary detection — find any object described in natural language without retraining

2025

Co-DETR and Group-DETR push COCO AP above 65% with collaborative training; Florence-2 unifies detection with other vision tasks in a single model

How Object Detection Works

1Backbone Feature Extr…A pretrained backbone (ResN…2Neck / Feature FusionFPN or BiFPN merges multi-s…3Proposal Generation o…Two-stage detectors (Faster…4Box Regression + Clas…Each proposal/query is refi…5Post-ProcessingNon-maximum suppression (NM…Object Detection Pipeline
1

Backbone Feature Extraction

A pretrained backbone (ResNet-50, Swin Transformer, InternViT) processes the input image into multi-scale feature maps at 1/8, 1/16, and 1/32 resolution.

2

Neck / Feature Fusion

FPN or BiFPN merges multi-scale features top-down and bottom-up, ensuring small and large objects are represented at appropriate resolutions.

3

Proposal Generation or Query Matching

Two-stage detectors (Faster R-CNN) generate ~300 region proposals via RPN. Transformer detectors (DETR) use learned object queries (100-900) that attend to the feature map. Single-shot detectors (YOLO) predict directly on a dense grid.

4

Box Regression + Classification

Each proposal/query is refined into a bounding box (x, y, w, h) and classified. DETR uses bipartite matching (Hungarian algorithm) to assign predictions to ground truth; YOLO/SSD use anchor-based assignment with IoU thresholds.

5

Post-Processing

Non-maximum suppression (NMS) removes duplicate boxes in anchor-based detectors. DETR avoids NMS entirely. Output: list of (box, class, confidence) tuples, evaluated with mAP at IoU thresholds 0.5:0.95.

Current Landscape

Object detection in 2025 is dominated by two parallel tracks: DETR-family transformers for maximum accuracy (Co-DETR, DINO-DETR) and the YOLO lineage for real-time deployment. The gap between them has narrowed dramatically — RT-DETR showed that transformer detectors can match YOLO speeds, and YOLOv8/v9 incorporated transformer ideas into the YOLO framework. Meanwhile, open-vocabulary detection (Grounding DINO, OWLv2) is disrupting the entire paradigm: instead of training a detector per domain, you describe what you want to find in text. Foundation models like Florence-2 are further blurring the boundary between detection, segmentation, and captioning.

Key Challenges

Small object detection — objects under 32×32 pixels account for 41% of COCO annotations but drive only ~15% of AP, and most detectors struggle here

Real-time inference constraints for autonomous driving (10-30ms latency budget) force painful accuracy/speed tradeoffs

Domain adaptation — detectors trained on COCO (everyday objects) fail on specialized domains like aerial imagery, medical scans, or manufacturing defects without significant fine-tuning

Crowded scenes with heavy occlusion (e.g., pedestrians in dense urban environments) cause proposal collision and NMS failures

Annotation cost — drawing bounding boxes takes 25-35 seconds per instance, making large-scale labeled datasets expensive to create

Quick Recommendations

Best accuracy (no latency constraint)

Co-DETR with Swin-L backbone

65%+ COCO mAP, best available closed-set detector; uses collaborative hybrid assignments for superior training

Real-time detection

YOLOv8-L or RT-DETR-L

54-56% COCO mAP at 100+ FPS on an A100; YOLOv8 for simpler deployment, RT-DETR for NMS-free inference

Open-vocabulary / zero-shot

Grounding DINO 1.5 or OWLv2

Detect any object described in text without retraining — critical for robotics, content moderation, and novel domains

Edge / mobile deployment

YOLOv8-N or NanoDet-Plus

~37% COCO mAP at 1.5-3M params, runs at 30+ FPS on mobile NPUs

Low-annotation regime

Grounding DINO + SAM

Use text prompts to generate pseudo-labels, then fine-tune a smaller detector — bootstraps detection without manual annotation

What's Next

The field is converging toward unified vision models that handle detection as one of many tasks (Florence-2, PaLI-X). Open-vocabulary detection will likely make closed-set training obsolete for most applications within 2-3 years. Active research frontiers include 3D object detection from monocular images (crucial for autonomous driving without LiDAR), temporal object detection in video (tracking + detection jointly), and detection foundation models that work zero-shot across wildly different domains like satellite imagery, microscopy, and underwater robotics.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Object Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000