Computer Visionobject-detection

Object Detection

Object Detection is a computer vision task that involves identifying and localizing objects within an image. The goal is to detect instances or objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Object detection models typically output a set of bounding boxes with corresponding predicted class names.

11 datasets46 resultsView full task mapping →

Object detection localizes and classifies multiple objects in an image with bounding boxes. It's the backbone of autonomous driving, surveillance, and robotics. COCO mAP has climbed from 19.7% (R-CNN, 2014) to 65%+ (Co-DETR, 2024), and the field has split between closed-set detectors and open-vocabulary models that find anything described in text.

History

2014

R-CNN (Girshick et al.) combines selective search proposals with CNN features, achieving 31.4% mAP on VOC — the first deep detector

2015

Faster R-CNN introduces the Region Proposal Network (RPN), making detection end-to-end trainable at 5 FPS

2016

SSD and YOLO (v1-v2) prove single-shot detection is viable for real-time (45+ FPS), trading accuracy for speed

2017

Feature Pyramid Networks (FPN) solve multi-scale detection, and RetinaNet's focal loss fixes class imbalance in one-stage detectors — reaching 40.8% COCO AP

2019

EfficientDet optimizes compound scaling for detection; FCOS proves anchor-free detection works, simplifying pipelines

2020

DETR (Carion et al.) eliminates NMS and anchors entirely by casting detection as set prediction with transformers

2022

DINO-DETR achieves 63.3% COCO AP, making transformer detectors decisively better than CNN-based ones for the first time

2023

YOLOv8 (Ultralytics) and RT-DETR bridge the real-time gap — DETR-quality accuracy at YOLO-like speeds (100+ FPS)

2024

Grounding DINO and OWLv2 enable open-vocabulary detection — find any object described in natural language without retraining

2025

Co-DETR and Group-DETR push COCO AP above 65% with collaborative training; Florence-2 unifies detection with other vision tasks in a single model

How Object Detection Works

Backbone Feature Extraction

A pretrained backbone (ResNet-50, Swin Transformer, InternViT) processes the input image into multi-scale feature maps at 1/8, 1/16, and 1/32 resolution.

Neck / Feature Fusion

FPN or BiFPN merges multi-scale features top-down and bottom-up, ensuring small and large objects are represented at appropriate resolutions.

Proposal Generation or Query Matching

Two-stage detectors (Faster R-CNN) generate ~300 region proposals via RPN. Transformer detectors (DETR) use learned object queries (100-900) that attend to the feature map. Single-shot detectors (YOLO) predict directly on a dense grid.

Box Regression + Classification

Each proposal/query is refined into a bounding box (x, y, w, h) and classified. DETR uses bipartite matching (Hungarian algorithm) to assign predictions to ground truth; YOLO/SSD use anchor-based assignment with IoU thresholds.

Post-Processing

Non-maximum suppression (NMS) removes duplicate boxes in anchor-based detectors. DETR avoids NMS entirely. Output: list of (box, class, confidence) tuples, evaluated with mAP at IoU thresholds 0.5:0.95.

Current Landscape

Object detection in 2025 is dominated by two parallel tracks: DETR-family transformers for maximum accuracy (Co-DETR, DINO-DETR) and the YOLO lineage for real-time deployment. The gap between them has narrowed dramatically — RT-DETR showed that transformer detectors can match YOLO speeds, and YOLOv8/v9 incorporated transformer ideas into the YOLO framework. Meanwhile, open-vocabulary detection (Grounding DINO, OWLv2) is disrupting the entire paradigm: instead of training a detector per domain, you describe what you want to find in text. Foundation models like Florence-2 are further blurring the boundary between detection, segmentation, and captioning.

Key Challenges

Small object detection — objects under 32×32 pixels account for 41% of COCO annotations but drive only ~15% of AP, and most detectors struggle here

Real-time inference constraints for autonomous driving (10-30ms latency budget) force painful accuracy/speed tradeoffs

Domain adaptation — detectors trained on COCO (everyday objects) fail on specialized domains like aerial imagery, medical scans, or manufacturing defects without significant fine-tuning

Crowded scenes with heavy occlusion (e.g., pedestrians in dense urban environments) cause proposal collision and NMS failures

Annotation cost — drawing bounding boxes takes 25-35 seconds per instance, making large-scale labeled datasets expensive to create

Quick Recommendations

Best accuracy (no latency constraint)

Co-DETR with Swin-L backbone

65%+ COCO mAP, best available closed-set detector; uses collaborative hybrid assignments for superior training

Real-time detection

YOLOv8-L or RT-DETR-L

54-56% COCO mAP at 100+ FPS on an A100; YOLOv8 for simpler deployment, RT-DETR for NMS-free inference

Open-vocabulary / zero-shot

Grounding DINO 1.5 or OWLv2

Detect any object described in text without retraining — critical for robotics, content moderation, and novel domains

Edge / mobile deployment

YOLOv8-N or NanoDet-Plus

~37% COCO mAP at 1.5-3M params, runs at 30+ FPS on mobile NPUs

Low-annotation regime

Grounding DINO + SAM

Use text prompts to generate pseudo-labels, then fine-tune a smaller detector — bootstraps detection without manual annotation

What's Next

The field is converging toward unified vision models that handle detection as one of many tasks (Florence-2, PaLI-X). Open-vocabulary detection will likely make closed-set training obsolete for most applications within 2-3 years. Active research frontiers include 3D object detection from monocular images (crucial for autonomous driving without LiDAR), temporal object detection in video (tracking + detection jointly), and detection foundation models that work zero-shot across wildly different domains like satellite imagery, microscopy, and underwater robotics.

Benchmarks & SOTA

COCO

Microsoft Common Objects in Context

201424 results

Microsoft COCO is the gold standard for large-scale object detection, segmentation, and captioning, with 330k+ images, 1.5M+ object instances, and 80 categories. Primary metric is box mAP averaged over 10 IoU thresholds (0.5:0.95).

State of the Art

ScyllaNet

Scylla Technologies

66.12

box-map

LVIS v1.0

Large Vocabulary Instance Segmentation v1.0

201916 results

1,203 object categories with federated, long-tail distribution across 164K COCO images. Tests real-world detection with rare and fine-grained categories.

State of the Art

DINO-X

IDEA Research

71.4

box-ap

Pascal VOC 2012

Pascal Visual Object Classes Challenge 2012

20126 results

11,530 images with 27,450 ROI annotated objects and 6,929 segmentations. Classic object detection benchmark.

State of the Art

SSD512 (VGG-16)

Google / UNC

mAP-coco-pretrain

ImageNet Detection (ILSVRC DET)

ImageNet Large Scale Visual Recognition Challenge — Detection (ILSVRC DET)

0 results

ImageNet Detection (commonly called ILSVRC DET) is the object detection track of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It provides bounding-box annotations for images across 200 object categories and was used as a large-scale benchmark for object detection in ILSVRC competitions (2012–2017). Models are evaluated with detection metrics (mean Average Precision, commonly reported at IoU = 0.5 / mAP@0.5, following the ILSVRC evaluation protocol). The dataset and challenge are described in the ILSVRC overview paper (Russakovsky et al., 2014) and on the ImageNet challenge website, which hosts the list of 200 detection synsets, development kits and per-year results.

No results tracked yet

ImageNet Localization (ILSVRC LOC)

ImageNet Large Scale Visual Recognition Challenge — Localization (ILSVRC LOC)

0 results

ImageNet Localization (ILSVRC LOC) is the localization subset of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It provides per-image annotations (bounding boxes) for target object instances across the 1,000 ILSVRC categories and is used to evaluate object localization performance (commonly reported as top-5 localization error %). The localization task requires a model to both classify the primary object in an image and provide its bounding box (typically one localized box per image in the ILSVRC LOC setup). The dataset and challenge are described in the original ImageNet paper (Deng et al., CVPR 2009) and in the ILSVRC challenge overview (Russakovsky et al., arXiv:1409.0575).

No results tracked yet

DIOR

DIOR (Dataset for Object detection in Optical Remote sensing images)

0 results

DIOR is a large-scale benchmark dataset for object detection in optical remote sensing (aerial/satellite) images. It contains approximately 23,463 images (800×800 px) and ~192,472 axis-aligned object instances covering 20 object categories (e.g., airplane, airport, ship, bridge, stadium, vehicle, windmill, storage tank, dam, chimney, golf course, tennis court, baseball field, basketball court, expressway toll station/service area, harbor, overpass, ground track field, train station). Images have varying spatial resolutions (~0.5 m to 30 m). Standard splits are provided (training, validation, test — commonly reported splits: train ~5,862, val ~5,863, test ~11,725). DIOR is typically evaluated using object-detection metrics such as mean Average Precision (mAP). A rotated-box variant (DIOR-R) with oriented bounding-box annotations has also been released/used by the community.

No results tracked yet

COCO val2017

COCO 2017 Object Detection (validation split)

20140 results

COCO 2017 validation split (5K images) for object detection evaluation. This dataset is specifically used for object detection tasks, where models are evaluated on their ability to detect and localize objects in images using bounding boxes.

No results tracked yet

COCO 2014 val

COCO 2014 Validation Split

20140 results

COCO 2014 validation split.

No results tracked yet

Roboflow100-VL (RF100-VL)

0 results

Roboflow100-VL (RF100-VL) is a multi-domain object-detection benchmark designed to evaluate vision-language models (VLMs) on diverse, out-of-distribution concepts and imaging modalities. The benchmark aggregates 100 heterogeneous object-detection datasets (drawn from Roboflow/Roboflow Universe collections) spanning domains such as medical imagery (X-ray), thermal, aerial, industrial inspection, synthetic/game imagery, and more. The paper reports aggregate metrics (e.g., AP, latency, FLOPs) averaged across all 100 tasks and evaluates models in zero-shot, few-shot, semi-supervised, and fully supervised settings; the project provides code, dataset interfaces (PyPI package rf100vl), and a public website. Primary sources: paper (arXiv:2505.20612), project site (https://rf100-vl.org), code repository (https://github.com/roboflow/rf100-vl), and a Hugging Face mirror/hosted collection (https://hf.co/datasets/gatilin/rf100-vl).

No results tracked yet

COCO test-dev

COCO test-dev Split

20140 results

COCO test-dev evaluation split used for benchmark submissions and leaderboard rankings.

No results tracked yet

PASCAL VOC 2007

PASCAL Visual Object Classes (VOC) Challenge 2007

0 results

PASCAL VOC 2007 (PASCAL Visual Object Classes Challenge 2007) is a standard benchmark dataset for object detection, classification and segmentation. VOC2007 contains 9,963 images with annotations for 20 object classes (e.g., person, car, bicycle, dog) and about 24,640 annotated object instances. Annotations include class labels, object bounding boxes and (for some images) pixel-level segmentation masks, plus object attributes such as "difficult" and "truncated". The dataset is provided with standard train/val/test splits (the official VOC2007 test annotations were held out on the evaluation server), and the canonical detection evaluation metric reported on this dataset is mean Average Precision (mAP) computed using the PASCAL VOC protocol (AP at IoU 0.5). VOC2007 is widely used for benchmarking object detection models and is often combined with VOC2012 or COCO for additional training (e.g., VOC07+12 or COCO+07+12).

No results tracked yet

Related Tasks

Few-Shot Image Classification

Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Object Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision