Home/Building Blocks/Object Detection
ImageBounding Boxes

Object Detection

Locate and classify objects in images with bounding boxes. Foundational for autonomous vehicles, surveillance, and robotics.

How Object Detection Works

A technical deep-dive into object detection systems. From bounding box predictions to real-time inference with YOLO.

1

Detection Basics: What Gets Predicted

Object detection = localization + classification. For each object, predict a bounding box and class label.

Image Classification

dog
Single label for entire image
Output: class_label

Object Detection

person 0.97
dog 0.92
car 0.89
Output: [{box, class, conf}, ...]

Detection Output Format

Bounding Box
4 coordinates
[x1, y1, x2, y2]
Class Label
Object category
"person", "car", ...
Confidence
Detection score
0.0 to 1.0
Instance ID
Multiple objects
person_1, person_2

Common Box Formats

xyxy (corners)
[x1, y1, x2, y2]
Top-left and bottom-right
xywh (center)
[cx, cy, w, h]
Center point + dimensions
normalized
[0-1, 0-1, 0-1, 0-1]
Relative to image size
2

Detector Architectures

Two main paradigms: two-stage (propose then classify) vs single-stage (detect directly). Transformers are the new contender.

Two-Stage (Faster R-CNN)

When accuracy > speed
5-7 FPS
Accuracy: High
Region Proposal Network (RPN)->
RoI Pooling->
Classification + Regression

Single-Stage (YOLO)

Real-time applications
30-500 FPS
Accuracy: Good-High
Backbone->
Neck (FPN)->
Detection Head

Transformer (DETR)

No NMS needed, elegant
10-30 FPS
Accuracy: High
Backbone->
Transformer Encoder->
Transformer Decoder->
FFN Heads

Two-Stage Pipeline

1
Region Proposal
~2000 candidate boxes per image
2
Classification + Refinement
Classify each region, refine box
Examples: R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN

Single-Stage Pipeline

1
Direct Prediction
Predict all boxes + classes in one pass
Much faster!
No region proposal stage. Predicts at multiple scales using Feature Pyramid Network (FPN).
Examples: YOLO, SSD, RetinaNet, CenterNet, FCOS
3

YOLO Evolution: From v1 to v11

You Only Look Once - the most popular real-time detector. 9 years of continuous improvement.

80%
60%
40%
20%
0%
YOLOv1
2015
63.4 mAP
45 FPS
YOLOv2
2016
76.8 mAP
67 FPS
YOLOv3
2018
33 mAP
35 FPS
YOLOv4
2020
43.5 mAP
54 FPS
YOLOv5
2020
50.7 mAP
140 FPS
YOLOv8
2023
53.9 mAP
280 FPS
YOLOv11
2024
54.7 mAP
320 FPS
VersionYearKey InnovationmAP (COCO)Speed
YOLOv12015Single-shot detection63.4%45 FPS
YOLOv22016Batch norm, anchor boxes76.8%67 FPS
YOLOv32018Multi-scale predictions33%35 FPS
YOLOv42020CSPDarknet, PANet43.5%54 FPS
YOLOv52020PyTorch, easy training50.7%140 FPS
YOLOv82023Anchor-free, decoupled head53.9%280 FPS
YOLOv112024C3k2 blocks, attention54.7%320 FPS

Modern YOLO Architecture (v8/v11)

Backbone
CSPDarknet/C3k2
->
Neck (PANet)
Feature Pyramid
->
Head
Decoupled (cls/reg)
->
Output
[N, 4+C] boxes

YOLOv8+ uses anchor-free detection and decoupled classification/regression heads for better accuracy.

4

Non-Maximum Suppression (NMS)

Detectors often predict multiple overlapping boxes for the same object. NMS removes duplicates by keeping only the highest-confidence box.

Before NMS

0.95
0.88
0.82
0.79
4 overlapping boxes for same object

After NMS (IoU = 0.5)

0.95
Only highest confidence box kept

NMS Algorithm

  1. 1
    Sort all boxes by confidence score (descending)
  2. 2
    Take the highest scoring box, add to final detections
  3. 3
    Remove all remaining boxes with IoU > threshold with selected box
  4. 4
    Repeat steps 2-3 until no boxes remain

IoU (Intersection over Union)

Box A
Box B
IoU = Intersection / Union
Measures overlap between two boxes. 1.0 = perfect overlap, 0.0 = no overlap.
Intersection = A AND B
Union = A OR B
IoU = Yellow / (Blue + Green - Yellow)
5

Detection Metrics

Understanding mAP, IoU thresholds, and speed metrics for comparing detectors.

mAP
Mean Average Precision
Average precision across all classes at IoU thresholds
mAP@50
mAP at IoU 0.5
AP when box overlap >= 50% is considered correct
mAP@50:95
COCO mAP
Average mAP at IoU 0.5 to 0.95 in 0.05 steps
IoU
Intersection over Union
Overlap between predicted and ground truth box
FPS
Frames Per Second
Inference speed, critical for real-time

Understanding mAP@50:95 (COCO metric)

The COCO benchmark uses a strict metric that averages AP across:

  • -10 IoU thresholds: 0.50, 0.55, ..., 0.95
  • -80 object categories
  • -~5000 validation images
COCO 2017 Val Leaderboard (selected)
YOLOv11-X54.7%
YOLOv8-X53.9%
RT-DETR-X54.8%
DINO (Swin-L)63.2%

Speed vs Accuracy Trade-off

YOLO-N
~500 FPS / 39% mAP
->
YOLO-S
~400 FPS / 45% mAP
->
YOLO-M
~250 FPS / 50% mAP
->
YOLO-L
~150 FPS / 53% mAP
->
YOLO-X
~100 FPS / 55% mAP

FPS measured on RTX 4090. Choose model size based on your speed/accuracy requirements.

6

Code Examples

Get started with object detection in Python.

YOLOv11 (Ultralytics)pip install ultralytics
Recommended
from ultralytics import YOLO

# Load a pretrained model
model = YOLO('yolo11x.pt')

# Run inference on an image
results = model('image.jpg')

# Process results
for result in results:
    boxes = result.boxes.xyxy    # [x1, y1, x2, y2]
    confs = result.boxes.conf    # confidence scores
    classes = result.boxes.cls   # class indices

    for box, conf, cls in zip(boxes, confs, classes):
        label = model.names[int(cls)]
        print(f'{label}: {conf:.2f} at {box.tolist()}')

Quick Reference

For Real-Time (30+ FPS)
  • - YOLO v8/v11 (any size)
  • - RT-DETR
  • - YOLO-NAS
For Max Accuracy
  • - DINO / DINO-X
  • - Mask R-CNN
  • - Co-DETR
For Open-Vocabulary
  • - Grounding DINO
  • - Florence-2
  • - YOLO-World

Use Cases

  • Autonomous driving
  • Security monitoring
  • Inventory management
  • Wildlife tracking

Architectural Patterns

Two-Stage Detectors

First propose regions, then classify (Faster R-CNN family).

Pros:
  • +High accuracy
  • +Good for small objects
  • +Flexible backbone
Cons:
  • -Slower inference
  • -Complex architecture

Single-Stage Detectors

Predict boxes and classes in one pass (YOLO, SSD).

Pros:
  • +Fast inference
  • +Real-time capable
  • +Simple deployment
Cons:
  • -May miss small objects
  • -Lower accuracy on crowded scenes

Transformer-Based

End-to-end detection with transformers (DETR family).

Pros:
  • +No NMS needed
  • +Elegant architecture
  • +Good accuracy
Cons:
  • -Slow training
  • -Needs large datasets

Implementations

API Services

AWS Rekognition

AWS
API

Managed detection API. Good for cloud-native apps.

Open Source

YOLOv8/YOLOv11

AGPL-3.0
Open Source

State-of-the-art speed/accuracy. Easy to use, many sizes (n/s/m/l/x).

RT-DETR

Apache 2.0
Open Source

Real-time DETR. Transformer-based with competitive speed.

Grounding DINO

Apache 2.0
Open Source

Open-vocabulary detection. Detect anything with text prompts.

Florence-2

MIT
Open Source

Unified vision model. Detection, segmentation, captioning in one.

Benchmarks

Quick Facts

Input
Image
Output
Bounding Boxes
Implementations
4 open source, 1 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for object detection.

Submit Results