Image→Bounding Boxes

Object Detection

Locate and classify objects in images with bounding boxes. Foundational for autonomous vehicles, surveillance, and robotics.

How Object Detection Works

A technical deep-dive into object detection systems. From bounding box predictions to real-time inference with YOLO.

1. Detection Basics 2. Architectures 3. YOLO Evolution 4. NMS 5. Metrics 6. Code

Detection Basics: What Gets Predicted

Object detection = localization + classification. For each object, predict a bounding box and class label.

Image Classification

dog

Single label for entire image

Output: class_label

Object Detection

person 0.97

dog 0.92

car 0.89

Output: [{box, class, conf}, ...]

Detection Output Format

Bounding Box

4 coordinates

[x1, y1, x2, y2]

Class Label

Object category

"person", "car", ...

Confidence

Detection score

0.0 to 1.0

Instance ID

Multiple objects

person_1, person_2

Common Box Formats

xyxy (corners)

[x1, y1, x2, y2]

Top-left and bottom-right

xywh (center)

[cx, cy, w, h]

Center point + dimensions

normalized

[0-1, 0-1, 0-1, 0-1]

Relative to image size

Detector Architectures

Two main paradigms: two-stage (propose then classify) vs single-stage (detect directly). Transformers are the new contender.

Two-Stage (Faster R-CNN)

When accuracy > speed

5-7 FPS

Accuracy: High

Region Proposal Network (RPN)->

RoI Pooling->

Classification + Regression

Single-Stage (YOLO)

Real-time applications

30-500 FPS

Accuracy: Good-High

Backbone->

Neck (FPN)->

Detection Head

Transformer (DETR)

No NMS needed, elegant

10-30 FPS

Accuracy: High

Backbone->

Transformer Encoder->

Transformer Decoder->

FFN Heads

Two-Stage Pipeline

Region Proposal

~2000 candidate boxes per image

Classification + Refinement

Classify each region, refine box

Examples: R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN

Single-Stage Pipeline

Direct Prediction

Predict all boxes + classes in one pass

Much faster!

No region proposal stage. Predicts at multiple scales using Feature Pyramid Network (FPN).

Examples: YOLO, SSD, RetinaNet, CenterNet, FCOS

YOLO Evolution: From v1 to v11

You Only Look Once - the most popular real-time detector. 9 years of continuous improvement.

80%

60%

40%

20%

YOLOv1

2015

63.4 mAP

45 FPS

YOLOv2

2016

76.8 mAP

67 FPS

YOLOv3

2018

33 mAP

35 FPS

YOLOv4

2020

43.5 mAP

54 FPS

YOLOv5

2020

50.7 mAP

140 FPS

YOLOv8

2023

53.9 mAP

280 FPS

YOLOv11

2024

54.7 mAP

320 FPS

Version	Year	Key Innovation	mAP (COCO)	Speed
YOLOv1	2015	Single-shot detection	63.4%	45 FPS
YOLOv2	2016	Batch norm, anchor boxes	76.8%	67 FPS
YOLOv3	2018	Multi-scale predictions	33%	35 FPS
YOLOv4	2020	CSPDarknet, PANet	43.5%	54 FPS
YOLOv5	2020	PyTorch, easy training	50.7%	140 FPS
YOLOv8	2023	Anchor-free, decoupled head	53.9%	280 FPS
YOLOv11	2024	C3k2 blocks, attention	54.7%	320 FPS

Modern YOLO Architecture (v8/v11)

Backbone

CSPDarknet/C3k2

Neck (PANet)

Feature Pyramid

Head

Decoupled (cls/reg)

Output

[N, 4+C] boxes

YOLOv8+ uses anchor-free detection and decoupled classification/regression heads for better accuracy.

Non-Maximum Suppression (NMS)

Detectors often predict multiple overlapping boxes for the same object. NMS removes duplicates by keeping only the highest-confidence box.

Before NMS

0.95

0.88

0.82

0.79

4 overlapping boxes for same object

After NMS (IoU = 0.5)

0.95

Only highest confidence box kept

NMS Algorithm

1
Sort all boxes by confidence score (descending)
2
Take the highest scoring box, add to final detections
3
Remove all remaining boxes with IoU > threshold with selected box
4
Repeat steps 2-3 until no boxes remain

IoU (Intersection over Union)

Box A

Box B

IoU = Intersection / Union

Measures overlap between two boxes. 1.0 = perfect overlap, 0.0 = no overlap.

Intersection = A AND B

Union = A OR B

IoU = Yellow / (Blue + Green - Yellow)

Detection Metrics

Understanding mAP, IoU thresholds, and speed metrics for comparing detectors.

mAP

Mean Average Precision

Average precision across all classes at IoU thresholds

mAP@50

mAP at IoU 0.5

AP when box overlap >= 50% is considered correct

mAP@50:95

COCO mAP

Average mAP at IoU 0.5 to 0.95 in 0.05 steps

IoU

Intersection over Union

Overlap between predicted and ground truth box

FPS

Frames Per Second

Inference speed, critical for real-time

Understanding mAP@50:95 (COCO metric)

The COCO benchmark uses a strict metric that averages AP across:

-10 IoU thresholds: 0.50, 0.55, ..., 0.95
-80 object categories
-~5000 validation images

COCO 2017 Val Leaderboard (selected)

YOLOv11-X	54.7%
YOLOv8-X	53.9%
RT-DETR-X	54.8%
DINO (Swin-L)	63.2%

Speed vs Accuracy Trade-off

YOLO-N

~500 FPS / 39% mAP

YOLO-S

~400 FPS / 45% mAP

YOLO-M

~250 FPS / 50% mAP

YOLO-L

~150 FPS / 53% mAP

YOLO-X

~100 FPS / 55% mAP

FPS measured on RTX 4090. Choose model size based on your speed/accuracy requirements.

Code Examples

Get started with object detection in Python.

YOLOv11 (Ultralytics)pip install ultralytics

Recommended

from ultralytics import YOLO

# Load a pretrained model
model = YOLO('yolo11x.pt')

# Run inference on an image
results = model('image.jpg')

# Process results
for result in results:
    boxes = result.boxes.xyxy    # [x1, y1, x2, y2]
    confs = result.boxes.conf    # confidence scores
    classes = result.boxes.cls   # class indices

    for box, conf, cls in zip(boxes, confs, classes):
        label = model.names[int(cls)]
        print(f'{label}: {conf:.2f} at {box.tolist()}')

Quick Reference

For Real-Time (30+ FPS)

- YOLO v8/v11 (any size)
- RT-DETR
- YOLO-NAS

For Max Accuracy

- DINO / DINO-X
- Mask R-CNN
- Co-DETR

For Open-Vocabulary

- Grounding DINO
- Florence-2
- YOLO-World

Use Cases

✓Autonomous driving
✓Security monitoring
✓Inventory management
✓Wildlife tracking

Architectural Patterns

Two-Stage Detectors

First propose regions, then classify (Faster R-CNN family).

Pros:

+High accuracy
+Good for small objects
+Flexible backbone

Cons:

-Slower inference
-Complex architecture

Single-Stage Detectors

Predict boxes and classes in one pass (YOLO, SSD).

Pros:

+Fast inference
+Real-time capable
+Simple deployment

Cons:

-May miss small objects
-Lower accuracy on crowded scenes

Transformer-Based

End-to-end detection with transformers (DETR family).

Pros:

+No NMS needed
+Elegant architecture
+Good accuracy

Cons:

-Slow training
-Needs large datasets

Implementations

API Services

AWS Rekognition

AWS

API

Managed detection API. Good for cloud-native apps.

Open Source

YOLOv8/YOLOv11

AGPL-3.0

Open Source

State-of-the-art speed/accuracy. Easy to use, many sizes (n/s/m/l/x).

GitHub

RT-DETR

Apache 2.0

Open Source

Real-time DETR. Transformer-based with competitive speed.

HuggingFace

Grounding DINO

Apache 2.0

Open Source

Open-vocabulary detection. Detect anything with text prompts.

GitHub

Florence-2

MIT

Open Source

Unified vision model. Detection, segmentation, captioning in one.

HuggingFace

Benchmarks

COCO Detection →LVIS →

Quick Facts

Input: Image
Output: Bounding Boxes
Implementations: 4 open source, 1 API
Patterns: 3 approaches

Related Blocks

Image Segmentation

Image → Segmentation Mask

Image Captioning

Image → Text

Have benchmark data?

Help us track the state of the art for object detection.

Submit Results