Object Detection
Locate and classify objects in images with bounding boxes. Foundational for autonomous vehicles, surveillance, and robotics.
How Object Detection Works
A technical deep-dive into object detection systems. From bounding box predictions to real-time inference with YOLO.
Detection Basics: What Gets Predicted
Object detection = localization + classification. For each object, predict a bounding box and class label.
Image Classification
class_labelObject Detection
[{box, class, conf}, ...]Detection Output Format
Common Box Formats
Detector Architectures
Two main paradigms: two-stage (propose then classify) vs single-stage (detect directly). Transformers are the new contender.
Two-Stage (Faster R-CNN)
When accuracy > speedSingle-Stage (YOLO)
Real-time applicationsTransformer (DETR)
No NMS needed, elegantTwo-Stage Pipeline
Single-Stage Pipeline
YOLO Evolution: From v1 to v11
You Only Look Once - the most popular real-time detector. 9 years of continuous improvement.
| Version | Year | Key Innovation | mAP (COCO) | Speed |
|---|---|---|---|---|
| YOLOv1 | 2015 | Single-shot detection | 63.4% | 45 FPS |
| YOLOv2 | 2016 | Batch norm, anchor boxes | 76.8% | 67 FPS |
| YOLOv3 | 2018 | Multi-scale predictions | 33% | 35 FPS |
| YOLOv4 | 2020 | CSPDarknet, PANet | 43.5% | 54 FPS |
| YOLOv5 | 2020 | PyTorch, easy training | 50.7% | 140 FPS |
| YOLOv8 | 2023 | Anchor-free, decoupled head | 53.9% | 280 FPS |
| YOLOv11 | 2024 | C3k2 blocks, attention | 54.7% | 320 FPS |
Modern YOLO Architecture (v8/v11)
YOLOv8+ uses anchor-free detection and decoupled classification/regression heads for better accuracy.
Non-Maximum Suppression (NMS)
Detectors often predict multiple overlapping boxes for the same object. NMS removes duplicates by keeping only the highest-confidence box.
Before NMS
After NMS (IoU = 0.5)
NMS Algorithm
- 1Sort all boxes by confidence score (descending)
- 2Take the highest scoring box, add to final detections
- 3Remove all remaining boxes with IoU > threshold with selected box
- 4Repeat steps 2-3 until no boxes remain
IoU (Intersection over Union)
Detection Metrics
Understanding mAP, IoU thresholds, and speed metrics for comparing detectors.
Understanding mAP@50:95 (COCO metric)
The COCO benchmark uses a strict metric that averages AP across:
- -10 IoU thresholds: 0.50, 0.55, ..., 0.95
- -80 object categories
- -~5000 validation images
| YOLOv11-X | 54.7% |
| YOLOv8-X | 53.9% |
| RT-DETR-X | 54.8% |
| DINO (Swin-L) | 63.2% |
Speed vs Accuracy Trade-off
FPS measured on RTX 4090. Choose model size based on your speed/accuracy requirements.
Code Examples
Get started with object detection in Python.
from ultralytics import YOLO
# Load a pretrained model
model = YOLO('yolo11x.pt')
# Run inference on an image
results = model('image.jpg')
# Process results
for result in results:
boxes = result.boxes.xyxy # [x1, y1, x2, y2]
confs = result.boxes.conf # confidence scores
classes = result.boxes.cls # class indices
for box, conf, cls in zip(boxes, confs, classes):
label = model.names[int(cls)]
print(f'{label}: {conf:.2f} at {box.tolist()}')Quick Reference
- - YOLO v8/v11 (any size)
- - RT-DETR
- - YOLO-NAS
- - DINO / DINO-X
- - Mask R-CNN
- - Co-DETR
- - Grounding DINO
- - Florence-2
- - YOLO-World
Use Cases
- ✓Autonomous driving
- ✓Security monitoring
- ✓Inventory management
- ✓Wildlife tracking
Architectural Patterns
Two-Stage Detectors
First propose regions, then classify (Faster R-CNN family).
- +High accuracy
- +Good for small objects
- +Flexible backbone
- -Slower inference
- -Complex architecture
Single-Stage Detectors
Predict boxes and classes in one pass (YOLO, SSD).
- +Fast inference
- +Real-time capable
- +Simple deployment
- -May miss small objects
- -Lower accuracy on crowded scenes
Transformer-Based
End-to-end detection with transformers (DETR family).
- +No NMS needed
- +Elegant architecture
- +Good accuracy
- -Slow training
- -Needs large datasets
Implementations
API Services
AWS Rekognition
AWSManaged detection API. Good for cloud-native apps.
Open Source
YOLOv8/YOLOv11
AGPL-3.0State-of-the-art speed/accuracy. Easy to use, many sizes (n/s/m/l/x).
Grounding DINO
Apache 2.0Open-vocabulary detection. Detect anything with text prompts.
Florence-2
MITUnified vision model. Detection, segmentation, captioning in one.
Benchmarks
Quick Facts
- Input
- Image
- Output
- Bounding Boxes
- Implementations
- 4 open source, 1 API
- Patterns
- 3 approaches