Level 1: Single Blocks~15 min

Object Detection: Image to Bounding Boxes

Locate and classify objects in images. The foundation of visual perception systems.

What is Object Detection?

Object detection answers two questions simultaneously: What objects are in this image? and Where are they located?

Unlike image classification (which outputs a single label), object detection outputs a list of bounding boxes - rectangles that localize each detected object along with its class label and confidence score.

Output Format

Each detection contains:

  • -Bounding box: (x1, y1, x2, y2) coordinates
  • -Class: What type of object (person, car, dog, etc.)
  • -Confidence: How certain the model is (0.0 to 1.0)

Closed vs Open Vocabulary Detection

Object detectors fall into two categories based on what they can detect:

Closed Vocabulary

Can only detect classes the model was trained on. COCO dataset has 80 classes (person, car, dog, etc.).

Examples:

YOLO, RT-DETR, Faster R-CNN

+ Faster inference
+ Higher accuracy on known classes
- Cannot detect novel objects

Open Vocabulary

Can detect any object you describe in natural language. Uses vision-language models to match text descriptions to image regions.

Examples:

Grounding DINO, OWL-ViT, GLIP

+ Detect any describable object
+ Zero-shot capability
- Slower inference

YOLO v11 (Ultralytics)

YOLO (You Only Look Once) is the most popular object detection architecture. YOLO v11 is the latest version from Ultralytics, achieving ~54.7 mAP on COCO with real-time performance.

Model Variants

ModelParamsmAPSpeed (T4)Use Case
yolo11n2.6M39.51.5msEdge/mobile
yolo11s9.4M47.02.5msBalanced
yolo11m20.1M51.54.7msGeneral purpose
yolo11l25.3M53.46.2msHigh accuracy
yolo11x56.9M54.711.3msMaximum accuracy

mAP = mean Average Precision on COCO val2017. Speed measured on NVIDIA T4 GPU.

# YOLO v11 Object Detection
from ultralytics import YOLO

# Load model (downloads automatically on first run)
model = YOLO('yolo11x.pt') # or yolo11n.pt for speed

# Run inference
results = model('image.jpg')

# Process detections
for result in results:
boxes = result.boxes
for box in boxes:
x1, y1, x2, y2 = box.xyxy[0]
conf = box.conf[0]
cls = int(box.cls[0])
print(f'{model.names[cls]}: {conf:.2f} at ({x1:.0f},{y1:.0f})-({x2:.0f},{y2:.0f})')

Installation: pip install ultralytics

RT-DETR (Real-Time Detection Transformer)

RT-DETR is a transformer-based detector from Baidu that achieves competitive accuracy with end-to-end design. Unlike YOLO's CNN backbone, RT-DETR uses attention mechanisms for better global context understanding.

RT-DETR Advantages

  • -End-to-end detection (no NMS post-processing needed)
  • -Better at detecting overlapping objects
  • -More flexible input resolution scaling
  • -~54.8 mAP on COCO with RT-DETR-L
# RT-DETR Object Detection
from ultralytics import RTDETR

# Load RT-DETR model
model = RTDETR('rtdetr-l.pt') # or rtdetr-x.pt for max accuracy

# Run inference - same API as YOLO
results = model('image.jpg')

# Process results (identical interface)
for result in results:
boxes = result.boxes
... # Same as YOLO

RT-DETR uses the same Ultralytics API, making it easy to swap between YOLO and RT-DETR depending on your accuracy vs speed requirements.

Grounding DINO (Open Vocabulary)

Grounding DINO combines a DINO detector with grounded pre-training to enable open-vocabulary detection. You describe what you want to find in natural language, and the model locates it.

The Power of Text Prompts

Unlike YOLO which can only detect its 80 trained classes, Grounding DINO can detect:

  • -"person wearing a red hat"
  • -"damaged car parts"
  • -"brand logos"
  • -Any object you can describe in words
# Grounding DINO - Open Vocabulary Detection
from groundingdino.util.inference import load_model, predict

# Load model
model = load_model(
'GroundingDINO_SwinT_OGC.py',
'groundingdino_swint_ogc.pth'
)

# Detect with text prompts (period-separated)
boxes, logits, phrases = predict(
model=model,
image=image,
caption='person . dog . car', # Any text prompt!
box_threshold=0.35,
text_threshold=0.25
)

# boxes: tensor of [cx, cy, w, h] normalized coordinates
# logits: confidence scores
# phrases: matched text phrases

Installation:

pip install groundingdino-py

Note: Grounding DINO requires downloading model weights separately from the GitHub repo.

Speed vs Accuracy Tradeoffs

Choosing the right detector depends on your constraints. Here is a comparison:

YOLO11x(Closed)
54.7mAP
11.3ms
RT-DETR-X(Closed)
54.8mAP
9.3ms
RT-DETR-L(Closed)
53mAP
5ms
YOLO11m(Closed)
51.5mAP
4.7ms
YOLO11n(Closed)
39.5mAP
1.5ms
Grounding DINO-T(Open)
48.4mAP
45ms

mAP on COCO val2017. Speed on NVIDIA T4 GPU. Open vocabulary detectors trade speed for flexibility.

When to Use Which Detector

Real-Time Applications (30+ FPS)

Use YOLO11n or YOLO11s. Surveillance, robotics, live video.

1.5-2.5ms per frame. Sacrifice accuracy for speed.

Balanced Production

Use YOLO11m or RT-DETR-L. General purpose, batch processing.

~5ms per frame. Good tradeoff between speed and accuracy.

Maximum Accuracy

Use YOLO11x or RT-DETR-X. Medical imaging, quality inspection.

~10ms per frame. When accuracy is more important than speed.

Custom Object Detection

Use Grounding DINO. Novel objects, dynamic categories, prototyping.

~45ms per frame. No training needed for new categories.

COCO Benchmark

COCO (Common Objects in Context) is the standard benchmark for object detection. It contains 80 object categories with bounding box annotations across 200K+ images.

Key Metrics

mAP (mean Average Precision)

Primary metric. Averages precision across IoU thresholds 0.5-0.95. Higher is better.

AP50 / AP75

AP at IoU threshold 0.5 (lenient) and 0.75 (strict). Useful for understanding localization quality.

Key Takeaways

  • 1

    Object detection outputs bounding boxes - coordinates, class labels, and confidence scores for each detected object.

  • 2

    YOLO v11 is the go-to for speed - from 1.5ms (nano) to 11ms (extra-large), covering edge to server deployments.

  • 3

    RT-DETR offers transformer architecture - end-to-end detection without NMS, better for overlapping objects.

  • 4

    Grounding DINO enables open-vocabulary detection - detect any object via text prompts, no retraining needed.