Multi-Object Tracking
Track multiple objects across video frames with consistent identities.
How Multi-Object Tracking Works
A comprehensive guide to tracking multiple objects across video frames. From detection-based association to re-identification after occlusion.
The Core Problem: Detection is Not Enough
Object detection tells you what is in a frame. Multi-object tracking tells you which object is which across frames. The same person detected in frame 1 should have the same ID in frame 100.
Interactive Tracking Visualization
Without Tracking (Detection Only)
With Tracking
Data Association: Matching Detections to Tracks
The heart of tracking is the association problem: given N existing tracks and M new detections, which detection belongs to which track? This is a bipartite matching problem solved via cost matrices.
Association Pipeline
Cost Matrix Example
Each cell shows the cost (1 - IoU) of assigning a detection to a track. Lower cost = better match. The Hungarian algorithm finds the optimal assignment.
| Det 1 | Det 2 | Det 3 | |
|---|---|---|---|
| Track 1 | 0.15 | 0.85 | 0.92 |
| Track 2 | 0.78 | 0.22 | 0.88 |
| Track 3 | 0.95 | 0.80 | 0.18 |
Kalman Filter: Motion Prediction
The Kalman Filter maintains a state estimate for each track: position (x, y), velocity (vx, vy), and bounding box dimensions.
Occlusion and Re-Identification
The hardest part of tracking: what happens when objects overlap, disappear behind obstacles, or leave the frame? Re-identification (ReID) uses appearance features to recover tracks.
Occlusion Scenario
Motion-Only Recovery
Simple trackers like SORT rely on the Kalman Filter to predict where a track should be. If the prediction aligns with a detection when the object reappears, the track is recovered.
Appearance-Based Re-ID
DeepSORT and BoT-SORT extract appearance embeddingsfrom each detection using a CNN (e.g., OSNet). These embeddings are compared to saved embeddings from each track.
Track Lifecycle States
Tracking Methods: SORT to BoT-SORT
The evolution of multi-object tracking algorithms, from simple IoU matching to sophisticated appearance-aware methods. Each builds on its predecessors.
ByteTrack
- + Uses all detections
- + Very accurate
- + Fast
- - Motion-based only
- - May struggle with appearance
Evolution of Multi-Object Tracking
ByteTrack Key Innovation: Using All Detections
Traditional trackers discard low-confidence detections. ByteTrack realized these often contain occluded objects that should still be tracked. It performs association in two stages:
Tracking Metrics
How do we measure tracking quality? The key metrics balance detection accuracy with identity preservation.
MOTA Formula
MOT17 Benchmark (Selected Results)
| Method | MOTA | IDF1 | HOTA | ID Sw |
|---|---|---|---|---|
| BoT-SORT | 80.5 | 80.2 | 65.0 | 1212 |
| ByteTrack | 80.3 | 77.3 | 63.1 | 2196 |
| OC-SORT | 78.0 | 77.5 | 63.2 | 1950 |
| DeepSORT | 61.4 | 62.2 | 45.6 | 2442 |
| SORT | 59.8 | 53.8 | 42.7 | 4852 |
Code Examples
Get started with multi-object tracking in Python. From simple one-liners to production-ready pipelines.
from bytetrack import BYTETracker
import cv2
from ultralytics import YOLO
# Initialize detector and tracker
detector = YOLO('yolov8x.pt')
tracker = BYTETracker(
track_thresh=0.5, # High confidence threshold
track_buffer=30, # Frames to keep lost tracks
match_thresh=0.8, # IoU threshold for matching
frame_rate=30
)
cap = cv2.VideoCapture('video.mp4')
while True:
ret, frame = cap.read()
if not ret:
break
# Get detections from YOLO
results = detector(frame)
detections = results[0].boxes
# Format: [x1, y1, x2, y2, score, class]
dets = []
for box in detections:
x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
score = box.conf[0].cpu().numpy()
cls = box.cls[0].cpu().numpy()
dets.append([x1, y1, x2, y2, score, cls])
# Update tracker
tracks = tracker.update(
np.array(dets),
img_info=(frame.shape[0], frame.shape[1]),
img_size=(frame.shape[0], frame.shape[1])
)
# Draw tracks
for track in tracks:
x1, y1, x2, y2 = track.tlbr.astype(int)
track_id = track.track_id
cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.putText(frame, f'ID: {track_id}',
(x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX,
0.7, (0, 255, 0), 2)Quick Reference
- - SORT (simplest)
- - ByteTrack
- - OC-SORT
- - BoT-SORT (SOTA)
- - DeepSORT
- - StrongSORT
- - supervision (Roboflow)
- - boxmot (Mikel Brostr.)
- - ultralytics (built-in)
model.track(source, tracker="bytetrack.yaml", persist=True). Add BoT-SORT if you need better re-identification after long occlusions.Use Cases
- ✓Surveillance
- ✓Sports player tracking
- ✓Autonomous driving
- ✓Retail footfall analysis
Architectural Patterns
Detect-Then-Track
Per-frame detection plus association (SORT/ByteTrack).
Joint Detection and Embedding
Detector produces re-ID embeddings for association.
Implementations
Benchmarks
Quick Facts
- Input
- Video
- Output
- Structured Data
- Implementations
- 3 open source, 0 API
- Patterns
- 2 approaches