Home/Building Blocks/Multi-Object Tracking
VideoStructured Data

Multi-Object Tracking

Track multiple objects across video frames with consistent identities.

How Multi-Object Tracking Works

A comprehensive guide to tracking multiple objects across video frames. From detection-based association to re-identification after occlusion.

1

The Core Problem: Detection is Not Enough

Object detection tells you what is in a frame. Multi-object tracking tells you which object is which across frames. The same person detected in frame 1 should have the same ID in frame 100.

Interactive Tracking Visualization

Frame: 1/6
#1#2#3t=0
Person 1 (ID: 1)
Person 2 (ID: 2)
Car (ID: 3)

Without Tracking (Detection Only)

# Frame 1
Person @ [10, 20, 22, 45]
Person @ [75, 25, 87, 50]
# Frame 2
Person @ [18, 20, 30, 45]
Person @ [67, 25, 79, 50]
Problem: Which person in Frame 2 is which person from Frame 1? We have no way to know from detections alone.

With Tracking

# Frame 1
Person ID=1 @ [10, 20, 22, 45]
Person ID=2 @ [75, 25, 87, 50]
# Frame 2
Person ID=1 @ [18, 20, 30, 45]
Person ID=2 @ [67, 25, 79, 50]
Solution: Tracking assigns persistent IDs. Person 1 remains Person 1. We can now analyze individual trajectories, count unique visitors, etc.
2

Data Association: Matching Detections to Tracks

The heart of tracking is the association problem: given N existing tracks and M new detections, which detection belongs to which track? This is a bipartite matching problem solved via cost matrices.

Association Pipeline

Detections
Frame t
+
Predicted
Tracks
From t-1
->
Cost Matrix
IoU / Distance
NxM costs
->
Hungarian
Algorithm
Optimal match
->
Updated Tracks
Frame t

Cost Matrix Example

Each cell shows the cost (1 - IoU) of assigning a detection to a track. Lower cost = better match. The Hungarian algorithm finds the optimal assignment.

Det 1Det 2Det 3
Track 10.150.850.92
Track 20.780.220.88
Track 30.950.800.18
Result: Track 1 matched to Det 1, Track 2 matched to Det 2, Track 3 matched to Det 3
IoU (Intersection over Union)
Measure overlap between predicted and detected boxes
IoU = Area(A AND B) / Area(A OR B)
Hungarian Algorithm
Optimal assignment solving cost matrix
Minimize total cost of assignments
Kalman Filter
Predict next position from velocity
x_t = F * x_{t-1} + noise
Cosine Distance
Similarity between appearance embeddings
d = 1 - (a . b) / (|a| * |b|)

Kalman Filter: Motion Prediction

Current

The Kalman Filter maintains a state estimate for each track: position (x, y), velocity (vx, vy), and bounding box dimensions.

# State vector
state = [x, y, vx, vy, w, h, ar]
# Predict step
x_pred = x + vx * dt
y_pred = y + vy * dt
When a detection matches, the Kalman Filter updates the state. When no detection matches (occlusion), the prediction carries the track forward.
3

Occlusion and Re-Identification

The hardest part of tracking: what happens when objects overlap, disappear behind obstacles, or leave the frame? Re-identification (ReID) uses appearance features to recover tracks.

Occlusion Scenario

Before Occlusion (t=1)
ID:1ID:2
Both tracks visible
During Occlusion (t=3)
ID:1ID:2
Tracks crossing
After Occlusion (t=5)
ID:1ID:2
Tracks recovered correctly

Motion-Only Recovery

Simple trackers like SORT rely on the Kalman Filter to predict where a track should be. If the prediction aligns with a detection when the object reappears, the track is recovered.

Limitation
Long occlusions cause predictions to drift. If two similar objects cross paths, IDs may be swapped. Motion alone cannot distinguish identical-looking objects.

Appearance-Based Re-ID

DeepSORT and BoT-SORT extract appearance embeddingsfrom each detection using a CNN (e.g., OSNet). These embeddings are compared to saved embeddings from each track.

Advantage
Even after long occlusion, a person's appearance (clothing, body shape) remains consistent. ReID can correctly re-associate tracks that motion alone would confuse.

Track Lifecycle States

Tentative
New detection
n_init hits needed
->
Confirmed
Active track
Matched each frame
->
Lost
No match
Using prediction
->
Deleted
max_age exceeded
Track removed
Lost tracks can return to Confirmed if a matching detection is found before max_age
4

Tracking Methods: SORT to BoT-SORT

The evolution of multi-object tracking algorithms, from simple IoU matching to sophisticated appearance-aware methods. Each builds on its predecessors.

ByteTrack

ByteTrack: Multi-Object Tracking by Associating Every Detection Box
Published: 2022
150 FPS
Speed
80.3
MOTA
Approach:
Two-stage association (high + low confidence)
Re-ID Features:No
Strengths
  • + Uses all detections
  • + Very accurate
  • + Fast
Weaknesses
  • - Motion-based only
  • - May struggle with appearance

Evolution of Multi-Object Tracking

SORT
2016
DeepSORT
2017
ByteTrack
2022
BoT-SORT
2022
OC-SORT
2023

ByteTrack Key Innovation: Using All Detections

Traditional trackers discard low-confidence detections. ByteTrack realized these often contain occluded objects that should still be tracked. It performs association in two stages:

Stage 1: High-Confidence
Associate tracks with detections above track_thresh (e.g., 0.6). These are clear, unoccluded detections.
Stage 2: Low-Confidence
Unmatched tracks are associated with remaining low-confidence detections. Rescues occluded or partially visible objects.
5

Tracking Metrics

How do we measure tracking quality? The key metrics balance detection accuracy with identity preservation.

MOTA
Multi-Object Tracking Accuracy
Combines FP, FN, and ID switches into one score. Higher is better.
IDF1
ID F1 Score
Harmonic mean of ID precision and recall. Measures identity preservation.
HOTA
Higher Order Tracking Accuracy
Balances detection and association. Modern replacement for MOTA.
ID Sw
ID Switches
Number of times a track ID changes for the same object. Lower is better.
FP/FN
False Positives/Negatives
Spurious or missed detections affecting tracking.

MOTA Formula

MOTA = 1 - (FN + FP + ID_Sw) / GT
FN
False Negatives
FP
False Positives
ID_Sw
ID Switches
GT
Ground Truth

MOT17 Benchmark (Selected Results)

MethodMOTAIDF1HOTAID Sw
BoT-SORT80.580.265.01212
ByteTrack80.377.363.12196
OC-SORT78.077.563.21950
DeepSORT61.462.245.62442
SORT59.853.842.74852
6

Code Examples

Get started with multi-object tracking in Python. From simple one-liners to production-ready pipelines.

ByteTrack (Official)pip install bytetrack
State-of-the-Art
from bytetrack import BYTETracker
import cv2
from ultralytics import YOLO

# Initialize detector and tracker
detector = YOLO('yolov8x.pt')
tracker = BYTETracker(
    track_thresh=0.5,      # High confidence threshold
    track_buffer=30,       # Frames to keep lost tracks
    match_thresh=0.8,      # IoU threshold for matching
    frame_rate=30
)

cap = cv2.VideoCapture('video.mp4')

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Get detections from YOLO
    results = detector(frame)
    detections = results[0].boxes

    # Format: [x1, y1, x2, y2, score, class]
    dets = []
    for box in detections:
        x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
        score = box.conf[0].cpu().numpy()
        cls = box.cls[0].cpu().numpy()
        dets.append([x1, y1, x2, y2, score, cls])

    # Update tracker
    tracks = tracker.update(
        np.array(dets),
        img_info=(frame.shape[0], frame.shape[1]),
        img_size=(frame.shape[0], frame.shape[1])
    )

    # Draw tracks
    for track in tracks:
        x1, y1, x2, y2 = track.tlbr.astype(int)
        track_id = track.track_id

        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, f'ID: {track_id}',
                    (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX,
                    0.7, (0, 255, 0), 2)

Quick Reference

For Speed (100+ FPS)
  • - SORT (simplest)
  • - ByteTrack
  • - OC-SORT
For Accuracy (Re-ID)
  • - BoT-SORT (SOTA)
  • - DeepSORT
  • - StrongSORT
Libraries
  • - supervision (Roboflow)
  • - boxmot (Mikel Brostr.)
  • - ultralytics (built-in)
Recommended Starting Point
For most applications, start with Ultralytics YOLO + ByteTrack. One line of code: model.track(source, tracker="bytetrack.yaml", persist=True). Add BoT-SORT if you need better re-identification after long occlusions.

Use Cases

  • Surveillance
  • Sports player tracking
  • Autonomous driving
  • Retail footfall analysis

Architectural Patterns

Detect-Then-Track

Per-frame detection plus association (SORT/ByteTrack).

Joint Detection and Embedding

Detector produces re-ID embeddings for association.

Implementations

Open Source

ByteTrack

MIT
Open Source

State-of-the-art accuracy/speed.

StrongSORT/BoT-SORT

MIT
Open Source

Robust tracking with re-ID.

OC-SORT

MIT
Open Source

Occlusion-aware tracker.

Benchmarks

Quick Facts

Input
Video
Output
Structured Data
Implementations
3 open source, 0 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for multi-object tracking.

Submit Results