Home/Building Blocks/Multi-Object Tracking

Video→Structured Data

Multi-Object Tracking

Track multiple objects across video frames with consistent identities.

How Multi-Object Tracking Works

A comprehensive guide to tracking multiple objects across video frames. From detection-based association to re-identification after occlusion.

1. Tracking Basics 2. Data Association 3. Occlusion Handling 4. Tracking Methods 5. Metrics 6. Code

The Core Problem: Detection is Not Enough

Object detection tells you what is in a frame. Multi-object tracking tells you which object is which across frames. The same person detected in frame 1 should have the same ID in frame 100.

Interactive Tracking Visualization

Frame: 1/6

Person 1 (ID: 1)

Person 2 (ID: 2)

Car (ID: 3)

Without Tracking (Detection Only)

# Frame 1

Person @ [10, 20, 22, 45]

Person @ [75, 25, 87, 50]

# Frame 2

Person @ [18, 20, 30, 45]

Person @ [67, 25, 79, 50]

Problem: Which person in Frame 2 is which person from Frame 1? We have no way to know from detections alone.

With Tracking

# Frame 1

Person ID=1 @ [10, 20, 22, 45]

Person ID=2 @ [75, 25, 87, 50]

# Frame 2

Person ID=1 @ [18, 20, 30, 45]

Person ID=2 @ [67, 25, 79, 50]

Solution: Tracking assigns persistent IDs. Person 1 remains Person 1. We can now analyze individual trajectories, count unique visitors, etc.

Data Association: Matching Detections to Tracks

The heart of tracking is the association problem: given N existing tracks and M new detections, which detection belongs to which track? This is a bipartite matching problem solved via cost matrices.

Association Pipeline

Detections

Frame t

Predicted

Tracks

From t-1

Cost Matrix

IoU / Distance

NxM costs

Hungarian

Algorithm

Optimal match

Updated Tracks

Frame t

Cost Matrix Example

Each cell shows the cost (1 - IoU) of assigning a detection to a track. Lower cost = better match. The Hungarian algorithm finds the optimal assignment.

	Det 1	Det 2	Det 3
Track 1	0.15	0.85	0.92
Track 2	0.78	0.22	0.88
Track 3	0.95	0.80	0.18

Result: Track 1 matched to Det 1, Track 2 matched to Det 2, Track 3 matched to Det 3

IoU (Intersection over Union)

Measure overlap between predicted and detected boxes

IoU = Area(A AND B) / Area(A OR B)

Hungarian Algorithm

Optimal assignment solving cost matrix

Minimize total cost of assignments

Kalman Filter

Predict next position from velocity

x_t = F * x_{t-1} + noise

Cosine Distance

Similarity between appearance embeddings

d = 1 - (a . b) / (|a| * |b|)

Kalman Filter: Motion Prediction

The Kalman Filter maintains a state estimate for each track: position (x, y), velocity (vx, vy), and bounding box dimensions.

# State vector

state = [x, y, vx, vy, w, h, ar]

# Predict step

x_pred = x + vx * dt

y_pred = y + vy * dt

When a detection matches, the Kalman Filter updates the state. When no detection matches (occlusion), the prediction carries the track forward.

Occlusion and Re-Identification

The hardest part of tracking: what happens when objects overlap, disappear behind obstacles, or leave the frame? Re-identification (ReID) uses appearance features to recover tracks.

Occlusion Scenario

Before Occlusion (t=1)

Both tracks visible

During Occlusion (t=3)

Tracks crossing

After Occlusion (t=5)

Tracks recovered correctly

Motion-Only Recovery

Simple trackers like SORT rely on the Kalman Filter to predict where a track should be. If the prediction aligns with a detection when the object reappears, the track is recovered.

Limitation

Long occlusions cause predictions to drift. If two similar objects cross paths, IDs may be swapped. Motion alone cannot distinguish identical-looking objects.

Appearance-Based Re-ID

DeepSORT and BoT-SORT extract appearance embeddingsfrom each detection using a CNN (e.g., OSNet). These embeddings are compared to saved embeddings from each track.

Advantage

Even after long occlusion, a person's appearance (clothing, body shape) remains consistent. ReID can correctly re-associate tracks that motion alone would confuse.

Track Lifecycle States

Tentative

New detection

n_init hits needed

Confirmed

Active track

Matched each frame

Lost

No match

Using prediction

Deleted

max_age exceeded

Track removed

Lost tracks can return to Confirmed if a matching detection is found before max_age

Tracking Methods: SORT to BoT-SORT

The evolution of multi-object tracking algorithms, from simple IoU matching to sophisticated appearance-aware methods. Each builds on its predecessors.

ByteTrack

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Published: 2022

150 FPS

Speed

80.3

MOTA

Approach:

Two-stage association (high + low confidence)

Re-ID Features:No

Strengths

+ Uses all detections
+ Very accurate
+ Fast

Weaknesses

- Motion-based only
- May struggle with appearance

Evolution of Multi-Object Tracking

SORT

2016

DeepSORT

2017

ByteTrack

2022

BoT-SORT

2022

OC-SORT

2023

ByteTrack Key Innovation: Using All Detections

Traditional trackers discard low-confidence detections. ByteTrack realized these often contain occluded objects that should still be tracked. It performs association in two stages:

Stage 1: High-Confidence

Associate tracks with detections above track_thresh (e.g., 0.6). These are clear, unoccluded detections.

Stage 2: Low-Confidence

Unmatched tracks are associated with remaining low-confidence detections. Rescues occluded or partially visible objects.

Tracking Metrics

How do we measure tracking quality? The key metrics balance detection accuracy with identity preservation.

MOTA

Multi-Object Tracking Accuracy

Combines FP, FN, and ID switches into one score. Higher is better.

IDF1

ID F1 Score

Harmonic mean of ID precision and recall. Measures identity preservation.

HOTA

Higher Order Tracking Accuracy

Balances detection and association. Modern replacement for MOTA.

ID Sw

ID Switches

Number of times a track ID changes for the same object. Lower is better.

FP/FN

False Positives/Negatives

Spurious or missed detections affecting tracking.

MOTA Formula

MOTA = 1 - (FN + FP + ID_Sw) / GT

False Negatives

False Positives

ID_Sw

ID Switches

Ground Truth

MOT17 Benchmark (Selected Results)

Method	MOTA	IDF1	HOTA	ID Sw
BoT-SORT	80.5	80.2	65.0	1212
ByteTrack	80.3	77.3	63.1	2196
OC-SORT	78.0	77.5	63.2	1950
DeepSORT	61.4	62.2	45.6	2442
SORT	59.8	53.8	42.7	4852

Code Examples

Get started with multi-object tracking in Python. From simple one-liners to production-ready pipelines.

ByteTrack (Official)pip install bytetrack

State-of-the-Art

from bytetrack import BYTETracker
import cv2
from ultralytics import YOLO

# Initialize detector and tracker
detector = YOLO('yolov8x.pt')
tracker = BYTETracker(
    track_thresh=0.5,      # High confidence threshold
    track_buffer=30,       # Frames to keep lost tracks
    match_thresh=0.8,      # IoU threshold for matching
    frame_rate=30
)

cap = cv2.VideoCapture('video.mp4')

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Get detections from YOLO
    results = detector(frame)
    detections = results[0].boxes

    # Format: [x1, y1, x2, y2, score, class]
    dets = []
    for box in detections:
        x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
        score = box.conf[0].cpu().numpy()
        cls = box.cls[0].cpu().numpy()
        dets.append([x1, y1, x2, y2, score, cls])

    # Update tracker
    tracks = tracker.update(
        np.array(dets),
        img_info=(frame.shape[0], frame.shape[1]),
        img_size=(frame.shape[0], frame.shape[1])
    )

    # Draw tracks
    for track in tracks:
        x1, y1, x2, y2 = track.tlbr.astype(int)
        track_id = track.track_id

        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, f'ID: {track_id}',
                    (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX,
                    0.7, (0, 255, 0), 2)

Quick Reference

For Speed (100+ FPS)

- SORT (simplest)
- ByteTrack
- OC-SORT

For Accuracy (Re-ID)

- BoT-SORT (SOTA)
- DeepSORT
- StrongSORT

Libraries

- supervision (Roboflow)
- boxmot (Mikel Brostr.)
- ultralytics (built-in)

Recommended Starting Point

For most applications, start with Ultralytics YOLO + ByteTrack. One line of code: model.track(source, tracker="bytetrack.yaml", persist=True). Add BoT-SORT if you need better re-identification after long occlusions.