Computer VisionDeep Dive
Kalman Filter: The Hidden Foundation of Computer Vision
Every YOLO tracker uses it. Self-driving cars depend on it. Yet most ML engineers treat it as a black box. Let's fix that.
December 2024|15 min read
TL;DR
- What: Kalman filter predicts where objects will be, even when detectors miss them
- Why: Detectors are noisy and miss frames. Kalman smooths trajectories and fills gaps
- Where: SORT, DeepSORT, ByteTrack, OC-SORT - all major trackers use it
- How: Two steps - predict (physics model) then update (correct with detection)
The Problem: Why Detection Alone Fails
Run YOLO on a video. You'll get bounding boxes for each frame. But try to answer:
- Identity: Is the person in frame 10 the same as frame 11?
- Missing detections: Frame 15 has no detection - where did they go?
- Jitter: The box jumps around even when the person stands still
Detection is memoryless. Each frame is processed independently. Tracking requires memory - knowing where things were to predict where they'll be.
Implementation: From Scratch
~100 lines of Python. No libraries except NumPy.
kalman_filter.py
import numpy as np
class KalmanFilter:
"""
Kalman filter for 2D bounding box tracking.
State: [x, y, w, h, vx, vy, vw, vh] (position + velocity)
Measurement: [x, y, w, h] (bounding box center + size)
"""
def __init__(self):
# State dimension: 8 (x, y, w, h + velocities)
# Measurement dimension: 4 (x, y, w, h)
self.dt = 1.0 # Time step (1 frame)
# State transition matrix (constant velocity model)
self.F = np.array([
[1, 0, 0, 0, self.dt, 0, 0, 0], # x = x + vx*dt
[0, 1, 0, 0, 0, self.dt, 0, 0], # y = y + vy*dt
[0, 0, 1, 0, 0, 0, self.dt, 0], # w = w + vw*dt
[0, 0, 0, 1, 0, 0, 0, self.dt], # h = h + vh*dt
[0, 0, 0, 0, 1, 0, 0, 0], # vx = vx
[0, 0, 0, 0, 0, 1, 0, 0], # vy = vy
[0, 0, 0, 0, 0, 0, 1, 0], # vw = vw
[0, 0, 0, 0, 0, 0, 0, 1], # vh = vh
])
# Measurement matrix (we only observe position, not velocity)
self.H = np.array([
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
])
# Process noise covariance (how much we trust our motion model)
self.Q = np.eye(8) * 0.01
# Measurement noise covariance (how much we trust detections)
self.R = np.eye(4) * 1.0
# Initial state covariance (high uncertainty initially)
self.P = np.eye(8) * 100
# State vector
self.x = np.zeros(8)
def init(self, measurement):
"""Initialize with first detection."""
self.x[:4] = measurement # Position from detection
self.x[4:] = 0 # Zero initial velocity
def predict(self):
"""Predict next state (called every frame)."""
# State prediction: x = F @ x
self.x = self.F @ self.x
# Covariance prediction: P = F @ P @ F.T + Q
self.P = self.F @ self.P @ self.F.T + self.Q
return self.x[:4] # Return predicted position
def update(self, measurement):
"""Update state with new detection."""
# Innovation (measurement residual)
y = measurement - self.H @ self.x
# Innovation covariance
S = self.H @ self.P @ self.H.T + self.R
# Kalman gain (how much to trust measurement vs prediction)
K = self.P @ self.H.T @ np.linalg.inv(S)
# State update
self.x = self.x + K @ y
# Covariance update
I = np.eye(8)
self.P = (I - K @ self.H) @ self.P
return self.x[:4] # Return updated position