Kalman Filter: The Hidden Foundation of Computer Vision

TL;DR

What: Kalman filter predicts where objects will be, even when detectors miss them

Why: Detectors are noisy and miss frames. Kalman smooths trajectories and fills gaps

Where: SORT, DeepSORT, ByteTrack, OC-SORT - all major trackers use it

How: Two steps - predict (physics model) then update (correct with detection)

The Problem: Why Detection Alone Fails

Run YOLO on a video. You'll get bounding boxes for each frame. But try to answer:

Identity: Is the person in frame 10 the same as frame 11?
Missing detections: Frame 15 has no detection - where did they go?
Jitter: The box jumps around even when the person stands still

Detection is memoryless. Each frame is processed independently. Tracking requires memory - knowing where things were to predict where they'll be.

Implementation: From Scratch

~100 lines of Python. No libraries except NumPy.

kalman_filter.py

import numpy as np

class KalmanFilter:
    """
    Kalman filter for 2D bounding box tracking.
    State: [x, y, w, h, vx, vy, vw, vh] (position + velocity)
    Measurement: [x, y, w, h] (bounding box center + size)
    """
    def __init__(self):
        # State dimension: 8 (x, y, w, h + velocities)
        # Measurement dimension: 4 (x, y, w, h)
        self.dt = 1.0  # Time step (1 frame)

        # State transition matrix (constant velocity model)
        self.F = np.array([
            [1, 0, 0, 0, self.dt, 0, 0, 0],  # x = x + vx*dt
            [0, 1, 0, 0, 0, self.dt, 0, 0],  # y = y + vy*dt
            [0, 0, 1, 0, 0, 0, self.dt, 0],  # w = w + vw*dt
            [0, 0, 0, 1, 0, 0, 0, self.dt],  # h = h + vh*dt
            [0, 0, 0, 0, 1, 0, 0, 0],         # vx = vx
            [0, 0, 0, 0, 0, 1, 0, 0],         # vy = vy
            [0, 0, 0, 0, 0, 0, 1, 0],         # vw = vw
            [0, 0, 0, 0, 0, 0, 0, 1],         # vh = vh
        ])

        # Measurement matrix (we only observe position, not velocity)
        self.H = np.array([
            [1, 0, 0, 0, 0, 0, 0, 0],
            [0, 1, 0, 0, 0, 0, 0, 0],
            [0, 0, 1, 0, 0, 0, 0, 0],
            [0, 0, 0, 1, 0, 0, 0, 0],
        ])

        # Process noise covariance (how much we trust our motion model)
        self.Q = np.eye(8) * 0.01

        # Measurement noise covariance (how much we trust detections)
        self.R = np.eye(4) * 1.0

        # Initial state covariance (high uncertainty initially)
        self.P = np.eye(8) * 100

        # State vector
        self.x = np.zeros(8)

    def init(self, measurement):
        """Initialize with first detection."""
        self.x[:4] = measurement  # Position from detection
        self.x[4:] = 0            # Zero initial velocity

    def predict(self):
        """Predict next state (called every frame)."""
        # State prediction: x = F @ x
        self.x = self.F @ self.x

        # Covariance prediction: P = F @ P @ F.T + Q
        self.P = self.F @ self.P @ self.F.T + self.Q

        return self.x[:4]  # Return predicted position

    def update(self, measurement):
        """Update state with new detection."""
        # Innovation (measurement residual)
        y = measurement - self.H @ self.x

        # Innovation covariance
        S = self.H @ self.P @ self.H.T + self.R

        # Kalman gain (how much to trust measurement vs prediction)
        K = self.P @ self.H.T @ np.linalg.inv(S)

        # State update
        self.x = self.x + K @ y

        # Covariance update
        I = np.eye(8)
        self.P = (I - K @ self.H) @ self.P

        return self.x[:4]  # Return updated position

TL;DR

The Problem: Why Detection Alone Fails

Implementation: From Scratch

Related Resources

Object Detection

Computer Vision Hub

CV Benchmarks