Level 4: Advanced~45 min

Video Understanding

From optical flow and hand-crafted descriptors to video transformers and multimodal foundation models — how machines learned to see time.

Why Video Is the Hardest Modality

An image is a spatial snapshot. A video is a spatiotemporal volume — width, height, and time, multiplied by three color channels, often accompanied by an audio track. A single minute of 1080p video at 30 fps contains 1,800 frames, roughly 3.7 billion pixel values. No model can process all of it naively.

But the real difficulty isn't scale — it's temporal reasoning. Understanding "a person picks up a cup, drinks, then sets it down" requires recognizing objects, tracking them across frames, inferring causality, and grounding all of it in a timeline. This is what separates video understanding from running an image classifier 30 times per second.

Computational Scale

A 10-minute video is 18,000 frames. Passing each through a ViT-L costs ~$3.60 in GPU time. Processing every frame through GPT-4o would cost ~$90 in API fees. You must sample.

Temporal Reasoning

"The dog caught the ball" vs "the ball hit the dog" — same objects, different events. Single-frame models can't distinguish them. You need motion and order.

Multi-modal Fusion

Audio carries crucial signal: speech identifies topics, laughter marks humor, a crash sound signals an accident. The visual and auditory streams must be aligned in time and fused meaningfully.

Temporal Localization

"When does the goal happen?" requires mapping answers to timestamps — not just classification, but grounding predictions in the temporal axis. This is video's unique challenge.

The Evolution of Video Understanding

Video understanding has gone through five distinct eras, each solving a fundamental limitation of the previous one. Understanding this arc explains why modern approaches work — and where they still break down.

Era I: Hand-Crafted Features
1981

Optical Flow

Berthold Horn and Brian Schunck formalized optical flow — the pattern of apparent motion of objects between consecutive frames. By computing pixel-level displacement vectors, they could estimate how objects moved through a scene. This became the foundational representation for motion in computer vision for the next three decades.

"The optical flow field is the distribution of apparent velocities of movement of brightness patterns in an image."

Horn, B. & Schunck, B. (1981). Determining Optical Flow. Artificial Intelligence, 17(1-3), 185–203.

Optical flow was elegant but brittle: expensive to compute, sensitive to lighting changes, and it captured only local motion without any semantic understanding. A person waving and a tree branch swaying produced similar flow fields.

2003–2005

Space-Time Interest Points & HOG/HOF

Ivan Laptev extended Harris corner detection into the temporal dimension, detecting "Space-Time Interest Points" (STIPs) — locations in a video where significant spatial and temporal change co-occur. These were described using Histograms of Oriented Gradients (HOG) for appearance and Histograms of Optical Flow (HOF) for motion.

The pipeline was classic pre-deep-learning computer vision: detect keypoints, extract hand-designed descriptors, cluster them into a "bag of visual words," train an SVM. It worked surprisingly well on constrained datasets like KTH (6 action classes, static camera) but collapsed on real-world video where backgrounds varied, cameras moved, and actions overlapped.

Laptev, I. (2005). On Space-Time Interest Points. IJCV, 64(2-3), 107–123.

2013

Improved Dense Trajectories (iDT)

Heng Wang and Cordelia Schmid at INRIA produced the last great hand-crafted video feature. Instead of sparse keypoints, they tracked dense point trajectories across frames, describing each with HOG, HOF, and Motion Boundary Histograms (MBH). Camera motion was estimated and removed. iDT achieved 85.9% on UCF-101 and remained the strongest non-neural baseline for two years.

Wang, H. & Schmid, C. (2013). Action Recognition with Improved Trajectories. ICCV.

Era II: Learning to See Motion
2014

Two-Stream Convolutional Networks

Karen Simonyan and Andrew Zisserman at Oxford proposed a deceptively simple idea: use two separate CNNs, one for appearance (a single RGB frame) and one for motion (a stack of optical flow fields), then fuse their predictions. This two-stream architecture was inspired by the two-pathway hypothesis of the human visual cortex — the ventral ("what") and dorsal ("where/how") streams.

The temporal stream CNN took 10 stacked optical flow frames as input, capturing short-term motion patterns. Despite its simplicity, the approach achieved 88.0% on UCF-101, surpassing iDT. It established a key principle: decomposing appearance and motion into separate processing streams works better than trying to learn both from raw pixels.

Simonyan, K. & Zisserman, A. (2014). Two-Stream Convolutional Networks for Action Recognition. NeurIPS.

2015

C3D: 3D Convolutions for Video

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri at Facebook AI asked: why use pre-computed optical flow at all? Instead, apply 3D convolutions that convolve across both spatial and temporal dimensions simultaneously, learning motion features directly from raw pixel volumes. Their 3×3×3 kernels slid across 16-frame clips, learning spatiotemporal patterns end-to-end.

# C3D: 3D convolution operates on video volume
# Input shape: (batch, channels=3, depth=16, height=112, width=112)
# Conv3D kernel: (3, 3, 3) — learns spatiotemporal features
import torch.nn as nn

conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))
# Temporal dimension preserved early, compressed later
# Output after 5 conv blocks + fc: 4096-dim video feature vector

Tran, D. et al. (2015). Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV.

C3D features became the default video representation for years — the "Word2Vec of video." Extract 4096-dim features per clip, use them for downstream tasks. But 3D convolutions were computationally expensive: C3D had 78M parameters and processed only 16-frame clips, limiting temporal context.

2017

I3D: Inflating ImageNet into Video

Joao Carreira and Andrew Zisserman at DeepMind had an elegant insight: take a 2D CNN pre-trained on ImageNet (e.g., Inception-v1), "inflate" every 2D filter into a 3D filter by repeating it along the temporal axis and rescaling, then fine-tune on video. This transferred powerful spatial features while learning temporal patterns.

I3D achieved 98.0% on UCF-101 and introduced the Kinetics-400 dataset — 400 action classes, ~300K clips from YouTube — which became the ImageNet of video. The paper also ran the most thorough comparison of video architectures to date: LSTM encoders, 3D convnets, two-stream networks, and their inflation variants. I3D with two streams (RGB + flow) won decisively.

Carreira, J. & Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CVPR. 10,000+ citations.

Era III: Video Transformers
2021

ViViT & TimeSformer: Attention Over Space and Time

The Vision Transformer (ViT) had proven that images could be understood as sequences of patches. Two groups independently extended this to video. ViViT (Arnab et al., Google) tokenized video into spatiotemporal "tubes" and explored four transformer variants — the most efficient used factorized self-attention: spatial attention within each frame, then temporal attention across frames. This reduced complexity from O(T²·N²) to O(T·N² + T²·N).

TimeSformer (Bertasius et al., Facebook AI) took a similar approach with "divided space-time attention" — each patch first attends to all patches at the same temporal position (spatial), then to all patches at the same spatial position across time (temporal). Both architectures showed that factorized attention could match or exceed 3D CNNs while being more scalable.

# Factorized attention (ViViT Model 3)
# Video: T frames, each frame has N spatial patches
# Instead of full attention over T*N tokens (O(T²N²)):

# Step 1: Spatial attention within each frame
for t in range(T):
    spatial_tokens[t] = self_attention(patches[t])  # O(N²) per frame

# Step 2: Temporal attention across frames
for n in range(N):
    temporal_tokens[:, n] = self_attention(spatial_tokens[:, n])  # O(T²) per patch

# Total: O(T·N² + T²·N) instead of O(T²·N²) — massive savings for long videos

Arnab, A. et al. (2021). ViViT: A Video Vision Transformer. ICCV.
Bertasius, G. et al. (2021). Is Space-Time Attention All You Need for Video Understanding? ICML.

2022

VideoMAE: Self-Supervised Video Pre-training

Tong et al. applied masked autoencoding to video, masking 90–95% of spatiotemporal patches and training the transformer to reconstruct them. The key insight was that video's temporal redundancy allows extremely high masking ratios — far higher than the 75% used for images in MAE. This made self-supervised pre-training on video computationally feasible: you only process 5–10% of the tokens during training.

VideoMAE-v2 scaled this to over 1 billion parameters and set new records on Kinetics-400/600/700, Something-Something-v2, and AVA. It proved that video transformers could match or beat convolutional models even without labeled data.

Tong, Z. et al. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS.

2022–2024

InternVideo: The Video Foundation Model

Shanghai AI Lab built InternVideo by combining masked video modeling (generative) with video-language contrastive learning (discriminative) in a unified framework. InternVideo2 scaled to 6 billion parameters and achieved state-of-the-art on 60+ video benchmarks simultaneously — action recognition, temporal grounding, video retrieval, video QA.

Wang, Y. et al. (2022). InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv.

Era IV: Video Meets Language
2021

VideoCLIP: Contrastive Video-Language Learning

Xu et al. at Meta AI trained a model to align video clips with their text descriptions using contrastive learning on 1.1M video-text pairs from HowTo100M. The key innovation was temporally overlapping positive pairs — rather than requiring exact temporal alignment (which is noisy in instructional videos), they treated any overlapping video-text pair as a soft positive. This produced a shared embedding space where you could search video with natural language.

Xu, H. et al. (2021). VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. EMNLP.

2023

Video-LLaVA: Visual Instruction Tuning for Video

Lin et al. unified image and video understanding in a single model by projecting both modalities into a shared feature space before feeding them to a language model backbone. Video frames were encoded with a ViT, projected through a learned MLP, and concatenated with text token embeddings. The language model (Vicuna-7B/13B) then generated free-form responses about the video content.

This was the moment video understanding became conversational. Instead of classifying into predefined action labels, you could ask open-ended questions: "What happens after the man enters the kitchen?" and get natural-language answers grounded in the video.

Lin, B. et al. (2023). Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. arXiv.

2024–2026

Frontier Models: Native Video Understanding

The current generation processes video natively in their context windows, without requiring separate video encoders or frame sampling logic on the user's side:

Gemini 2.5 Pro

Google. Processes up to 1 hour of video natively in its 1M-token context window. Samples frames internally.

GPT-4o

OpenAI. Multi-frame image input. No native video upload — requires client-side frame extraction.

Qwen2.5-VL

Alibaba. Open-weight. Processes video at dynamic resolution with temporal position embeddings.

Twelve Labs

Purpose-built video understanding API. Embedding, search, and generation over video libraries.

The sampling question hasn't disappeared

Even models that accept "native video" sample internally. Gemini 2.5 Pro extracts frames at ~1 fps from uploaded video. GPT-4o requires you to sample explicitly. Understanding frame sampling strategies remains essential — you're either choosing the strategy yourself or trusting the model's default. For production systems where cost, latency, and accuracy matter, you want control over that decision.

The throughline: 1981 → 2026

Four decades, one goal: make machines understand what happens in video, not just what appears in frames.

1981–2013Features: Hand-craft motion descriptors (optical flow, HOG/HOF, dense trajectories)
2014–2017Learning: Learn spatiotemporal features from data (two-stream, C3D, I3D)
2021–2022Attention: Replace convolutions with factorized self-attention (ViViT, TimeSformer, VideoMAE)
2023–nowLanguage: Fuse video with LLMs for open-ended understanding (Video-LLaVA, Gemini, GPT-4o)

Frame Sampling Strategies

A 30-fps video contains 1,800 frames per minute. Most of them are redundant — adjacent frames in a static shot are nearly identical. The art of video understanding is choosing which frames to process and how many to budget. The wrong sampling strategy can miss critical events or waste 90% of your compute on duplicate information.

Video Timeline0:00scene changescene changescene change2:00Uniform16 framesKeyframe9 framescutcutcutAdaptive14 framesUniform: equal spacing, simple but may miss eventsKeyframe: extracts at shot boundaries, good for edited videoAdaptive: dense at changes, sparse when staticScene change markers (detected automatically)
1

Uniform Sampling

Extract frames at fixed intervals (e.g., 1 fps, or every 30th frame). The simplest approach and the default for most video-language models. Gemini samples at ~1 fps internally. Uniform sampling works well when the information density is roughly constant — lectures, surveillance, dashcam footage.

Failure mode: Misses brief but critical events (a punch in a fight, a traffic light change) that fall between sample points. A 1-fps sample of a 120-fps slow-motion replay will miss 99.2% of frames.

Best for: General summarization, content understanding, meeting recordings

2

Keyframe / Shot-boundary Detection

Detect frames where significant visual change occurs — a scene cut, a camera pan, or a major action event. Algorithms compute inter-frame difference (pixel-level, histogram-based, or feature-level) and extract frames that exceed a threshold. This produces an adaptive sample: more frames during action, fewer during static shots.

Implementations: FFmpeg's select='gt(scene,0.3)' filter, PySceneDetect, or compute SSIM/histogram distance between consecutive frames.

Best for: Movies, TV shows, edited content, event summarization

3

Clustering-based Sampling

Extract all frames (or a dense uniform sample), compute lightweight embeddings (e.g., CLIP ViT-B), cluster them with K-means, and pick the frame nearest each cluster centroid. This guarantees visual diversity in your sample — you'll never get 10 near-identical frames from a static shot.

Best for: Video retrieval, content indexing, diverse thumbnail generation

4

Audio-guided Sampling

Use the audio track to guide visual sampling. Transcribe with Whisper to get word-level timestamps, then sample frames aligned to speech onset, topic changes, or audio events (applause, music cues, sound effects). This is especially powerful for lecture videos and podcasts where the audio carries the primary information.

Best for: Lectures, interviews, webinars, conference talks, podcast video

Practical guideline: how many frames?

For GPT-4o, the sweet spot is 8–32 frames depending on video length. Beyond ~50 frames, costs escalate and the model's performance plateaus. For Gemini, upload the full video and let the model handle sampling — it tokenizes ~1 fps and the cost scales with duration. For local models like Qwen2.5-VL, budget depends on your GPU memory: 8 frames at 448×448 requires ~4GB VRAM.

Building a Video Understanding Pipeline

Let's build a complete video analysis pipeline: extract frames, optionally detect keyframes, transcribe audio, and analyze with a VLM. Every code block is production-usable.

Video Understanding Pipeline

VideoFrameExtractionuniform / keyframePer-frameEncodingViT / CLIP / VLMTemporalAggregationpool / attend / fuseOutputcaptionsvideo QA answersaction labelstemporal groundingAudio track (Whisper STT) -- parallel branch

Step 1: Frame Extraction with Multiple Strategies

import cv2
import numpy as np
from typing import Literal

def extract_frames(
    video_path: str,
    strategy: Literal["uniform", "keyframe", "scene"] = "uniform",
    target_frames: int = 16,
    scene_threshold: float = 30.0,
) -> list[tuple[np.ndarray, float]]:
    """Extract frames from video with timestamp.

    Returns list of (frame, timestamp_seconds) tuples.
    """
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    duration = total_frames / fps

    if strategy == "uniform":
        # Sample evenly across the video
        indices = np.linspace(0, total_frames - 1, target_frames, dtype=int)
        frames = []
        for idx in indices:
            cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
            ret, frame = cap.read()
            if ret:
                frames.append((frame, idx / fps))
        cap.release()
        return frames

    elif strategy == "keyframe":
        # Extract frames with significant visual change
        frames = []
        prev_hist = None
        frame_idx = 0

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            # Compute color histogram
            hist = cv2.calcHist([frame], [0, 1, 2], None,
                               [8, 8, 8], [0, 256, 0, 256, 0, 256])
            hist = cv2.normalize(hist, hist).flatten()

            if prev_hist is None:
                frames.append((frame, frame_idx / fps))
            else:
                diff = cv2.compareHist(prev_hist, hist, cv2.HISTCMP_CHISQR)
                if diff > scene_threshold:
                    frames.append((frame, frame_idx / fps))
            prev_hist = hist
            frame_idx += 1

        cap.release()
        # If too many, subsample uniformly
        if len(frames) > target_frames:
            indices = np.linspace(0, len(frames) - 1, target_frames, dtype=int)
            frames = [frames[i] for i in indices]
        return frames

    elif strategy == "scene":
        # Use ffmpeg scene detection (shell out)
        import subprocess, json
        cmd = [
            "ffprobe", "-v", "quiet", "-select_streams", "v",
            "-show_frames", "-show_entries", "frame=pts_time,pict_type",
            "-of", "json", video_path
        ]
        result = subprocess.run(cmd, capture_output=True, text=True)
        scene_data = json.loads(result.stdout)
        # Filter I-frames (scene changes)
        i_frames = [
            float(f["pts_time"]) for f in scene_data.get("frames", [])
            if f.get("pict_type") == "I"
        ]
        # Sample from scene boundaries
        timestamps = i_frames[:target_frames] if len(i_frames) <= target_frames \
            else [i_frames[i] for i in np.linspace(0, len(i_frames)-1, target_frames, dtype=int)]

        frames = []
        for ts in timestamps:
            cap.set(cv2.CAP_PROP_POS_MSEC, ts * 1000)
            ret, frame = cap.read()
            if ret:
                frames.append((frame, ts))
        cap.release()
        return frames

Step 2: Video Analysis with GPT-4o

import base64
from openai import OpenAI

def encode_frame(frame: np.ndarray) -> str:
    """Encode frame as base64 JPEG for API consumption."""
    _, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
    return base64.b64encode(buffer).decode('utf-8')

def analyze_video_gpt4o(
    frames: list[tuple[np.ndarray, float]],
    question: str,
    model: str = "gpt-4o",
) -> str:
    """Analyze video frames with GPT-4o.

    Includes timestamps in the prompt so the model can
    reference specific moments in its response.
    """
    client = OpenAI()

    # Build the message content with timestamped frames
    content = [
        {
            "type": "text",
            "text": (
                f"You are analyzing a video. I'm providing {len(frames)} frames "
                f"sampled from the video, each labeled with its timestamp.\n\n"
                f"Question: {question}"
            ),
        }
    ]

    for frame, timestamp in frames:
        # Add timestamp label before each frame
        minutes = int(timestamp // 60)
        seconds = timestamp % 60
        content.append({
            "type": "text",
            "text": f"[{minutes}:{seconds:05.2f}]"
        })
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{encode_frame(frame)}",
                "detail": "low"  # Use "high" for fine detail, costs 4x more
            }
        })

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": content}],
        max_tokens=2000,
    )
    return response.choices[0].message.content

# Usage
frames = extract_frames("lecture.mp4", strategy="uniform", target_frames=16)
summary = analyze_video_gpt4o(
    frames,
    "Summarize the key points discussed in this presentation. "
    "Reference timestamps when the topic changes."
)

Cost awareness

Each low-detail image token costs 85 tokens (~$0.0004). At 16 frames, that's ~1,360 image tokens plus your prompt — roughly $0.01 per video analysis. High-detail mode costs 4x more per frame. For batch processing thousands of videos, this adds up fast. Consider using Gemini Flash for high-volume workloads ($0.075/1M tokens) or running Qwen2.5-VL locally.

Step 3: Native Video with Gemini

import google.generativeai as genai
import time

genai.configure(api_key="YOUR_API_KEY")

def analyze_video_gemini(video_path: str, question: str) -> str:
    """Analyze video natively with Gemini.

    Gemini handles frame sampling internally at ~1 fps.
    Supports up to 1 hour of video in Gemini 2.5 Pro.
    """
    model = genai.GenerativeModel("gemini-2.5-pro")

    # Upload video file (supports mp4, mov, avi, mkv, webm)
    video_file = genai.upload_file(video_path)

    # Wait for server-side processing (transcoding + frame extraction)
    while video_file.state.name == "PROCESSING":
        time.sleep(5)
        video_file = genai.get_file(video_file.name)

    if video_file.state.name == "FAILED":
        raise ValueError(f"Video processing failed: {video_file.state.name}")

    # Analyze — Gemini sees frames + audio natively
    response = model.generate_content(
        [video_file, question],
        generation_config=genai.GenerationConfig(
            temperature=0.2,
            max_output_tokens=4000,
        ),
    )
    return response.text

# Usage — no frame extraction needed
result = analyze_video_gemini(
    "meeting_recording.mp4",
    "List all action items discussed in this meeting with timestamps."
)

When to use GPT-4o vs Gemini for video

GPT-4o: You need precise control over which frames are analyzed. Better for short clips (<2 min) where frame selection matters. Supports structured outputs via function calling.

Gemini 2.5 Pro: Long-form video (10–60 min). Native audio understanding without separate transcription. Simpler API — upload and ask. Better for meeting recordings, lectures, tutorials.

Qwen2.5-VL (local): Privacy-sensitive use cases, high-volume batch processing, or when you need to run on-premise. 72B model matches GPT-4o on many video benchmarks.

Multi-modal Pipeline: Visual + Audio

Video understanding is incomplete without audio. A person nodding while saying "no" means something different from nodding while saying "yes." Speech content, speaker tone, background sounds, and music all carry semantic signal that pure visual analysis misses.

The standard approach: extract audio with FFmpeg, transcribe with Whisper (getting word-level timestamps), then fuse the transcript with visual analysis in a final LLM call that sees both modalities.

Complete Multi-modal Video Pipeline

import whisper
import subprocess
from dataclasses import dataclass

@dataclass
class VideoAnalysis:
    transcript: str
    visual_description: str
    combined_summary: str
    timestamps: list[dict]

def extract_audio(video_path: str) -> str:
    """Extract audio track from video using ffmpeg."""
    audio_path = video_path.rsplit('.', 1)[0] + '.wav'
    subprocess.run([
        'ffmpeg', '-y', '-i', video_path,
        '-vn', '-acodec', 'pcm_s16le',
        '-ar', '16000', '-ac', '1',
        audio_path
    ], check=True, capture_output=True)
    return audio_path

def transcribe_with_timestamps(audio_path: str) -> dict:
    """Transcribe audio with word-level timestamps."""
    model = whisper.load_model("base")
    result = model.transcribe(
        audio_path,
        word_timestamps=True,
        language="en",
    )
    return result

def full_video_analysis(video_path: str, question: str) -> VideoAnalysis:
    """Combine visual and audio understanding.

    Pipeline:
    1. Extract frames (uniform, 1 fps)
    2. Extract + transcribe audio (Whisper)
    3. Analyze frames with GPT-4o (visual)
    4. Synthesize visual + audio with final LLM call
    """
    # Visual: extract and analyze frames
    frames = extract_frames(video_path, strategy="uniform", target_frames=16)
    visual_desc = analyze_video_gpt4o(
        frames,
        "Describe what you see in each frame. Note any text, "
        "people, actions, objects, and scene changes."
    )

    # Audio: extract and transcribe
    audio_path = extract_audio(video_path)
    transcript_result = transcribe_with_timestamps(audio_path)
    transcript = transcript_result["text"]

    # Collect segment timestamps
    timestamps = [
        {"start": seg["start"], "end": seg["end"], "text": seg["text"]}
        for seg in transcript_result.get("segments", [])
    ]

    # Synthesis: combine both modalities
    client = OpenAI()
    synthesis = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a video analyst. You receive a visual description "
                    "of video frames and an audio transcript. Synthesize both "
                    "into a coherent analysis. Reference specific timestamps."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"VISUAL DESCRIPTION:\n{visual_desc}\n\n"
                    f"AUDIO TRANSCRIPT:\n{transcript}\n\n"
                    f"QUESTION: {question}"
                ),
            },
        ],
    )

    return VideoAnalysis(
        transcript=transcript,
        visual_description=visual_desc,
        combined_summary=synthesis.choices[0].message.content,
        timestamps=timestamps,
    )

# Usage
analysis = full_video_analysis(
    "product_demo.mp4",
    "What features are demonstrated and what claims are made about each?"
)
print(analysis.combined_summary)

Video Search and Retrieval

The most impactful production use case for video understanding is semantic search over video libraries. Instead of relying on manual tags or metadata, you embed video segments into a shared vector space with text queries, enabling natural-language search: "Find the moment where the speaker discusses pricing" returns a timestamp, not a document.

Video Search with Twelve Labs

from twelvelabs import TwelveLabs

client = TwelveLabs(api_key="YOUR_API_KEY")

# Create an index (a searchable video collection)
index = client.index.create(
    name="product_demos",
    engines=[{
        "name": "marengo2.7",  # Video understanding engine
        "options": ["visual", "conversation", "text_in_video", "logo"],
    }],
)

# Upload videos to the index
task = client.task.create(
    index_id=index.id,
    file="demo_video.mp4",
)
task.wait_for_done()  # Processing: ~1 min per 1 min of video

# Natural language search — returns timestamps
results = client.search.query(
    index_id=index.id,
    query_text="moment where the speaker demonstrates the API",
    options=["visual", "conversation"],
)

for clip in results.data:
    print(f"[{clip.start:.1f}s - {clip.end:.1f}s] "
          f"score={clip.score:.3f} | {clip.video_id}")

# Generate text from a specific segment
summary = client.generate.text(
    video_id=task.video_id,
    prompt="Summarize the key features shown in this segment",
    temperature=0.2,
)

DIY Video Search with CLIP Embeddings

# Build your own video search with CLIP + vector DB
import torch
from transformers import CLIPModel, CLIPProcessor
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

# Initialize CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Initialize vector database
qdrant = QdrantClient(":memory:")  # or url="http://localhost:6333"
qdrant.create_collection(
    collection_name="video_frames",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

def index_video(video_path: str, video_id: str):
    """Index a video by embedding frames into Qdrant."""
    frames = extract_frames(video_path, strategy="uniform", target_frames=60)

    points = []
    for i, (frame, timestamp) in enumerate(frames):
        # Convert BGR (OpenCV) to RGB (PIL)
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        inputs = processor(images=rgb_frame, return_tensors="pt")

        with torch.no_grad():
            embedding = model.get_image_features(**inputs)
            embedding = embedding / embedding.norm(dim=-1, keepdim=True)

        points.append(PointStruct(
            id=hash(f"{video_id}_{i}"),
            vector=embedding[0].numpy().tolist(),
            payload={"video_id": video_id, "timestamp": timestamp, "frame_idx": i},
        ))

    qdrant.upsert(collection_name="video_frames", points=points)

def search_video(query: str, top_k: int = 5):
    """Search indexed videos with natural language."""
    inputs = processor(text=query, return_tensors="pt", padding=True)
    with torch.no_grad():
        text_embedding = model.get_text_features(**inputs)
        text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)

    results = qdrant.search(
        collection_name="video_frames",
        query_vector=text_embedding[0].numpy().tolist(),
        limit=top_k,
    )
    return [
        {"video_id": r.payload["video_id"],
         "timestamp": r.payload["timestamp"],
         "score": r.score}
        for r in results
    ]

# Index and search
index_video("keynote.mp4", "keynote_2024")
hits = search_video("slide showing revenue growth chart")
# Returns: [{"video_id": "keynote_2024", "timestamp": 847.3, "score": 0.31}, ...]

CLIP vs dedicated video models for search

CLIP encodes individual frames — it has no temporal understanding. Searching for "person running" works because it's visible in a single frame. Searching for "person who tripped and fell" will fail because it requires understanding a sequence of frames. For temporal queries, use dedicated video embedding models like Twelve Labs Marengo, InternVideo2, or LanguageBind — these encode short clips (4–16 frames) into a single vector that captures motion and temporal relationships.

The Temporal Understanding Challenge

The hardest problems in video understanding are temporal — they require reasoning about sequences of events, not just recognizing objects in frames. This is where most current systems still struggle.

Action Recognition

Classifying what action is being performed in a video clip. Early benchmarks (UCF-101, Kinetics-400) focused on short clips (3–10 seconds) with a single action. Models now achieve 90%+ on these, partly because many actions are identifiable from a single frame ("playing guitar" is recognizable without motion). The field has moved to fine-grained temporal reasoning benchmarks.

UCF-101Kinetics-700Something-Something v2Moments in Time

Temporal Grounding

Given a natural language query and a long video, find the start and end timestamps of the described moment. Example: "The moment the speaker first mentions competition" in a 40-minute earnings call. This requires understanding language, scanning the full video, and localizing precisely. Current SOTA uses models like UniVTG and Moment-DETR.

ActivityNet CaptionsCharades-STAQVHighlights

Long-form Video Understanding

Understanding hour-long videos — movies, meetings, lectures — is the current frontier. The challenges compound: you need to track entities across scenes, maintain a narrative state, handle topic drift, and answer questions that require synthesizing information from multiple distant segments. Benchmarks like EgoSchema (3-min egocentric clips requiring temporal reasoning) and MovieChat (hour-long movies) expose how far even frontier models have to go.

Gemini 2.5 Pro's 1M-token context window can process ~1 hour of video, but performance degrades significantly on questions requiring reasoning about events separated by more than 10 minutes. The needle-in-a-haystack problem for video is far harder than for text.

EgoSchemaMovieChatVideo-MMELVBench

Video Question Answering (VideoQA)

Open-ended question answering about video content. "How many times did the batter swing and miss?" requires counting events across time. "Why did the person leave the room?" requires causal reasoning. Current models handle factual questions well but struggle with counterfactual reasoning ("What would have happened if...") and questions requiring real-world knowledge not present in the video.

MSRVTT-QANExT-QASTARVideoBench

Production Use Cases

Video understanding has moved from research benchmarks to production systems. Here are the use cases where it delivers measurable value today.

Surveillance & Security

Anomaly detection in CCTV feeds: detect fights, unattended bags, intrusions, or vehicle accidents in real-time. Modern systems combine YOLOv8 for object detection with a video classifier for action recognition, triggering alerts only when both agree.

Key requirements: <500ms latency, high recall (can't miss events), low false-positive rate, 24/7 operation on edge devices.

Content Moderation

Identify policy violations in user-uploaded video: violence, NSFW content, self-harm, dangerous challenges. Platforms like YouTube process 500+ hours of video per minute. The pipeline: fast frame-level classifier (cheap, high recall) followed by a VLM for nuanced review (expensive, high precision) on flagged content.

Key requirements: Explainable decisions, timestamp extraction, multi-language support, cultural context awareness.

Video Search & Discovery

Natural language search over corporate video archives: "Find where the CEO discusses Q3 results in last month's all-hands." Used by media companies (search across footage libraries), enterprises (search meeting recordings), and education platforms (search across lecture archives).

Key requirements: Multi-modal indexing (visual + speech + text-on-screen), sub-second query latency, timestamp-level precision.

Sports Analytics

Automatic play detection, player tracking, formation recognition, and highlight generation. Companies like Hawk-Eye (tennis/cricket), StatsBomb (football), and Second Spectrum (basketball) use video understanding to generate real-time statistics that were previously only available through manual annotation.

Key requirements: Object tracking at 30+ fps, action recognition with sub-second precision, multi-camera view synthesis.

Medical Video Analysis

Surgical procedure recognition, endoscopy anomaly detection, physical therapy compliance monitoring. AI-assisted colonoscopy (detecting polyps in real-time) has already been shown to improve detection rates by 14% in randomized clinical trials.

Key requirements: FDA/CE-mark certification, on-device inference, explainable outputs, zero tolerance for false negatives.

Autonomous Driving

Multi-camera video feeds processed in real-time for lane detection, pedestrian prediction, traffic sign recognition, and scenario understanding. Tesla's vision-only approach processes 8 cameras simultaneously with a temporal backbone that reasons across frames.

Key requirements: <50ms latency, multi-camera fusion, temporal prediction (where will objects be in 3 seconds?), extreme reliability.

Key Takeaways

  • 1

    Video = frames + audio + time — The temporal dimension is what separates video from a batch of images. Temporal reasoning (understanding what happened, not just what appears) remains the hardest problem.

  • 2

    Sampling strategy determines everything — Uniform for general use, keyframe detection for edited content, clustering for diversity, audio-guided for speech-heavy video. The wrong strategy wastes compute and misses events.

  • 3

    The field evolved through five eras — Hand-crafted features (HOG/HOF) → 3D CNNs (C3D, I3D) → video transformers (ViViT) → video-language models (Video-LLaVA) → frontier multimodal models (Gemini, GPT-4o). Each generation solved one limitation of the last.

  • 4

    Combine visual + audio for production — Whisper for transcription, a VLM for frame analysis, an LLM for synthesis. Or use Gemini which handles both natively. The multi-modal pipeline catches what either modality alone misses.

  • 5

    Long-form video is the current frontier — Understanding hour-long videos with complex narratives, tracking entities across scenes, and answering questions that require reasoning over distant segments. Even Gemini 2.5 Pro degrades beyond ~10 minutes of temporal separation.

Further Reading

Foundational Papers

Benchmarks

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.