Video Understanding
From optical flow and hand-crafted descriptors to video transformers and multimodal foundation models — how machines learned to see time.
Why Video Is the Hardest Modality
An image is a spatial snapshot. A video is a spatiotemporal volume — width, height, and time, multiplied by three color channels, often accompanied by an audio track. A single minute of 1080p video at 30 fps contains 1,800 frames, roughly 3.7 billion pixel values. No model can process all of it naively.
But the real difficulty isn't scale — it's temporal reasoning. Understanding "a person picks up a cup, drinks, then sets it down" requires recognizing objects, tracking them across frames, inferring causality, and grounding all of it in a timeline. This is what separates video understanding from running an image classifier 30 times per second.
Computational Scale
A 10-minute video is 18,000 frames. Passing each through a ViT-L costs ~$3.60 in GPU time. Processing every frame through GPT-4o would cost ~$90 in API fees. You must sample.
Temporal Reasoning
"The dog caught the ball" vs "the ball hit the dog" — same objects, different events. Single-frame models can't distinguish them. You need motion and order.
Multi-modal Fusion
Audio carries crucial signal: speech identifies topics, laughter marks humor, a crash sound signals an accident. The visual and auditory streams must be aligned in time and fused meaningfully.
Temporal Localization
"When does the goal happen?" requires mapping answers to timestamps — not just classification, but grounding predictions in the temporal axis. This is video's unique challenge.
The Evolution of Video Understanding
Video understanding has gone through five distinct eras, each solving a fundamental limitation of the previous one. Understanding this arc explains why modern approaches work — and where they still break down.
Optical Flow
Berthold Horn and Brian Schunck formalized optical flow — the pattern of apparent motion of objects between consecutive frames. By computing pixel-level displacement vectors, they could estimate how objects moved through a scene. This became the foundational representation for motion in computer vision for the next three decades.
"The optical flow field is the distribution of apparent velocities of movement of brightness patterns in an image."
— Horn, B. & Schunck, B. (1981). Determining Optical Flow. Artificial Intelligence, 17(1-3), 185–203.
Optical flow was elegant but brittle: expensive to compute, sensitive to lighting changes, and it captured only local motion without any semantic understanding. A person waving and a tree branch swaying produced similar flow fields.
Space-Time Interest Points & HOG/HOF
Ivan Laptev extended Harris corner detection into the temporal dimension, detecting "Space-Time Interest Points" (STIPs) — locations in a video where significant spatial and temporal change co-occur. These were described using Histograms of Oriented Gradients (HOG) for appearance and Histograms of Optical Flow (HOF) for motion.
The pipeline was classic pre-deep-learning computer vision: detect keypoints, extract hand-designed descriptors, cluster them into a "bag of visual words," train an SVM. It worked surprisingly well on constrained datasets like KTH (6 action classes, static camera) but collapsed on real-world video where backgrounds varied, cameras moved, and actions overlapped.
— Laptev, I. (2005). On Space-Time Interest Points. IJCV, 64(2-3), 107–123.
Improved Dense Trajectories (iDT)
Heng Wang and Cordelia Schmid at INRIA produced the last great hand-crafted video feature. Instead of sparse keypoints, they tracked dense point trajectories across frames, describing each with HOG, HOF, and Motion Boundary Histograms (MBH). Camera motion was estimated and removed. iDT achieved 85.9% on UCF-101 and remained the strongest non-neural baseline for two years.
— Wang, H. & Schmid, C. (2013). Action Recognition with Improved Trajectories. ICCV.
Two-Stream Convolutional Networks
Karen Simonyan and Andrew Zisserman at Oxford proposed a deceptively simple idea: use two separate CNNs, one for appearance (a single RGB frame) and one for motion (a stack of optical flow fields), then fuse their predictions. This two-stream architecture was inspired by the two-pathway hypothesis of the human visual cortex — the ventral ("what") and dorsal ("where/how") streams.
The temporal stream CNN took 10 stacked optical flow frames as input, capturing short-term motion patterns. Despite its simplicity, the approach achieved 88.0% on UCF-101, surpassing iDT. It established a key principle: decomposing appearance and motion into separate processing streams works better than trying to learn both from raw pixels.
C3D: 3D Convolutions for Video
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri at Facebook AI asked: why use pre-computed optical flow at all? Instead, apply 3D convolutions that convolve across both spatial and temporal dimensions simultaneously, learning motion features directly from raw pixel volumes. Their 3×3×3 kernels slid across 16-frame clips, learning spatiotemporal patterns end-to-end.
# C3D: 3D convolution operates on video volume # Input shape: (batch, channels=3, depth=16, height=112, width=112) # Conv3D kernel: (3, 3, 3) — learns spatiotemporal features import torch.nn as nn conv1 = nn.Conv3d(3, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1)) pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2)) # Temporal dimension preserved early, compressed later # Output after 5 conv blocks + fc: 4096-dim video feature vector
— Tran, D. et al. (2015). Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV.
C3D features became the default video representation for years — the "Word2Vec of video." Extract 4096-dim features per clip, use them for downstream tasks. But 3D convolutions were computationally expensive: C3D had 78M parameters and processed only 16-frame clips, limiting temporal context.
I3D: Inflating ImageNet into Video
Joao Carreira and Andrew Zisserman at DeepMind had an elegant insight: take a 2D CNN pre-trained on ImageNet (e.g., Inception-v1), "inflate" every 2D filter into a 3D filter by repeating it along the temporal axis and rescaling, then fine-tune on video. This transferred powerful spatial features while learning temporal patterns.
I3D achieved 98.0% on UCF-101 and introduced the Kinetics-400 dataset — 400 action classes, ~300K clips from YouTube — which became the ImageNet of video. The paper also ran the most thorough comparison of video architectures to date: LSTM encoders, 3D convnets, two-stream networks, and their inflation variants. I3D with two streams (RGB + flow) won decisively.
— Carreira, J. & Zisserman, A. (2017). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CVPR. 10,000+ citations.
ViViT & TimeSformer: Attention Over Space and Time
The Vision Transformer (ViT) had proven that images could be understood as sequences of patches. Two groups independently extended this to video. ViViT (Arnab et al., Google) tokenized video into spatiotemporal "tubes" and explored four transformer variants — the most efficient used factorized self-attention: spatial attention within each frame, then temporal attention across frames. This reduced complexity from O(T²·N²) to O(T·N² + T²·N).
TimeSformer (Bertasius et al., Facebook AI) took a similar approach with "divided space-time attention" — each patch first attends to all patches at the same temporal position (spatial), then to all patches at the same spatial position across time (temporal). Both architectures showed that factorized attention could match or exceed 3D CNNs while being more scalable.
# Factorized attention (ViViT Model 3)
# Video: T frames, each frame has N spatial patches
# Instead of full attention over T*N tokens (O(T²N²)):
# Step 1: Spatial attention within each frame
for t in range(T):
spatial_tokens[t] = self_attention(patches[t]) # O(N²) per frame
# Step 2: Temporal attention across frames
for n in range(N):
temporal_tokens[:, n] = self_attention(spatial_tokens[:, n]) # O(T²) per patch
# Total: O(T·N² + T²·N) instead of O(T²·N²) — massive savings for long videos— Arnab, A. et al. (2021). ViViT: A Video Vision Transformer. ICCV.
— Bertasius, G. et al. (2021). Is Space-Time Attention All You Need for Video Understanding? ICML.
VideoMAE: Self-Supervised Video Pre-training
Tong et al. applied masked autoencoding to video, masking 90–95% of spatiotemporal patches and training the transformer to reconstruct them. The key insight was that video's temporal redundancy allows extremely high masking ratios — far higher than the 75% used for images in MAE. This made self-supervised pre-training on video computationally feasible: you only process 5–10% of the tokens during training.
VideoMAE-v2 scaled this to over 1 billion parameters and set new records on Kinetics-400/600/700, Something-Something-v2, and AVA. It proved that video transformers could match or beat convolutional models even without labeled data.
InternVideo: The Video Foundation Model
Shanghai AI Lab built InternVideo by combining masked video modeling (generative) with video-language contrastive learning (discriminative) in a unified framework. InternVideo2 scaled to 6 billion parameters and achieved state-of-the-art on 60+ video benchmarks simultaneously — action recognition, temporal grounding, video retrieval, video QA.
VideoCLIP: Contrastive Video-Language Learning
Xu et al. at Meta AI trained a model to align video clips with their text descriptions using contrastive learning on 1.1M video-text pairs from HowTo100M. The key innovation was temporally overlapping positive pairs — rather than requiring exact temporal alignment (which is noisy in instructional videos), they treated any overlapping video-text pair as a soft positive. This produced a shared embedding space where you could search video with natural language.
Video-LLaVA: Visual Instruction Tuning for Video
Lin et al. unified image and video understanding in a single model by projecting both modalities into a shared feature space before feeding them to a language model backbone. Video frames were encoded with a ViT, projected through a learned MLP, and concatenated with text token embeddings. The language model (Vicuna-7B/13B) then generated free-form responses about the video content.
This was the moment video understanding became conversational. Instead of classifying into predefined action labels, you could ask open-ended questions: "What happens after the man enters the kitchen?" and get natural-language answers grounded in the video.
Frontier Models: Native Video Understanding
The current generation processes video natively in their context windows, without requiring separate video encoders or frame sampling logic on the user's side:
Gemini 2.5 Pro
Google. Processes up to 1 hour of video natively in its 1M-token context window. Samples frames internally.
GPT-4o
OpenAI. Multi-frame image input. No native video upload — requires client-side frame extraction.
Qwen2.5-VL
Alibaba. Open-weight. Processes video at dynamic resolution with temporal position embeddings.
Twelve Labs
Purpose-built video understanding API. Embedding, search, and generation over video libraries.
The sampling question hasn't disappeared
Even models that accept "native video" sample internally. Gemini 2.5 Pro extracts frames at ~1 fps from uploaded video. GPT-4o requires you to sample explicitly. Understanding frame sampling strategies remains essential — you're either choosing the strategy yourself or trusting the model's default. For production systems where cost, latency, and accuracy matter, you want control over that decision.
The throughline: 1981 → 2026
Four decades, one goal: make machines understand what happens in video, not just what appears in frames.
Frame Sampling Strategies
A 30-fps video contains 1,800 frames per minute. Most of them are redundant — adjacent frames in a static shot are nearly identical. The art of video understanding is choosing which frames to process and how many to budget. The wrong sampling strategy can miss critical events or waste 90% of your compute on duplicate information.
Uniform Sampling
Extract frames at fixed intervals (e.g., 1 fps, or every 30th frame). The simplest approach and the default for most video-language models. Gemini samples at ~1 fps internally. Uniform sampling works well when the information density is roughly constant — lectures, surveillance, dashcam footage.
Failure mode: Misses brief but critical events (a punch in a fight, a traffic light change) that fall between sample points. A 1-fps sample of a 120-fps slow-motion replay will miss 99.2% of frames.
Best for: General summarization, content understanding, meeting recordings
Keyframe / Shot-boundary Detection
Detect frames where significant visual change occurs — a scene cut, a camera pan, or a major action event. Algorithms compute inter-frame difference (pixel-level, histogram-based, or feature-level) and extract frames that exceed a threshold. This produces an adaptive sample: more frames during action, fewer during static shots.
Implementations: FFmpeg's select='gt(scene,0.3)' filter, PySceneDetect, or compute SSIM/histogram distance between consecutive frames.
Best for: Movies, TV shows, edited content, event summarization
Clustering-based Sampling
Extract all frames (or a dense uniform sample), compute lightweight embeddings (e.g., CLIP ViT-B), cluster them with K-means, and pick the frame nearest each cluster centroid. This guarantees visual diversity in your sample — you'll never get 10 near-identical frames from a static shot.
Best for: Video retrieval, content indexing, diverse thumbnail generation
Audio-guided Sampling
Use the audio track to guide visual sampling. Transcribe with Whisper to get word-level timestamps, then sample frames aligned to speech onset, topic changes, or audio events (applause, music cues, sound effects). This is especially powerful for lecture videos and podcasts where the audio carries the primary information.
Best for: Lectures, interviews, webinars, conference talks, podcast video
Practical guideline: how many frames?
For GPT-4o, the sweet spot is 8–32 frames depending on video length. Beyond ~50 frames, costs escalate and the model's performance plateaus. For Gemini, upload the full video and let the model handle sampling — it tokenizes ~1 fps and the cost scales with duration. For local models like Qwen2.5-VL, budget depends on your GPU memory: 8 frames at 448×448 requires ~4GB VRAM.
Building a Video Understanding Pipeline
Let's build a complete video analysis pipeline: extract frames, optionally detect keyframes, transcribe audio, and analyze with a VLM. Every code block is production-usable.
Video Understanding Pipeline
Step 1: Frame Extraction with Multiple Strategies
import cv2
import numpy as np
from typing import Literal
def extract_frames(
video_path: str,
strategy: Literal["uniform", "keyframe", "scene"] = "uniform",
target_frames: int = 16,
scene_threshold: float = 30.0,
) -> list[tuple[np.ndarray, float]]:
"""Extract frames from video with timestamp.
Returns list of (frame, timestamp_seconds) tuples.
"""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
duration = total_frames / fps
if strategy == "uniform":
# Sample evenly across the video
indices = np.linspace(0, total_frames - 1, target_frames, dtype=int)
frames = []
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
frames.append((frame, idx / fps))
cap.release()
return frames
elif strategy == "keyframe":
# Extract frames with significant visual change
frames = []
prev_hist = None
frame_idx = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Compute color histogram
hist = cv2.calcHist([frame], [0, 1, 2], None,
[8, 8, 8], [0, 256, 0, 256, 0, 256])
hist = cv2.normalize(hist, hist).flatten()
if prev_hist is None:
frames.append((frame, frame_idx / fps))
else:
diff = cv2.compareHist(prev_hist, hist, cv2.HISTCMP_CHISQR)
if diff > scene_threshold:
frames.append((frame, frame_idx / fps))
prev_hist = hist
frame_idx += 1
cap.release()
# If too many, subsample uniformly
if len(frames) > target_frames:
indices = np.linspace(0, len(frames) - 1, target_frames, dtype=int)
frames = [frames[i] for i in indices]
return frames
elif strategy == "scene":
# Use ffmpeg scene detection (shell out)
import subprocess, json
cmd = [
"ffprobe", "-v", "quiet", "-select_streams", "v",
"-show_frames", "-show_entries", "frame=pts_time,pict_type",
"-of", "json", video_path
]
result = subprocess.run(cmd, capture_output=True, text=True)
scene_data = json.loads(result.stdout)
# Filter I-frames (scene changes)
i_frames = [
float(f["pts_time"]) for f in scene_data.get("frames", [])
if f.get("pict_type") == "I"
]
# Sample from scene boundaries
timestamps = i_frames[:target_frames] if len(i_frames) <= target_frames \
else [i_frames[i] for i in np.linspace(0, len(i_frames)-1, target_frames, dtype=int)]
frames = []
for ts in timestamps:
cap.set(cv2.CAP_PROP_POS_MSEC, ts * 1000)
ret, frame = cap.read()
if ret:
frames.append((frame, ts))
cap.release()
return framesStep 2: Video Analysis with GPT-4o
import base64
from openai import OpenAI
def encode_frame(frame: np.ndarray) -> str:
"""Encode frame as base64 JPEG for API consumption."""
_, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
return base64.b64encode(buffer).decode('utf-8')
def analyze_video_gpt4o(
frames: list[tuple[np.ndarray, float]],
question: str,
model: str = "gpt-4o",
) -> str:
"""Analyze video frames with GPT-4o.
Includes timestamps in the prompt so the model can
reference specific moments in its response.
"""
client = OpenAI()
# Build the message content with timestamped frames
content = [
{
"type": "text",
"text": (
f"You are analyzing a video. I'm providing {len(frames)} frames "
f"sampled from the video, each labeled with its timestamp.\n\n"
f"Question: {question}"
),
}
]
for frame, timestamp in frames:
# Add timestamp label before each frame
minutes = int(timestamp // 60)
seconds = timestamp % 60
content.append({
"type": "text",
"text": f"[{minutes}:{seconds:05.2f}]"
})
content.append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encode_frame(frame)}",
"detail": "low" # Use "high" for fine detail, costs 4x more
}
})
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": content}],
max_tokens=2000,
)
return response.choices[0].message.content
# Usage
frames = extract_frames("lecture.mp4", strategy="uniform", target_frames=16)
summary = analyze_video_gpt4o(
frames,
"Summarize the key points discussed in this presentation. "
"Reference timestamps when the topic changes."
)Cost awareness
Each low-detail image token costs 85 tokens (~$0.0004). At 16 frames, that's ~1,360 image tokens plus your prompt — roughly $0.01 per video analysis. High-detail mode costs 4x more per frame. For batch processing thousands of videos, this adds up fast. Consider using Gemini Flash for high-volume workloads ($0.075/1M tokens) or running Qwen2.5-VL locally.
Step 3: Native Video with Gemini
import google.generativeai as genai
import time
genai.configure(api_key="YOUR_API_KEY")
def analyze_video_gemini(video_path: str, question: str) -> str:
"""Analyze video natively with Gemini.
Gemini handles frame sampling internally at ~1 fps.
Supports up to 1 hour of video in Gemini 2.5 Pro.
"""
model = genai.GenerativeModel("gemini-2.5-pro")
# Upload video file (supports mp4, mov, avi, mkv, webm)
video_file = genai.upload_file(video_path)
# Wait for server-side processing (transcoding + frame extraction)
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError(f"Video processing failed: {video_file.state.name}")
# Analyze — Gemini sees frames + audio natively
response = model.generate_content(
[video_file, question],
generation_config=genai.GenerationConfig(
temperature=0.2,
max_output_tokens=4000,
),
)
return response.text
# Usage — no frame extraction needed
result = analyze_video_gemini(
"meeting_recording.mp4",
"List all action items discussed in this meeting with timestamps."
)When to use GPT-4o vs Gemini for video
GPT-4o: You need precise control over which frames are analyzed. Better for short clips (<2 min) where frame selection matters. Supports structured outputs via function calling.
Gemini 2.5 Pro: Long-form video (10–60 min). Native audio understanding without separate transcription. Simpler API — upload and ask. Better for meeting recordings, lectures, tutorials.
Qwen2.5-VL (local): Privacy-sensitive use cases, high-volume batch processing, or when you need to run on-premise. 72B model matches GPT-4o on many video benchmarks.
Multi-modal Pipeline: Visual + Audio
Video understanding is incomplete without audio. A person nodding while saying "no" means something different from nodding while saying "yes." Speech content, speaker tone, background sounds, and music all carry semantic signal that pure visual analysis misses.
The standard approach: extract audio with FFmpeg, transcribe with Whisper (getting word-level timestamps), then fuse the transcript with visual analysis in a final LLM call that sees both modalities.
Complete Multi-modal Video Pipeline
import whisper
import subprocess
from dataclasses import dataclass
@dataclass
class VideoAnalysis:
transcript: str
visual_description: str
combined_summary: str
timestamps: list[dict]
def extract_audio(video_path: str) -> str:
"""Extract audio track from video using ffmpeg."""
audio_path = video_path.rsplit('.', 1)[0] + '.wav'
subprocess.run([
'ffmpeg', '-y', '-i', video_path,
'-vn', '-acodec', 'pcm_s16le',
'-ar', '16000', '-ac', '1',
audio_path
], check=True, capture_output=True)
return audio_path
def transcribe_with_timestamps(audio_path: str) -> dict:
"""Transcribe audio with word-level timestamps."""
model = whisper.load_model("base")
result = model.transcribe(
audio_path,
word_timestamps=True,
language="en",
)
return result
def full_video_analysis(video_path: str, question: str) -> VideoAnalysis:
"""Combine visual and audio understanding.
Pipeline:
1. Extract frames (uniform, 1 fps)
2. Extract + transcribe audio (Whisper)
3. Analyze frames with GPT-4o (visual)
4. Synthesize visual + audio with final LLM call
"""
# Visual: extract and analyze frames
frames = extract_frames(video_path, strategy="uniform", target_frames=16)
visual_desc = analyze_video_gpt4o(
frames,
"Describe what you see in each frame. Note any text, "
"people, actions, objects, and scene changes."
)
# Audio: extract and transcribe
audio_path = extract_audio(video_path)
transcript_result = transcribe_with_timestamps(audio_path)
transcript = transcript_result["text"]
# Collect segment timestamps
timestamps = [
{"start": seg["start"], "end": seg["end"], "text": seg["text"]}
for seg in transcript_result.get("segments", [])
]
# Synthesis: combine both modalities
client = OpenAI()
synthesis = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a video analyst. You receive a visual description "
"of video frames and an audio transcript. Synthesize both "
"into a coherent analysis. Reference specific timestamps."
),
},
{
"role": "user",
"content": (
f"VISUAL DESCRIPTION:\n{visual_desc}\n\n"
f"AUDIO TRANSCRIPT:\n{transcript}\n\n"
f"QUESTION: {question}"
),
},
],
)
return VideoAnalysis(
transcript=transcript,
visual_description=visual_desc,
combined_summary=synthesis.choices[0].message.content,
timestamps=timestamps,
)
# Usage
analysis = full_video_analysis(
"product_demo.mp4",
"What features are demonstrated and what claims are made about each?"
)
print(analysis.combined_summary)Video Search and Retrieval
The most impactful production use case for video understanding is semantic search over video libraries. Instead of relying on manual tags or metadata, you embed video segments into a shared vector space with text queries, enabling natural-language search: "Find the moment where the speaker discusses pricing" returns a timestamp, not a document.
Video Search with Twelve Labs
from twelvelabs import TwelveLabs
client = TwelveLabs(api_key="YOUR_API_KEY")
# Create an index (a searchable video collection)
index = client.index.create(
name="product_demos",
engines=[{
"name": "marengo2.7", # Video understanding engine
"options": ["visual", "conversation", "text_in_video", "logo"],
}],
)
# Upload videos to the index
task = client.task.create(
index_id=index.id,
file="demo_video.mp4",
)
task.wait_for_done() # Processing: ~1 min per 1 min of video
# Natural language search — returns timestamps
results = client.search.query(
index_id=index.id,
query_text="moment where the speaker demonstrates the API",
options=["visual", "conversation"],
)
for clip in results.data:
print(f"[{clip.start:.1f}s - {clip.end:.1f}s] "
f"score={clip.score:.3f} | {clip.video_id}")
# Generate text from a specific segment
summary = client.generate.text(
video_id=task.video_id,
prompt="Summarize the key features shown in this segment",
temperature=0.2,
)DIY Video Search with CLIP Embeddings
# Build your own video search with CLIP + vector DB
import torch
from transformers import CLIPModel, CLIPProcessor
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
# Initialize CLIP
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
# Initialize vector database
qdrant = QdrantClient(":memory:") # or url="http://localhost:6333"
qdrant.create_collection(
collection_name="video_frames",
vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)
def index_video(video_path: str, video_id: str):
"""Index a video by embedding frames into Qdrant."""
frames = extract_frames(video_path, strategy="uniform", target_frames=60)
points = []
for i, (frame, timestamp) in enumerate(frames):
# Convert BGR (OpenCV) to RGB (PIL)
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
inputs = processor(images=rgb_frame, return_tensors="pt")
with torch.no_grad():
embedding = model.get_image_features(**inputs)
embedding = embedding / embedding.norm(dim=-1, keepdim=True)
points.append(PointStruct(
id=hash(f"{video_id}_{i}"),
vector=embedding[0].numpy().tolist(),
payload={"video_id": video_id, "timestamp": timestamp, "frame_idx": i},
))
qdrant.upsert(collection_name="video_frames", points=points)
def search_video(query: str, top_k: int = 5):
"""Search indexed videos with natural language."""
inputs = processor(text=query, return_tensors="pt", padding=True)
with torch.no_grad():
text_embedding = model.get_text_features(**inputs)
text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)
results = qdrant.search(
collection_name="video_frames",
query_vector=text_embedding[0].numpy().tolist(),
limit=top_k,
)
return [
{"video_id": r.payload["video_id"],
"timestamp": r.payload["timestamp"],
"score": r.score}
for r in results
]
# Index and search
index_video("keynote.mp4", "keynote_2024")
hits = search_video("slide showing revenue growth chart")
# Returns: [{"video_id": "keynote_2024", "timestamp": 847.3, "score": 0.31}, ...]CLIP vs dedicated video models for search
CLIP encodes individual frames — it has no temporal understanding. Searching for "person running" works because it's visible in a single frame. Searching for "person who tripped and fell" will fail because it requires understanding a sequence of frames. For temporal queries, use dedicated video embedding models like Twelve Labs Marengo, InternVideo2, or LanguageBind — these encode short clips (4–16 frames) into a single vector that captures motion and temporal relationships.
The Temporal Understanding Challenge
The hardest problems in video understanding are temporal — they require reasoning about sequences of events, not just recognizing objects in frames. This is where most current systems still struggle.
Action Recognition
Classifying what action is being performed in a video clip. Early benchmarks (UCF-101, Kinetics-400) focused on short clips (3–10 seconds) with a single action. Models now achieve 90%+ on these, partly because many actions are identifiable from a single frame ("playing guitar" is recognizable without motion). The field has moved to fine-grained temporal reasoning benchmarks.
Temporal Grounding
Given a natural language query and a long video, find the start and end timestamps of the described moment. Example: "The moment the speaker first mentions competition" in a 40-minute earnings call. This requires understanding language, scanning the full video, and localizing precisely. Current SOTA uses models like UniVTG and Moment-DETR.
Long-form Video Understanding
Understanding hour-long videos — movies, meetings, lectures — is the current frontier. The challenges compound: you need to track entities across scenes, maintain a narrative state, handle topic drift, and answer questions that require synthesizing information from multiple distant segments. Benchmarks like EgoSchema (3-min egocentric clips requiring temporal reasoning) and MovieChat (hour-long movies) expose how far even frontier models have to go.
Gemini 2.5 Pro's 1M-token context window can process ~1 hour of video, but performance degrades significantly on questions requiring reasoning about events separated by more than 10 minutes. The needle-in-a-haystack problem for video is far harder than for text.
Video Question Answering (VideoQA)
Open-ended question answering about video content. "How many times did the batter swing and miss?" requires counting events across time. "Why did the person leave the room?" requires causal reasoning. Current models handle factual questions well but struggle with counterfactual reasoning ("What would have happened if...") and questions requiring real-world knowledge not present in the video.
Production Use Cases
Video understanding has moved from research benchmarks to production systems. Here are the use cases where it delivers measurable value today.
Surveillance & Security
Anomaly detection in CCTV feeds: detect fights, unattended bags, intrusions, or vehicle accidents in real-time. Modern systems combine YOLOv8 for object detection with a video classifier for action recognition, triggering alerts only when both agree.
Content Moderation
Identify policy violations in user-uploaded video: violence, NSFW content, self-harm, dangerous challenges. Platforms like YouTube process 500+ hours of video per minute. The pipeline: fast frame-level classifier (cheap, high recall) followed by a VLM for nuanced review (expensive, high precision) on flagged content.
Video Search & Discovery
Natural language search over corporate video archives: "Find where the CEO discusses Q3 results in last month's all-hands." Used by media companies (search across footage libraries), enterprises (search meeting recordings), and education platforms (search across lecture archives).
Sports Analytics
Automatic play detection, player tracking, formation recognition, and highlight generation. Companies like Hawk-Eye (tennis/cricket), StatsBomb (football), and Second Spectrum (basketball) use video understanding to generate real-time statistics that were previously only available through manual annotation.
Medical Video Analysis
Surgical procedure recognition, endoscopy anomaly detection, physical therapy compliance monitoring. AI-assisted colonoscopy (detecting polyps in real-time) has already been shown to improve detection rates by 14% in randomized clinical trials.
Autonomous Driving
Multi-camera video feeds processed in real-time for lane detection, pedestrian prediction, traffic sign recognition, and scenario understanding. Tesla's vision-only approach processes 8 cameras simultaneously with a temporal backbone that reasons across frames.
Key Takeaways
- 1
Video = frames + audio + time — The temporal dimension is what separates video from a batch of images. Temporal reasoning (understanding what happened, not just what appears) remains the hardest problem.
- 2
Sampling strategy determines everything — Uniform for general use, keyframe detection for edited content, clustering for diversity, audio-guided for speech-heavy video. The wrong strategy wastes compute and misses events.
- 3
The field evolved through five eras — Hand-crafted features (HOG/HOF) → 3D CNNs (C3D, I3D) → video transformers (ViViT) → video-language models (Video-LLaVA) → frontier multimodal models (Gemini, GPT-4o). Each generation solved one limitation of the last.
- 4
Combine visual + audio for production — Whisper for transcription, a VLM for frame analysis, an LLM for synthesis. Or use Gemini which handles both natively. The multi-modal pipeline catches what either modality alone misses.
- 5
Long-form video is the current frontier — Understanding hour-long videos with complex narratives, tracking entities across scenes, and answering questions that require reasoning over distant segments. Even Gemini 2.5 Pro degrades beyond ~10 minutes of temporal separation.
Further Reading
Foundational Papers
- • Two-Stream Convolutional Networks (Simonyan & Zisserman, 2014) — Established the two-stream paradigm
- • Quo Vadis, Action Recognition? (Carreira & Zisserman, 2017) — I3D and Kinetics dataset
- • ViViT: A Video Vision Transformer (Arnab et al., 2021) — Factorized space-time attention
- • Video-LLaVA (Lin et al., 2023) — Unified image-video LLM
Benchmarks
- • CodeSOTA Benchmarks — Track video understanding SOTA on our leaderboards
- • Papers With Code: Video Understanding — Current leaderboards and papers
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.