Video Understanding
Understand and describe video content. Powers video search, summarization, and analysis.
How Video to Text Works
A technical deep-dive into video captioning and description. From frame sampling to temporal understanding, and how modern VLMs turn moving images into text.
The Problem
Why is video understanding fundamentally different from image understanding?
Imagine you are shown a single photograph of someone mid-jump. You can describe their pose, the setting, perhaps guess they are exercising. Now imagine you see a 10-second video of the same scene. Suddenly you understand: they are doing a triple jump, this is a track meet, they just set a personal record.
The difference is temporal context. A video is not just a collection of images; it is images connected by time, motion, and causality. Understanding video means understanding what came before, what is happening now, and what will happen next.
A 1-minute video at 30 FPS is 1,800 frames. Processing all of them is computationally prohibitive.
Most adjacent frames are nearly identical. The challenge is extracting only the frames that matter.
Actions like walking, running, and waving only make sense across multiple frames.
Videos tell stories. Beginning, middle, end. Cause and effect. Context that spans seconds or minutes.
Video-to-text is fundamentally about compression with preservation. We must compress hours of visual data into a few hundred tokens of text, while preserving the essential narrative, actions, and meaning. The art lies in deciding what to keep and what to discard.
Frame Sampling
How do we select which frames to analyze? This is the first and most critical decision.
Interactive: Frame Sampling Strategies
Wide shot of a kitchen counter with ingredients
Uniform Sampling
Extract frames at fixed intervals (e.g., 1 frame per second)
Keyframe Detection
Extract frames when significant visual change occurs
Dense Sampling
Extract many frames (5-30 FPS) for fine-grained analysis
Temporal Understanding
How models learn to understand what happens across time, not just within a single frame.
Consider the phrase "the ball was caught." To understand this from video, a model must: track the ball across frames, identify the catching motion, recognize when the action completes. This is temporal reasoning - understanding how things change over time.
Frame-by-Frame (GPT-4V style)
Extract frames, send as images to VLM, aggregate captions
Native Video Encoder (Gemini/Twelve Labs)
Model directly processes video as continuous input
Two-Stage (Video-LLaMA)
Video encoder creates embeddings, Q-Former bridges to LLM
Most VLMs were trained on images, not video. When we send frames to GPT-4V, the model sees them as separate images and must infer temporal relationships from context (like frame numbers in the prompt). Native video models like Gemini process the actual video stream, preserving motion information that frame sampling destroys.
Dense vs Sparse Captioning
Different use cases require different levels of detail. Should you describe every second, or summarize the whole video?
Sparse Captioning
One description for the entire video
A chef prepares a pasta dish, starting by boiling water, then sauteing garlic, and finally plating the finished meal with fresh herbs.
Dense Captioning
Timestamped descriptions for each segment
[0:00-0:15] Chef fills pot with water [0:15-0:30] Adds salt, brings to boil [0:30-1:00] Sautes garlic in olive oil [1:00-1:30] Adds pasta to boiling water
Highlight Detection
Identify and describe key moments only
[0:45] Key moment: The chef adds a secret ingredient - truffle oil [2:30] Key moment: Final plating with microgreens
- - Building search indexes
- - Quick content categorization
- - Token budget is limited
- - Overview is sufficient
- - Creating transcripts
- - Accessibility descriptions
- - Training data generation
- - Detailed analysis needed
- - Sports clips
- - Meeting recordings
- - Finding key moments
- - Creating summaries
Models and Methods
The landscape of video understanding models, from APIs to open source.
| Model | Vendor | Approach | Capacity | Type | Cost |
|---|---|---|---|---|---|
| GPT-4V with Frames | OpenAI | Multi-image | ~50-100 images | API | $$$ |
| Gemini 1.5 Pro/Flash | Native video | 1 hour video | API | $$ | |
| Video-LLaMA | DAMO Academy | Video encoder + LLM | ~32-64 frames | Open Source | Free |
| VideoCLIP | Meta | Contrastive | ~32 frames | Open Source | Free |
| Twelve Labs | Twelve Labs | Native video | Hours of video | API | $$ |
| Qwen2-VL | Alibaba | Native video | ~64 frames | Open Source | Free |
- + Best reasoning
- + Flexible prompting
- + Handles any video type
- - No native video
- - Frame selection is manual
- - Expensive at scale
- + Native video input
- + Long context
- + Fast inference
- - Less precise on details
- - Availability varies
- + Open weights
- + Audio understanding
- + Fine-tunable
- - Lower quality than APIs
- - High VRAM
- - Complex setup
- + Fast embeddings
- + Good for retrieval
- + Efficient
- - No generation
- - Fixed output format
Code Examples
Production-ready code for video captioning with different providers.
import openai
import cv2
import base64
from typing import List
def extract_frames(video_path: str, fps: float = 1.0) -> List[str]:
"""Extract frames from video at specified FPS, return as base64."""
video = cv2.VideoCapture(video_path)
original_fps = video.get(cv2.CAP_PROP_FPS)
frame_interval = int(original_fps / fps)
frames = []
frame_count = 0
while True:
success, frame = video.read()
if not success:
break
if frame_count % frame_interval == 0:
# Resize for efficiency
frame = cv2.resize(frame, (512, 512))
_, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
frames.append(base64.b64encode(buffer).decode('utf-8'))
frame_count += 1
video.release()
return frames
def describe_video(video_path: str, max_frames: int = 20) -> str:
"""Generate description of video using GPT-4V with frame sampling."""
client = openai.OpenAI()
# Extract frames (1 FPS)
frames = extract_frames(video_path, fps=1.0)
# Limit frames to avoid token limits
if len(frames) > max_frames:
# Uniform sampling
indices = [int(i * len(frames) / max_frames) for i in range(max_frames)]
frames = [frames[i] for i in indices]
# Build message with frames
content = [{"type": "text", "text": """Analyze this video represented as frames.
Provide:
1. A brief summary of what happens
2. Key actions or events in chronological order
3. Notable objects, people, or settings"""}]
for i, frame in enumerate(frames):
content.append({
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{frame}",
"detail": "low" # Use 'high' for more detail
}
})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}],
max_tokens=1000
)
return response.choices[0].message.content
# Usage
description = describe_video("cooking_video.mp4")
print(description)- - Use detail: low for cost savings
- - Resize frames to 512x512
- - Include frame numbers in prompt
- - Max ~50-100 frames per request
- - Flash is faster, Pro is better
- - Upload files for long videos
- - Request timestamps explicitly
- - Clean up uploaded files
- - Great for search + chapters
- - Indexing takes time upfront
- - Audio + visual understanding
- - Good for video libraries
Quick Reference
- - Uniform: 1 FPS default
- - Keyframe: scene changes
- - Dense: 5-30 FPS
- - Sparse: overall summary
- - Dense: timestamped
- - Highlights: key moments
- - Gemini 1.5 (native)
- - GPT-4o (reasoning)
- - Qwen2-VL (open)
- - Too many frames = token limit
- - Missing motion from sampling
- - Ignoring audio context
Use Cases
- ✓Video captioning
- ✓Video Q&A
- ✓Content moderation
- ✓Surveillance analysis
- ✓Video search
Architectural Patterns
Frame Sampling + VLM
Sample frames, process with vision-language model.
- +Simple
- +Leverages image VLMs
- -May miss temporal info
- -Sampling matters
Video Transformers
Native video understanding with temporal modeling.
- +Understands motion
- +Full temporal context
- -High compute
- -Long videos hard
Hierarchical Processing
Process clips, then aggregate at video level.
- +Handles long videos
- +Efficient
- -May lose details
- -Complex pipeline
Implementations
API Services
Gemini 1.5 Pro
GoogleBest for long videos. 1 hour+ context.
GPT-4V (with frames)
OpenAIProcess video as image sequence. Good reasoning.
Twelve Labs
Twelve LabsVideo search and understanding API. Semantic search.
Benchmarks
Quick Facts
- Input
- Video
- Output
- Text
- Implementations
- 2 open source, 3 API
- Patterns
- 3 approaches