Home/Building Blocks/Video Understanding
VideoText

Video Understanding

Understand and describe video content. Powers video search, summarization, and analysis.

How Video to Text Works

A technical deep-dive into video captioning and description. From frame sampling to temporal understanding, and how modern VLMs turn moving images into text.

1

The Problem

Why is video understanding fundamentally different from image understanding?

Imagine you are shown a single photograph of someone mid-jump. You can describe their pose, the setting, perhaps guess they are exercising. Now imagine you see a 10-second video of the same scene. Suddenly you understand: they are doing a triple jump, this is a track meet, they just set a personal record.

The difference is temporal context. A video is not just a collection of images; it is images connected by time, motion, and causality. Understanding video means understanding what came before, what is happening now, and what will happen next.

Scale

A 1-minute video at 30 FPS is 1,800 frames. Processing all of them is computationally prohibitive.

Redundancy

Most adjacent frames are nearly identical. The challenge is extracting only the frames that matter.

Motion

Actions like walking, running, and waving only make sense across multiple frames.

Narrative

Videos tell stories. Beginning, middle, end. Cause and effect. Context that spans seconds or minutes.

The Core Insight

Video-to-text is fundamentally about compression with preservation. We must compress hours of visual data into a few hundred tokens of text, while preserving the essential narrative, actions, and meaning. The art lies in deciding what to keep and what to discard.

2

Frame Sampling

How do we select which frames to analyze? This is the first and most critical decision.

Interactive: Frame Sampling Strategies

Video Timeline (8 frames total)
0:00(Setup)
Selected

Wide shot of a kitchen counter with ingredients

Uniform Sampling

Extract frames at fixed intervals (e.g., 1 frame per second)

+Simple, Predictable, Good for long videos
-May miss key moments, Ignores scene changes
Best for:General video understanding, long content

Keyframe Detection

Extract frames when significant visual change occurs

+Captures scene transitions, Efficient representation
-Complex to implement, May over-sample action scenes
Best for:Movies, presentations, lectures

Dense Sampling

Extract many frames (5-30 FPS) for fine-grained analysis

+Captures subtle motion, Best for action understanding
-High compute cost, Token limits quickly exceeded
Best for:Sports, cooking, how-to videos
3

Temporal Understanding

How models learn to understand what happens across time, not just within a single frame.

Consider the phrase "the ball was caught." To understand this from video, a model must: track the ball across frames, identify the catching motion, recognize when the action completes. This is temporal reasoning - understanding how things change over time.

Frame-by-Frame (GPT-4V style)

Extract frames, send as images to VLM, aggregate captions

Video
->
Frame Extraction
->
VLM (per frame)
->
Caption Aggregation
->
Final Description
+Works with any VLM, Full control over sampling
-Loses temporal context, High token cost

Native Video Encoder (Gemini/Twelve Labs)

Model directly processes video as continuous input

Video
->
Video Encoder
->
Temporal Modeling
->
LLM
->
Description
+Preserves motion, Efficient, Handles long videos
-Limited model choice, Less interpretable

Two-Stage (Video-LLaMA)

Video encoder creates embeddings, Q-Former bridges to LLM

Video
->
ViT + Temporal
->
Q-Former
->
LLM
->
Description
+Open source, Modular design
-Complex training, Lower quality
The Temporal Modeling Challenge

Most VLMs were trained on images, not video. When we send frames to GPT-4V, the model sees them as separate images and must infer temporal relationships from context (like frame numbers in the prompt). Native video models like Gemini process the actual video stream, preserving motion information that frame sampling destroys.

4

Dense vs Sparse Captioning

Different use cases require different levels of detail. Should you describe every second, or summarize the whole video?

Sparse Captioning

One description for the entire video

~50-100 tokens
Example output:
A chef prepares a pasta dish, starting by boiling water, then sauteing garlic, and finally plating the finished meal with fresh herbs.

Dense Captioning

Timestamped descriptions for each segment

~200-500+ tokens
Example output:
[0:00-0:15] Chef fills pot with water
[0:15-0:30] Adds salt, brings to boil
[0:30-1:00] Sautes garlic in olive oil
[1:00-1:30] Adds pasta to boiling water

Highlight Detection

Identify and describe key moments only

~50-150 tokens
Example output:
[0:45] Key moment: The chef adds a secret ingredient - truffle oil
[2:30] Key moment: Final plating with microgreens
Use Sparse When
  • - Building search indexes
  • - Quick content categorization
  • - Token budget is limited
  • - Overview is sufficient
Use Dense When
  • - Creating transcripts
  • - Accessibility descriptions
  • - Training data generation
  • - Detailed analysis needed
Use Highlights When
  • - Sports clips
  • - Meeting recordings
  • - Finding key moments
  • - Creating summaries
5

Models and Methods

The landscape of video understanding models, from APIs to open source.

ModelVendorApproachCapacityTypeCost
GPT-4V with FramesOpenAIMulti-image~50-100 imagesAPI$$$
Gemini 1.5 Pro/FlashGoogleNative video1 hour videoAPI$$
Video-LLaMADAMO AcademyVideo encoder + LLM~32-64 framesOpen SourceFree
VideoCLIPMetaContrastive~32 framesOpen SourceFree
Twelve LabsTwelve LabsNative videoHours of videoAPI$$
Qwen2-VLAlibabaNative video~64 framesOpen SourceFree
GPT-4V with Frames
API
Strengths
  • + Best reasoning
  • + Flexible prompting
  • + Handles any video type
Weaknesses
  • - No native video
  • - Frame selection is manual
  • - Expensive at scale
Gemini 1.5 Pro/Flash
API
Strengths
  • + Native video input
  • + Long context
  • + Fast inference
Weaknesses
  • - Less precise on details
  • - Availability varies
Video-LLaMA
Open Source
Strengths
  • + Open weights
  • + Audio understanding
  • + Fine-tunable
Weaknesses
  • - Lower quality than APIs
  • - High VRAM
  • - Complex setup
VideoCLIP
Open Source
Strengths
  • + Fast embeddings
  • + Good for retrieval
  • + Efficient
Weaknesses
  • - No generation
  • - Fixed output format
Best Overall
Gemini 1.5 Pro
Native video, long context, good quality
Best Reasoning
GPT-4o with Frames
Superior analysis, more control
Best Open Source
Qwen2-VL-72B
SOTA open weights, video native
6

Code Examples

Production-ready code for video captioning with different providers.

OpenAI GPT-4V with Frame Extractionpip install openai opencv-python
Frame-based
import openai
import cv2
import base64
from typing import List

def extract_frames(video_path: str, fps: float = 1.0) -> List[str]:
    """Extract frames from video at specified FPS, return as base64."""
    video = cv2.VideoCapture(video_path)
    original_fps = video.get(cv2.CAP_PROP_FPS)
    frame_interval = int(original_fps / fps)

    frames = []
    frame_count = 0

    while True:
        success, frame = video.read()
        if not success:
            break

        if frame_count % frame_interval == 0:
            # Resize for efficiency
            frame = cv2.resize(frame, (512, 512))
            _, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
            frames.append(base64.b64encode(buffer).decode('utf-8'))

        frame_count += 1

    video.release()
    return frames

def describe_video(video_path: str, max_frames: int = 20) -> str:
    """Generate description of video using GPT-4V with frame sampling."""
    client = openai.OpenAI()

    # Extract frames (1 FPS)
    frames = extract_frames(video_path, fps=1.0)

    # Limit frames to avoid token limits
    if len(frames) > max_frames:
        # Uniform sampling
        indices = [int(i * len(frames) / max_frames) for i in range(max_frames)]
        frames = [frames[i] for i in indices]

    # Build message with frames
    content = [{"type": "text", "text": """Analyze this video represented as frames.
Provide:
1. A brief summary of what happens
2. Key actions or events in chronological order
3. Notable objects, people, or settings"""}]

    for i, frame in enumerate(frames):
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{frame}",
                "detail": "low"  # Use 'high' for more detail
            }
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=1000
    )

    return response.choices[0].message.content

# Usage
description = describe_video("cooking_video.mp4")
print(description)
GPT-4V Tips
  • - Use detail: low for cost savings
  • - Resize frames to 512x512
  • - Include frame numbers in prompt
  • - Max ~50-100 frames per request
Gemini Tips
  • - Flash is faster, Pro is better
  • - Upload files for long videos
  • - Request timestamps explicitly
  • - Clean up uploaded files
Twelve Labs Tips
  • - Great for search + chapters
  • - Indexing takes time upfront
  • - Audio + visual understanding
  • - Good for video libraries

Quick Reference

Frame Sampling
  • - Uniform: 1 FPS default
  • - Keyframe: scene changes
  • - Dense: 5-30 FPS
Captioning Type
  • - Sparse: overall summary
  • - Dense: timestamped
  • - Highlights: key moments
Best Models
  • - Gemini 1.5 (native)
  • - GPT-4o (reasoning)
  • - Qwen2-VL (open)
Common Pitfalls
  • - Too many frames = token limit
  • - Missing motion from sampling
  • - Ignoring audio context

Use Cases

  • Video captioning
  • Video Q&A
  • Content moderation
  • Surveillance analysis
  • Video search

Architectural Patterns

Frame Sampling + VLM

Sample frames, process with vision-language model.

Pros:
  • +Simple
  • +Leverages image VLMs
Cons:
  • -May miss temporal info
  • -Sampling matters

Video Transformers

Native video understanding with temporal modeling.

Pros:
  • +Understands motion
  • +Full temporal context
Cons:
  • -High compute
  • -Long videos hard

Hierarchical Processing

Process clips, then aggregate at video level.

Pros:
  • +Handles long videos
  • +Efficient
Cons:
  • -May lose details
  • -Complex pipeline

Implementations

API Services

Gemini 1.5 Pro

Google
API

Best for long videos. 1 hour+ context.

GPT-4V (with frames)

OpenAI
API

Process video as image sequence. Good reasoning.

Twelve Labs

Twelve Labs
API

Video search and understanding API. Semantic search.

Open Source

VideoLLaMA

Apache 2.0
Open Source

Audio-visual LLM for video understanding.

InternVideo2

MIT
Open Source

Strong video foundation model. Action, caption, QA.

Benchmarks

Quick Facts

Input
Video
Output
Text
Implementations
2 open source, 3 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for video understanding.

Submit Results