Home/Building Blocks/Video Understanding

Video→Text

Video Understanding

Understand and describe video content. Powers video search, summarization, and analysis.

How Video to Text Works

A technical deep-dive into video captioning and description. From frame sampling to temporal understanding, and how modern VLMs turn moving images into text.

1. The Problem 2. Frame Sampling 3. Temporal Understanding 4. Dense vs Sparse 5. Models 6. Code

The Problem

Why is video understanding fundamentally different from image understanding?

Imagine you are shown a single photograph of someone mid-jump. You can describe their pose, the setting, perhaps guess they are exercising. Now imagine you see a 10-second video of the same scene. Suddenly you understand: they are doing a triple jump, this is a track meet, they just set a personal record.

The difference is temporal context. A video is not just a collection of images; it is images connected by time, motion, and causality. Understanding video means understanding what came before, what is happening now, and what will happen next.

Scale

A 1-minute video at 30 FPS is 1,800 frames. Processing all of them is computationally prohibitive.

Redundancy

Most adjacent frames are nearly identical. The challenge is extracting only the frames that matter.

Motion

Actions like walking, running, and waving only make sense across multiple frames.

Narrative

Videos tell stories. Beginning, middle, end. Cause and effect. Context that spans seconds or minutes.

The Core Insight

Video-to-text is fundamentally about compression with preservation. We must compress hours of visual data into a few hundred tokens of text, while preserving the essential narrative, actions, and meaning. The art lies in deciding what to keep and what to discard.

Frame Sampling

How do we select which frames to analyze? This is the first and most critical decision.

Interactive: Frame Sampling Strategies

Video Timeline (8 frames total)

0:00(Setup)

Selected

Wide shot of a kitchen counter with ingredients

Uniform Sampling

Extract frames at fixed intervals (e.g., 1 frame per second)

+Simple, Predictable, Good for long videos

-May miss key moments, Ignores scene changes

Best for:General video understanding, long content

Keyframe Detection

Extract frames when significant visual change occurs

+Captures scene transitions, Efficient representation

-Complex to implement, May over-sample action scenes

Best for:Movies, presentations, lectures

Dense Sampling

Extract many frames (5-30 FPS) for fine-grained analysis

+Captures subtle motion, Best for action understanding

-High compute cost, Token limits quickly exceeded

Best for:Sports, cooking, how-to videos

Temporal Understanding

How models learn to understand what happens across time, not just within a single frame.

Consider the phrase "the ball was caught." To understand this from video, a model must: track the ball across frames, identify the catching motion, recognize when the action completes. This is temporal reasoning - understanding how things change over time.

Frame-by-Frame (GPT-4V style)

Extract frames, send as images to VLM, aggregate captions

Video

Frame Extraction

VLM (per frame)

Caption Aggregation

Final Description

+Works with any VLM, Full control over sampling

-Loses temporal context, High token cost

Native Video Encoder (Gemini/Twelve Labs)

Model directly processes video as continuous input

Video

Video Encoder

Temporal Modeling

LLM

Description

+Preserves motion, Efficient, Handles long videos

-Limited model choice, Less interpretable

Two-Stage (Video-LLaMA)

Video encoder creates embeddings, Q-Former bridges to LLM

Video

ViT + Temporal

Q-Former

LLM

Description

+Open source, Modular design

-Complex training, Lower quality

The Temporal Modeling Challenge

Most VLMs were trained on images, not video. When we send frames to GPT-4V, the model sees them as separate images and must infer temporal relationships from context (like frame numbers in the prompt). Native video models like Gemini process the actual video stream, preserving motion information that frame sampling destroys.

Dense vs Sparse Captioning

Different use cases require different levels of detail. Should you describe every second, or summarize the whole video?

Sparse Captioning

One description for the entire video

~50-100 tokens

Example output:

A chef prepares a pasta dish, starting by boiling water, then sauteing garlic, and finally plating the finished meal with fresh herbs.

Dense Captioning

Timestamped descriptions for each segment

~200-500+ tokens

Example output:

[0:00-0:15] Chef fills pot with water
[0:15-0:30] Adds salt, brings to boil
[0:30-1:00] Sautes garlic in olive oil
[1:00-1:30] Adds pasta to boiling water

Highlight Detection

Identify and describe key moments only

~50-150 tokens

Example output:

[0:45] Key moment: The chef adds a secret ingredient - truffle oil
[2:30] Key moment: Final plating with microgreens

Use Sparse When

- Building search indexes
- Quick content categorization
- Token budget is limited
- Overview is sufficient

Use Dense When

- Creating transcripts
- Accessibility descriptions
- Training data generation
- Detailed analysis needed

Use Highlights When

- Sports clips
- Meeting recordings
- Finding key moments
- Creating summaries

Models and Methods

The landscape of video understanding models, from APIs to open source.

Model	Vendor	Approach	Capacity	Type	Cost
GPT-4V with Frames	OpenAI	Multi-image	~50-100 images	API	$$$
Gemini 1.5 Pro/Flash	Google	Native video	1 hour video	API	$$
Video-LLaMA	DAMO Academy	Video encoder + LLM	~32-64 frames	Open Source	Free
VideoCLIP	Meta	Contrastive	~32 frames	Open Source	Free
Twelve Labs	Twelve Labs	Native video	Hours of video	API	$$
Qwen2-VL	Alibaba	Native video	~64 frames	Open Source	Free

GPT-4V with Frames

API

Strengths

+ Best reasoning
+ Flexible prompting
+ Handles any video type

Weaknesses

- No native video
- Frame selection is manual
- Expensive at scale

Gemini 1.5 Pro/Flash

API

Strengths

+ Native video input
+ Long context
+ Fast inference

Weaknesses

- Less precise on details
- Availability varies

Video-LLaMA

Open Source

Strengths

+ Open weights
+ Audio understanding
+ Fine-tunable

Weaknesses

- Lower quality than APIs
- High VRAM
- Complex setup

VideoCLIP

Open Source

Strengths

+ Fast embeddings
+ Good for retrieval
+ Efficient

Weaknesses

- No generation
- Fixed output format

Best Overall

Gemini 1.5 Pro

Native video, long context, good quality

Best Reasoning

GPT-4o with Frames

Superior analysis, more control

Best Open Source

Qwen2-VL-72B

SOTA open weights, video native

Code Examples

Production-ready code for video captioning with different providers.

OpenAI GPT-4V with Frame Extractionpip install openai opencv-python

Frame-based

import openai
import cv2
import base64
from typing import List

def extract_frames(video_path: str, fps: float = 1.0) -> List[str]:
    """Extract frames from video at specified FPS, return as base64."""
    video = cv2.VideoCapture(video_path)
    original_fps = video.get(cv2.CAP_PROP_FPS)
    frame_interval = int(original_fps / fps)

    frames = []
    frame_count = 0

    while True:
        success, frame = video.read()
        if not success:
            break

        if frame_count % frame_interval == 0:
            # Resize for efficiency
            frame = cv2.resize(frame, (512, 512))
            _, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
            frames.append(base64.b64encode(buffer).decode('utf-8'))

        frame_count += 1

    video.release()
    return frames

def describe_video(video_path: str, max_frames: int = 20) -> str:
    """Generate description of video using GPT-4V with frame sampling."""
    client = openai.OpenAI()

    # Extract frames (1 FPS)
    frames = extract_frames(video_path, fps=1.0)

    # Limit frames to avoid token limits
    if len(frames) > max_frames:
        # Uniform sampling
        indices = [int(i * len(frames) / max_frames) for i in range(max_frames)]
        frames = [frames[i] for i in indices]

    # Build message with frames
    content = [{"type": "text", "text": """Analyze this video represented as frames.
Provide:
1. A brief summary of what happens
2. Key actions or events in chronological order
3. Notable objects, people, or settings"""}]

    for i, frame in enumerate(frames):
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{frame}",
                "detail": "low"  # Use 'high' for more detail
            }
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=1000
    )

    return response.choices[0].message.content

# Usage
description = describe_video("cooking_video.mp4")
print(description)

GPT-4V Tips

- Use detail: low for cost savings
- Resize frames to 512x512
- Include frame numbers in prompt
- Max ~50-100 frames per request

Gemini Tips

- Flash is faster, Pro is better
- Upload files for long videos
- Request timestamps explicitly
- Clean up uploaded files

Twelve Labs Tips

- Great for search + chapters
- Indexing takes time upfront
- Audio + visual understanding
- Good for video libraries

Quick Reference

Frame Sampling

- Uniform: 1 FPS default
- Keyframe: scene changes
- Dense: 5-30 FPS

Captioning Type

- Sparse: overall summary
- Dense: timestamped
- Highlights: key moments

Best Models

- Gemini 1.5 (native)
- GPT-4o (reasoning)
- Qwen2-VL (open)

Common Pitfalls

- Too many frames = token limit
- Missing motion from sampling
- Ignoring audio context

Use Cases

✓Video captioning
✓Video Q&A
✓Content moderation
✓Surveillance analysis
✓Video search

Architectural Patterns

Frame Sampling + VLM

Sample frames, process with vision-language model.

Pros:

+Simple
+Leverages image VLMs

Cons:

-May miss temporal info
-Sampling matters

Video Transformers

Native video understanding with temporal modeling.

Pros:

+Understands motion
+Full temporal context

Cons:

-High compute
-Long videos hard

Hierarchical Processing

Process clips, then aggregate at video level.

Pros:

+Handles long videos
+Efficient

Cons:

-May lose details
-Complex pipeline

Implementations

API Services

Gemini 1.5 Pro

Google

API

Best for long videos. 1 hour+ context.

GPT-4V (with frames)

OpenAI

API

Process video as image sequence. Good reasoning.

Twelve Labs

API

Video search and understanding API. Semantic search.

Open Source

VideoLLaMA

Apache 2.0

Open Source

Audio-visual LLM for video understanding.

GitHub

InternVideo2

MIT

Open Source

Strong video foundation model. Action, caption, QA.

GitHub

Benchmarks

MSR-VTT →ActivityNet-QA →

Quick Facts

Input: Video
Output: Text
Implementations: 2 open source, 3 API
Patterns: 3 approaches

Related Blocks

Image Captioning

Image → Text

Visual Question Answering

Image → Text

Have benchmark data?

Help us track the state of the art for video understanding.

Submit Results