Level 4: Advanced~30 min

Video Understanding

Analyze video content with AI. From frame sampling to temporal reasoning and action recognition.

Video as a Modality

Video is fundamentally images over time + audio. A 1-minute video at 30fps contains 1,800 frames. You can't process all of them through a VLM - you need smart sampling strategies.

The key challenges in video understanding:

Scale

Videos are huge. A 10-minute video is 18,000 frames. Processing each with a VLM is prohibitively expensive.

Temporal Context

Understanding "what happened" requires seeing events unfold over time. Single frames miss the action.

Multi-modal Fusion

Audio provides crucial context. Speech, music, and sound effects all carry meaning.

Localization

"When does X happen?" requires mapping answers to timestamps, not just frame indices.

Frame Sampling Strategies

Since you can't process every frame, you need to sample intelligently. The strategy depends on your use case.

1

Uniform Sampling

Extract frames at fixed intervals (e.g., 1 FPS). Simple and predictable.

Good for: General summarization, scene understanding

2

Keyframe Detection

Extract frames where significant visual change occurs. Skip redundant frames.

Good for: Action detection, event summarization

3

Scene-based Sampling

Detect scene changes, sample one frame per scene. Captures narrative structure.

Good for: Movie analysis, content indexing

4

Audio-guided Sampling

Sample more frames during speech or important audio events.

Good for: Lecture videos, interviews, podcasts

Video Processing Pipeline

Here's a practical implementation for video understanding with frame sampling and GPT-4V analysis:

Frame Extraction

# Video understanding with frame sampling
import cv2
from openai import OpenAI
import base64

def extract_frames(video_path: str, fps: int = 1):
    """Extract frames at specified FPS"""
    cap = cv2.VideoCapture(video_path)
    frames = []
    frame_interval = int(cap.get(cv2.CAP_PROP_FPS) / fps)

    frame_count = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        if frame_count % frame_interval == 0:
            frames.append(frame)
        frame_count += 1
    cap.release()
    return frames

def encode_frame(frame) -> str:
    """Encode frame as base64 JPEG"""
    _, buffer = cv2.imencode('.jpg', frame)
    return base64.b64encode(buffer).decode('utf-8')

Video Analysis with GPT-4V

# Analyze with GPT-4V
def analyze_video(frames: list, question: str):
    client = OpenAI()

    # Encode frames as base64
    images = [encode_frame(f) for f in frames[:10]]  # Limit frames

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                *[{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}}
                  for img in images]
            ]
        }]
    )
    return response.choices[0].message.content

# Usage
frames = extract_frames("presentation.mp4", fps=0.5)  # 1 frame every 2 seconds
summary = analyze_video(frames, "Summarize the key points in this presentation")
Performance tip:

GPT-4o can handle up to ~50 images per request. For longer videos, process in chunks and aggregate results. Consider using timestamps in your prompts to maintain temporal coherence.

Video-Language Models

Dedicated video-language models process video natively, understanding temporal relationships without explicit frame sampling.

Current Video-Language Models

Video-LLaVA(Open Source)
Video + language understanding
Gemini 1.5 Pro(Google API)
Long-context video input
GPT-4V/4o(OpenAI API)
Multi-frame analysis
InternVideo2(Open Source)
Video foundation model

Using Gemini for Video

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel('gemini-1.5-pro')

# Upload video file
video_file = genai.upload_file("video.mp4")

# Wait for processing
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

# Analyze the video
response = model.generate_content([
    video_file,
    "What are the main topics discussed in this video?"
])
print(response.text)

Practical Use Cases

Surveillance and Security

Detect anomalies, identify objects of interest, generate alerts for specific events.

Key: Low latency, high recall, temporal localization

Content Moderation

Identify policy violations, NSFW content, dangerous activities in user-uploaded videos.

Key: High precision, explainable decisions, timestamp extraction

Video Search and Retrieval

Natural language search over video archives. "Find scenes where the CEO mentions Q3 results."

Key: Dense indexing, speech transcription, semantic matching

Sports Analytics

Track player movements, detect plays, generate highlight reels automatically.

Key: Object tracking, action recognition, temporal segmentation

Integrating Audio

Video understanding is incomplete without audio. Combine Whisper for transcription with visual analysis for complete understanding.

Multi-modal Video Pipeline

import whisper
import subprocess

def extract_audio(video_path: str, audio_path: str):
    """Extract audio from video using ffmpeg"""
    subprocess.run([
        'ffmpeg', '-i', video_path,
        '-vn', '-acodec', 'pcm_s16le',
        '-ar', '16000', '-ac', '1',
        audio_path
    ], check=True)

def transcribe_audio(audio_path: str) -> str:
    """Transcribe audio with Whisper"""
    model = whisper.load_model("base")
    result = model.transcribe(audio_path)
    return result["text"]

def full_video_analysis(video_path: str, question: str):
    """Combine visual and audio understanding"""
    # Extract frames
    frames = extract_frames(video_path, fps=1)

    # Extract and transcribe audio
    audio_path = video_path.replace('.mp4', '.wav')
    extract_audio(video_path, audio_path)
    transcript = transcribe_audio(audio_path)

    # Analyze with both modalities
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": f"Video transcript: {transcript}\n\nQuestion: {question}"},
                *[{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_frame(f)}"}}
                  for f in frames[:8]]
            ]
        }]
    )
    return response.choices[0].message.content

Key Takeaways

  • 1

    Video = frames + audio + time - You need smart sampling because processing every frame is impractical.

  • 2

    Sampling strategy matters - Uniform, keyframe, scene-based, or audio-guided depending on your use case.

  • 3

    GPT-4V handles multi-frame analysis - Pass sampled frames as images. Gemini can process video files directly.

  • 4

    Combine visual + audio - Whisper for transcription, VLM for visuals, LLM for synthesis.