Level 4: Advanced~30 min

Video Understanding

Analyze video content with AI. From frame sampling to temporal reasoning and action recognition.

Video as a Modality

Video is fundamentally images over time + audio. A 1-minute video at 30fps contains 1,800 frames. You can't process all of them through a VLM - you need smart sampling strategies.

The key challenges in video understanding:

Scale

Videos are huge. A 10-minute video is 18,000 frames. Processing each with a VLM is prohibitively expensive.

Temporal Context

Understanding "what happened" requires seeing events unfold over time. Single frames miss the action.

Multi-modal Fusion

Audio provides crucial context. Speech, music, and sound effects all carry meaning.

Localization

"When does X happen?" requires mapping answers to timestamps, not just frame indices.

Frame Sampling Strategies

Since you can't process every frame, you need to sample intelligently. The strategy depends on your use case.

Uniform Sampling

Extract frames at fixed intervals (e.g., 1 FPS). Simple and predictable.

Good for: General summarization, scene understanding

Keyframe Detection

Extract frames where significant visual change occurs. Skip redundant frames.

Good for: Action detection, event summarization

Scene-based Sampling

Detect scene changes, sample one frame per scene. Captures narrative structure.

Good for: Movie analysis, content indexing

Audio-guided Sampling

Sample more frames during speech or important audio events.

Good for: Lecture videos, interviews, podcasts

Video Processing Pipeline

Here's a practical implementation for video understanding with frame sampling and GPT-4V analysis:

Frame Extraction

# Video understanding with frame sampling
import cv2
from openai import OpenAI
import base64

def extract_frames(video_path: str, fps: int = 1):
    """Extract frames at specified FPS"""
    cap = cv2.VideoCapture(video_path)
    frames = []
    frame_interval = int(cap.get(cv2.CAP_PROP_FPS) / fps)

    frame_count = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        if frame_count % frame_interval == 0:
            frames.append(frame)
        frame_count += 1
    cap.release()
    return frames

def encode_frame(frame) -> str:
    """Encode frame as base64 JPEG"""
    _, buffer = cv2.imencode('.jpg', frame)
    return base64.b64encode(buffer).decode('utf-8')

Video Analysis with GPT-4V

# Analyze with GPT-4V
def analyze_video(frames: list, question: str):
    client = OpenAI()

    # Encode frames as base64
    images = [encode_frame(f) for f in frames[:10]]  # Limit frames

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                *[{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}}
                  for img in images]
            ]
        }]
    )
    return response.choices[0].message.content

# Usage
frames = extract_frames("presentation.mp4", fps=0.5)  # 1 frame every 2 seconds
summary = analyze_video(frames, "Summarize the key points in this presentation")

Performance tip:

GPT-4o can handle up to ~50 images per request. For longer videos, process in chunks and aggregate results. Consider using timestamps in your prompts to maintain temporal coherence.

Video-Language Models

Dedicated video-language models process video natively, understanding temporal relationships without explicit frame sampling.

Current Video-Language Models

Video-LLaVA(Open Source)

Video + language understanding

Gemini 1.5 Pro(Google API)

Long-context video input

GPT-4V/4o(OpenAI API)

Multi-frame analysis

InternVideo2(Open Source)

Video foundation model

Using Gemini for Video

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel('gemini-1.5-pro')

# Upload video file
video_file = genai.upload_file("video.mp4")

# Wait for processing
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

# Analyze the video
response = model.generate_content([
    video_file,
    "What are the main topics discussed in this video?"
])
print(response.text)

Practical Use Cases

Surveillance and Security

Detect anomalies, identify objects of interest, generate alerts for specific events.

Key: Low latency, high recall, temporal localization

Content Moderation

Identify policy violations, NSFW content, dangerous activities in user-uploaded videos.

Key: High precision, explainable decisions, timestamp extraction

Video Search and Retrieval

Natural language search over video archives. "Find scenes where the CEO mentions Q3 results."

Key: Dense indexing, speech transcription, semantic matching

Sports Analytics

Track player movements, detect plays, generate highlight reels automatically.

Key: Object tracking, action recognition, temporal segmentation

Integrating Audio

Video understanding is incomplete without audio. Combine Whisper for transcription with visual analysis for complete understanding.

Multi-modal Video Pipeline

import whisper
import subprocess

def extract_audio(video_path: str, audio_path: str):
    """Extract audio from video using ffmpeg"""
    subprocess.run([
        'ffmpeg', '-i', video_path,
        '-vn', '-acodec', 'pcm_s16le',
        '-ar', '16000', '-ac', '1',
        audio_path
    ], check=True)

def transcribe_audio(audio_path: str) -> str:
    """Transcribe audio with Whisper"""
    model = whisper.load_model("base")
    result = model.transcribe(audio_path)
    return result["text"]

def full_video_analysis(video_path: str, question: str):
    """Combine visual and audio understanding"""
    # Extract frames
    frames = extract_frames(video_path, fps=1)

    # Extract and transcribe audio
    audio_path = video_path.replace('.mp4', '.wav')
    extract_audio(video_path, audio_path)
    transcript = transcribe_audio(audio_path)

    # Analyze with both modalities
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": f"Video transcript: {transcript}\n\nQuestion: {question}"},
                *[{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_frame(f)}"}}
                  for f in frames[:8]]
            ]
        }]
    )
    return response.choices[0].message.content

Key Takeaways

1
Video = frames + audio + time - You need smart sampling because processing every frame is impractical.
2
Sampling strategy matters - Uniform, keyframe, scene-based, or audio-guided depending on your use case.
3
GPT-4V handles multi-frame analysis - Pass sampled frames as images. Gemini can process video files directly.
4
Combine visual + audio - Whisper for transcription, VLM for visuals, LLM for synthesis.

Next: Real-time Systems Previous: Agent Pipelines