Video Understanding
Analyze video content with AI. From frame sampling to temporal reasoning and action recognition.
Video as a Modality
Video is fundamentally images over time + audio. A 1-minute video at 30fps contains 1,800 frames. You can't process all of them through a VLM - you need smart sampling strategies.
The key challenges in video understanding:
Scale
Videos are huge. A 10-minute video is 18,000 frames. Processing each with a VLM is prohibitively expensive.
Temporal Context
Understanding "what happened" requires seeing events unfold over time. Single frames miss the action.
Multi-modal Fusion
Audio provides crucial context. Speech, music, and sound effects all carry meaning.
Localization
"When does X happen?" requires mapping answers to timestamps, not just frame indices.
Frame Sampling Strategies
Since you can't process every frame, you need to sample intelligently. The strategy depends on your use case.
Uniform Sampling
Extract frames at fixed intervals (e.g., 1 FPS). Simple and predictable.
Good for: General summarization, scene understanding
Keyframe Detection
Extract frames where significant visual change occurs. Skip redundant frames.
Good for: Action detection, event summarization
Scene-based Sampling
Detect scene changes, sample one frame per scene. Captures narrative structure.
Good for: Movie analysis, content indexing
Audio-guided Sampling
Sample more frames during speech or important audio events.
Good for: Lecture videos, interviews, podcasts
Video Processing Pipeline
Here's a practical implementation for video understanding with frame sampling and GPT-4V analysis:
Frame Extraction
# Video understanding with frame sampling
import cv2
from openai import OpenAI
import base64
def extract_frames(video_path: str, fps: int = 1):
"""Extract frames at specified FPS"""
cap = cv2.VideoCapture(video_path)
frames = []
frame_interval = int(cap.get(cv2.CAP_PROP_FPS) / fps)
frame_count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_count % frame_interval == 0:
frames.append(frame)
frame_count += 1
cap.release()
return frames
def encode_frame(frame) -> str:
"""Encode frame as base64 JPEG"""
_, buffer = cv2.imencode('.jpg', frame)
return base64.b64encode(buffer).decode('utf-8')Video Analysis with GPT-4V
# Analyze with GPT-4V
def analyze_video(frames: list, question: str):
client = OpenAI()
# Encode frames as base64
images = [encode_frame(f) for f in frames[:10]] # Limit frames
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": question},
*[{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}}
for img in images]
]
}]
)
return response.choices[0].message.content
# Usage
frames = extract_frames("presentation.mp4", fps=0.5) # 1 frame every 2 seconds
summary = analyze_video(frames, "Summarize the key points in this presentation")GPT-4o can handle up to ~50 images per request. For longer videos, process in chunks and aggregate results. Consider using timestamps in your prompts to maintain temporal coherence.
Video-Language Models
Dedicated video-language models process video natively, understanding temporal relationships without explicit frame sampling.
Current Video-Language Models
Using Gemini for Video
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel('gemini-1.5-pro')
# Upload video file
video_file = genai.upload_file("video.mp4")
# Wait for processing
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
# Analyze the video
response = model.generate_content([
video_file,
"What are the main topics discussed in this video?"
])
print(response.text)Practical Use Cases
Surveillance and Security
Detect anomalies, identify objects of interest, generate alerts for specific events.
Content Moderation
Identify policy violations, NSFW content, dangerous activities in user-uploaded videos.
Video Search and Retrieval
Natural language search over video archives. "Find scenes where the CEO mentions Q3 results."
Sports Analytics
Track player movements, detect plays, generate highlight reels automatically.
Integrating Audio
Video understanding is incomplete without audio. Combine Whisper for transcription with visual analysis for complete understanding.
Multi-modal Video Pipeline
import whisper
import subprocess
def extract_audio(video_path: str, audio_path: str):
"""Extract audio from video using ffmpeg"""
subprocess.run([
'ffmpeg', '-i', video_path,
'-vn', '-acodec', 'pcm_s16le',
'-ar', '16000', '-ac', '1',
audio_path
], check=True)
def transcribe_audio(audio_path: str) -> str:
"""Transcribe audio with Whisper"""
model = whisper.load_model("base")
result = model.transcribe(audio_path)
return result["text"]
def full_video_analysis(video_path: str, question: str):
"""Combine visual and audio understanding"""
# Extract frames
frames = extract_frames(video_path, fps=1)
# Extract and transcribe audio
audio_path = video_path.replace('.mp4', '.wav')
extract_audio(video_path, audio_path)
transcript = transcribe_audio(audio_path)
# Analyze with both modalities
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": f"Video transcript: {transcript}\n\nQuestion: {question}"},
*[{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_frame(f)}"}}
for f in frames[:8]]
]
}]
)
return response.choices[0].message.contentKey Takeaways
- 1
Video = frames + audio + time - You need smart sampling because processing every frame is impractical.
- 2
Sampling strategy matters - Uniform, keyframe, scene-based, or audio-guided depending on your use case.
- 3
GPT-4V handles multi-frame analysis - Pass sampled frames as images. Gemini can process video files directly.
- 4
Combine visual + audio - Whisper for transcription, VLM for visuals, LLM for synthesis.