Video Understanding
Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.
Video understanding models process video input (frames + optional audio) to answer questions, generate summaries, detect events, and reason about temporal dynamics. This is one of the hardest multimodal tasks because it requires integrating spatial perception, temporal reasoning, and often audio understanding across potentially hours of content.
History
Two-Stream Networks (Simonyan & Zisserman) combine spatial (RGB) and temporal (optical flow) CNNs for video classification
I3D (Carreira & Zisserman) inflates 2D ImageNet features to 3D for video, establishing the Kinetics benchmark
TimeSformer and ViViT apply Vision Transformers to video, replacing 3D CNNs with factored spatiotemporal attention
InternVideo achieves SOTA across 39 video benchmarks using a unified video foundation model
VideoChat and Video-LLaMA connect video encoders to LLMs for open-ended video dialogue
Gemini 1.5 Pro processes up to 1 hour of video natively in its 1M token context, enabling long-form video QA
LLaVA-Video and Qwen2-VL demonstrate strong video understanding in open-source models
Gemini 2.0 and GPT-4o handle multi-hour video; InternVideo2.5 and Qwen2.5-VL lead open-source video benchmarks
How Video Understanding Works
Frame Sampling
Videos are sampled at a reduced frame rate (typically 1-8 fps) to manage token budgets. Uniform sampling, keyframe extraction, or adaptive sampling strategies select the most informative frames.
Visual Encoding
Each sampled frame is encoded through a vision encoder (ViT, SigLIP). Some models use video-specific encoders with temporal attention layers; others process frames independently and rely on the LLM for temporal reasoning.
Temporal Modeling
Temporal relationships between frames are modeled via temporal attention layers, 3D convolutions, or by simply concatenating frame tokens in sequence for the LLM to attend over. Token compression strategies reduce the quadratic cost of attending over thousands of visual tokens.
Language-conditioned Reasoning
The LLM processes the sequence of visual tokens alongside the text query, generating answers that require temporal reasoning — what happened before/after, cause-effect relationships, and activity recognition.
Current Landscape
Video understanding in 2025 is defined by the context window race. Gemini 2.0 Pro's ability to natively ingest hours of video has set the bar for what 'video understanding' means — not just classifying 10-second clips, but answering questions about full-length movies and surveillance feeds. The field has evolved from classification (Kinetics, ActivityNet) to open-ended video dialogue. Proprietary models dominate long-form video tasks, while open-source models (Qwen2.5-VL, InternVideo2.5) are competitive on shorter videos. The key architectural debate is whether to use video-specific encoders with temporal modules or to simply feed many frames to a powerful LLM and let it handle temporal reasoning implicitly.
Key Challenges
Token budget explosion — a 10-minute video at 1 fps with 256 tokens per frame consumes 153,600 tokens, straining context windows and compute
Temporal grounding — models struggle to pinpoint when specific events occur in a video and to reason about temporal ordering precisely
Long-form understanding — summarizing or answering questions about hour-long videos requires maintaining coherent understanding across thousands of frames
Action recognition in complex scenes — fine-grained action discrimination (e.g., 'stirring' vs. 'mixing') remains challenging in cluttered real-world video
Audio-visual integration — most video understanding models ignore the audio track, missing critical cues for event detection and scene understanding
Quick Recommendations
Best overall
Gemini 2.0 Pro
Best long-form video understanding with native support for multi-hour videos; jointly processes audio and visual streams
Best for short video QA
GPT-4o
Highest accuracy on short video benchmarks (ActivityNet-QA, MSVD-QA); strong temporal reasoning for clips under 2 minutes
Open source (large)
Qwen2.5-VL-72B
Best open-weight video understanding; handles variable-length videos with dynamic resolution and frame sampling
Open source (efficient)
InternVL2.5-8B
Competitive video understanding in an 8B model; efficient enough for real-time processing pipelines
Video search & retrieval
InternVideo2.5
Strong video-text alignment for temporal grounding and moment retrieval; best open-source video representation model
What's Next
Real-time streaming video understanding is the next milestone — models that can process live video feeds, detect events as they happen, and maintain running state over hours of continuous input. Expect tighter audio-visual fusion where models reason jointly over what they see and hear, temporal grounding that can cite specific timestamps, and integration with robotic perception for embodied AI. The benchmark frontier is moving toward EgoSchema-style long-form reasoning and full-movie comprehension tasks.
Benchmarks & SOTA
Related Tasks
Audio-Text-to-Text
Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.
Image-Text-to-Image
Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.
Image-Text-to-Text
Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.
Image-Text-to-Video
Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.
Something wrong or missing?
Help keep Video Understanding benchmarks accurate. Report outdated results, missing benchmarks, or errors.