Multimodalvideo-text-to-text

Video Understanding

Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.

2 datasets44 resultsView full task mapping →

Video understanding models process video input (frames + optional audio) to answer questions, generate summaries, detect events, and reason about temporal dynamics. This is one of the hardest multimodal tasks because it requires integrating spatial perception, temporal reasoning, and often audio understanding across potentially hours of content.

History

2014

Two-Stream Networks (Simonyan & Zisserman) combine spatial (RGB) and temporal (optical flow) CNNs for video classification

2017

I3D (Carreira & Zisserman) inflates 2D ImageNet features to 3D for video, establishing the Kinetics benchmark

2021

TimeSformer and ViViT apply Vision Transformers to video, replacing 3D CNNs with factored spatiotemporal attention

2022

InternVideo achieves SOTA across 39 video benchmarks using a unified video foundation model

2023

VideoChat and Video-LLaMA connect video encoders to LLMs for open-ended video dialogue

2024

Gemini 1.5 Pro processes up to 1 hour of video natively in its 1M token context, enabling long-form video QA

2024

LLaVA-Video and Qwen2-VL demonstrate strong video understanding in open-source models

2025

Gemini 2.0 and GPT-4o handle multi-hour video; InternVideo2.5 and Qwen2.5-VL lead open-source video benchmarks

How Video Understanding Works

Frame Sampling

Videos are sampled at a reduced frame rate (typically 1-8 fps) to manage token budgets. Uniform sampling, keyframe extraction, or adaptive sampling strategies select the most informative frames.

Visual Encoding

Each sampled frame is encoded through a vision encoder (ViT, SigLIP). Some models use video-specific encoders with temporal attention layers; others process frames independently and rely on the LLM for temporal reasoning.

Temporal Modeling

Temporal relationships between frames are modeled via temporal attention layers, 3D convolutions, or by simply concatenating frame tokens in sequence for the LLM to attend over. Token compression strategies reduce the quadratic cost of attending over thousands of visual tokens.

Language-conditioned Reasoning

The LLM processes the sequence of visual tokens alongside the text query, generating answers that require temporal reasoning — what happened before/after, cause-effect relationships, and activity recognition.

Current Landscape

Video understanding in 2025 is defined by the context window race. Gemini 2.0 Pro's ability to natively ingest hours of video has set the bar for what 'video understanding' means — not just classifying 10-second clips, but answering questions about full-length movies and surveillance feeds. The field has evolved from classification (Kinetics, ActivityNet) to open-ended video dialogue. Proprietary models dominate long-form video tasks, while open-source models (Qwen2.5-VL, InternVideo2.5) are competitive on shorter videos. The key architectural debate is whether to use video-specific encoders with temporal modules or to simply feed many frames to a powerful LLM and let it handle temporal reasoning implicitly.

Key Challenges

Token budget explosion — a 10-minute video at 1 fps with 256 tokens per frame consumes 153,600 tokens, straining context windows and compute

Temporal grounding — models struggle to pinpoint when specific events occur in a video and to reason about temporal ordering precisely

Long-form understanding — summarizing or answering questions about hour-long videos requires maintaining coherent understanding across thousands of frames

Action recognition in complex scenes — fine-grained action discrimination (e.g., 'stirring' vs. 'mixing') remains challenging in cluttered real-world video

Audio-visual integration — most video understanding models ignore the audio track, missing critical cues for event detection and scene understanding

Quick Recommendations

Best overall

Gemini 2.0 Pro

Best long-form video understanding with native support for multi-hour videos; jointly processes audio and visual streams

Best for short video QA

GPT-4o

Highest accuracy on short video benchmarks (ActivityNet-QA, MSVD-QA); strong temporal reasoning for clips under 2 minutes

Open source (large)

Qwen2.5-VL-72B

Best open-weight video understanding; handles variable-length videos with dynamic resolution and frame sampling

Open source (efficient)

InternVL2.5-8B

Competitive video understanding in an 8B model; efficient enough for real-time processing pipelines

Video search & retrieval

InternVideo2.5

Strong video-text alignment for temporal grounding and moment retrieval; best open-source video representation model

What's Next

Real-time streaming video understanding is the next milestone — models that can process live video feeds, detect events as they happen, and maintain running state over hours of continuous input. Expect tighter audio-visual fusion where models reason jointly over what they see and hear, temporal grounding that can cite specific timestamps, and integration with robotic perception for embodied AI. The benchmark frontier is moving toward EgoSchema-style long-form reasoning and full-movie comprehension tasks.

Benchmarks & SOTA

Video-MME

202424 results

Comprehensive video understanding across diverse video types

State of the Art

Qwen3.6-27B

87.7

accuracy

MVBench

202420 results

Multi-task video understanding with 20 temporal reasoning tasks

State of the Art

Qwen3.5-Omni-Plus

accuracy

Related Tasks

Audio-Text-to-Text

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

Image-Text-to-Image

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

Image-Text-to-Text

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

Image-Text-to-Video

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Video Understanding benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Multimodal