Multimodalvideo-text-to-text

Video Understanding

Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.

2
Datasets
0
Results
accuracy
Canonical metric
Canonical Benchmark

MVBench

Multi-task video understanding with 20 temporal reasoning tasks

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on MVBench.

No results yet. Be the first to contribute.

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Multimodal.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace