Video Understanding
Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.
MVBench
Multi-task video understanding with 20 temporal reasoning tasks
Top 10
Leading models on MVBench.
No results yet. Be the first to contribute.
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Multimodal.
Looking to run a model? HuggingFace hosts inference for this task type.