Multimodalvideo-text-to-text

Video Understanding

Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.

2
Datasets
0
Results
accuracy
Canonical metric
Canonical Benchmark

MVBench

Multi-task video understanding with 20 temporal reasoning tasks

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on MVBench.

No results yet. Be the first to contribute.

What were you looking for on Video Understanding?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Multimodal.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Video Understanding? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.