Video Understanding.

Video understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.

Datasets

Results

accuracy

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

MVBench

Multi-task video understanding with 20 temporal reasoning tasks

Primary metric: accuracy

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on MVBench.

#	Model	accuracy	Year	Source
★	Qwen3.5-Omni-Plus	79.0	2026	paper ↗
2	Qwen3.5-397B-A17B	77.6	2026	paper ↗
3	Qwen3.5-122B-A10B	76.6	2026	paper ↗
4	Qwen3-VL-235B-A22B-Instruct	76.5	2025	paper ↗
5	Qwen3.6-27B	75.5	2026	paper ↗
6	LongCat-Flash-Omni	75.2	2025	paper ↗
7	Qwen3-VL-235B-A22B-Thinking	75.2	2025	paper ↗
8	Qwen3.5-35B-A3B	74.8	2026	paper ↗
9	Qwen3.6-35B-A3B	74.6	2026	paper ↗
10	Qwen3.5-27B	74.6	2026	paper ↗