Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?.

Video-Holmes is a benchmark for evaluating complex multimodal (video+audio+text) reasoning of multimodal large language models (MLLMs). It was created from manually annotated suspense short films and emphasizes tasks that require actively locating, integrating, and connecting multiple visual/audio clues dispersed across different video segments (rather than answering from single-shot or explicitly grounded cues). Key facts from the authors' release and dataset page: - Full dataset: 1,837 questions sourced from 270 manually annotated suspense short films (videos typically 1–5 minutes long). - Task coverage: seven purpose-designed reasoning tasks (examples/abbreviations used by the authors include SR: Social Reasoning; IMC: Intention and Motive Chaining; TCI: Temporal Causal Inference; TA: Timeline Analysis; MHR: Multimodal Hint Reasoning; PAR: Physical Anomaly Reasoning; CTI: Core Theme Inference). - Data and tooling: the authors package videos (and audios), questions, and evaluation code; the release includes training splits (the authors reported a training set release consisting of 233 videos and 1,551 questions). - Motivation and findings: the benchmark highlights a substantial performance gap between current MLLMs and human-like complex reasoning, and is intended as a "Holmes-test" to stimulate better multimodal reasoning and process analysis. Sources: dataset GitHub & homepage (TencentARC/Video-Holmes), Hugging Face dataset page for TencentARC/Video-Holmes, and the corresponding arXiv paper (arXiv:2505.21374).

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?.

Best published scores.

Neighbouring benchmarks.

Have a score that beatsthis table?

Have a score that beats
this table?