Video-Holmes is a benchmark for evaluating complex multimodal (video+audio+text) reasoning of multimodal large language models (MLLMs). It was created from manually annotated suspense short films and emphasizes tasks that require actively locating, integrating, and connecting multiple visual/audio clues dispersed across different video segments (rather than answering from single-shot or explicitly grounded cues). Key facts from the authors' release and dataset page: - Full dataset: 1,837 questions sourced from 270 manually annotated suspense short films (videos typically 1–5 minutes long). - Task coverage: seven purpose-designed reasoning tasks (examples/abbreviations used by the authors include SR: Social Reasoning; IMC: Intention and Motive Chaining; TCI: Temporal Causal Inference; TA: Timeline Analysis; MHR: Multimodal Hint Reasoning; PAR: Physical Anomaly Reasoning; CTI: Core Theme Inference). - Data and tooling: the authors package videos (and audios), questions, and evaluation code; the release includes training splits (the authors reported a training set release consisting of 233 videos and 1,551 questions). - Motivation and findings: the benchmark highlights a substantial performance gap between current MLLMs and human-like complex reasoning, and is intended as a "Holmes-test" to stimulate better multimodal reasoning and process analysis. Sources: dataset GitHub & homepage (TencentARC/Video-Holmes), Hugging Face dataset page for TencentARC/Video-Holmes, and the corresponding arXiv paper (arXiv:2505.21374).
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.