General

Video-Language Models

Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.

19 datasets4 resultsView full task mapping →

Video-Language Models is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

PLM-VideoBench

PLM-VideoBench (PerceptionLM Video Benchmark)

4 results

PLM-VideoBench is a human-annotated video evaluation suite introduced in the PerceptionLM paper (arXiv:2504.13180). It is designed to test detailed video understanding and reasoning about “what”, “where”, “when” and “how” in video content. The benchmark contains multiple task-specific subsets: FGQA (fine-grained multiple-choice QA), SGQA (smart-glasses open-ended QA), RCap (video region captioning), RTLoc (region temporal localization), and RDCap (region dense video captioning). The PerceptionLM paper states the full PLM release includes 2.8M human-labeled instances across video QA and spatio-temporal captioning; the paper reports test-set sizes of FGQA ~4.3K, SGQA ~665, RCap ~10.06K, RTLoc ~7.91K and RDCap ~2.62K. Evaluation metrics used in the paper include MBAcc for FGQA, LLM-judge accuracy for SGQA and RCap, SODA for RDCap, and mean Recall@1 (averaged over IoU thresholds) for RTLoc. The Hugging Face dataset page (facebook/PLM-VideoBench) provides downloadable parquet subsets and metadata; the HF page lists subset row counts (for example: fgqa ~11k rows, rcap ~14.7k rows, rdcap ~5.17k rows, rtloc ~12.5k rows, sgqa 665 rows) which reflect the distributed dataset files on the hub. License: CC BY 4.0. Modalities: video + text (QA/captions/temporal spans).

State of the Art

PLM (8B)

67.7

MBAcc

Video-MMMU

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

0 results

Video-MMMU is a multi-modal, multi-disciplinary benchmark designed to assess Large Multimodal Models (LMMs) ability to acquire and utilize knowledge from videos. It features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation.

No results tracked yet

MMVU

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

0 results

A comprehensive benchmark for evaluating expert-level multi-discipline video understanding capabilities. MMVU provides 3,000 expert-annotated QA examples spanning 1,529 specialized-domain videos across 27 subjects in four key disciplines (Science, Healthcare, Humanities & Social Sciences, and Engineering). Each example comes with expert-annotated reasoning rationales and relevant domain knowledge, enabling researchers to assess not just answer correctness but also reasoning quality.

No results tracked yet

MMWorld

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

0 results

MMWorld is a comprehensive benchmark for evaluating multi-discipline multi-faceted world model evaluation in videos. It provides a curated collection of videos across multiple disciplines with questions that test various aspects of video understanding, including visual perception, domain knowledge, and reasoning capabilities. The dataset includes videos from different domains with structured questions and answers to evaluate Large Multimodal Models on their ability to understand and reason about video content.

No results tracked yet

CG-Bench

CG-Bench: A Comprehensive Benchmark for Computer Graphics Understanding

0 results

CG-Bench is a comprehensive benchmark for evaluating Large Multimodal Models (LMMs) on computer graphics understanding tasks. The benchmark includes various computer graphics scenarios and tasks that test models ability to understand and reason about computer-generated visual content, including 3D graphics, rendering, animation, and visual effects. The dataset is designed to evaluate how well models can comprehend and analyze computer graphics content across different domains and complexity levels.

No results tracked yet

EgoLife

EgoLife: Towards Egocentric Life Assistant

0 results

EgoLife is an ambitious egocentric AI project capturing multimodal daily activities of six participants over a week. Using Meta Aria glasses, synchronized third-person cameras, and mmWave sensors, it provides a rich dataset for long-term video understanding. The project enables AI assistants—powered by EgoGPT and EgoRAG—to support memory, habit tracking, event recall, and task management, advancing real-world egocentric AI applications. The dataset includes EgoIT-99K for instruction tuning and comprehensive egocentric video understanding.

No results tracked yet

CinePile

CinePile: A Long Video Question Answering Dataset and Benchmark

0 results

CinePile is a long-form video understanding dataset created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. The dataset comprises Multiple-Choice Questions across 86 diverse question templates, such as Emotional Transition, Object Description, etc., generated automatically using Gemini. These templates are categorized into five high-level categories: Character and Relationship Dynamics (CRD), Narrative and Plot Analysis (NPA), Thematic Exploration (TE), Setting and Technical Analysis (STA), and Temporal (TE). The dataset contains 298,888 training points and 4,940 test-set points from 9,396 movie clips.

No results tracked yet

EgoSchema

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

0 results

EgoSchema is a very long-form video question-answering dataset and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip.

No results tracked yet

VideoHolmes

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

0 results

Video-Holmes is a benchmark for evaluating complex multimodal (video+audio+text) reasoning of multimodal large language models (MLLMs). It was created from manually annotated suspense short films and emphasizes tasks that require actively locating, integrating, and connecting multiple visual/audio clues dispersed across different video segments (rather than answering from single-shot or explicitly grounded cues). Key facts from the authors' release and dataset page: - Full dataset: 1,837 questions sourced from 270 manually annotated suspense short films (videos typically 1–5 minutes long). - Task coverage: seven purpose-designed reasoning tasks (examples/abbreviations used by the authors include SR: Social Reasoning; IMC: Intention and Motive Chaining; TCI: Temporal Causal Inference; TA: Timeline Analysis; MHR: Multimodal Hint Reasoning; PAR: Physical Anomaly Reasoning; CTI: Core Theme Inference). - Data and tooling: the authors package videos (and audios), questions, and evaluation code; the release includes training splits (the authors reported a training set release consisting of 233 videos and 1,551 questions). - Motivation and findings: the benchmark highlights a substantial performance gap between current MLLMs and human-like complex reasoning, and is intended as a "Holmes-test" to stimulate better multimodal reasoning and process analysis. Sources: dataset GitHub & homepage (TencentARC/Video-Holmes), Hugging Face dataset page for TencentARC/Video-Holmes, and the corresponding arXiv paper (arXiv:2505.21374).

No results tracked yet

TemporalBench (MBA-short QA)

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

0 results

TemporalBench is a multimodal video benchmark for fine-grained temporal understanding and reasoning. Introduced in the paper "TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models" (arXiv:2410.10818), the benchmark evaluates video–language models on a set of temporally-focused tasks (e.g., short question-answering, multi-binary temporal checks, event ordering, frequency/amplitude reasoning). The dataset provides evaluation splits and task-specific subsets; the subset referenced as "MBA-short QA" (also written in the paper as MBA-short QA) is a short question-answering style subset where performance is reported as multi-binary short-QA accuracy (i.e., multi-binary correctness over short QA items). Project page, code and dataset resources are available from the authors and a hosted dataset entry on Hugging Face.

No results tracked yet

MVP

Minimal Video Pairs (MVP)

0 results

Minimal Video Pairs (MVP) is a shortcut-aware Video Question Answering (Video-QA) benchmark designed to evaluate spatio-temporal and intuitive-physics understanding of video-language models. The benchmark is constructed from minimally different video pairs such that videos in each pair differ in only small ways but produce opposite correct answers to the same question; this design reduces reliance on superficial visual or textual shortcuts. The dataset contains multiple-choice QA examples (reported as ~55K examples in the paper) curated from nine video sources spanning egocentric/first-person and third-person domains. It is organized into thematic subsets (e.g., human_object_interactions, intuitive_physics, robot_object_interactions, temporal_reasoning) and provides scripts to download underlying videos (videos not hosted directly on Hugging Face for legal reasons). The primary evaluation metric used is paired accuracy (paired accuracy over minimal video pairs).

No results tracked yet

TOMATO

TOMATO (Temporal Reasoning Multimodal Evaluation)

0 results

TOMATO (Temporal Reasoning Multimodal Evaluation) is a video question-answer benchmark designed to rigorously evaluate visual temporal reasoning in multimodal (video+language) foundation models. The benchmark was constructed according to three principles proposed by the authors — Multi-Frame Gain, Frame Order Sensitivity, and Frame Information Disparity — to ensure questions require reasoning across multiple frames and correct temporal ordering. TOMATO contains diverse scenarios (self-recorded human-centric interactions, gesture/interactive scenarios, and simulated scenes) and is organized into six task-types. The paper reports 1,484 questions applied to 1,417 videos and uses accuracy as the primary evaluation metric. The benchmark highlights a substantial gap between human and current model performance and is released with code and data resources in the authors' repository.

No results tracked yet

TempCompass

TempCompass

0 results

TempCompass is a temporal-understanding video QA benchmark designed to evaluate the temporal perception abilities of Video LLMs. The benchmark exposes fine-grained temporal aspects (e.g., speed and direction) and uses multiple task formats to avoid relying on single-frame or language priors. The authors collect “conflicting” video pairs that share the same static content but differ in a specific temporal aspect, and they use a human annotation + LLM-based instruction generation pipeline to produce diverse task instructions. The public Hugging Face release provides four subsets / task formats (with the uploaded dataset showing a single "test" split): multi-choice (≤1.58k examples), yes/no (≤2.45k examples), captioning (≤2.0k examples), and caption_matching (≤1.5k examples). TempCompass was introduced in the paper “TempCompass: Do Video LLMs Really Understand Videos?” and is intended to benchmark nuanced temporal understanding (examples in the paper include aspects such as speed and direction).

No results tracked yet

Video-MME

0 results

Video-MME is a comprehensive evaluation benchmark for multi-modal large language models (MLLMs) in video analysis. It evaluates MLLMs on video understanding tasks using 900 newly collected and human-annotated videos, including those with subtitles and audio. The dataset covers a full spectrum of video lengths, various video types across 6 key domains and 30 sub-class video types, and integrates multi-modal inputs like subtitles and audio to assess all-round MLLM capabilities.

No results tracked yet

LVBench

0 results

LVBench is a dataset for video language models, specifically designed for extreme long video understanding. It is a benchmark that leverages ring attention to push context lengths to the million-token scale, building upon models like LWM, and utilizes feature pooling strategies similar to PLLaVA for adapting image-language pre-trained models to dense video understanding.

No results tracked yet

MLVU

0 results

MLVU is a multi-task benchmark designed for long video understanding, consisting of 3,102 questions across 9 categories. It is divided into a dev set (2,593 questions) and a test set (509 questions). The code and dataset can be accessed from <https://github.com/JUNJIE99/MLVU>.

No results tracked yet

MVBench

0 results

MVBench is a dataset for video language models that covers a wide range of temporal tasks, emphasizing temporally-sensitive videos. It facilitates systematic generation of video tasks requiring various temporal abilities, from perception to cognition. MVBench efficiently creates multiple-choice QA for task evaluation by automatically transforming public video annotations, ensuring fairness through ground-truth video annotations and avoiding biased LLM scoring.

No results tracked yet

Video-MMLU

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

20250 results

Video-MMLU is a benchmark designed to rigorously evaluate how well Large Multimodal Models perform on Massive Multi-discipline Lecture Understanding. It specifically focuses on testing whether models can truly understand and reason about knowledge-intensive lecture videos – like those demonstrating theorems or solving problems in fields such as math, physics, and chemistry, including their dynamic formulas and animations – requiring them to integrate visual and temporal information and grasp the reasoning behind them, much like a human student would.

No results tracked yet

Perception Test

0 results

The Perception Test is a diagnostic benchmark dataset for multimodal video models, particularly visual language models (VLMs). It is used to test the ability of these models to bridge powerful pretrained vision-only and language-only models, handle sequences of interleaved visual and textual data, and ingest images or videos as inputs.

No results tracked yet

Related Tasks

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Video-Language Models benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000