Video-Language Models
Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.
Video-Language Models is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
PLM-VideoBench
PLM-VideoBench (PerceptionLM Video Benchmark)
PLM-VideoBench is a human-annotated video evaluation suite introduced in the PerceptionLM paper (arXiv:2504.13180). It is designed to test detailed video understanding and reasoning about “what”, “where”, “when” and “how” in video content. The benchmark contains multiple task-specific subsets: FGQA (fine-grained multiple-choice QA), SGQA (smart-glasses open-ended QA), RCap (video region captioning), RTLoc (region temporal localization), and RDCap (region dense video captioning). The PerceptionLM paper states the full PLM release includes 2.8M human-labeled instances across video QA and spatio-temporal captioning; the paper reports test-set sizes of FGQA ~4.3K, SGQA ~665, RCap ~10.06K, RTLoc ~7.91K and RDCap ~2.62K. Evaluation metrics used in the paper include MBAcc for FGQA, LLM-judge accuracy for SGQA and RCap, SODA for RDCap, and mean Recall@1 (averaged over IoU thresholds) for RTLoc. The Hugging Face dataset page (facebook/PLM-VideoBench) provides downloadable parquet subsets and metadata; the HF page lists subset row counts (for example: fgqa ~11k rows, rcap ~14.7k rows, rdcap ~5.17k rows, rtloc ~12.5k rows, sgqa 665 rows) which reflect the distributed dataset files on the hub. License: CC BY 4.0. Modalities: video + text (QA/captions/temporal spans).
State of the Art
PLM (8B)
67.7
MBAcc
Video-MMMU
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU is a multi-modal, multi-disciplinary benchmark designed to assess Large Multimodal Models (LMMs) ability to acquire and utilize knowledge from videos. It features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation.
No results tracked yet
MMVU
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
A comprehensive benchmark for evaluating expert-level multi-discipline video understanding capabilities. MMVU provides 3,000 expert-annotated QA examples spanning 1,529 specialized-domain videos across 27 subjects in four key disciplines (Science, Healthcare, Humanities & Social Sciences, and Engineering). Each example comes with expert-annotated reasoning rationales and relevant domain knowledge, enabling researchers to assess not just answer correctness but also reasoning quality.
No results tracked yet
MMWorld
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
MMWorld is a comprehensive benchmark for evaluating multi-discipline multi-faceted world model evaluation in videos. It provides a curated collection of videos across multiple disciplines with questions that test various aspects of video understanding, including visual perception, domain knowledge, and reasoning capabilities. The dataset includes videos from different domains with structured questions and answers to evaluate Large Multimodal Models on their ability to understand and reason about video content.
No results tracked yet
CG-Bench
CG-Bench: A Comprehensive Benchmark for Computer Graphics Understanding
CG-Bench is a comprehensive benchmark for evaluating Large Multimodal Models (LMMs) on computer graphics understanding tasks. The benchmark includes various computer graphics scenarios and tasks that test models ability to understand and reason about computer-generated visual content, including 3D graphics, rendering, animation, and visual effects. The dataset is designed to evaluate how well models can comprehend and analyze computer graphics content across different domains and complexity levels.
No results tracked yet
EgoLife
EgoLife: Towards Egocentric Life Assistant
EgoLife is an ambitious egocentric AI project capturing multimodal daily activities of six participants over a week. Using Meta Aria glasses, synchronized third-person cameras, and mmWave sensors, it provides a rich dataset for long-term video understanding. The project enables AI assistants—powered by EgoGPT and EgoRAG—to support memory, habit tracking, event recall, and task management, advancing real-world egocentric AI applications. The dataset includes EgoIT-99K for instruction tuning and comprehensive egocentric video understanding.
No results tracked yet
CinePile
CinePile: A Long Video Question Answering Dataset and Benchmark
CinePile is a long-form video understanding dataset created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. The dataset comprises Multiple-Choice Questions across 86 diverse question templates, such as Emotional Transition, Object Description, etc., generated automatically using Gemini. These templates are categorized into five high-level categories: Character and Relationship Dynamics (CRD), Narrative and Plot Analysis (NPA), Thematic Exploration (TE), Setting and Technical Analysis (STA), and Temporal (TE). The dataset contains 298,888 training points and 4,940 test-set points from 9,396 movie clips.
No results tracked yet
EgoSchema
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
EgoSchema is a very long-form video question-answering dataset and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip.
No results tracked yet
VideoHolmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Video-Holmes is a benchmark for evaluating complex multimodal (video+audio+text) reasoning of multimodal large language models (MLLMs). It was created from manually annotated suspense short films and emphasizes tasks that require actively locating, integrating, and connecting multiple visual/audio clues dispersed across different video segments (rather than answering from single-shot or explicitly grounded cues). Key facts from the authors' release and dataset page: - Full dataset: 1,837 questions sourced from 270 manually annotated suspense short films (videos typically 1–5 minutes long). - Task coverage: seven purpose-designed reasoning tasks (examples/abbreviations used by the authors include SR: Social Reasoning; IMC: Intention and Motive Chaining; TCI: Temporal Causal Inference; TA: Timeline Analysis; MHR: Multimodal Hint Reasoning; PAR: Physical Anomaly Reasoning; CTI: Core Theme Inference). - Data and tooling: the authors package videos (and audios), questions, and evaluation code; the release includes training splits (the authors reported a training set release consisting of 233 videos and 1,551 questions). - Motivation and findings: the benchmark highlights a substantial performance gap between current MLLMs and human-like complex reasoning, and is intended as a "Holmes-test" to stimulate better multimodal reasoning and process analysis. Sources: dataset GitHub & homepage (TencentARC/Video-Holmes), Hugging Face dataset page for TencentARC/Video-Holmes, and the corresponding arXiv paper (arXiv:2505.21374).
No results tracked yet
TemporalBench (MBA-short QA)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
TemporalBench is a multimodal video benchmark for fine-grained temporal understanding and reasoning. Introduced in the paper "TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models" (arXiv:2410.10818), the benchmark evaluates video–language models on a set of temporally-focused tasks (e.g., short question-answering, multi-binary temporal checks, event ordering, frequency/amplitude reasoning). The dataset provides evaluation splits and task-specific subsets; the subset referenced as "MBA-short QA" (also written in the paper as MBA-short QA) is a short question-answering style subset where performance is reported as multi-binary short-QA accuracy (i.e., multi-binary correctness over short QA items). Project page, code and dataset resources are available from the authors and a hosted dataset entry on Hugging Face.
No results tracked yet
MVP
Minimal Video Pairs (MVP)
Minimal Video Pairs (MVP) is a shortcut-aware Video Question Answering (Video-QA) benchmark designed to evaluate spatio-temporal and intuitive-physics understanding of video-language models. The benchmark is constructed from minimally different video pairs such that videos in each pair differ in only small ways but produce opposite correct answers to the same question; this design reduces reliance on superficial visual or textual shortcuts. The dataset contains multiple-choice QA examples (reported as ~55K examples in the paper) curated from nine video sources spanning egocentric/first-person and third-person domains. It is organized into thematic subsets (e.g., human_object_interactions, intuitive_physics, robot_object_interactions, temporal_reasoning) and provides scripts to download underlying videos (videos not hosted directly on Hugging Face for legal reasons). The primary evaluation metric used is paired accuracy (paired accuracy over minimal video pairs).
No results tracked yet
TOMATO
TOMATO (Temporal Reasoning Multimodal Evaluation)
TOMATO (Temporal Reasoning Multimodal Evaluation) is a video question-answer benchmark designed to rigorously evaluate visual temporal reasoning in multimodal (video+language) foundation models. The benchmark was constructed according to three principles proposed by the authors — Multi-Frame Gain, Frame Order Sensitivity, and Frame Information Disparity — to ensure questions require reasoning across multiple frames and correct temporal ordering. TOMATO contains diverse scenarios (self-recorded human-centric interactions, gesture/interactive scenarios, and simulated scenes) and is organized into six task-types. The paper reports 1,484 questions applied to 1,417 videos and uses accuracy as the primary evaluation metric. The benchmark highlights a substantial gap between human and current model performance and is released with code and data resources in the authors' repository.
No results tracked yet
TempCompass
TempCompass
TempCompass is a temporal-understanding video QA benchmark designed to evaluate the temporal perception abilities of Video LLMs. The benchmark exposes fine-grained temporal aspects (e.g., speed and direction) and uses multiple task formats to avoid relying on single-frame or language priors. The authors collect “conflicting” video pairs that share the same static content but differ in a specific temporal aspect, and they use a human annotation + LLM-based instruction generation pipeline to produce diverse task instructions. The public Hugging Face release provides four subsets / task formats (with the uploaded dataset showing a single "test" split): multi-choice (≤1.58k examples), yes/no (≤2.45k examples), captioning (≤2.0k examples), and caption_matching (≤1.5k examples). TempCompass was introduced in the paper “TempCompass: Do Video LLMs Really Understand Videos?” and is intended to benchmark nuanced temporal understanding (examples in the paper include aspects such as speed and direction).
No results tracked yet
Video-MME
Video-MME is a comprehensive evaluation benchmark for multi-modal large language models (MLLMs) in video analysis. It evaluates MLLMs on video understanding tasks using 900 newly collected and human-annotated videos, including those with subtitles and audio. The dataset covers a full spectrum of video lengths, various video types across 6 key domains and 30 sub-class video types, and integrates multi-modal inputs like subtitles and audio to assess all-round MLLM capabilities.
No results tracked yet
LVBench
LVBench is a dataset for video language models, specifically designed for extreme long video understanding. It is a benchmark that leverages ring attention to push context lengths to the million-token scale, building upon models like LWM, and utilizes feature pooling strategies similar to PLLaVA for adapting image-language pre-trained models to dense video understanding.
No results tracked yet
MLVU
MLVU is a multi-task benchmark designed for long video understanding, consisting of 3,102 questions across 9 categories. It is divided into a dev set (2,593 questions) and a test set (509 questions). The code and dataset can be accessed from <https://github.com/JUNJIE99/MLVU>.
No results tracked yet
MVBench
MVBench is a dataset for video language models that covers a wide range of temporal tasks, emphasizing temporally-sensitive videos. It facilitates systematic generation of video tasks requiring various temporal abilities, from perception to cognition. MVBench efficiently creates multiple-choice QA for task evaluation by automatically transforming public video annotations, ensuring fairness through ground-truth video annotations and avoiding biased LLM scoring.
No results tracked yet
Video-MMLU
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Video-MMLU is a benchmark designed to rigorously evaluate how well Large Multimodal Models perform on Massive Multi-discipline Lecture Understanding. It specifically focuses on testing whether models can truly understand and reason about knowledge-intensive lecture videos – like those demonstrating theorems or solving problems in fields such as math, physics, and chemistry, including their dynamic formulas and animations – requiring them to integrate visual and temporal information and grasp the reasoning behind them, much like a human student would.
No results tracked yet
Perception Test
The Perception Test is a diagnostic benchmark dataset for multimodal video models, particularly visual language models (VLMs). It is used to test the ability of these models to bridge powerful pretrained vision-only and language-only models, handle sequences of interleaved visual and textual data, and ingest images or videos as inputs.
No results tracked yet
Related Tasks
General
Task for General
World Models
World models are internal, learned representations in AI that function like a "computational snow globe," allowing an agent to understand its environment, predict future states, and simulate the outcomes of actions before acting in the real world. They are essential for building sophisticated AI systems that can reason, make decisions, and interact with complex environments by simulating dynamics like physics, motion, and spatial relationships.
Omni models
Omni models are AI models that take multiple modalities (language, vision, audio) as input and produce multiple modalities as output. Some examples of the first omni models include [Qwen2.5 Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) and [BAGEL](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT).
Coding Agents
Coding agents are autonomous, AI-powered software development tools that understand natural language prompts and execute multi-step tasks to automate coding, bug fixing, and entire software workflows. They act as intelligent assistants within the software development lifecycle, capable of understanding code, generating new code, optimizing existing code, debugging, and handling tasks like documentation and feature scaffolding with minimal user supervision. Examples of coding agents include Claude Code and Cursor Agent.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Video-Language Models benchmarks accurate. Report outdated results, missing benchmarks, or errors.