Vision-Language Models
Vision-Language Models (VLMs) are advanced AI systems that unify computer vision and natural language processing, enabling them to understand and reason about both visual and textual data simultaneously. By processing images and text together, VLMs can perform tasks such as image captioning, visual question answering, and generating images from text. They are trained on large datasets of image-text pairs, allowing them to learn the relationships between visual features and language, leading to comprehensive, multimodal understanding.
Vision-Language Models is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
RefCOCO
Referring Expressions COCO
RefCOCO is a dataset for referring expression comprehension. It contains 142,209 referring expressions for 50,000 objects in 19,994 images from MS COCO.
No results tracked yet
GQA
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
GQA is a new dataset for visual question answering featuring compositional questions over real-world images. The dataset consists of 22M questions about various day-to-day images, where each image is associated with a scene graph of the objects, attributes and relations. Each question is associated with a structured representation of its semantics, a functional program that specifies the reasoning steps. The dataset is designed to address shortcomings in existing VQA benchmarks by mitigating language priors and conditional biases, enabling fine-grained diagnosis for different question types.
No results tracked yet
MME
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a comprehensive evaluation benchmark for Multimodal Large Language Models (MLLMs) that assesses both perception and cognition abilities across 14 subtasks. The benchmark features manually designed instruction-answer pairs to prevent data leakage and uses concise instruction design to facilitate fair comparisons among MLLMs. Over 50 advanced MLLMs have been evaluated using MME, providing quantitative analysis and highlighting areas for improvement in multimodal model development.
No results tracked yet
MTVQA
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Text-Centric Visual Question Answering (TEC-VQA) benchmark featuring high-quality human expert annotations across 9 diverse languages (AR, DE, FR, IT, JA, KO, RU, TH, VI). MTVQA evaluates multimodal large language models on their ability to understand and answer questions about text in images across multiple languages.
No results tracked yet
VCR-Wiki-EN-Easy
VCR-Wiki English Easy: Visual Caption Restoration
English easy mode variant of VCR-Wiki benchmark for visual caption restoration. VCR challenges models to accurately restore partially obscured texts using pixel-level hints within images, requiring the combined information from provided images, context, and subtle cues from tiny exposed areas of masked texts.
No results tracked yet
VCR-Wiki-ZH-Easy
VCR-Wiki Chinese Easy: Visual Caption Restoration
Chinese easy mode variant of VCR-Wiki benchmark for visual caption restoration. VCR challenges models to accurately restore partially obscured texts using pixel-level hints within images, requiring the combined information from provided images, context, and subtle cues from tiny exposed areas of masked texts.
No results tracked yet
MMBench-EN
MMBench English Test: Is Your Multi-modal Model an All-around Player?
English test split of MMBench, a comprehensive benchmark to evaluate the multi-modal understanding capability of large vision-language models across 20 ability dimensions including perception and reasoning. Contains 1784 multiple-choice questions with circular evaluation strategy.
No results tracked yet
MMBench-CN
MMBench Chinese Test: Is Your Multi-modal Model an All-around Player?
Chinese test split of MMBench, a comprehensive benchmark to evaluate the multi-modal understanding capability of large vision-language models across 20 ability dimensions. Contains 1784 multiple-choice questions translated to Chinese with circular evaluation strategy.
No results tracked yet
MMBench-V1.1
MMBench V1.1 Test
Version 1.1 test split of MMBench, an updated version of the comprehensive multi-modal benchmark evaluating vision-language models across multiple ability dimensions with improved question quality and coverage.
No results tracked yet
MMStar
MMStar: Are We on the Right Way for Evaluating Large Vision-Language Models?
MMStar is a vision-language benchmark designed to address key issues in LVLM evaluation by providing a more challenging and reliable test set. It focuses on eliminating data leakage and reducing bias to better assess true multimodal capabilities.
No results tracked yet
HallusionBench
HallusionBench: An Advanced Diagnostic Suite for Spotting Language Hallucination
HallusionBench is a comprehensive benchmark designed to evaluate language hallucination and visual illusion in large vision-language models. It presents challenging image-context reasoning tasks to assess model robustness and accuracy.
No results tracked yet
Vibe-Eval
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
Vibe-Eval is an open benchmark for evaluating multimodal chat models. It consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. The benchmark is designed to be open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, the hard set contains >50% questions that all frontier models answer incorrectly.
No results tracked yet
Meta-World authors' collected dataset
Meta-World MT50 (authors' collected dataset)
Meta-World (authors' collected dataset) — a collection of simulated demonstrations in the Meta-World MT50 benchmark used by the SmolVLA paper (arXiv:2506.01844). According to the Hugging Face dataset card (lerobot/metaworld_mt50) the dataset was created with LeRobot and contains 2,500 episodes (total_frames: 204,806), ~49 tasks (HF metadata lists total_tasks: 49), fps: 80, stored in parquet/video chunks; license: apache-2.0. From the SmolVLA paper: the authors collected 50 demonstrations per each of the 50 MT50 tasks (2,500 episodes) and evaluate with 10 trials per task reporting a binary success rate averaged across tasks. Hugging Face dataset: https://huggingface.co/datasets/lerobot/metaworld_mt50 (meta/info.json lists the dataset metadata shown above).
No results tracked yet
SO100 real-world: Pick-Place, Stacking, Sorting
SO100 (real-world: Pick-Place, Stacking, Sorting)
Three small real-world robot manipulation datasets collected using the SO-100 (SO100) robot and released on the Hugging Face Hub. The datasets correspond to three tasks: Pick-Place, Stacking, and Sorting. According to the SmolVLA paper (arXiv:2506.01844) each dataset contains 10 trajectories from each of 5 starting positions (50 demonstrations total) and is scored with fine-grained subtasks. The released data on Hugging Face uses the LeRobot dataset format (parquet/Timeseries + video frames), is provided under an Apache-2.0 compatible license, and is intended for training and evaluating vision-language-action and robotics models. Representative Hugging Face dataset pages include fracapuano/so100_test and related so100 repositories.
No results tracked yet
SO101 real-world: Pick-Place-Lego
SO101 (real-world: Pick-Place-Lego) — lerobot/svla_so101_pickplace
SO101 (real-world: Pick-Place-Lego) is a community-collected robotics dataset created with the LeRobot tooling. The Hugging Face dataset entry (lerobot/svla_so101_pickplace) contains 50 real-world pick-and-place demonstrations recorded with an SO-101/so100_follower robot: total_episodes=50, total_frames=11,939, total_videos=100, fps=30. Data is provided in chunked Parquet files (tabular / timeseries) alongside video, and is organized with a single split (train: 0:50). Modalities: video, tabular, timeseries. Format: parquet. License: Apache-2.0. Typical use: imitation learning / vision-language-action evaluation for manipulation tasks (Pick-Place Lego). The dataset was used for evaluation in the SmolVLA paper (arXiv:2506.01844) as a real-world Pick-Place-Lego benchmark; the SmolVLA authors note their model was not pretrained on SO101 data. Source/hub page: https://huggingface.co/datasets/lerobot/svla_so101_pickplace.
No results tracked yet
OmniBench
OmniBench
OmniBench is a tri-modal (audio + image + text) benchmark designed to evaluate omni-language / cross-modal models' ability to recognize, interpret, and reason across visual, acoustic and textual inputs simultaneously. The benchmark collects multi-modal QA-style examples covering diverse task types (e.g., action/activity recognition, multi-modal question answering). The Hugging Face dataset card (m-a-p/OmniBench) shows the dataset as a single split with ~1.14k rows and a schema including fields such as task type, question, options, answer, audio/image content and file paths; the HF dataset is provided in parquet format and tagged with modalities audio, image, and text. The paper (arXiv:2409.15272) and project page describe the benchmark, motivations, and evaluation protocol.
No results tracked yet
DocVQA
DocVQA is a dataset for Visual Question Answering (VQA) on document images. It consists of 50,000 questions defined on over 12,000 document images, covering various document types with textual, graphical, and structural elements like tables, forms, and figures. The document images are sourced from the UCSF Industry Documents Library and include a mix of printed, typewritten, and handwritten content, such as letters, memos, notes, and reports. The dataset is split into a training set (39,463 questions, 10,194 images), a validation set (5,349 questions, 1,286 images), and a test set (5,188 questions, 1,287 images).
No results tracked yet
ChartQA
ChartQA is a dataset for question answering about charts with visual and logical reasoning. It is used for vision language models and involves complex reasoning questions that require several logical and arithmetic operations.
No results tracked yet
MMMU
MMMU is a large multimodal benchmark for evaluating multimodal models on college-level, multi-discipline understanding and reasoning. It contains ~11.5K carefully collected multimodal questions from college exams, quizzes, and textbooks spanning 30 subjects and 183 subfields, with 30 heterogeneous image types (e.g., charts, diagrams, maps, tables, music sheets, chemical structures) to test expert-level reasoning across disciplines.
No results tracked yet
MMMU-Pro
MMMU-Pro serves as a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images.
No results tracked yet
AI2D
Dataset of ~5,000 diagrams with exhaustive annotations of diagram constituents and their relationships, together with about 15,000 question–answer pairs for diagram question answering. Introduced for studying diagram parsing and reasoning in “A Diagram Is Worth A Dozen Images” (Kembhavi et al., 2016).
No results tracked yet
MathVista
MathVista is a dataset designed to evaluate mathematical reasoning in visual contexts for vision language models. It includes 6,141 examples collected from 31 different datasets, divided into a "testmini" subset (1,000 examples for model development and validation) and a "test" subset (5,141 examples for standard evaluation).
No results tracked yet
SEED (SeedBench)
SEED-Bench
SEED-Bench is a large-scale multimodal benchmark for evaluating generative comprehension of Multimodal Large Language Models (MLLMs). Introduced in the paper “SEED-Bench: Benchmarking Multimodal Large Language Models with Generative Comprehension” (arXiv:2307.16125, CVPR 2024), the benchmark contains ~19K multiple-choice questions with human-verified ground-truth answers spanning 12 evaluation dimensions (covering both image and video modalities and a range of capabilities such as scene understanding, instance identity/attribute/location/counting, spatial relations, text recognition, action recognition/prediction, visual reasoning, chart understanding, meme comprehension, etc.). Questions were generated with an automated pipeline followed by manual verification to ensure high-quality human annotations; the format (multiple-choice with gold options) enables objective, automated evaluation without human/GPT intervention. The dataset is distributed under CC BY-NC 4.0 and is available on Hugging Face (author/repo: AILab-CVC/SEED-Bench).
No results tracked yet
VQAv2
Visual Question Answering v2.0 (VQA v2.0)
VQA v2.0 (Visual Question Answering v2.0) is a large-scale visual question answering dataset and benchmark designed to reduce language priors present in the original VQA dataset. It contains open-ended natural-language questions about images (primarily COCO images) that require joint image and language understanding and commonsense reasoning to answer. The dataset was constructed by pairing complementary images so that language-only shortcuts are less effective. Key statistics (official site): ~204,721 COCO images (balanced real images), ~1,105,904 questions (≈5.4 questions per image), and 10 ground-truth answers per question (≈11,059,040 answers total). VQA v2.0 provides standard train/validation/test splits and an automatic evaluation metric for open-ended answers.
No results tracked yet
WISE
WISE: A World Knowledge-Informed Semantic Evaluation
WISE (World Knowledge-Informed Semantic Evaluation) is a benchmark and dataset for evaluating text-to-image (T2I) models on their ability to integrate world knowledge and complex semantic understanding into generated images. The benchmark contains 1,000 carefully crafted prompts organized across 25 sub-domains spanning cultural common sense, spatio-temporal reasoning, and natural science. The project introduces WiScore, a quantitative metric designed to assess knowledge–image alignment beyond traditional CLIP-based metrics. The repository includes prompt JSON files (structured prompts and explanations), evaluation code and scripts, example assets, and instructions to compute WiScore and run evaluations. Code and data are hosted in the public GitHub repository (https://github.com/PKU-YuanGroup/WISE). The accompanying paper is available at arXiv:2503.07265. (Also cited as [22] in the referencing paper.)
No results tracked yet
MM-Vet
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities ("Multimodal Veterinarian")
MM-Vet (short for “Multimodal Veterinarian”) is an evaluation benchmark for large multimodal models (LMMs) that examines models on complex, integrated vision-language capabilities. The benchmark is designed around the insight that advanced multimodal abilities arise from integrating core vision-language capabilities: the authors define six core VL capabilities and evaluate 16 capability integrations of interest. MM‑Vet includes both open‑ended and closed QA style items, an LLM‑based evaluator for open‑ended answers, and aims to provide diagnostic insights beyond single-number rankings. The project provides code, data, and an online evaluator (GitHub) and a formatted dataset version used in the lmms-eval pipeline (Hugging Face). The Hugging Face formatted dataset includes fields such as question_id, image, question, answer, image_source, and capability.
No results tracked yet
IntelligentBench
IntelligentBench (BAGEL evaluation suite)
IntelligentBench is an evaluation suite introduced in the paper "Emerging Properties in Unified Multimodal Pretraining" (BAGEL). It is designed to evaluate free-form image manipulation and complex multimodal reasoning capabilities of unified multimodal models. The paper reports an initial release of 350 examples and that evaluations were run with GPT-4o. The benchmark is intended to probe advanced multimodal reasoning behaviours demonstrated by BAGEL (e.g., free-form image manipulation, future-frame prediction, 3D manipulation and world navigation). No public Hugging Face dataset entry for IntelligentBench was found during the search (dataset appears to be introduced in the BAGEL paper and may be hosted later on the project/GitHub page).
No results tracked yet
MathVision
The MathVision dataset is designed for vision language models.
No results tracked yet
MMT-Bench
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
MMT-Bench is a large, curated multimodal multitask benchmark for evaluating large vision-language models (LVLMs). It contains 31,325 multiple-choice visual questions covering 32 core meta-tasks and 162 subtasks spanning diverse multimodal scenarios (e.g., vehicle driving, embodied navigation) that require visual recognition, localization, reasoning, expert knowledge and planning. The benchmark is intended to provide a task-map style, comprehensive evaluation of LVLMs’ multitask capabilities; the project provides dataset files on Hugging Face, code on GitHub, and a public leaderboard. Dataset release metadata indicates an MIT license.
No results tracked yet
RefCOCO / RefCOCO+ / RefCOCOg (overall)
RefCOCO / RefCOCO+ / RefCOCOg (referring-expression visual grounding datasets on MS COCO)
RefCOCO / RefCOCO+ / RefCOCOg are a family of referring-expression (visual grounding) benchmarks built on MS COCO images. Each dataset pairs natural-language referring expressions with target object instances (bounding boxes) so models can localize the described object in the image. Key characteristics: RefCOCO — ~142,209 expressions for ~50,000 object instances in 19,994 COCO images (short, concise expressions; split into train/val/testA/testB). RefCOCO+ — ~141,564 expressions for ~49,856 objects in 19,992 images; similar to RefCOCO but location/absolute-position words are banned (encourages appearance-based descriptions). RefCOCOg — ~85,474 (longer, more complex) expressions for ~54,822 objects in 26,711 images (collected with different protocol; expressions average much longer than RefCOCO/RefCOCO+). These datasets are widely used to evaluate referring expression comprehension / visual grounding / vision-language localization models. (Information from the original papers and dataset releases: Yu et al. (ECCV/ArXiv) and Mao et al. (CVPR/ArXiv), and standard dataset metadata / TFDS / HF dataset entries.)
No results tracked yet
A12D
AI2D (AI2 Diagrams Dataset) — “A Diagram Is Worth A Dozen Images”
AI2D (often cited from the paper “A Diagram Is Worth A Dozen Images” by Kembhavi et al., arXiv:1603.07396) is a dataset of elementary-school–level science diagrams intended for diagram understanding, parsing and multi-modal reasoning. The dataset contains roughly 4.9K diagrams (reported as ~4,903 images) that have been densely annotated with their constituent elements and the semantic/structural relationships between them. The authors introduce a Diagram Parse Graph (DPG) representation to capture diagram components (e.g., diagram regions/figures, diagram text, arrows/lines) and the relations that connect them; the dataset has been used for diagram parsing, diagram question answering / visual reasoning over diagrams, and related vision–language research. The original paper (ECCV/ArXiv) describes the collection, annotation format and the DPG representation. — Key references: arXiv:1603.07396, AI2D dataset on Hugging Face (lmms-lab/ai2d).
No results tracked yet
M-LongDoc
M-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
M-LongDoc is a benchmark introduced in Chia et al. (arXiv:2411.06176) for multimodal super-long document understanding. The benchmark consists of 851 examples/questions constructed from long PDF documents that contain multimodal content (interleaved text, figures, tables, etc.) and is intended to evaluate models' ability to read and answer questions over very long, multi-page documents. The paper also provides an automated evaluation framework for reliably assessing open-ended model answers and proposes a retrieval-aware tuning approach that retrieves relevant pages/regions to enable efficient long-document reading. Project/paper information and a demo are available from the project page (https://multimodal-documents.github.io/) and the paper on arXiv.
No results tracked yet
MEGA-Bench (macro)
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
MEGA-Bench is a large-scale multimodal evaluation suite that consolidates over 500 real-world multimodal tasks into a unified evaluation format. Released by TIGER-Lab, MEGA-Bench provides curated high-quality data samples (images/videos + text) and standardized example/metric fields (e.g., task_name, task_description, example_text, example_media, metric_info, answer, eval_context) to enable cost-effective, accurate evaluation of multimodal/vision-language models. The Hugging Face dataset contains subsets (e.g., core and open), a test split (core ≈ 6.53k rows), and metadata describing each task and its evaluation metric. The accompanying paper (ICLR 2025 / arXiv:2410.10563) describes the benchmark and reports aggregated metrics including a macro metric across tasks. License: Apache-2.0. Main resources: paper (arXiv), code (GitHub), dataset and leaderboard on Hugging Face.
No results tracked yet
OlympiadBench (full)
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
OlympiadBench is an Olympiad-level bilingual multimodal scientific benchmark (paper/ACL 2024, arXiv:2402.14008). It contains 8,476 problems drawn from high‑difficulty mathematics and physics competitions (including examples from Chinese exams) presented in both Chinese and English. Problems are multimodal (text + images) and the dataset includes expert annotations with step‑by‑step reasoning and final answers. The benchmark is intended to evaluate advanced reasoning, multimodal understanding, and problem‑solving capabilities of LLMs and LMMs (tasks: question answering / visual question answering). The Hugging Face dataset page groups the data into multiple subsets (math/physics, Chinese/English, multimodal/text‑only variants) and the paper/report refers to evaluations on a reported “full” split. (Sources: arXiv:2402.14008, ACL 2024 paper, Hugging Face dataset page, OpenBMB GitHub.)
No results tracked yet
NIH/Multi-needle
MMNeedle (Multimodal Needle-in-a-haystack)
MMNeedle (MultiModal Needle-in-a-haystack) is a benchmark for evaluating long-context capabilities of multimodal large language models (MLLMs). The benchmark stresses sub-image level retrieval and understanding by asking models to locate a target "needle" (a sub-image or region) inside a large "haystack" composed of many images or stitched images to create very long visual contexts. The benchmark includes a protocol to generate labels for sub-image retrieval and supports multi-image and stitched-image inputs to scale context length; evaluation focuses on the model's ability to find the correct sub-image given textual instructions and visual context. The dataset, code and leaderboard are linked from the project page and GitHub repository for the MMNeedle benchmark.
No results tracked yet
RealWorldQA
RealWorldQA is a dataset for vision language models that consists of over 700 images, with a question and easily verifiable answer for each image. It is also described as the largest fully human-annotated dataset, featuring the highest average resolution and the most challenging tasks.
No results tracked yet
BLINK
BLINK is a dataset for Vision language models. It is a benchmark containing 14 visual perception tasks that can be solved by humans "within a blink" but pose significant challenges for current multimodal large language models (LLMs).
No results tracked yet
TextVQA
TextVQA is a dataset for visual question answering (VQA) that requires models to read and reason about text within images to answer questions. It contains 45,336 questions over 28,408 images, specifically designed for tasks where questions require understanding scene text in the given image. The dataset uses VQA accuracy for evaluation.
No results tracked yet
ZeroBench
ZeroBench is a lightweight, challenging visual reasoning benchmark for large multimodal models (LMMs). It consists of 100 hand-crafted questions and 334 subquestions, covering a wide range of domains and visual capabilities. All 20 models evaluated on ZeroBench scored 0.0% across the board on the main questions, making it an impossible visual benchmark for contemporary frontier LMMs.
No results tracked yet
InfoVQA
InfoVQA is a dataset for Vision Language Models that contains infographics collected from the Internet. It includes 30,000 questions and 5,000 images, with questions and answers that were manually annotated.
No results tracked yet
Related Tasks
General
Task for General
World Models
World models are internal, learned representations in AI that function like a "computational snow globe," allowing an agent to understand its environment, predict future states, and simulate the outcomes of actions before acting in the real world. They are essential for building sophisticated AI systems that can reason, make decisions, and interact with complex environments by simulating dynamics like physics, motion, and spatial relationships.
Omni models
Omni models are AI models that take multiple modalities (language, vision, audio) as input and produce multiple modalities as output. Some examples of the first omni models include [Qwen2.5 Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) and [BAGEL](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT).
Video-Language Models
Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Vision-Language Models benchmarks accurate. Report outdated results, missing benchmarks, or errors.