General

Vision-Language Models

Vision-Language Models (VLMs) are advanced AI systems that unify computer vision and natural language processing, enabling them to understand and reason about both visual and textual data simultaneously. By processing images and text together, VLMs can perform tasks such as image captioning, visual question answering, and generating images from text. They are trained on large datasets of image-text pairs, allowing them to learn the relationships between visual features and language, leading to comprehensive, multimodal understanding.

40 datasets0 resultsView full task mapping →

Vision-Language Models is a key task in general. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

RefCOCO

Referring Expressions COCO

20160 results

RefCOCO is a dataset for referring expression comprehension. It contains 142,209 referring expressions for 50,000 objects in 19,994 images from MS COCO.

No results tracked yet

GQA

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

0 results

GQA is a new dataset for visual question answering featuring compositional questions over real-world images. The dataset consists of 22M questions about various day-to-day images, where each image is associated with a scene graph of the objects, attributes and relations. Each question is associated with a structured representation of its semantics, a functional program that specifies the reasoning steps. The dataset is designed to address shortcomings in existing VQA benchmarks by mitigating language priors and conditional biases, enabling fine-grained diagnosis for different question types.

No results tracked yet

MME

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

0 results

MME is a comprehensive evaluation benchmark for Multimodal Large Language Models (MLLMs) that assesses both perception and cognition abilities across 14 subtasks. The benchmark features manually designed instruction-answer pairs to prevent data leakage and uses concise instruction design to facilitate fair comparisons among MLLMs. Over 50 advanced MLLMs have been evaluated using MME, providing quantitative analysis and highlighting areas for improvement in multimodal model development.

No results tracked yet

MTVQA

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

0 results

Text-Centric Visual Question Answering (TEC-VQA) benchmark featuring high-quality human expert annotations across 9 diverse languages (AR, DE, FR, IT, JA, KO, RU, TH, VI). MTVQA evaluates multimodal large language models on their ability to understand and answer questions about text in images across multiple languages.

No results tracked yet

VCR-Wiki-EN-Easy

VCR-Wiki English Easy: Visual Caption Restoration

0 results

English easy mode variant of VCR-Wiki benchmark for visual caption restoration. VCR challenges models to accurately restore partially obscured texts using pixel-level hints within images, requiring the combined information from provided images, context, and subtle cues from tiny exposed areas of masked texts.

No results tracked yet

VCR-Wiki-ZH-Easy

VCR-Wiki Chinese Easy: Visual Caption Restoration

0 results

Chinese easy mode variant of VCR-Wiki benchmark for visual caption restoration. VCR challenges models to accurately restore partially obscured texts using pixel-level hints within images, requiring the combined information from provided images, context, and subtle cues from tiny exposed areas of masked texts.

No results tracked yet

MMBench-EN

MMBench English Test: Is Your Multi-modal Model an All-around Player?

0 results

English test split of MMBench, a comprehensive benchmark to evaluate the multi-modal understanding capability of large vision-language models across 20 ability dimensions including perception and reasoning. Contains 1784 multiple-choice questions with circular evaluation strategy.

No results tracked yet

MMBench-CN

MMBench Chinese Test: Is Your Multi-modal Model an All-around Player?

0 results

Chinese test split of MMBench, a comprehensive benchmark to evaluate the multi-modal understanding capability of large vision-language models across 20 ability dimensions. Contains 1784 multiple-choice questions translated to Chinese with circular evaluation strategy.

No results tracked yet

MMBench-V1.1

MMBench V1.1 Test

0 results

Version 1.1 test split of MMBench, an updated version of the comprehensive multi-modal benchmark evaluating vision-language models across multiple ability dimensions with improved question quality and coverage.

No results tracked yet

MMStar

MMStar: Are We on the Right Way for Evaluating Large Vision-Language Models?

0 results

MMStar is a vision-language benchmark designed to address key issues in LVLM evaluation by providing a more challenging and reliable test set. It focuses on eliminating data leakage and reducing bias to better assess true multimodal capabilities.

No results tracked yet

HallusionBench

HallusionBench: An Advanced Diagnostic Suite for Spotting Language Hallucination

0 results

HallusionBench is a comprehensive benchmark designed to evaluate language hallucination and visual illusion in large vision-language models. It presents challenging image-context reasoning tasks to assess model robustness and accuracy.

No results tracked yet

Vibe-Eval

Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

0 results

Vibe-Eval is an open benchmark for evaluating multimodal chat models. It consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts. The benchmark is designed to be open-ended and challenging with dual objectives: (i) vibe checking multimodal chat models for day-to-day tasks and (ii) rigorously testing and probing the capabilities of present frontier models. Notably, the hard set contains >50% questions that all frontier models answer incorrectly.

No results tracked yet

Meta-World authors' collected dataset

Meta-World MT50 (authors' collected dataset)

0 results

Meta-World (authors' collected dataset) — a collection of simulated demonstrations in the Meta-World MT50 benchmark used by the SmolVLA paper (arXiv:2506.01844). According to the Hugging Face dataset card (lerobot/metaworld_mt50) the dataset was created with LeRobot and contains 2,500 episodes (total_frames: 204,806), ~49 tasks (HF metadata lists total_tasks: 49), fps: 80, stored in parquet/video chunks; license: apache-2.0. From the SmolVLA paper: the authors collected 50 demonstrations per each of the 50 MT50 tasks (2,500 episodes) and evaluate with 10 trials per task reporting a binary success rate averaged across tasks. Hugging Face dataset: https://huggingface.co/datasets/lerobot/metaworld_mt50 (meta/info.json lists the dataset metadata shown above).

No results tracked yet

SO100 real-world: Pick-Place, Stacking, Sorting

SO100 (real-world: Pick-Place, Stacking, Sorting)

0 results

Three small real-world robot manipulation datasets collected using the SO-100 (SO100) robot and released on the Hugging Face Hub. The datasets correspond to three tasks: Pick-Place, Stacking, and Sorting. According to the SmolVLA paper (arXiv:2506.01844) each dataset contains 10 trajectories from each of 5 starting positions (50 demonstrations total) and is scored with fine-grained subtasks. The released data on Hugging Face uses the LeRobot dataset format (parquet/Timeseries + video frames), is provided under an Apache-2.0 compatible license, and is intended for training and evaluating vision-language-action and robotics models. Representative Hugging Face dataset pages include fracapuano/so100_test and related so100 repositories.

No results tracked yet

SO101 real-world: Pick-Place-Lego

SO101 (real-world: Pick-Place-Lego) — lerobot/svla_so101_pickplace

0 results

SO101 (real-world: Pick-Place-Lego) is a community-collected robotics dataset created with the LeRobot tooling. The Hugging Face dataset entry (lerobot/svla_so101_pickplace) contains 50 real-world pick-and-place demonstrations recorded with an SO-101/so100_follower robot: total_episodes=50, total_frames=11,939, total_videos=100, fps=30. Data is provided in chunked Parquet files (tabular / timeseries) alongside video, and is organized with a single split (train: 0:50). Modalities: video, tabular, timeseries. Format: parquet. License: Apache-2.0. Typical use: imitation learning / vision-language-action evaluation for manipulation tasks (Pick-Place Lego). The dataset was used for evaluation in the SmolVLA paper (arXiv:2506.01844) as a real-world Pick-Place-Lego benchmark; the SmolVLA authors note their model was not pretrained on SO101 data. Source/hub page: https://huggingface.co/datasets/lerobot/svla_so101_pickplace.

No results tracked yet

OmniBench

0 results

OmniBench is a tri-modal (audio + image + text) benchmark designed to evaluate omni-language / cross-modal models' ability to recognize, interpret, and reason across visual, acoustic and textual inputs simultaneously. The benchmark collects multi-modal QA-style examples covering diverse task types (e.g., action/activity recognition, multi-modal question answering). The Hugging Face dataset card (m-a-p/OmniBench) shows the dataset as a single split with ~1.14k rows and a schema including fields such as task type, question, options, answer, audio/image content and file paths; the HF dataset is provided in parquet format and tagged with modalities audio, image, and text. The paper (arXiv:2409.15272) and project page describe the benchmark, motivations, and evaluation protocol.

No results tracked yet

DocVQA

0 results

DocVQA is a dataset for Visual Question Answering (VQA) on document images. It consists of 50,000 questions defined on over 12,000 document images, covering various document types with textual, graphical, and structural elements like tables, forms, and figures. The document images are sourced from the UCSF Industry Documents Library and include a mix of printed, typewritten, and handwritten content, such as letters, memos, notes, and reports. The dataset is split into a training set (39,463 questions, 10,194 images), a validation set (5,349 questions, 1,286 images), and a test set (5,188 questions, 1,287 images).

No results tracked yet

ChartQA

0 results

ChartQA is a dataset for question answering about charts with visual and logical reasoning. It is used for vision language models and involves complex reasoning questions that require several logical and arithmetic operations.

No results tracked yet

MMMU

0 results

MMMU is a large multimodal benchmark for evaluating multimodal models on college-level, multi-discipline understanding and reasoning. It contains ~11.5K carefully collected multimodal questions from college exams, quizzes, and textbooks spanning 30 subjects and 183 subfields, with 30 heterogeneous image types (e.g., charts, diagrams, maps, tables, music sheets, chemical structures) to test expert-level reasoning across disciplines.

No results tracked yet

MMMU-Pro

0 results

MMMU-Pro serves as a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images.

No results tracked yet

AI2D

0 results

Dataset of ~5,000 diagrams with exhaustive annotations of diagram constituents and their relationships, together with about 15,000 question–answer pairs for diagram question answering. Introduced for studying diagram parsing and reasoning in “A Diagram Is Worth A Dozen Images” (Kembhavi et al., 2016).

No results tracked yet

MathVista

0 results

MathVista is a dataset designed to evaluate mathematical reasoning in visual contexts for vision language models. It includes 6,141 examples collected from 31 different datasets, divided into a "testmini" subset (1,000 examples for model development and validation) and a "test" subset (5,141 examples for standard evaluation).

No results tracked yet

SEED (SeedBench)

SEED-Bench

0 results

SEED-Bench is a large-scale multimodal benchmark for evaluating generative comprehension of Multimodal Large Language Models (MLLMs). Introduced in the paper “SEED-Bench: Benchmarking Multimodal Large Language Models with Generative Comprehension” (arXiv:2307.16125, CVPR 2024), the benchmark contains ~19K multiple-choice questions with human-verified ground-truth answers spanning 12 evaluation dimensions (covering both image and video modalities and a range of capabilities such as scene understanding, instance identity/attribute/location/counting, spatial relations, text recognition, action recognition/prediction, visual reasoning, chart understanding, meme comprehension, etc.). Questions were generated with an automated pipeline followed by manual verification to ensure high-quality human annotations; the format (multiple-choice with gold options) enables objective, automated evaluation without human/GPT intervention. The dataset is distributed under CC BY-NC 4.0 and is available on Hugging Face (author/repo: AILab-CVC/SEED-Bench).

No results tracked yet

VQAv2

Visual Question Answering v2.0 (VQA v2.0)

0 results

VQA v2.0 (Visual Question Answering v2.0) is a large-scale visual question answering dataset and benchmark designed to reduce language priors present in the original VQA dataset. It contains open-ended natural-language questions about images (primarily COCO images) that require joint image and language understanding and commonsense reasoning to answer. The dataset was constructed by pairing complementary images so that language-only shortcuts are less effective. Key statistics (official site): ~204,721 COCO images (balanced real images), ~1,105,904 questions (≈5.4 questions per image), and 10 ground-truth answers per question (≈11,059,040 answers total). VQA v2.0 provides standard train/validation/test splits and an automatic evaluation metric for open-ended answers.

No results tracked yet

WISE

WISE: A World Knowledge-Informed Semantic Evaluation

0 results

WISE (World Knowledge-Informed Semantic Evaluation) is a benchmark and dataset for evaluating text-to-image (T2I) models on their ability to integrate world knowledge and complex semantic understanding into generated images. The benchmark contains 1,000 carefully crafted prompts organized across 25 sub-domains spanning cultural common sense, spatio-temporal reasoning, and natural science. The project introduces WiScore, a quantitative metric designed to assess knowledge–image alignment beyond traditional CLIP-based metrics. The repository includes prompt JSON files (structured prompts and explanations), evaluation code and scripts, example assets, and instructions to compute WiScore and run evaluations. Code and data are hosted in the public GitHub repository (https://github.com/PKU-YuanGroup/WISE). The accompanying paper is available at arXiv:2503.07265. (Also cited as [22] in the referencing paper.)

No results tracked yet

MM-Vet

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities ("Multimodal Veterinarian")

0 results

MM-Vet (short for “Multimodal Veterinarian”) is an evaluation benchmark for large multimodal models (LMMs) that examines models on complex, integrated vision-language capabilities. The benchmark is designed around the insight that advanced multimodal abilities arise from integrating core vision-language capabilities: the authors define six core VL capabilities and evaluate 16 capability integrations of interest. MM‑Vet includes both open‑ended and closed QA style items, an LLM‑based evaluator for open‑ended answers, and aims to provide diagnostic insights beyond single-number rankings. The project provides code, data, and an online evaluator (GitHub) and a formatted dataset version used in the lmms-eval pipeline (Hugging Face). The Hugging Face formatted dataset includes fields such as question_id, image, question, answer, image_source, and capability.

No results tracked yet

IntelligentBench

IntelligentBench (BAGEL evaluation suite)

0 results

IntelligentBench is an evaluation suite introduced in the paper "Emerging Properties in Unified Multimodal Pretraining" (BAGEL). It is designed to evaluate free-form image manipulation and complex multimodal reasoning capabilities of unified multimodal models. The paper reports an initial release of 350 examples and that evaluations were run with GPT-4o. The benchmark is intended to probe advanced multimodal reasoning behaviours demonstrated by BAGEL (e.g., free-form image manipulation, future-frame prediction, 3D manipulation and world navigation). No public Hugging Face dataset entry for IntelligentBench was found during the search (dataset appears to be introduced in the BAGEL paper and may be hosted later on the project/GitHub page).

No results tracked yet

MathVision

0 results

The MathVision dataset is designed for vision language models.

No results tracked yet

MMT-Bench

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

0 results

MMT-Bench is a large, curated multimodal multitask benchmark for evaluating large vision-language models (LVLMs). It contains 31,325 multiple-choice visual questions covering 32 core meta-tasks and 162 subtasks spanning diverse multimodal scenarios (e.g., vehicle driving, embodied navigation) that require visual recognition, localization, reasoning, expert knowledge and planning. The benchmark is intended to provide a task-map style, comprehensive evaluation of LVLMs’ multitask capabilities; the project provides dataset files on Hugging Face, code on GitHub, and a public leaderboard. Dataset release metadata indicates an MIT license.

No results tracked yet

RefCOCO / RefCOCO+ / RefCOCOg (overall)

RefCOCO / RefCOCO+ / RefCOCOg (referring-expression visual grounding datasets on MS COCO)

0 results

RefCOCO / RefCOCO+ / RefCOCOg are a family of referring-expression (visual grounding) benchmarks built on MS COCO images. Each dataset pairs natural-language referring expressions with target object instances (bounding boxes) so models can localize the described object in the image. Key characteristics: RefCOCO — ~142,209 expressions for ~50,000 object instances in 19,994 COCO images (short, concise expressions; split into train/val/testA/testB). RefCOCO+ — ~141,564 expressions for ~49,856 objects in 19,992 images; similar to RefCOCO but location/absolute-position words are banned (encourages appearance-based descriptions). RefCOCOg — ~85,474 (longer, more complex) expressions for ~54,822 objects in 26,711 images (collected with different protocol; expressions average much longer than RefCOCO/RefCOCO+). These datasets are widely used to evaluate referring expression comprehension / visual grounding / vision-language localization models. (Information from the original papers and dataset releases: Yu et al. (ECCV/ArXiv) and Mao et al. (CVPR/ArXiv), and standard dataset metadata / TFDS / HF dataset entries.)

No results tracked yet

A12D

AI2D (AI2 Diagrams Dataset) — “A Diagram Is Worth A Dozen Images”

0 results

AI2D (often cited from the paper “A Diagram Is Worth A Dozen Images” by Kembhavi et al., arXiv:1603.07396) is a dataset of elementary-school–level science diagrams intended for diagram understanding, parsing and multi-modal reasoning. The dataset contains roughly 4.9K diagrams (reported as ~4,903 images) that have been densely annotated with their constituent elements and the semantic/structural relationships between them. The authors introduce a Diagram Parse Graph (DPG) representation to capture diagram components (e.g., diagram regions/figures, diagram text, arrows/lines) and the relations that connect them; the dataset has been used for diagram parsing, diagram question answering / visual reasoning over diagrams, and related vision–language research. The original paper (ECCV/ArXiv) describes the collection, annotation format and the DPG representation. — Key references: arXiv:1603.07396, AI2D dataset on Hugging Face (lmms-lab/ai2d).

No results tracked yet

M-LongDoc

M-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

0 results

M-LongDoc is a benchmark introduced in Chia et al. (arXiv:2411.06176) for multimodal super-long document understanding. The benchmark consists of 851 examples/questions constructed from long PDF documents that contain multimodal content (interleaved text, figures, tables, etc.) and is intended to evaluate models' ability to read and answer questions over very long, multi-page documents. The paper also provides an automated evaluation framework for reliably assessing open-ended model answers and proposes a retrieval-aware tuning approach that retrieves relevant pages/regions to enable efficient long-document reading. Project/paper information and a demo are available from the project page (https://multimodal-documents.github.io/) and the paper on arXiv.

No results tracked yet

MEGA-Bench (macro)

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

0 results

MEGA-Bench is a large-scale multimodal evaluation suite that consolidates over 500 real-world multimodal tasks into a unified evaluation format. Released by TIGER-Lab, MEGA-Bench provides curated high-quality data samples (images/videos + text) and standardized example/metric fields (e.g., task_name, task_description, example_text, example_media, metric_info, answer, eval_context) to enable cost-effective, accurate evaluation of multimodal/vision-language models. The Hugging Face dataset contains subsets (e.g., core and open), a test split (core ≈ 6.53k rows), and metadata describing each task and its evaluation metric. The accompanying paper (ICLR 2025 / arXiv:2410.10563) describes the benchmark and reports aggregated metrics including a macro metric across tasks. License: Apache-2.0. Main resources: paper (arXiv), code (GitHub), dataset and leaderboard on Hugging Face.

No results tracked yet

OlympiadBench (full)

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

0 results

OlympiadBench is an Olympiad-level bilingual multimodal scientific benchmark (paper/ACL 2024, arXiv:2402.14008). It contains 8,476 problems drawn from high‑difficulty mathematics and physics competitions (including examples from Chinese exams) presented in both Chinese and English. Problems are multimodal (text + images) and the dataset includes expert annotations with step‑by‑step reasoning and final answers. The benchmark is intended to evaluate advanced reasoning, multimodal understanding, and problem‑solving capabilities of LLMs and LMMs (tasks: question answering / visual question answering). The Hugging Face dataset page groups the data into multiple subsets (math/physics, Chinese/English, multimodal/text‑only variants) and the paper/report refers to evaluations on a reported “full” split. (Sources: arXiv:2402.14008, ACL 2024 paper, Hugging Face dataset page, OpenBMB GitHub.)

No results tracked yet

NIH/Multi-needle

MMNeedle (Multimodal Needle-in-a-haystack)

0 results

MMNeedle (MultiModal Needle-in-a-haystack) is a benchmark for evaluating long-context capabilities of multimodal large language models (MLLMs). The benchmark stresses sub-image level retrieval and understanding by asking models to locate a target "needle" (a sub-image or region) inside a large "haystack" composed of many images or stitched images to create very long visual contexts. The benchmark includes a protocol to generate labels for sub-image retrieval and supports multi-image and stitched-image inputs to scale context length; evaluation focuses on the model's ability to find the correct sub-image given textual instructions and visual context. The dataset, code and leaderboard are linked from the project page and GitHub repository for the MMNeedle benchmark.

No results tracked yet

RealWorldQA

0 results

RealWorldQA is a dataset for vision language models that consists of over 700 images, with a question and easily verifiable answer for each image. It is also described as the largest fully human-annotated dataset, featuring the highest average resolution and the most challenging tasks.

No results tracked yet

BLINK

0 results

BLINK is a dataset for Vision language models. It is a benchmark containing 14 visual perception tasks that can be solved by humans "within a blink" but pose significant challenges for current multimodal large language models (LLMs).

No results tracked yet

TextVQA

0 results

TextVQA is a dataset for visual question answering (VQA) that requires models to read and reason about text within images to answer questions. It contains 45,336 questions over 28,408 images, specifically designed for tasks where questions require understanding scene text in the given image. The dataset uses VQA accuracy for evaluation.

No results tracked yet

ZeroBench

0 results

ZeroBench is a lightweight, challenging visual reasoning benchmark for large multimodal models (LMMs). It consists of 100 hand-crafted questions and 334 subquestions, covering a wide range of domains and visual capabilities. All 20 models evaluated on ZeroBench scored 0.0% across the board on the main questions, making it an impossible visual benchmark for contemporary frontier LMMs.

No results tracked yet

InfoVQA

0 results

InfoVQA is a dataset for Vision Language Models that contains infographics collected from the Internet. It includes 30,000 questions and 5,000 images, with questions and answers that were manually annotated.

No results tracked yet

Related Tasks

General

Task for General

World Models

World models are internal, learned representations in AI that function like a "computational snow globe," allowing an agent to understand its environment, predict future states, and simulate the outcomes of actions before acting in the real world. They are essential for building sophisticated AI systems that can reason, make decisions, and interact with complex environments by simulating dynamics like physics, motion, and spatial relationships.

Omni models

Omni models are AI models that take multiple modalities (language, vision, audio) as input and produce multiple modalities as output. Some examples of the first omni models include [Qwen2.5 Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) and [BAGEL](https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT).

Video-Language Models

Video Language Models (Video LLMs) are advanced AI systems that combine large language models with video processing capabilities to understand and generate descriptive content from videos. They bridge the gap between visual and textual information by using special encoders to convert video data into a format that a standard text-based large language model (LLM) can process, enabling tasks like video analysis, content generation, and question answering about video content.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Vision-Language Models benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to General