Who leads the VQA v2.0 benchmark?

Qwen2-VL 72B currently leads VQA v2.0 with a score of 87.60 on accuracy.

What is the state-of-the-art score on VQA v2.0?

The state-of-the-art result on VQA v2.0 is 87.60 (accuracy), achieved by Qwen2-VL 72B as of 2026.

How many models are tracked on VQA v2.0?

Codesota tracks 16 models on VQA v2.0.

When was the VQA v2.0 leaderboard last updated?

The VQA v2.0 leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2022.

Codesota · Multimodal · Visual Question Answering · VQA v2.0Tasks/Multimodal/Visual Question Answering

Visual Question Answering · benchmark dataset · 2017 · EN

Visual Question Answering v2.0.

Name: Visual Question Answering v2.0 Benchmark Results
Creator: Codesota
Published: 2022-01-01
License: https://creativecommons.org/licenses/by/4.0/

265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

16 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: accuracy · higher is better

accuracy· primary

16 rows

#	Model	Org	Submitted	Paper / code	accuracy
01	Qwen2-VL 72BOpen	Alibaba	Sep 2024	Qwen2-VL: Enhancing Vision-Language Model's Perception o…	87.60
02	InternVL2-76BOpen	Shanghai AI Lab	Apr 2024	InternVL: Scaling up Vision Foundation Models and Aligni…	87.20
03	Gemini 1.5 ProAPI	Google	Feb 2024	Gemini 1.5: Unlocking multimodal understanding across mi…	86.50
04	BLIP3-o (8B)	—	May 2025	BLIP3-o: A Family of Fully Open Unified Multimodal Model… · code	83.10
05	BLIP-2 ViT-g OPT 6.7B	—	Jan 2023	BLIP-2: Bootstrapping Language-Image Pre-training with F… · code	82.30
06	BLIP-2Open	Salesforce	Jan 2023	BLIP-2: Bootstrapping Language-Image Pre-training with F…	82.19
07	AIMv2 ViT-3B/14 + Llama 3.0 8B	—	Nov 2024	Multimodal Autoregressive Pre-training of Large Vision E… · code	80.90
08	Llama 3-V (405B)	—	Jul 2024	The Llama 3 Herd of Models · code	80.20
09	LLaVA-1.5Open	UW-Madison / Microsoft	Oct 2023	Improved Baselines with Visual Instruction Tuning (LLaVA…	80
10	ZAYA1-VL-8B	—	May 2026	pwc-dump · code	80
11	GPT-4oAPI	OpenAI	Oct 2024	SWE-bench Verified	78.50
12	BLIP CapFilt-L	—	Jan 2022	BLIP: Bootstrapping Language-Image Pre-training for Unif… · code	78.32
13	GPT-4V	—	Mar 2023	GPT-4 Technical Report	77.20
14	GLIPv2-H (fine-tuned)	—	Jun 2022	GLIPv2: Unifying Localization and Vision-Language Unders… · code	74.80
15	Chameleon-MultiTask	—	May 2024	Chameleon: Mixed-Modal Early-Fusion Foundation Models · code	69.60
16	Flamingo (32-shot)	—	Apr 2022	Flamingo: a Visual Language Model for Few-Shot Learning · code	67.60

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

5 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy

Jan 28, 2022BLIP CapFilt-L78.32
Jan 30, 2023BLIP-2 ViT-g OPT 6.7B82.30
Feb 15, 2024Gemini 1.5 ProGoogle86.50
Apr 25, 2024InternVL2-76BShanghai AI Lab87.20
Sep 18, 2024Qwen2-VL 72BAlibaba87.60

Fig 3 · SOTA-setting models only. 5 entries span Jan 2022 → Sep 2024.

§ 04 · Literature

14 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
May 2025·BLIP3-o (8B)
arXiv ↗Code
Multimodal Autoregressive Pre-training of Large Vision Encoders
Nov 2024·AIMv2 ViT-3B/14 + Llama 3.0 8B
arXiv ↗Code
SWE-bench Verified
Oct 2024·GPT-4o
arXiv ↗
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Sep 2024·Qwen2-VL 72B
arXiv ↗
The Llama 3 Herd of Models
Jul 2024·Llama 3-V (405B)
arXiv ↗Code
Chameleon: Mixed-Modal Early-Fusion Foundation Models
May 2024·Chameleon-MultiTask
arXiv ↗Code
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Apr 2024·InternVL2-76B
arXiv ↗
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Feb 2024·Gemini 1.5 Pro
arXiv ↗
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)
Oct 2023·LLaVA-1.5
arXiv ↗
GPT-4 Technical Report
Mar 2023·GPT-4V
arXiv ↗
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Jan 2023·BLIP-2 ViT-g OPT 6.7B, BLIP-2
arXiv ↗Code
GLIPv2: Unifying Localization and Vision-Language Understanding
Jun 2022·GLIPv2-H (fine-tuned)
arXiv ↗Code
Flamingo: a Visual Language Model for Few-Shot Learning
Apr 2022·Flamingo (32-shot)
arXiv ↗Code
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Jan 2022·BLIP CapFilt-L
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

Visual Question Answering v2.0.

Best published scores.

5 stepsof state of the art.

14 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

5 steps
of state of the art.

14 papers
tied to this benchmark.

Have a score that beats
this table?