Codesota · Multimodal · Visual Question Answering · VQA v2.0Tasks/Multimodal/Visual Question Answering
Visual Question Answering · benchmark dataset · 2017 · EN

Visual Question Answering v2.0.

265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.

Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

16 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
16 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Qwen2-VL 72BOpenAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o…87.60
02InternVL2-76BOpenShanghai AI LabApr 2024InternVL: Scaling up Vision Foundation Models and Aligni…87.20
03Gemini 1.5 ProAPIGoogleFeb 2024Gemini 1.5: Unlocking multimodal understanding across mi…86.50
04BLIP3-o (8B)May 2025BLIP3-o: A Family of Fully Open Unified Multimodal Model… · code83.10
05BLIP-2 ViT-g OPT 6.7BJan 2023BLIP-2: Bootstrapping Language-Image Pre-training with F… · code82.30
06BLIP-2OpenSalesforceJan 2023BLIP-2: Bootstrapping Language-Image Pre-training with F…82.19
07AIMv2 ViT-3B/14 + Llama 3.0 8BNov 2024Multimodal Autoregressive Pre-training of Large Vision E… · code80.90
08Llama 3-V (405B)Jul 2024The Llama 3 Herd of Models · code80.20
09LLaVA-1.5OpenUW-Madison / MicrosoftOct 2023Improved Baselines with Visual Instruction Tuning (LLaVA…80
10ZAYA1-VL-8BMay 2026pwc-dump · code80
11GPT-4oAPIOpenAIOct 2024SWE-bench Verified78.50
12BLIP CapFilt-LJan 2022BLIP: Bootstrapping Language-Image Pre-training for Unif… · code78.32
13GPT-4VMar 2023GPT-4 Technical Report77.20
14GLIPv2-H (fine-tuned)Jun 2022GLIPv2: Unifying Localization and Vision-Language Unders… · code74.80
15Chameleon-MultiTaskMay 2024Chameleon: Mixed-Modal Early-Fusion Foundation Models · code69.60
16Flamingo (32-shot)Apr 2022Flamingo: a Visual Language Model for Few-Shot Learning · code67.60
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

5 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Jan 28, 2022BLIP CapFilt-L78.32
  2. Jan 30, 2023BLIP-2 ViT-g OPT 6.7B82.30
  3. Feb 15, 2024Gemini 1.5 ProGoogle86.50
  4. Apr 25, 2024InternVL2-76BShanghai AI Lab87.20
  5. Sep 18, 2024Qwen2-VL 72BAlibaba87.60
Fig 3 · SOTA-setting models only. 5 entries span Jan 2022 Sep 2024.
§ 04 · Literature

14 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
VQA v2.0 — Visual Question Answering | CodeSOTA