Codesota · Benchmark · VQA v2.0Home/Leaderboards/Multimodal Media/Visual Question Answering/VQA v2.0
Unknown

VQA v2.0.

265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

accuracy

Accuracy is the reported evaluation metric for VQA v2.0. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01Qwen2-VL 72B
VQA-v2 test-dev. Qwen2-VL 72B. Table 1. arxiv:2409.12191
verified87.62026Source ↗Looks wrong?
02InternVL2-76B
VQA-v2 test-dev. InternVL2-76B. Table 3. arxiv:2404.16821
verified87.22026Source ↗Looks wrong?
03Gemini 1.5 Pro
VQA-v2 test-dev. Table 5. Gemini 1.5 paper arxiv:2403.05530
verified86.52026Source ↗Looks wrong?
04PaLI-X 55B
VQA v2 test-dev. From Table 3 of PaLI-X paper (arxiv 2305.18565). State-of-the-art for encoder-decoder VLMs.
verified86.12023Source ↗Looks wrong?
05NVLM-D 1.0 72B
VQA v2 test-dev. From NVLM paper (arxiv 2409.11402) Table 7. Decoder-only architecture. Highest among open-access models at time of release.
verified85.42024Source ↗Looks wrong?
06NVLM-H 1.0 72B
VQA v2 test-dev. From NVLM paper (arxiv 2409.11402) Table 7. Cross-attention architecture.
verified85.22024Source ↗Looks wrong?
07NVLM-X 1.0 72B
VQA v2 test-dev. From NVLM paper (arxiv 2409.11402) Table 7. Hybrid architecture.
verified85.22024Source ↗Looks wrong?
08VILA-1.5 40B
VQA v2 test-dev. Reported in NVLM paper (arxiv 2409.11402) Table 7. VILA-1.5 40B released Apr 2024.
verified84.32024Source ↗Looks wrong?
09LLaVA-NeXT 34B
VQA v2 test-dev. From official LLaVA-NeXT (LLaVA-1.6) blog post, Jan 2024. Best open-source at time of release.
verified83.72024Source ↗Looks wrong?
10BLIP3-o (8B)unverified83.12025Paper ↗Code ↗Looks wrong?
11LLaVA-NeXT 13B
VQA v2 test-dev. From official LLaVA-NeXT (LLaVA-1.6) blog post, Jan 2024.
verified82.82024Source ↗Looks wrong?
12BLIP-2 ViT-g OPT 6.7Bunverified82.32023Paper ↗Code ↗Looks wrong?
13CogVLM-17B
CogVLM-17B. VQAv2 test-dev accuracy. NeurIPS 2024. Tsinghua/Zhipu.
verified82.32023Source ↗Looks wrong?
14LLaVA-NeXT 7B (Mistral)
VQA v2 test-dev. From official LLaVA-NeXT (LLaVA-1.6) blog post, Jan 2024.
verified82.22024Source ↗Looks wrong?
15BLIP-2
VQA-v2 test-dev. FlanT5-XXL backbone. Table 9. arxiv:2301.12597
verified82.192023Paper ↗Looks wrong?
16LLaVA-NeXT 7B (Vicuna)
VQA v2 test-dev. From official LLaVA-NeXT (LLaVA-1.6) blog post, Jan 2024.
verified81.82024Source ↗Looks wrong?
17AIMv2 ViT-3B/14 + Llama 3.0 8Bunverified80.92024Paper ↗Code ↗Looks wrong?
18Pixtral Large
VQA v2. Self-reported by Mistral AI. Pixtral Large 124B released Nov 2024. Score reported as 0.809 (80.9%).
paper80.92024Source ↗Looks wrong?
19Llama 3-V (405B)unverified80.22024Paper ↗Code ↗Looks wrong?
20Llama 3-V 405B
VQA v2 test-dev. Reported in NVLM paper (arxiv 2409.11402) Table 7.
verified80.22024Source ↗Looks wrong?
21LLaVA-1.5
VQA-v2 test-dev. 13B (Vicuna) backbone. Table 1. arxiv:2310.03744
verified802026Source ↗Looks wrong?
22LLaVA-1.5 13B
VQA v2 test-dev. From "Improved Baselines with Visual Instruction Tuning" (LLaVA-1.5), CVPR 2024. Also reported as baseline in LLaVA-NeXT blog.
verified802023Source ↗Looks wrong?
23ZAYA1-VL-8Bunverified802026Paper ↗Code ↗Looks wrong?
24Llama 3-V 70B
VQA v2 test-dev. Reported in NVLM paper (arxiv 2409.11402) Table 7.
verified79.12024Source ↗Looks wrong?
25Pixtral-12B
VQA v2. Self-reported by Mistral AI. Pixtral-12B released Sep 2024. Score reported as 0.786 (78.6%).
paper78.62024Source ↗Looks wrong?
26GPT-4o
VQA-v2 test-dev. GPT-4o system card Table 1. arxiv:2410.21276
verified78.52024Paper ↗Looks wrong?
27BLIP CapFilt-Lunverified78.322022Paper ↗Code ↗Looks wrong?
28Llama 3.2 90B Vision Instruct
VQA v2. Reported by Meta for Llama 3.2 90B multimodal model. Self-reported score of 0.781 (78.1%).
paper78.12024Source ↗Looks wrong?
29GPT-4V
VQA-v2 val, 0-shot. Table 2. GPT-4 Technical Report arxiv:2303.08774
verified77.22023Paper ↗Looks wrong?
30GLIPv2-H (fine-tuned)unverified74.82022Paper ↗Code ↗Looks wrong?
31Chameleon-MultiTaskunverified69.62024Paper ↗Code ↗Looks wrong?
32Flamingo (32-shot)unverified67.62022Paper ↗Code ↗Looks wrong?
Lineage

VQAv2 in context.

See full visual question answering lineage →
Predecessors (1)
superseded2015-05
VQA
Models answered correctly without looking at the image — VQAv2's balanced pairs force visual grounding.
This benchmark (1)
saturated2017-04
VQAv2
§ 04 · Submit a result

Add to the leaderboard.

← Back to Visual Question Answering