Codesota · Benchmark · VQA v2.0Home/Leaderboards/Multimodal/Visual Question Answering/VQA v2.0
Unknown

VQA v2.0.

265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

accuracy

accuracy

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01Qwen2-VL 72B
VQA-v2 test-dev. Qwen2-VL 72B. Table 1. arxiv:2409.12191
verified87.62026Source ↗
02Qwen2-VL 72B
VQA-v2 test-dev. Qwen2-VL 72B. Table 1. arxiv:2409.12191
verified87.62024Source ↗
03InternVL2-76B
VQA-v2 test-dev. InternVL2-76B. Table 3. arxiv:2404.16821
verified87.22026Source ↗
04InternVL2-76B
VQA-v2 test-dev. InternVL2-76B. Table 3. arxiv:2404.16821
verified87.22024Source ↗
05Gemini 1.5 Pro
VQA-v2 test-dev. Table 5. Gemini 1.5 paper arxiv:2403.05530
verified86.52026Source ↗
06Gemini 1.5 Pro
VQA-v2 test-dev. Table 5. Gemini 1.5 paper arxiv:2403.05530
verified86.52026Source ↗
07Gemini 1.5 Pro
VQA-v2 test-dev. Table 5. Gemini 1.5 paper arxiv:2403.05530
verified86.52024Source ↗
08PaLI-X 55B
VQA v2 test-dev. From Table 3 of PaLI-X paper (arxiv 2305.18565). State-of-the-art for encoder-decoder VLMs.
verified86.12023Source ↗
09NVLM-D 1.0 72B
VQA v2 test-dev. From NVLM paper (arxiv 2409.11402) Table 7. Decoder-only architecture. Highest among open-access models at time of release.
verified85.42024Source ↗
10NVLM-H 1.0 72B
VQA v2 test-dev. From NVLM paper (arxiv 2409.11402) Table 7. Cross-attention architecture.
verified85.22024Source ↗
11NVLM-X 1.0 72B
VQA v2 test-dev. From NVLM paper (arxiv 2409.11402) Table 7. Hybrid architecture.
verified85.22024Source ↗
12VILA-1.5 40B
VQA v2 test-dev. Reported in NVLM paper (arxiv 2409.11402) Table 7. VILA-1.5 40B released Apr 2024.
verified84.32024Source ↗
13LLaVA-NeXT 34B
VQA v2 test-dev. From official LLaVA-NeXT (LLaVA-1.6) blog post, Jan 2024. Best open-source at time of release.
verified83.72024Source ↗
14LLaVA-NeXT 13B
VQA v2 test-dev. From official LLaVA-NeXT (LLaVA-1.6) blog post, Jan 2024.
verified82.82024Source ↗
15CogVLM-17B
CogVLM-17B. VQAv2 test-dev accuracy. NeurIPS 2024. Tsinghua/Zhipu.
verified82.32023Source ↗
16LLaVA-NeXT 7B (Mistral)
VQA v2 test-dev. From official LLaVA-NeXT (LLaVA-1.6) blog post, Jan 2024.
verified82.22024Source ↗
17BLIP-2
VQA-v2 test-dev. FlanT5-XXL backbone. Table 9. arxiv:2301.12597
verified82.192023Source ↗
18BLIP-2
VQA-v2 test-dev. FlanT5-XXL backbone. Table 9. arxiv:2301.12597
verified82.192026Source ↗
19LLaVA-NeXT 7B (Vicuna)
VQA v2 test-dev. From official LLaVA-NeXT (LLaVA-1.6) blog post, Jan 2024.
verified81.82024Source ↗
20Pixtral Large
VQA v2. Self-reported by Mistral AI. Pixtral Large 124B released Nov 2024. Score reported as 0.809 (80.9%).
paper80.92024Source ↗
21Llama 3-V 405B
VQA v2 test-dev. Reported in NVLM paper (arxiv 2409.11402) Table 7.
verified80.22024Source ↗
22LLaVA-1.5
VQA-v2 test-dev. 13B (Vicuna) backbone. Table 1. arxiv:2310.03744
verified802026Source ↗
23LLaVA-1.5
VQA-v2 test-dev. 13B (Vicuna) backbone. Table 1. arxiv:2310.03744
verified802023Source ↗
24LLaVA-1.5 13B
VQA v2 test-dev. From "Improved Baselines with Visual Instruction Tuning" (LLaVA-1.5), CVPR 2024. Also reported as baseline in LLaVA-NeXT blog.
verified802023Source ↗
25Llama 3-V 70B
VQA v2 test-dev. Reported in NVLM paper (arxiv 2409.11402) Table 7.
verified79.12024Source ↗
26Pixtral-12B
VQA v2. Self-reported by Mistral AI. Pixtral-12B released Sep 2024. Score reported as 0.786 (78.6%).
paper78.62024Source ↗
27GPT-4o
VQA-v2 test-dev. GPT-4o system card Table 1. arxiv:2410.21276
verified78.52024Source ↗
28GPT-4o
VQA-v2 test-dev. GPT-4o system card Table 1. arxiv:2410.21276
verified78.52026Source ↗
29Llama 3.2 90B Vision Instruct
VQA v2. Reported by Meta for Llama 3.2 90B multimodal model. Self-reported score of 0.781 (78.1%).
paper78.12024Source ↗
30GPT-4V
VQA-v2 val, 0-shot. Table 2. GPT-4 Technical Report arxiv:2303.08774
verified77.22023Source ↗
31GPT-4V
VQA-v2 val, 0-shot. Table 2. GPT-4 Technical Report arxiv:2303.08774
verified77.22023Source ↗
Lineage

VQAv2 in context.

See full visual question answering lineage →
Predecessors (1)
superseded2015-05
VQA
Models answered correctly without looking at the image — VQAv2's balanced pairs force visual grounding.
This benchmark (1)
saturated2017-04
VQAv2
§ 04 · Submit a result

Add to the leaderboard.

← Back to Visual Question Answering