Multimodalvisual-question-answering

Visual Question Answering

Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.

6
Datasets
56
Results
accuracy
Canonical metric
Canonical Benchmark

VQA v2.0

265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on VQA v2.0.

RankModelaccuracyYearSource
1
Qwen2-VL 72B
87.62026paper
2
InternVL2-76B
87.22026paper
3
Gemini 1.5 Pro
86.52026paper
4
PaLI-X 55B
86.12023paper
5
NVLM-D 1.0 72B
85.42024paper
6
NVLM-X 1.0 72B
85.22024paper
7
NVLM-H 1.0 72B
85.22024paper
8
VILA-1.5 40B
84.32024paper
9
LLaVA-NeXT 34B
83.72024paper
10
LLaVA-NeXT 13B
82.82024paper

What were you looking for on Visual Question Answering?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

6 datasets tracked for this task.

Related tasks

Other tasks in Multimodal.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Visual Question Answering? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.