Multimodalvisual-question-answering

Visual Question Answering

Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.

6
Datasets
56
Results
accuracy
Canonical metric
Canonical Benchmark

VQA v2.0

265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on VQA v2.0.

RankModelaccuracyYearSource
1
Qwen2-VL 72B
87.62026paper
2
InternVL2-76B
87.22026paper
3
Gemini 1.5 Pro
86.52026paper
4
PaLI-X 55B
86.12023paper
5
NVLM-D 1.0 72B
85.42024paper
6
NVLM-X 1.0 72B
85.22024paper
7
NVLM-H 1.0 72B
85.22024paper
8
VILA-1.5 40B
84.32024paper
9
LLaVA-NeXT 34B
83.72024paper
10
LLaVA-NeXT 13B
82.82024paper

All datasets

6 datasets tracked for this task.

Related tasks

Other tasks in Multimodal.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace