Visual Question Answering
Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.
VQA v2.0
265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.
Top 10
Leading models on VQA v2.0.
What were you looking for on Visual Question Answering?
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
All datasets
6 datasets tracked for this task.
Related tasks
Other tasks in Multimodal.
Didn't find what you came for?
Still looking for something on Visual Question Answering? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.