Visual Question Answering
Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.
VQA v2.0
265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.
Top 10
Leading models on VQA v2.0.
All datasets
6 datasets tracked for this task.
Related tasks
Other tasks in Multimodal.
Looking to run a model? HuggingFace hosts inference for this task type.