Visual Question Answering
Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.
Visual Question Answering (VQA) models answer natural language questions about images, requiring joint visual perception and language understanding. Once a standalone research task, VQA is now largely subsumed by general vision-language models but remains a critical evaluation benchmark for visual reasoning capabilities.
History
VQA v1.0 dataset (Antol et al.) formalizes the task with 250K images and 760K questions from MS COCO
VQA v2.0 rebalances the dataset to reduce language bias, becoming the standard benchmark
ViLBERT and LXMERT introduce transformer-based vision-language pretraining, dramatically improving VQA accuracy
VinVL achieves 76.6% on VQA v2.0 test-std by combining better object detection with vision-language pretraining
Flamingo (DeepMind) achieves strong few-shot VQA performance without task-specific fine-tuning
PaLI (Google) scales vision-language models to 17B parameters, hitting 84.3% on VQA v2.0
GPT-4V and Gemini Ultra surpass 85% on VQA v2.0, effectively saturating the benchmark
Community shifts focus to harder benchmarks: MMMU (college-level exams), RealWorldQA, and domain-specific VQA (medical, scientific)
VQA v2.0 is considered solved; evaluation moves to compositional, multi-hop, and knowledge-intensive visual questions
How Visual Question Answering Works
Image Feature Extraction
The image is processed through a vision encoder (ViT, SigLIP) producing patch-level features. Earlier approaches used region features from object detectors (Faster R-CNN), but modern models use end-to-end ViT encoders.
Question Encoding
The natural language question is tokenized and encoded by the LLM's text encoder. The question guides attention toward relevant image regions.
Cross-modal Reasoning
Vision and language features are fused via cross-attention mechanisms. The model must ground question concepts in visual features — 'what color is the car?' requires locating the car and identifying its color attribute.
Answer Generation
Modern VLMs generate free-form text answers autoregressively. Earlier VQA models used classification over a fixed answer vocabulary (e.g., top 3,129 answers in VQA v2.0).
Current Landscape
VQA as a standalone task has been largely absorbed into general vision-language modeling. VQA v2.0 accuracy exceeds 85% for frontier models, and the benchmark no longer differentiates model capabilities meaningfully. The field has migrated to harder evaluation suites: MMMU tests college-level multimodal reasoning, MathVista tests mathematical visual understanding, and domain-specific benchmarks (PathVQA, ScienceQA) test specialized knowledge. In practice, any modern VLM (GPT-4o, Claude 3.5, Gemini 2.0, Qwen2.5-VL) handles standard VQA tasks reliably — the differentiators are now reasoning depth, hallucination resistance, and domain expertise.
Key Challenges
Compositional reasoning — questions requiring multiple reasoning steps ('Is the number of red objects greater than blue objects?') remain challenging
Knowledge-intensive questions — answering 'Who painted this?' or 'What species is this?' requires external world knowledge beyond the image
Spatial reasoning — questions about relative positions, distances, and orientations ('Is the cat to the left of the dog?') have lower accuracy
Counting — accurately counting objects, especially in crowded scenes, remains unreliable even for frontier models
Benchmark saturation — VQA v2.0 is effectively solved; the field needs harder, more diagnostic benchmarks
Quick Recommendations
Best accuracy
GPT-4o
Highest accuracy across VQA benchmarks including hard compositional and knowledge-intensive questions
Best for domain-specific VQA
Gemini 2.0 Pro
Long context window allows injecting domain knowledge alongside images; strong on scientific and medical VQA
Open source (large)
Qwen2.5-VL-72B
Near-GPT-4o accuracy on VQA benchmarks; best open-weight choice for production visual Q&A systems
Open source (efficient)
InternVL2.5-8B
Strong VQA performance in a compact 8B model; 4-bit quantized version runs on consumer GPUs
Medical VQA
Med-PaLM M (Google)
Purpose-built for medical visual question answering; achieves expert-level accuracy on pathology and radiology questions
What's Next
VQA's future lies in embodied and interactive settings — answering questions about real-time video feeds, 3D environments, and dynamic scenes. Multi-hop visual reasoning (requiring information from multiple images or video frames) will become the new frontier. Expect evaluation to shift toward open-ended visual dialogue quality rather than single-turn accuracy, and toward measuring calibration (does the model know when it does not know?) rather than raw correctness.
Benchmarks & SOTA
MMMU
Massive Multidiscipline Multimodal Understanding
Massive Multidiscipline Multimodal Understanding benchmark covering 11.5K multimodal questions across 183 subfields from college-level exams in Art, Business, Science, Health, Humanities, and Tech. Requires deep reasoning over images, diagrams, and text. 30 subjects per discipline. Tests multi-image understanding and expert-level domain knowledge. A key VLM reasoning benchmark since early 2024.
State of the Art
InternVL3-78B
Shanghai AI Lab
73.3
accuracy
TextVQA
TextVQA: Towards VQA Models That Can Read
Visual Question Answering dataset requiring models to read and reason about text in natural images. Contains 45,336 questions about 28,408 images from Open Images dataset. Questions require OCR-based reasoning, e.g. "What does the sign say?". A standard benchmark for evaluating text understanding within visual scenes. ANLS and exact-match accuracy metrics.
State of the Art
Qwen2.5-VL 72B
Alibaba
85.5
accuracy
MMBench
MMBench: Is Your Multi-modal Model an All-around Player?
Comprehensive multimodal model evaluation benchmark covering 20 ability dimensions including object recognition, attribute reasoning, spatial reasoning, commonsense, and more. Contains 3,000+ multiple-choice questions. Uses CircularEval strategy to avoid positional bias. Maintained by OpenGVLab, widely used for VLM evaluation in 2024-2025.
State of the Art
Qwen2.5-VL 72B
Alibaba
90.5
accuracy
VQA v2.0
Visual Question Answering v2.0
265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.
State of the Art
Qwen2-VL 72B
Alibaba
87.6
accuracy
GQA
GQA: Visual Reasoning in the Real World
22M compositional questions grounded in real images via scene graphs. Tests multi-step visual reasoning, spatial understanding, and attribute comparison.
No results tracked yet
OK-VQA
Outside Knowledge Visual Question Answering
14,055 questions requiring outside knowledge to answer. Tests models that must consult external knowledge beyond the image.
No results tracked yet
Related Tasks
Audio-Text-to-Text
Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.
Image-Text-to-Image
Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.
Image-Text-to-Text
Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.
Image-Text-to-Video
Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.
Something wrong or missing?
Help keep Visual Question Answering benchmarks accurate. Report outdated results, missing benchmarks, or errors.