Multimodalvisual-question-answering

Visual Question Answering

Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.

8 datasets147 resultsView full task mapping →

Visual Question Answering (VQA) models answer natural language questions about images, requiring joint visual perception and language understanding. Once a standalone research task, VQA is now largely subsumed by general vision-language models but remains a critical evaluation benchmark for visual reasoning capabilities.

History

2015

VQA v1.0 dataset (Antol et al.) formalizes the task with 250K images and 760K questions from MS COCO

2017

VQA v2.0 rebalances the dataset to reduce language bias, becoming the standard benchmark

2019

ViLBERT and LXMERT introduce transformer-based vision-language pretraining, dramatically improving VQA accuracy

2021

VinVL achieves 76.6% on VQA v2.0 test-std by combining better object detection with vision-language pretraining

2022

Flamingo (DeepMind) achieves strong few-shot VQA performance without task-specific fine-tuning

2022

PaLI (Google) scales vision-language models to 17B parameters, hitting 84.3% on VQA v2.0

2023

GPT-4V and Gemini Ultra surpass 85% on VQA v2.0, effectively saturating the benchmark

2024

Community shifts focus to harder benchmarks: MMMU (college-level exams), RealWorldQA, and domain-specific VQA (medical, scientific)

2025

VQA v2.0 is considered solved; evaluation moves to compositional, multi-hop, and knowledge-intensive visual questions

How Visual Question Answering Works

Image Feature Extraction

The image is processed through a vision encoder (ViT, SigLIP) producing patch-level features. Earlier approaches used region features from object detectors (Faster R-CNN), but modern models use end-to-end ViT encoders.

Question Encoding

The natural language question is tokenized and encoded by the LLM's text encoder. The question guides attention toward relevant image regions.

Cross-modal Reasoning

Vision and language features are fused via cross-attention mechanisms. The model must ground question concepts in visual features — 'what color is the car?' requires locating the car and identifying its color attribute.

Answer Generation

Modern VLMs generate free-form text answers autoregressively. Earlier VQA models used classification over a fixed answer vocabulary (e.g., top 3,129 answers in VQA v2.0).

Current Landscape

VQA as a standalone task has been largely absorbed into general vision-language modeling. VQA v2.0 accuracy exceeds 85% for frontier models, and the benchmark no longer differentiates model capabilities meaningfully. The field has migrated to harder evaluation suites: MMMU tests college-level multimodal reasoning, MathVista tests mathematical visual understanding, and domain-specific benchmarks (PathVQA, ScienceQA) test specialized knowledge. In practice, any modern VLM (GPT-4o, Claude 3.5, Gemini 2.0, Qwen2.5-VL) handles standard VQA tasks reliably — the differentiators are now reasoning depth, hallucination resistance, and domain expertise.

Key Challenges

Compositional reasoning — questions requiring multiple reasoning steps ('Is the number of red objects greater than blue objects?') remain challenging

Knowledge-intensive questions — answering 'Who painted this?' or 'What species is this?' requires external world knowledge beyond the image

Spatial reasoning — questions about relative positions, distances, and orientations ('Is the cat to the left of the dog?') have lower accuracy

Counting — accurately counting objects, especially in crowded scenes, remains unreliable even for frontier models

Benchmark saturation — VQA v2.0 is effectively solved; the field needs harder, more diagnostic benchmarks

Quick Recommendations

Best accuracy

GPT-4o

Highest accuracy across VQA benchmarks including hard compositional and knowledge-intensive questions

Best for domain-specific VQA

Gemini 2.0 Pro

Long context window allows injecting domain knowledge alongside images; strong on scientific and medical VQA

Open source (large)

Qwen2.5-VL-72B

Near-GPT-4o accuracy on VQA benchmarks; best open-weight choice for production visual Q&A systems

Open source (efficient)

InternVL2.5-8B

Strong VQA performance in a compact 8B model; 4-bit quantized version runs on consumer GPUs

Medical VQA

Med-PaLM M (Google)

Purpose-built for medical visual question answering; achieves expert-level accuracy on pathology and radiology questions

What's Next

SenseTime

91.59

accuracy

VQA v2.0

Visual Question Answering v2.0

201716 results

265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.

State of the Art

Qwen2-VL 72B

Alibaba

87.6

accuracy

GQA

GQA: Visual Reasoning in the Real World

20194 results

22M compositional questions grounded in real images via scene graphs. Tests multi-step visual reasoning, spatial understanding, and attribute comparison.

State of the Art

AIMv2 ViT-3B/14 + Llama 3.0 8B

73.3

accuracy

OK-VQA

Outside Knowledge Visual Question Answering

20190 results

14,055 questions requiring outside knowledge to answer. Tests models that must consult external knowledge beyond the image.

No results tracked yet

Related Tasks

Audio-Text-to-Text

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

Image-Text-to-Image

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

Image-Text-to-Text

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

Image-Text-to-Video

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Visual Question Answering benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Multimodal