Visual Question Answering
Answer natural language questions about images. Combines vision and language understanding.
How Visual Question Answering Works
A technical deep-dive into Visual Question Answering. From attention mechanisms to modern vision-language models that can reason about images.
The Problem
Why is answering questions about images hard for machines?
Picture a photograph of a family picnic. A human glances at it and can instantly answer "How many people are eating?" or "Is the weather nice?" without conscious effort. For a machine, this requires solving multiple hard problems simultaneously:
The system must detect objects, understand their relationships, recognize actions, and infer scene context. A "picnic" is not just objects, but their arrangement and context.
Questions come in infinite variety. "How many?" needs counting. "Is it raining?" needs visual inference. "What might happen next?" needs reasoning.
The hardest part: connecting words to visual concepts. What does "eating" look like? Where in the image is the answer to "What color is the blanket?"
Many questions require knowledge beyond the image. "What city is this?" needs to recognize landmarks. "Is this food healthy?" needs nutrition knowledge.
Types of VQA Questions
Questions with direct visual answers
- "What color is the car?"
- "How many people are there?"
Requires logical inference from visual cues
- "Is it going to rain?"
- "What might happen next?"
Needs external world knowledge beyond the image
- "What city is this?"
- "Who painted this?"
Reading and understanding text in images
- "What does the sign say?"
- "What is the price?"
How Vision-Language Models Work
The architecture evolved from simple feature concatenation to sophisticated multimodal transformers.
Modern VLM Architecture (Simplified)
Modern VLMs treat images as a special kind of "text." The vision encoder converts image patches into tokens that look like word embeddings to the LLM. The LLM then processes image tokens and text tokens together, allowing it to reason about both modalities using the same attention mechanisms.
Concatenate image and text features early, then process together.
Process each modality separately, combine at the end.
Text attends to image regions, image attends to words. Iterative.
Key Models
The models you should know for VQA in 2024-2025.
- +Efficient Q-Former bridge
- +Works with any LLM
- +Good zero-shot
- -Less strong on complex reasoning
- -Frozen vision encoder
- +Visual instruction following
- +Open weights
- +Active community
- -Requires fine-tuning for best results
- -Single image only
- +Dynamic resolution
- +Video support
- +Strong OCR
- -Requires significant VRAM
- -Complex inference setup
- +Best reasoning
- +Handles complex questions
- +Multi-image
- -Expensive
- -Rate limited
- -No local deployment
- +Long context
- +Video understanding
- +Fast inference
- -API only
- -Variable availability
- +Strong reasoning
- +Good at charts/diagrams
- +Reliable
- -API only
- -Image limits per request
Benchmarks
Standard datasets for evaluating VQA models.
| Dataset | Focus | Size | Metric | SOTA |
|---|---|---|---|---|
| VQAv2 | General VQA | 1.1M QA pairs | Accuracy | 86.1% (Gemini) |
| OK-VQA | Knowledge VQA | 14K questions | Accuracy | 66.1% (PaLI-X) |
| TextVQA | Scene Text | 45K questions | Accuracy | 77.6% (GPT-4V) |
| GQA | Compositional | 22M questions | Accuracy | 72.1% (PaLI) |
| VizWiz | Accessibility | 31K questions | Accuracy | 73.2% (GPT-4V) |
| DocVQA | Document | 50K questions | ANLS | 93.4% (GPT-4V) |
The standard benchmark. 1.1 million questions about COCO images. Balanced so guessing from question alone performs poorly. Accuracy metric with consensus from 10 human annotators.
"Outside Knowledge" VQA. Questions require knowledge not in the image. Example: "What vitamin is this fruit high in?" Tests knowledge retrieval.
Reading text in images. Street signs, product labels, documents. Requires OCR capability integrated with reasoning.
22M compositional questions from scene graphs. Tests reasoning: "Is the cat on the table to the left of the lamp?"
VQA in Action
See how VQA works on real images with different question types.
Code Examples
Get started with VQA using different frameworks and models.
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch
# Load BLIP-2 model
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16,
device_map="auto"
)
# Load image
image = Image.open("photo.jpg")
# Ask a question
question = "What is happening in this image?"
inputs = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs, max_new_tokens=100)
answer = processor.decode(output[0], skip_special_tokens=True)
print(f"Q: {question}")
print(f"A: {answer}")Quick Reference
- - GPT-4o for best quality
- - Qwen2-VL for self-hosted
- - LLaVA for local deployment
- - VQAv2 (general)
- - TextVQA (OCR + QA)
- - OK-VQA (knowledge)
- - Image resolution matters
- - Question phrasing affects accuracy
- - Hallucinations on OCR tasks
Use Cases
- ✓Accessibility for blind users
- ✓Image-based search
- ✓Visual reasoning
- ✓Educational tools
- ✓Customer support with images
Architectural Patterns
Vision-Language Models
End-to-end models trained on image-question-answer triplets.
- +State-of-the-art accuracy
- +Handles complex reasoning
- -Large models
- -Expensive inference
Vision Encoder + LLM
Encode image, feed features to LLM decoder.
- +Leverages LLM capabilities
- +Flexible
- -Two-stage pipeline
- -May lose visual details
Object Detection + QA
Detect objects first, then reason over detections.
- +Interpretable
- +Good for counting
- -Limited by detector
- -Complex pipeline
Implementations
API Services
GPT-4V
OpenAIBest overall VQA. Handles complex reasoning well.
Claude 3.5 Sonnet
AnthropicExcellent for detailed image analysis.
Open Source
Benchmarks
Quick Facts
- Input
- Image
- Output
- Text
- Implementations
- 3 open source, 2 API
- Patterns
- 3 approaches