Codesota · Lineage · Visual Question Answering9 benchmarks · 8 edgesUpdated 2026-04-23

Benchmark lineage

Visual Question Answering

From the original image+question task to broad multimodal reasoning. The attention path tracks where leaderboard focus has moved; branches show specialized variants that remain active.

Editor's note

MMMU is placed on the attention path as a scope_shift — it's not strictly the same task as VQAv2, but the field's attention migrated there once VQAv2 saturated. Specialized VQA variants (knowledge, text, compositional) are shown as branches and remain active in their own right.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded

VQA → VQAv2 · direct successor · attention

Models answered correctly without looking at the image — VQAv2's balanced pairs force visual grounding.

VQAv2 → MMMU · scope shift · attention

VQAv2 saturated above 85%. Leaderboard attention moved to broader multimodal reasoning.

MMMU → MMMU-Pro · direct successor · attention

Text-only shortcuts and narrow answer space in MMMU. Pro adds vision-only questions and ten answer choices.

VQAv2 → GQA · fork

Compositional structure over scene graphs.

VQAv2 → OK-VQA · fork

External-knowledge requirement.

OK-VQA → A-OKVQA · direct successor

Broader knowledge types and better annotation.

VQAv2 → TextVQA · fork

Reading text in the image — OCR-grounded sub-task.

VQAv2 → ScienceQA · fork

Multimodal chain-of-thought reasoning.

§ 02 · Benchmarks in this lineage

Nodes in detail.

May 2015Superseded

VQA

Visual Question Answering

The original image+question → answer task on COCO images. Established the task.

Antol et al. · paper

Apr 2017Saturated

View benchmark page →

VQAv2

Balanced pairs kill language priors — each question has two similar images with different answers so models must actually look.

Goyal et al. · paper

Apr 2019Active

View benchmark page →

TextVQA

VQA requiring reading text embedded in images — the OCR↔VQA bridge.

Singh et al. · paper

May 2019Saturated

View benchmark page →

GQA

Compositional VQA grounded in scene graphs.

Hudson & Manning · paper

Jun 2019Active

View benchmark page →

OK-VQA

VQA requiring external world knowledge the image alone doesn't provide.

Marino et al. · paper

Jun 2022Active

A-OKVQA

OK-VQA successor with more diverse knowledge types and commonsense reasoning.

Schwenk et al. · paper

Sep 2022Active

ScienceQA

Multimodal chain-of-thought reasoning on science questions.

Lu et al. · paper

Nov 2023Saturating

View benchmark page →

MMMU

Massive Multi-discipline Multimodal Understanding

College-level multimodal questions across 30 subjects. Broader scope than task-specific VQA, where leaderboard attention moved once VQAv2 saturated.

Yue et al. · paper

Sep 2024Active

View benchmark page →

MMMU-Pro

Harder MMMU with vision-only questions and ten answer choices — fixes the text-only shortcuts readers exploited in MMMU.

Yue et al. · paper