Visual Question Answering
From the original image+question task to broad multimodal reasoning. The attention path tracks where leaderboard focus has moved; branches show specialized variants that remain active.
MMMU is placed on the attention path as a scope_shift — it's not strictly the same task as VQAv2, but the field's attention migrated there once VQAv2 saturated. Specialized VQA variants (knowledge, text, compositional) are shown as branches and remain active in their own right.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
VQA
The original image+question → answer task on COCO images. Established the task.
VQAv2
Balanced pairs kill language priors — each question has two similar images with different answers so models must actually look.
A-OKVQA
OK-VQA successor with more diverse knowledge types and commonsense reasoning.
MMMU
College-level multimodal questions across 30 subjects. Broader scope than task-specific VQA, where leaderboard attention moved once VQAv2 saturated.
MMMU-Pro
Harder MMMU with vision-only questions and ten answer choices — fixes the text-only shortcuts readers exploited in MMMU.