Multimodal Reasoning Benchmarks
How vision-language model evaluation moved beyond visual question answering (covered in the VQA lineage) into multimodal reasoning — science, mathematics, chart understanding, and expert-level perception. When VQA-v2 saturated, the field needed benchmarks that tested whether models could integrate vision and language for genuine reasoning, not pattern matching. This lineage tracks that shift from ScienceQA through MMMU, MathVista, and into the expert-difficulty frontier.
VQA-v2 saturating around 2023 (top models at 82–86%, near human parity on the simple distribution) pushed the community toward benchmarks with genuine reasoning requirements. MMMU was the first large-scale exam-style multimodal benchmark — 30 subjects, requiring college-level knowledge to answer correctly. MathVista filled the mathematical-reasoning gap VQA benchmarks missed entirely. The current active frontier is MMMU-Pro and CharXiv: MMMU-Pro's 10-option format and without-context condition show that models are pattern-matching more than reasoning; CharXiv's chart-understanding tasks expose weak spatial-quantitative reasoning in current VLMs. The HLE multimodal split — expert-contributed image+text questions — is where no model exceeds 20%.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
ScienceQA
21,208 multimodal science questions from elementary and high school curricula. Text and image inputs, multiple-choice answers with chain-of-thought explanations. The first widely-used benchmark designed specifically for multimodal reasoning rather than visual grounding. Top models now exceed human accuracy on this distribution.
MathVista
6,141 math problems with visual context — geometry figures, statistical charts, scientific plots. Requires combining visual parsing with mathematical reasoning. Human performance ~60.3%. GPT-4V scored 49.9% at launch; top models approached 70% by 2024.
MMMU
11,550 college-level questions across 30 subjects requiring image+text understanding: medical imaging, engineering diagrams, financial charts, art history. Human baseline ~88.6%. GPT-4V scored 56% at launch; top VLMs reached ~70% by end of 2024. The exam-style multimodal standard.
CharXiv
2,323 charts from arXiv papers paired with descriptive and reasoning questions — requiring precise extraction of values, trend interpretation, and multi-step quantitative reasoning from real scientific figures. Human accuracy ~80%; top VLMs scored 30–47% at launch. Exposes weak chart-reasoning in models that perform well on cleaner MMMU charts.
MMMU-Pro
Harder variant of MMMU: 10-option questions (vs. 4), a vision-only condition where all textual cues are embedded in images, and augmented reasoning requirements. Top models that scored 70% on MMMU drop to 40–50% here. The MMMU-Pro without-context condition specifically tests perceptual reasoning rather than language-mediated shortcuts.
HLE (multimodal split)
~1,000 questions from HLE requiring both image understanding and expert domain knowledge — medical imaging, microscopy, engineering diagrams, mathematical figures. The hardest multimodal reasoning benchmark available; top VLMs score below 20%.