Codesota · Lineage · Multimodal Reasoning Benchmarks6 benchmarks · 5 edgesUpdated 2026-04-27
Benchmark lineage

Multimodal Reasoning Benchmarks

How vision-language model evaluation moved beyond visual question answering (covered in the VQA lineage) into multimodal reasoning — science, mathematics, chart understanding, and expert-level perception. When VQA-v2 saturated, the field needed benchmarks that tested whether models could integrate vision and language for genuine reasoning, not pattern matching. This lineage tracks that shift from ScienceQA through MMMU, MathVista, and into the expert-difficulty frontier.

Editor's note

VQA-v2 saturating around 2023 (top models at 82–86%, near human parity on the simple distribution) pushed the community toward benchmarks with genuine reasoning requirements. MMMU was the first large-scale exam-style multimodal benchmark — 30 subjects, requiring college-level knowledge to answer correctly. MathVista filled the mathematical-reasoning gap VQA benchmarks missed entirely. The current active frontier is MMMU-Pro and CharXiv: MMMU-Pro's 10-option format and without-context condition show that models are pattern-matching more than reasoning; CharXiv's chart-understanding tasks expose weak spatial-quantitative reasoning in current VLMs. The HLE multimodal split — expert-contributed image+text questions — is where no model exceeds 20%.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
SCOPE SHIFTDIRECT SUCCESSORSCOPE SHIFTScienceQASEP 2022MMMUNOV 2023SOTA 86.0%MathVistaOCT 2023MMMU-ProSEP 2024CharXivAUG 2024HLE (multimodal split)JAN 2025SOTA 38.3%
ScienceQAMMMU · scope shift · attention
ScienceQA proved the multimodal-reasoning benchmark concept worked; MMMU raised scope to college-level knowledge across 30 disciplines with genuine exam difficulty. Top models saturating ScienceQA ~90% while scoring 56% on MMMU at launch confirmed the benchmark transition was necessary.
MMMUMathVista · scope shift
MathVista fills the specific mathematical-reasoning gap in MMMU — visual geometry, charts, and plots require math skills MMMU's multi-discipline framing doesn't isolate. Complementary rather than competitive.
MMMUMMMU-Pro · direct successor · attention
MMMU-Pro was built after evidence emerged that models were exploiting MMMU's 4-option format and text-based shortcuts. 10-option questions and a vision-only condition expose how much language-mediated pattern matching inflated MMMU scores.
MMMUCharXiv · scope shift
CharXiv narrows focus to scientific chart understanding using real arXiv figures — finer-grained than MMMU's chart questions and sourced from the domain where imprecision in chart reading matters most.
MMMU-ProHLE (multimodal split) · scope shift · attention
HLE's multimodal split extends MMMU-Pro's expert-difficulty thesis to genuinely unsolved territory — expert-contributed image+text questions where current VLMs score below 20%. The current frontier endpoint for multimodal reasoning evaluation.
§ 02 · Benchmarks in this lineage

Nodes in detail.

Sep 2022Saturating

ScienceQA

ScienceQA: Multimodal Science Questions

21,208 multimodal science questions from elementary and high school curricula. Text and image inputs, multiple-choice answers with chain-of-thought explanations. The first widely-used benchmark designed specifically for multimodal reasoning rather than visual grounding. Top models now exceed human accuracy on this distribution.

Lu et al. (UCLA / Microsoft) · paper

MathVista

MathVista: Mathematical Reasoning with Visual Context

6,141 math problems with visual context — geometry figures, statistical charts, scientific plots. Requires combining visual parsing with mathematical reasoning. Human performance ~60.3%. GPT-4V scored 49.9% at launch; top models approached 70% by 2024.

Lu et al. (UCLA / Microsoft) · paper

MMMU

Massive Multidisciplinary Multimodal Understanding

11,550 college-level questions across 30 subjects requiring image+text understanding: medical imaging, engineering diagrams, financial charts, art history. Human baseline ~88.6%. GPT-4V scored 56% at launch; top VLMs reached ~70% by end of 2024. The exam-style multimodal standard.

Yue et al. (CMU / Meta / Google / UC Berkeley) · paper
Aug 2024Active

CharXiv

CharXiv: Chart Understanding Evaluation

2,323 charts from arXiv papers paired with descriptive and reasoning questions — requiring precise extraction of values, trend interpretation, and multi-step quantitative reasoning from real scientific figures. Human accuracy ~80%; top VLMs scored 30–47% at launch. Exposes weak chart-reasoning in models that perform well on cleaner MMMU charts.

Wang et al. (Princeton) · paper
Sep 2024Active

MMMU-Pro

MMMU-Pro: Harder Multimodal Exam

Harder variant of MMMU: 10-option questions (vs. 4), a vision-only condition where all textual cues are embedded in images, and augmented reasoning requirements. Top models that scored 70% on MMMU drop to 40–50% here. The MMMU-Pro without-context condition specifically tests perceptual reasoning rather than language-mediated shortcuts.

Yue et al. · paper

HLE (multimodal split)

Humanity's Last Exam — Multimodal Subset

~1,000 questions from HLE requiring both image understanding and expert domain knowledge — medical imaging, microscopy, engineering diagrams, mathematical figures. The hardest multimodal reasoning benchmark available; top VLMs score below 20%.

Phan et al. (Center for AI Safety / Scale AI) · paper