Codesota · Benchmark · MMMUHome/Leaderboards/Multimodal Media/Visual Question Answering/MMMU
Unknown

MMMU.

Massive Multidiscipline Multimodal Understanding benchmark covering 11.5K multimodal questions across 183 subfields from college-level exams in Art, Business, Science, Health, Humanities, and Tech. Requires deep reasoning over images, diagrams, and text. 30 subjects per discipline. Tests multi-image understanding and expert-level domain knowledge. A key VLM reasoning benchmark since early 2024.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Accuracy

Accuracy is the reported evaluation metric for MMMU. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Qwen3.6 Plusverified862026Source ↗Looks wrong?
02GPT-5.1verified85.42025Source ↗Looks wrong?
03GPT-5.1 Instantverified85.42025Source ↗Looks wrong?
04GPT-5.1 Thinkingverified85.42025Source ↗Looks wrong?
05Qwen3.5-122B-A10Bverified83.92025Source ↗Looks wrong?
06Qwen3.5-397B-A17Bverified83.92025Source ↗Looks wrong?
07Qwen3.5-27Bverified82.32025Source ↗Looks wrong?
08Gemini 2.5 Prounverified822025Paper ↗Looks wrong?
09Gemini 2.5 Flashunverified79.72025Paper ↗Looks wrong?
10InternVL3-78B
MMMU val. InternVL3-78B. Table 2. arxiv:2501.12891
verified73.32026Source ↗Looks wrong?
11Gemini 2.0 Flash
MMMU val. Gemini 2.0 Flash. Technical report.
verified71.92025Paper ↗Looks wrong?
12Qwen2.5-VL 72B
MMMU val. Qwen2.5-VL 72B. Table 2. arxiv:2502.13923
verified70.22025Paper ↗Looks wrong?
13GPT-4o
MMMU val. GPT-4o system card Table 1. arxiv:2410.21276
verified69.12026Source ↗Looks wrong?
14MiniMax-VL-01unverified68.52025Paper ↗Code ↗Looks wrong?
15Claude 3.5 Sonnet
MMMU val. Claude 3.5 Sonnet (Oct 2024). Anthropic model card.
verified68.32026Source ↗Looks wrong?
16InternVL2-76B
MMMU val. InternVL2-76B. Table 10. arxiv:2404.16821
verified67.42026Source ↗Looks wrong?
17Gemma 3 (27B, IT)unverified64.92025Paper ↗Code ↗Looks wrong?
18Qwen2-VL 72B
MMMU val. Qwen2-VL 72B. Table 6. arxiv:2409.12191
verified64.52024Paper ↗Looks wrong?
19Gemini 1.5 Pro
MMMU val. Table 5. Gemini 1.5 paper arxiv:2403.05530
verified62.22024Paper ↗Looks wrong?
20Llama 3.2 Vision 90B
MMMU val. Llama 3.2 Vision 90B. Table 3. arxiv:2407.21783
verified60.32026Source ↗Looks wrong?
21Claude 3 Opus
MMMU val. 0-shot. Anthropic Claude 3 family model card. March 2024.
verified59.42026Source ↗Looks wrong?
22Qwen3-Omni-30B-A3B-Base-202507unverified59.332025Paper ↗Code ↗Looks wrong?
23GPT-4V
MMMU val. 0-shot. MMMU benchmark paper Table 1. Source cross-referenced with GPT-4 Technical Report.
verified56.82026Source ↗Looks wrong?
24BAGEL (7B MoT)unverified55.32025Paper ↗Code ↗Looks wrong?
25Qwen2-VL 7Bunverified54.12024Paper ↗Code ↗Looks wrong?
26BLIP3-o (8B)unverified50.62025Paper ↗Code ↗Looks wrong?
27VideoLLaMA3 2Bunverified45.32025Paper ↗Code ↗Looks wrong?
28Qwen2-VL-2Bunverified41.12024Paper ↗Code ↗Looks wrong?
Lineage

MMMU in context.

See full multimodal reasoning benchmarks lineage →
Predecessors (1)
saturating2022-09
ScienceQA
ScienceQA proved the multimodal-reasoning benchmark concept worked; MMMU raised scope to college-level knowledge across 30 disciplines with genuine exam difficulty. Top models saturating ScienceQA ~90% while scoring 56% on MMMU at launch confirmed the benchmark transition was necessary.
This benchmark (1)
active2023-11
MMMU
Successors (3)
active2023-10
MathVista
MathVista fills the specific mathematical-reasoning gap in MMMU — visual geometry, charts, and plots require math skills MMMU's multi-discipline framing doesn't isolate. Complementary rather than competitive.
active2024-09
MMMU-Pro
MMMU-Pro was built after evidence emerged that models were exploiting MMMU's 4-option format and text-based shortcuts. 10-option questions and a vision-only condition expose how much language-mediated pattern matching inflated MMMU scores.
active2024-08
CharXiv
CharXiv narrows focus to scientific chart understanding using real arXiv figures — finer-grained than MMMU's chart questions and sourced from the domain where imprecision in chart reading matters most.
§ 04 · Submit a result

Add to the leaderboard.

← Back to Visual Question Answering