Who leads the MMMU benchmark?

Qwen3.6 Plus currently leads MMMU with a score of 86 on Accuracy.

What is the state-of-the-art score on MMMU?

The state-of-the-art result on MMMU is 86 (Accuracy), achieved by Qwen3.6 Plus as of 2026.

How many models are tracked on MMMU?

Codesota tracks 28 models on MMMU.

When was the MMMU leaderboard last updated?

The MMMU leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2024.

Codesota · Benchmark · MMMUHome/Leaderboards/Multimodal Media/Visual Question Answering/MMMU

Unknown

MMMU.

Name: MMMU Benchmark Results
Creator: Unknown
Published: 2024-01-01
License: https://creativecommons.org/licenses/by/4.0/

Massive Multidiscipline Multimodal Understanding benchmark covering 11.5K multimodal questions across 183 subfields from college-level exams in Art, Business, Science, Health, Humanities, and Tech. Requires deep reasoning over images, diagrams, and text. 30 subjects per discipline. Tests multi-image understanding and expert-level domain knowledge. A key VLM reasoning benchmark since early 2024.

Paper ↗Leaderboard ↓Lineage

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Accuracy

Accuracy is the reported evaluation metric for MMMU. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Qwen3.6 Plus	verified	86	2026	Source ↗	Looks wrong?
02	GPT-5.1	verified	85.4	2025	Source ↗	Looks wrong?
03	GPT-5.1 Instant	verified	85.4	2025	Source ↗	Looks wrong?
04	GPT-5.1 Thinking	verified	85.4	2025	Source ↗	Looks wrong?
05	Qwen3.5-122B-A10B	verified	83.9	2025	Source ↗	Looks wrong?
06	Qwen3.5-397B-A17B	verified	83.9	2025	Source ↗	Looks wrong?
07	Qwen3.5-27B	verified	82.3	2025	Source ↗	Looks wrong?
08	Gemini 2.5 Pro	unverified	82	2025	Paper ↗	Looks wrong?
09	Gemini 2.5 Flash	unverified	79.7	2025	Paper ↗	Looks wrong?
10	InternVL3-78B MMMU val. InternVL3-78B. Table 2. arxiv:2501.12891	verified	73.3	2026	Source ↗	Looks wrong?
11	Gemini 2.0 Flash MMMU val. Gemini 2.0 Flash. Technical report.	verified	71.9	2025	Paper ↗	Looks wrong?
12	Qwen2.5-VL 72B MMMU val. Qwen2.5-VL 72B. Table 2. arxiv:2502.13923	verified	70.2	2025	Paper ↗	Looks wrong?
13	GPT-4o MMMU val. GPT-4o system card Table 1. arxiv:2410.21276	verified	69.1	2026	Source ↗	Looks wrong?
14	MiniMax-VL-01	unverified	68.5	2025	Paper ↗Code ↗	Looks wrong?
15	Claude 3.5 Sonnet MMMU val. Claude 3.5 Sonnet (Oct 2024). Anthropic model card.	verified	68.3	2026	Source ↗	Looks wrong?
16	InternVL2-76B MMMU val. InternVL2-76B. Table 10. arxiv:2404.16821	verified	67.4	2026	Source ↗	Looks wrong?
17	Gemma 3 (27B, IT)	unverified	64.9	2025	Paper ↗Code ↗	Looks wrong?
18	Qwen2-VL 72B MMMU val. Qwen2-VL 72B. Table 6. arxiv:2409.12191	verified	64.5	2024	Paper ↗	Looks wrong?
19	Gemini 1.5 Pro MMMU val. Table 5. Gemini 1.5 paper arxiv:2403.05530	verified	62.2	2024	Paper ↗	Looks wrong?
20	Llama 3.2 Vision 90B MMMU val. Llama 3.2 Vision 90B. Table 3. arxiv:2407.21783	verified	60.3	2026	Source ↗	Looks wrong?
21	Claude 3 Opus MMMU val. 0-shot. Anthropic Claude 3 family model card. March 2024.	verified	59.4	2026	Source ↗	Looks wrong?
22	Qwen3-Omni-30B-A3B-Base-202507	unverified	59.33	2025	Paper ↗Code ↗	Looks wrong?
23	GPT-4V MMMU val. 0-shot. MMMU benchmark paper Table 1. Source cross-referenced with GPT-4 Technical Report.	verified	56.8	2026	Source ↗	Looks wrong?
24	BAGEL (7B MoT)	unverified	55.3	2025	Paper ↗Code ↗	Looks wrong?
25	Qwen2-VL 7B	unverified	54.1	2024	Paper ↗Code ↗	Looks wrong?
26	BLIP3-o (8B)	unverified	50.6	2025	Paper ↗Code ↗	Looks wrong?
27	VideoLLaMA3 2B	unverified	45.3	2025	Paper ↗Code ↗	Looks wrong?
28	Qwen2-VL-2B	unverified	41.1	2024	Paper ↗Code ↗	Looks wrong?

Lineage

MMMU in context.

See full multimodal reasoning benchmarks lineage →

Predecessors (1)

saturating2022-09

ScienceQA

ScienceQA proved the multimodal-reasoning benchmark concept worked; MMMU raised scope to college-level knowledge across 30 disciplines with genuine exam difficulty. Top models saturating ScienceQA ~90% while scoring 56% on MMMU at launch confirmed the benchmark transition was necessary.

This benchmark (1)

active2023-11

MMMU

Successors (3)

active2023-10

MathVista

MathVista fills the specific mathematical-reasoning gap in MMMU — visual geometry, charts, and plots require math skills MMMU's multi-discipline framing doesn't isolate. Complementary rather than competitive.

active2024-09

MMMU-Pro

MMMU-Pro was built after evidence emerged that models were exploiting MMMU's 4-option format and text-based shortcuts. 10-option questions and a vision-only condition expose how much language-mediated pattern matching inflated MMMU scores.

active2024-08

CharXiv

CharXiv narrows focus to scientific chart understanding using real arXiv figures — finer-grained than MMMU's chart questions and sourced from the domain where imprecision in chart reading matters most.

§ 04 · Submit a result

Add to the leaderboard.

← Back to Visual Question Answering