Codesota · Multimodal · Visual Question Answering · MMMU-ProTasks/Multimodal/Visual Question Answering
Visual Question Answering · benchmark dataset · 2024 · EN

Massive Multi-discipline Multimodal Understanding · Pro.

Harder MMMU variant with vision-only questions and ten answer choices — fixes the text-only shortcuts readers exploited in the original.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

31 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
31 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Gemini-3.1-ProAPIGoogleMar 2026artificialanalysis.ai82
02GPT-5.2OpenAIDec 2025artificialanalysis.ai81
03Gemini 3 ProAPIGoogleJan 2026artificialanalysis.ai80
04Kimi K2.6Apr 2026pwc-dump79.40
05Qwen3.5-397B-A17BOpenAlibabaFeb 2026pwc-dump · code79
06Kimi-K2.5OpenMoonshot.AIFeb 2026Kimi K2.5: Visual Agentic Intelligence · code78.50
07Qwen3.5-122B-A10BOpenAlibabaFeb 2026pwc-dump · code76.90
08Gemma 4 31BGoogleApr 2026pwc-dump76.90
09GPT-5.1OpenAINov 2025artificialanalysis.ai76.50
10Qwen3.6-27BApr 2026pwc-dump · code75.80
11Qwen3.6-35B-A3BApr 2026pwc-dump · code75.30
12Qwen3.5-35B-A3BOpenAlibabaFeb 2026pwc-dump · code75.10
13Qwen3.5-27BOpenAlibabaFeb 2026pwc-dump · code75
14Qwen3.5-Omni-PlusApr 2026Qwen3.5-Omni Technical Report73.90
15Qwen3.6 PlusAlibabaMar 2026artificialanalysis.ai73.80
16SenseNova-U1-A3B-MoTSenseTimeMay 2026SenseNova-U1: Unifying Multimodal Understanding and Gene… · code72.83
17Intern-S1-ProShanghai AI LabMar 2026Intern-S1-Pro: Scientific Multimodal Foundation Model at…72.80
18Qwen3-VL-235B-A22B-ThinkingQwenNov 2025Qwen3-VL Technical Report · code69.30
19Qwen3-VL-235B-A22B-InstructQwenNov 2025Qwen3-VL Technical Report · code68.10
20Qwen3-Omni-Flash-ThinkingSep 2025Qwen3-Omni Technical Report · code60.80
21Qwen3-VL-8B-InstructQwenNov 2025Qwen3-VL Technical Report · code55.90
22Ovis2.5-9BAug 2025Ovis2.5 Technical Report · code54.40
23MiniMax-VL-01Jan 2025MiniMax-01: Scaling Foundation Models with Lightning Att… · code52.70
24Qwen2.5-VL-72BFeb 2025Qwen2.5-VL Technical Report · code51.10
25Kimi-VL-A3B-Thinking-2506Apr 2025Kimi-VL Technical Report · code46.30
26Qwen2-VL 72BOpenAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code46.20
27Qwen2-VL 7BAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code43.50
28Qwen2-VL-2BSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code37.60
29VideoLLaMA3 7BJan 2025VideoLLaMA 3: Frontier Multimodal Foundation Models for … · code33.60
30MiniCPM-V 4.6-Thinking (16x)May 2026pwc-dump32.50
31VideoLLaMA3 2BJan 2025VideoLLaMA 3: Frontier Multimodal Foundation Models for … · code28.60
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

7 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Sep 18, 2024Qwen2-VL 72BAlibaba46.20
  2. Jan 14, 2025MiniMax-VL-0152.70
  3. Aug 15, 2025Ovis2.5-9B54.40
  4. Sep 22, 2025Qwen3-Omni-Flash-Thinking60.80
  5. Nov 13, 2025GPT-5.1OpenAI76.50
  6. Dec 11, 2025GPT-5.2OpenAI81
  7. Mar 18, 2026Gemini-3.1-ProGoogle82
Fig 3 · SOTA-setting models only. 7 entries span Sep 2024 Mar 2026.
§ 04 · Literature

12 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies