Codesota · Multimodal · Visual Question Answering · MMMUTasks/Multimodal/Visual Question Answering
Visual Question Answering · benchmark dataset · 2024 · EN

Massive Multidiscipline Multimodal Understanding.

Massive Multidiscipline Multimodal Understanding benchmark covering 11.5K multimodal questions across 183 subfields from college-level exams in Art, Business, Science, Health, Humanities, and Tech. Requires deep reasoning over images, diagrams, and text. 30 subjects per discipline. Tests multi-image understanding and expert-level domain knowledge. A key VLM reasoning benchmark since early 2024.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

30 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
30 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Qwen3.6 PlusAlibabaMar 2026llm-stats.com86
02GPT-5.1 ThinkingOpenAINov 2025llm-stats.com85.40
03GPT-5.1 InstantOpenAINov 2025llm-stats.com85.40
04GPT-5.1OpenAINov 2025llm-stats.com85.40
05Qwen3.5-122B-A10BAlibaba CloudSep 2025llm-stats.com83.90
06Qwen3.5-397B-A17BOpenAlibabaSep 2025llm-stats.com83.90
07Qwen3.5-27BAlibaba CloudSep 2025llm-stats.com82.30
08Gemini 2.5 ProJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…82
09Gemini 2.5 FlashJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…79.70
10InternVL3-78BOpenShanghai AI LabJan 2025InternVL3: Exploring Advanced Training and Test-Time Rec…73.30
11InternVL3-78BOpenShanghai AI LabApr 2025InternVL3: Exploring Advanced Training and Test-Time Rec… · code72.20
12Gemini 2.0 FlashAPIGoogleJan 2025Gemini 2.0 Flash Technical Report71.90
13Qwen2.5-VL 72BOpenAlibabaFeb 2025Qwen2.5-VL Technical Report70.20
14GPT-4oAPIOpenAIOct 2024SWE-bench Verified69.10
15MiniMax-VL-01Jan 2025MiniMax-01: Scaling Foundation Models with Lightning Att… · code68.50
16Claude 3.5 SonnetAPIAnthropicOct 2024Claude 3.5 Sonnet Model Card68.30
17InternVL2-76BOpenShanghai AI LabApr 2024InternVL: Scaling up Vision Foundation Models and Aligni…67.40
18Gemma 3 (27B, IT)Mar 2025Gemma 3 Technical Report · code64.90
19Qwen2-VL 72BOpenAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o…64.50
20Qwen2-VL 72BOpenAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code64.50
21Gemini 1.5 ProAPIGoogleFeb 2024Gemini 1.5: Unlocking multimodal understanding across mi…62.20
22Llama 3.2 Vision 90BOpenMetaJul 2024The Llama 3 Herd of Models60.30
23Claude 3 OpusAPIAnthropicMar 2024Claude 3 Model Family (Haiku, Sonnet, Opus)59.40
24Qwen3-Omni-30B-A3B-Base-202507Sep 2025Qwen3-Omni Technical Report · code59.33
25GPT-4VMar 2023GPT-4 Technical Report56.80
26BAGEL (7B MoT)May 2025Emerging Properties in Unified Multimodal Pretraining · code55.30
27Qwen2-VL 7BAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code54.10
28BLIP3-o (8B)May 2025BLIP3-o: A Family of Fully Open Unified Multimodal Model… · code50.60
29VideoLLaMA3 2BJan 2025VideoLLaMA 3: Frontier Multimodal Foundation Models for … · code45.30
30Qwen2-VL-2BSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code41.10
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

11 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Mar 15, 2023GPT-4V56.80
  2. Feb 15, 2024Gemini 1.5 ProGoogle62.20
  3. Apr 25, 2024InternVL2-76BShanghai AI Lab67.40
  4. Oct 22, 2024Claude 3.5 SonnetAnthropic68.30
  5. Oct 25, 2024GPT-4oOpenAI69.10
  6. Jan 15, 2025Gemini 2.0 FlashGoogle71.90
  7. Jan 22, 2025InternVL3-78BShanghai AI Lab73.30
  8. Jul 7, 2025Gemini 2.5 Pro82
  9. Sep 1, 2025Qwen3.5-122B-A10BAlibaba Cloud83.90
  10. Nov 13, 2025GPT-5.1 ThinkingOpenAI85.40
  11. Mar 15, 2026Qwen3.6 PlusAlibaba86
Fig 3 · SOTA-setting models only. 11 entries span Mar 2023 Mar 2026.
§ 04 · Literature

19 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
MMMU — Visual Question Answering | CodeSOTA