Codesota · Multimodal · Visual Question Answering · MMMUTasks/Multimodal/Visual Question Answering
Visual Question Answering · benchmark dataset · 2024 · EN

Massive Multidiscipline Multimodal Understanding.

Massive Multidiscipline Multimodal Understanding benchmark covering 11.5K multimodal questions across 183 subfields from college-level exams in Art, Business, Science, Health, Humanities, and Tech. Requires deep reasoning over images, diagrams, and text. 30 subjects per discipline. Tests multi-image understanding and expert-level domain knowledge. A key VLM reasoning benchmark since early 2024.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

18 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
18 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Qwen3.6 PlusAlibaba CloudMar 2026llm-stats.com86
02GPT-5.1 ThinkingOpenAINov 2025llm-stats.com85.40
03GPT-5.1 InstantOpenAINov 2025llm-stats.com85.40
04GPT-5.1OpenAINov 2025llm-stats.com85.40
05Qwen3.5-122B-A10BAlibaba CloudSep 2025llm-stats.com83.90
06Qwen3.5-397B-A17BAlibaba CloudSep 2025llm-stats.com83.90
07Qwen3.5-27BAlibaba CloudSep 2025llm-stats.com82.30
08InternVL3-78BOSSShanghai AI LabJan 2025InternVL3: Exploring Advanced Training and Test-Time Rec…73.30
09Gemini 2.0 FlashAPIGoogleJan 2025Gemini 2.0 Flash Technical Report71.90
10Qwen2.5-VL 72BOSSAlibabaFeb 2025Qwen2.5-VL Technical Report70.20
11GPT-4oAPIOpenAIOct 2024SWE-bench Verified69.10
12Claude 3.5 SonnetAPIAnthropicOct 2024Claude 3.5 Sonnet Model Card68.30
13InternVL2-76BOSSShanghai AI LabApr 2024InternVL: Scaling up Vision Foundation Models and Aligni…67.40
14Qwen2-VL 72BOSSAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o…64.50
15Gemini 1.5 ProAPIGoogleFeb 2024Gemini 1.5: Unlocking multimodal understanding across mi…62.20
16Llama 3.2 Vision 90BOSSMetaJul 2024The Llama 3 Herd of Models60.30
17Claude 3 OpusAPIAnthropicMar 2024Claude 3 Model Family (Haiku, Sonnet, Opus)59.40
18GPT-4VMar 2023GPT-4 Technical Report56.80
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

10 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Mar 15, 2023GPT-4V56.80
  2. Feb 15, 2024Gemini 1.5 ProGoogle62.20
  3. Apr 25, 2024InternVL2-76BShanghai AI Lab67.40
  4. Oct 22, 2024Claude 3.5 SonnetAnthropic68.30
  5. Oct 25, 2024GPT-4oOpenAI69.10
  6. Jan 15, 2025Gemini 2.0 FlashGoogle71.90
  7. Jan 22, 2025InternVL3-78BShanghai AI Lab73.30
  8. Sep 1, 2025Qwen3.5-122B-A10BAlibaba Cloud83.90
  9. Nov 13, 2025GPT-5.1 ThinkingOpenAI85.40
  10. Mar 15, 2026Qwen3.6 PlusAlibaba Cloud86
Fig 3 · SOTA-setting models only. 10 entries span Mar 2023 Mar 2026.
§ 04 · Literature

11 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies