Visual Question Answering2024en
Massive Multidiscipline Multimodal Understanding
Massive Multidiscipline Multimodal Understanding benchmark covering 11.5K multimodal questions across 183 subfields from college-level exams in Art, Business, Science, Health, Humanities, and Tech. Requires deep reasoning over images, diagrams, and text. 30 subjects per discipline. Tests multi-image understanding and expert-level domain knowledge. A key VLM reasoning benchmark since early 2024.
Current State of the Art
InternVL3-78B
Shanghai AI Lab
73.3
accuracy
accuracy Progress Over Time
Showing 5 breakthroughs from Mar 2023 to Jan 2025
Key Milestones
Mar 2023
GPT-4V
MMMU val. 0-shot. MMMU benchmark paper Table 1. Source cross-referenced with GPT-4 Technical Report.
56.8
Total Improvement
29.0%
Time Span
1y 10m
Breakthroughs
5
Current SOTA
73.3
Top Models Performance Comparison
Top 10 models ranked by accuracy
Best Score
73.3
Top Model
InternVL3-78B
Models Compared
10
Score Range
13.9
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | InternVL3-78BOpen Source Shanghai AI Lab | 73.3 | Jan 2025 | |
| 2 | Gemini 2.0 FlashAPI Google | 71.9 | Jan 2025 | |
| 3 | Qwen2.5-VL 72BOpen Source Alibaba | 70.2 | Feb 2025 | |
| 4 | GPT-4oAPI OpenAI | 69.1 | Oct 2024 | |
| 5 | Claude 3.5 SonnetAPI Anthropic | 68.3 | Oct 2024 | |
| 6 | InternVL2-76BOpen Source Shanghai AI Lab | 67.4 | Apr 2024 | |
| 7 | Qwen2-VL 72BOpen Source Alibaba | 64.5 | Sep 2024 | |
| 8 | Gemini 1.5 ProAPI Google | 62.2 | Feb 2024 | |
| 9 | Llama 3.2 Vision 90BOpen Source Meta | 60.3 | Jul 2024 | |
| 10 | Claude 3 OpusAPI Anthropic | 59.4 | Mar 2024 | |
| 11 | GPT-4V | 56.8 | Mar 2023 |
Related Papers11
Qwen2.5-VL Technical Report
Feb 2025Models: Qwen2.5-VL 72B
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jan 2025Models: InternVL3-78B
Gemini 2.0 Flash Technical Report
Jan 2025Models: Gemini 2.0 Flash
SWE-bench Verified
Oct 2024Models: GPT-4o
Claude 3.5 Sonnet Model Card
Oct 2024Models: Claude 3.5 Sonnet
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Sep 2024Models: Qwen2-VL 72B
The Llama 3 Herd of Models
Jul 2024Models: Llama 3.2 Vision 90B
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Apr 2024Models: InternVL2-76B
Claude 3 Model Family (Haiku, Sonnet, Opus)
Mar 2024Models: Claude 3 Opus
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Feb 2024Models: Gemini 1.5 Pro
GPT-4 Technical Report
Mar 2023Models: GPT-4V