Codesota · Multimodal · Visual Question Answering · MMBenchTasks/Multimodal/Visual Question Answering
Visual Question Answering · benchmark dataset · 2023 · EN

MMBench: Is Your Multi-modal Model an All-around Player?.

Multimodal capability benchmark for vision-language models, covering perception and reasoning abilities across multiple dimensions.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

20 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
20 rows
#ModelOrgSubmittedPaper / codeaccuracy
01SenseNova-U1-A3B-MoTSenseTimeMay 2026SenseNova-U1: Unifying Multimodal Understanding and Gene… · code91.59
02Qwen2.5-VL 72BOpenAlibabaFeb 2025Qwen2.5-VL Technical Report90.50
03InternVL3-78BOpenShanghai AI LabJan 2025InternVL3: Exploring Advanced Training and Test-Time Rec…90.10
04LongCat-Flash-OmniOct 2025LongCat-Flash-Omni Technical Report · code89.80
05Qwen3-VL-235B-A22B-InstructQwenNov 2025Qwen3-VL Technical Report · code89.30
06Qwen3-VL-235B-A22B-ThinkingQwenNov 2025Qwen3-VL Technical Report · code88.80
07Qwen2.5-VL-72BFeb 2025Qwen2.5-VL Technical Report · code88.60
08Qwen2-VL 72BOpenAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o…88
09Infinity-Parser2-ProMay 2026pwc-dump87.54
10Qwen2-VL 72BOpenAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code86.50
11InternVL2-76BOpenShanghai AI LabApr 2024InternVL: Scaling up Vision Foundation Models and Aligni…86.50
12BAGEL (7B MoT)May 2025Emerging Properties in Unified Multimodal Pretraining · code85
13GPT-4oAPIOpenAIOct 2024SWE-bench Verified83.40
14MiniCPM-V 4.6-Thinking (16x)May 2026pwc-dump83.10
15Qwen2-VL 7BAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code83
16MiniCPM-Llama3-V 2.5Aug 2024MiniCPM-V: A GPT-4V Level MLLM on Your Phone · code77.20
17GPT-4VMar 2023GPT-4 Technical Report75.80
18Qwen2-VL-2BSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code74.90
19Gemini 1.5 ProAPIGoogleFeb 2024Gemini 1.5: Unlocking multimodal understanding across mi…73.90
20LLaVA-1.5OpenUW-Madison / MicrosoftOct 2023Improved Baselines with Visual Instruction Tuning (LLaVA…67.70
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

6 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Mar 15, 2023GPT-4V75.80
  2. Apr 25, 2024InternVL2-76BShanghai AI Lab86.50
  3. Sep 18, 2024Qwen2-VL 72BAlibaba88
  4. Jan 22, 2025InternVL3-78BShanghai AI Lab90.10
  5. Feb 19, 2025Qwen2.5-VL 72BAlibaba90.50
  6. May 12, 2026SenseNova-U1-A3B-MoTSenseTime91.59
Fig 3 · SOTA-setting models only. 6 entries span Mar 2023 May 2026.
§ 04 · Literature

13 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies