MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.

MME is a comprehensive evaluation benchmark for Multimodal Large Language Models (MLLMs) that assesses both perception and cognition abilities across 14 subtasks. The benchmark features manually designed instruction-answer pairs to prevent data leakage and uses concise instruction design to facilitate fair comparisons among MLLMs. Over 50 advanced MLLMs have been evaluated using MME, providing quantitative analysis and highlighting areas for improvement in multimodal model development.

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models.

Best published scores.

Neighbouring benchmarks.

Have a score that beatsthis table?

Have a score that beats
this table?