MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks.

MEGA-Bench is a large-scale multimodal evaluation suite that consolidates over 500 real-world multimodal tasks into a unified evaluation format. Released by TIGER-Lab, MEGA-Bench provides curated high-quality data samples (images/videos + text) and standardized example/metric fields (e.g., task_name, task_description, example_text, example_media, metric_info, answer, eval_context) to enable cost-effective, accurate evaluation of multimodal/vision-language models. The Hugging Face dataset contains subsets (e.g., core and open), a test split (core ≈ 6.53k rows), and metadata describing each task and its evaluation metric. The accompanying paper (ICLR 2025 / arXiv:2410.10563) describes the benchmark and reports aggregated metrics including a macro metric across tasks. License: Apache-2.0. Main resources: paper (arXiv), code (GitHub), dataset and leaderboard on Hugging Face.

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks.

Best published scores.

Neighbouring benchmarks.

Have a score that beatsthis table?

Have a score that beats
this table?