M-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework.

M-LongDoc is a benchmark introduced in Chia et al. (arXiv:2411.06176) for multimodal super-long document understanding. The benchmark consists of 851 examples/questions constructed from long PDF documents that contain multimodal content (interleaved text, figures, tables, etc.) and is intended to evaluate models' ability to read and answer questions over very long, multi-page documents. The paper also provides an automated evaluation framework for reliably assessing open-ended model answers and proposes a retrieval-aware tuning approach that retrieves relevant pages/regions to enable efficient long-document reading. Project/paper information and a demo are available from the project page (https://multimodal-documents.github.io/) and the paper on arXiv.

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

M-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework.

Best published scores.

Neighbouring benchmarks.

Have a score that beatsthis table?

Have a score that beats
this table?