Codesota · General · Vision-Language Models · M-LongDocTasks/General/Vision-Language Models
Vision-Language Models · benchmark dataset · EN

M-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework.

M-LongDoc is a benchmark introduced in Chia et al. (arXiv:2411.06176) for multimodal super-long document understanding. The benchmark consists of 851 examples/questions constructed from long PDF documents that contain multimodal content (interleaved text, figures, tables, etc.) and is intended to evaluate models' ability to read and answer questions over very long, multi-page documents. The paper also provides an automated evaluation framework for reliably assessing open-ended model answers and proposes a retrieval-aware tuning approach that retrieves relevant pages/regions to enable efficient long-document reading. Project/paper information and a demo are available from the project page (https://multimodal-documents.github.io/) and the paper on arXiv.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

No results indexed yet — be the first to submit a score.

No benchmark results indexed yet
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
M-LongDoc — Vision-Language Models benchmark · Codesota | CodeSOTA