M-LongDoc is a benchmark introduced in Chia et al. (arXiv:2411.06176) for multimodal super-long document understanding. The benchmark consists of 851 examples/questions constructed from long PDF documents that contain multimodal content (interleaved text, figures, tables, etc.) and is intended to evaluate models' ability to read and answer questions over very long, multi-page documents. The paper also provides an automated evaluation framework for reliably assessing open-ended model answers and proposes a retrieval-aware tuning approach that retrieves relevant pages/regions to enable efficient long-document reading. Project/paper information and a demo are available from the project page (https://multimodal-documents.github.io/) and the paper on arXiv.
No results indexed yet — be the first to submit a score.
Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.