Computer Visiondocument-question-answering

Document Understanding

Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables — where layout and typography carry as much meaning as the text itself. LayoutLMv3 (2022) and Donut pioneered layout-aware pretraining, but the game changed when GPT-4V and Claude 3 demonstrated that general-purpose multimodal LLMs could match or exceed specialist models on DocVQA and InfographicsVQA without fine-tuning. The persistent challenges are multi-page reasoning, handling handwritten text mixed with print, and accurately extracting structured data from complex table layouts. This task sits at the intersection of OCR, layout analysis, and language understanding, making it one of the highest-value enterprise AI applications.

3 datasets28 resultsView full task mapping →

Document understanding goes beyond extraction to comprehend document content — answering questions, summarizing, comparing, and reasoning about documents. DocVQA is the flagship benchmark, where models answer natural language questions about document images. Large VLMs (GPT-4o, Qwen2-VL) now achieve 93%+ ANLS on DocVQA, approaching human performance.

History

2019

LayoutLM (Xu et al.) introduces the concept of multimodal document pretraining — combining text, layout, and image features

2020

DocVQA dataset (Mathew et al.) establishes document visual question answering as a benchmark task with 50K questions on 12K document images

2021

LayoutLMv2 adds visual embeddings and achieves 86.72% ANLS on DocVQA; InfographicVQA extends to information-rich graphics

2022

Donut and Pix2Struct enable OCR-free document understanding, directly mapping document images to answers

2022

LayoutLMv3 achieves 83.37% on DocVQA with unified multimodal pretraining; becomes the standard document AI backbone

2023

TextMonkey and UReader push document-specific VLMs, handling multi-page and high-resolution documents

2024

GPT-4o achieves 92.8% ANLS on DocVQA; Qwen2-VL-72B reaches 96.5% — general VLMs surpass document-specific models

2024

DocGenome introduces a large-scale benchmark with 500K documents covering 13 types, pushing toward comprehensive document understanding

2025

Multi-page and multi-document understanding (MP-DocVQA, DUDE) become active research frontiers as single-page understanding is largely solved

How Document Understanding Works

Document Encoding

OCR-based: extract text + bounding boxes via OCR, then encode with LayoutLM-family models that embed spatial position alongside text. OCR-free: a ViT encoder processes the document image directly, often at high resolution (1344px+ for detail). VLMs use a vision encoder (SigLIP, InternViT) to create visual tokens.

Question Encoding

The natural language question is tokenized and either concatenated with document tokens (LayoutLM approach) or processed as text input alongside visual tokens (VLM approach).

Cross-Modal Reasoning

Transformer attention allows the model to locate relevant document regions for the question. In VLMs, this is implicit in the autoregressive generation — the model 'reads' the document image and generates an answer conditioned on both the visual content and the question.

Answer Generation

Extractive models point to spans in the OCR text. Generative models (Donut, VLMs) produce the answer token-by-token. VLMs can generate free-form explanations alongside the answer.

Evaluation

ANLS (Average Normalized Levenshtein Similarity) is the standard metric — it's edit-distance-based and tolerates minor OCR errors. DocVQA, InfographicVQA, and ChartQA are primary benchmarks. Multi-page benchmarks (MP-DocVQA, DUDE) test document-level reasoning.

Current Landscape

Document understanding in 2025 is dominated by large VLMs. The LayoutLM era (2019-2023) of task-specific document AI models has been superseded by general-purpose vision-language models that simply 'read' document images and answer questions. Qwen2-VL and InternVL2 achieve 96%+ on DocVQA without any document-specific training — they just understand documents as part of their general visual understanding. The remaining challenges are multi-page reasoning, numerical precision, and chart understanding. For production systems, the choice is between API-based VLMs (highest accuracy, data leaves your infrastructure) and self-hosted smaller models (lower accuracy, full data control).

Key Challenges

Multi-page reasoning — real-world questions often require synthesizing information across multiple pages (e.g., 'what was the total revenue increase from page 4 to page 12?')

Table reasoning — understanding table structure and performing operations (sums, comparisons, lookups) within tables embedded in documents

Chart and graph understanding — extracting quantitative information from visualizations (bar charts, line graphs, pie charts) embedded in documents

Resolution requirements — small text, dense tables, and fine print require high-resolution input (2000+ pixels), which is expensive for transformer-based models

Hallucination — VLMs sometimes generate plausible but incorrect answers, especially for numerical data and precise dates extracted from documents

Quick Recommendations

Best accuracy (API)

GPT-4o or Gemini 2.5 Pro

93%+ ANLS on DocVQA; handles complex reasoning, charts, and multi-page documents; best for highest-accuracy requirements

Best open-source

Qwen2-VL-72B or InternVL2-76B

96%+ ANLS on DocVQA; competitive with closed-source models; can be self-hosted for data privacy

Efficient / self-hosted

Qwen2-VL-7B or InternVL2-8B

90%+ ANLS on DocVQA at 7-8B parameters; runs on a single A100; best accuracy-per-compute ratio

OCR-free simplicity

Donut-Base or Florence-2

No OCR dependency, simpler pipeline; 80%+ ANLS on DocVQA; faster inference for high-throughput

Enterprise KYC / forms

Azure Document Intelligence + GPT-4o

Specialized prebuilt models for common document types (invoices, IDs, receipts) with VLM fallback for edge cases

What's Next

The frontier is agentic document workflows — models that don't just answer questions about documents but take actions (fill forms, route approvals, generate summaries, flag anomalies). Multi-document reasoning (cross-referencing contracts, comparing financial statements) is the next benchmark frontier. The long-term vision is AI assistants that can process, understand, and act on any document as well as a human knowledge worker.

Benchmarks & SOTA

DocVQA

Document Visual Question Answering

202021 results

Document question-answering benchmark where systems answer natural-language questions grounded in document images.

State of the Art

Qwen3-VL-235B-A22B-Instruct

Qwen

97.1

anls

DocLayNet

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

20227 results

IBM dataset with 80,863 pages across 6 document categories (financial, scientific, patents, law, government, manuals). 11 layout element classes. Supersedes PubLayNet for general-purpose layout analysis.

State of the Art

DocFormerv2-Large

Adobe Research

84.1

mAP

FUNSD

Form Understanding in Noisy Scanned Documents

20190 results

199 fully annotated forms. Tests semantic entity labeling and linking.

No results tracked yet

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Document Understanding benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision