Document Understanding
Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables — where layout and typography carry as much meaning as the text itself. LayoutLMv3 (2022) and Donut pioneered layout-aware pretraining, but the game changed when GPT-4V and Claude 3 demonstrated that general-purpose multimodal LLMs could match or exceed specialist models on DocVQA and InfographicsVQA without fine-tuning. The persistent challenges are multi-page reasoning, handling handwritten text mixed with print, and accurately extracting structured data from complex table layouts. This task sits at the intersection of OCR, layout analysis, and language understanding, making it one of the highest-value enterprise AI applications.
Document understanding goes beyond extraction to comprehend document content — answering questions, summarizing, comparing, and reasoning about documents. DocVQA is the flagship benchmark, where models answer natural language questions about document images. Large VLMs (GPT-4o, Qwen2-VL) now achieve 93%+ ANLS on DocVQA, approaching human performance.
History
LayoutLM (Xu et al.) introduces the concept of multimodal document pretraining — combining text, layout, and image features
DocVQA dataset (Mathew et al.) establishes document visual question answering as a benchmark task with 50K questions on 12K document images
LayoutLMv2 adds visual embeddings and achieves 86.72% ANLS on DocVQA; InfographicVQA extends to information-rich graphics
Donut and Pix2Struct enable OCR-free document understanding, directly mapping document images to answers
LayoutLMv3 achieves 83.37% on DocVQA with unified multimodal pretraining; becomes the standard document AI backbone
TextMonkey and UReader push document-specific VLMs, handling multi-page and high-resolution documents
GPT-4o achieves 92.8% ANLS on DocVQA; Qwen2-VL-72B reaches 96.5% — general VLMs surpass document-specific models
DocGenome introduces a large-scale benchmark with 500K documents covering 13 types, pushing toward comprehensive document understanding
Multi-page and multi-document understanding (MP-DocVQA, DUDE) become active research frontiers as single-page understanding is largely solved
How Document Understanding Works
Document Encoding
OCR-based: extract text + bounding boxes via OCR, then encode with LayoutLM-family models that embed spatial position alongside text. OCR-free: a ViT encoder processes the document image directly, often at high resolution (1344px+ for detail). VLMs use a vision encoder (SigLIP, InternViT) to create visual tokens.
Question Encoding
The natural language question is tokenized and either concatenated with document tokens (LayoutLM approach) or processed as text input alongside visual tokens (VLM approach).
Cross-Modal Reasoning
Transformer attention allows the model to locate relevant document regions for the question. In VLMs, this is implicit in the autoregressive generation — the model 'reads' the document image and generates an answer conditioned on both the visual content and the question.
Answer Generation
Extractive models point to spans in the OCR text. Generative models (Donut, VLMs) produce the answer token-by-token. VLMs can generate free-form explanations alongside the answer.
Evaluation
ANLS (Average Normalized Levenshtein Similarity) is the standard metric — it's edit-distance-based and tolerates minor OCR errors. DocVQA, InfographicVQA, and ChartQA are primary benchmarks. Multi-page benchmarks (MP-DocVQA, DUDE) test document-level reasoning.
Current Landscape
Document understanding in 2025 is dominated by large VLMs. The LayoutLM era (2019-2023) of task-specific document AI models has been superseded by general-purpose vision-language models that simply 'read' document images and answer questions. Qwen2-VL and InternVL2 achieve 96%+ on DocVQA without any document-specific training — they just understand documents as part of their general visual understanding. The remaining challenges are multi-page reasoning, numerical precision, and chart understanding. For production systems, the choice is between API-based VLMs (highest accuracy, data leaves your infrastructure) and self-hosted smaller models (lower accuracy, full data control).
Key Challenges
Multi-page reasoning — real-world questions often require synthesizing information across multiple pages (e.g., 'what was the total revenue increase from page 4 to page 12?')
Table reasoning — understanding table structure and performing operations (sums, comparisons, lookups) within tables embedded in documents
Chart and graph understanding — extracting quantitative information from visualizations (bar charts, line graphs, pie charts) embedded in documents
Resolution requirements — small text, dense tables, and fine print require high-resolution input (2000+ pixels), which is expensive for transformer-based models
Hallucination — VLMs sometimes generate plausible but incorrect answers, especially for numerical data and precise dates extracted from documents
Quick Recommendations
Best accuracy (API)
GPT-4o or Gemini 2.5 Pro
93%+ ANLS on DocVQA; handles complex reasoning, charts, and multi-page documents; best for highest-accuracy requirements
Best open-source
Qwen2-VL-72B or InternVL2-76B
96%+ ANLS on DocVQA; competitive with closed-source models; can be self-hosted for data privacy
Efficient / self-hosted
Qwen2-VL-7B or InternVL2-8B
90%+ ANLS on DocVQA at 7-8B parameters; runs on a single A100; best accuracy-per-compute ratio
OCR-free simplicity
Donut-Base or Florence-2
No OCR dependency, simpler pipeline; 80%+ ANLS on DocVQA; faster inference for high-throughput
Enterprise KYC / forms
Azure Document Intelligence + GPT-4o
Specialized prebuilt models for common document types (invoices, IDs, receipts) with VLM fallback for edge cases
What's Next
The frontier is agentic document workflows — models that don't just answer questions about documents but take actions (fill forms, route approvals, generate summaries, flag anomalies). Multi-document reasoning (cross-referencing contracts, comparing financial statements) is the next benchmark frontier. The long-term vision is AI assistants that can process, understand, and act on any document as well as a human knowledge worker.
Benchmarks & SOTA
DocLayNet
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
IBM dataset with 80,863 pages across 6 document categories (financial, scientific, patents, law, government, manuals). 11 layout element classes. Supersedes PubLayNet for general-purpose layout analysis.
State of the Art
DocFormerv2-Large
Adobe Research
84.1
mAP
FUNSD
Form Understanding in Noisy Scanned Documents
199 fully annotated forms. Tests semantic entity labeling and linking.
No results tracked yet
Related Tasks
Something wrong or missing?
Help keep Document Understanding benchmarks accurate. Report outdated results, missing benchmarks, or errors.