Computer Visiondocument-question-answering

Document Understanding

Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables — where layout and typography carry as much meaning as the text itself. LayoutLMv3 (2022) and Donut pioneered layout-aware pretraining, but the game changed when GPT-4V and Claude 3 demonstrated that general-purpose multimodal LLMs could match or exceed specialist models on DocVQA and InfographicsVQA without fine-tuning. The persistent challenges are multi-page reasoning, handling handwritten text mixed with print, and accurately extracting structured data from complex table layouts. This task sits at the intersection of OCR, layout analysis, and language understanding, making it one of the highest-value enterprise AI applications.

2
Datasets
17
Results
f1
Canonical metric
Canonical Benchmark

FUNSD

199 fully annotated forms. Tests semantic entity labeling and linking.

Primary metric: f1
View full leaderboard

Top 10

Leading models on FUNSD.

RankModelf1YearSource
1
LayoutLMv3-large
92.12022paper
2
UDOP
91.62023paper
3
LayoutLMv3-base
90.32022paper
4
DocFormerv2-large
88.92023paper
5
LiLT[EN-R2]-base
88.42022paper
6
DocFormerv2-base
88.42023paper
7
StructuralLM
85.12021paper
8
FormNet
84.72022paper
9
BROS-large
84.52022paper
10
LayoutLMv2-large
84.22021paper

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Computer Vision.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace