Computer Visiondocument-question-answering

Document Understanding

Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables — where layout and typography carry as much meaning as the text itself. LayoutLMv3 (2022) and Donut pioneered layout-aware pretraining, but the game changed when GPT-4V and Claude 3 demonstrated that general-purpose multimodal LLMs could match or exceed specialist models on DocVQA and InfographicsVQA without fine-tuning. The persistent challenges are multi-page reasoning, handling handwritten text mixed with print, and accurately extracting structured data from complex table layouts. This task sits at the intersection of OCR, layout analysis, and language understanding, making it one of the highest-value enterprise AI applications.

Datasets

Results

Canonical metric

Canonical Benchmark

FUNSD

199 fully annotated forms. Tests semantic entity labeling and linking.

Primary metric: f1

View full leaderboard

Top 10

Leading models on FUNSD.

Rank	Model	f1	Year	Source
1	LayoutLMv3-large	92.1	2022	paper
2	UDOP	91.6	2023	paper
3	LayoutLMv3-base	90.3	2022	paper
4	DocFormerv2-large	88.9	2023	paper
5	LiLT[EN-R2]-base	88.4	2022	paper
6	DocFormerv2-base	88.4	2023	paper
7	StructuralLM	85.1	2021	paper
8	FormNet	84.7	2022	paper
9	BROS-large	84.5	2022	paper
10	LayoutLMv2-large	84.2	2021	paper

What were you looking for on Document Understanding?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

2 datasets tracked for this task.

FUNSD

CANONICAL

13results·f1

Top: LayoutLMv3-large — 92.1

DocLayNet

4results·mAP

Top: DocFormerv2-Large — 84.1

Related tasks

Other tasks in Computer Vision.

Depth Estimation Document Image Classification Document Layout Analysis Document Parsing General OCR Capabilities Handwriting Recognition Image Classification Image Feature Extraction

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Document Understanding? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.