Who leads the rvl-cdip benchmark?

EAML currently leads rvl-cdip with a score of 97.70 on accuracy.

What is the state-of-the-art score on rvl-cdip?

The state-of-the-art result on rvl-cdip is 97.70 (accuracy), achieved by EAML as of 2025.

How many models are tracked on rvl-cdip?

Codesota tracks 37 models on rvl-cdip across 3 metrics.

When was the rvl-cdip leaderboard last updated?

The rvl-cdip leaderboard on Codesota includes results through 2025, with the earliest tracked result from 2017.

Codesota · Computer Vision · Document Image Classification · rvl-cdipTasks/Computer Vision/Document Image Classification

Document Image Classification · benchmark dataset · 2020 · EN

rvl-cdip.

Name: rvl-cdip Benchmark Results
Creator: Codesota
Published: 2017-01-01
License: https://creativecommons.org/licenses/by/4.0/

Dataset from Papers With Code

Saturated benchmark

Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years

Submit a result ↵

§ 01 · Leaderboard

Best published scores.

38 results indexed across 3 metrics. Shaded row marks current SOTA; ties broken by submission date.

Primary: accuracy · higher is better
All metrics: accuracy, far, war

accuracy· primary

36 rows

#	Model	Org	Submitted	Paper / code	accuracy
01	EAML	—	May 2023	EAML: Ensemble Self-Attention-based Mutual Learning Netw…	97.70
02	Cross-Modal	—	Jun 2020	papers-with-code	97.05
03	DocFormerBASE	—	Jun 2021	DocFormer: End-to-End Transformer for Document Understan… · code	96.17
04	LayoutLMV3Large	—	Apr 2022	LayoutLMv3: Pre-training for Document AI with Unified Te… · code	95.93
05	LiLT[EN-R]BASE	—	Feb 2022	LiLT: A Simple yet Effective Language-Independent Layout… · code	95.68
06	LayoutLMv2LARGE	—	Dec 2020	LayoutLMv2: Multi-modal Pre-training for Visually-Rich D… · code	95.64
07	LayoutLMv2 Large	—	Dec 2020	LayoutLMv2: Multi-modal Pre-training for Visually-Rich D… · code	95.64
08	TILT-Large	—	Feb 2021	Going Full-TILT Boogie on Document Understanding with Te… · code	95.52
09	DocFormer large	—	Jun 2021	DocFormer: End-to-End Transformer for Document Understan… · code	95.50
10	LayoutLMv3BASE	—	Apr 2022	LayoutLMv3: Pre-training for Document AI with Unified Te… · code	95.44
11	Donut	—	Nov 2021	OCR-free Document Understanding Transformer · code	95.30
12	TILT-Base	—	Feb 2021	Going Full-TILT Boogie on Document Understanding with Te… · code	95.25
13	LayoutLMv2BASE	—	Dec 2020	LayoutLMv2: Multi-modal Pre-training for Visually-Rich D… · code	95.25
14	LayoutXLM	—	Apr 2021	LayoutXLM: Multimodal Pre-training for Multilingual Visu… · code	95.21
15	StrucTexTv2 (large)	—	Mar 2023	StrucTexTv2: Masked Visual-Textual Prediction for Docume… · code	94.62
16	Pre-trained LayoutLM	—	Dec 2019	LayoutLM: Pre-training of Text and Layout for Document I… · code	94.42
17	DoPTA	—	Dec 2024	DoPTA: Improving Document Layout Analysis using Patch-Te…	94.12
18	DoPTA-HR (512×512)	—	Dec 2024	arxiv	94.07
19	DocXClassifier-B	—	Mar 2022	papers-with-code · code	94
20	HEADoC-Large	—	Oct 2025	springer	93.62
21	StrucTexTv2 (small)	—	Mar 2023	StrucTexTv2: Masked Visual-Textual Prediction for Docume… · code	93.40
22	VLCDoC	—	May 2022	VLCDoC: Vision-Language Contrastive Pre-Training Model f…	93.19
23	TransferDoc	—	Sep 2023	GlobalDoc: A Cross-Modal Vision-Language Framework for R…	93.18
24	DoPTA (224×224)	—	Dec 2024	arxiv	92.96
25	HEADoC-Base	—	Oct 2025	springer	92.95
26	Multimodal (ResNet50)	—	Jan 2023	Multimodal Side-Tuning for Document Classification · code	92.70
27	DiT-L	—	Mar 2022	DiT: Self-supervised Pre-training for Document Image Tra… · code	92.69
28	Pre-trained EfficientNet	—	Jun 2020	Improving accuracy and speeding up Document Image Classi… · code	92.31
29	Transfer Learning from VGG16 trained on Imagenet	—	Jan 2018	Document Image Classification with Intra-Domain Transfer… · code	92.21
30	Multimodal (MobileNetV2)	—	Jan 2023	Multimodal Side-Tuning for Document Classification · code	92.20
31	DiT-B	—	Mar 2022	DiT: Self-supervised Pre-training for Document Image Tra… · code	92.11
32	BEiT-B	—	Jun 2021	BEiT: BERT Pre-Training of Image Transformers · code	91.09
33	Transfer Learning from AlexNet, VGG-16, GoogLeNet and ResNet50	—	Apr 2017	Cutting the Error by Half: Investigation of Very Deep CN… · code	90.97
34	AlexNet + spatial pyramidal pooling + image resizing	—	Aug 2017	Analysis of Convolutional Neural Networks for Document I…	90.94
35	DeiT-BOpen	Meta	Dec 2020	Training data-efficient image transformers & distillatio… · code	90.32
36	Roberta base	—	Jul 2019	RoBERTa: A Robustly Optimized BERT Pretraining Approach · code	90.06

far

1 row

#	Model	Org	Submitted	Paper / code	far
01	VisualWordGrid	—	Oct 2020	VisualWordGrid: Information Extraction From Scanned Docu…	28.70

war

1 row

#	Model	Org	Submitted	Paper / code	war
01	VisualWordGrid	—	Oct 2020	VisualWordGrid: Information Extraction From Scanned Docu…	18.70

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

5 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy

Apr 11, 2017Transfer Learning from AlexNet, VGG-16, GoogLeNet and ResNet5090.97
Jan 29, 2018Transfer Learning from VGG16 trained on Imagenet92.21
Dec 31, 2019Pre-trained LayoutLM94.42
Jun 16, 2020Cross-Modal97.05
May 11, 2023EAML97.70

Fig 3 · SOTA-setting models only. 5 entries span Apr 2017 → May 2023.

§ 04 · Literature

24 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment
Dec 2024·DoPTA
arXiv ↗
GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification
Sep 2023·TransferDoc
arXiv ↗
EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification
May 2023·EAML
arXiv ↗
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training
Mar 2023·StrucTexTv2 (large), StrucTexTv2 (small)
arXiv ↗Code
Multimodal Side-Tuning for Document Classification
Jan 2023·Multimodal (ResNet50), Multimodal (MobileNetV2)
arXiv ↗Code
VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification
May 2022·VLCDoC
arXiv ↗
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Apr 2022·LayoutLMV3Large, LayoutLMv3BASE
arXiv ↗Code
DiT: Self-supervised Pre-training for Document Image Transformer
Mar 2022·DiT-L, DiT-B
arXiv ↗Code
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding
Feb 2022·LiLT[EN-R]BASE
arXiv ↗Code
OCR-free Document Understanding Transformer
Nov 2021·Donut
arXiv ↗Code
DocFormer: End-to-End Transformer for Document Understanding
Jun 2021·DocFormerBASE, DocFormer large
arXiv ↗Code
BEiT: BERT Pre-Training of Image Transformers
Jun 2021·BEiT-B
arXiv ↗Code
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
Apr 2021·LayoutXLM
arXiv ↗Code
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Feb 2021·TILT-Large, TILT-Base
arXiv ↗Code
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Dec 2020·LayoutLMv2LARGE, LayoutLMv2BASE
arXiv ↗Code
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Dec 2020·LayoutLMv2 Large
arXiv ↗Code
Training data-efficient image transformers & distillation through attention
Dec 2020·DeiT-B
arXiv ↗Code
VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach
Oct 2020·VisualWordGrid
arXiv ↗
Improving accuracy and speeding up Document Image Classification through parallel systems
Jun 2020·Pre-trained EfficientNet
arXiv ↗Code
LayoutLM: Pre-training of Text and Layout for Document Image Understanding
Dec 2019·Pre-trained LayoutLM
arXiv ↗Code
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Jul 2019·Roberta base
arXiv ↗Code
Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks
Jan 2018·Transfer Learning from VGG16 trained on Imagenet
arXiv ↗Code
Analysis of Convolutional Neural Networks for Document Image Classification
Aug 2017·AlexNet + spatial pyramidal pooling + image resizing
arXiv ↗
Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification
Apr 2017·Transfer Learning from AlexNet, VGG-16, GoogLeNet and ResNet50
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

rvl-cdip.

Best published scores.

5 stepsof state of the art.

24 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

5 steps
of state of the art.

24 papers
tied to this benchmark.

Have a score that beats
this table?