Who leads the rvl-cdip benchmark?

EAML currently leads rvl-cdip with a score of 97.7 on Accuracy.

What is the state-of-the-art score on rvl-cdip?

The state-of-the-art result on rvl-cdip is 97.7 (Accuracy), achieved by EAML as of 2025.

How many models are tracked on rvl-cdip?

Codesota tracks 37 models on rvl-cdip across 3 metrics.

When was the rvl-cdip leaderboard last updated?

The rvl-cdip leaderboard on Codesota includes results through 2025, with the earliest tracked result from 2017.

Codesota · Benchmark · rvl-cdipHome/Leaderboards/rvl-cdip

Unknown

rvl-cdip.

Name: rvl-cdip Benchmark Results
Creator: Unknown
Published: 2017-01-01
License: https://creativecommons.org/licenses/by/4.0/

rvl-cdip is a state-of-the-art machine learning benchmark indexed on Codesota. This page tracks published model results, top scores per metric, and the SOTA timeline for rvl-cdip.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Accuracy

Accuracy is the reported evaluation metric for rvl-cdip. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	EAML From paper: EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification	verified	97.7	2023	Paper ↗	Looks wrong?
02	Cross-Modal From paper: Visual and Textual Deep Feature Fusion for Document Image Classification	verified	97.05	2020	Paper ↗	Looks wrong?
03	DocFormerBASE From paper: DocFormer: End-to-End Transformer for Document Understanding	verified	96.17	2021	Paper ↗Code ↗	Looks wrong?
04	LayoutLMV3Large From paper: LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking	verified	95.93	2022	Paper ↗Code ↗	Looks wrong?
05	LiLT[EN-R]BASE From paper: LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding	verified	95.68	2022	Paper ↗Code ↗	Looks wrong?
06	LayoutLMv2LARGE From paper: LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding	verified	95.64	2020	Paper ↗Code ↗	Looks wrong?
07	LayoutLMv2 Large	unverified	95.64	2020	Paper ↗Code ↗	Looks wrong?
08	TILT-Large From paper: Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer	verified	95.52	2021	Paper ↗Code ↗	Looks wrong?
09	DocFormer large From paper: DocFormer: End-to-End Transformer for Document Understanding	verified	95.5	2021	Paper ↗Code ↗	Looks wrong?
10	LayoutLMv3BASE From paper: LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking	verified	95.44	2022	Paper ↗Code ↗	Looks wrong?
11	Donut From paper: OCR-free Document Understanding Transformer	verified	95.3	2021	Paper ↗Code ↗	Looks wrong?
12	TILT-Base From paper: Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer	verified	95.25	2021	Paper ↗Code ↗	Looks wrong?
13	LayoutLMv2BASE From paper: LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding	verified	95.25	2020	Paper ↗Code ↗	Looks wrong?
14	LayoutXLM From paper: LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding	verified	95.21	2021	Paper ↗Code ↗	Looks wrong?
15	StrucTexTv2 (large) From paper: StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training	verified	94.62	2023	Paper ↗Code ↗	Looks wrong?
16	Pre-trained LayoutLM From paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding	verified	94.42	2019	Paper ↗Code ↗	Looks wrong?
17	DoPTA From paper: DoPTA: Improving Document Layout Analysis using Patch-Text Alignment	verified	94.12	2024	Paper ↗	Looks wrong?
18	DoPTA-HR (512×512) DoPTA: Improving Document Layout Analysis using Patch-Text Alignment. High-resolution (512×512) variant. Table 1. Outperforms Donut-Encoder (93.37%) and StructTexTv2-Small (93.4%) at comparable resolutions.	verified	94.07	2024	Source ↗	Looks wrong?
19	DocXClassifier-B From paper: DocXClassifier: High Performance Explainable Deep Network for Document Image Classification	verified	94	2022	Paper ↗Code ↗	Looks wrong?
20	HEADoC-Large HEADoC: Highly Efficient and Accurate Document Classifier Optimized Using Semantic Distances. LARGE variant (90.58M params). Published in Progress in Artificial Intelligence, Oct 2025. Deep attention mechanism fusing textual and visual modalities via semantic distances.	verified	93.62	2025	Source ↗	Looks wrong?
21	StrucTexTv2 (small) From paper: StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training	verified	93.4	2023	Paper ↗Code ↗	Looks wrong?
22	VLCDoC From paper: VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification	verified	93.19	2022	Paper ↗	Looks wrong?
23	TransferDoc From paper: GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification	verified	93.18	2023	Paper ↗	Looks wrong?
24	DoPTA (224×224) DoPTA: Improving Document Layout Analysis using Patch-Text Alignment. Standard resolution (224×224) variant. Table 1. Outperforms DiT-L with <1/3rd parameters. 250k pretraining steps on document images.	verified	92.96	2024	Source ↗	Looks wrong?
25	HEADoC-Base HEADoC: Highly Efficient and Accurate Document Classifier Optimized Using Semantic Distances. BASE variant (27.7M params). Published in Progress in Artificial Intelligence, Oct 2025. Deep attention mechanism fusing textual and visual modalities via semantic distances.	verified	92.95	2025	Source ↗	Looks wrong?
26	Multimodal (ResNet50) From paper: Multimodal Side-Tuning for Document Classification	verified	92.7	2023	Paper ↗Code ↗	Looks wrong?
27	DiT-L From paper: DiT: Self-supervised Pre-training for Document Image Transformer	verified	92.69	2022	Paper ↗Code ↗	Looks wrong?
28	Pre-trained EfficientNet From paper: Improving accuracy and speeding up Document Image Classification through parallel systems	verified	92.31	2020	Paper ↗Code ↗	Looks wrong?
29	Transfer Learning from VGG16 trained on Imagenet From paper: Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks	verified	92.21	2018	Paper ↗Code ↗	Looks wrong?
30	Multimodal (MobileNetV2) From paper: Multimodal Side-Tuning for Document Classification	verified	92.2	2023	Paper ↗Code ↗	Looks wrong?
31	DiT-B From paper: DiT: Self-supervised Pre-training for Document Image Transformer	verified	92.11	2022	Paper ↗Code ↗	Looks wrong?
32	BEiT-B From paper: BEiT: BERT Pre-Training of Image Transformers	verified	91.09	2021	Paper ↗Code ↗	Looks wrong?
33	Transfer Learning from AlexNet, VGG-16, GoogLeNet and ResNet50 From paper: Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification	verified	90.97	2017	Paper ↗Code ↗	Looks wrong?
34	AlexNet + spatial pyramidal pooling + image resizing From paper: Analysis of Convolutional Neural Networks for Document Image Classification	verified	90.94	2017	Paper ↗	Looks wrong?
35	DeiT-B From paper: Training data-efficient image transformers & distillation through attention	verified	90.32	2020	Paper ↗Code ↗	Looks wrong?
36	Roberta base From paper: RoBERTa: A Robustly Optimized BERT Pretraining Approach	verified	90.06	2019	Paper ↗Code ↗	Looks wrong?

Far

Far is the reported evaluation metric for rvl-cdip. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Farverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	VisualWordGrid From paper: VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach	verified	28.7	2020	Paper ↗	Looks wrong?

War

War is the reported evaluation metric for rvl-cdip. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Warverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	VisualWordGrid From paper: VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach	verified	18.7	2020	Paper ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards