Codesota · Benchmark · rvl-cdipHome/Leaderboards/rvl-cdip
Unknown

rvl-cdip.

rvl-cdip is a state-of-the-art machine learning benchmark indexed on Codesota. This page tracks published model results, top scores per metric, and the SOTA timeline for rvl-cdip.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Accuracy

Accuracy is the reported evaluation metric for rvl-cdip. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01EAML
From paper: EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification
verified97.72023Paper ↗Looks wrong?
02Cross-Modal
From paper: Visual and Textual Deep Feature Fusion for Document Image Classification
verified97.052020Paper ↗Looks wrong?
03DocFormerBASE
From paper: DocFormer: End-to-End Transformer for Document Understanding
verified96.172021Paper ↗Code ↗Looks wrong?
04LayoutLMV3Large
From paper: LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
verified95.932022Paper ↗Code ↗Looks wrong?
05LiLT[EN-R]BASE
From paper: LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding
verified95.682022Paper ↗Code ↗Looks wrong?
06LayoutLMv2LARGE
From paper: LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
verified95.642020Paper ↗Code ↗Looks wrong?
07LayoutLMv2 Largeunverified95.642020Paper ↗Code ↗Looks wrong?
08TILT-Large
From paper: Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
verified95.522021Paper ↗Code ↗Looks wrong?
09DocFormer large
From paper: DocFormer: End-to-End Transformer for Document Understanding
verified95.52021Paper ↗Code ↗Looks wrong?
10LayoutLMv3BASE
From paper: LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
verified95.442022Paper ↗Code ↗Looks wrong?
11Donut
From paper: OCR-free Document Understanding Transformer
verified95.32021Paper ↗Code ↗Looks wrong?
12TILT-Base
From paper: Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
verified95.252021Paper ↗Code ↗Looks wrong?
13LayoutLMv2BASE
From paper: LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
verified95.252020Paper ↗Code ↗Looks wrong?
14LayoutXLM
From paper: LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
verified95.212021Paper ↗Code ↗Looks wrong?
15StrucTexTv2 (large)
From paper: StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training
verified94.622023Paper ↗Code ↗Looks wrong?
16Pre-trained LayoutLM
From paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding
verified94.422019Paper ↗Code ↗Looks wrong?
17DoPTA
From paper: DoPTA: Improving Document Layout Analysis using Patch-Text Alignment
verified94.122024Paper ↗Looks wrong?
18DoPTA-HR (512×512)
DoPTA: Improving Document Layout Analysis using Patch-Text Alignment. High-resolution (512×512) variant. Table 1. Outperforms Donut-Encoder (93.37%) and StructTexTv2-Small (93.4%) at comparable resolutions.
verified94.072024Source ↗Looks wrong?
19DocXClassifier-B
From paper: DocXClassifier: High Performance Explainable Deep Network for Document Image Classification
verified942022Paper ↗Code ↗Looks wrong?
20HEADoC-Large
HEADoC: Highly Efficient and Accurate Document Classifier Optimized Using Semantic Distances. LARGE variant (90.58M params). Published in Progress in Artificial Intelligence, Oct 2025. Deep attention mechanism fusing textual and visual modalities via semantic distances.
verified93.622025Source ↗Looks wrong?
21StrucTexTv2 (small)
From paper: StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training
verified93.42023Paper ↗Code ↗Looks wrong?
22VLCDoC
From paper: VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification
verified93.192022Paper ↗Looks wrong?
23TransferDoc
From paper: GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification
verified93.182023Paper ↗Looks wrong?
24DoPTA (224×224)
DoPTA: Improving Document Layout Analysis using Patch-Text Alignment. Standard resolution (224×224) variant. Table 1. Outperforms DiT-L with <1/3rd parameters. 250k pretraining steps on document images.
verified92.962024Source ↗Looks wrong?
25HEADoC-Base
HEADoC: Highly Efficient and Accurate Document Classifier Optimized Using Semantic Distances. BASE variant (27.7M params). Published in Progress in Artificial Intelligence, Oct 2025. Deep attention mechanism fusing textual and visual modalities via semantic distances.
verified92.952025Source ↗Looks wrong?
26Multimodal (ResNet50)
From paper: Multimodal Side-Tuning for Document Classification
verified92.72023Paper ↗Code ↗Looks wrong?
27DiT-L
From paper: DiT: Self-supervised Pre-training for Document Image Transformer
verified92.692022Paper ↗Code ↗Looks wrong?
28Pre-trained EfficientNet
From paper: Improving accuracy and speeding up Document Image Classification through parallel systems
verified92.312020Paper ↗Code ↗Looks wrong?
29Transfer Learning from VGG16 trained on Imagenet
From paper: Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks
verified92.212018Paper ↗Code ↗Looks wrong?
30Multimodal (MobileNetV2)
From paper: Multimodal Side-Tuning for Document Classification
verified92.22023Paper ↗Code ↗Looks wrong?
31DiT-B
From paper: DiT: Self-supervised Pre-training for Document Image Transformer
verified92.112022Paper ↗Code ↗Looks wrong?
32BEiT-B
From paper: BEiT: BERT Pre-Training of Image Transformers
verified91.092021Paper ↗Code ↗Looks wrong?
33Transfer Learning from AlexNet, VGG-16, GoogLeNet and ResNet50
From paper: Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification
verified90.972017Paper ↗Code ↗Looks wrong?
34AlexNet + spatial pyramidal pooling + image resizing
From paper: Analysis of Convolutional Neural Networks for Document Image Classification
verified90.942017Paper ↗Looks wrong?
35DeiT-B
From paper: Training data-efficient image transformers & distillation through attention
verified90.322020Paper ↗Code ↗Looks wrong?
36Roberta base
From paper: RoBERTa: A Robustly Optimized BERT Pretraining Approach
verified90.062019Paper ↗Code ↗Looks wrong?

Far

Far is the reported evaluation metric for rvl-cdip. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Farverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01VisualWordGrid
From paper: VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach
verified28.72020Paper ↗Looks wrong?

War

War is the reported evaluation metric for rvl-cdip. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Warverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01VisualWordGrid
From paper: VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach
verified18.72020Paper ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards
rvl-cdip Leaderboard | CodeSOTA | CodeSOTA