| 01 | EAML From paper: EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification | verified | 97.7 | 2023 | Paper ↗ | Looks wrong? |
| 02 | Cross-Modal From paper: Visual and Textual Deep Feature Fusion for Document Image Classification | verified | 97.05 | 2020 | Paper ↗ | Looks wrong? |
| 03 | DocFormerBASE From paper: DocFormer: End-to-End Transformer for Document Understanding | verified | 96.17 | 2021 | Paper ↗Code ↗ | Looks wrong? |
| 04 | LayoutLMV3Large From paper: LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking | verified | 95.93 | 2022 | Paper ↗Code ↗ | Looks wrong? |
| 05 | LiLT[EN-R]BASE From paper: LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding | verified | 95.68 | 2022 | Paper ↗Code ↗ | Looks wrong? |
| 06 | LayoutLMv2LARGE From paper: LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding | verified | 95.64 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 07 | LayoutLMv2 Large | unverified | 95.64 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 08 | TILT-Large From paper: Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer | verified | 95.52 | 2021 | Paper ↗Code ↗ | Looks wrong? |
| 09 | DocFormer large From paper: DocFormer: End-to-End Transformer for Document Understanding | verified | 95.5 | 2021 | Paper ↗Code ↗ | Looks wrong? |
| 10 | LayoutLMv3BASE From paper: LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking | verified | 95.44 | 2022 | Paper ↗Code ↗ | Looks wrong? |
| 11 | Donut From paper: OCR-free Document Understanding Transformer | verified | 95.3 | 2021 | Paper ↗Code ↗ | Looks wrong? |
| 12 | TILT-Base From paper: Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer | verified | 95.25 | 2021 | Paper ↗Code ↗ | Looks wrong? |
| 13 | LayoutLMv2BASE From paper: LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding | verified | 95.25 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 14 | LayoutXLM From paper: LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding | verified | 95.21 | 2021 | Paper ↗Code ↗ | Looks wrong? |
| 15 | StrucTexTv2 (large) From paper: StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training | verified | 94.62 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 16 | Pre-trained LayoutLM From paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding | verified | 94.42 | 2019 | Paper ↗Code ↗ | Looks wrong? |
| 17 | DoPTA From paper: DoPTA: Improving Document Layout Analysis using Patch-Text Alignment | verified | 94.12 | 2024 | Paper ↗ | Looks wrong? |
| 18 | DoPTA-HR (512×512) DoPTA: Improving Document Layout Analysis using Patch-Text Alignment. High-resolution (512×512) variant. Table 1. Outperforms Donut-Encoder (93.37%) and StructTexTv2-Small (93.4%) at comparable resolutions. | verified | 94.07 | 2024 | Source ↗ | Looks wrong? |
| 19 | DocXClassifier-B From paper: DocXClassifier: High Performance Explainable Deep Network for Document Image Classification | verified | 94 | 2022 | Paper ↗Code ↗ | Looks wrong? |
| 20 | HEADoC-Large HEADoC: Highly Efficient and Accurate Document Classifier Optimized Using Semantic Distances. LARGE variant (90.58M params). Published in Progress in Artificial Intelligence, Oct 2025. Deep attention mechanism fusing textual and visual modalities via semantic distances. | verified | 93.62 | 2025 | Source ↗ | Looks wrong? |
| 21 | StrucTexTv2 (small) From paper: StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training | verified | 93.4 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 22 | VLCDoC From paper: VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification | verified | 93.19 | 2022 | Paper ↗ | Looks wrong? |
| 23 | TransferDoc From paper: GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification | verified | 93.18 | 2023 | Paper ↗ | Looks wrong? |
| 24 | DoPTA (224×224) DoPTA: Improving Document Layout Analysis using Patch-Text Alignment. Standard resolution (224×224) variant. Table 1. Outperforms DiT-L with <1/3rd parameters. 250k pretraining steps on document images. | verified | 92.96 | 2024 | Source ↗ | Looks wrong? |
| 25 | HEADoC-Base HEADoC: Highly Efficient and Accurate Document Classifier Optimized Using Semantic Distances. BASE variant (27.7M params). Published in Progress in Artificial Intelligence, Oct 2025. Deep attention mechanism fusing textual and visual modalities via semantic distances. | verified | 92.95 | 2025 | Source ↗ | Looks wrong? |
| 26 | Multimodal (ResNet50) From paper: Multimodal Side-Tuning for Document Classification | verified | 92.7 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 27 | DiT-L From paper: DiT: Self-supervised Pre-training for Document Image Transformer | verified | 92.69 | 2022 | Paper ↗Code ↗ | Looks wrong? |
| 28 | Pre-trained EfficientNet From paper: Improving accuracy and speeding up Document Image Classification through parallel systems | verified | 92.31 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 29 | Transfer Learning from VGG16 trained on Imagenet From paper: Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks | verified | 92.21 | 2018 | Paper ↗Code ↗ | Looks wrong? |
| 30 | Multimodal (MobileNetV2) From paper: Multimodal Side-Tuning for Document Classification | verified | 92.2 | 2023 | Paper ↗Code ↗ | Looks wrong? |
| 31 | DiT-B From paper: DiT: Self-supervised Pre-training for Document Image Transformer | verified | 92.11 | 2022 | Paper ↗Code ↗ | Looks wrong? |
| 32 | BEiT-B From paper: BEiT: BERT Pre-Training of Image Transformers | verified | 91.09 | 2021 | Paper ↗Code ↗ | Looks wrong? |
| 33 | Transfer Learning from AlexNet, VGG-16, GoogLeNet and ResNet50 From paper: Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification | verified | 90.97 | 2017 | Paper ↗Code ↗ | Looks wrong? |
| 34 | AlexNet + spatial pyramidal pooling + image resizing From paper: Analysis of Convolutional Neural Networks for Document Image Classification | verified | 90.94 | 2017 | Paper ↗ | Looks wrong? |
| 35 | DeiT-B From paper: Training data-efficient image transformers & distillation through attention | verified | 90.32 | 2020 | Paper ↗Code ↗ | Looks wrong? |
| 36 | Roberta base From paper: RoBERTa: A Robustly Optimized BERT Pretraining Approach | verified | 90.06 | 2019 | Paper ↗Code ↗ | Looks wrong? |