Document Image Classification
Classifying documents by type or category
Document image classification categorizes scanned documents or photos of documents into types (invoice, receipt, contract, ID, resume, etc.) based on visual layout and content. It's the entry point of document processing pipelines — route the document to the right extraction model. LayoutLMv3 and DocFormerv2 achieve 95%+ accuracy on RVL-CDIP (16 classes, 400K documents) by combining OCR text, visual features, and spatial layout.
History
RVL-CDIP dataset (Harley et al.) establishes the standard benchmark with 400K document images across 16 categories
CNN-based classifiers (VGG, ResNet on document images) achieve 89-90% accuracy on RVL-CDIP, treating documents as regular images
LayoutLM (Xu et al.) combines BERT-style text encoding with 2D position embeddings, showing that spatial layout is critical for document understanding
LayoutLMv2 adds visual features and cross-modal alignment, reaching 95.25% on RVL-CDIP — multimodal approach dominates
LayoutLMv3 unifies text, image, and layout pretraining with masked image/language modeling, achieving 95.93% on RVL-CDIP
DiT (Document Image Transformer) shows that self-supervised pretraining on document images alone reaches 92.69% without OCR
DocFormerv2 and Donut (OCR-free) achieve competitive accuracy without explicit OCR, simplifying the pipeline
Large VLMs (Qwen2-VL, InternVL2) classify documents zero-shot using visual understanding alone, matching fine-tuned specialists
How Document Image Classification Works
Input Processing
Documents are either (a) rendered as images and processed by a vision encoder, or (b) OCR'd to extract text + bounding boxes, which are combined with the document image.
Multimodal Encoding
LayoutLM-family models embed each text token with its 2D position (x, y, width, height) from OCR bounding boxes, plus visual features from the document image. This captures layout structure (headers are at the top, tables have grid patterns) alongside text content.
OCR-Free Encoding
Donut and Pix2Struct bypass OCR entirely — they encode the document image with a ViT and decode text autoregressively. The model implicitly learns to read and understand layout simultaneously.
Classification Head
A [CLS] token representation is projected to document type logits. Cross-entropy loss trains classification. Some approaches use hierarchical classification for fine-grained subtypes.
Evaluation
Accuracy on RVL-CDIP (16 classes) is the primary benchmark. Tobacco3482 (10 classes) is a smaller alternative. Real-world evaluation includes processing speed and robustness to scan quality (skew, noise, low resolution).
Current Landscape
Document classification in 2025 is mature for standard benchmarks — RVL-CDIP accuracy exceeds 95% with multiple approaches. The field has bifurcated between OCR-dependent multimodal models (LayoutLMv3, highest accuracy) and OCR-free models (Donut, simpler pipeline). Large VLMs are disrupting the space by enabling zero-shot classification — describe a document type in text and the model classifies it without training data. For production systems, the choice is increasingly between fine-tuning a small specialist model (fast, cheap) or calling a large VLM API (flexible, no training needed).
Key Challenges
Visual variability within classes — 'invoices' look radically different across companies, countries, and eras; templates vary by hundreds of thousands of unique layouts
Multi-page documents — most benchmarks classify single pages, but real documents (contracts, reports) span many pages with varying content per page
Scan quality — real-world document images include skew, blur, fax artifacts, handwritten annotations, and partial occlusions
Class imbalance — in production pipelines, some document types (emails) are 100× more common than others (legal notices), requiring careful sampling or loss weighting
OCR dependency — multimodal models relying on OCR inherit its errors; OCR-free models avoid this but sacrifice text understanding depth
Quick Recommendations
Best accuracy (OCR available)
LayoutLMv3-Large
95.93% on RVL-CDIP; leverages OCR text + layout + image jointly; the established production choice
OCR-free pipeline
Donut-Base or Pix2Struct
No OCR dependency eliminates a failure point; Donut reaches 95.3% on RVL-CDIP with simpler pipeline
Zero-shot / new document types
Qwen2-VL-72B or GPT-4o
Classify into arbitrary document categories with a text prompt; no training data needed for new types
Fast / edge deployment
DiT-Base or EfficientNet on document images
Image-only classification at 90%+ accuracy without OCR overhead; suitable for mobile scanning apps
Multi-page documents
Hi-VT5 or custom LayoutLMv3 with page aggregation
Process each page then aggregate features; handles contracts and reports that span 10+ pages
What's Next
The field is moving toward: unified document understanding models that classify, extract, and answer questions in a single pass (no separate classification step); multi-page document classification with hierarchical attention; and continual learning systems that adapt to new document types without retraining on old ones. The end state is document AI pipelines where classification is an emergent capability of the extraction model rather than a separate step.
Benchmarks & SOTA
rvl-cdip
Dataset from Papers With Code
State of the Art
EAML
97.7
accuracy
tobacco-3482
Dataset from Papers With Code
State of the Art
DocXClassifier-L
95.57
accuracy
noisy-bangla-numeral
Dataset from Papers With Code
State of the Art
PCGAN-CHAR
96.68
accuracy
noisy-bangla-characters
Dataset from Papers With Code
State of the Art
PCGAN-CHAR
89.54
accuracy
noisy-mnist
Dataset from Papers With Code
State of the Art
PCGAN-CHAR
98.43
accuracy
aip
Dataset from Papers With Code
State of the Art
ResNet-RS (ResNet-200 + RS training tricks)
83.4
top-1-accuracy-verb
n-mnist
Dataset from Papers With Code
State of the Art
Pixel-level RC
97.62
accuracy
Related Tasks
Something wrong or missing?
Help keep Document Image Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.