Computer Vision

Document Image Classification

Classifying documents by type or category

7 datasets54 resultsView full task mapping →

Document image classification categorizes scanned documents or photos of documents into types (invoice, receipt, contract, ID, resume, etc.) based on visual layout and content. It's the entry point of document processing pipelines — route the document to the right extraction model. LayoutLMv3 and DocFormerv2 achieve 95%+ accuracy on RVL-CDIP (16 classes, 400K documents) by combining OCR text, visual features, and spatial layout.

History

2014

RVL-CDIP dataset (Harley et al.) establishes the standard benchmark with 400K document images across 16 categories

2017

CNN-based classifiers (VGG, ResNet on document images) achieve 89-90% accuracy on RVL-CDIP, treating documents as regular images

2019

LayoutLM (Xu et al.) combines BERT-style text encoding with 2D position embeddings, showing that spatial layout is critical for document understanding

2021

LayoutLMv2 adds visual features and cross-modal alignment, reaching 95.25% on RVL-CDIP — multimodal approach dominates

2022

LayoutLMv3 unifies text, image, and layout pretraining with masked image/language modeling, achieving 95.93% on RVL-CDIP

2022

DiT (Document Image Transformer) shows that self-supervised pretraining on document images alone reaches 92.69% without OCR

2023

DocFormerv2 and Donut (OCR-free) achieve competitive accuracy without explicit OCR, simplifying the pipeline

2024

Large VLMs (Qwen2-VL, InternVL2) classify documents zero-shot using visual understanding alone, matching fine-tuned specialists

How Document Image Classification Works

1Input ProcessingDocuments are either (a) re…2Multimodal EncodingLayoutLM-family models embe…3OCR-Free EncodingDonut and Pix2Struct bypass…4Classification HeadA [CLS] token representatio…5EvaluationAccuracy on RVL-CDIP (16 cl…Document Image Classification Pipeline
1

Input Processing

Documents are either (a) rendered as images and processed by a vision encoder, or (b) OCR'd to extract text + bounding boxes, which are combined with the document image.

2

Multimodal Encoding

LayoutLM-family models embed each text token with its 2D position (x, y, width, height) from OCR bounding boxes, plus visual features from the document image. This captures layout structure (headers are at the top, tables have grid patterns) alongside text content.

3

OCR-Free Encoding

Donut and Pix2Struct bypass OCR entirely — they encode the document image with a ViT and decode text autoregressively. The model implicitly learns to read and understand layout simultaneously.

4

Classification Head

A [CLS] token representation is projected to document type logits. Cross-entropy loss trains classification. Some approaches use hierarchical classification for fine-grained subtypes.

5

Evaluation

Accuracy on RVL-CDIP (16 classes) is the primary benchmark. Tobacco3482 (10 classes) is a smaller alternative. Real-world evaluation includes processing speed and robustness to scan quality (skew, noise, low resolution).

Current Landscape

Document classification in 2025 is mature for standard benchmarks — RVL-CDIP accuracy exceeds 95% with multiple approaches. The field has bifurcated between OCR-dependent multimodal models (LayoutLMv3, highest accuracy) and OCR-free models (Donut, simpler pipeline). Large VLMs are disrupting the space by enabling zero-shot classification — describe a document type in text and the model classifies it without training data. For production systems, the choice is increasingly between fine-tuning a small specialist model (fast, cheap) or calling a large VLM API (flexible, no training needed).

Key Challenges

Visual variability within classes — 'invoices' look radically different across companies, countries, and eras; templates vary by hundreds of thousands of unique layouts

Multi-page documents — most benchmarks classify single pages, but real documents (contracts, reports) span many pages with varying content per page

Scan quality — real-world document images include skew, blur, fax artifacts, handwritten annotations, and partial occlusions

Class imbalance — in production pipelines, some document types (emails) are 100× more common than others (legal notices), requiring careful sampling or loss weighting

OCR dependency — multimodal models relying on OCR inherit its errors; OCR-free models avoid this but sacrifice text understanding depth

Quick Recommendations

Best accuracy (OCR available)

LayoutLMv3-Large

95.93% on RVL-CDIP; leverages OCR text + layout + image jointly; the established production choice

OCR-free pipeline

Donut-Base or Pix2Struct

No OCR dependency eliminates a failure point; Donut reaches 95.3% on RVL-CDIP with simpler pipeline

Zero-shot / new document types

Qwen2-VL-72B or GPT-4o

Classify into arbitrary document categories with a text prompt; no training data needed for new types

Fast / edge deployment

DiT-Base or EfficientNet on document images

Image-only classification at 90%+ accuracy without OCR overhead; suitable for mobile scanning apps

Multi-page documents

Hi-VT5 or custom LayoutLMv3 with page aggregation

Process each page then aggregate features; handles contracts and reports that span 10+ pages

What's Next

The field is moving toward: unified document understanding models that classify, extract, and answer questions in a single pass (no separate classification step); multi-page document classification with hierarchical attention; and continual learning systems that adapt to new document types without retraining on old ones. The end state is document AI pipelines where classification is an emergent capability of the extraction model rather than a separate step.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Document Image Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000