Parsing Every
Document
OmniDocBench is the most comprehensive benchmark for PDF document parsing, evaluating text extraction, table recognition, formula detection, and layout analysis across 9 diverse document types.
Benchmark Stats
What is OmniDocBench?
OmniDocBench is a comprehensive document parsing benchmark created by Shanghai AI Laboratory and accepted at CVPR 2025. It evaluates the ability of AI systems to convert PDF documents into structured formats like Markdown, preserving text, tables, formulas, and reading order.
Unlike earlier benchmarks that focus on narrow document types (only academic papers, or only scanned receipts), OmniDocBench covers 9 diverse document categories including academic papers, textbooks, slides, financial reports, newspapers, handwritten notes, exam papers, magazines, and research reports.
The benchmark uses 19 layout categories and 15 attribute labels for multi-level annotation, enabling both end-to-end evaluation and fine-grained task-specific analysis. This makes it the most thorough document parsing evaluation available.
Key Properties
The Document Parsing Pipeline
From raw PDF pages to structured Markdown output. Understanding each stage reveals where models succeed and fail.
PDF Document
Raw PDF pages with mixed content: text paragraphs, tables, mathematical formulas, figures, headers, footers, and complex multi-column layouts.
Layout Analysis
Detect and classify 19 layout elements: text blocks, tables, formulas, figures, titles, headers, footers, page numbers, captions, and more.
Content Recognition
Each detected region gets specialized processing: OCR for text, structure recognition for tables (HTML/LaTeX), and LaTeX conversion for formulas.
Structured Format
Final Markdown/HTML output preserving reading order, table structure, formula notation, and document hierarchy. Ready for downstream tasks.
End-to-end VLMs (like Qwen3-VL, Gemini 2.5 Pro) collapse stages 2-4 into a single forward pass. Pipeline methods (MinerU, PaddleOCR) use specialized models per stage.
OmniDocBench Composite Leaderboard
Composite Score = ((1 - TextEditDist) x 100 + TableTEDS + FormulaCDM) / 3. Higher is better.
| Rank | Model | Composite | Source |
|---|---|---|---|
| 1 | PaddleOCR-VL | 92.86 | alphaxiv-leaderboard |
| 2 | PaddleOCR-VL 0.9B | 92.56 | alphaxiv-leaderboard |
| 3 | MinerU 2.5 | 90.67 | alphaxiv-leaderboard |
| #4 | Qwen3-VL 235B | 89.15 | alphaxiv-leaderboard |
| #5 | MonkeyOCR Pro 3B | 88.85 | alphaxiv-leaderboard |
| #6 | OCRVerse 4B | 88.56 | github-leaderboard |
| #7 | dots.ocr 3B | 88.41 | github-leaderboard |
| #8 | Gemini 2.5 Pro | 88.03 | alphaxiv-leaderboard |
| #9 | Qwen2.5-VL | 87.02 | alphaxiv-leaderboard |
| #10 | Mistral OCR (2512)Verified | 79.75 | codesota-verified |
| #11 | Mistral OCR 3Verified | 79.75 | codesota-verified |
| #12 | clearOCR (TeamQuest)Verified | 31.70 | codesota-verified |
Best Scores by Metric
Individual metric leaders across all tracked OmniDocBench dimensions.
Text Edit Distance
Character-level edit distance for OCR accuracy. Lower is better.
Table TEDS
Tree Edit Distance Score for table structure. Higher is better.
Layout mAP
Mean Average Precision for layout detection. Higher is better.
Formula Edit Distance
LaTeX formula recognition accuracy. Lower is better.
Reading Order
Accuracy of element reading order. Higher is better.
Why Document Parsing is Hard
Document parsing sits at the intersection of computer vision (layout detection, figure recognition), NLP (text extraction, reading order), and structured prediction (table/formula reconstruction).
- Layout Diversity: Academic papers, newspapers, and slides have radically different layouts
- Nested Structures: Tables within tables, formulas within table cells, multi-column text flows
- OCR Errors Cascade: A single misread character in a formula renders the entire equation wrong
- Language Agnosticism: Documents span dozens of languages with different scripts
The Rise of Vision-Language Models
Traditional document parsing relied on pipeline approaches: separate models for layout detection, OCR, table recognition, and formula detection. Each module could be optimized independently but errors cascaded between stages.
Now, end-to-end VLMs like Qwen3-VL and Gemini 2.5 Pro convert entire pages in a single forward pass. They score competitively on OmniDocBench without any document-specific training.
However, pipeline methods like PaddleOCR-VL and MinerU still hold the top spots, suggesting that specialized architectures remain valuable for structured document understanding.
Understanding the Metrics
Text Edit Distance
Measures character-level accuracy of extracted text against ground truth using normalized Levenshtein distance. A score of 0.02 means only 2% of characters need editing.
Table TEDS (Tree Edit Distance Score)
Evaluates table structure recognition by comparing the predicted HTML/LaTeX table tree against the ground truth tree. Captures both cell content and structural accuracy.
Layout mAP (Mean Average Precision)
Standard object detection metric applied to document layout elements. Measures how accurately the model detects and classifies text blocks, tables, figures, formulas, etc.
Formula CDM (Character Detection Matching)
Evaluates mathematical formula recognition by matching detected characters and symbols against ground truth LaTeX. Captures both symbol accuracy and spatial arrangement.
Composite Score Formula
Composite = ((1 - TextEditDist) × 100 + TableTEDS + FormulaCDM) / 3
This balanced formula ensures models must excel at all three core tasks. A model strong at OCR but weak at tables will be penalized.
Related Benchmarks Comparison
How OmniDocBench compares to other document understanding benchmarks.
| Benchmark | Focus | Documents | Doc Types | Key Metric | Year |
|---|---|---|---|---|---|
| OmniDocBench | End-to-end parsing | 981 | 9 categories | Composite (Text + Table + Formula) | 2024 |
| DocLayNet | Layout detection | 80,863 | 6 categories | mAP@0.5 | 2022 |
| PubLayNet | Layout detection | 360,000+ | Academic papers | mAP | 2019 |
| olmOCR-Bench | PDF extraction | 1,402 | Mixed PDFs | Pass Rate (unit tests) | 2025 |
| OCRBench v2 | OCR capabilities | 10,000+ | 23 task types | Overall Score | 2024 |
| TableBank | Table detection | 417,234 | Academic papers | F1 Score | 2019 |
| CC-OCR | Multi-scene OCR | - | 4 task domains | F1 Score | 2024 |
OmniDocBench is unique in evaluating the full end-to-end parsing pipeline (text + tables + formulas + layout) on diverse document types, rather than focusing on a single sub-task.
Dataset Access
GitHub Repository
Source code, evaluation scripts, and benchmark data. Open source under Apache 2.0.
Research Paper
arXiv:2412.07626 -- Full methodology, annotation guidelines, and baseline results. Accepted at CVPR 2025.
AlphaXiv Leaderboard
Official live leaderboard with the latest model submissions and verified scores.
Explore More OCR Content
Verified Model Reviews
Have OmniDocBench Results?
If you have run your model on OmniDocBench and want to be listed on this leaderboard with verified results, submit your scores for independent verification.