OmniDocBench is the most comprehensive benchmark for PDF document parsing, evaluating text extraction, table recognition, formula detection, and layout analysis across 9 diverse document types.
OmniDocBench is a comprehensive document parsing benchmark created by Shanghai AI Laboratory and accepted at CVPR 2025. It evaluates the ability of AI systems to convert PDF documents into structured formats like Markdown, preserving text, tables, formulas, and reading order.
Unlike earlier benchmarks that focus on narrow document types (only academic papers, or only scanned receipts), OmniDocBench covers 9 diverse document categories including academic papers, textbooks, slides, financial reports, newspapers, handwritten notes, exam papers, magazines, and research reports.
The benchmark uses 19 layout categories and 15 attribute labels for multi-level annotation, enabling both end-to-end evaluation and fine-grained task-specific analysis. This makes it the most thorough document parsing evaluation available.
From raw PDF pages to structured Markdown output. Understanding each stage reveals where models succeed and fail.
Raw PDF pages with mixed content: text paragraphs, tables, mathematical formulas, figures, headers, footers, and complex multi-column layouts.
Detect and classify 19 layout elements: text blocks, tables, formulas, figures, titles, headers, footers, page numbers, captions, and more.
Each detected region gets specialized processing: OCR for text, structure recognition for tables (HTML/LaTeX), and LaTeX conversion for formulas.
Final Markdown/HTML output preserving reading order, table structure, formula notation, and document hierarchy. Ready for downstream tasks.
End-to-end VLMs (like Qwen3-VL, Gemini 2.5 Pro) collapse stages 2-4 into a single forward pass. Pipeline methods (MinerU, PaddleOCR) use specialized models per stage.
Composite Score = ((1 - TextEditDist) x 100 + TableTEDS + FormulaCDM) / 3. Higher is better.
| Rank | Model | Composite | Source |
|---|---|---|---|
| 1 | GLM-OCR | 94.62 | codesota-api |
| 2 | PaddleOCR-VL | 92.86 | codesota-api |
| 3 | PaddleOCR-VL 0.9B | 92.56 | codesota-api |
| #4 | MinerU 2.5 | 90.67 | codesota-api |
| #5 | Qwen3-VL 235B | 89.15 | codesota-api |
| #6 | MonkeyOCR Pro 3B | 88.85 | codesota-api |
| #7 | OCRVerse 4B | 88.56 | codesota-api |
| #8 | dots.ocr 3B | 88.41 | codesota-api |
| #9 | Gemini 2.5 Pro | 88.03 | codesota-api |
| #10 | Qwen2.5-VL | 87.02 | codesota-api |
| #11 | Mistral OCR 3Verified | 79.75 | codesota-api |
| #12 | Mistral OCR (2512)Verified | 79.75 | codesota-api |
| #13 | clearOCR (TeamQuest)Verified | 31.70 | codesota-api |
Individual metric leaders across all tracked OmniDocBench dimensions.
Character-level edit distance for OCR accuracy. Lower is better.
Tree Edit Distance Score for table structure. Higher is better.
Mean Average Precision for layout detection. Higher is better.
LaTeX formula recognition accuracy. Lower is better.
Accuracy of element reading order. Higher is better.
Document parsing sits at the intersection of computer vision (layout detection, figure recognition), NLP (text extraction, reading order), and structured prediction (table/formula reconstruction).
Traditional document parsing relied on pipeline approaches: separate models for layout detection, OCR, table recognition, and formula detection. Each module could be optimized independently but errors cascaded between stages.
Now, end-to-end VLMs like Qwen3-VL and Gemini 2.5 Pro convert entire pages in a single forward pass. They score competitively on OmniDocBench without any document-specific training.
However, pipeline methods like PaddleOCR-VL and MinerU still hold the top spots, suggesting that specialized architectures remain valuable for structured document understanding.
Measures character-level accuracy of extracted text against ground truth using normalized Levenshtein distance. A score of 0.02 means only 2% of characters need editing.
Evaluates table structure recognition by comparing the predicted HTML/LaTeX table tree against the ground truth tree. Captures both cell content and structural accuracy.
Standard object detection metric applied to document layout elements. Measures how accurately the model detects and classifies text blocks, tables, figures, formulas, etc.
Evaluates mathematical formula recognition by matching detected characters and symbols against ground truth LaTeX. Captures both symbol accuracy and spatial arrangement.
Composite = ((1 - TextEditDist) × 100 + TableTEDS + FormulaCDM) / 3
This balanced formula ensures models must excel at all three core tasks. A model strong at OCR but weak at tables will be penalized.
How OmniDocBench compares to other document understanding benchmarks.
| Benchmark | Focus | Documents | Doc Types | Key Metric | Year |
|---|---|---|---|---|---|
| OmniDocBench | End-to-end parsing | 981 | 9 categories | Composite (Text + Table + Formula) | 2024 |
| DocLayNet | Layout detection | 80,863 | 6 categories | mAP@0.5 | 2022 |
| PubLayNet | Layout detection | 360,000+ | Academic papers | mAP | 2019 |
| olmOCR-Bench | PDF extraction | 1,402 | Mixed PDFs | Pass Rate (unit tests) | 2025 |
| OCRBench v2 | OCR capabilities | 10,000+ | 23 task types | Overall Score | 2024 |
| TableBank | Table detection | 417,234 | Academic papers | F1 Score | 2019 |
| CC-OCR | Multi-scene OCR | - | 4 task domains | F1 Score | 2024 |
OmniDocBench is unique in evaluating the full end-to-end parsing pipeline (text + tables + formulas + layout) on diverse document types, rather than focusing on a single sub-task.
Source code, evaluation scripts, and benchmark data. Open source under Apache 2.0.
arXiv:2412.07626 -- Full methodology, annotation guidelines, and baseline results. Accepted at CVPR 2025.
Official live leaderboard with the latest model submissions and verified scores.
If you have run your model on OmniDocBench and want to be listed on this leaderboard with verified results, submit your scores for independent verification.