Home/Browse/Computer Vision/Document Parsing
CVPR 2025 Benchmark

Parsing Every
Document

OmniDocBench is the most comprehensive benchmark for PDF document parsing, evaluating text extraction, table recognition, formula detection, and layout analysis across 9 diverse document types.

Benchmark Stats

981
Annotated Pages
92.9
SOTA Composite Score
7
Metrics Tracked
13
Models Evaluated

What is OmniDocBench?

OmniDocBench is a comprehensive document parsing benchmark created by Shanghai AI Laboratory and accepted at CVPR 2025. It evaluates the ability of AI systems to convert PDF documents into structured formats like Markdown, preserving text, tables, formulas, and reading order.

Unlike earlier benchmarks that focus on narrow document types (only academic papers, or only scanned receipts), OmniDocBench covers 9 diverse document categories including academic papers, textbooks, slides, financial reports, newspapers, handwritten notes, exam papers, magazines, and research reports.

The benchmark uses 19 layout categories and 15 attribute labels for multi-level annotation, enabling both end-to-end evaluation and fine-grained task-specific analysis. This makes it the most thorough document parsing evaluation available.

Key Properties

Multi-source Coverage
9 document types from academic to handwritten
Multi-level Annotations
19 layout categories, 15 attribute labels
Composite Scoring
Balanced metric across text, tables, and formulas
Pipeline + VLM Evaluation
Compares traditional pipelines and vision-language models
Open Access
Dataset and evaluation code publicly available on GitHub

The Document Parsing Pipeline

From raw PDF pages to structured Markdown output. Understanding each stage reveals where models succeed and fail.

1
Input

PDF Document

Raw PDF pages with mixed content: text paragraphs, tables, mathematical formulas, figures, headers, footers, and complex multi-column layouts.

2
Detection

Layout Analysis

Detect and classify 19 layout elements: text blocks, tables, formulas, figures, titles, headers, footers, page numbers, captions, and more.

3
Extraction

Content Recognition

Each detected region gets specialized processing: OCR for text, structure recognition for tables (HTML/LaTeX), and LaTeX conversion for formulas.

4
Output

Structured Format

Final Markdown/HTML output preserving reading order, table structure, formula notation, and document hierarchy. Ready for downstream tasks.

PDF PageLayout DetectionText OCR + Table TEDS + Formula LaTeXStructured Markdown

End-to-end VLMs (like Qwen3-VL, Gemini 2.5 Pro) collapse stages 2-4 into a single forward pass. Pipeline methods (MinerU, PaddleOCR) use specialized models per stage.

OmniDocBench Composite Leaderboard

Composite Score = ((1 - TextEditDist) x 100 + TableTEDS + FormulaCDM) / 3. Higher is better.

RankModelCompositeSource
1PaddleOCR-VL92.86alphaxiv-leaderboard
2PaddleOCR-VL 0.9B92.56alphaxiv-leaderboard
3MinerU 2.590.67alphaxiv-leaderboard
#4Qwen3-VL 235B89.15alphaxiv-leaderboard
#5MonkeyOCR Pro 3B88.85alphaxiv-leaderboard
#6OCRVerse 4B88.56github-leaderboard
#7dots.ocr 3B88.41github-leaderboard
#8Gemini 2.5 Pro88.03alphaxiv-leaderboard
#9Qwen2.5-VL87.02alphaxiv-leaderboard
#10Mistral OCR (2512)Verified79.75codesota-verified
#11Mistral OCR 3Verified79.75codesota-verified
#12clearOCR (TeamQuest)Verified31.70codesota-verified

Best Scores by Metric

Individual metric leaders across all tracked OmniDocBench dimensions.

Text Edit Distance

Character-level edit distance for OCR accuracy. Lower is better.

GPT-4o0.020
Mistral OCR 30.099
clearOCR (TeamQuest)0.154

Table TEDS

Tree Edit Distance Score for table structure. Higher is better.

PaddleOCR-VL93.52
Mistral OCR 370.88
clearOCR (TeamQuest)0.80

Layout mAP

Mean Average Precision for layout detection. Higher is better.

MinerU 2.597.5

Formula Edit Distance

LaTeX formula recognition accuracy. Lower is better.

Mistral OCR 30.218
clearOCR (TeamQuest)0.902

Reading Order

Accuracy of element reading order. Higher is better.

Mistral OCR 391.63
clearOCR (TeamQuest)86.04

Why Document Parsing is Hard

Document parsing sits at the intersection of computer vision (layout detection, figure recognition), NLP (text extraction, reading order), and structured prediction (table/formula reconstruction).

  • Layout Diversity: Academic papers, newspapers, and slides have radically different layouts
  • Nested Structures: Tables within tables, formulas within table cells, multi-column text flows
  • OCR Errors Cascade: A single misread character in a formula renders the entire equation wrong
  • Language Agnosticism: Documents span dozens of languages with different scripts

The Rise of Vision-Language Models

Traditional document parsing relied on pipeline approaches: separate models for layout detection, OCR, table recognition, and formula detection. Each module could be optimized independently but errors cascaded between stages.

Now, end-to-end VLMs like Qwen3-VL and Gemini 2.5 Pro convert entire pages in a single forward pass. They score competitively on OmniDocBench without any document-specific training.

However, pipeline methods like PaddleOCR-VL and MinerU still hold the top spots, suggesting that specialized architectures remain valuable for structured document understanding.

Understanding the Metrics

Text Edit Distance

Measures character-level accuracy of extracted text against ground truth using normalized Levenshtein distance. A score of 0.02 means only 2% of characters need editing.

Lower is better. Range: 0.0 (perfect) to 1.0 (completely wrong)

Table TEDS (Tree Edit Distance Score)

Evaluates table structure recognition by comparing the predicted HTML/LaTeX table tree against the ground truth tree. Captures both cell content and structural accuracy.

Higher is better. Range: 0 to 100

Layout mAP (Mean Average Precision)

Standard object detection metric applied to document layout elements. Measures how accurately the model detects and classifies text blocks, tables, figures, formulas, etc.

Higher is better. Range: 0 to 100

Formula CDM (Character Detection Matching)

Evaluates mathematical formula recognition by matching detected characters and symbols against ground truth LaTeX. Captures both symbol accuracy and spatial arrangement.

Higher is better. Used in composite score calculation

Composite Score Formula

Composite = ((1 - TextEditDist) × 100 + TableTEDS + FormulaCDM) / 3

This balanced formula ensures models must excel at all three core tasks. A model strong at OCR but weak at tables will be penalized.

Related Benchmarks Comparison

How OmniDocBench compares to other document understanding benchmarks.

BenchmarkFocusDocumentsDoc TypesKey MetricYear
OmniDocBenchEnd-to-end parsing9819 categoriesComposite (Text + Table + Formula)2024
DocLayNetLayout detection80,8636 categoriesmAP@0.52022
PubLayNetLayout detection360,000+Academic papersmAP2019
olmOCR-BenchPDF extraction1,402Mixed PDFsPass Rate (unit tests)2025
OCRBench v2OCR capabilities10,000+23 task typesOverall Score2024
TableBankTable detection417,234Academic papersF1 Score2019
CC-OCRMulti-scene OCR-4 task domainsF1 Score2024

OmniDocBench is unique in evaluating the full end-to-end parsing pipeline (text + tables + formulas + layout) on diverse document types, rather than focusing on a single sub-task.

Dataset Access

Explore More OCR Content

Have OmniDocBench Results?

If you have run your model on OmniDocBench and want to be listed on this leaderboard with verified results, submit your scores for independent verification.