Codesota · OCR · Table ExtractionHome/OCR/Table Extraction
Benchmark · 2025

Table Extraction OCR.

Which OCR model actually preserves table structure? We tested Claude, GPT-5.4, Mistral, Docling, and PaddleOCR on real-world tables using the TEDS metric.

§ 01 · Quick Answer

Three picks.

Best overall (TEDS 93.52)
PaddleOCR-VL
free, open source
Best for complex tables
MinerU 2.5
89.8% on merged cells
Best for financial docs
Claude
lowest hallucination
§ 02 · What is TEDS

Tree-Edit-Distance-based Similarity.

TEDS measures how accurately an OCR model preserves table structure. It compares the predicted table's HTML/tree structure against the ground truth.

How TEDS Works

  1. Convert table to tree structure (rows, cells, content)
  2. Calculate minimum edit operations to transform predicted to ground truth
  3. Normalize by tree size: 1 - (edits / max_nodes)
  4. Score ranges from 0 (completely wrong) to 100 (perfect)

What TEDS Captures

  • Row and column structure preservation
  • Cell content accuracy
  • Merged cell handling (colspan/rowspan)
  • Cell alignment and ordering

TEDS was introduced in the PubTabNet paper (2019) and is now the standard metric for table extraction evaluation. Higher is better.

90+
Excellent — Production ready
80–90
Good — Minor structure errors
<80
Needs post-processing
§ 03 · Benchmark Results

Seven models, ranked by TEDS overall.

ModelTEDS SimpleTEDS ComplexTEDS OverallStructureSpeedCost / 1k
PaddleOCR-VL OSS
Best TEDS score, open source
96.891.293.5297%850msFree
MinerU 2.5 OSS
Excellent structure, outputs LaTeX
95.189.891.9095%1470msFree
GPT-5.4 API
Good reasoning, hallucinates on complex tables
94.287.590.1092%2300ms$7.50
Claude Sonnet 4 API
Low hallucination, good for financial tables
93.886.989.5091%2800ms$6.00
dots.ocr 3B OSS
New SOTA contender, 100+ languages
92.486.888.9090%920msFree
Docling OSS
Fast local processing, good for PDFs
89.282.485.1088%680msFree
Mistral OCR 3 API
Fast API, struggles with merged cells
91.570.979.7585%1200ms$4.00

TEDS Simple: Tables without merged cells. TEDS Complex: Tables with rowspan/colspan. Structure: Row/column alignment accuracy. Speed: Per-table processing time.

§ 04 · Structure Preservation

What models get wrong, and right.

Table structure preservation goes beyond text extraction. We tested how each model handles common challenges.

What Models Get Wrong

  • ×Merged cells: Most models split merged cells incorrectly or duplicate content
  • ×Multi-row headers: Complex headers get flattened or reordered
  • ×Empty cells: Often skipped, causing column misalignment
  • ×Nested tables: Inner tables extracted as flat text

Best at Each Challenge

  • +Merged cells: MinerU 2.5 handles colspan/rowspan correctly
  • +Multi-row headers: PaddleOCR-VL preserves hierarchy
  • +Empty cells: Claude maintains column alignment
  • +Nested tables: GPT-5.4 can reason about structure
ChallengePaddleOCR-VLMinerUGPT-5.4ClaudeDocling
Simple grid tablesExcellentExcellentExcellentExcellentGood
Merged cells (colspan)ExcellentExcellentGoodGoodPartial
Multi-row headersExcellentGoodGoodGoodPartial
Borderless tablesGoodExcellentExcellentExcellentGood
Tables with imagesGoodGoodExcellentExcellentPartial
Rotated/skewed tablesExcellentGoodGoodGoodPoor
§ 05 · Code Examples

Four integrations.

Claude Sonnet 4 — Table Extraction

API
import anthropic
import base64
client = anthropic.Anthropic()

with open("table.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data
                }
            },
            {
                "type": "text",
                "text": """Extract the table from this image.
Return as markdown table with exact cell values.
Preserve merged cells using colspan notation."""
            }
        ]
    }]
)
print(response.content[0].text)

PaddleOCR — Table Structure Recognition

Open Source
from paddleocr import PaddleOCR
from paddleocr.ppstructure import PPStructure

# Initialize table recognition
table_engine = PPStructure(table=True, ocr=True)

# Process image
result = table_engine("table.png")

# Extract table structure
for item in result:
    if item['type'] == 'table':
        html_table = item['res']['html']
        print(html_table)

        # Convert to markdown if needed
        # from markdownify import markdownify
        # print(markdownify(html_table))

Docling — PDF Table Extraction

Open Source
from docling import DocumentConverter
from docling.datamodel.base_models import InputFormat

# Initialize converter
converter = DocumentConverter()

# Process document
result = converter.convert("document.pdf")

# Extract tables from all pages
for page in result.document.pages:
    for table in page.tables:
        # Get as markdown
        print(table.export_to_markdown())

        # Or as pandas DataFrame
        # df = table.export_to_dataframe()
        # print(df)

MinerU — Scientific Table Extraction

Open Source
from mineru import MinerU

# Initialize with table extraction focus
miner = MinerU(
    enable_table=True,
    table_format="markdown"  # or "html", "latex"
)

# Extract from PDF
result = miner.extract("research_paper.pdf")

# Process tables
for page in result:
    for block in page.blocks:
        if block.category == 'table':
            print(block.to_markdown())
            # LaTeX output for scientific docs
            # print(block.to_latex())
§ 06 · Use Cases

By document type.

Financial Reports

Best · Claude Sonnet 4

Lowest hallucination rate (0.09%). Critical for financial accuracy where invented numbers are unacceptable.

Alternative · PaddleOCR-VL for high volume, local processing

Scientific Papers

Best · MinerU 2.5

LaTeX equation support + 95% structure preservation. Handles complex multi-row headers.

Alternative · PaddleOCR-VL for simpler tables

Invoices & Receipts

Best · PaddleOCR-VL

Best TEDS score (93.52) + free + fast. Line item extraction is accurate.

Alternative · Docling for PDF invoices specifically

Data Entry Automation

Best · GPT-5.4

Can output structured JSON directly. Good for forms with varied layouts.

Alternative · Claude for lower error rates

Historical Documents

Best · PaddleOCR-VL

Handles degraded scans better. 96.8% on simple tables even with noise.

Alternative · dots.ocr for multilingual historical docs

High-Volume Processing

Best · Docling

Fastest local processing (680ms). No API costs. Apache 2.0 license.

Alternative · PaddleOCR-VL for higher accuracy

§ 07 · Cost Analysis

10,000 tables / month.

Open source costs assume cloud GPU at $0.50/hour.

ModelCost / Table10k / moTEDS Score$ / TEDS Point
PaddleOCR-VL$0.00012$1.2093.52$0.013
Docling$0.00009$0.9085.1$0.011
MinerU 2.5$0.00020$2.0091.9$0.022
Claude Sonnet 4$0.006$60.0089.5$0.67
GPT-5.4$0.0075$75.0090.1$0.83

Bottom line: Open source models (PaddleOCR-VL, MinerU) offer 50-60× cost savings over API models with comparable or better accuracy. API models are worth the premium for low volume or when you need reasoning capabilities.

§ 08 · Methodology

How we tested.

All benchmarks run on standardized test sets including PubTabNet, TableBank, and FinTabNet. TEDS scores calculated using the official evaluation script from the PubTabNet paper.

  • Simple tables: No merged cells, uniform grid structure
  • Complex tables: Merged cells, multi-row headers, nested structures
  • Speed: Average of 100 table extractions on standardized hardware
  • Structure preservation: Manual evaluation of 500 random samples
§ 09 · Related

Continue reading.

OCR · guide
Docling vs MinerU
In-depth PDF extraction comparison
OCR · guide
Best OCR for Invoices
Line item extraction guide
OCR · guide
OCR Benchmarks
Full model comparison