A perfectly transcribed table is still useless if you don't know which cell each number is in.

Table extraction is structure recognition, not transcription. The job is to recover the grid — cell boundaries, row and column membership, merged and spanning cells, header association — and serialise it to HTML, CSV or Markdown. OCR gives you the characters; this task gives you the table.

Same pixels, two outputs. Only the one that keeps the grid is a table.

Why this is another task entirely

OCR reads in a line. A table is two-dimensional. Read left-to-right, the cells1,2003400become a meaningless sequence. Which is the unit price, which is the quantity, which is the total? Recovering that 2-D membership — and surviving merged cells, spanning headers and borderless rules — is what makes table extraction a distinct problem with its own models and its own metric.

§ 01 · Metrics

Why tables get their own metric.

TEDS

Tree-Edit-Distance Similarity. Represents the table as an HTML tree and measures edit distance between prediction and ground truth — scoring both structure and cell text. Higher is better.

TEDS-Struct

The same tree distance with cell contents ignored — pure structure. The gap between TEDS and TEDS-Struct tells you whether errors are layout or text.

Cell F1 / GriTS

Cell-level precision/recall on detected cells and their row/col indices. GriTS generalises this to a grid-similarity score robust to spanning cells.

None of CER / WER / edit distance — the reading metrics — capture whether the grid is right. That is the whole point.

Which OCR fits your use case?

Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.

§ 02 · Two Regimes

Cropped tables are near-solved. In-the-wild is not.

On clean, pre-cropped tables — PubTabNet (568k tables) and FinTabNet (113k financial tables) — specialist models exceed 97 TEDS. But that assumes someone already found the table and cut it out. Run the same models on a full document page, where the table must first be located and may be borderless or rotated, and scores fall to the high-80s / low-90s. The benchmark you cite should match the regime you ship in.

Cropped · PubTabNet

Model	TEDS-S	TEDS
TFLOP	98.38	96.66
UniTable	97.89	96.50
TABLET	97.67	96.79
SEMv3	97.50	97.30
TableFormer	—	96.80

In-document · full page

Model	TEDS	Complex
PaddleOCR-VL	93.52	91.2
MinerU 2.5	91.9	89.8
GPT-5.4	90.1	87.5
Claude Sonnet 4	89.5	86.9
dots.ocr	88.9	86.8

Left: PubTabNet val (TFLOP arXiv:2501.11800; UniTable arXiv:2403.04822; TableFormer). FinTabNet's SOTA (VAST) sits ≈ 97.1 TEDS. Right: CodeSOTA document-context table benchmark. “Complex” = spanning/borderless tables.

Full methodology and the complete vendor list are on the table-extraction benchmark page.

§ 03 · Hard Cases

What separates 97 TEDS from 88.

›Spanning / merged cells. A header spanning three columns must map to three logical positions, not one.
›Borderless tables. No ruling lines — structure must be inferred from whitespace and alignment alone.
›Nested & multi-row headers. Two-level column headers wreck flat row/col indexing.

−Multi-page tables. Rows continue across a page break; headers must be carried over.
−Rotated / wide tables. Landscape financial tables rotated 90° defeat row detectors.
−Empty & sparse cells. Models hallucinate values into blanks or drop the empty cell entirely, shifting the grid.

§ 04 · Implementation

Ask for HTML, then reconstruct the frame.

1 · Extract structure as HTML

import replicate, base64

def extract_table(png_bytes: bytes) -> str:
    """Ask a document VLM for HTML, not prose. HTML carries the structure."""
    b64 = base64.b64encode(png_bytes).decode()
    out = replicate.run(
        "rednote-hilab/dots.ocr",
        input={
            "image": f"data:image/png;base64,{b64}",
            "prompt": "Extract every table as HTML. Preserve rowspan and "
                      "colspan exactly. Do not flatten merged cells.",
            "temperature": 0.0,
        },
    )
    return "".join(out)

2 · HTML → DataFrame (spans resolved)

import pandas as pd
from io import StringIO

def html_table_to_dataframe(html: str) -> pd.DataFrame:
    """A table model returns HTML with the grid intact — spans and all.
    pandas reconstructs rows/cols; OCR text alone could not."""
    # read_html resolves rowspan / colspan into a rectangular frame
    return pd.read_html(StringIO(html))[0]

# vs. what plain OCR gives you for the same table:
#   "Item Qty Price Widget 3 12.00 Gadget 1 4.50 Total 16.50"
# ...no idea which number belongs to which row or column.

§ 05 · References

Sources.

Which OCR fits your use case?

Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.

#1 on OmniDocBench92.86 compositeSOTA shipped

Run the best OCR model on your Mac — $6

Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.

Every purchase directly supports CodeSOTA's independent benchmark research.

Visit hardparse.com →Mac App Store — $6 Full review & benchmarks →

§ 06 · Adjacent Tasks

Where tables fit in the pipeline.

Task · Layout

page → regions

The step that finds and isolates the table before structure recovery starts.

Use case · Forms

invoice → fields

Where table extraction earns its keep: line items, quantities, totals.

← Back to OCR task router