Codesota · OCR · Task: TablesHome/OCR/Tables
Task Brief · table image → cells

A perfectly transcribed table is still useless if you don't know which cell each number is in.

Table extraction is structure recognition, not transcription. The job is to recover the grid — cell boundaries, row and column membership, merged and spanning cells, header association — and serialise it to HTML, CSV or Markdown. OCR gives you the characters; this task gives you the table.

plain OCR →"Item Qty Price Widget 3 12 …"structure →HTML / CSV / cells
Same pixels, two outputs. Only the one that keeps the grid is a table.
Why this is another task entirely

OCR reads in a line. A table is two-dimensional. Read left-to-right, the cells1,2003400become a meaningless sequence. Which is the unit price, which is the quantity, which is the total? Recovering that 2-D membership — and surviving merged cells, spanning headers and borderless rules — is what makes table extraction a distinct problem with its own models and its own metric.

§ 01 · Metrics

Why tables get their own metric.

TEDS

Tree-Edit-Distance Similarity. Represents the table as an HTML tree and measures edit distance between prediction and ground truth — scoring both structure and cell text. Higher is better.

TEDS-Struct

The same tree distance with cell contents ignored — pure structure. The gap between TEDS and TEDS-Struct tells you whether errors are layout or text.

Cell F1 / GriTS

Cell-level precision/recall on detected cells and their row/col indices. GriTS generalises this to a grid-similarity score robust to spanning cells.

None of CER / WER / edit distance — the reading metrics — capture whether the grid is right. That is the whole point.

Which OCR fits your use case?

Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.

§ 02 · Two Regimes

Cropped tables are near-solved. In-the-wild is not.

On clean, pre-cropped tables — PubTabNet (568k tables) and FinTabNet (113k financial tables) — specialist models exceed 97 TEDS. But that assumes someone already found the table and cut it out. Run the same models on a full document page, where the table must first be located and may be borderless or rotated, and scores fall to the high-80s / low-90s. The benchmark you cite should match the regime you ship in.

Cropped · PubTabNet
ModelTEDS-STEDS
TFLOP98.3896.66
UniTable97.8996.50
TABLET97.6796.79
SEMv397.5097.30
TableFormer96.80
In-document · full page
ModelTEDSComplex
PaddleOCR-VL93.5291.2
MinerU 2.591.989.8
GPT-5.490.187.5
Claude Sonnet 489.586.9
dots.ocr88.986.8

Left: PubTabNet val (TFLOP arXiv:2501.11800; UniTable arXiv:2403.04822; TableFormer). FinTabNet's SOTA (VAST) sits ≈ 97.1 TEDS. Right: CodeSOTA document-context table benchmark. “Complex” = spanning/borderless tables.

Full methodology and the complete vendor list are on the table-extraction benchmark page.

§ 03 · Hard Cases

What separates 97 TEDS from 88.

  • Spanning / merged cells. A header spanning three columns must map to three logical positions, not one.
  • Borderless tables. No ruling lines — structure must be inferred from whitespace and alignment alone.
  • Nested & multi-row headers. Two-level column headers wreck flat row/col indexing.
  • Multi-page tables. Rows continue across a page break; headers must be carried over.
  • Rotated / wide tables. Landscape financial tables rotated 90° defeat row detectors.
  • Empty & sparse cells. Models hallucinate values into blanks or drop the empty cell entirely, shifting the grid.
§ 04 · Implementation

Ask for HTML, then reconstruct the frame.

1 · Extract structure as HTML

import replicate, base64

def extract_table(png_bytes: bytes) -> str:
    """Ask a document VLM for HTML, not prose. HTML carries the structure."""
    b64 = base64.b64encode(png_bytes).decode()
    out = replicate.run(
        "rednote-hilab/dots.ocr",
        input={
            "image": f"data:image/png;base64,{b64}",
            "prompt": "Extract every table as HTML. Preserve rowspan and "
                      "colspan exactly. Do not flatten merged cells.",
            "temperature": 0.0,
        },
    )
    return "".join(out)

2 · HTML → DataFrame (spans resolved)

import pandas as pd
from io import StringIO

def html_table_to_dataframe(html: str) -> pd.DataFrame:
    """A table model returns HTML with the grid intact — spans and all.
    pandas reconstructs rows/cols; OCR text alone could not."""
    # read_html resolves rowspan / colspan into a rectangular frame
    return pd.read_html(StringIO(html))[0]

# vs. what plain OCR gives you for the same table:
#   "Item Qty Price Widget 3 12.00 Gadget 1 4.50 Total 16.50"
# ...no idea which number belongs to which row or column.
§ 05 · References

Sources.

Which OCR fits your use case?

Answer 3 questions, get a personal recommendation. Or just drop your email — we reply.

#1 on OmniDocBench92.86 compositeSOTA shipped

Run the best OCR model on your Mac — $6

Hardparse runs PaddleOCR-VL-1.5 locally via Apple Metal. No cloud, no API keys, no subscription. Tables, formulas, handwriting, 109 languages.

Every purchase directly supports CodeSOTA's independent benchmark research.

§ 06 · Adjacent Tasks

Where tables fit in the pipeline.

Task · Layout
page → regions
The step that finds and isolates the table before structure recovery starts.
Use case · Forms
invoice → fields
Where table extraction earns its keep: line items, quantities, totals.
← Back to OCR task router