Computer Vision

Table Recognition

Detecting and parsing tables in documents

5 datasets38 resultsView full task mapping →

Table recognition detects tables in documents and extracts their structure (rows, columns, cells, spanning headers) into machine-readable formats. It's one of the hardest document understanding subtasks because tables vary enormously in structure — from simple grids to complex multi-level headers with merged cells. Table Transformer (TATR) and TableFormer achieve 80%+ accuracy on complex tables, but heavily formatted financial and regulatory tables remain challenging.

History

2010

Rule-based table detection using line detection, whitespace analysis, and heuristic parsing — works for simple grid tables but fails on borderless and complex layouts

2017

DeepDeSRT (Schreiber et al.) first applies deep learning to table detection and structure recognition, using Faster R-CNN

2019

TableNet and CascadeTabNet use instance segmentation approaches (Mask R-CNN) for joint table detection and cell segmentation

2020

ICDAR table recognition competitions (cTDaR, TableBank) establish standardized benchmarks with diverse document types

2021

Table Transformer (TATR by Microsoft) applies DETR architecture to table structure recognition, detecting rows, columns, and cells as objects

2022

PubTables-1M (Smock et al.) provides 1M annotated tables from scientific papers, enabling large-scale training for table structure recognition

2023

TableFormer (IBM) uses a transformer-based approach to predict HTML table structure from images, handling complex spanning cells

2024

Docling and Marker integrate table recognition into end-to-end parsing pipelines; FinTabNet provides financial table benchmarks

2025

VLMs (GPT-4o, Qwen2-VL) convert table images to structured formats (HTML, Markdown) with impressive accuracy, challenging specialized models

How Table Recognition Works

1Table DetectionA detection model (DETR2Structure RecognitionThe cropped table image is …3Cell AssociationRows and columns are matche…4Cell Content Extracti…Each detected cell region i…5EvaluationTEDS (Tree Edit Distance Si…Table Recognition Pipeline
1

Table Detection

A detection model (DETR, YOLO, Faster R-CNN) finds table regions in the document image. This is relatively easy for bordered tables but challenging for borderless tables that blend with surrounding text. The detector outputs a bounding box per table.

2

Structure Recognition

The cropped table image is processed by a structure recognition model that identifies rows, columns, and cells. TATR uses object detection (one box per row, column, cell). TableFormer predicts HTML tags sequentially. Some methods use segmentation masks for cells.

3

Cell Association

Rows and columns are matched to form a grid structure. Spanning cells (merged across rows/columns) are identified by detecting cells that overlap multiple row/column regions. This is the hardest step — complex headers with hierarchical spanning cells cause most failures.

4

Cell Content Extraction

Each detected cell region is OCR'd to extract text content. Cell content is aligned with the grid structure to produce a fully structured table (HTML, CSV, or JSON with row/column indices).

5

Evaluation

TEDS (Tree Edit Distance Similarity) compares predicted and ground-truth HTML table structures — accounts for both structure and content. GriTS (Grid Table Similarity) evaluates row/column topology. PubTables-1M, FinTabNet, and SciTSR are standard benchmarks.

Current Landscape

Table recognition in 2025 is evolving rapidly. Specialized models (TATR, TableFormer) achieve 80-90% TEDS on benchmark tables but struggle with the long tail of real-world table formats. VLMs have emerged as strong competitors — GPT-4o can convert most tables to HTML/Markdown with 85%+ accuracy zero-shot, which is competitive with fine-tuned specialists on diverse tables. The field is bifurcated: high-throughput production systems use specialized detectors + structure recognizers (TATR + OCR), while low-volume, high-diversity applications increasingly use VLMs. PubTables-1M has been the most impactful resource, but it's biased toward scientific tables — financial, legal, and government tables remain underserved.

Key Challenges

Spanning cells — headers that span 3 columns or data cells that merge 2 rows are extremely common in financial and scientific tables but hard to detect correctly

Borderless tables — tables without visible gridlines (common in modern PDFs) require the model to infer structure from alignment and spacing alone

Complex multi-level headers — financial statements often have 3-4 levels of nested column headers with irregular spanning patterns

Rotated and embedded tables — tables within multi-column documents, tables in presentations, and tables at odd angles require robust detection

Evaluation difficulty — TEDS is sensitive to minor structural differences; a table that is 95% correct but has one wrong span gets disproportionately penalized

Quick Recommendations

Best accuracy (research)

Table Transformer (TATR) or TableFormer

Best TEDS scores on PubTables-1M and FinTabNet; TATR is well-maintained by Microsoft Research

Integrated pipeline

Docling v2 or MinerU with table module

End-to-end document parsing with table extraction built in; handles the full document context, not just isolated tables

Financial/complex tables

TableFormer-Large or FinTabNet-trained TATR

Better handling of multi-level headers and spanning cells common in financial documents

Simple tables (fast)

Camelot or Tabula (digital PDFs)

For digital-born PDFs with accessible structure, rule-based tools extract tables instantly without ML

VLM-based (flexible)

GPT-4o or Claude Sonnet with structured output

Send table image, request HTML/Markdown output; handles unusual tables that specialized models miss

What's Next

The field is moving toward: (1) end-to-end table understanding — not just extracting structure but answering questions about table content directly, (2) cross-page table stitching for tables that span multiple pages, (3) chart-to-table conversion (extracting data from bar charts, line graphs), and (4) real-time table capture from phone cameras with AR overlay. VLMs will likely become the default approach for table recognition within 2 years, with specialized models persisting only for high-throughput batch processing.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Table Recognition benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000