§ Building Blocks

Image→Structured Data

Chart and Table Understanding.

Parse charts, diagrams, and tables into structured data for analysis and QA.

How Chart and Table Understanding Works

A technical deep-dive into extracting structured data from charts, graphs, and tables. From bar charts to complex financial tables, understanding the visual grammar of data visualization.

1. The Problem 2. Chart Types 3. Pipeline 4. Approaches 5. Models 6. Benchmarks 7. Code

The Problem

Why is understanding charts hard for machines when humans find it trivial?

The Challenge

Charts and tables encode data visually. Humans read them effortlessly, extracting trends, comparisons, and specific values. But to a machine, a bar chart is just colored rectangles. How do we bridge this gap?

The Insight

The key is recognizing that charts have grammar: axes define scales, marks represent data points, legends map colors to categories. By understanding this visual grammar, we can reverse-engineer the underlying data.

The Key Idea

Chart understanding is not just OCR. It requires spatial reasoning (this bar is taller than that one), semantic understanding (the x-axis represents time), and numerical precision (that bar reaches exactly 72).

Three Skills a Chart Reader Needs

Visual Recognition

Detecting chart elements: bars, lines, points, axes, legends, titles. Each chart type has its own visual vocabulary.

Similar to: Object detection, OCR

Spatial Reasoning

Understanding that position encodes value. A bar reaching 75% height means 75% of the scale. Relative positions matter.

Similar to: Geometric reasoning, measurement

Semantic Understanding

Connecting visual elements to meaning. The blue line represents revenue, the x-axis is time, the legend explains the colors.

Similar to: VQA, document understanding

Chart Types and Their Challenges

Each chart type encodes data differently. Understanding these visual grammars is step one.

Bar Chart

Categorical comparisons using rectangular bars

Key Challenges

*Stacked vs grouped
*Horizontal vs vertical
*Reading exact values

Extraction Pipeline

Detect axis labels

Identify bar boundaries

Map colors to legend

Extract heights/widths

Interactive: Watch Data Extraction

Chart Image

Quarterly Revenue

Extracted Data

{
  "chart_type": "bar",
  "title": "Quarterly Revenue",
  "data": [
    {
      "label": "Q1",
      "value": 45
    },
    {
      "label": "Q2",
      "value": 72
    },
    {
      "label": "Q3",
      "value": 63
    },
    {
      "label": "Q4",
      "value": 89
    }
  ]
}

The Processing Pipeline

From raw pixels to structured data. Each stage transforms the representation closer to machine-readable format.

IMG

Input

Chart image or document

Type Detection

What kind of chart is this?

BOX

Structure Analysis

Find visual elements

NUM

Data Extraction

Convert visuals to numbers

{}

Output

Structured data

Type Detection: The First Decision

Before extracting data, we must know what kind of chart we are looking at. A bar chart extracts differently than a line chart. This classification determines the entire downstream pipeline.

*Visual classifiers (CNN-based) for chart type

*Or multimodal LLM: "What type of chart is this?"

*Accuracy here is critical: wrong type = wrong extraction

Structure Analysis: Finding the Grammar

Every chart has structural elements: axes define the coordinate system, legends map visual properties to meaning, titles provide context. Detecting these is like parsing the syntax of a visual language.

*Object detection for visual elements

*OCR for text (labels, values, titles)

*Geometric reasoning for relationships

Architectural Approaches

Three fundamentally different ways to approach chart understanding. Each has its place.

Component Pipeline

Detect elements, then extract each

How it works

1. Chart type classification
2. Element detection (bars, lines, points)
3. OCR for text
4. Geometric reasoning for values

Pros

+ Interpretable

+ High precision possible

+ Domain knowledge encoded

Cons

- Complex pipeline

- Error propagation

- Brittle to variations

End-to-End Neural

Image in, structured data out

How it works

1. Vision encoder (ViT, Swin)
2. Cross-attention to query
3. Text decoder generates JSON/markdown

Pros

+ Simple architecture

+ Learns from data

+ Handles variations

Cons

- May hallucinate

- Needs lots of data

- Black box

Multimodal LLM

Leverage general vision-language capabilities

How it works

1. Image encoded as tokens
2. Natural language query
3. Structured output via prompting

Pros

+ Most flexible

+ Best reasoning

+ Zero-shot

Cons

- Expensive

- Non-deterministic

- May miss details

When to Use What

Component Pipeline

When you need high precision, interpretable results, and have clean, standardized charts. Best for production systems with known chart formats.

End-to-End Neural

When you have varied chart styles and can tolerate some errors. Good for quick prototypes and when training data is available.

Multimodal LLM

When you need to reason about charts, answer questions, or handle unexpected formats. Best for analysis tasks, not bulk extraction.

Key Models

The models you should know for chart and table understanding in 2024-2025.

ChartOCR

Component-based pipeline

Specialized

Strengths:

+High accuracy on clean charts
+Interpretable outputs
+Chart-specific logic

Weaknesses:

-Requires chart type classification
-Struggles with unusual layouts

Best for:

Production chart extraction

Donut

OCR-free document understanding

End-to-End

Strengths:

+No OCR preprocessing
+Handles diverse layouts
+One model for many tasks

Weaknesses:

-May miss fine numerical details
-Needs task-specific prompting

Best for:

Quick prototyping, varied documents

Pix2Struct

Screenshot parsing via masked patches

End-to-End

Strengths:

+Trained on web screenshots
+Good for infographics
+Chart-specific fine-tuning available

Weaknesses:

-Limited to training distribution
-Can hallucinate values

Best for:

Infographics, web charts, UI screenshots

GPT-4V / Claude

General vision-language reasoning

Multimodal LLM

Strengths:

+Best reasoning about charts
+Handles questions naturally
+Zero-shot capability

Weaknesses:

-Expensive at scale
-May hallucinate numbers
-Not deterministic

Best for:

Chart QA, analysis, insights

TableTransformer

DETR-based table detection and structure

Specialized

Strengths:

+State-of-the-art table detection
+Cell-level extraction
+Handles complex layouts

Weaknesses:

-Tables only, not charts
-Requires OCR for text

Best for:

Document tables, forms

For Tables

TableTransformer

State-of-the-art structure recognition

For Chart QA

GPT-4V / Claude

Best reasoning, handles any format

For Data Extraction

DePlot + LLM

Chart to table, then reason

Benchmarks

Standard datasets for evaluating chart and table understanding systems.

Benchmark	Focus	Size	Metric	SOTA
ChartQA	Chart QA	32K QA pairs	Accuracy	GPT-4V: 78.5%
PlotQA	Scientific Charts	224K QA pairs	Accuracy	DePlot: 54.8%
ChartInfo	Chart Summarization	7K charts	BLEU/METEOR	MatCha: 0.42
PubTabNet	Table Structure	568K tables	TEDS	TableFormer: 96.8%
FinTabNet	Financial Tables	113K tables	TEDS	VAST: 97.1%
SciGraphQA	Scientific Figures	295K QA pairs	Accuracy	LLaVA: 45.2%

TEDS (Tree Edit Distance Similarity)

Measures structural similarity between predicted and ground-truth tables. Accounts for both content and structure (rows, columns, spans). Score of 1.0 means perfect match; commonly see 0.9+ for good systems.

ChartQA Accuracy

Percentage of questions answered correctly about charts. Questions range from simple value lookup to complex reasoning. Human performance is around 85%; best models reach 78%.

Code Examples

Get started with chart and table understanding in Python.

Donutpip install transformers torch

End-to-End

from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch
import json

# Load Donut fine-tuned for chart understanding
processor = DonutProcessor.from_pretrained(
    "naver-clova-ix/donut-base-finetuned-docvqa"
)
model = VisionEncoderDecoderModel.from_pretrained(
    "naver-clova-ix/donut-base-finetuned-docvqa"
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load chart image
image = Image.open("chart.png").convert("RGB")

# Create task prompt for chart understanding
task_prompt = "<s_docvqa><s_question>Extract all data from this chart</s_question><s_answer>"

# Process
decoder_input_ids = processor.tokenizer(
    task_prompt, add_special_tokens=False, return_tensors="pt"
).input_ids.to(device)

pixel_values = processor(image, return_tensors="pt").pixel_values.to(device)

# Generate
outputs = model.generate(
    pixel_values,
    decoder_input_ids=decoder_input_ids,
    max_length=512,
    early_stopping=True,
    pad_token_id=processor.tokenizer.pad_token_id,
    eos_token_id=processor.tokenizer.eos_token_id,
    use_cache=True,
    num_beams=4,
)

# Decode
result = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(result)

# For chart-specific Donut, try:
# "naver-clova-ix/donut-base-finetuned-cord-v2" for receipts
# Custom fine-tuning on ChartQA dataset for charts

Quick Reference

For Charts

- Pix2Struct (charts/infographics)
- DePlot (chart to table)
- GPT-4V (analysis/QA)

For Tables

- TableTransformer (detection)
- Donut (end-to-end)
- AWS Textract (production)

Key Benchmarks

- ChartQA (chart understanding)
- PubTabNet (table structure)
- PlotQA (scientific charts)

Common Pitfalls

- Hallucinated numbers from LLMs
- Wrong chart type detection
- Complex table structures

Key Takeaways

1. Chart understanding requires visual, spatial, and semantic reasoning combined
2. Tables and charts need different approaches; use specialized tools
3. Multimodal LLMs excel at reasoning but may hallucinate numbers
4. For production extraction, consider DePlot + LLM or TableTransformer

§ Use cases