Home/Building Blocks/Chart and Table Understanding
ImageStructured Data

Chart and Table Understanding

Parse charts, diagrams, and tables into structured data for analysis and QA.

How Chart and Table Understanding Works

A technical deep-dive into extracting structured data from charts, graphs, and tables. From bar charts to complex financial tables, understanding the visual grammar of data visualization.

1

The Problem

Why is understanding charts hard for machines when humans find it trivial?

The Challenge

Charts and tables encode data visually. Humans read them effortlessly, extracting trends, comparisons, and specific values. But to a machine, a bar chart is just colored rectangles. How do we bridge this gap?

The Insight

The key is recognizing that charts have grammar: axes define scales, marks represent data points, legends map colors to categories. By understanding this visual grammar, we can reverse-engineer the underlying data.

The Key Idea

Chart understanding is not just OCR. It requires spatial reasoning (this bar is taller than that one), semantic understanding (the x-axis represents time), and numerical precision (that bar reaches exactly 72).

Three Skills a Chart Reader Needs

Visual Recognition

Detecting chart elements: bars, lines, points, axes, legends, titles. Each chart type has its own visual vocabulary.

Similar to: Object detection, OCR
Spatial Reasoning

Understanding that position encodes value. A bar reaching 75% height means 75% of the scale. Relative positions matter.

Similar to: Geometric reasoning, measurement
Semantic Understanding

Connecting visual elements to meaning. The blue line represents revenue, the x-axis is time, the legend explains the colors.

Similar to: VQA, document understanding
2

Chart Types and Their Challenges

Each chart type encodes data differently. Understanding these visual grammars is step one.

Bar Chart

Categorical comparisons using rectangular bars

Key Challenges
  • *Stacked vs grouped
  • *Horizontal vs vertical
  • *Reading exact values

Extraction Pipeline

1
Detect axis labels
2
Identify bar boundaries
3
Map colors to legend
4
Extract heights/widths

Interactive: Watch Data Extraction

Chart Image
Quarterly Revenue
Q1
Q2
Q3
Q4
Extracted Data
{
  "chart_type": "bar",
  "title": "Quarterly Revenue",
  "data": [
    {
      "label": "Q1",
      "value": 45
    },
    {
      "label": "Q2",
      "value": 72
    },
    {
      "label": "Q3",
      "value": 63
    },
    {
      "label": "Q4",
      "value": 89
    }
  ]
}
3

The Processing Pipeline

From raw pixels to structured data. Each stage transforms the representation closer to machine-readable format.

IMG
Input
Chart image or document
?
Type Detection
What kind of chart is this?
BOX
Structure Analysis
Find visual elements
NUM
Data Extraction
Convert visuals to numbers
{}
Output
Structured data

Type Detection: The First Decision

Before extracting data, we must know what kind of chart we are looking at. A bar chart extracts differently than a line chart. This classification determines the entire downstream pipeline.

*Visual classifiers (CNN-based) for chart type
*Or multimodal LLM: "What type of chart is this?"
*Accuracy here is critical: wrong type = wrong extraction

Structure Analysis: Finding the Grammar

Every chart has structural elements: axes define the coordinate system, legends map visual properties to meaning, titles provide context. Detecting these is like parsing the syntax of a visual language.

*Object detection for visual elements
*OCR for text (labels, values, titles)
*Geometric reasoning for relationships
4

Architectural Approaches

Three fundamentally different ways to approach chart understanding. Each has its place.

Component Pipeline

Detect elements, then extract each

How it works
  1. 1. Chart type classification
  2. 2. Element detection (bars, lines, points)
  3. 3. OCR for text
  4. 4. Geometric reasoning for values
Pros
+ Interpretable
+ High precision possible
+ Domain knowledge encoded
Cons
- Complex pipeline
- Error propagation
- Brittle to variations
End-to-End Neural

Image in, structured data out

How it works
  1. 1. Vision encoder (ViT, Swin)
  2. 2. Cross-attention to query
  3. 3. Text decoder generates JSON/markdown
Pros
+ Simple architecture
+ Learns from data
+ Handles variations
Cons
- May hallucinate
- Needs lots of data
- Black box
Multimodal LLM

Leverage general vision-language capabilities

How it works
  1. 1. Image encoded as tokens
  2. 2. Natural language query
  3. 3. Structured output via prompting
Pros
+ Most flexible
+ Best reasoning
+ Zero-shot
Cons
- Expensive
- Non-deterministic
- May miss details

When to Use What

Component Pipeline

When you need high precision, interpretable results, and have clean, standardized charts. Best for production systems with known chart formats.

End-to-End Neural

When you have varied chart styles and can tolerate some errors. Good for quick prototypes and when training data is available.

Multimodal LLM

When you need to reason about charts, answer questions, or handle unexpected formats. Best for analysis tasks, not bulk extraction.

5

Key Models

The models you should know for chart and table understanding in 2024-2025.

ChartOCR
Component-based pipeline
Specialized
Strengths:
  • +High accuracy on clean charts
  • +Interpretable outputs
  • +Chart-specific logic
Weaknesses:
  • -Requires chart type classification
  • -Struggles with unusual layouts
Best for:

Production chart extraction

Donut
OCR-free document understanding
End-to-End
Strengths:
  • +No OCR preprocessing
  • +Handles diverse layouts
  • +One model for many tasks
Weaknesses:
  • -May miss fine numerical details
  • -Needs task-specific prompting
Best for:

Quick prototyping, varied documents

Pix2Struct
Screenshot parsing via masked patches
End-to-End
Strengths:
  • +Trained on web screenshots
  • +Good for infographics
  • +Chart-specific fine-tuning available
Weaknesses:
  • -Limited to training distribution
  • -Can hallucinate values
Best for:

Infographics, web charts, UI screenshots

GPT-4V / Claude
General vision-language reasoning
Multimodal LLM
Strengths:
  • +Best reasoning about charts
  • +Handles questions naturally
  • +Zero-shot capability
Weaknesses:
  • -Expensive at scale
  • -May hallucinate numbers
  • -Not deterministic
Best for:

Chart QA, analysis, insights

TableTransformer
DETR-based table detection and structure
Specialized
Strengths:
  • +State-of-the-art table detection
  • +Cell-level extraction
  • +Handles complex layouts
Weaknesses:
  • -Tables only, not charts
  • -Requires OCR for text
Best for:

Document tables, forms

For Tables
TableTransformer
State-of-the-art structure recognition
For Chart QA
GPT-4V / Claude
Best reasoning, handles any format
For Data Extraction
DePlot + LLM
Chart to table, then reason
6

Benchmarks

Standard datasets for evaluating chart and table understanding systems.

BenchmarkFocusSizeMetricSOTA
ChartQAChart QA32K QA pairsAccuracyGPT-4V: 78.5%
PlotQAScientific Charts224K QA pairsAccuracyDePlot: 54.8%
ChartInfoChart Summarization7K chartsBLEU/METEORMatCha: 0.42
PubTabNetTable Structure568K tablesTEDSTableFormer: 96.8%
FinTabNetFinancial Tables113K tablesTEDSVAST: 97.1%
SciGraphQAScientific Figures295K QA pairsAccuracyLLaVA: 45.2%
TEDS (Tree Edit Distance Similarity)

Measures structural similarity between predicted and ground-truth tables. Accounts for both content and structure (rows, columns, spans). Score of 1.0 means perfect match; commonly see 0.9+ for good systems.

ChartQA Accuracy

Percentage of questions answered correctly about charts. Questions range from simple value lookup to complex reasoning. Human performance is around 85%; best models reach 78%.

7

Code Examples

Get started with chart and table understanding in Python.

Donutpip install transformers torch
End-to-End
from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch
import json

# Load Donut fine-tuned for chart understanding
processor = DonutProcessor.from_pretrained(
    "naver-clova-ix/donut-base-finetuned-docvqa"
)
model = VisionEncoderDecoderModel.from_pretrained(
    "naver-clova-ix/donut-base-finetuned-docvqa"
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load chart image
image = Image.open("chart.png").convert("RGB")

# Create task prompt for chart understanding
task_prompt = "<s_docvqa><s_question>Extract all data from this chart</s_question><s_answer>"

# Process
decoder_input_ids = processor.tokenizer(
    task_prompt, add_special_tokens=False, return_tensors="pt"
).input_ids.to(device)

pixel_values = processor(image, return_tensors="pt").pixel_values.to(device)

# Generate
outputs = model.generate(
    pixel_values,
    decoder_input_ids=decoder_input_ids,
    max_length=512,
    early_stopping=True,
    pad_token_id=processor.tokenizer.pad_token_id,
    eos_token_id=processor.tokenizer.eos_token_id,
    use_cache=True,
    num_beams=4,
)

# Decode
result = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(result)

# For chart-specific Donut, try:
# "naver-clova-ix/donut-base-finetuned-cord-v2" for receipts
# Custom fine-tuning on ChartQA dataset for charts

Quick Reference

For Charts
  • - Pix2Struct (charts/infographics)
  • - DePlot (chart to table)
  • - GPT-4V (analysis/QA)
For Tables
  • - TableTransformer (detection)
  • - Donut (end-to-end)
  • - AWS Textract (production)
Key Benchmarks
  • - ChartQA (chart understanding)
  • - PubTabNet (table structure)
  • - PlotQA (scientific charts)
Common Pitfalls
  • - Hallucinated numbers from LLMs
  • - Wrong chart type detection
  • - Complex table structures

Key Takeaways

  • 1. Chart understanding requires visual, spatial, and semantic reasoning combined
  • 2. Tables and charts need different approaches; use specialized tools
  • 3. Multimodal LLMs excel at reasoning but may hallucinate numbers
  • 4. For production extraction, consider DePlot + LLM or TableTransformer

Use Cases

  • Financial chart QA
  • Research figure extraction
  • Table-to-CSV
  • Dashboard auditing

Architectural Patterns

Layout-Aware Parsing

Detect cells/regions then recognize text and structure (table grid detection + OCR).

Vision-Language Chart QA

Use chart-specific VLMs to answer questions or extract series.

Implementations

Open Source

Table Transformer (TATR)

MIT
Open Source

Detects table structure; pair with OCR for content.

DocTR

Apache 2.0
Open Source

Document OCR with table detection utilities.

ChartQA Models

MIT
Open Source

Chart question answering and series extraction.

Benchmarks

Quick Facts

Input
Image
Output
Structured Data
Implementations
3 open source, 0 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for chart and table understanding.

Submit Results