Natural Language Processingtable-question-answering

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

2 datasets3 resultsView full task mapping →

Table QA answers natural language questions over structured tables — bridging the gap between SQL databases and end users. TAPAS and TaPEx established the transformer approach, but LLMs now dominate by generating SQL or Python code that executes against the table. The challenge is complex multi-step reasoning and faithfulness to the actual data.

History

2015

WikiTableQuestions (Pasupat & Liang) introduces the task of answering questions over Wikipedia tables

2017

Seq2SQL (Zhong et al.) frames table QA as text-to-SQL generation, launching the Spider benchmark era

2018

Spider (Yu et al.) provides a complex cross-database text-to-SQL benchmark with 10,181 questions

2020

TAPAS (Herzig et al., Google) introduces table-aware pretraining with cell selection and aggregation operations

2021

TaPEx (Liu et al.) pretrains on synthetic SQL execution traces, outperforming TAPAS on WikiTableQuestions

2022

GRAPPA and OmniTab show that table-specific pretraining objectives consistently improve downstream accuracy

2023

DIN-SQL and DAIL-SQL use GPT-4 for text-to-SQL, reaching 85+ accuracy on Spider with decomposed prompting

2024

Claude 3.5 Sonnet and GPT-4o achieve 90%+ on Spider-dev via code generation; open models like DeepSeek-Coder close the gap

2025

BIRD benchmark and Spider 2.0 test real-world database complexity; LLM agents with self-correction approach expert-level SQL writing

How Table Question Answering Works

Table serialization

The table is linearized into a token sequence — headers and cell values separated by special tokens or markdown formatting

Question encoding

The natural language question is concatenated with the serialized table and encoded by the transformer

Cell selection / SQL generation

TAPAS-style models select relevant cells directly; LLM-based models generate SQL or Python code to query the table

Aggregation

Operations like COUNT, SUM, AVERAGE are either predicted as a classification head (TAPAS) or embedded in generated code

Execution and verification

Generated code is executed against the table; self-consistency or execution feedback loops catch errors

Current Landscape

Table QA in 2025 has bifurcated: simple lookup questions are solved by both fine-tuned models and LLMs, while complex analytical queries over real databases (multi-table joins, window functions, CTEs) remain the frontier. The Spider leaderboard is nearly saturated, pushing the community toward harder benchmarks like BIRD (with real dirty data and domain knowledge) and Spider 2.0 (enterprise-scale databases). Code-generation approaches have definitively won over direct cell-selection methods for complex queries.

Key Challenges

Large tables exceed context windows — a 10,000-row table can't fit in most model contexts, requiring schema-aware truncation

Multi-hop reasoning (joins across tables, subqueries) remains error-prone even for frontier LLMs

Ambiguous questions that map to multiple valid SQL interpretations cause inconsistent answers

Numerical reasoning (percentages, date arithmetic, comparisons) is a systematic weakness in purely neural approaches

Hallucination: models sometimes produce plausible but incorrect answers not grounded in the actual table data

Quick Recommendations

Best accuracy on complex SQL

GPT-4o or Claude 3.5 Sonnet with text-to-SQL prompting

90%+ on Spider; handles joins, subqueries, and aggregations reliably

Direct table QA (no SQL)

TAPAS-large fine-tuned

Operates directly on table cells without code generation; good for simple lookups and aggregations

Open-source text-to-SQL

DeepSeek-Coder-33B or CodeLlama-34B

Strong SQL generation at a fraction of API cost; self-hostable for data privacy

Enterprise data assistant

LLM agent with schema retrieval + execution loop

Retrieves relevant tables, generates SQL, executes, and self-corrects — handles real-world database complexity

What's Next

The next wave is conversational table QA — multi-turn dialogues where the model maintains state across questions, handles clarifications, and produces visualizations alongside answers. Expect tighter integration with BI tools (Tableau, Metabase) and the rise of text-to-SQL agents that can autonomously explore database schemas, identify relevant tables, and compose multi-step analytical pipelines.

Benchmarks & SOTA

WikiTableQuestions

20153 results

Question answering over Wikipedia tables requiring compositional reasoning

State of the Art

GPT-4

OpenAI

75.3

accuracy

SQA

20170 results

Sequential question answering over tables with multi-turn reasoning

No results tracked yet

Related Tasks

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Table Question Answering benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing