Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Table QA answers natural language questions over structured tables — bridging the gap between SQL databases and end users. TAPAS and TaPEx established the transformer approach, but LLMs now dominate by generating SQL or Python code that executes against the table. The challenge is complex multi-step reasoning and faithfulness to the actual data.
History
WikiTableQuestions (Pasupat & Liang) introduces the task of answering questions over Wikipedia tables
Seq2SQL (Zhong et al.) frames table QA as text-to-SQL generation, launching the Spider benchmark era
Spider (Yu et al.) provides a complex cross-database text-to-SQL benchmark with 10,181 questions
TAPAS (Herzig et al., Google) introduces table-aware pretraining with cell selection and aggregation operations
TaPEx (Liu et al.) pretrains on synthetic SQL execution traces, outperforming TAPAS on WikiTableQuestions
GRAPPA and OmniTab show that table-specific pretraining objectives consistently improve downstream accuracy
DIN-SQL and DAIL-SQL use GPT-4 for text-to-SQL, reaching 85+ accuracy on Spider with decomposed prompting
Claude 3.5 Sonnet and GPT-4o achieve 90%+ on Spider-dev via code generation; open models like DeepSeek-Coder close the gap
BIRD benchmark and Spider 2.0 test real-world database complexity; LLM agents with self-correction approach expert-level SQL writing
How Table Question Answering Works
Table serialization
The table is linearized into a token sequence — headers and cell values separated by special tokens or markdown formatting
Question encoding
The natural language question is concatenated with the serialized table and encoded by the transformer
Cell selection / SQL generation
TAPAS-style models select relevant cells directly; LLM-based models generate SQL or Python code to query the table
Aggregation
Operations like COUNT, SUM, AVERAGE are either predicted as a classification head (TAPAS) or embedded in generated code
Execution and verification
Generated code is executed against the table; self-consistency or execution feedback loops catch errors
Current Landscape
Table QA in 2025 has bifurcated: simple lookup questions are solved by both fine-tuned models and LLMs, while complex analytical queries over real databases (multi-table joins, window functions, CTEs) remain the frontier. The Spider leaderboard is nearly saturated, pushing the community toward harder benchmarks like BIRD (with real dirty data and domain knowledge) and Spider 2.0 (enterprise-scale databases). Code-generation approaches have definitively won over direct cell-selection methods for complex queries.
Key Challenges
Large tables exceed context windows — a 10,000-row table can't fit in most model contexts, requiring schema-aware truncation
Multi-hop reasoning (joins across tables, subqueries) remains error-prone even for frontier LLMs
Ambiguous questions that map to multiple valid SQL interpretations cause inconsistent answers
Numerical reasoning (percentages, date arithmetic, comparisons) is a systematic weakness in purely neural approaches
Hallucination: models sometimes produce plausible but incorrect answers not grounded in the actual table data
Quick Recommendations
Best accuracy on complex SQL
GPT-4o or Claude 3.5 Sonnet with text-to-SQL prompting
90%+ on Spider; handles joins, subqueries, and aggregations reliably
Direct table QA (no SQL)
TAPAS-large fine-tuned
Operates directly on table cells without code generation; good for simple lookups and aggregations
Open-source text-to-SQL
DeepSeek-Coder-33B or CodeLlama-34B
Strong SQL generation at a fraction of API cost; self-hostable for data privacy
Enterprise data assistant
LLM agent with schema retrieval + execution loop
Retrieves relevant tables, generates SQL, executes, and self-corrects — handles real-world database complexity
What's Next
The next wave is conversational table QA — multi-turn dialogues where the model maintains state across questions, handles clarifications, and produces visualizations alongside answers. Expect tighter integration with BI tools (Tableau, Metabase) and the rise of text-to-SQL agents that can autonomously explore database schemas, identify relevant tables, and compose multi-step analytical pipelines.
Benchmarks & SOTA
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Question Answering
Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.
Something wrong or missing?
Help keep Table Question Answering benchmarks accurate. Report outdated results, missing benchmarks, or errors.