Natural Language Processingquestion-answering

Question Answering

Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.

1 datasets9 resultsView full task mapping →

Question answering spans extractive (finding answer spans in a passage), abstractive (generating free-form answers), and open-domain (retrieving then answering from large corpora). RAG pipelines with LLMs have become the dominant production architecture, but hallucination and faithfulness remain the core unsolved problems.

History

2016

SQuAD (Rajpurkar et al.) establishes extractive QA as a benchmark with 100K+ question-passage pairs

2017

SQuAD 2.0 adds unanswerable questions, testing whether models know what they don't know

2018

BERT achieves human-level F1 (93.2) on SQuAD 1.1, sparking widespread adoption of QA fine-tuning

2019

Natural Questions (Google) and TyDi QA shift focus to real user queries and multilingual QA

2020

DPR (Karpukhin et al.) introduces dense passage retrieval, outperforming BM25 for open-domain QA

2020

RAG (Lewis et al.) combines retrieval with generation — the architecture that would dominate production QA

2022

Atlas and RETRO show that retrieval-augmented models can match 10x larger models without retrieval

2023

GPT-4 with RAG becomes the standard enterprise QA architecture; LangChain and LlamaIndex enable rapid prototyping

2024

Long-context models (Gemini 1.5 1M tokens, Claude 100K+) challenge RAG by fitting entire document collections in context

How Question Answering Works

1RetrievalA query encoder maps the qu…2Context assemblyRetrieved passages are conc…3Reading / generationAn extractive reader highli…4Answer extractionFor extractive QA5VerificationAdvanced pipelines add a ve…Question Answering Pipeline
1

Retrieval

A query encoder maps the question to a dense vector; top-k relevant passages are retrieved via approximate nearest neighbor search (FAISS, HNSW)

2

Context assembly

Retrieved passages are concatenated with the question as context for the reader model

3

Reading / generation

An extractive reader highlights answer spans, or a generative reader (LLM) produces a free-form answer grounded in the passages

4

Answer extraction

For extractive QA, start and end logits identify the answer span; for generative QA, the model outputs text with optional citations

5

Verification

Advanced pipelines add a verification step that checks whether the generated answer is actually supported by the retrieved passages

Current Landscape

QA in 2025 is dominated by two paradigms: RAG (retrieval-augmented generation) for production systems that need grounded, verifiable answers, and long-context LLMs that skip retrieval entirely by fitting entire document collections in the prompt. The SQuAD benchmark is long saturated; real-world QA evaluation has moved to KILT, Natural Questions, and domain-specific benchmarks (BioASQ, FinQA). The key metric is shifting from F1/EM to faithfulness — can you trust the answer?

Key Challenges

Hallucination: generative models produce fluent but unfaithful answers not supported by the source documents

Retrieval quality is the bottleneck — if the right passage isn't retrieved, no reader can produce the correct answer

Multi-hop questions requiring synthesis across 2+ documents remain significantly harder than single-hop

Temporal reasoning: answering questions about 'current' events requires up-to-date retrieval corpora

Evaluation is difficult for open-ended questions where multiple valid answers exist

Quick Recommendations

Enterprise RAG QA

Claude 3.5 Sonnet or GPT-4o + vector store (Pinecone, Weaviate)

Best faithfulness to retrieved context with strong citation generation

Extractive QA (span selection)

DeBERTa-v3-large fine-tuned on SQuAD 2.0

93+ F1 on SQuAD; fast inference, no hallucination by design

Open-source RAG

Llama 3.1 70B + ColBERT v2 retriever

Strong generation quality with late-interaction retrieval for better passage matching

Multilingual QA

mDeBERTa-v3 or XLM-RoBERTa + mContriever

Covers 100+ languages for both retrieval and reading components

What's Next

The frontier is agentic QA: systems that don't just retrieve and read, but actively search, verify, and synthesize across multiple sources with tool use. Expect multi-step reasoning pipelines (retrieve → reason → re-retrieve → synthesize) to replace single-pass RAG. Attribution and citation quality will become first-class evaluation metrics, and hybrid architectures that use long-context for high-value queries and RAG for scale will emerge as the practical middle ground.

Benchmarks & SOTA

Related Tasks

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

Something wrong or missing?

Help keep Question Answering benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000