Question Answering
Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.
Question answering spans extractive (finding answer spans in a passage), abstractive (generating free-form answers), and open-domain (retrieving then answering from large corpora). RAG pipelines with LLMs have become the dominant production architecture, but hallucination and faithfulness remain the core unsolved problems.
History
SQuAD (Rajpurkar et al.) establishes extractive QA as a benchmark with 100K+ question-passage pairs
SQuAD 2.0 adds unanswerable questions, testing whether models know what they don't know
BERT achieves human-level F1 (93.2) on SQuAD 1.1, sparking widespread adoption of QA fine-tuning
Natural Questions (Google) and TyDi QA shift focus to real user queries and multilingual QA
DPR (Karpukhin et al.) introduces dense passage retrieval, outperforming BM25 for open-domain QA
RAG (Lewis et al.) combines retrieval with generation — the architecture that would dominate production QA
Atlas and RETRO show that retrieval-augmented models can match 10x larger models without retrieval
GPT-4 with RAG becomes the standard enterprise QA architecture; LangChain and LlamaIndex enable rapid prototyping
Long-context models (Gemini 1.5 1M tokens, Claude 100K+) challenge RAG by fitting entire document collections in context
How Question Answering Works
Retrieval
A query encoder maps the question to a dense vector; top-k relevant passages are retrieved via approximate nearest neighbor search (FAISS, HNSW)
Context assembly
Retrieved passages are concatenated with the question as context for the reader model
Reading / generation
An extractive reader highlights answer spans, or a generative reader (LLM) produces a free-form answer grounded in the passages
Answer extraction
For extractive QA, start and end logits identify the answer span; for generative QA, the model outputs text with optional citations
Verification
Advanced pipelines add a verification step that checks whether the generated answer is actually supported by the retrieved passages
Current Landscape
QA in 2025 is dominated by two paradigms: RAG (retrieval-augmented generation) for production systems that need grounded, verifiable answers, and long-context LLMs that skip retrieval entirely by fitting entire document collections in the prompt. The SQuAD benchmark is long saturated; real-world QA evaluation has moved to KILT, Natural Questions, and domain-specific benchmarks (BioASQ, FinQA). The key metric is shifting from F1/EM to faithfulness — can you trust the answer?
Key Challenges
Hallucination: generative models produce fluent but unfaithful answers not supported by the source documents
Retrieval quality is the bottleneck — if the right passage isn't retrieved, no reader can produce the correct answer
Multi-hop questions requiring synthesis across 2+ documents remain significantly harder than single-hop
Temporal reasoning: answering questions about 'current' events requires up-to-date retrieval corpora
Evaluation is difficult for open-ended questions where multiple valid answers exist
Quick Recommendations
Enterprise RAG QA
Claude 3.5 Sonnet or GPT-4o + vector store (Pinecone, Weaviate)
Best faithfulness to retrieved context with strong citation generation
Extractive QA (span selection)
DeBERTa-v3-large fine-tuned on SQuAD 2.0
93+ F1 on SQuAD; fast inference, no hallucination by design
Open-source RAG
Llama 3.1 70B + ColBERT v2 retriever
Strong generation quality with late-interaction retrieval for better passage matching
Multilingual QA
mDeBERTa-v3 or XLM-RoBERTa + mContriever
Covers 100+ languages for both retrieval and reading components
What's Next
The frontier is agentic QA: systems that don't just retrieve and read, but actively search, verify, and synthesize across multiple sources with tool use. Expect multi-step reasoning pipelines (retrieve → reason → re-retrieve → synthesize) to replace single-pass RAG. Attribution and citation quality will become first-class evaluation metrics, and hybrid architectures that use long-context for high-value queries and RAG for scale will emerge as the practical middle ground.
Benchmarks & SOTA
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Something wrong or missing?
Help keep Question Answering benchmarks accurate. Report outdated results, missing benchmarks, or errors.