Question Answering
Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.
Question answering is no longer just SQuAD-style span extraction. The active surface now splits into extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality checks, long-context QA, and web-browsing QA. SQuAD remains useful as a historical sanity check, but it should not be the only benchmark on this page.
History
SQuAD makes extractive reading comprehension the default QA benchmark.
TriviaQA scales QA to trivia questions with independently gathered evidence documents.
SQuAD 2.0 adds unanswerable questions; HotpotQA tests multi-hop evidence chaining.
Natural Questions uses real Google search queries grounded in Wikipedia pages; DROP stresses discrete reasoning over passages.
KILT unifies knowledge-intensive QA and retrieval-grounded tasks over one Wikipedia snapshot; MuSiQue makes multi-hop composition harder to shortcut.
SimpleQA and FRAMES shift attention toward factuality, retrieval, and answer support instead of span F1 alone.
BrowseComp measures hard-to-find web QA where browsing strategy and citation discipline matter.
How Question Answering Works
Question analysis
Classify the query as span extraction, open-domain lookup, multi-hop reasoning, long-context lookup, or web research.
Evidence retrieval
Fetch passages, documents, or web pages with lexical search, dense retrieval, hybrid search, and reranking.
Grounded reading
Read the retrieved evidence and produce a short answer, span, or free-form response with support.
Verification
Check answerability, citation support, contradiction risk, and whether the answer is actually present in the evidence.
Abstention
For unanswerable or under-supported cases, the system should say it cannot answer instead of hallucinating.
Current Landscape
QA in 2026 is best treated as an evaluation stack. SQuAD and TriviaQA cover the historical reading-comprehension baseline. Natural Questions and KILT cover retrieval-grounded open-domain QA. HotpotQA, MuSiQue, and DROP cover reasoning over evidence. SimpleQA, FRAMES, and BrowseComp cover the newer frontier: factuality, retrieval quality, and hard-to-find answers with tools.
Key Challenges
SQuAD-style F1 is saturated and no longer separates frontier systems.
Retrieval failure still dominates production QA errors.
Multi-hop datasets expose whether systems can combine evidence rather than match one passage.
Factuality benchmarks punish plausible but unsupported answers.
Browsing QA requires search strategy, persistence, and source judgment, not only language modeling.
Long-context QA can still miss small facts even when the answer is technically inside the prompt.
Quick Recommendations
Historical extractive QA
SQuAD v2.0 plus an abstention metric
Useful for regression testing span readers, but too saturated to stand alone.
Open-domain QA
Natural Questions, TriviaQA, and KILT
Better reflects retrieval-grounded QA over real or broad knowledge sources.
Multi-hop reasoning
HotpotQA, MuSiQue, and DROP
Tests composition, numerical reasoning, and evidence chaining beyond single-pass lookup.
Modern factual QA
SimpleQA, FRAMES, and BrowseComp
Tracks hallucination, retrieval quality, and hard web-search behavior for LLM systems.
What's Next
The useful leaderboard will separate answer accuracy from evidence quality. Expect QA pages to track not only exact match or F1, but answerability, citation support, retrieval recall, calibration, and whether the system can abstain when the evidence is insufficient.
Benchmarks & SOTA
SQuAD v2.0
Stanford Question Answering Dataset v2.0
Historical extractive QA benchmark over Wikipedia paragraphs with unanswerable questions. Valuable as a regression test, but saturated for frontier LLM comparison.
State of the Art
GPT-4o
OpenAI
91.4
f1
HotpotQA
HotpotQA
Multi-hop question-answering benchmark requiring reasoning across multiple Wikipedia documents.
State of the Art
GPT-4o
OpenAI
71.3
f1
FRAMES
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
RAG evaluation dataset for factuality, retrieval, and reasoning in answer generation.
No results tracked yet
KILT
Knowledge Intensive Language Tasks
Unified benchmark for retrieval-grounded knowledge-intensive NLP tasks over a shared Wikipedia snapshot, including open-domain QA.
No results tracked yet
MuSiQue
MuSiQue: Multihop Questions via Single-hop Question Composition
Multi-hop QA benchmark constructed to require connected reasoning over multiple passages.
No results tracked yet
Natural Questions
Natural Questions: a Benchmark for Question Answering Research
Open-domain QA benchmark built from real Google search queries with answers annotated from Wikipedia pages.
No results tracked yet
SimpleQA
Measuring Short-form Factuality in Large Language Models
Short-form factuality benchmark with single-answer fact-seeking questions designed to expose hallucination and calibration failures.
No results tracked yet
BrowseComp
BrowseComp: A Benchmark for Browsing Agents
Hard web-browsing QA benchmark with short factual answers that require persistent search over many online sources.
No results tracked yet
TriviaQA
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Large-scale QA benchmark with trivia questions and independently gathered evidence documents.
No results tracked yet
DROP
Discrete Reasoning Over Paragraphs
Reading-comprehension benchmark requiring arithmetic, counting, sorting, comparison, and other discrete reasoning over paragraphs.
No results tracked yet
Related Tasks
Polish LLM General
General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, question answering, cyberbullying detection, and emotional intelligence.
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Question Answering benchmarks accurate. Report outdated results, missing benchmarks, or errors.