Natural Language Processingquestion-answering

Question Answering

Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality, long-context QA, and web-browsing agents. SQuAD is historical; current QA evaluation needs Natural Questions, TriviaQA, HotpotQA, MuSiQue, DROP, KILT, SimpleQA, FRAMES, and BrowseComp.

10 datasets26 resultsView full task mapping →

Question answering is no longer just SQuAD-style span extraction. The active surface now splits into extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning, factuality checks, long-context QA, and web-browsing QA. SQuAD remains useful as a historical sanity check, but it should not be the only benchmark on this page.

History

2016

SQuAD makes extractive reading comprehension the default QA benchmark.

2017

TriviaQA scales QA to trivia questions with independently gathered evidence documents.

2018

SQuAD 2.0 adds unanswerable questions; HotpotQA tests multi-hop evidence chaining.

2019

Natural Questions uses real Google search queries grounded in Wikipedia pages; DROP stresses discrete reasoning over passages.

2021

KILT unifies knowledge-intensive QA and retrieval-grounded tasks over one Wikipedia snapshot; MuSiQue makes multi-hop composition harder to shortcut.

2024

SimpleQA and FRAMES shift attention toward factuality, retrieval, and answer support instead of span F1 alone.

2025

BrowseComp measures hard-to-find web QA where browsing strategy and citation discipline matter.

How Question Answering Works

Question analysis

Classify the query as span extraction, open-domain lookup, multi-hop reasoning, long-context lookup, or web research.

Evidence retrieval

Fetch passages, documents, or web pages with lexical search, dense retrieval, hybrid search, and reranking.

Grounded reading

Read the retrieved evidence and produce a short answer, span, or free-form response with support.

Verification

Check answerability, citation support, contradiction risk, and whether the answer is actually present in the evidence.

Abstention

For unanswerable or under-supported cases, the system should say it cannot answer instead of hallucinating.

Current Landscape

QA in 2026 is best treated as an evaluation stack. SQuAD and TriviaQA cover the historical reading-comprehension baseline. Natural Questions and KILT cover retrieval-grounded open-domain QA. HotpotQA, MuSiQue, and DROP cover reasoning over evidence. SimpleQA, FRAMES, and BrowseComp cover the newer frontier: factuality, retrieval quality, and hard-to-find answers with tools.

Key Challenges

SQuAD-style F1 is saturated and no longer separates frontier systems.

Retrieval failure still dominates production QA errors.

Multi-hop datasets expose whether systems can combine evidence rather than match one passage.

Factuality benchmarks punish plausible but unsupported answers.

Browsing QA requires search strategy, persistence, and source judgment, not only language modeling.

Long-context QA can still miss small facts even when the answer is technically inside the prompt.

Quick Recommendations

Historical extractive QA

SQuAD v2.0 plus an abstention metric

Useful for regression testing span readers, but too saturated to stand alone.

Open-domain QA

Natural Questions, TriviaQA, and KILT

Better reflects retrieval-grounded QA over real or broad knowledge sources.

Multi-hop reasoning

HotpotQA, MuSiQue, and DROP

Tests composition, numerical reasoning, and evidence chaining beyond single-pass lookup.

Modern factual QA

SimpleQA, FRAMES, and BrowseComp

Tracks hallucination, retrieval quality, and hard web-search behavior for LLM systems.

What's Next

The useful leaderboard will separate answer accuracy from evidence quality. Expect QA pages to track not only exact match or F1, but answerability, citation support, retrieval recall, calibration, and whether the system can abstain when the evidence is insufficient.

Benchmarks & SOTA

SQuAD v2.0

Stanford Question Answering Dataset v2.0

201824 results

Historical extractive QA benchmark over Wikipedia paragraphs with unanswerable questions. Valuable as a regression test, but saturated for frontier LLM comparison.

State of the Art

GPT-4o

OpenAI

91.4

HotpotQA

20182 results

Multi-hop question-answering benchmark requiring reasoning across multiple Wikipedia documents.

State of the Art

GPT-4o

OpenAI

71.3

FRAMES

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

20240 results

RAG evaluation dataset for factuality, retrieval, and reasoning in answer generation.

No results tracked yet

KILT

Knowledge Intensive Language Tasks

20210 results

Unified benchmark for retrieval-grounded knowledge-intensive NLP tasks over a shared Wikipedia snapshot, including open-domain QA.

No results tracked yet

MuSiQue

MuSiQue: Multihop Questions via Single-hop Question Composition

20210 results

Multi-hop QA benchmark constructed to require connected reasoning over multiple passages.

No results tracked yet

Natural Questions

Natural Questions: a Benchmark for Question Answering Research

20190 results

Open-domain QA benchmark built from real Google search queries with answers annotated from Wikipedia pages.

No results tracked yet

SimpleQA

Measuring Short-form Factuality in Large Language Models

20240 results

Short-form factuality benchmark with single-answer fact-seeking questions designed to expose hallucination and calibration failures.

No results tracked yet

BrowseComp

BrowseComp: A Benchmark for Browsing Agents

20250 results

Hard web-browsing QA benchmark with short factual answers that require persistent search over many online sources.

No results tracked yet

TriviaQA

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

20170 results

Large-scale QA benchmark with trivia questions and independently gathered evidence documents.

No results tracked yet

DROP

Discrete Reasoning Over Paragraphs

20190 results

Reading-comprehension benchmark requiring arithmetic, counting, sorting, comparison, and other discrete reasoning over paragraphs.

No results tracked yet

Related Tasks

Polish LLM General

General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, question answering, cyberbullying detection, and emotional intelligence.

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

Reading Comprehension

Understanding and answering questions about passages.

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Question Answering benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing