Natural Language Processingquestion-answering

Question Answering

Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.

1
Datasets
0
Results
f1
Canonical metric
Canonical Benchmark

SQuAD v2.0

150K questions on Wikipedia articles, including 50K unanswerable questions. Tests reading comprehension and knowing when a question cannot be answered.

Primary metric: f1
View full leaderboard

Top 10

Leading models on SQuAD v2.0.

No results yet. Be the first to contribute.

What were you looking for on Question Answering?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Natural Language Processing.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Question Answering? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.