Question Answering
Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.
SQuAD v2.0
150K questions on Wikipedia articles, including 50K unanswerable questions. Tests reading comprehension and knowing when a question cannot be answered.
Top 10
Leading models on SQuAD v2.0.
No results yet. Be the first to contribute.
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Natural Language Processing.
Looking to run a model? HuggingFace hosts inference for this task type.