BenchmarkEMNLP 2016Stanford NLP

SQuAD Benchmark

The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of questions posed by crowdworkers on Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage.

SOTA F1 Score
89.795
Total Samples
150K+
Metric
F1 / EM
Human F1
86.83

What is SQuAD?

SQuAD quickly became the gold standard for evaluating question answering systems, driving rapid progress in neural language understanding. Unlike previous datasets that used multiple-choice formats, SQuAD requires models to identify the exact start and end indices of the answer within a passage.

The dataset evolved into SQuAD 2.0, which introduced a significant challenge: determining when a question is unanswerable based on the provided text. This forced models to not only find answers but also develop a "null" response capability, reducing hallucinations.

SQuAD 1.1Legacy Standard

Original dataset where all questions have answers in the text.

100,000+ pairs
SQuAD 2.0Current Standard

Combines SQuAD 1.1 with 50,000 unanswerable questions written adversarially.

150,000+ pairs

Key Innovations

  • 1

    Span-based format: Requiring models to select text spans rather than generating free-form text or choosing from options.

  • 2

    Large-scale: Over 100,000 pairs allowed for the first deep training of complex neural architectures like BiDAF and BERT.

  • 3

    Adversarial Unanswerability: SQuAD 2.0 introduced questions that look relevant but cannot be answered, testing true comprehension.

SQuAD Performance Distribution
Performance Distribution Visualization

SOTA Evolution

The journey from feature engineering to Transformer dominance.

F1 Score
Logistic Regression: 51
2016
Original SQuAD 1.1 baseline
BERT (Google AI): 83.1
Nov 2018
Transformer breakthrough
BERT + AoA: 88.6
Mar 2019
Attention-over-Attention
SpanBERT: 88.7
Jul 2019
Span-level pre-training
XLNet + Verifier: 89.1
Oct 2019
Permutation-based XLNet
RoBERTa + Verify: 90
Nov 2019
Robustly optimized BERT
RoBERTa (Single): 89.8
Jul 2020
Late single-model peak

Official Leaderboard

Top performing models on the SQuAD 2.0 hidden test set.

RankModelVendor / TeamF1 ScoreDate
#01
RoBERTa (single model)
single model
Facebook AI89.795Jul 2020
#02
Enhanced Albert+Verifier3 (ensemble)
ensemble
Microsoft STCA AIC89.778May 2020
#03
RoBERTa+Verify (single model)
single model
CW89.586Nov 2019
#04
BERT + ConvLSTM + MTL + Verifier (ensemble)
ensemble
Layer 6 AI89.286Mar 2019
#05
Xlnet+Verifier (single model)
single model
Google/CMU89.082Oct 2019
#06
Xlnet+Verifier (single model)
single model
Ping An Life Insurance89.063Aug 2019
#07
BERT + DAE + AoA (single model)
single model
HIT & iFLYTEK88.621Mar 2019
#08
SpanBERT (single model)
single model
FAIR & UW88.709Jul 2019
#09
xlnet (single model)
single model
Verified XiaoPAI88.000Sep 2019
#10
Insight-baseline-BERT (single model)
single model
PAII Insight Team87.644Apr 2019
#11
Hanvon_model (single model)
single model
Hanvon_WuHan87.117Sep 2019
#12
SLQA+ (single model)
single model
Alibaba iDST87.021Jan 2018

Domain Adaptation Challenges

While SQuAD models achieve human-level performance on Wikipedia text, they often struggle when deployed to specialized domains. The heatmap below shows vocabulary overlap and context length disparities between SQuAD and specialized QA benchmarks.

MOVIE-QA (Plot summaries)41.4% Overlap
Context: 150-300 tokens
COVID-QA (Biomedical papers)36% Overlap
Context: 4000+ tokens
CUAD-QA (Legal contracts)34.8% Overlap
Context: 5000+ tokens
BioASQ (Medical abstracts)31.2% Overlap
Context: 200-500 tokens

Practitioner Tip

When fine-tuning SQuAD models for legal or medical domains, prioritize increasing the max_seq_length. SQuAD passages average 116 tokens, while CUAD (legal) can exceed 5,000.

Metric Explainer

Exact Match (EM)

Percentage of predictions that match one of the ground truth answers exactly.

F1 Score

The harmonic mean of precision and recall, measuring the overlap between prediction and truth at the token level.

Foundational Papers

Ready to train your model?

Download the SQuAD 2.0 dataset and join the leaderboard. Access the official training and development sets via the Stanford NLP group.

Related QA Benchmarks

Natural Questions
Question Answering
HotpotQA
Question Answering
TriviaQA
Question Answering
QuAC
Question Answering