Who leads the SQuAD v2.0 benchmark?

ALBERT ensemble currently leads SQuAD v2.0 with a score of 92.20 on f1.

What is the state-of-the-art score on SQuAD v2.0?

The state-of-the-art result on SQuAD v2.0 is 92.20 (f1), achieved by ALBERT ensemble as of 2024.

How many models are tracked on SQuAD v2.0?

Codesota tracks 26 models on SQuAD v2.0 across 2 metrics.

When was the SQuAD v2.0 leaderboard last updated?

The SQuAD v2.0 leaderboard on Codesota includes results through 2024, with the earliest tracked result from 2016.

SQuAD Leaderboard: Question Answering SOTA Results

Name: Stanford Question Answering Dataset v2.0 Benchmark Results
Creator: Codesota
Published: 2016-01-01
License: https://creativecommons.org/licenses/by/4.0/

What is SQuAD?

SQuAD quickly became the gold standard for evaluating question answering systems, driving rapid progress in neural language understanding. Unlike previous datasets that used multiple-choice formats, SQuAD requires models to identify the exact start and end indices of the answer within a passage.

The dataset evolved into SQuAD 2.0, which introduced a significant challenge: determining when a question is unanswerable based on the provided text. This forced models to not only find answers but also develop a "null" response capability, reducing hallucinations.

SQuAD 1.1Legacy Standard

Original dataset where all questions have answers in the text.

100,000+ pairs

SQuAD 2.0Current Standard

Combines SQuAD 1.1 with 50,000 unanswerable questions written adversarially.

150,000+ pairs

Key Innovations

1
Span-based format: Requiring models to select text spans rather than generating free-form text or choosing from options.
2
Large-scale: Over 100,000 pairs allowed for the first deep training of complex neural architectures like BiDAF and BERT.
3
Adversarial Unanswerability: SQuAD 2.0 introduced questions that look relevant but cannot be answered, testing true comprehension.

Performance Distribution Visualization

SOTA Evolution

The journey from feature engineering to Transformer dominance.

F1 Score

Logistic Regression: 51

2016

Original SQuAD 1.1 baseline

BERT (Google AI): 83.1

Nov 2018

Transformer breakthrough

BERT + AoA: 88.6

Mar 2019

Attention-over-Attention

SpanBERT: 88.7

Jul 2019

Span-level pre-training

XLNet + Verifier: 89.1

Oct 2019

Permutation-based XLNet

RoBERTa + Verify: 90

Nov 2019

Robustly optimized BERT

RoBERTa (Single): 89.8

Jul 2020

Late single-model peak

Official Leaderboard

Top performing models on the SQuAD 2.0 hidden test set.

Rank	Model	Vendor / Team	F1 Score	Date
#01	RoBERTa (single model) single model	Facebook AI	89.795	Jul 2020
#02	Enhanced Albert+Verifier3 (ensemble) ensemble	Microsoft STCA AIC	89.778	May 2020
#03	RoBERTa+Verify (single model) single model	CW	89.586	Nov 2019
#04	BERT + ConvLSTM + MTL + Verifier (ensemble) ensemble	Layer 6 AI	89.286	Mar 2019
#05	Xlnet+Verifier (single model) single model	Google/CMU	89.082	Oct 2019
#06	Xlnet+Verifier (single model) single model	Ping An Life Insurance	89.063	Aug 2019
#07	BERT + DAE + AoA (single model) single model	HIT & iFLYTEK	88.621	Mar 2019
#08	SpanBERT (single model) single model	FAIR & UW	88.709	Jul 2019
#09	xlnet (single model) single model	Verified XiaoPAI	88.000	Sep 2019
#10	Insight-baseline-BERT (single model) single model	PAII Insight Team	87.644	Apr 2019
#11	Hanvon_model (single model) single model	Hanvon_WuHan	87.117	Sep 2019
#12	SLQA+ (single model) single model	Alibaba iDST	87.021	Jan 2018

Domain Adaptation Challenges

While SQuAD models achieve human-level performance on Wikipedia text, they often struggle when deployed to specialized domains. The heatmap below shows vocabulary overlap and context length disparities between SQuAD and specialized QA benchmarks.

MOVIE-QA (Plot summaries)41.4% Overlap

Context: 150-300 tokens

COVID-QA (Biomedical papers)36% Overlap

Context: 4000+ tokens

CUAD-QA (Legal contracts)34.8% Overlap

Context: 5000+ tokens

BioASQ (Medical abstracts)31.2% Overlap

Context: 200-500 tokens

Practitioner Tip

When fine-tuning SQuAD models for legal or medical domains, prioritize increasing the max_seq_length. SQuAD passages average 116 tokens, while CUAD (legal) can exceed 5,000.