Who leads the SQuAD v2.0 benchmark?

ALBERT ensemble currently leads SQuAD v2.0 with a score of 92.2 on f1.

What is the state-of-the-art score on SQuAD v2.0?

The state-of-the-art result on SQuAD v2.0 is 92.2 (f1), achieved by ALBERT ensemble as of 2024.

How many models are tracked on SQuAD v2.0?

Codesota tracks 26 models on SQuAD v2.0 across 2 metrics.

When was the SQuAD v2.0 leaderboard last updated?

The SQuAD v2.0 leaderboard on Codesota includes results through 2024, with the earliest tracked result from 2016.

Codesota · Benchmark · SQuAD v2.0Home/Leaderboards/Language & Knowledge/Question Answering/SQuAD v2.0

Unknown

SQuAD v2.0.

Name: SQuAD v2.0 Benchmark Results
Creator: Unknown
Published: 2016-01-01
License: https://creativecommons.org/licenses/by/4.0/

150K questions on Wikipedia articles, including 50K unanswerable questions. Tests reading comprehension and knowing when a question cannot be answered.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

f1

F1 is the reported evaluation metric for SQuAD v2.0. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for f1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	ALBERT ensemble	unverified	92.2	2019	Paper ↗Code ↗	Looks wrong?
02	GPT-4o GPT-4o few-shot.	verified	91.4	2023	Paper ↗Source ↗	Looks wrong?
03	DeBERTa-v3-large DeBERTa-v3-Large fine-tuned. Source: Table 2, arxiv:2111.09543.	verified	91.4	2021	Paper ↗	Looks wrong?
04	Gemini 1.5 Pro Gemini 1.5 Pro few-shot. Source: Gemini 1.5 technical report (2024).	verified	90.5	2024	Paper ↗	Looks wrong?
05	Claude 3.5 Sonnet Claude 3.5 Sonnet few-shot on SQuAD 2.0. Reported in model card.	verified	90.2	2024	Paper ↗	Looks wrong?
06	RoBERTa	unverified	89.8	2019	Paper ↗Code ↗	Looks wrong?
07	RoBERTa (single model) SQuAD 2.0 hidden test set. Rank 1 on shadow-page leaderboard.	verified	89.795	2020	Source ↗	Looks wrong?
08	Enhanced Albert+Verifier3 (ensemble) Ensemble. SQuAD 2.0 hidden test set.	verified	89.778	2020	Source ↗	Looks wrong?
09	RoBERTa+Verify (single model) Single model. SQuAD 2.0 hidden test set.	verified	89.586	2019	Source ↗	Looks wrong?
10	BERT + ConvLSTM + MTL + Verifier (ensemble) Ensemble. SQuAD 2.0 hidden test set.	verified	89.286	2019	Source ↗	Looks wrong?
11	BART	unverified	89.2	2019	Paper ↗Code ↗	Looks wrong?
12	XLNet+Verifier (single, Google/CMU) Single model. SQuAD 2.0 hidden test set.	verified	89.082	2019	Source ↗	Looks wrong?
13	XLNet+Verifier (single, Ping An) Single model. SQuAD 2.0 hidden test set.	verified	89.063	2019	Source ↗	Looks wrong?
14	SpanBERT (single model) Single model. SQuAD 2.0 hidden test set.	verified	88.709	2019	Source ↗	Looks wrong?
15	Llama 3.1 405B Llama 3.1 405B Instruct few-shot. Source: Llama 3 paper Table 7.	verified	88.7	2024	Paper ↗	Looks wrong?
16	BERT + DAE + AoA (single model) Single model. SQuAD 2.0 hidden test set.	verified	88.621	2019	Source ↗	Looks wrong?
17	BERT + AoA BERT + Attention-over-Attention. Reported on SQuAD shadow-page timeline.	verified	88.6	2019	Source ↗	Looks wrong?
18	XLNet (single, Verified XiaoPAI) Single model. SQuAD 2.0 hidden test set.	verified	88	2019	Source ↗	Looks wrong?
19	Insight-baseline-BERT (single model) Single model. SQuAD 2.0 hidden test set.	verified	87.644	2019	Source ↗	Looks wrong?
20	Hanvon_model (single model) Single model. SQuAD 2.0 hidden test set.	verified	87.117	2019	Source ↗	Looks wrong?
21	SLQA+ (single model) Single model. SQuAD 2.0 hidden test set.	verified	87.021	2018	Source ↗	Looks wrong?
22	Qwen2 72B Qwen2 72B Instruct. Source: Qwen2 technical report (2024).	verified	86.1	2024	Paper ↗	Looks wrong?
23	Llama 3 70B Llama 3 70B Instruct. Source: Llama 3 paper.	verified	85.3	2024	Paper ↗	Looks wrong?
24	BERT (Google AI) BERT Transformer breakthrough on SQuAD. Reported on SQuAD shadow-page timeline.	verified	83.1	2018	Source ↗	Looks wrong?
25	BERT Large	unverified	83.1	2018	Paper ↗Code ↗	Looks wrong?
26	Logistic Regression (SQuAD baseline) Original SQuAD 1.1 baseline (Rajpurkar et al. 2016). Reported on SQuAD shadow-page timeline.	verified	51	2016	Source ↗	Looks wrong?

em

Em is the reported evaluation metric for SQuAD v2.0. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for emverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	DeBERTa-v3-large DeBERTa-v3-Large fine-tuned. Source: Table 2, arxiv:2111.09543.	verified	88.4	2021	Paper ↗	Looks wrong?
02	GPT-4o GPT-4o few-shot.	verified	87.1	2023	Paper ↗Source ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Question Answering