Codesota · Benchmark · SQuAD v2.0Home/Leaderboards/Language & Knowledge/Question Answering/SQuAD v2.0
Unknown

SQuAD v2.0.

150K questions on Wikipedia articles, including 50K unanswerable questions. Tests reading comprehension and knowing when a question cannot be answered.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

f1

F1 is the reported evaluation metric for SQuAD v2.0. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for f1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01ALBERT ensembleunverified92.22019Paper ↗Code ↗Looks wrong?
02GPT-4o
GPT-4o few-shot.
verified91.42023Paper ↗Source ↗Looks wrong?
03DeBERTa-v3-large
DeBERTa-v3-Large fine-tuned. Source: Table 2, arxiv:2111.09543.
verified91.42021Paper ↗Looks wrong?
04Gemini 1.5 Pro
Gemini 1.5 Pro few-shot. Source: Gemini 1.5 technical report (2024).
verified90.52024Paper ↗Looks wrong?
05Claude 3.5 Sonnet
Claude 3.5 Sonnet few-shot on SQuAD 2.0. Reported in model card.
verified90.22024Paper ↗Looks wrong?
06RoBERTaunverified89.82019Paper ↗Code ↗Looks wrong?
07RoBERTa (single model)
SQuAD 2.0 hidden test set. Rank 1 on shadow-page leaderboard.
verified89.7952020Source ↗Looks wrong?
08Enhanced Albert+Verifier3 (ensemble)
Ensemble. SQuAD 2.0 hidden test set.
verified89.7782020Source ↗Looks wrong?
09RoBERTa+Verify (single model)
Single model. SQuAD 2.0 hidden test set.
verified89.5862019Source ↗Looks wrong?
10BERT + ConvLSTM + MTL + Verifier (ensemble)
Ensemble. SQuAD 2.0 hidden test set.
verified89.2862019Source ↗Looks wrong?
11BARTunverified89.22019Paper ↗Code ↗Looks wrong?
12XLNet+Verifier (single, Google/CMU)
Single model. SQuAD 2.0 hidden test set.
verified89.0822019Source ↗Looks wrong?
13XLNet+Verifier (single, Ping An)
Single model. SQuAD 2.0 hidden test set.
verified89.0632019Source ↗Looks wrong?
14SpanBERT (single model)
Single model. SQuAD 2.0 hidden test set.
verified88.7092019Source ↗Looks wrong?
15Llama 3.1 405B
Llama 3.1 405B Instruct few-shot. Source: Llama 3 paper Table 7.
verified88.72024Paper ↗Looks wrong?
16BERT + DAE + AoA (single model)
Single model. SQuAD 2.0 hidden test set.
verified88.6212019Source ↗Looks wrong?
17BERT + AoA
BERT + Attention-over-Attention. Reported on SQuAD shadow-page timeline.
verified88.62019Source ↗Looks wrong?
18XLNet (single, Verified XiaoPAI)
Single model. SQuAD 2.0 hidden test set.
verified882019Source ↗Looks wrong?
19Insight-baseline-BERT (single model)
Single model. SQuAD 2.0 hidden test set.
verified87.6442019Source ↗Looks wrong?
20Hanvon_model (single model)
Single model. SQuAD 2.0 hidden test set.
verified87.1172019Source ↗Looks wrong?
21SLQA+ (single model)
Single model. SQuAD 2.0 hidden test set.
verified87.0212018Source ↗Looks wrong?
22Qwen2 72B
Qwen2 72B Instruct. Source: Qwen2 technical report (2024).
verified86.12024Paper ↗Looks wrong?
23Llama 3 70B
Llama 3 70B Instruct. Source: Llama 3 paper.
verified85.32024Paper ↗Looks wrong?
24BERT (Google AI)
BERT Transformer breakthrough on SQuAD. Reported on SQuAD shadow-page timeline.
verified83.12018Source ↗Looks wrong?
25BERT Largeunverified83.12018Paper ↗Code ↗Looks wrong?
26Logistic Regression (SQuAD baseline)
Original SQuAD 1.1 baseline (Rajpurkar et al. 2016). Reported on SQuAD shadow-page timeline.
verified512016Source ↗Looks wrong?

em

Em is the reported evaluation metric for SQuAD v2.0. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for emverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinksFix
01DeBERTa-v3-large
DeBERTa-v3-Large fine-tuned. Source: Table 2, arxiv:2111.09543.
verified88.42021Paper ↗Looks wrong?
02GPT-4o
GPT-4o few-shot.
verified87.12023Paper ↗Source ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Question Answering