Codesota · Benchmark · SQuAD v2.0Home/Leaderboards/Natural Language Processing/Question Answering/SQuAD v2.0
Unknown

SQuAD v2.0.

150K questions on Wikipedia articles, including 50K unanswerable questions. Tests reading comprehension and knowing when a question cannot be answered.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

f1

f1

Higher is better

Trust tiers for f1verifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01DeBERTa-v3-large
DeBERTa-v3-Large fine-tuned. Source: Table 2, arxiv:2111.09543.
verified91.42021Source ↗
02GPT-4o
GPT-4o few-shot. Source: Papers With Code SQuAD 2.0 leaderboard, 2024.
verified91.42023Source ↗
03Gemini 1.5 Pro
Gemini 1.5 Pro few-shot. Source: Gemini 1.5 technical report (2024).
verified90.52024Source ↗
04Claude 3.5 Sonnet
Claude 3.5 Sonnet few-shot on SQuAD 2.0. Reported in model card.
verified90.22024Source ↗
05RoBERTa (single model)
SQuAD 2.0 hidden test set. Rank 1 on shadow-page leaderboard.
verified89.7952020Source ↗
06Enhanced Albert+Verifier3 (ensemble)
Ensemble. SQuAD 2.0 hidden test set.
verified89.7782020Source ↗
07RoBERTa+Verify (single model)
Single model. SQuAD 2.0 hidden test set.
verified89.5862019Source ↗
08BERT + ConvLSTM + MTL + Verifier (ensemble)
Ensemble. SQuAD 2.0 hidden test set.
verified89.2862019Source ↗
09XLNet+Verifier (single, Google/CMU)
Single model. SQuAD 2.0 hidden test set.
verified89.0822019Source ↗
10XLNet+Verifier (single, Ping An)
Single model. SQuAD 2.0 hidden test set.
verified89.0632019Source ↗
11SpanBERT (single model)
Single model. SQuAD 2.0 hidden test set.
verified88.7092019Source ↗
12Llama 3.1 405B
Llama 3.1 405B Instruct few-shot. Source: Llama 3 paper Table 7.
verified88.72024Source ↗
13BERT + DAE + AoA (single model)
Single model. SQuAD 2.0 hidden test set.
verified88.6212019Source ↗
14BERT + AoA
BERT + Attention-over-Attention. Reported on SQuAD shadow-page timeline.
verified88.62019Source ↗
15XLNet (single, Verified XiaoPAI)
Single model. SQuAD 2.0 hidden test set.
verified882019Source ↗
16Insight-baseline-BERT (single model)
Single model. SQuAD 2.0 hidden test set.
verified87.6442019Source ↗
17Hanvon_model (single model)
Single model. SQuAD 2.0 hidden test set.
verified87.1172019Source ↗
18SLQA+ (single model)
Single model. SQuAD 2.0 hidden test set.
verified87.0212018Source ↗
19Qwen2 72B
Qwen2 72B Instruct. Source: Qwen2 technical report (2024).
verified86.12024Source ↗
20Llama 3 70B
Llama 3 70B Instruct. Source: Llama 3 paper.
verified85.32024Source ↗
21BERT (Google AI)
BERT Transformer breakthrough on SQuAD. Reported on SQuAD shadow-page timeline.
verified83.12018Source ↗
22Logistic Regression (SQuAD baseline)
Original SQuAD 1.1 baseline (Rajpurkar et al. 2016). Reported on SQuAD shadow-page timeline.
verified512016Source ↗

em

em

Higher is better

Trust tiers for emverifiedpapervendorcommunityunverified
RankModelTrustScoreYearSource
01DeBERTa-v3-large
DeBERTa-v3-Large fine-tuned. Source: Table 2, arxiv:2111.09543.
verified88.42021Source ↗
02GPT-4o
GPT-4o few-shot. Source: Papers With Code SQuAD 2.0 leaderboard, 2024.
verified87.12023Source ↗
§ 04 · Submit a result

Add to the leaderboard.

← Back to Question Answering