Codesota · Natural Language Processing · Question Answering · SQuAD v2.0Tasks/Natural Language Processing/Question Answering
Question Answering · benchmark dataset · 2018 · EN

Stanford Question Answering Dataset v2.0.

Historical extractive QA benchmark over Wikipedia paragraphs with unanswerable questions. Valuable as a regression test, but saturated for frontier LLM comparison.

Saturated benchmark· last significant update May 2026
Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

28 results indexed across 2 metrics. Shaded row marks current SOTA; ties broken by submission date.


Primary
f1 · higher is better
All metrics
em, f1
em
2 rows
#ModelOrgSubmittedPaper / codeem
01DeBERTa-v3-largeOpenMicrosoftNov 2021DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Tra…88.40
02GPT-4oAPIOpenAIMar 2023GPT-4 Technical Report87.10
f1· primary
26 rows
#ModelOrgSubmittedPaper / codef1
01ALBERT ensembleSep 2019ALBERT: A Lite BERT for Self-supervised Learning of Lang… · code92.20
02GPT-4oAPIOpenAIMar 2023GPT-4 Technical Report91.40
03DeBERTa-v3-largeOpenMicrosoftNov 2021DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Tra…91.40
04Gemini 1.5 ProAPIGoogleFeb 2024Gemini 1.5: Unlocking multimodal understanding across mi…90.50
05Claude 3.5 SonnetAPIAnthropicJun 2024Claude 3.5 Sonnet Model Card90.20
06RoBERTaJul 2019RoBERTa: A Robustly Optimized BERT Pretraining Approach · code89.80
07RoBERTa (single model)OpenFacebook AIJul 2020squad-shadow-page89.80
08Enhanced Albert+Verifier3 (ensemble)OpenMicrosoft STCA AICMay 2020squad-shadow-page89.78
09RoBERTa+Verify (single model)OpenCWNov 2019squad-shadow-page89.59
10BERT + ConvLSTM + MTL + Verifier (ensemble)OpenLayer 6 AIMar 2019squad-shadow-page89.29
11BARTOct 2019BART: Denoising Sequence-to-Sequence Pre-training for Na… · code89.20
12XLNet+Verifier (single, Google/CMU)OpenGoogle/CMUOct 2019squad-shadow-page89.08
13XLNet+Verifier (single, Ping An)OpenPing An Life InsuranceAug 2019squad-shadow-page89.06
14SpanBERT (single model)OpenFAIR & UWJul 2019squad-shadow-page88.71
15Llama 3.1 405BOpenMetaJul 2024The Llama 3 Herd of Models88.70
16BERT + DAE + AoA (single model)OpenHIT & iFLYTEKMar 2019squad-shadow-page88.62
17BERT + AoAOpenHIT & iFLYTEKMar 2019squad-shadow-page88.60
18XLNet (single, Verified XiaoPAI)OpenVerified XiaoPAISep 2019squad-shadow-page88
19Insight-baseline-BERT (single model)OpenPAII Insight TeamApr 2019squad-shadow-page87.64
20Hanvon_model (single model)OpenHanvon_WuHanSep 2019squad-shadow-page87.12
21SLQA+ (single model)OpenAlibaba iDSTJan 2018squad-shadow-page87.02
22Qwen2 72BAlibabaJul 2024Qwen2 Technical Report86.10
23Llama 3 70BOpenMetaJul 2024The Llama 3 Herd of Models85.30
24BERT (Google AI)OpenGoogle AINov 2018squad-shadow-page83.10
25BERT LargeOct 2018BERT: Pre-training of Deep Bidirectional Transformers fo… · code83.10
26Logistic Regression (SQuAD baseline)OpenStanford NLPJun 2016squad-shadow-page51
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

5 steps
of state of the art.

Each row below marks a model that broke the previous record on f1. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · f1
  1. Jun 1, 2016Logistic Regression (SQuAD baseline)Stanford NLP51
  2. Jan 1, 2018SLQA+ (single model)Alibaba iDST87.02
  3. Mar 1, 2019BERT + ConvLSTM + MTL + Verifier (ensemble)Layer 6 AI89.29
  4. Jul 26, 2019RoBERTa89.80
  5. Sep 26, 2019ALBERT ensemble92.20
Fig 3 · SOTA-setting models only. 5 entries span Jun 2016 Sep 2019.
§ 04 · Literature

10 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies