Codesota · Natural Language Processing · Question Answering · SQuAD v2.0Tasks/Natural Language Processing/Question Answering
Question Answering · benchmark dataset · 2018 · EN

Stanford Question Answering Dataset v2.0.

150K questions on Wikipedia articles, including 50K unanswerable questions. Tests reading comprehension and knowing when a question cannot be answered.

Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

24 results indexed across 2 metrics. Shaded row marks current SOTA; ties broken by submission date.


Primary
f1 · higher is better
All metrics
em, f1
em
2 rows
#ModelOrgSubmittedPaper / codeem
01DeBERTa-v3-largeOSSMicrosoftNov 2021DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Tra…88.40
02GPT-4oAPIOpenAIMar 2023GPT-4 Technical Report87.10
f1· primary
22 rows
#ModelOrgSubmittedPaper / codef1
01GPT-4oAPIOpenAIMar 2023GPT-4 Technical Report91.40
02DeBERTa-v3-largeOSSMicrosoftNov 2021DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Tra…91.40
03Gemini 1.5 ProAPIGoogleFeb 2024Gemini 1.5: Unlocking multimodal understanding across mi…90.50
04Claude 3.5 SonnetAPIAnthropicJun 2024Claude 3.5 Sonnet Model Card90.20
05RoBERTa (single model)OSSFacebook AIJul 2020squad-shadow-page89.80
06Enhanced Albert+Verifier3 (ensemble)OSSMicrosoft STCA AICMay 2020squad-shadow-page89.78
07RoBERTa+Verify (single model)OSSCWNov 2019squad-shadow-page89.59
08BERT + ConvLSTM + MTL + Verifier (ensemble)OSSLayer 6 AIMar 2019squad-shadow-page89.29
09XLNet+Verifier (single, Google/CMU)OSSGoogle/CMUOct 2019squad-shadow-page89.08
10XLNet+Verifier (single, Ping An)OSSPing An Life InsuranceAug 2019squad-shadow-page89.06
11SpanBERT (single model)OSSFAIR & UWJul 2019squad-shadow-page88.71
12Llama 3.1 405BOSSMetaJul 2024The Llama 3 Herd of Models88.70
13BERT + DAE + AoA (single model)OSSHIT & iFLYTEKMar 2019squad-shadow-page88.62
14BERT + AoAOSSHIT & iFLYTEKMar 2019squad-shadow-page88.60
15XLNet (single, Verified XiaoPAI)OSSVerified XiaoPAISep 2019squad-shadow-page88
16Insight-baseline-BERT (single model)OSSPAII Insight TeamApr 2019squad-shadow-page87.64
17Hanvon_model (single model)OSSHanvon_WuHanSep 2019squad-shadow-page87.12
18SLQA+ (single model)OSSAlibaba iDSTJan 2018squad-shadow-page87.02
19Qwen2 72BAlibabaJul 2024Qwen2 Technical Report86.10
20Llama 3 70BOSSMetaJul 2024The Llama 3 Herd of Models85.30
21BERT (Google AI)OSSGoogle AINov 2018squad-shadow-page83.10
22Logistic Regression (SQuAD baseline)OSSStanford NLPJun 2016squad-shadow-page51
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

7 steps
of state of the art.

Each row below marks a model that broke the previous record on f1. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · f1
  1. Jun 1, 2016Logistic Regression (SQuAD baseline)Stanford NLP51
  2. Jan 1, 2018SLQA+ (single model)Alibaba iDST87.02
  3. Mar 1, 2019BERT + ConvLSTM + MTL + Verifier (ensemble)Layer 6 AI89.29
  4. Nov 1, 2019RoBERTa+Verify (single model)CW89.59
  5. May 1, 2020Enhanced Albert+Verifier3 (ensemble)Microsoft STCA AIC89.78
  6. Jul 1, 2020RoBERTa (single model)Facebook AI89.80
  7. Nov 18, 2021DeBERTa-v3-largeMicrosoft91.40
Fig 3 · SOTA-setting models only. 7 entries span Jun 2016 Nov 2021.
§ 04 · Literature

6 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies