Multi-hop questions, graded on F1.
HotpotQA is the multi-hop question-answering benchmark built on Wikipedia: each question requires a system to reason across two or more paragraphs drawn from different articles. The test is whether a model can actually chain facts, not just retrieve one.
Answer F1, ranked.
Harmonic mean of precision and recall over shared answer tokens after normalisation. (higher is better)
| # | Model | Answer F1 | Verified | Source |
|---|---|---|---|---|
| 01 | gpt-4o Non-API entry from src | 71.3 | — | src |
| 02 | claude-35-sonnet Non-API entry from src | 68.5 | — | src |
F1, on short free-form answers.
HotpotQA reports answer F1 — the harmonic mean of precision and recall over the tokens shared between the predicted answer and the gold answer, after lower-casing and punctuation stripping. It is the standard metric for short free-form QA, more forgiving than exact match but still demanding that the extracted span actually contain the answer.
The benchmark separately scores supporting-fact prediction. The rows below focus on the headline answer F1; supporting-fact scores are tracked in the full registry.
113K questions, two Wikipedia paragraphs each.
HotpotQA contains roughly 113,000 question–answer pairs; every question is written so that the answer requires reasoning over at least two Wikipedia paragraphs. The dataset also includes annotated sentence-level supporting facts, which lets the benchmark measure whether the model used the right evidence, not just whether it produced the right string.
Two settings are in common use: distractor (a small pool of paragraphs is given, most of them irrelevant) and full-wiki (the model must retrieve from all of Wikipedia). The scores below report the distractor setting unless noted.
Reported, then reproduced.
Closed-API models are evaluated through the vendor endpoint with the model version and access date recorded. F1 is computed with the reference script distributed by the HotpotQA authors, so scores are comparable across years.
Full policy: /methodology.