Codesota · OCR · Benchmark · HotpotQA2 scored runs · 2 distinct modelsUpdated 2026-04-20

§ 00 · Opening

Multi-hop questions, graded on F1.

HotpotQA is the multi-hop question-answering benchmark built on Wikipedia: each question requires a system to reason across two or more paragraphs drawn from different articles. The test is whether a model can actually chain facts, not just retrieve one.

§ 01 · Leaderboard · Answer F1

Answer F1, ranked.

Harmonic mean of precision and recall over shared answer tokens after normalisation. (higher is better)

#	Model	Answer F1	Verified	Source
01	gpt-4o Non-API entry from src	71.3	—	src
02	claude-35-sonnet Non-API entry from src	68.5	—	src

Fig · 2 results on Answer F1. Rows sourced from benchmarks.json; shaded row marks current SOTA.

§ What it measures

F1, on short free-form answers.

HotpotQA reports answer F1 — the harmonic mean of precision and recall over the tokens shared between the predicted answer and the gold answer, after lower-casing and punctuation stripping. It is the standard metric for short free-form QA, more forgiving than exact match but still demanding that the extracted span actually contain the answer.

The benchmark separately scores supporting-fact prediction. The rows below focus on the headline answer F1; supporting-fact scores are tracked in the full registry.

§ Dataset details

113K questions, two Wikipedia paragraphs each.

HotpotQA contains roughly 113,000 question–answer pairs; every question is written so that the answer requires reasoning over at least two Wikipedia paragraphs. The dataset also includes annotated sentence-level supporting facts, which lets the benchmark measure whether the model used the right evidence, not just whether it produced the right string.

Two settings are in common use: distractor (a small pool of paragraphs is given, most of them irrelevant) and full-wiki (the model must retrieve from all of Wikipedia). The scores below report the distractor setting unless noted.

§ How scores are verified

Reported, then reproduced.

Closed-API models are evaluated through the vendor endpoint with the model version and access date recorded. F1 is computed with the reference script distributed by the HotpotQA authors, so scores are comparable across years.

Full policy: /methodology.

§ Final · Related OCR benchmarks