Extractive reader
Cheaper and more auditable when the answer is a literal span.
QA is not one task. Span extraction, open-domain retrieval, multi-hop reasoning, and conversational answers have different failure modes. Start from the evidence source and the answer shape you need.
Question answering systems map a question plus evidence into an answer, a refusal, or a cited explanation. Extractive QA measures whether the model can find a span in a passage; open-domain QA adds retrieval; multi-hop QA tests whether the system can combine evidence across documents. The modern production version is usually RAG with citation and abstention checks.
One leaderboard rarely captures the task. Use the canonical benchmark for lineage, then add harder or more domain-specific checks before choosing a model.
| Benchmark | Role | Metric | Caveat |
|---|---|---|---|
| SQuAD 2.0 | Extractive QA lineage | Exact Match / F1 | Saturated and passage-bound; useful for span extraction, not broad QA reliability. |
| Natural Questions | Open-domain QA | Long answer / short answer F1 | Closer to search QA, but still rewards answer overlap more than source faithfulness. |
| HotpotQA | Multi-hop reasoning | Joint EM / F1 | Tests linked evidence, but systems can exploit dataset artifacts without robust reasoning. |
| RAG eval | Production QA | Groundedness / citation support / refusal rate | Needs local documents and human review for high-liability domains. |
The public benchmark is a shortlist signal. Production choice still depends on latency, cost, domain drift, and how expensive mistakes are.
| Axis | Value | Why it matters |
|---|---|---|
| Extractive QA | SQuAD 2.0 | Best when the answer must be a span from supplied context. |
| Knowledge QA | TriviaQA / Natural Questions | Tests retrieval and answer selection from broader evidence. |
| Modern production QA | RAG + answer verification | Most real systems need retrieval, citation, and refusal behavior. |
| Failure mode | Fluent hallucinated answer | Measure groundedness and source support, not just answer text overlap. |
Cheaper and more auditable when the answer is a literal span.
Retrieval controls freshness and lets the answer cite evidence.
The model must combine facts across passages and show support.
Add a second pass for source coverage, contradiction, and abstention.
Open the lower-level explainer for architecture, code examples, and implementation options.
Open SQuAD editorial ->