Question answering task router

QA is not one task. Span extraction, open-domain retrieval, multi-hop reasoning, and conversational answers have different failure modes. Start from the evidence source and the answer shape you need.

Benchmark

SQuAD - TriviaQA - Natural Questions

Current pick

GPT-5 / Claude 4

01 - Explainer

What this task measures.

Question answering systems map a question plus evidence into an answer, a refusal, or a cited explanation. Extractive QA measures whether the model can find a span in a passage; open-domain QA adds retrieval; multi-hop QA tests whether the system can combine evidence across documents. The modern production version is usually RAG with citation and abstention checks.

02 - Benchmarks

Use a benchmark ladder.

One leaderboard rarely captures the task. Use the canonical benchmark for lineage, then add harder or more domain-specific checks before choosing a model.

Benchmark	Role	Metric	Caveat
SQuAD 2.0	Extractive QA lineage	Exact Match / F1	Saturated and passage-bound; useful for span extraction, not broad QA reliability.
Natural Questions	Open-domain QA	Long answer / short answer F1	Closer to search QA, but still rewards answer overlap more than source faithfulness.
HotpotQA	Multi-hop reasoning	Joint EM / F1	Tests linked evidence, but systems can exploit dataset artifacts without robust reasoning.
RAG eval	Production QA	Groundedness / citation support / refusal rate	Needs local documents and human review for high-liability domains.

03 - Evaluation

What to compare.

The public benchmark is a shortlist signal. Production choice still depends on latency, cost, domain drift, and how expensive mistakes are.

Axis	Value	Why it matters
Extractive QA	SQuAD 2.0	Best when the answer must be a span from supplied context.
Knowledge QA	TriviaQA / Natural Questions	Tests retrieval and answer selection from broader evidence.
Modern production QA	RAG + answer verification	Most real systems need retrieval, citation, and refusal behavior.
Failure mode	Fluent hallucinated answer	Measure groundedness and source support, not just answer text overlap.

04 - Routing

Pick by task shape.

Answer from known document

Extractive reader

Cheaper and more auditable when the answer is a literal span.

Answer from corpus

Retriever + generator

Retrieval controls freshness and lets the answer cite evidence.

Multi-hop question

Reasoning LLM + citations

The model must combine facts across passages and show support.

High-liability QA

Answer verifier

Add a second pass for source coverage, contradiction, and abstention.

05 - Related

Need implementation details?

Open the lower-level explainer for architecture, code examples, and implementation options.

Open SQuAD editorial ->