Text→Text

Question Answering

Answer questions based on context or knowledge. Foundation for chatbots, search, and knowledge systems.

How Question Answering Works

A technical deep-dive into Question Answering systems. From extractive span prediction to generative reasoning, and from fine-tuned models to RAG pipelines.

1. Core Insight 2. Extractive vs Generative 3. Interactive Demo 4. Context Window 5. RAG vs Fine-tuned 6. Models 7. Benchmarks 8. Code

The Core Insight

Why question answering is fundamentally different from search.

The Problem

Information is buried in text. Users have questions, but documents have paragraphs. Traditional search returns documents, not answers. You ask 'When was Einstein born?' and get 10 articles about Einstein instead of '1879.'

The Solution

Train models to read passages and either extract exact answer spans (extractive QA) or generate natural language answers (generative QA). The model becomes a reading comprehension machine.

The Key Idea

QA models don't just match keywords - they understand the question's intent and locate or synthesize the precise information needed to answer it.

Search Returns Documents. QA Returns Answers.

Traditional Search

Query:

"When was Einstein born?"

Results:

"Albert Einstein - Biography | Life and Career..."

"Einstein: A Life of Genius | History Channel..."

"10 Facts About Albert Einstein You Didn't Know..."

User must read documents to find answer

Question Answering

Question:

"When was Einstein born?"

Answer:

March 14, 1879

From: Albert Einstein Biography (score: 0.98)

Direct answer with source citation

Extractive vs Generative QA

The fundamental choice: copy from the text or generate an answer.

How It Works

Model outputs start and end token positions within the passage

Example

Context:

Albert Einstein was born on March 14, 1879, in Ulm, Germany. He developed the theory of relativity.

Question:

When was Einstein born?

Answer:

March 14, 1879

Pros

+ Always grounded in source text

+ No hallucination risk

+ Easy to verify and cite

+ Fast inference

Cons

- Cannot synthesize information

- Limited to verbatim text

- Fails if answer is paraphrased

- Cannot handle yes/no questions naturally

Common Models:

BERT-QARoBERTa-QAALBERTDistilBERT

Interactive Demo: See QA in Action

Watch how extractive QA highlights answer spans in the passage.

Select Passage:

Passage

Show answer highlight

Albert Einstein was born on March 14, 1879, in Ulm, in the Kingdom of Wurttemberg in the German Empire. He developed the theory of special relativity in 1905 while working as a patent clerk in Bern, Switzerland. His famous equation E=mc2 showed the equivalence of mass and energy. Einstein received the Nobel Prize in Physics in 1921, not for relativity, but for his explanation of the photoelectric effect.

Questions

Notice the Difference

Extractive QA copies exactly from the text (positions 28-42). Generative QA synthesizes a natural response that may rephrase or add context.

The Context Window Problem

How much text can your model read at once?

Every QA model has a limit on how much text it can read at once. BERT reads ~512 tokens. GPT-4 reads ~128K tokens. When your documents exceed this limit, you must choose: truncate and miss information, or retrieve relevant passages first (RAG).

Context Window Sizes

BERT-base

512

~1 page

RoBERTa-large

512

~1 page

Longformer

4,096

~8 pages

GPT-4-turbo

128,000

~300 pages

Claude 3.5

200,000

~500 pages

When Context is Too Small

- Must chunk documents and retrieve relevant pieces
- Risk missing information if retrieval fails
- Cannot answer questions spanning multiple sections

Long Context Advantage

+ Can read entire documents without chunking
+ No retrieval errors possible
+ Simpler pipeline (no vector DB needed)

RAG vs Fine-tuned vs Hybrid

Three fundamentally different ways to add domain knowledge to your QA system.

Fine-tuned Model

Train the model on your specific domain data

Question

Fine-tuned Model

Answer

Pros

+ Fast inference (no retrieval)

+ Consistent domain expertise

+ Works offline

Cons

- Expensive to update (retrain)

- Knowledge gets stale

- Limited to training data

- Can hallucinate confidently

Best for: Static domains with stable knowledge (medical terminology, legal definitions)

RAG Pipeline

Retrieve relevant documents, then generate answer from context

Question

Retrieve

Context + LLM

Answer

Pros

+ Easy to update (just update docs)

+ Always current information

+ Citable sources

+ No retraining needed

Cons

- Retrieval quality is critical

- Higher latency (two-stage)

- More infrastructure

- Can fail if retrieval misses

Best for: Dynamic content, enterprise search, knowledge bases that change

Hybrid

Fine-tuned model with RAG augmentation

Question

Retrieve

Fine-tuned + Context

Answer

Pros

+ Best accuracy

+ Domain expertise + fresh data

+ Graceful degradation

Cons

- Most complex

- Highest cost

- Two systems to maintain

Best for: Production systems requiring both accuracy and freshness

Quick Decision Guide

Knowledge changes frequently?->Use RAG

Need specialized domain reasoning?->Fine-tune

Production system with both needs?->Hybrid

Simple use case, limited budget?->Start with RAG

Models for Question Answering

From BERT to GPT-4: how QA models have evolved.

LSTM + Attention

2016

RNN-basedAttention over passage for answer selection

BERT

2018

TransformerBidirectional pre-training, span prediction

RoBERTa

2019

TransformerOptimized BERT training, more data

ALBERT

2019

TransformerParameter sharing for efficiency

2020

Seq2SeqText-to-text framing, generative QA

GPT-3

2020

Decoder-onlyFew-shot learning, no fine-tuning needed

Flan-T5

2022

Instruction-tunedInstruction following for QA

GPT-4

2023

Multimodal LLMChain-of-thought reasoning

Llama 3

2024

Open-source LLMCompetitive with proprietary, fully open

Model	Type	Params	Context	Best For
BERT-base-uncased Google	Extractive	110M	512 tokens	Production extractive QA with low latency requirements
RoBERTa-large Meta	Extractive	355M	512 tokens	When you need better accuracy than BERT
Flan-T5-XL Google	Generative	3B	512 tokens	Generative QA with instruction following
GPT-4 OpenAI	Generative	Unknown	128K tokens	Complex QA requiring reasoning or long documents
Llama 3 70B Meta	Generative	70B	8K tokens	Self-hosted generative QA without API dependencies

Best for Speed

DistilBERT-QA

40% smaller, 60% faster, 97% accuracy

Best for Quality

GPT-4

Highest reasoning, longest context

Best for Self-hosting

Llama 3 8B

Open weights, runs on single GPU

Benchmarks

Standard datasets for evaluating question answering systems.

Benchmark	Type	Size	Metric	SOTA
SQuAD 2.0 Wikipedia paragraphs with unanswerable questions	Reading Comprehension	150K QA pairs	EM / F1	93.2 / 95.3 (Human: 86.8 / 89.5)
Natural Questions Real Google search questions	Open Domain	307K QA pairs	EM / F1	52.7 / 58.9
TriviaQA Trivia questions with evidence documents	Knowledge-Intensive	95K QA pairs	EM / F1	73.3 / 77.5
HotpotQA Questions requiring reasoning over multiple docs	Multi-hop Reasoning	113K QA pairs	EM / F1	72.5 / 84.8
QuALITY Questions about full-length articles and stories	Long Document	6.7K QA pairs	Accuracy	62.3%

Exact Match (EM)

Binary: 1 if prediction exactly matches any ground truth answer, 0 otherwise. Strict but clear - the answer must be character-for-character identical.

F1 Score

Token-level overlap between prediction and ground truth. Computed as harmonic mean of precision and recall. More forgiving than EM for partial matches.

Code Examples

From quick extractive QA to production RAG pipelines.

HuggingFace Extractivepip install transformers torch

Recommended

from transformers import pipeline

# Load a pre-trained extractive QA model
qa_pipeline = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2",
    device=0  # GPU, use -1 for CPU
)

# Your context passage
context = """
Albert Einstein was born on March 14, 1879, in Ulm, Germany.
He developed the theory of special relativity in 1905 while
working as a patent clerk in Bern, Switzerland.
"""

# Ask a question
result = qa_pipeline(
    question="When was Einstein born?",
    context=context
)

print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.3f}")
print(f"Start: {result['start']}, End: {result['end']}")

# Output:
# Answer: March 14, 1879
# Confidence: 0.987
# Start: 28, End: 42

Quick Reference

Extractive QA

- BERT, RoBERTa, ALBERT
- Fast, grounded, no hallucination
- Limited to verbatim text

Generative QA

- T5, GPT-4, Llama
- Natural answers, reasoning
- Watch for hallucination

RAG Pipeline

- Retrieve then generate
- Easy to update knowledge
- Citable sources

Key Decisions

- Context size determines approach
- Static vs dynamic knowledge
- Speed vs accuracy trade-off

Key Takeaways

1. Extractive QA is fast and grounded but cannot synthesize or rephrase
2. Generative QA is flexible but requires hallucination mitigation
3. Context window determines whether you need RAG or can use long-context LLMs
4. Start with RAG for most use cases - easier to update and maintain

Use Cases

✓Customer support bots
✓Knowledge base search
✓Reading comprehension
✓FAQ automation

Architectural Patterns

Extractive QA

Find answer spans within provided context.

Pros:

+Grounded in source
+Fast
+No hallucination

Cons:

-Needs context provided
-Can't synthesize

Generative QA

Generate answers using LLMs with retrieved context.

Pros:

+Fluent answers
+Can synthesize
+Handles complex questions

Cons:

-May hallucinate
-Slower
-Needs good retrieval

Open-Domain QA

Answer from parametric knowledge without context.

Pros:

+No retrieval needed
+Simple pipeline

Cons:

-Hallucination risk
-Knowledge cutoff
-Can't cite sources

Implementations

API Services

Perplexity API

Perplexity

API

Real-time search + generation. Cites sources.

You.com API

You.com

API

Search-augmented answers. Good for current events.

Open Source

RoBERTa-SQuAD

Apache 2.0

Open Source

Extractive QA. Fast, accurate for span extraction.

HuggingFace

DPR (Dense Passage Retrieval)

Apache 2.0

Open Source

Retrieval for open-domain QA. Use with reader.

HuggingFace

FiD (Fusion-in-Decoder)

MIT

Open Source

Multi-document reading. Good for complex questions.

GitHub

Benchmarks

SQuAD 2.0 →Natural Questions →

Quick Facts

Input: Text
Output: Text
Implementations: 3 open source, 2 API
Patterns: 3 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for question answering.

Submit Results