Home/Building Blocks/Question Answering
TextText

Question Answering

Answer questions based on context or knowledge. Foundation for chatbots, search, and knowledge systems.

How Question Answering Works

A technical deep-dive into Question Answering systems. From extractive span prediction to generative reasoning, and from fine-tuned models to RAG pipelines.

1

The Core Insight

Why question answering is fundamentally different from search.

The Problem

Information is buried in text. Users have questions, but documents have paragraphs. Traditional search returns documents, not answers. You ask 'When was Einstein born?' and get 10 articles about Einstein instead of '1879.'

The Solution

Train models to read passages and either extract exact answer spans (extractive QA) or generate natural language answers (generative QA). The model becomes a reading comprehension machine.

The Key Idea

QA models don't just match keywords - they understand the question's intent and locate or synthesize the precise information needed to answer it.

Search Returns Documents. QA Returns Answers.

Traditional Search
Query:
"When was Einstein born?"
Results:
"Albert Einstein - Biography | Life and Career..."
"Einstein: A Life of Genius | History Channel..."
"10 Facts About Albert Einstein You Didn't Know..."
User must read documents to find answer
Question Answering
Question:
"When was Einstein born?"
Answer:
March 14, 1879
From: Albert Einstein Biography (score: 0.98)
Direct answer with source citation
2

Extractive vs Generative QA

The fundamental choice: copy from the text or generate an answer.

How It Works
Model outputs start and end token positions within the passage
Example
Context:
Albert Einstein was born on March 14, 1879, in Ulm, Germany. He developed the theory of relativity.
Question:
When was Einstein born?
Answer:
March 14, 1879
Pros
+ Always grounded in source text
+ No hallucination risk
+ Easy to verify and cite
+ Fast inference
Cons
- Cannot synthesize information
- Limited to verbatim text
- Fails if answer is paraphrased
- Cannot handle yes/no questions naturally
Common Models:
BERT-QARoBERTa-QAALBERTDistilBERT
3

Interactive Demo: See QA in Action

Watch how extractive QA highlights answer spans in the passage.

Select Passage:
Passage
Albert Einstein was born on March 14, 1879, in Ulm, in the Kingdom of Wurttemberg in the German Empire. He developed the theory of special relativity in 1905 while working as a patent clerk in Bern, Switzerland. His famous equation E=mc2 showed the equivalence of mass and energy. Einstein received the Nobel Prize in Physics in 1921, not for relativity, but for his explanation of the photoelectric effect.
Questions
Notice the Difference
Extractive QA copies exactly from the text (positions 28-42). Generative QA synthesizes a natural response that may rephrase or add context.
4

The Context Window Problem

How much text can your model read at once?

Every QA model has a limit on how much text it can read at once. BERT reads ~512 tokens. GPT-4 reads ~128K tokens. When your documents exceed this limit, you must choose: truncate and miss information, or retrieve relevant passages first (RAG).

Context Window Sizes

BERT-base
512
~1 page
RoBERTa-large
512
~1 page
Longformer
4,096
~8 pages
GPT-4-turbo
128,000
~300 pages
Claude 3.5
200,000
~500 pages
When Context is Too Small
  • - Must chunk documents and retrieve relevant pieces
  • - Risk missing information if retrieval fails
  • - Cannot answer questions spanning multiple sections
Long Context Advantage
  • + Can read entire documents without chunking
  • + No retrieval errors possible
  • + Simpler pipeline (no vector DB needed)
5

RAG vs Fine-tuned vs Hybrid

Three fundamentally different ways to add domain knowledge to your QA system.

Fine-tuned Model

Train the model on your specific domain data

Question
->
Fine-tuned Model
->
Answer
Pros
+ Fast inference (no retrieval)
+ Consistent domain expertise
+ Works offline
Cons
- Expensive to update (retrain)
- Knowledge gets stale
- Limited to training data
- Can hallucinate confidently
Best for: Static domains with stable knowledge (medical terminology, legal definitions)
RAG Pipeline

Retrieve relevant documents, then generate answer from context

Question
->
Retrieve
->
Context + LLM
->
Answer
Pros
+ Easy to update (just update docs)
+ Always current information
+ Citable sources
+ No retraining needed
Cons
- Retrieval quality is critical
- Higher latency (two-stage)
- More infrastructure
- Can fail if retrieval misses
Best for: Dynamic content, enterprise search, knowledge bases that change
Hybrid

Fine-tuned model with RAG augmentation

Question
->
Retrieve
->
Fine-tuned + Context
->
Answer
Pros
+ Best accuracy
+ Domain expertise + fresh data
+ Graceful degradation
Cons
- Most complex
- Highest cost
- Two systems to maintain
Best for: Production systems requiring both accuracy and freshness

Quick Decision Guide

Knowledge changes frequently?->Use RAG
Need specialized domain reasoning?->Fine-tune
Production system with both needs?->Hybrid
Simple use case, limited budget?->Start with RAG
6

Models for Question Answering

From BERT to GPT-4: how QA models have evolved.

LSTM + Attention
2016
RNN-basedAttention over passage for answer selection
BERT
2018
TransformerBidirectional pre-training, span prediction
RoBERTa
2019
TransformerOptimized BERT training, more data
ALBERT
2019
TransformerParameter sharing for efficiency
T5
2020
Seq2SeqText-to-text framing, generative QA
GPT-3
2020
Decoder-onlyFew-shot learning, no fine-tuning needed
Flan-T5
2022
Instruction-tunedInstruction following for QA
GPT-4
2023
Multimodal LLMChain-of-thought reasoning
Llama 3
2024
Open-source LLMCompetitive with proprietary, fully open
ModelTypeParamsContextBest For
BERT-base-uncased
Google
Extractive110M512 tokensProduction extractive QA with low latency requirements
RoBERTa-large
Meta
Extractive355M512 tokensWhen you need better accuracy than BERT
Flan-T5-XL
Google
Generative3B512 tokensGenerative QA with instruction following
GPT-4
OpenAI
GenerativeUnknown128K tokensComplex QA requiring reasoning or long documents
Llama 3 70B
Meta
Generative70B8K tokensSelf-hosted generative QA without API dependencies
Best for Speed
DistilBERT-QA
40% smaller, 60% faster, 97% accuracy
Best for Quality
GPT-4
Highest reasoning, longest context
Best for Self-hosting
Llama 3 8B
Open weights, runs on single GPU
7

Benchmarks

Standard datasets for evaluating question answering systems.

BenchmarkTypeSizeMetricSOTA
SQuAD 2.0
Wikipedia paragraphs with unanswerable questions
Reading Comprehension150K QA pairsEM / F193.2 / 95.3 (Human: 86.8 / 89.5)
Natural Questions
Real Google search questions
Open Domain307K QA pairsEM / F152.7 / 58.9
TriviaQA
Trivia questions with evidence documents
Knowledge-Intensive95K QA pairsEM / F173.3 / 77.5
HotpotQA
Questions requiring reasoning over multiple docs
Multi-hop Reasoning113K QA pairsEM / F172.5 / 84.8
QuALITY
Questions about full-length articles and stories
Long Document6.7K QA pairsAccuracy62.3%
Exact Match (EM)

Binary: 1 if prediction exactly matches any ground truth answer, 0 otherwise. Strict but clear - the answer must be character-for-character identical.

F1 Score

Token-level overlap between prediction and ground truth. Computed as harmonic mean of precision and recall. More forgiving than EM for partial matches.

8

Code Examples

From quick extractive QA to production RAG pipelines.

HuggingFace Extractivepip install transformers torch
Recommended
from transformers import pipeline

# Load a pre-trained extractive QA model
qa_pipeline = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2",
    device=0  # GPU, use -1 for CPU
)

# Your context passage
context = """
Albert Einstein was born on March 14, 1879, in Ulm, Germany.
He developed the theory of special relativity in 1905 while
working as a patent clerk in Bern, Switzerland.
"""

# Ask a question
result = qa_pipeline(
    question="When was Einstein born?",
    context=context
)

print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.3f}")
print(f"Start: {result['start']}, End: {result['end']}")

# Output:
# Answer: March 14, 1879
# Confidence: 0.987
# Start: 28, End: 42

Quick Reference

Extractive QA
  • - BERT, RoBERTa, ALBERT
  • - Fast, grounded, no hallucination
  • - Limited to verbatim text
Generative QA
  • - T5, GPT-4, Llama
  • - Natural answers, reasoning
  • - Watch for hallucination
RAG Pipeline
  • - Retrieve then generate
  • - Easy to update knowledge
  • - Citable sources
Key Decisions
  • - Context size determines approach
  • - Static vs dynamic knowledge
  • - Speed vs accuracy trade-off
Key Takeaways
  • 1. Extractive QA is fast and grounded but cannot synthesize or rephrase
  • 2. Generative QA is flexible but requires hallucination mitigation
  • 3. Context window determines whether you need RAG or can use long-context LLMs
  • 4. Start with RAG for most use cases - easier to update and maintain

Use Cases

  • Customer support bots
  • Knowledge base search
  • Reading comprehension
  • FAQ automation

Architectural Patterns

Extractive QA

Find answer spans within provided context.

Pros:
  • +Grounded in source
  • +Fast
  • +No hallucination
Cons:
  • -Needs context provided
  • -Can't synthesize

Generative QA

Generate answers using LLMs with retrieved context.

Pros:
  • +Fluent answers
  • +Can synthesize
  • +Handles complex questions
Cons:
  • -May hallucinate
  • -Slower
  • -Needs good retrieval

Open-Domain QA

Answer from parametric knowledge without context.

Pros:
  • +No retrieval needed
  • +Simple pipeline
Cons:
  • -Hallucination risk
  • -Knowledge cutoff
  • -Can't cite sources

Implementations

API Services

Perplexity API

Perplexity
API

Real-time search + generation. Cites sources.

You.com API

You.com
API

Search-augmented answers. Good for current events.

Open Source

RoBERTa-SQuAD

Apache 2.0
Open Source

Extractive QA. Fast, accurate for span extraction.

DPR (Dense Passage Retrieval)

Apache 2.0
Open Source

Retrieval for open-domain QA. Use with reader.

FiD (Fusion-in-Decoder)

MIT
Open Source

Multi-document reading. Good for complex questions.

Benchmarks

Quick Facts

Input
Text
Output
Text
Implementations
3 open source, 2 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for question answering.

Submit Results