Basic RAG Pipeline
Ground your LLM in real data. The architecture, history, and engineering behind retrieval-augmented generation — from the 2020 paper to production systems.
The Road to RAG: 2017 – Present
Retrieval-Augmented Generation did not appear from nowhere. It emerged from a specific collision of problems: LLMs were getting dramatically better at language but remained fundamentally limited by their training data cutoff, prone to hallucination, and unable to access private knowledge. Researchers at Facebook AI (now Meta) formalized the solution in 2020, but the ideas had been building for years.
Understanding this history matters because every architectural choice in a modern RAG system — chunking strategies, retrieval methods, prompt templates — traces back to trade-offs identified in these foundational papers.
DrQA: Reading Wikipedia to Answer Questions
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes at Facebook AI Research built DrQA, a system that combined a TF-IDF document retriever with a neural reader to answer open-domain questions using the entirety of English Wikipedia. The architecture was simple but prophetic: retrieve relevant passages first, then have a neural network extract the answer from those passages.
"We use a simple TF-IDF bag-of-words model as our document retriever [...] combined with a multi-layer recurrent neural network model trained to detect answer spans in paragraphs."
— Chen, D. et al. (2017). Reading Wikipedia to Answer Open-Domain Questions. ACL.
DrQA established the retriever-reader paradigm that RAG would later generalize. The reader was extractive (it highlighted spans in the text) rather than generative, but the fundamental insight — feed retrieved context to a neural model — was already there.
Dense Passage Retrieval & ORQA
Two critical limitations of DrQA drove the next wave of research. First, TF-IDF retrieval was purely lexical — it could not find passages that used different words to express the same concept. Second, the retriever and reader were trained separately, preventing end-to-end optimization.
Lee et al. (2019) introduced ORQA (Open-Retrieval Question Answering), which jointly pre-trained retriever and reader using an Inverse Cloze Task. Then Karpukhin et al. (2020) released DPR (Dense Passage Retrieval), replacing TF-IDF with dual BERT encoders — one for queries, one for passages. DPR improved top-20 retrieval accuracy on Natural Questions from 59.1% (BM25) to 78.4%.
— Lee, K. et al. (2019). Latent Retrieval for Weakly Supervised Open Domain QA. ACL.
— Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain QA. EMNLP.
REALM: Retrieval-Augmented Language Model Pre-Training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang at Google Research published REALM, arguably the first true retrieval-augmented language model. REALM augmented BERT's masked language modeling pre-training with a learned retriever that fetched relevant Wikipedia passages at training time, not just inference.
"For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner, using masked language modeling as the learning signal and backpropagating through a retrieval step that considers millions of documents."
— Guu, K. et al. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. ICML.
REALM's key contribution was making the retriever differentiable — the model could learn which documents were useful through gradient updates. This was computationally expensive (requiring periodic re-encoding of the entire corpus), but it demonstrated that retrieval and generation could be trained jointly.
RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela at Facebook AI Research published the paper that gave the technique its name. RAG combined a pre-trained seq2seq model (BART) with a dense retriever (DPR), creating a general-purpose architecture for any text generation task that benefits from external knowledge.
# The RAG architecture (Lewis et al. 2020)
# 1. Query encoder: BERT_q(x) → dense query vector
q = query_encoder(input_text) # Shape: (768,)
# 2. Retrieve top-k passages using MIPS (Maximum Inner Product Search)
passages = index.search(q, k=5) # From DPR-indexed Wikipedia
# 3. Generate conditioned on EACH retrieved passage
for doc in passages:
p(y|x,doc) = generator(input_text, doc) # BART seq2seq
# 4. Marginalize over documents:
# RAG-Token: p(y_i|x) = Σ_z p(z|x) · p(y_i|x,z,y_{1:i-1})
# RAG-Sequence: p(y|x) = Σ_z p(z|x) · p(y|x,z)— Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. 4,500+ citations.
The paper introduced two variants: RAG-Sequence, which uses the same document to generate the entire output, and RAG-Token, which can use different documents for each output token. RAG set new state-of-the-art on open-domain QA (Natural Questions, WebQuestions, CuratedTrec), abstractive QA (MSMARCO), and the Jeopardy question generation task.
Why the RAG paper mattered more than REALM
REALM was technically first, but RAG won adoption. Three reasons: (1) it used a frozen retriever rather than differentiable retrieval, making it far simpler to implement; (2) it was generative (using BART) rather than extractive, enabling open-ended text generation; (3) the name stuck. "RAG" became the universal shorthand for any system that retrieves context before generating, even systems that look nothing like the original paper.
Fusion-in-Decoder & Atlas
Gautier Izacard and Edouard Grave introduced Fusion-in-Decoder (FiD), which encoded each retrieved passage independently (enabling parallelism) and fused them only in the decoder's cross-attention layers. This scaled gracefully to 100+ retrieved passages. Their follow-up, Atlas (Izacard et al., 2023), combined FiD with contrastive retriever training, achieving state-of-the-art few-shot performance — a 11B parameter model matching the 540B PaLM on Natural Questions with just 64 examples.
— Izacard, G. & Grave, E. (2021). Leveraging Passage Retrieval with Generative Models. EACL.
— Izacard, G. et al. (2023). Atlas: Few-shot Learning with Retrieval Augmented Language Models. JMLR.
ChatGPT Ignites the RAG Gold Rush
The release of ChatGPT (November 2022) created explosive demand for connecting LLMs to private data. Enterprises wanted GPT-quality responses grounded in their own documents, not hallucinated generalizations. "RAG" went from an NLP research term to a startup pitch deck buzzword overnight.
Pinecone
Managed vector DB. Raised $138M in 2023. Serverless index launch.
Weaviate
Open-source + cloud. Built-in vectorizers. GraphQL API.
Chroma
Developer-friendly Python client. Embedded mode for prototyping.
Qdrant
Rust-based. Filtering + payload support. Self-hostable.
pgvector
PostgreSQL extension. Use your existing DB. HNSW + IVFFlat indexes.
FAISS
Meta's library. Not a DB — a fast similarity search engine. Powers most vector DBs internally.
LangChain and LlamaIndex emerged as orchestration frameworks, abstracting the retrieve-then-generate pattern into simple API calls. The ecosystem exploded: by mid-2023, "RAG" appeared in over 2,000 ArXiv papers and every major cloud provider had launched a managed RAG product.
Advanced RAG & Beyond Naive Retrieval
As production deployments revealed the limitations of naive RAG (poor chunking, irrelevant retrieval, lost-in-the-middle effects), the field advanced rapidly:
- Self-RAG (Asai et al., 2023) — the model decides whether to retrieve and self-critiques retrieved passages
- CRAG (Yan et al., 2024) — Corrective RAG that evaluates retrieval quality and falls back to web search when confidence is low
- GraphRAG (Microsoft, 2024) — builds knowledge graphs from documents, enabling multi-hop reasoning over entity relationships
- Contextual Retrieval (Anthropic, 2024) — prepends LLM-generated context to each chunk before embedding, reducing retrieval failure by 49%
- Late Chunking (Jina AI, 2024) — embeds the full document through a long-context model, then chunks the output embeddings, preserving cross-chunk context
— Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate and Critique. ICLR 2024.
— Yan, S. et al. (2024). Corrective Retrieval Augmented Generation. arXiv.
The throughline: 2017 → 2026
Five years. One idea, refined relentlessly:
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that enhances LLMs by connecting them to external knowledge sources. Instead of relying solely on what the model memorized during training, RAG retrieves relevant documents at query time and injects them into the prompt as context.
Think of it as giving the LLM an open-book exam instead of a closed-book test. The model can look up information rather than trying to recall everything from memory — and crucially, it can cite where the information came from.
// Without RAG (closed-book)
User: "What's our Q4 revenue target?"
LLM: "I don't have access to your company's financial data..."
// With RAG (open-book)
User: "What's our Q4 revenue target?"
LLM: "Per the 2026 strategic plan [slide 14], Q4 target is $4.2M, up 18% YoY..."
Why RAG Matters
The Hallucination Problem
LLMs confidently generate plausible-sounding but factually incorrect information. They have no mechanism to distinguish recalled facts from fabricated ones.
Huang et al. (2023) found that even GPT-4 hallucinates on 3–15% of factual questions depending on domain, with medical and legal queries being worst.
The RAG Solution
By grounding responses in retrieved documents, RAG provides verifiable sources. The model can cite where information came from, making answers auditable.
Shuster et al. (2021) showed that RAG reduces hallucination in knowledge-grounded dialogue by 36% compared to a parametric-only model of the same size.
Stale Knowledge
Training data has a cutoff date. A model trained in January cannot answer questions about events in March. Re-training costs millions of dollars.
RAG decouples knowledge freshness from model training: update the document index in minutes, not weeks. No GPU cluster required.
Private Data Access
RAG lets you query internal documents, databases, and proprietary knowledge without fine-tuning or risking data leakage through training.
Enterprise adoption is driven by this: your data stays in your infrastructure, injected only at inference time, with access controls at the retrieval layer.
The RAG Architecture: 4 Stages
Every RAG pipeline follows the same fundamental pattern: Chunk → Embed → Retrieve → Generate. The first two stages happen offline (indexing time). The last two happen online (query time). Understanding where the boundary falls is critical for latency optimization.
┌─────────────────────────────────────────────────────────────────┐ │ INDEXING PIPELINE (offline) │ │ │ │ Documents ──→ [Chunk] ──→ [Embed] ──→ [Store in Vector DB] │ │ PDF ↓ ↓ ↓ │ │ HTML chunks[] vectors[] index + metadata │ │ TXT │ │ Markdown │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ QUERY PIPELINE (online) │ │ │ │ User Query ──→ [Embed] ──→ [Retrieve top-k] ──→ [Generate] │ │ ↓ ↓ ↓ │ │ query_vec relevant_chunks LLM response │ │ + similarity scores with citations │ └─────────────────────────────────────────────────────────────────┘
Chunk
Split your documents into smaller pieces. Large documents exceed context windows, and retrieval precision degrades when chunks are too big — the embedding averages over too many topics. The ideal chunk contains a single coherent idea with enough context to be understood in isolation.
This stage is often where RAG systems fail. We'll cover chunking strategies in detail below.
Embed
Convert each chunk into a dense vector embedding using a model liketext-embedding-3-small orBAAI/bge-base-en-v1.5. The embedding captures the semantic meaning of the chunk — similar content produces similar vectors, enabling semantic search rather than keyword matching.
See Lesson 0.1: What is an Embedding? for how this works under the hood.
Retrieve
When a query arrives, embed it using the same model as the chunks, then find the most similar chunk vectors using approximate nearest neighbor (ANN) search. Common algorithms: HNSW (Hierarchical Navigable Small World), IVF (Inverted File Index), and product quantization. Return the top-k most relevant chunks as context for generation.
ANN search is sub-linear: searching 10 million vectors takes milliseconds, not minutes. This is what makes RAG practical at scale.
Generate
Pass the retrieved chunks along with the user's question to the LLM in a carefully structured prompt. The model synthesizes an answer grounded in the provided context. Key design decisions: context window budget, citation format, fallback behavior when context is insufficient, and whether to use a system prompt or few-shot examples.
Liu et al. (2024) showed that LLMs attend more to context at the beginning and end of the prompt ("lost in the middle"), so ordering retrieved chunks by relevance at the boundaries matters.
Interactive Pipeline Demo
Walk through each stage of the RAG pipeline. Try different chunking strategies, embed the chunks, retrieve relevant context, and see how the final answer is generated.
RAG Pipeline Visualizer
What is RAG?
Retrieval-Augmented Generation connects LLMs to your data. Instead of hallucinating, the model retrieves relevant documents and grounds its answers in real information.
Problem: Hallucination
LLMs make up facts when they don't know the answer. No way to verify claims.
Problem: Stale Knowledge
Training data has a cutoff date. Model doesn't know about recent events.
Solution: RAG
Retrieve real documents at query time. Ground answers in your actual data.
Chunking Strategies: The Most Underrated Stage
Chunking is deceptively important. Bad chunking leads to bad retrieval, which leads to bad answers — and no amount of prompt engineering can recover from irrelevant context. The research consistently shows that chunking strategy has a larger impact on RAG quality than the choice of embedding model.
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed Size | Split at N characters/tokens with M overlap | Homogeneous text, quick prototypes |
| Sentence-based | Group N sentences, split at sentence boundaries | Articles, documentation, books |
| Recursive | Try splitting on \n\n, then \n, then sentences, then characters | Mixed-format documents (default in LangChain) |
| Semantic | Embed sentences, split where cosine similarity drops | Documents with clear topic shifts |
| Document-structure | Split on headers, sections, paragraphs (Markdown/HTML-aware) | Technical docs, legal contracts, specs |
Chunking Code: Four Strategies
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
def count_tokens(text: str) -> int:
return len(enc.encode(text))
# ── Strategy 1: Fixed-size with overlap ──────────────────────
def chunk_fixed(text: str, size: int = 512, overlap: int = 50) -> list[str]:
"""Split into token-counted chunks with overlap for context continuity."""
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + size
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
start = end - overlap # slide back by overlap
return chunks
# ── Strategy 2: Sentence-based grouping ──────────────────────
import re
def chunk_sentences(text: str, max_tokens: int = 512) -> list[str]:
"""Group sentences until token budget is reached."""
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks, current = [], []
current_len = 0
for sent in sentences:
sent_len = count_tokens(sent)
if current_len + sent_len > max_tokens and current:
chunks.append(" ".join(current))
current, current_len = [], 0
current.append(sent)
current_len += sent_len
if current:
chunks.append(" ".join(current))
return chunks
# ── Strategy 3: Recursive splitting ──────────────────────────
def chunk_recursive(text: str, max_tokens: int = 512,
separators: list[str] = None) -> list[str]:
"""Try each separator in order; recurse on pieces that are too large."""
if separators is None:
separators = ["\n\n", "\n", ". ", " "]
if count_tokens(text) <= max_tokens:
return [text.strip()] if text.strip() else []
sep = separators[0] if separators else " "
parts = text.split(sep)
chunks, current = [], ""
for part in parts:
candidate = current + sep + part if current else part
if count_tokens(candidate) > max_tokens:
if current:
chunks.append(current.strip())
# If single part exceeds limit, recurse with finer separator
if count_tokens(part) > max_tokens and len(separators) > 1:
chunks.extend(chunk_recursive(part, max_tokens, separators[1:]))
current = ""
else:
current = part
else:
current = candidate
if current.strip():
chunks.append(current.strip())
return chunks
# ── Strategy 4: Semantic chunking ────────────────────────────
import numpy as np
def chunk_semantic(text: str, model, threshold: float = 0.5,
max_tokens: int = 1024) -> list[str]:
"""Split where cosine similarity between consecutive sentences drops."""
sentences = re.split(r'(?<=[.!?])\s+', text)
if len(sentences) <= 1:
return [text]
embeddings = model.encode(sentences)
# Compute cosine similarity between consecutive sentences
similarities = [
np.dot(embeddings[i], embeddings[i+1]) /
(np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1]))
for i in range(len(embeddings) - 1)
]
# Split where similarity drops below threshold
chunks, current = [], [sentences[0]]
for i, sim in enumerate(similarities):
if sim < threshold or count_tokens(" ".join(current)) > max_tokens:
chunks.append(" ".join(current))
current = []
current.append(sentences[i + 1])
if current:
chunks.append(" ".join(current))
return chunksThe Overlap Question
Overlap (also called "stride") duplicates tokens at chunk boundaries so that information split across two chunks still appears in at least one. A common rule of thumb:
0 overlap
No duplicate storage. Fastest indexing.
- + Minimal storage cost
- - Information at boundaries is lost
- - Only for well-structured docs
10–15% overlap
Balanced. Most common in production.
- + Recovers boundary context
- + Modest storage increase
- - May retrieve near-duplicate chunks
25%+ overlap
Maximum coverage. High storage cost.
- + Almost no boundary information loss
- - 25%+ more embeddings to store/search
- - Deduplication becomes necessary
Chunk Size Experiments
128–256 tokens
Fine-grained. High precision.
- + Best for factoid QA
- + Each chunk ~ one idea
- - Loses broader context
- - Many more embeddings to store
512 tokens
The default sweet spot.
- + Good balance of precision/context
- + Works for most document types
- - May still split related content
1024–2048 tokens
Broad context. Used with rerankers.
- + Rich context in each chunk
- + Fewer retrieval round-trips
- - Dilutes relevance signal in embedding
- - Needs reranker to work well
Retrieval: Top-K, Thresholds, and Reranking
The retrieval stage has three levers that control what gets passed to the generator:
Top-K Selection
Return the K most similar chunks regardless of absolute similarity score.
k=3: Focused, minimal context. Fast.
k=5: Balanced (common default).
k=10: Broad context, more noise.
k=20+: Requires reranking to be useful.
Similarity Threshold
Only return chunks above a minimum score. Prevents irrelevant results when the question is out of scope.
0.7+: Strict. Might miss relevant context.
0.4–0.6: Moderate. Good default.
<0.3: Too loose. Noise dominates.
Cross-Encoder Reranking
Retrieve k=20 with a fast bi-encoder, then rerank with a slow but accurate cross-encoder. Return the top 5.
Bi-encoder: ~1ms per query. Independent encoding.
Cross-encoder: ~50ms per pair. Sees query + doc together.
Production pattern: Combine all three. Bi-encoder retrieves top-20 candidates (fast). Similarity threshold filters out irrelevant results. Cross-encoder reranks survivors and takes top-5. This gives you the precision of a cross-encoder at near bi-encoder latency.
Full RAG Pipeline: Working Code
Here is a complete, minimal RAG pipeline you can run locally. No frameworks — just the raw components so you understand what LangChain and LlamaIndex abstract away.
"""
Minimal RAG pipeline — no frameworks, just numpy + openai.
pip install openai numpy tiktoken
"""
import numpy as np
import tiktoken
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from env
enc = tiktoken.get_encoding("cl100k_base")
# ── 1. CHUNK ─────────────────────────────────────────────────
def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> list[str]:
"""Fixed-size chunking with token-based overlap."""
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunks.append(enc.decode(tokens[start:end]))
if end == len(tokens):
break
start = end - overlap
return chunks
# ── 2. EMBED ─────────────────────────────────────────────────
def embed(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
"""Embed a batch of texts. Returns (N, dim) array."""
response = client.embeddings.create(input=texts, model=model)
return np.array([e.embedding for e in response.data])
# ── 3. INDEX (in-memory for demo; use a vector DB in production) ─
class SimpleVectorIndex:
def __init__(self):
self.chunks: list[str] = []
self.vectors: np.ndarray | None = None
def add(self, texts: list[str]):
"""Add chunks to the index."""
self.chunks.extend(texts)
new_vecs = embed(texts)
if self.vectors is None:
self.vectors = new_vecs
else:
self.vectors = np.vstack([self.vectors, new_vecs])
def search(self, query: str, k: int = 5, threshold: float = 0.3
) -> list[tuple[str, float]]:
"""Return top-k chunks above similarity threshold."""
q_vec = embed([query])[0]
# Cosine similarity (vectors are already normalized by OpenAI)
scores = self.vectors @ q_vec
# Filter by threshold, then take top-k
indices = np.where(scores >= threshold)[0]
indices = indices[np.argsort(scores[indices])[::-1]][:k]
return [(self.chunks[i], float(scores[i])) for i in indices]
# ── 4. GENERATE ──────────────────────────────────────────────
def generate(query: str, context_chunks: list[tuple[str, float]],
model: str = "gpt-4o-mini") -> str:
"""Generate answer grounded in retrieved context."""
context = "\n\n".join(
f"[{i+1}] (score: {score:.2f}) {chunk}"
for i, (chunk, score) in enumerate(context_chunks)
)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": (
"Answer the user's question using ONLY the provided context. "
"Cite sources using [1], [2], etc. If the context doesn't contain "
"the answer, say 'I don't have enough information to answer that.'"
)},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
],
temperature=0.1, # Low temperature for factual accuracy
)
return response.choices[0].message.content
# ── USAGE ────────────────────────────────────────────────────
# Index some documents
index = SimpleVectorIndex()
documents = [
"The Eiffel Tower was built in 1889 for the World's Fair in Paris...",
"Python was created by Guido van Rossum and released in 1991...",
"The mitochondria is the powerhouse of the cell, producing ATP...",
]
for doc in documents:
chunks = chunk_text(doc)
index.add(chunks)
# Query
query = "When was Python created?"
results = index.search(query, k=3)
answer = generate(query, results)
print(answer)
# → "Python was created by Guido van Rossum and released in 1991 [1]."Install: pip install openai numpy tiktoken. For production, replace SimpleVectorIndex with a real vector database (pgvector, Qdrant, Pinecone, etc.) and add a reranker.
Prompt Engineering for RAG
The prompt template you use significantly impacts answer quality. Research on RAG-specific prompting reveals several non-obvious principles:
Production RAG Prompt Template
System:
You are a helpful assistant. Answer questions using ONLY the provided context. If the context doesn't contain the answer, say "I don't have enough information to answer that question." Always cite your sources using [1], [2], etc. Be concise but thorough. If multiple sources agree, synthesize them. If sources conflict, note the discrepancy.
Context:
[1] (relevance: 0.89) {chunk_1_text}
[2] (relevance: 0.82) {chunk_2_text}
[3] (relevance: 0.76) {chunk_3_text}
Question:
{user_question}
1. Constrain to context
Explicitly tell the model to ONLY use provided context. Without this instruction, the model will freely mix retrieved facts with parametric memory, making hallucinations undetectable.
2. Handle missing information gracefully
Give the model a way out. If context is insufficient, it should admit it rather than guess. This is the single most important instruction for reducing hallucination in RAG systems.
3. Require citations
Ask for source attribution. This makes answers verifiable, builds user trust, and — as a side effect — forces the model to ground each claim in a specific chunk rather than synthesizing from memory.
4. Context placement matters
Liu et al. (2024) demonstrated the "lost in the middle" effect: LLMs attend more to context at the beginning and end of the prompt. Place the most relevant chunks first and last, not buried in the middle of a long context window.
— Liu, N. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL.
5. Include relevance scores
Passing similarity scores alongside chunks helps the model weigh its sources. A chunk with 0.92 similarity deserves more weight than one at 0.45. Some models can learn to discount low-confidence retrievals when scores are visible.
Evaluating RAG: Metrics That Matter
RAG has two failure modes that need separate evaluation: the retriever can fail to find relevant context, or the generator can fail to use the context correctly. Measuring end-to-end answer quality alone cannot distinguish between these — you need component-level metrics.
Retrieval Metrics
Recall@k
Of all relevant documents, what fraction appears in the top-k results? The most important retrieval metric for RAG because the generator cannot use what the retriever doesn't find.
MRR (Mean Reciprocal Rank)
How high does the first relevant result rank? MRR rewards retrieval systems that place the best chunk at position 1, not buried at position 5.
NDCG@k
Normalized Discounted Cumulative Gain. Accounts for graded relevance (some documents are more relevant than others) and position (top results matter more).
Context Relevance (LLM-as-judge)
Use a separate LLM to score whether each retrieved chunk is relevant to the query. Automated, scalable, and correlates well with human judgments.
Generation Metrics
Faithfulness
Does the answer contain only information supported by the retrieved context? The core anti-hallucination metric. Can be measured by extracting claims from the answer and verifying each against the context.
Used by RAGAS (Retrieval Augmented Generation Assessment) framework.
Answer Relevance
Does the answer actually address the user's question? A faithful answer that misses the point is still a bad answer. Measured by generating questions that the answer would address and comparing to the original question.
Answer Correctness
The end-to-end metric: is the answer factually correct? Requires ground-truth labels. Combines semantic similarity with factual overlap against reference answers.
Citation Accuracy
Do the cited sources actually support the claims they're attached to? A surprisingly common failure: the model cites [1] but the claim actually comes from [3], or from no retrieved source at all.
Evaluation Code: Using RAGAS
# pip install ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare evaluation dataset
eval_data = {
"question": ["When was Python created?"],
"answer": ["Python was created by Guido van Rossum and released in 1991 [1]."],
"contexts": [["Python was created by Guido van Rossum and released in 1991."]],
"ground_truth": ["Python was created by Guido van Rossum in 1991."],
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 1.0, 'answer_relevancy': 0.95,
# 'context_precision': 1.0, 'context_recall': 1.0}— Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv.
Common Failure Modes
Most RAG systems fail in predictable ways. Knowing these patterns saves weeks of debugging:
1. Retrieval Failure: Relevant docs exist but aren't retrieved
Cause: Query-document vocabulary mismatch. The user asks "how to cancel my subscription" but the docs say "termination of service agreement."
Fix: Hybrid search (dense + sparse/BM25), query expansion, or HyDE (Hypothetical Document Embeddings — generate a hypothetical answer, embed that instead).
2. Context Stuffing: Too many irrelevant chunks
Cause: High k with no threshold or reranking. The model is given 10 chunks where only 2 are relevant, and the noise dilutes the signal.
Fix: Add a reranker, lower k, increase similarity threshold, or use a model with instruction-following strong enough to ignore irrelevant context.
3. Lost in the Middle: Key context buried at position 5 of 10
Cause: The most relevant chunk ranked mid-pack. The LLM attends primarily to the first and last chunks in the prompt.
Fix: Rerank by relevance. Or use a "bookend" strategy: place the most relevant chunks at positions 1 and k, less relevant in the middle.
4. Chunk Boundary Splits: Answer spans two chunks
Cause: A critical fact was split by the chunker. Each half is retrieved but neither contains the complete information.
Fix: Add overlap, use sentence-aware chunking, or implement "parent document retrieval" — retrieve small chunks for precision, but pass the larger parent document to the LLM.
5. Confident Hallucination Despite Context
Cause: The model's parametric knowledge conflicts with the retrieved context, and it trusts its training over the prompt. More common with smaller models.
Fix: Stronger system prompt constraints, lower temperature, or use a model specifically tuned for RAG faithfulness (e.g., Mistral with function calling).
Key Takeaways
- 1
RAG = Chunk + Embed + Retrieve + Generate — The first two stages happen offline (indexing), the last two online (query). Every optimization targets one of these four stages.
- 2
Chunking is the highest-leverage optimization — Bad chunks produce bad embeddings which produce bad retrieval. No amount of prompt engineering can compensate. Start with recursive or sentence-based chunking at 512 tokens.
- 3
Evaluate retrieval and generation separately — Use Recall@k and MRR for retrieval, Faithfulness and Answer Relevance for generation. RAGAS provides automated evaluation that correlates with human judgment.
- 4
Add a reranker before scaling up — A cross-encoder reranker (retrieve 20, rerank to 5) often improves answer quality more than switching embedding models or increasing chunk counts.
- 5
The field is moving fast — From Lewis et al. (2020) to Self-RAG, GraphRAG, and contextual retrieval in under four years. The basic pattern (retrieve then generate) is stable; the implementation details change quarterly.
References
- Chen, D. et al. (2017). Reading Wikipedia to Answer Open-Domain Questions. ACL.
- Lee, K. et al. (2019). Latent Retrieval for Weakly Supervised Open Domain QA. ACL.
- Guu, K. et al. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. ICML.
- Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain QA. EMNLP.
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
- Nogueira, R. et al. (2020). Document Ranking with a Pretrained Sequence-to-Sequence Model. Findings of EMNLP.
- Izacard, G. & Grave, E. (2021). Leveraging Passage Retrieval with Generative Models. EACL.
- Izacard, G. et al. (2023). Atlas: Few-shot Learning with Retrieval Augmented Language Models. JMLR.
- Liu, N. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL.
- Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv.
- Asai, A. et al. (2023). Self-RAG: Learning to Retrieve, Generate and Critique. ICLR 2024.
- Yan, S. et al. (2024). Corrective Retrieval Augmented Generation. arXiv.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.