Home/Building Blocks/Document RAG Pipeline

Document→Text

Document RAG Pipeline

Build a complete Retrieval-Augmented Generation system for documents. Parse PDFs and documents, chunk intelligently, embed for semantic search, retrieve relevant context, and generate grounded answers with LLMs.

How Document RAG Works

Without RAG

You ask:

"What's our company's refund policy?"

LLM (trained on public internet)

No access to your documents

LLM responds:

"I don't have information about your specific company policies..."

Or worse: makes something up

With RAG

You ask:

"What's our company's refund policy?"

retrieve

Found in your docs:

"30-day money-back guarantee... return in original packaging..."

LLM + Your Context

Reasons over retrieved information

LLM responds:

"Our refund policy offers a 30-day money-back guarantee. Items must be returned in original packaging."

The RAG Pipeline: From Documents to Answers

Click each stage to see how it works

At query time:

Click a stage above to learn more

Why Chunking Matters

Documents are too long for embedding models. We must split them intelligently.

Original Document

Our company provides a 30-day money-back guarantee on all physical products. Customers may return items for a full refund if they are unsatisfied with their purchase.

To initiate a return, contact our support team with your order number. Items must be in original packaging and unused condition. Shipping costs for returns are the customer's responsibility unless the item arrived damaged.

Refunds are processed within 5-7 business days after we receive the returned item. The refund will be credited to your original payment method.

For digital products, refunds are available within 14 days of purchase, but only if the product has not been downloaded or accessed.

Chunks Created

#1Our company provides a 30-day money-back guarantee on all physical products. Customers may return items for a full refund if they are unsatisfied with their purchase.

166 chars

#2To initiate a return, contact our support team with your order number. Items must be in original packaging and unused condition. Shipping costs for returns are the customer's responsibility unless the item arrived damaged.

222 chars

#3Refunds are processed within 5-7 business days after we receive the returned item. The refund will be credited to your original payment method.

143 chars

#4For digital products, refunds are available within 14 days of purchase, but only if the product has not been downloaded or accessed.

132 chars

The Key Insight

Good chunks preserve complete thoughts. If a chunk is cut mid-sentence, the embedding won't capture the full meaning, and retrieval quality suffers.

Embeddings: Meaning as Geometry

Each chunk becomes a point in high-dimensional space. Similar meanings = nearby points.

Embedding Space (2D projection)

Refund-related

Return process

Digital products

Processing

Try a Query

Click a query to see which chunks are most similar

How Similarity Works

Cosine Similarity

Measures the angle between two vectors. Same direction = 1.0, opposite = -1.0, perpendicular = 0

Retrieval: Finding the Right Context

The retrieved chunks become the LLM's knowledge for answering your question.

Your Question

"Can I return a digital download?"

Retrieved (top 2)

0.94Best match

"For digital products, refunds are available within 14 days..."

0.722nd match

"30-day money-back guarantee on all physical products..."

Not Retrieved

0.31

"Contact support with your order number..."

0.28

"Refunds processed in 5-7 days..."

+ Question

LLM

Reasons over retrieved context

Generated Answer:

"Yes, you can return a digital download, but only within 14 days of purchase and only if you haven't downloaded or accessed it yet."

Top-K

How many chunks to retrieve (typically 3-10)

tau

Threshold

Minimum similarity to include (e.g., 0.7)

MMR

Diversity

Avoid redundant chunks with similar content

The "Lost in the Middle" Problem

LLMs pay more attention to the beginning and end of context, often missing crucial middle content.

High

Med

Low

Med

High

Beginning->Middle (often ignored)->End

Solution: Put Most Relevant First

Use a reranker to reorder retrieved chunks by relevance before sending to the LLM. The best match should be first in the context.

RAG Patterns: From Simple to Advanced

Naive RAG

Chunk -> Embed -> Retrieve -> Generate

Good starting point. Works for many cases.

+ Reranking

Retrieve many -> Rerank -> Keep best

Higher precision. Handles ambiguous queries.

Hybrid Search

Vectors + Keywords (BM25)

Catches exact matches vectors miss.

Agentic RAG

LLM decides what to retrieve

Multi-hop reasoning. Complex questions.

When RAG Fails (And How to Fix It)

Retrieved wrong chunks

Query and relevant content use different words ("cancel subscription" vs "end membership")

Fix:Query expansion, HyDE (hypothetical document embeddings), or domain-specific embedding models

Context too long / truncated

Retrieved 20 chunks but only 5 fit in context window

Fix:Smaller chunks, summarization, or models with longer context (GPT-4, Claude)

LLM ignores context / hallucinates

Answer sounds right but isn't from retrieved documents

Fix:Explicit prompting ("ONLY use provided context"), require citations, use grounded models like Command R

Measuring RAG Quality

Use RAGAS framework to evaluate your pipeline

Retrieval Quality

Context Precision

0.85

Context Recall

0.78

Generation Quality

Faithfulness

0.92

Answer Relevancy

0.88

Aim for >0.8 on all metrics for production quality

Use Cases

✓Enterprise knowledge base Q&A
✓Legal document analysis
✓Research paper assistant
✓Technical documentation search
✓Customer support automation
✓Compliance and policy lookup

Architectural Patterns

Naive RAG

Simple chunk-embed-retrieve-generate pipeline. Split documents into fixed-size chunks, embed each chunk, retrieve top-k by similarity, pass to LLM.

Pros:

+Simple to implement
+Works for many use cases
+Fast development

Cons:

-Chunks may split context
-No semantic boundaries
-Retrieved chunks may be redundant

Advanced RAG with Reranking

Add a cross-encoder reranker after initial retrieval to improve precision. Retrieve more candidates, then rerank for relevance.

Pros:

+Higher precision
+Better answer quality
+Handles ambiguous queries

Cons:

-Added latency
-Requires reranker model
-More complexity

Hierarchical / Parent-Child Chunking

Store small chunks for retrieval but return parent (larger) chunks for context. Captures both precision and context.

Pros:

+Better context preservation
+Precise retrieval
+Reduces context fragmentation

Cons:

-More complex indexing
-Storage overhead
-Implementation complexity

Agentic RAG

LLM-driven retrieval with query decomposition, multi-hop reasoning, and tool use. The agent decides what to retrieve and when.

Pros:

+Handles complex questions
+Multi-step reasoning
+Can combine sources

Cons:

-Higher latency
-More API calls
-Harder to debug

Graph RAG

Build a knowledge graph from documents, traverse relationships during retrieval. Good for entities and relationships.

Pros:

+Captures relationships
+Entity-centric queries
+Structured knowledge

Cons:

-Complex to build
-Graph extraction errors
-Maintenance overhead

Implementations

Benchmarks

RAGAS →BEIR →MTEB Retrieval →

Code Examples

Simple RAG with LlamaIndex

Build a basic document Q&A system in minutes

Install:pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

# Load documents from a folder
documents = SimpleDirectoryReader('data/').load_data()

# Build index (chunks, embeds, stores)
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(
    llm=OpenAI(model='gpt-4o'),
    similarity_top_k=3
)

# Ask a question
response = query_engine.query(
    'What are the main findings in these documents?'
)
print(response)

RAG with Reranking

Add a reranker for better precision using Cohere

Install:pip install llama-index llama-index-postprocessor-cohere-rerank

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.postprocessor.cohere_rerank import CohereRerank

# Load and index documents
documents = SimpleDirectoryReader('data/').load_data()
index = VectorStoreIndex.from_documents(documents)

# Add Cohere reranker
reranker = CohereRerank(
    api_key='YOUR_COHERE_KEY',
    top_n=3
)

# Query with reranking
query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve more, rerank to 3
    node_postprocessors=[reranker]
)

response = query_engine.query('What is the refund policy?')
print(response)

Production RAG with Chroma + OpenAI

Persistent vector store with semantic chunking

Install:pip install chromadb openai langchain langchain-openai langchain-chroma unstructured

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain.chains import RetrievalQA

# Load documents
loader = DirectoryLoader('docs/', glob='**/*.pdf')
documents = loader.load()

# Semantic chunking
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=['\n\n', '\n', '. ', ' ', '']
)
chunks = text_splitter.split_documents(documents)

# Create persistent vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model='text-embedding-3-large'),
    persist_directory='./chroma_db'
)

# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model='gpt-4o'),
    retriever=vectorstore.as_retriever(search_kwargs={'k': 5}),
    return_source_documents=True
)

# Query with sources
result = qa_chain.invoke({'query': 'Summarize the key points'})
print(result['result'])
print('\nSources:')
for doc in result['source_documents']:
    print(f'  - {doc.metadata.get("source", "unknown")}')

Quick Facts

Input: Document
Output: Text
Implementations: 5 open source, 3 API
Patterns: 5 approaches

Related Blocks

Document Extraction

Document → Structured Data

Have benchmark data?

Help us track the state of the art for document rag pipeline.

Submit Results