Home/Building Blocks/Document RAG Pipeline
DocumentText

Document RAG Pipeline

Build a complete Retrieval-Augmented Generation system for documents. Parse PDFs and documents, chunk intelligently, embed for semantic search, retrieve relevant context, and generate grounded answers with LLMs.

How Document RAG Works

Without RAG
You ask:
"What's our company's refund policy?"
|
v
LLM (trained on public internet)
No access to your documents
|
v
LLM responds:
"I don't have information about your specific company policies..."
Or worse: makes something up
With RAG
You ask:
"What's our company's refund policy?"
retrieve
Found in your docs:
"30-day money-back guarantee... return in original packaging..."
|
v
LLM + Your Context
Reasons over retrieved information
|
v
LLM responds:
"Our refund policy offers a 30-day money-back guarantee. Items must be returned in original packaging."

The RAG Pipeline: From Documents to Answers

Click each stage to see how it works

->
->
->
->
At query time:
->
->
->
Click a stage above to learn more
1

Why Chunking Matters

Documents are too long for embedding models. We must split them intelligently.

Original Document

Our company provides a 30-day money-back guarantee on all physical products. Customers may return items for a full refund if they are unsatisfied with their purchase.

To initiate a return, contact our support team with your order number. Items must be in original packaging and unused condition. Shipping costs for returns are the customer's responsibility unless the item arrived damaged.

Refunds are processed within 5-7 business days after we receive the returned item. The refund will be credited to your original payment method.

For digital products, refunds are available within 14 days of purchase, but only if the product has not been downloaded or accessed.

Chunks Created
#1Our company provides a 30-day money-back guarantee on all physical products. Customers may return items for a full refund if they are unsatisfied with their purchase.
166 chars
#2To initiate a return, contact our support team with your order number. Items must be in original packaging and unused condition. Shipping costs for returns are the customer's responsibility unless the item arrived damaged.
222 chars
#3Refunds are processed within 5-7 business days after we receive the returned item. The refund will be credited to your original payment method.
143 chars
#4For digital products, refunds are available within 14 days of purchase, but only if the product has not been downloaded or accessed.
132 chars
The Key Insight
Good chunks preserve complete thoughts. If a chunk is cut mid-sentence, the embedding won't capture the full meaning, and retrieval quality suffers.
2

Embeddings: Meaning as Geometry

Each chunk becomes a point in high-dimensional space. Similar meanings = nearby points.

Embedding Space (2D projection)
Refund-related
Return process
Digital products
Processing
Try a Query
Click a query to see which chunks are most similar
How Similarity Works
Cosine Similarity
Measures the angle between two vectors. Same direction = 1.0, opposite = -1.0, perpendicular = 0
theta
3

Retrieval: Finding the Right Context

The retrieved chunks become the LLM's knowledge for answering your question.

Your Question
"Can I return a digital download?"
Retrieved (top 2)
0.94Best match
"For digital products, refunds are available within 14 days..."
0.722nd match
"30-day money-back guarantee on all physical products..."
Not Retrieved
0.31
"Contact support with your order number..."
0.28
"Refunds processed in 5-7 days..."
+ Question
LLM
Reasons over retrieved context
Generated Answer:
"Yes, you can return a digital download, but only within 14 days of purchase and only if you haven't downloaded or accessed it yet."
k
Top-K
How many chunks to retrieve (typically 3-10)
tau
Threshold
Minimum similarity to include (e.g., 0.7)
MMR
Diversity
Avoid redundant chunks with similar content
!

The "Lost in the Middle" Problem

LLMs pay more attention to the beginning and end of context, often missing crucial middle content.

High
Med
Low
Low
Low
Med
High
Beginning->Middle (often ignored)->End
Solution: Put Most Relevant First
Use a reranker to reorder retrieved chunks by relevance before sending to the LLM. The best match should be first in the context.

RAG Patterns: From Simple to Advanced

Naive RAG
Chunk -> Embed -> Retrieve -> Generate
->
->
Good starting point. Works for many cases.
+ Reranking
Retrieve many -> Rerank -> Keep best
->
R
->
Higher precision. Handles ambiguous queries.
Hybrid Search
Vectors + Keywords (BM25)
+
K
=
Catches exact matches vectors miss.
Agentic RAG
LLM decides what to retrieve
?
->
->
?
Multi-hop reasoning. Complex questions.

When RAG Fails (And How to Fix It)

1
Retrieved wrong chunks
Query and relevant content use different words ("cancel subscription" vs "end membership")
Fix:Query expansion, HyDE (hypothetical document embeddings), or domain-specific embedding models
2
Context too long / truncated
Retrieved 20 chunks but only 5 fit in context window
Fix:Smaller chunks, summarization, or models with longer context (GPT-4, Claude)
3
LLM ignores context / hallucinates
Answer sounds right but isn't from retrieved documents
Fix:Explicit prompting ("ONLY use provided context"), require citations, use grounded models like Command R

Measuring RAG Quality

Use RAGAS framework to evaluate your pipeline

Retrieval Quality
Context Precision
0.85
Context Recall
0.78
Generation Quality
Faithfulness
0.92
Answer Relevancy
0.88
Aim for >0.8 on all metrics for production quality

Use Cases

  • Enterprise knowledge base Q&A
  • Legal document analysis
  • Research paper assistant
  • Technical documentation search
  • Customer support automation
  • Compliance and policy lookup

Architectural Patterns

Naive RAG

Simple chunk-embed-retrieve-generate pipeline. Split documents into fixed-size chunks, embed each chunk, retrieve top-k by similarity, pass to LLM.

Pros:
  • +Simple to implement
  • +Works for many use cases
  • +Fast development
Cons:
  • -Chunks may split context
  • -No semantic boundaries
  • -Retrieved chunks may be redundant

Advanced RAG with Reranking

Add a cross-encoder reranker after initial retrieval to improve precision. Retrieve more candidates, then rerank for relevance.

Pros:
  • +Higher precision
  • +Better answer quality
  • +Handles ambiguous queries
Cons:
  • -Added latency
  • -Requires reranker model
  • -More complexity

Hierarchical / Parent-Child Chunking

Store small chunks for retrieval but return parent (larger) chunks for context. Captures both precision and context.

Pros:
  • +Better context preservation
  • +Precise retrieval
  • +Reduces context fragmentation
Cons:
  • -More complex indexing
  • -Storage overhead
  • -Implementation complexity

Agentic RAG

LLM-driven retrieval with query decomposition, multi-hop reasoning, and tool use. The agent decides what to retrieve and when.

Pros:
  • +Handles complex questions
  • +Multi-step reasoning
  • +Can combine sources
Cons:
  • -Higher latency
  • -More API calls
  • -Harder to debug

Graph RAG

Build a knowledge graph from documents, traverse relationships during retrieval. Good for entities and relationships.

Pros:
  • +Captures relationships
  • +Entity-centric queries
  • +Structured knowledge
Cons:
  • -Complex to build
  • -Graph extraction errors
  • -Maintenance overhead

Implementations

API Services

Cohere RAG

Cohere
API

Built-in RAG with Command R models. Grounded generation with citations.

Perplexity API

Perplexity
API

Online RAG with real-time web search. Always up-to-date answers.

OpenAI Assistants

OpenAI
API

File search with vector store. Managed RAG infrastructure.

Open Source

LlamaIndex

MIT
Open Source

Comprehensive RAG framework. Supports many loaders, indexes, and query engines. Best for complex pipelines.

LangChain

MIT
Open Source

Popular LLM framework with strong RAG support. Good ecosystem, many integrations.

Haystack

Apache 2.0
Open Source

Production-ready NLP framework. Strong on document processing and pipelines.

RAGFlow

Apache 2.0
Open Source

Deep document understanding RAG engine. Strong on complex layouts.

Vercel AI SDK

Apache 2.0
Open Source

Lightweight RAG building blocks for TypeScript/JavaScript. Good for web apps.

Benchmarks

Code Examples

Simple RAG with LlamaIndex

Build a basic document Q&A system in minutes

Install:pip install llama-index llama-index-llms-openai llama-index-embeddings-openai
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

# Load documents from a folder
documents = SimpleDirectoryReader('data/').load_data()

# Build index (chunks, embeds, stores)
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(
    llm=OpenAI(model='gpt-4o'),
    similarity_top_k=3
)

# Ask a question
response = query_engine.query(
    'What are the main findings in these documents?'
)
print(response)

RAG with Reranking

Add a reranker for better precision using Cohere

Install:pip install llama-index llama-index-postprocessor-cohere-rerank
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.postprocessor.cohere_rerank import CohereRerank

# Load and index documents
documents = SimpleDirectoryReader('data/').load_data()
index = VectorStoreIndex.from_documents(documents)

# Add Cohere reranker
reranker = CohereRerank(
    api_key='YOUR_COHERE_KEY',
    top_n=3
)

# Query with reranking
query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve more, rerank to 3
    node_postprocessors=[reranker]
)

response = query_engine.query('What is the refund policy?')
print(response)

Production RAG with Chroma + OpenAI

Persistent vector store with semantic chunking

Install:pip install chromadb openai langchain langchain-openai langchain-chroma unstructured
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain.chains import RetrievalQA

# Load documents
loader = DirectoryLoader('docs/', glob='**/*.pdf')
documents = loader.load()

# Semantic chunking
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=['\n\n', '\n', '. ', ' ', '']
)
chunks = text_splitter.split_documents(documents)

# Create persistent vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model='text-embedding-3-large'),
    persist_directory='./chroma_db'
)

# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model='gpt-4o'),
    retriever=vectorstore.as_retriever(search_kwargs={'k': 5}),
    return_source_documents=True
)

# Query with sources
result = qa_chain.invoke({'query': 'Summarize the key points'})
print(result['result'])
print('\nSources:')
for doc in result['source_documents']:
    print(f'  - {doc.metadata.get("source", "unknown")}')

Quick Facts

Input
Document
Output
Text
Implementations
5 open source, 3 API
Patterns
5 approaches

Have benchmark data?

Help us track the state of the art for document rag pipeline.

Submit Results