Document RAG
From OCR pipelines and vision-language models to structure-aware chunking, multi-representation indexing, and rigorous evaluation. The full story of making RAG work on real documents.
30 Years of Teaching Machines to Read Documents
RAG over plain text is straightforward — chunk, embed, retrieve. Real documents are nothing like plain text. They have tables, headers, figures, footnotes, multi-column layouts, mathematical notation, and page numbers that interrupt sentences mid-word. Getting RAG to work on actual PDFs, scanned contracts, and research papers required solving the document understanding problem first — and that took decades.
Understanding this history explains why naive "pdf-to-text then chunk" pipelines fail, and what the current state of the art actually does differently.
Tesseract and the OCR Era
Hewlett-Packard Labs developed Tesseract between 1985 and 1994. It was one of the top three OCR engines in the 1995 UNLV accuracy test. Google open-sourced it in 2006, and it became the default document digitization tool for a generation. The approach was classic: binarize the image, detect connected components, match against character templates, apply dictionary correction.
The fundamental limitation: OCR sees characters, not structure. A two-column PDF becomes a single stream of text where the left column's line 1 merges with the right column's line 1. Tables become gibberish. Headers lose their hierarchy. Every downstream RAG system built on raw OCR output inherits these corruptions.
PDF Parsers: pdfminer, Apache Tika, Poppler
Tools like pdfminer (2004), Apache Tika (2007), and Poppler extracted text from born-digital PDFs by reading the underlying text stream directly — no OCR needed. This was faster and more accurate for clean PDFs, but still suffered from the same structural blindness: reading order was guessed from coordinates, table cells were output as disconnected text fragments, and figures were invisible. The mantra of this era was "garbage in, garbage out" — no amount of clever chunking could fix fundamentally broken text extraction.
LayoutLM: Text + Layout in One Model
Yiheng Xu et al. at Microsoft Research made the critical leap: instead of treating documents as flat text, feed the model both the text tokens and their 2D bounding-box coordinates on the page. LayoutLM extended BERT by adding x/y position embeddings alongside the standard token and segment embeddings.
# LayoutLM input: text + spatial position token_embedding = text_embed(token) + position_2d_embed(x0, y0, x1, y1) # x0,y0,x1,y1 = bounding box of the token on the page # The model learns that tokens at the top of the page are headers, # tokens aligned vertically are in the same column, # and tokens in a grid pattern form a table.
— Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout. KDD.
The result was dramatic: LayoutLM achieved state-of-the-art on form understanding (FUNSD), receipt extraction (SROIE), and document classification tasks — beating text-only models by large margins. For the first time, a model could distinguish a table header from body text based on where it sat on the page, not just what it said.
LayoutLMv2 & v3: Adding Vision
LayoutLM v1 still relied on OCR for text extraction. LayoutLMv2 integrated a visual backbone (ResNeXt-FPN) that processed the raw document image alongside text and layout, enabling the model to "see" visual cues like bold text, colored cells, and underlined headers. LayoutLMv3 unified the architecture further with a single multimodal transformer trained with masked image-text alignment.
— Xu, Y. et al. (2021). LayoutLMv2. ACL.
— Huang, Y. et al. (2022). LayoutLMv3. ACM MM.
Donut: OCR-Free Document Understanding
Geewook Kim et al. at NAVER AI Lab asked a radical question: what if we skip OCR entirely? Donut (Document Understanding Transformer) was an end-to-end vision encoder-decoder that took a document image as input and directly generated structured JSON output — no text extraction step, no bounding boxes, no pipeline.
"We show that a simple encoder-decoder architecture with a visual encoder and a text decoder can achieve state-of-the-art performance on document understanding tasks without relying on OCR."
— Kim, G. et al. (2022). OCR-free Document Understanding Transformer. ECCV.
This was important philosophically: it proved that OCR was a bottleneck, not a prerequisite. But Donut was trained for extraction tasks (parsing receipts, forms), not for generating embeddings suitable for retrieval. That gap would take two more years to close.
Nougat: Academic PDF to Markdown
Lukas Blecher et al. at Meta AI built Nougat — a Donut-based model specifically trained to convert academic papers (with equations, tables, and citations) into structured Markdown. For the first time, you could feed a scanned ArXiv paper in and get clean, parseable text with LaTeX equations preserved. This was a game-changer for scientific RAG pipelines.
— Blecher, L. et al. (2023). Nougat: Neural Optical Understanding for Academic Documents. arXiv.
ColPali: Late Interaction for Document Retrieval
Manuel Faysse et al. proposed the approach that finally unified vision-language understanding with efficient retrieval. ColPali treats each document page as an image, processes it through a vision-language model (PaliGemma), and produces a set of multi-vector embeddings — one per image patch. At query time, it uses late interaction (the ColBERT paradigm) to compute fine-grained token-to-patch similarity.
# ColPali: no OCR, no chunking — just images and late interaction doc_patches = vision_encoder(page_image) # (N_patches, dim) query_tokens = text_encoder(query) # (N_tokens, dim) # Late interaction: max-sim between each query token and all patches score = sum(max(dot(q_i, p_j) for p_j in doc_patches) for q_i in query_tokens) # Each query token finds its best-matching visual region on the page
— Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv.
ColPali outperformed text-based retrieval pipelines on document benchmarks while being radically simpler — no OCR, no layout detection, no chunking decisions. The query "what was Q3 revenue?" can match directly against a table cell in the page image without ever extracting that cell as text. ColQwen2, the follow-up using Qwen2-VL as backbone, pushed accuracy further on the ViDoRe benchmark.
DSE, MMLongBench-Doc & Multi-Page Reasoning
The frontier has moved to multi-page document reasoning. Document Screenshot Embedding (DSE) extends the ColPali paradigm to handle long documents by embedding screenshots of every page and retrieving the relevant pages before generation. MMLongBench-Doc (Ma et al., 2024) established that even GPT-4o only achieves 42.7% accuracy on long-document QA — meaning the field still has enormous room for improvement. Multi-modal RAG over 100+ page documents remains an open problem.
The throughline: 1985 → 2026
Each generation solved the previous one's blindness:
The trajectory is clear: remove pipeline stages. OCR → layout parser → chunker → embedder is being replaced by image → multi-vector embedding. Fewer stages means fewer failure modes.
Full Document RAG Pipeline
The Document Extraction Problem
Before you can chunk a document, you need to extract its content. This is where most RAG pipelines silently fail. Here's what goes wrong and what to do about it.
Naive Extraction (What Fails)
- Tables become interleaved lines of text
- Multi-column layouts merge into one stream
- Headers and footers repeat on every page
- Equations become gibberish: "E = mc2"
- Figures and captions are completely lost
- Footnotes split from their references
Structure-Aware Extraction
- Tables extracted as structured data (CSV/JSON)
- Reading order follows visual flow
- Headers/footers identified and stripped
- Equations preserved as LaTeX
- Figures captioned and linked to text
- Footnotes merged with their paragraphs
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
converter = DocumentConverter()
result = converter.convert("financial_report.pdf")
# Structured output: sections, tables, figures all preserved
doc = result.document
# Access tables as structured data
for table in doc.tables:
print(f"Table: {table.caption}")
for row in table.data:
print(row) # Each row is a list of cell values
# Access text with hierarchy
for section in doc.sections:
print(f"## {section.heading}")
for paragraph in section.paragraphs:
print(paragraph.text)
# Export to Markdown (preserves structure for chunking)
markdown = doc.export_to_markdown()
print(markdown)Extraction Tool Comparison (2026)
| Tool | Approach | Tables | Speed |
|---|---|---|---|
| PyMuPDF / pdfminer | Text stream extraction | Poor | Fast |
| Unstructured.io | ML-based layout detection | Good | Medium |
| Docling (IBM) | Vision + layout + OCR fusion | Excellent | Medium |
| Nougat | End-to-end vision model | Excellent | Slow (GPU) |
| ColPali / DSE | No extraction — image retrieval | Native | Fast (retrieval) |
Chunking Strategies Compared
Structure-Aware Chunking
Chunking is where most document RAG systems fail. The difference between naive and structure-aware chunking is the difference between a system that hallucinates and one that cites correctly. The key insight: chunk boundaries should respect document structure, not arbitrary character counts.
1. Section-Aware Recursive Splitting
Use Markdown headers from structured extraction as primary split points. Never split mid-section unless the section exceeds the chunk size limit.
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Step 1: Split by document structure (headers)
header_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
)
header_chunks = header_splitter.split_text(markdown_text)
# Step 2: Sub-split large sections by character count
char_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "]
)
final_chunks = []
for chunk in header_chunks:
if len(chunk.page_content) > 500:
sub_chunks = char_splitter.split_text(chunk.page_content)
for sc in sub_chunks:
# Inherit section metadata from parent
final_chunks.append(Document(
page_content=sc,
metadata={**chunk.metadata, "chunk_type": "text"}
))
else:
final_chunks.append(chunk)2. Semantic Chunking
Use embedding similarity to find natural breakpoints. When consecutive sentences have low similarity, that's a topic boundary. Produces variable-length chunks that respect meaning, not character counts.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Semantic chunking: split where meaning shifts
semantic_splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90 # Split at top-10% dissimilarity
)
semantic_chunks = semantic_splitter.split_documents(documents)
# Result: chunks that follow topic boundaries, not character counts
# "Section 4.2: Benefits" stays together even if it's 800 chars3. Parent-Child Retrieval
Embed small chunks for retrieval precision, but return their parent context to the LLM. This is the single most impactful architectural pattern for document RAG.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import FAISS
# Small chunks = precise retrieval (find the right sentence)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Large chunks = rich context (give the LLM the full section)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500)
store = InMemoryStore()
vectorstore = FAISS.from_documents([], OpenAIEmbeddings())
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(documents)
# Query matches a 200-char child, but returns the 1500-char parent
results = retriever.invoke("What is the vacation policy?")
# Result includes surrounding context the LLM needs to answer fully4. Table-Aware Chunking
Tables must never be split across chunks. Extract them as self-contained units with their caption and surrounding context.
def chunk_with_tables(doc_sections: list, tables: list) -> list:
"""Keep tables intact as separate chunks with context."""
chunks = []
for section in doc_sections:
# Regular text: recursive split
text_chunks = char_splitter.split_text(section.text)
chunks.extend(text_chunks)
for table in tables:
# Table chunk: caption + serialized table + surrounding text
table_text = f"Table: {table.caption}\n"
table_text += table.to_markdown() # Markdown table format
if table.preceding_paragraph:
table_text = table.preceding_paragraph + "\n\n" + table_text
chunks.append(Document(
page_content=table_text,
metadata={
"chunk_type": "table",
"page": table.page_number,
"source": table.source_file,
}
))
return chunksChunking Strategy Comparison
| Strategy | Best For | Weakness | Chunk Size |
|---|---|---|---|
| Section-Aware | Structured docs (reports, papers) | Requires Markdown extraction | Variable |
| Semantic | Topic-based retrieval, mixed docs | Expensive (requires embeddings) | Variable |
| Parent-Child | Long documents, context-heavy QA | Complex, more storage | 200 / 1500 |
| Table-Aware | Financial docs, scientific papers | Requires table detection | Per-table |
Multi-Representation Indexing
Production document RAG does not use a single representation per chunk. The state of the art indexes multiple representations of each document element — text, summary, hypothetical questions, and table serializations — and retrieves the original source regardless of which representation matched.
This technique, popularized by LangChain's Multi-Vector Retriever and the RAPTOR paper (Sarthi et al., 2024), is the single biggest accuracy improvement most teams can make with minimal code changes.
How It Works
Original
Chunk
Rep 1
Raw Text
Rep 2
LLM Summary
Rep 3
Hypothetical Qs
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import uuid
# Stores: vector index for search, byte store for original docs
vectorstore = FAISS.from_documents([], OpenAIEmbeddings())
docstore = InMemoryByteStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=docstore,
id_key=id_key,
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
for chunk in chunks:
doc_id = str(uuid.uuid4())
chunk.metadata[id_key] = doc_id
# Store original chunk in docstore
docstore.mset([(doc_id, chunk)])
# Representation 1: raw text embedding
vectorstore.add_documents([chunk])
# Representation 2: LLM summary
summary = llm.invoke(f"Summarize in 2 sentences:\n{chunk.page_content}")
summary_doc = Document(
page_content=summary.content,
metadata={id_key: doc_id, "type": "summary"}
)
vectorstore.add_documents([summary_doc])
# Representation 3: hypothetical questions (HyDE variant)
questions = llm.invoke(
f"Generate 3 questions this text answers:\n{chunk.page_content}"
)
for q in questions.content.split("\n"):
if q.strip():
q_doc = Document(
page_content=q.strip(),
metadata={id_key: doc_id, "type": "question"}
)
vectorstore.add_documents([q_doc])
# At query time: user question matches a hypothetical question,
# but the LLM receives the original full chunk
results = retriever.invoke("What was the revenue in Q3?")
# Returns the original financial table, not the summary or questionWhy This Works So Well
The vocabulary mismatch problem is the #1 cause of RAG retrieval failures. A user asks "What's the PTO policy?" but the document says "Annual leave entitlement per Section 4.2." The raw text embedding misses it. But the hypothetical question "How many vacation days do employees get?" bridges the gap — it matches the user's query while pointing to the correct source chunk.
Teams that add multi-representation indexing typically see 15–30% improvement in recall@5 on their domain-specific benchmarks.
Table Extraction and Retrieval
Tables are the hardest element in document RAG. They contain dense, structured information where row-column relationships carry meaning that vanishes when serialized as flat text. Here are the three production approaches.
Approach 1: Markdown Serialization
Convert tables to Markdown format. Simple, works surprisingly well for LLMs that were trained on Markdown-heavy data.
# Serialized table as Markdown (embeds well, LLMs parse easily)
table_md = """
| Quarter | Revenue ($M) | Growth |
|---------|-------------|--------|
| Q1 2025 | 142.3 | +12% |
| Q2 2025 | 158.7 | +11.5% |
| Q3 2025 | 171.2 | +7.9% |
| Q4 2025 | 189.4 | +10.6% |
"""
# Embed with caption context for better retrieval
chunk = Document(
page_content=f"Table 3: Quarterly Revenue Summary\n{table_md}",
metadata={"chunk_type": "table", "page": 15}
)Approach 2: LLM-Generated Natural Language Summary
Ask an LLM to describe the table in prose. Index the description for retrieval, return the original table for generation.
# Generate a natural language description of the table
prompt = f"""Describe this table in 2-3 sentences, including
key data points and trends:\n{table_md}"""
description = llm.invoke(prompt).content
# "Table 3 shows quarterly revenue for 2025, growing from
# $142.3M in Q1 to $189.4M in Q4. Growth rates ranged from
# 7.9% to 12%, with Q3 showing the slowest quarter."
# Index the description, but store the original table
vectorstore.add_documents([Document(
page_content=description,
metadata={"doc_id": table_id, "type": "table_summary"}
)])Approach 3: Vision-Based (ColPali / GPT-4o)
Skip extraction entirely. Send the page image containing the table to a vision-language model. This handles complex table layouts (merged cells, nested headers) that text extraction cannot.
import base64
from openai import OpenAI
client = OpenAI()
# Convert page to image, send directly to vision model
with open("page_15.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What was Q3 2025 revenue?"},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{image_b64}"
}}
]
}]
)
# "$171.2M with 7.9% growth" — reads directly from the table imageCitation Generation
Production RAG without citations is a liability. Users need to verify answers, auditors need trails, and your system needs a mechanism to detect when the LLM fabricates information. Here's the engineering.
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
def generate_with_citations(query: str, chunks: list) -> dict:
"""Generate answer with inline citations and verification."""
# Format chunks with citation markers and rich metadata
context_parts = []
for i, chunk in enumerate(chunks):
source = chunk.metadata.get("source", "Unknown")
page = chunk.metadata.get("page", "N/A")
section = chunk.metadata.get("h2", chunk.metadata.get("section", ""))
chunk_type = chunk.metadata.get("chunk_type", "text")
header = f"[{i+1}] Source: {source} | Page: {page}"
if section:
header += f" | Section: {section}"
if chunk_type == "table":
header += " | [TABLE]"
context_parts.append(f"{header}\n{chunk.page_content}")
context = "\n\n---\n\n".join(context_parts)
prompt = ChatPromptTemplate.from_messages([
("system", """You are a precise document analyst. Rules:
1. Answer ONLY using the provided context chunks.
2. Include inline citations [1], [2], etc. for EVERY claim.
3. If a claim spans multiple sources, cite all: [1][3].
4. For tables, reference the specific row/column.
5. If the context is insufficient, say "The provided documents
do not contain enough information to answer this question."
6. Never infer beyond what the sources explicitly state."""),
("user", """Context chunks:
{context}
Question: {question}
Answer with citations:""")
])
llm = ChatOpenAI(model="gpt-4o", temperature=0)
chain = prompt | llm
response = chain.invoke({
"context": context,
"question": query
})
# Extract and validate citations
import re
cited_nums = set(int(n) for n in re.findall(r'\[(\d+)\]', response.content))
available_nums = set(range(1, len(chunks) + 1))
hallucinated_citations = cited_nums - available_nums
return {
"answer": response.content,
"sources": [
{
"citation": f"[{i+1}]",
"source": chunks[i].metadata.get("source"),
"page": chunks[i].metadata.get("page"),
"section": chunks[i].metadata.get("h2", ""),
"preview": chunks[i].page_content[:200] + "..."
}
for i in range(len(chunks))
if (i + 1) in cited_nums
],
"citation_integrity": len(hallucinated_citations) == 0,
"hallucinated_refs": list(hallucinated_citations),
}According to the 2025 annual report, total revenue reached $661.6M for the fiscal year [1]. Q4 was the strongest quarter at $189.4M, representing 10.6% quarter-over-quarter growth [1][2]. The CEO noted in the shareholder letter that growth was "primarily driven by enterprise expansion" rather than new customer acquisition [3].
Sources:
[1]: annual_report_2025.pdf, page 15, Section: Financial Summary
[2]: annual_report_2025.pdf, page 15, Section: Financial Summary [TABLE]
[3]: annual_report_2025.pdf, page 3, Section: Letter to Shareholders
Citation integrity: PASS (all references valid)
Full Production Pipeline
Here is a complete production document RAG pipeline combining structure-aware extraction, multi-representation indexing, hybrid retrieval with reranking, and cited generation.
Architecture Overview
Input
PDF / DOCX
Extract
Docling
Chunk
Section-Aware
Index
Multi-Rep
Store
Vector DB
Input
Query
Retrieve
Hybrid + Rerank
Generate
LLM + Citations
Output
Cited Answer
from langchain.text_splitter import (
MarkdownHeaderTextSplitter,
RecursiveCharacterTextSplitter,
)
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from sentence_transformers import CrossEncoder
import numpy as np
import uuid
class DocumentRAGPipeline:
def __init__(self):
self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
self.reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
self.vectorstore = FAISS.from_documents([], self.embeddings)
self.docstore = InMemoryByteStore()
self.retriever = MultiVectorRetriever(
vectorstore=self.vectorstore,
byte_store=self.docstore,
id_key="doc_id",
)
def ingest(self, markdown: str, source: str) -> int:
"""Ingest a structured Markdown document."""
# Step 1: Section-aware splitting
header_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
sections = header_splitter.split_text(markdown)
# Step 2: Sub-split + multi-representation indexing
char_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=50
)
count = 0
for section in sections:
sub_chunks = char_splitter.split_text(section.page_content)
for text in sub_chunks:
doc_id = str(uuid.uuid4())
original = Document(
page_content=text,
metadata={**section.metadata, "source": source, "doc_id": doc_id}
)
# Store original
self.docstore.mset([(doc_id, original)])
# Rep 1: raw text
self.vectorstore.add_documents([original])
# Rep 2: hypothetical questions
qs = self.llm.invoke(
f"Generate 2 questions this answers:\n{text}"
).content
for q in qs.strip().split("\n"):
if q.strip():
self.vectorstore.add_documents([Document(
page_content=q.strip(),
metadata={"doc_id": doc_id, "type": "question"}
)])
count += 1
return count
def query(self, question: str, k: int = 10, top_k: int = 5) -> dict:
"""Full RAG pipeline: retrieve, rerank, generate with citations."""
# Stage 1: Multi-vector retrieval
candidates = self.retriever.invoke(question)[:k]
if not candidates:
return {"answer": "No relevant documents found.", "sources": []}
# Stage 2: Rerank
if len(candidates) > top_k:
pairs = [[question, doc.page_content] for doc in candidates]
scores = self.reranker.predict(pairs)
ranked = np.argsort(scores)[::-1][:top_k]
candidates = [candidates[i] for i in ranked]
# Stage 3: Generate with citations
return generate_with_citations(question, candidates)
# Usage
pipeline = DocumentRAGPipeline()
count = pipeline.ingest(markdown_text, source="annual_report_2025.pdf")
print(f"Indexed {count} chunks with multi-representation")
result = pipeline.query("What was the total revenue in 2025?")
print(result["answer"])
for src in result["sources"]:
print(f" {src['citation']}: {src['source']}, p.{src['page']}")Evaluation with RAGAS
RAGAS (Retrieval Augmented Generation Assessment) by Shahul Es et al. (2023) is the standard evaluation framework. It measures both retrieval quality and generation quality without requiring human evaluation for every test case — using LLM-as-judge to decompose and verify claims.
— Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv.
Retrieval Metrics
Context Precision
Are retrieved chunks actually relevant? Measures signal-to-noise in your retrieval set. Low precision = the LLM is drowning in irrelevant context.
Context Recall
Do retrieved chunks cover all the information needed? Low recall = the answer exists in your corpus but retrieval missed it.
Generation Metrics
Faithfulness
Is every claim in the answer supported by the retrieved context? The hallucination detector. Most important metric for trust.
Answer Relevancy
Does the answer actually address the question asked? Catches tangential or generic responses.
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy,
)
from datasets import Dataset
# Build evaluation dataset from your RAG pipeline
eval_questions = [
"What was total revenue in 2025?",
"How many vacation days do employees get?",
"What are the API rate limits?",
]
eval_data = {"question": [], "answer": [], "contexts": [], "ground_truth": []}
for q, gt in zip(eval_questions, ground_truths):
result = pipeline.query(q)
eval_data["question"].append(q)
eval_data["answer"].append(result["answer"])
eval_data["contexts"].append([c.page_content for c in result["chunks"]])
eval_data["ground_truth"].append(gt)
dataset = Dataset.from_dict(eval_data)
results = evaluate(
dataset,
metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
)
print(results)
# {'context_precision': 0.91, 'context_recall': 0.87,
# 'faithfulness': 0.94, 'answer_relevancy': 0.89}
# Dig into failures:
df = results.to_pandas()
low_faith = df[df["faithfulness"] < 0.7]
print(f"\n{len(low_faith)} questions with faithfulness < 0.7:")
for _, row in low_faith.iterrows():
print(f" Q: {row['question']}")
print(f" A: {row['answer'][:100]}...")
print(f" Faithfulness: {row['faithfulness']:.2f}\n")Benchmark Targets for Document RAG
| Metric | Acceptable | Good | Excellent |
|---|---|---|---|
| Context Precision | > 0.70 | > 0.85 | > 0.95 |
| Context Recall | > 0.65 | > 0.80 | > 0.90 |
| Faithfulness | > 0.80 | > 0.90 | > 0.95 |
| Answer Relevancy | > 0.70 | > 0.85 | > 0.95 |
For document RAG specifically, faithfulness thresholds should be higher than general RAG because users are making decisions based on cited documents. A faithfulness score below 0.80 means your system is fabricating claims in 1 out of 5 answers.
Why Document RAG Pipelines Fail in Production
Most failures are not in the LLM generation step. They are upstream, in extraction and chunking.
1. Table Corruption
A financial PDF has a revenue table. PyMuPDF extracts it as: "Q1 142.3 Q2 158.7 Q3 171.2 Revenue Growth +12% +11.5% +7.9%". The row-column relationship is destroyed. When the LLM sees this, it might associate Q1 with +7.9% growth.
Fix: Use Docling or Nougat for table extraction. Serialize tables as Markdown with headers preserved. Add table-specific chunks.
2. Cross-Chunk Information
The answer requires combining information from two chunks — the definition from page 3 and the exception from page 47. Your retriever finds chunk A but misses chunk B because B doesn't contain any keywords from the query.
Fix: Multi-representation indexing. The hypothetical question for chunk B ("Are there exceptions to the PTO policy?") bridges the vocabulary gap. Also consider increasing retrieval depth (k=15) with aggressive reranking (top_k=5).
3. Stale Citations
Document version 2 replaces version 1, but old chunks remain in the vector store. The LLM cites page 42 of the employee handbook, but that page no longer exists in the current version. The user follows the citation and finds different content.
Fix: Namespace your vector store by document version. When a document is updated, delete all chunks from the previous version before ingesting the new one. Add version andingested_at metadata to every chunk.
4. Context Window Overflow
You retrieve 10 chunks of 500 tokens each = 5,000 tokens of context. Add the system prompt, the query, and the generation overhead. If the LLM's effective context window is smaller than advertised (common with long-context models), it silently ignores later chunks — and those are often the most relevant ones from reranking.
Fix: Place the most relevant chunks first (reranker output is already sorted). Monitor actual token usage. Consider lost-in-the-middle effects — Liu et al. (2024) showed LLMs perform worst on information placed in the middle of the context.
Key Takeaways
- 1
Extraction quality determines RAG quality — No amount of clever prompting fixes corrupted table data or merged columns. Use structure-aware extraction (Docling, Nougat) or skip it entirely (ColPali).
- 2
Chunk boundaries must respect document structure — Tables, sections, and lists are semantic units. Splitting them mid-element is the #1 source of wrong answers in production.
- 3
Multi-representation indexing closes the vocabulary gap — Index summaries and hypothetical questions alongside raw text. This is the highest-leverage improvement for retrieval recall.
- 4
Citations are non-negotiable in production — Every claim must trace back to a source chunk with page number and section. Validate citation integrity programmatically. Users need to verify.
- 5
Measure faithfulness, not just relevancy — RAGAS faithfulness > 0.90 should be the gate for production deployment. A system that gives relevant but unfaithful answers is worse than one that says "I don't know."
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.