Level 3: Production~45 min

Document RAG

From OCR pipelines and vision-language models to structure-aware chunking, multi-representation indexing, and rigorous evaluation. The full story of making RAG work on real documents.

30 Years of Teaching Machines to Read Documents

RAG over plain text is straightforward — chunk, embed, retrieve. Real documents are nothing like plain text. They have tables, headers, figures, footnotes, multi-column layouts, mathematical notation, and page numbers that interrupt sentences mid-word. Getting RAG to work on actual PDFs, scanned contracts, and research papers required solving the document understanding problem first — and that took decades.

Understanding this history explains why naive "pdf-to-text then chunk" pipelines fail, and what the current state of the art actually does differently.

Era I: OCR & Rule-Based Extraction
1985–1995

Tesseract and the OCR Era

Hewlett-Packard Labs developed Tesseract between 1985 and 1994. It was one of the top three OCR engines in the 1995 UNLV accuracy test. Google open-sourced it in 2006, and it became the default document digitization tool for a generation. The approach was classic: binarize the image, detect connected components, match against character templates, apply dictionary correction.

The fundamental limitation: OCR sees characters, not structure. A two-column PDF becomes a single stream of text where the left column's line 1 merges with the right column's line 1. Tables become gibberish. Headers lose their hierarchy. Every downstream RAG system built on raw OCR output inherits these corruptions.

2000s

PDF Parsers: pdfminer, Apache Tika, Poppler

Tools like pdfminer (2004), Apache Tika (2007), and Poppler extracted text from born-digital PDFs by reading the underlying text stream directly — no OCR needed. This was faster and more accurate for clean PDFs, but still suffered from the same structural blindness: reading order was guessed from coordinates, table cells were output as disconnected text fragments, and figures were invisible. The mantra of this era was "garbage in, garbage out" — no amount of clever chunking could fix fundamentally broken text extraction.

Era II: Layout-Aware Neural Models
2020

LayoutLM: Text + Layout in One Model

Yiheng Xu et al. at Microsoft Research made the critical leap: instead of treating documents as flat text, feed the model both the text tokens and their 2D bounding-box coordinates on the page. LayoutLM extended BERT by adding x/y position embeddings alongside the standard token and segment embeddings.

# LayoutLM input: text + spatial position
token_embedding = text_embed(token) + position_2d_embed(x0, y0, x1, y1)
# x0,y0,x1,y1 = bounding box of the token on the page

# The model learns that tokens at the top of the page are headers,
# tokens aligned vertically are in the same column,
# and tokens in a grid pattern form a table.

Xu, Y. et al. (2020). LayoutLM: Pre-training of Text and Layout. KDD.

The result was dramatic: LayoutLM achieved state-of-the-art on form understanding (FUNSD), receipt extraction (SROIE), and document classification tasks — beating text-only models by large margins. For the first time, a model could distinguish a table header from body text based on where it sat on the page, not just what it said.

2021–2022

LayoutLMv2 & v3: Adding Vision

LayoutLM v1 still relied on OCR for text extraction. LayoutLMv2 integrated a visual backbone (ResNeXt-FPN) that processed the raw document image alongside text and layout, enabling the model to "see" visual cues like bold text, colored cells, and underlined headers. LayoutLMv3 unified the architecture further with a single multimodal transformer trained with masked image-text alignment.

Xu, Y. et al. (2021). LayoutLMv2. ACL.
Huang, Y. et al. (2022). LayoutLMv3. ACM MM.

2022

Donut: OCR-Free Document Understanding

Geewook Kim et al. at NAVER AI Lab asked a radical question: what if we skip OCR entirely? Donut (Document Understanding Transformer) was an end-to-end vision encoder-decoder that took a document image as input and directly generated structured JSON output — no text extraction step, no bounding boxes, no pipeline.

"We show that a simple encoder-decoder architecture with a visual encoder and a text decoder can achieve state-of-the-art performance on document understanding tasks without relying on OCR."

Kim, G. et al. (2022). OCR-free Document Understanding Transformer. ECCV.

This was important philosophically: it proved that OCR was a bottleneck, not a prerequisite. But Donut was trained for extraction tasks (parsing receipts, forms), not for generating embeddings suitable for retrieval. That gap would take two more years to close.

Era III: Vision-Language Retrieval
2023

Nougat: Academic PDF to Markdown

Lukas Blecher et al. at Meta AI built Nougat — a Donut-based model specifically trained to convert academic papers (with equations, tables, and citations) into structured Markdown. For the first time, you could feed a scanned ArXiv paper in and get clean, parseable text with LaTeX equations preserved. This was a game-changer for scientific RAG pipelines.

Blecher, L. et al. (2023). Nougat: Neural Optical Understanding for Academic Documents. arXiv.

2024

ColPali: Late Interaction for Document Retrieval

Manuel Faysse et al. proposed the approach that finally unified vision-language understanding with efficient retrieval. ColPali treats each document page as an image, processes it through a vision-language model (PaliGemma), and produces a set of multi-vector embeddings — one per image patch. At query time, it uses late interaction (the ColBERT paradigm) to compute fine-grained token-to-patch similarity.

# ColPali: no OCR, no chunking — just images and late interaction
doc_patches = vision_encoder(page_image)    # (N_patches, dim)
query_tokens = text_encoder(query)          # (N_tokens, dim)

# Late interaction: max-sim between each query token and all patches
score = sum(max(dot(q_i, p_j) for p_j in doc_patches) for q_i in query_tokens)
# Each query token finds its best-matching visual region on the page

Faysse, M. et al. (2024). ColPali: Efficient Document Retrieval with Vision Language Models. arXiv.

ColPali outperformed text-based retrieval pipelines on document benchmarks while being radically simpler — no OCR, no layout detection, no chunking decisions. The query "what was Q3 revenue?" can match directly against a table cell in the page image without ever extracting that cell as text. ColQwen2, the follow-up using Qwen2-VL as backbone, pushed accuracy further on the ViDoRe benchmark.

2025

DSE, MMLongBench-Doc & Multi-Page Reasoning

The frontier has moved to multi-page document reasoning. Document Screenshot Embedding (DSE) extends the ColPali paradigm to handle long documents by embedding screenshots of every page and retrieving the relevant pages before generation. MMLongBench-Doc (Ma et al., 2024) established that even GPT-4o only achieves 42.7% accuracy on long-document QA — meaning the field still has enormous room for improvement. Multi-modal RAG over 100+ page documents remains an open problem.

Ma, Y. et al. (2024). MMLongBench-Doc. arXiv.

The throughline: 1985 → 2026

Each generation solved the previous one's blindness:

1985–2005Characters: OCR reads pixels into text, blind to structure (Tesseract, pdfminer)
2020–2022Layout: Models see text + bounding boxes, understand spatial relationships (LayoutLM)
2022–2023Vision: End-to-end image-to-text, no OCR bottleneck (Donut, Nougat)
2024–nowRetrieval: Vision-language embeddings for direct page-level search (ColPali, DSE)

The trajectory is clear: remove pipeline stages. OCR → layout parser → chunker → embedder is being replaced by image → multi-vector embedding. Fewer stages means fewer failure modes.

Full Document RAG Pipeline

INGESTION PIPELINEPDFDocumentsParseExtract textChunkSplit textEmbedVectorizeStoreVector DBDBQdrant / WeaviateQUERY PIPELINEUser Query"vacation policy?"RetrieveHybrid + RerankAugment PromptContext + Query+ Citation markersLLMGPT-4o / ClaudeGenerate answerAnswer + Citations"20 days PTO [1]..."[1] handbook.pdf, p.42RAGAS EvaluationFaithfulness | Context PrecisionContext Recall | Answer Relevancy

The Document Extraction Problem

Before you can chunk a document, you need to extract its content. This is where most RAG pipelines silently fail. Here's what goes wrong and what to do about it.

Naive Extraction (What Fails)

  • Tables become interleaved lines of text
  • Multi-column layouts merge into one stream
  • Headers and footers repeat on every page
  • Equations become gibberish: "E = mc2"
  • Figures and captions are completely lost
  • Footnotes split from their references

Structure-Aware Extraction

  • Tables extracted as structured data (CSV/JSON)
  • Reading order follows visual flow
  • Headers/footers identified and stripped
  • Equations preserved as LaTeX
  • Figures captioned and linked to text
  • Footnotes merged with their paragraphs
# Modern document extraction with Docling (IBM, 2024)
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat

converter = DocumentConverter()
result = converter.convert("financial_report.pdf")

# Structured output: sections, tables, figures all preserved
doc = result.document

# Access tables as structured data
for table in doc.tables:
    print(f"Table: {table.caption}")
    for row in table.data:
        print(row)  # Each row is a list of cell values

# Access text with hierarchy
for section in doc.sections:
    print(f"## {section.heading}")
    for paragraph in section.paragraphs:
        print(paragraph.text)

# Export to Markdown (preserves structure for chunking)
markdown = doc.export_to_markdown()
print(markdown)

Extraction Tool Comparison (2026)

ToolApproachTablesSpeed
PyMuPDF / pdfminerText stream extractionPoorFast
Unstructured.ioML-based layout detectionGoodMedium
Docling (IBM)Vision + layout + OCR fusionExcellentMedium
NougatEnd-to-end vision modelExcellentSlow (GPU)
ColPali / DSENo extraction — image retrievalNativeFast (retrieval)

Chunking Strategies Compared

Source DocumentParagraph 1Paragraph 2Paragraph 3Paragraph 4Paragraph 5Fixed-Size Chunks500 chars, 50 overlapChunk 1 (500 chars)Chunk 2 (500 chars)Chunk 3 (500 chars)May split mid-thoughtSimple, predictable sizesSemantic ChunksSplit on topic boundariesTopic ATopic BTopic CPreserves topic coherenceVariable sizes, needs embeddingsRecursive SplittingHierarchical: para > sent > wordSplit by paragraph firstIf too large, split by sentenceSent 1-3Sent 4-6If still too large, split by wordWords 1-50Words 51-100RemainderBest general-purpose strategyRespects structure hierarchyConfigurable separators

Structure-Aware Chunking

Chunking is where most document RAG systems fail. The difference between naive and structure-aware chunking is the difference between a system that hallucinates and one that cites correctly. The key insight: chunk boundaries should respect document structure, not arbitrary character counts.

1. Section-Aware Recursive Splitting

Use Markdown headers from structured extraction as primary split points. Never split mid-section unless the section exceeds the chunk size limit.

from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Step 1: Split by document structure (headers)
header_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ]
)
header_chunks = header_splitter.split_text(markdown_text)

# Step 2: Sub-split large sections by character count
char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)

final_chunks = []
for chunk in header_chunks:
    if len(chunk.page_content) > 500:
        sub_chunks = char_splitter.split_text(chunk.page_content)
        for sc in sub_chunks:
            # Inherit section metadata from parent
            final_chunks.append(Document(
                page_content=sc,
                metadata={**chunk.metadata, "chunk_type": "text"}
            ))
    else:
        final_chunks.append(chunk)

2. Semantic Chunking

Use embedding similarity to find natural breakpoints. When consecutive sentences have low similarity, that's a topic boundary. Produces variable-length chunks that respect meaning, not character counts.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Semantic chunking: split where meaning shifts
semantic_splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90  # Split at top-10% dissimilarity
)

semantic_chunks = semantic_splitter.split_documents(documents)
# Result: chunks that follow topic boundaries, not character counts
# "Section 4.2: Benefits" stays together even if it's 800 chars

3. Parent-Child Retrieval

Embed small chunks for retrieval precision, but return their parent context to the LLM. This is the single most impactful architectural pattern for document RAG.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import FAISS

# Small chunks = precise retrieval (find the right sentence)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Large chunks = rich context (give the LLM the full section)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500)

store = InMemoryStore()
vectorstore = FAISS.from_documents([], OpenAIEmbeddings())

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(documents)

# Query matches a 200-char child, but returns the 1500-char parent
results = retriever.invoke("What is the vacation policy?")
# Result includes surrounding context the LLM needs to answer fully

4. Table-Aware Chunking

Tables must never be split across chunks. Extract them as self-contained units with their caption and surrounding context.

def chunk_with_tables(doc_sections: list, tables: list) -> list:
    """Keep tables intact as separate chunks with context."""
    chunks = []

    for section in doc_sections:
        # Regular text: recursive split
        text_chunks = char_splitter.split_text(section.text)
        chunks.extend(text_chunks)

    for table in tables:
        # Table chunk: caption + serialized table + surrounding text
        table_text = f"Table: {table.caption}\n"
        table_text += table.to_markdown()  # Markdown table format
        if table.preceding_paragraph:
            table_text = table.preceding_paragraph + "\n\n" + table_text

        chunks.append(Document(
            page_content=table_text,
            metadata={
                "chunk_type": "table",
                "page": table.page_number,
                "source": table.source_file,
            }
        ))

    return chunks

Chunking Strategy Comparison

StrategyBest ForWeaknessChunk Size
Section-AwareStructured docs (reports, papers)Requires Markdown extractionVariable
SemanticTopic-based retrieval, mixed docsExpensive (requires embeddings)Variable
Parent-ChildLong documents, context-heavy QAComplex, more storage200 / 1500
Table-AwareFinancial docs, scientific papersRequires table detectionPer-table

Multi-Representation Indexing

Production document RAG does not use a single representation per chunk. The state of the art indexes multiple representations of each document element — text, summary, hypothetical questions, and table serializations — and retrieves the original source regardless of which representation matched.

This technique, popularized by LangChain's Multi-Vector Retriever and the RAPTOR paper (Sarthi et al., 2024), is the single biggest accuracy improvement most teams can make with minimal code changes.

How It Works

Original

Chunk

->

Rep 1

Raw Text

+

Rep 2

LLM Summary

+

Rep 3

Hypothetical Qs

All three embeddings point back to the same original chunk
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import uuid

# Stores: vector index for search, byte store for original docs
vectorstore = FAISS.from_documents([], OpenAIEmbeddings())
docstore = InMemoryByteStore()
id_key = "doc_id"

retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=docstore,
    id_key=id_key,
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

for chunk in chunks:
    doc_id = str(uuid.uuid4())
    chunk.metadata[id_key] = doc_id

    # Store original chunk in docstore
    docstore.mset([(doc_id, chunk)])

    # Representation 1: raw text embedding
    vectorstore.add_documents([chunk])

    # Representation 2: LLM summary
    summary = llm.invoke(f"Summarize in 2 sentences:\n{chunk.page_content}")
    summary_doc = Document(
        page_content=summary.content,
        metadata={id_key: doc_id, "type": "summary"}
    )
    vectorstore.add_documents([summary_doc])

    # Representation 3: hypothetical questions (HyDE variant)
    questions = llm.invoke(
        f"Generate 3 questions this text answers:\n{chunk.page_content}"
    )
    for q in questions.content.split("\n"):
        if q.strip():
            q_doc = Document(
                page_content=q.strip(),
                metadata={id_key: doc_id, "type": "question"}
            )
            vectorstore.add_documents([q_doc])

# At query time: user question matches a hypothetical question,
# but the LLM receives the original full chunk
results = retriever.invoke("What was the revenue in Q3?")
# Returns the original financial table, not the summary or question

Why This Works So Well

The vocabulary mismatch problem is the #1 cause of RAG retrieval failures. A user asks "What's the PTO policy?" but the document says "Annual leave entitlement per Section 4.2." The raw text embedding misses it. But the hypothetical question "How many vacation days do employees get?" bridges the gap — it matches the user's query while pointing to the correct source chunk.

Teams that add multi-representation indexing typically see 15–30% improvement in recall@5 on their domain-specific benchmarks.

Table Extraction and Retrieval

Tables are the hardest element in document RAG. They contain dense, structured information where row-column relationships carry meaning that vanishes when serialized as flat text. Here are the three production approaches.

Approach 1: Markdown Serialization

Convert tables to Markdown format. Simple, works surprisingly well for LLMs that were trained on Markdown-heavy data.

# Serialized table as Markdown (embeds well, LLMs parse easily)
table_md = """
| Quarter | Revenue ($M) | Growth |
|---------|-------------|--------|
| Q1 2025 | 142.3       | +12%   |
| Q2 2025 | 158.7       | +11.5% |
| Q3 2025 | 171.2       | +7.9%  |
| Q4 2025 | 189.4       | +10.6% |
"""

# Embed with caption context for better retrieval
chunk = Document(
    page_content=f"Table 3: Quarterly Revenue Summary\n{table_md}",
    metadata={"chunk_type": "table", "page": 15}
)

Approach 2: LLM-Generated Natural Language Summary

Ask an LLM to describe the table in prose. Index the description for retrieval, return the original table for generation.

# Generate a natural language description of the table
prompt = f"""Describe this table in 2-3 sentences, including
key data points and trends:\n{table_md}"""

description = llm.invoke(prompt).content
# "Table 3 shows quarterly revenue for 2025, growing from
#  $142.3M in Q1 to $189.4M in Q4. Growth rates ranged from
#  7.9% to 12%, with Q3 showing the slowest quarter."

# Index the description, but store the original table
vectorstore.add_documents([Document(
    page_content=description,
    metadata={"doc_id": table_id, "type": "table_summary"}
)])

Approach 3: Vision-Based (ColPali / GPT-4o)

Skip extraction entirely. Send the page image containing the table to a vision-language model. This handles complex table layouts (merged cells, nested headers) that text extraction cannot.

import base64
from openai import OpenAI

client = OpenAI()

# Convert page to image, send directly to vision model
with open("page_15.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What was Q3 2025 revenue?"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{image_b64}"
            }}
        ]
    }]
)
# "$171.2M with 7.9% growth" — reads directly from the table image

Citation Generation

Production RAG without citations is a liability. Users need to verify answers, auditors need trails, and your system needs a mechanism to detect when the LLM fabricates information. Here's the engineering.

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

def generate_with_citations(query: str, chunks: list) -> dict:
    """Generate answer with inline citations and verification."""

    # Format chunks with citation markers and rich metadata
    context_parts = []
    for i, chunk in enumerate(chunks):
        source = chunk.metadata.get("source", "Unknown")
        page = chunk.metadata.get("page", "N/A")
        section = chunk.metadata.get("h2", chunk.metadata.get("section", ""))
        chunk_type = chunk.metadata.get("chunk_type", "text")

        header = f"[{i+1}] Source: {source} | Page: {page}"
        if section:
            header += f" | Section: {section}"
        if chunk_type == "table":
            header += " | [TABLE]"

        context_parts.append(f"{header}\n{chunk.page_content}")

    context = "\n\n---\n\n".join(context_parts)

    prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a precise document analyst. Rules:
1. Answer ONLY using the provided context chunks.
2. Include inline citations [1], [2], etc. for EVERY claim.
3. If a claim spans multiple sources, cite all: [1][3].
4. For tables, reference the specific row/column.
5. If the context is insufficient, say "The provided documents
   do not contain enough information to answer this question."
6. Never infer beyond what the sources explicitly state."""),
        ("user", """Context chunks:
{context}

Question: {question}

Answer with citations:""")
    ])

    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    chain = prompt | llm

    response = chain.invoke({
        "context": context,
        "question": query
    })

    # Extract and validate citations
    import re
    cited_nums = set(int(n) for n in re.findall(r'\[(\d+)\]', response.content))
    available_nums = set(range(1, len(chunks) + 1))
    hallucinated_citations = cited_nums - available_nums

    return {
        "answer": response.content,
        "sources": [
            {
                "citation": f"[{i+1}]",
                "source": chunks[i].metadata.get("source"),
                "page": chunks[i].metadata.get("page"),
                "section": chunks[i].metadata.get("h2", ""),
                "preview": chunks[i].page_content[:200] + "..."
            }
            for i in range(len(chunks))
            if (i + 1) in cited_nums
        ],
        "citation_integrity": len(hallucinated_citations) == 0,
        "hallucinated_refs": list(hallucinated_citations),
    }
Example output:

According to the 2025 annual report, total revenue reached $661.6M for the fiscal year [1]. Q4 was the strongest quarter at $189.4M, representing 10.6% quarter-over-quarter growth [1][2]. The CEO noted in the shareholder letter that growth was "primarily driven by enterprise expansion" rather than new customer acquisition [3].

Sources:

[1]: annual_report_2025.pdf, page 15, Section: Financial Summary

[2]: annual_report_2025.pdf, page 15, Section: Financial Summary [TABLE]

[3]: annual_report_2025.pdf, page 3, Section: Letter to Shareholders

Citation integrity: PASS (all references valid)

Full Production Pipeline

Here is a complete production document RAG pipeline combining structure-aware extraction, multi-representation indexing, hybrid retrieval with reranking, and cited generation.

Architecture Overview

Input

PDF / DOCX

->

Extract

Docling

->

Chunk

Section-Aware

->

Index

Multi-Rep

->

Store

Vector DB

Input

Query

->

Retrieve

Hybrid + Rerank

->

Generate

LLM + Citations

->

Output

Cited Answer

from langchain.text_splitter import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from sentence_transformers import CrossEncoder
import numpy as np
import uuid

class DocumentRAGPipeline:
    def __init__(self):
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
        self.vectorstore = FAISS.from_documents([], self.embeddings)
        self.docstore = InMemoryByteStore()
        self.retriever = MultiVectorRetriever(
            vectorstore=self.vectorstore,
            byte_store=self.docstore,
            id_key="doc_id",
        )

    def ingest(self, markdown: str, source: str) -> int:
        """Ingest a structured Markdown document."""
        # Step 1: Section-aware splitting
        header_splitter = MarkdownHeaderTextSplitter(
            headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
        )
        sections = header_splitter.split_text(markdown)

        # Step 2: Sub-split + multi-representation indexing
        char_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500, chunk_overlap=50
        )
        count = 0
        for section in sections:
            sub_chunks = char_splitter.split_text(section.page_content)
            for text in sub_chunks:
                doc_id = str(uuid.uuid4())
                original = Document(
                    page_content=text,
                    metadata={**section.metadata, "source": source, "doc_id": doc_id}
                )

                # Store original
                self.docstore.mset([(doc_id, original)])

                # Rep 1: raw text
                self.vectorstore.add_documents([original])

                # Rep 2: hypothetical questions
                qs = self.llm.invoke(
                    f"Generate 2 questions this answers:\n{text}"
                ).content
                for q in qs.strip().split("\n"):
                    if q.strip():
                        self.vectorstore.add_documents([Document(
                            page_content=q.strip(),
                            metadata={"doc_id": doc_id, "type": "question"}
                        )])
                count += 1
        return count

    def query(self, question: str, k: int = 10, top_k: int = 5) -> dict:
        """Full RAG pipeline: retrieve, rerank, generate with citations."""
        # Stage 1: Multi-vector retrieval
        candidates = self.retriever.invoke(question)[:k]

        if not candidates:
            return {"answer": "No relevant documents found.", "sources": []}

        # Stage 2: Rerank
        if len(candidates) > top_k:
            pairs = [[question, doc.page_content] for doc in candidates]
            scores = self.reranker.predict(pairs)
            ranked = np.argsort(scores)[::-1][:top_k]
            candidates = [candidates[i] for i in ranked]

        # Stage 3: Generate with citations
        return generate_with_citations(question, candidates)

# Usage
pipeline = DocumentRAGPipeline()
count = pipeline.ingest(markdown_text, source="annual_report_2025.pdf")
print(f"Indexed {count} chunks with multi-representation")

result = pipeline.query("What was the total revenue in 2025?")
print(result["answer"])
for src in result["sources"]:
    print(f"  {src['citation']}: {src['source']}, p.{src['page']}")

Evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) by Shahul Es et al. (2023) is the standard evaluation framework. It measures both retrieval quality and generation quality without requiring human evaluation for every test case — using LLM-as-judge to decompose and verify claims.

Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv.

Retrieval Metrics

  • Context Precision

    Are retrieved chunks actually relevant? Measures signal-to-noise in your retrieval set. Low precision = the LLM is drowning in irrelevant context.

  • Context Recall

    Do retrieved chunks cover all the information needed? Low recall = the answer exists in your corpus but retrieval missed it.

Generation Metrics

  • Faithfulness

    Is every claim in the answer supported by the retrieved context? The hallucination detector. Most important metric for trust.

  • Answer Relevancy

    Does the answer actually address the question asked? Catches tangential or generic responses.

# pip install ragas datasets
from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

# Build evaluation dataset from your RAG pipeline
eval_questions = [
    "What was total revenue in 2025?",
    "How many vacation days do employees get?",
    "What are the API rate limits?",
]

eval_data = {"question": [], "answer": [], "contexts": [], "ground_truth": []}

for q, gt in zip(eval_questions, ground_truths):
    result = pipeline.query(q)
    eval_data["question"].append(q)
    eval_data["answer"].append(result["answer"])
    eval_data["contexts"].append([c.page_content for c in result["chunks"]])
    eval_data["ground_truth"].append(gt)

dataset = Dataset.from_dict(eval_data)

results = evaluate(
    dataset,
    metrics=[context_precision, context_recall, faithfulness, answer_relevancy],
)

print(results)
# {'context_precision': 0.91, 'context_recall': 0.87,
#  'faithfulness': 0.94, 'answer_relevancy': 0.89}

# Dig into failures:
df = results.to_pandas()
low_faith = df[df["faithfulness"] < 0.7]
print(f"\n{len(low_faith)} questions with faithfulness < 0.7:")
for _, row in low_faith.iterrows():
    print(f"  Q: {row['question']}")
    print(f"  A: {row['answer'][:100]}...")
    print(f"  Faithfulness: {row['faithfulness']:.2f}\n")

Benchmark Targets for Document RAG

MetricAcceptableGoodExcellent
Context Precision> 0.70> 0.85> 0.95
Context Recall> 0.65> 0.80> 0.90
Faithfulness> 0.80> 0.90> 0.95
Answer Relevancy> 0.70> 0.85> 0.95

For document RAG specifically, faithfulness thresholds should be higher than general RAG because users are making decisions based on cited documents. A faithfulness score below 0.80 means your system is fabricating claims in 1 out of 5 answers.

Failure Modes

Why Document RAG Pipelines Fail in Production

Most failures are not in the LLM generation step. They are upstream, in extraction and chunking.

1. Table Corruption

A financial PDF has a revenue table. PyMuPDF extracts it as: "Q1 142.3 Q2 158.7 Q3 171.2 Revenue Growth +12% +11.5% +7.9%". The row-column relationship is destroyed. When the LLM sees this, it might associate Q1 with +7.9% growth.

Fix: Use Docling or Nougat for table extraction. Serialize tables as Markdown with headers preserved. Add table-specific chunks.

2. Cross-Chunk Information

The answer requires combining information from two chunks — the definition from page 3 and the exception from page 47. Your retriever finds chunk A but misses chunk B because B doesn't contain any keywords from the query.

Fix: Multi-representation indexing. The hypothetical question for chunk B ("Are there exceptions to the PTO policy?") bridges the vocabulary gap. Also consider increasing retrieval depth (k=15) with aggressive reranking (top_k=5).

3. Stale Citations

Document version 2 replaces version 1, but old chunks remain in the vector store. The LLM cites page 42 of the employee handbook, but that page no longer exists in the current version. The user follows the citation and finds different content.

Fix: Namespace your vector store by document version. When a document is updated, delete all chunks from the previous version before ingesting the new one. Add version andingested_at metadata to every chunk.

4. Context Window Overflow

You retrieve 10 chunks of 500 tokens each = 5,000 tokens of context. Add the system prompt, the query, and the generation overhead. If the LLM's effective context window is smaller than advertised (common with long-context models), it silently ignores later chunks — and those are often the most relevant ones from reranking.

Fix: Place the most relevant chunks first (reranker output is already sorted). Monitor actual token usage. Consider lost-in-the-middle effects — Liu et al. (2024) showed LLMs perform worst on information placed in the middle of the context.

Liu, N. et al. (2024). Lost in the Middle. TACL.

Key Takeaways

  • 1

    Extraction quality determines RAG quality — No amount of clever prompting fixes corrupted table data or merged columns. Use structure-aware extraction (Docling, Nougat) or skip it entirely (ColPali).

  • 2

    Chunk boundaries must respect document structure — Tables, sections, and lists are semantic units. Splitting them mid-element is the #1 source of wrong answers in production.

  • 3

    Multi-representation indexing closes the vocabulary gap — Index summaries and hypothetical questions alongside raw text. This is the highest-leverage improvement for retrieval recall.

  • 4

    Citations are non-negotiable in production — Every claim must trace back to a source chunk with page number and section. Validate citation integrity programmatically. Users need to verify.

  • 5

    Measure faithfulness, not just relevancy — RAGAS faithfulness > 0.90 should be the gate for production deployment. A system that gives relevant but unfaithful answers is worse than one that says "I don't know."

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.