Level 2: Pipelines~20 min

Basic RAG Pipeline

Ground your LLM in real data. Build a complete retrieval-augmented generation system from scratch.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances LLMs by connecting them to external knowledge sources. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents at query time.

Think of it as giving the LLM an open-book exam instead of a closed-book test. The model can look up information rather than trying to recall everything from memory.

// Without RAG (closed-book)

User: "What's our company's vacation policy?"

LLM: "I don't have information about your specific company..."

// With RAG (open-book)

User: "What's our company's vacation policy?"

LLM: "According to section 4.2 of the employee handbook, you receive 20 days PTO..."

Why RAG Matters

The Hallucination Problem

LLMs confidently generate plausible-sounding but factually incorrect information. They can't distinguish between what they know and what they're making up.

The RAG Solution

By grounding responses in retrieved documents, RAG provides verifiable sources. The model can cite where information came from, making it auditable.

Stale Knowledge

Training data has a cutoff date. GPT-4's knowledge stops at a fixed point. It can't know about events, products, or changes that happened after training.

Private Data Access

RAG lets you query internal documents, databases, and proprietary knowledge without fine-tuning. Update your knowledge base anytime without retraining.

The 4 Stages of RAG

Every RAG pipeline follows the same fundamental pattern: Chunk, Embed, Retrieve, Generate.

1

Chunk

Split your documents into smaller pieces. Large documents don't fit in context windows, and smaller chunks provide more precise retrieval. Choose between fixed-size, sentence-based, or semantic chunking strategies.

2

Embed

Convert each chunk into a vector embedding that captures its semantic meaning. Similar content produces similar vectors, enabling semantic search rather than keyword matching.

3

Retrieve

When a query arrives, embed it and find the most similar chunks using vector similarity (cosine similarity or dot product). Return the top-k most relevant chunks as context for generation.

4

Generate

Pass the retrieved chunks along with the user's question to the LLM. The model synthesizes an answer grounded in the provided context, ideally citing its sources.

Interactive Pipeline Demo

Walk through each stage of the RAG pipeline. Try different chunking strategies, embed the chunks, retrieve relevant context, and see how the final answer is generated.

RAG Pipeline Visualizer

What is RAG?

Retrieval-Augmented Generation connects LLMs to your data. Instead of hallucinating, the model retrieves relevant documents and grounds its answers in real information.

Document
Chunk
Embed
Store
Query
Retrieve
Generate
Indexing (offline)
Querying (online)
Problem: Hallucination

LLMs make up facts when they don't know the answer. No way to verify claims.

Problem: Stale Knowledge

Training data has a cutoff date. Model doesn't know about recent events.

Solution: RAG

Retrieve real documents at query time. Ground answers in your actual data.

Chunking Strategies

Chunking is deceptively important. Bad chunking leads to bad retrieval, which leads to bad answers. Here are the main strategies:

StrategyProsCons
Fixed SizeSimple, predictable chunk countsMay split mid-sentence
Sentence-basedPreserves sentence boundariesVariable sizes, may be too granular
Semantic / ParagraphPreserves topic coherenceChunks may be too large
RecursiveHierarchical structure preservedMore complex implementation

Chunk Size Experiments

256 tokens

Fine-grained, precise retrieval

  • + High precision for specific facts
  • + More chunks = more retrieval options
  • - May lack surrounding context
  • - More embeddings to store

512 tokens

Balanced approach (common default)

  • + Good balance of precision/context
  • + Works well for most use cases
  • - May still split related content

1024 tokens

More context per chunk

  • + Rich context in each chunk
  • + Fewer total chunks
  • - Less precise retrieval
  • - May dilute relevance signal

Retrieval: Top-K and Thresholds

Two key parameters control what gets retrieved:

Top-K Selection

Return the K most similar chunks regardless of their absolute similarity score.

k=3: Focused, minimal context

k=5: Balanced (common default)

k=10: Broad context, more noise

Similarity Threshold

Only return chunks above a minimum similarity score. Prevents irrelevant results.

0.7: Strict - only highly relevant

0.5: Moderate threshold

0.3: Loose - may include tangential

Pro tip: Combine both approaches. Use a similarity threshold to filter out irrelevant chunks, then take the top-k from what remains. This prevents the system from hallucinating when there's no relevant context.

Prompt Engineering for RAG

The prompt template you use significantly impacts answer quality. Here are key patterns:

Example RAG Prompt

System:

You are a helpful assistant. Answer questions using ONLY the provided context. If the context doesn't contain the answer, say "I don't have enough information to answer that question." Cite your sources using [1], [2], etc.

Context:

[1] {chunk_1_text}

[2] {chunk_2_text}

[3] {chunk_3_text}

Question:

{user_question}

1. Constrain to context

Explicitly tell the model to ONLY use provided context. This reduces hallucination.

2. Handle missing information

Give the model a way out. If context is insufficient, it should admit it rather than guess.

3. Require citations

Ask for source attribution. This makes answers verifiable and builds user trust.

4. Context placement

Place context before the question. Models pay more attention to content near the end of the prompt.

Key Takeaways

  • 1

    RAG = Chunk + Embed + Retrieve + Generate - This four-stage pipeline grounds LLM outputs in real data.

  • 2

    Chunking strategy matters - Bad chunking leads to bad retrieval. Experiment with size and approach.

  • 3

    Tune top-k and thresholds - Too few chunks = missing context. Too many = noise and distraction.

  • 4

    Prompt engineering is critical - Constrain to context, handle unknowns gracefully, require citations.