Quick Answer: What is DSPy?

Programming language models instead of prompting them.

What it does:
Replaces brittle prompts with declarative signatures + automatic optimization
Key insight:
Prompts should be learned, not hand-written
Created by:
Omar Khattab @ Stanford NLP (same lab as ColBERT)
When to use:
Multi-step pipelines, model-agnostic code, optimizing for metrics
LLM EngineeringDecember 2025

DSPy: The Framework for Programming LLMs

Stop writing prompts. Start writing programs. DSPy lets you define what you want, and automatically finds the prompts, examples, and strategies to achieve it.

Updated December 2025|20 min read|GitHub

The Problem with Prompts

Traditional Prompting

  • -Brittle: Small changes break everything
  • -Model-specific: Prompts don't transfer between models
  • -Hard to optimize: Which version is best? Who knows.
  • -Not composable: Multi-step pipelines become spaghetti
  • -Not reproducible: "It worked yesterday..."

DSPy Approach

  • +Declarative: Define inputs/outputs, not exact wording
  • +Model-agnostic: Same code, swap LLM provider
  • +Optimizable: Automatic prompt tuning with metrics
  • +Composable: Modules combine like PyTorch layers
  • +Reproducible: Save, load, version, test

Core Concepts

DSPy has four main abstractions. Master these and you can build anything.

Signatures

Declarative I/O specs

Define what your LLM should do, not how. Signatures specify input and output fields with types and descriptions.

"question -> answer"  # Simple
"context, question -> reasoning, answer"  # Multi-field
class QA(dspy.Signature):
    """Answer questions with citations."""
    context = dspy.InputField(desc="relevant passages")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="cite sources")

Modules

Reusable LLM components

Pre-built patterns like ChainOfThought, ReAct, ProgramOfThought. Compose them like PyTorch layers.

# Built-in modules
dspy.Predict(signature)      # Direct prediction
dspy.ChainOfThought(sig)     # Step-by-step reasoning
dspy.ReAct(sig, tools=[...]) # Tool-using agent
dspy.ProgramOfThought(sig)   # Generate & execute code

# Custom module
class RAG(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=3)
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

Optimizers

Automatic prompt tuning

Find optimal prompts, few-shot examples, and instructions. Like hyperparameter search for LLM programs.

# Define your metric
def accuracy(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

# Choose an optimizer
optimizer = dspy.MIPROv2(
    metric=accuracy,
    num_candidates=10,
    init_temperature=1.0
)

# Compile (optimize) your program
optimized_rag = optimizer.compile(
    RAG(),
    trainset=train_examples,
    max_bootstrapped_demos=4,
    max_labeled_demos=4
)

Assertions

Runtime constraints

Add hard constraints that trigger retries. Catch hallucinations, format errors, and policy violations.

class FactualQA(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, context, question):
        answer = self.generate(context=context, question=question)

        # Hard constraint: answer must be in context
        dspy.Assert(
            answer.answer in context,
            "Answer must be grounded in context"
        )

        # Soft suggestion: prefer concise answers
        dspy.Suggest(
            len(answer.answer.split()) < 50,
            "Keep answers concise"
        )

        return answer

Getting Started

Installation & Basic Usage

pip install dspy
import dspy

# Configure LLM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# Define a simple signature
class Summarize(dspy.Signature):
    """Summarize the document in 2-3 sentences."""
    document = dspy.InputField()
    summary = dspy.OutputField()

# Create a module
summarizer = dspy.Predict(Summarize)

# Use it
result = summarizer(document="DSPy is a framework for programming...")
print(result.summary)

Chain of Thought Reasoning

Automatically elicit step-by-step reasoning

import dspy

# Chain of Thought adds step-by-step reasoning
class MathProblem(dspy.Signature):
    """Solve the math problem step by step."""
    problem = dspy.InputField()
    reasoning = dspy.OutputField(desc="step-by-step solution")
    answer = dspy.OutputField(desc="final numerical answer")

# CoT automatically elicits reasoning
solver = dspy.ChainOfThought(MathProblem)

result = solver(problem="If a train travels 120 miles in 2 hours, what is its average speed?")
print(f"Reasoning: {result.reasoning}")
print(f"Answer: {result.answer}")

RAG Pipeline

Retrieval-Augmented Generation with DSPy

import dspy
from dspy.retrieve.chromadb_rm import ChromadbRM

# Configure retriever and LLM
retriever = ChromadbRM(collection_name="docs", persist_directory="./chroma")
lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm, rm=retriever)

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate = dspy.ChainOfThought(
            "context, question -> answer"
        )

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

# Use it
rag = RAG()
result = rag("What is the refund policy?")
print(result.answer)

Optimization: The Secret Sauce

DSPy's killer feature is automatic optimization. Give it examples and a metric, it finds the best prompts and few-shot examples.

Optimizing a RAG Pipeline

From 60% to 85% accuracy with MIPROv2

import dspy
from dspy.evaluate import Evaluate

# Your module
rag = RAG()

# Your training data
trainset = [
    dspy.Example(question="What is X?", answer="X is...").with_inputs("question"),
    # ... more examples
]

# Your metric
def answer_correctness(example, pred, trace=None):
    # Simple exact match (use better metrics in practice)
    return example.answer.lower() in pred.answer.lower()

# Evaluate before optimization
evaluator = Evaluate(devset=trainset[:20], metric=answer_correctness)
print(f"Before: {evaluator(rag)}")

# Optimize with MIPROv2
optimizer = dspy.MIPROv2(
    metric=answer_correctness,
    num_candidates=7,
    init_temperature=1.0,
)

optimized_rag = optimizer.compile(
    rag,
    trainset=trainset,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
    requires_permission_to_run=False,
)

print(f"After: {evaluator(optimized_rag)}")

# Save the optimized program
optimized_rag.save("optimized_rag.json")

Optimizer Comparison

OptimizerSpeedQualityComputeBest For
BootstrapFewShot

Generates few-shot examples from your training data by running the pipeline and keeping successful traces.

FastGoodLowQuick iteration, limited compute budget
BootstrapFewShotWithRandomSearch

BootstrapFewShot plus random search over instruction variations.

MediumBetterMediumBalanced quality/speed tradeoff
MIPROv2

State-of-the-art optimizer. Uses Bayesian optimization to search prompt space. Generates diverse instructions.

SlowBestHighProduction deployment, maximizing quality
COPRO

Optimizes instructions using coordinate ascent. Good for instruction-following tasks.

MediumGoodMediumWhen instructions matter more than examples
BootstrapFinetune

Generates training data and fine-tunes the underlying model. Combines prompting with fine-tuning.

SlowBestVery HighHigh-volume production, latency-sensitive

Advanced Patterns

Assertions for Grounding

Enforce constraints with automatic retries

import dspy

class GroundedQA(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, context, question):
        result = self.generate(context=context, question=question)

        # Assertion: answer must be grounded in context
        dspy.Assert(
            self._is_grounded(result.answer, context),
            f"Answer must be supported by context. Got: {result.answer}"
        )

        return result

    def _is_grounded(self, answer, context):
        # Check if key claims in answer appear in context
        # (Simplified - use NLI model in production)
        answer_words = set(answer.lower().split())
        context_words = set(context.lower().split())
        overlap = len(answer_words & context_words) / len(answer_words)
        return overlap > 0.3

# Assertions trigger retries with backtracking
with dspy.context(backtrack_to=self.generate):
    result = grounded_qa(context="...", question="...")

Multi-Model Pipelines

Use different models for different tasks

import dspy

# Configure multiple models
gpt4 = dspy.LM("openai/gpt-4o")
claude = dspy.LM("anthropic/claude-3-5-sonnet-20241022")
local = dspy.LM("ollama/llama3.1:8b")

# Use different models for different tasks
class Pipeline(dspy.Module):
    def __init__(self):
        # Cheap model for classification
        self.classify = dspy.Predict("text -> category")
        # Powerful model for generation
        self.generate = dspy.ChainOfThought("category, text -> response")

    def forward(self, text):
        with dspy.context(lm=local):  # Use local for classification
            category = self.classify(text=text).category

        with dspy.context(lm=gpt4):  # Use GPT-4 for generation
            response = self.generate(category=category, text=text)

        return response

# Or cascade: try cheap first, fall back to expensive
class CascadePipeline(dspy.Module):
    def forward(self, question):
        # Try cheap model first
        with dspy.context(lm=local):
            result = self.answer(question=question)
            if self._is_confident(result):
                return result

        # Fall back to expensive model
        with dspy.context(lm=gpt4):
            return self.answer(question=question)

ReAct Agent with Tools

Build tool-using agents with reasoning traces

import dspy

# Define tools
def search_web(query: str) -> str:
    """Search the web for current information."""
    # Implement web search
    return f"Results for: {query}"

def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    return str(eval(expression))

def lookup_database(entity: str) -> str:
    """Look up entity in knowledge base."""
    # Implement database lookup
    return f"Info about: {entity}"

# ReAct agent with tools
class ResearchAgent(dspy.Module):
    def __init__(self):
        self.react = dspy.ReAct(
            "question -> answer",
            tools=[search_web, calculate, lookup_database],
            max_iters=5
        )

    def forward(self, question):
        return self.react(question=question)

agent = ResearchAgent()
result = agent("What is the population of Tokyo times 2?")

DSPy vs Alternatives

When to use DSPy vs other frameworks.

Manual Prompting

Pros:
  • +Simple for one-off tasks
  • +Full control
  • +No framework overhead
Cons:
  • -Brittle to model changes
  • -Hard to optimize systematically
  • -Prompts become unmanageable
Best for:

Simple, stable tasks with one model

Start here, graduate to DSPy when complexity grows

LangChain

Pros:
  • +Large ecosystem
  • +Many integrations
  • +Good for RAG pipelines
Cons:
  • -Imperative, not declarative
  • -No automatic optimization
  • -Complex abstractions
Best for:

Integrating many tools and data sources

Use for integrations, DSPy for the LLM logic

LlamaIndex

Pros:
  • +Excellent for RAG
  • +Strong document handling
  • +Query engines
Cons:
  • -Focused on retrieval
  • -Less flexible for non-RAG
  • -No prompt optimization
Best for:

Document Q&A and knowledge bases

Use for document pipelines, DSPy for complex reasoning

Guidance / LMQL

Pros:
  • +Constrained generation
  • +Template control
  • +Guaranteed format
Cons:
  • -Lower-level control
  • -More verbose
  • -No optimization
Best for:

Structured output with strict format requirements

Use when you need token-level control

Production Deployment

DSPy is production-ready. Here's how to deploy effectively.

Caching

DSPy caches LLM calls by default. Set cache directory for persistence across runs.

import dspy
# Enable persistent caching
dspy.configure(
    lm=dspy.LM("openai/gpt-4o"),
    cache_dir="./dspy_cache"  # Persists across runs
)

Async Execution

Use async for high-throughput applications.

import dspy
import asyncio

# Enable async mode
dspy.configure(lm=dspy.LM("openai/gpt-4o"), async_mode=True)

async def process_batch(questions):
    module = dspy.ChainOfThought("question -> answer")
    tasks = [module(question=q) for q in questions]
    return await asyncio.gather(*tasks)

Serialization

Save and load optimized programs for deployment.

# After optimization
optimized_module.save("./model/optimized_rag.json")

# In production
loaded_module = RAG()
loaded_module.load("./model/optimized_rag.json")

Observability

Integrate with tracing tools for debugging and monitoring.

import dspy
from dspy.utils import inspect_history

# Enable detailed logging
dspy.configure(lm=dspy.LM("openai/gpt-4o"), trace=[])

# After running
result = module(question="...")

# Inspect what happened
inspect_history(n=3)  # Show last 3 LLM calls

Should You Use DSPy?

Do you have a metric to optimize?Yes: DSPy shines hereNo: Manual prompting is fine
Are you building multi-step pipelines?Yes: DSPy modules helpNo: Single predict is fine
Will you switch models frequently?Yes: DSPy abstracts modelsNo: Hardcode prompts
Do you need reproducible experiments?Yes: DSPy tracks everythingNo: Ad-hoc is fine
High volume production?Yes: Optimize + cacheNo: Start simple

Rule of thumb: If you're copy-pasting prompts between projects, tweaking wording to improve results, or maintaining different prompts for different models - DSPy will save you time and improve quality.

Frequently Asked Questions

What is DSPy?

DSPy is a framework for programming (not prompting) language models. You define signatures (input/output specs), compose modules, and let optimizers automatically find the best prompts and few-shot examples.

When should I use DSPy instead of manual prompting?

Use DSPy when: (1) you have a metric to optimize, (2) you're building multi-step pipelines, (3) prompts need to work across models, or (4) you want reproducible, testable code. Manual prompts are fine for simple one-off tasks.

How does optimization work?

Optimizers like MIPROv2 generate prompt and example variations, evaluate them against your metric using training data, and keep the best-performing combinations. It's like hyperparameter tuning for prompts.

Is DSPy production-ready?

Yes. Used at Databricks, VMware, JetBlue, and others. Supports caching, async, serialization, and observability. Main requirement: you need training examples and compute for optimization.

Can I use DSPy with any LLM?

Yes. DSPy supports OpenAI, Anthropic, Google, Cohere, local models via Ollama/vLLM, and more. The same code works across providers - just change the LM configuration.

Resources

Related Guides

Ready to try DSPy?

Start with a simple pipeline, add optimization when you have metrics, scale from there.