Programming language models instead of prompting them.
- What it does:
- Replaces brittle prompts with declarative signatures + automatic optimization
- Key insight:
- Prompts should be learned, not hand-written
- Created by:
- Omar Khattab @ Stanford NLP (same lab as ColBERT)
- When to use:
- Multi-step pipelines, model-agnostic code, optimizing for metrics
DSPy: The Framework for Programming LLMs
Stop writing prompts. Start writing programs. DSPy lets you define what you want, and automatically finds the prompts, examples, and strategies to achieve it.
The Problem with Prompts
Traditional Prompting
- -Brittle: Small changes break everything
- -Model-specific: Prompts don't transfer between models
- -Hard to optimize: Which version is best? Who knows.
- -Not composable: Multi-step pipelines become spaghetti
- -Not reproducible: "It worked yesterday..."
DSPy Approach
- +Declarative: Define inputs/outputs, not exact wording
- +Model-agnostic: Same code, swap LLM provider
- +Optimizable: Automatic prompt tuning with metrics
- +Composable: Modules combine like PyTorch layers
- +Reproducible: Save, load, version, test
Core Concepts
DSPy has four main abstractions. Master these and you can build anything.
Signatures
Declarative I/O specsDefine what your LLM should do, not how. Signatures specify input and output fields with types and descriptions.
"question -> answer" # Simple
"context, question -> reasoning, answer" # Multi-field
class QA(dspy.Signature):
"""Answer questions with citations."""
context = dspy.InputField(desc="relevant passages")
question = dspy.InputField()
answer = dspy.OutputField(desc="cite sources")Modules
Reusable LLM componentsPre-built patterns like ChainOfThought, ReAct, ProgramOfThought. Compose them like PyTorch layers.
# Built-in modules
dspy.Predict(signature) # Direct prediction
dspy.ChainOfThought(sig) # Step-by-step reasoning
dspy.ReAct(sig, tools=[...]) # Tool-using agent
dspy.ProgramOfThought(sig) # Generate & execute code
# Custom module
class RAG(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=3)
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)Optimizers
Automatic prompt tuningFind optimal prompts, few-shot examples, and instructions. Like hyperparameter search for LLM programs.
# Define your metric
def accuracy(example, pred, trace=None):
return example.answer.lower() == pred.answer.lower()
# Choose an optimizer
optimizer = dspy.MIPROv2(
metric=accuracy,
num_candidates=10,
init_temperature=1.0
)
# Compile (optimize) your program
optimized_rag = optimizer.compile(
RAG(),
trainset=train_examples,
max_bootstrapped_demos=4,
max_labeled_demos=4
)Assertions
Runtime constraintsAdd hard constraints that trigger retries. Catch hallucinations, format errors, and policy violations.
class FactualQA(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, context, question):
answer = self.generate(context=context, question=question)
# Hard constraint: answer must be in context
dspy.Assert(
answer.answer in context,
"Answer must be grounded in context"
)
# Soft suggestion: prefer concise answers
dspy.Suggest(
len(answer.answer.split()) < 50,
"Keep answers concise"
)
return answerGetting Started
Installation & Basic Usage
pip install dspyimport dspy
# Configure LLM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# Define a simple signature
class Summarize(dspy.Signature):
"""Summarize the document in 2-3 sentences."""
document = dspy.InputField()
summary = dspy.OutputField()
# Create a module
summarizer = dspy.Predict(Summarize)
# Use it
result = summarizer(document="DSPy is a framework for programming...")
print(result.summary)Chain of Thought Reasoning
Automatically elicit step-by-step reasoning
import dspy
# Chain of Thought adds step-by-step reasoning
class MathProblem(dspy.Signature):
"""Solve the math problem step by step."""
problem = dspy.InputField()
reasoning = dspy.OutputField(desc="step-by-step solution")
answer = dspy.OutputField(desc="final numerical answer")
# CoT automatically elicits reasoning
solver = dspy.ChainOfThought(MathProblem)
result = solver(problem="If a train travels 120 miles in 2 hours, what is its average speed?")
print(f"Reasoning: {result.reasoning}")
print(f"Answer: {result.answer}")RAG Pipeline
Retrieval-Augmented Generation with DSPy
import dspy
from dspy.retrieve.chromadb_rm import ChromadbRM
# Configure retriever and LLM
retriever = ChromadbRM(collection_name="docs", persist_directory="./chroma")
lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm, rm=retriever)
class RAG(dspy.Module):
def __init__(self, num_passages=3):
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate = dspy.ChainOfThought(
"context, question -> answer"
)
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
# Use it
rag = RAG()
result = rag("What is the refund policy?")
print(result.answer)Optimization: The Secret Sauce
DSPy's killer feature is automatic optimization. Give it examples and a metric, it finds the best prompts and few-shot examples.
Optimizing a RAG Pipeline
From 60% to 85% accuracy with MIPROv2
import dspy
from dspy.evaluate import Evaluate
# Your module
rag = RAG()
# Your training data
trainset = [
dspy.Example(question="What is X?", answer="X is...").with_inputs("question"),
# ... more examples
]
# Your metric
def answer_correctness(example, pred, trace=None):
# Simple exact match (use better metrics in practice)
return example.answer.lower() in pred.answer.lower()
# Evaluate before optimization
evaluator = Evaluate(devset=trainset[:20], metric=answer_correctness)
print(f"Before: {evaluator(rag)}")
# Optimize with MIPROv2
optimizer = dspy.MIPROv2(
metric=answer_correctness,
num_candidates=7,
init_temperature=1.0,
)
optimized_rag = optimizer.compile(
rag,
trainset=trainset,
max_bootstrapped_demos=4,
max_labeled_demos=4,
requires_permission_to_run=False,
)
print(f"After: {evaluator(optimized_rag)}")
# Save the optimized program
optimized_rag.save("optimized_rag.json")Optimizer Comparison
| Optimizer | Speed | Quality | Compute | Best For |
|---|---|---|---|---|
| BootstrapFewShot Generates few-shot examples from your training data by running the pipeline and keeping successful traces. | Fast | Good | Low | Quick iteration, limited compute budget |
| BootstrapFewShotWithRandomSearch BootstrapFewShot plus random search over instruction variations. | Medium | Better | Medium | Balanced quality/speed tradeoff |
| MIPROv2 State-of-the-art optimizer. Uses Bayesian optimization to search prompt space. Generates diverse instructions. | Slow | Best | High | Production deployment, maximizing quality |
| COPRO Optimizes instructions using coordinate ascent. Good for instruction-following tasks. | Medium | Good | Medium | When instructions matter more than examples |
| BootstrapFinetune Generates training data and fine-tunes the underlying model. Combines prompting with fine-tuning. | Slow | Best | Very High | High-volume production, latency-sensitive |
Advanced Patterns
Assertions for Grounding
Enforce constraints with automatic retries
import dspy
class GroundedQA(dspy.Module):
def __init__(self):
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, context, question):
result = self.generate(context=context, question=question)
# Assertion: answer must be grounded in context
dspy.Assert(
self._is_grounded(result.answer, context),
f"Answer must be supported by context. Got: {result.answer}"
)
return result
def _is_grounded(self, answer, context):
# Check if key claims in answer appear in context
# (Simplified - use NLI model in production)
answer_words = set(answer.lower().split())
context_words = set(context.lower().split())
overlap = len(answer_words & context_words) / len(answer_words)
return overlap > 0.3
# Assertions trigger retries with backtracking
with dspy.context(backtrack_to=self.generate):
result = grounded_qa(context="...", question="...")Multi-Model Pipelines
Use different models for different tasks
import dspy
# Configure multiple models
gpt4 = dspy.LM("openai/gpt-4o")
claude = dspy.LM("anthropic/claude-3-5-sonnet-20241022")
local = dspy.LM("ollama/llama3.1:8b")
# Use different models for different tasks
class Pipeline(dspy.Module):
def __init__(self):
# Cheap model for classification
self.classify = dspy.Predict("text -> category")
# Powerful model for generation
self.generate = dspy.ChainOfThought("category, text -> response")
def forward(self, text):
with dspy.context(lm=local): # Use local for classification
category = self.classify(text=text).category
with dspy.context(lm=gpt4): # Use GPT-4 for generation
response = self.generate(category=category, text=text)
return response
# Or cascade: try cheap first, fall back to expensive
class CascadePipeline(dspy.Module):
def forward(self, question):
# Try cheap model first
with dspy.context(lm=local):
result = self.answer(question=question)
if self._is_confident(result):
return result
# Fall back to expensive model
with dspy.context(lm=gpt4):
return self.answer(question=question)ReAct Agent with Tools
Build tool-using agents with reasoning traces
import dspy
# Define tools
def search_web(query: str) -> str:
"""Search the web for current information."""
# Implement web search
return f"Results for: {query}"
def calculate(expression: str) -> str:
"""Evaluate a mathematical expression."""
return str(eval(expression))
def lookup_database(entity: str) -> str:
"""Look up entity in knowledge base."""
# Implement database lookup
return f"Info about: {entity}"
# ReAct agent with tools
class ResearchAgent(dspy.Module):
def __init__(self):
self.react = dspy.ReAct(
"question -> answer",
tools=[search_web, calculate, lookup_database],
max_iters=5
)
def forward(self, question):
return self.react(question=question)
agent = ResearchAgent()
result = agent("What is the population of Tokyo times 2?")DSPy vs Alternatives
When to use DSPy vs other frameworks.
Manual Prompting
- +Simple for one-off tasks
- +Full control
- +No framework overhead
- -Brittle to model changes
- -Hard to optimize systematically
- -Prompts become unmanageable
Simple, stable tasks with one model
Start here, graduate to DSPy when complexity grows
LangChain
- +Large ecosystem
- +Many integrations
- +Good for RAG pipelines
- -Imperative, not declarative
- -No automatic optimization
- -Complex abstractions
Integrating many tools and data sources
Use for integrations, DSPy for the LLM logic
LlamaIndex
- +Excellent for RAG
- +Strong document handling
- +Query engines
- -Focused on retrieval
- -Less flexible for non-RAG
- -No prompt optimization
Document Q&A and knowledge bases
Use for document pipelines, DSPy for complex reasoning
Guidance / LMQL
- +Constrained generation
- +Template control
- +Guaranteed format
- -Lower-level control
- -More verbose
- -No optimization
Structured output with strict format requirements
Use when you need token-level control
Production Deployment
DSPy is production-ready. Here's how to deploy effectively.
Caching
DSPy caches LLM calls by default. Set cache directory for persistence across runs.
import dspy
# Enable persistent caching
dspy.configure(
lm=dspy.LM("openai/gpt-4o"),
cache_dir="./dspy_cache" # Persists across runs
)Async Execution
Use async for high-throughput applications.
import dspy
import asyncio
# Enable async mode
dspy.configure(lm=dspy.LM("openai/gpt-4o"), async_mode=True)
async def process_batch(questions):
module = dspy.ChainOfThought("question -> answer")
tasks = [module(question=q) for q in questions]
return await asyncio.gather(*tasks)Serialization
Save and load optimized programs for deployment.
# After optimization
optimized_module.save("./model/optimized_rag.json")
# In production
loaded_module = RAG()
loaded_module.load("./model/optimized_rag.json")Observability
Integrate with tracing tools for debugging and monitoring.
import dspy
from dspy.utils import inspect_history
# Enable detailed logging
dspy.configure(lm=dspy.LM("openai/gpt-4o"), trace=[])
# After running
result = module(question="...")
# Inspect what happened
inspect_history(n=3) # Show last 3 LLM callsShould You Use DSPy?
Rule of thumb: If you're copy-pasting prompts between projects, tweaking wording to improve results, or maintaining different prompts for different models - DSPy will save you time and improve quality.
Frequently Asked Questions
What is DSPy?
DSPy is a framework for programming (not prompting) language models. You define signatures (input/output specs), compose modules, and let optimizers automatically find the best prompts and few-shot examples.
When should I use DSPy instead of manual prompting?
Use DSPy when: (1) you have a metric to optimize, (2) you're building multi-step pipelines, (3) prompts need to work across models, or (4) you want reproducible, testable code. Manual prompts are fine for simple one-off tasks.
How does optimization work?
Optimizers like MIPROv2 generate prompt and example variations, evaluate them against your metric using training data, and keep the best-performing combinations. It's like hyperparameter tuning for prompts.
Is DSPy production-ready?
Yes. Used at Databricks, VMware, JetBlue, and others. Supports caching, async, serialization, and observability. Main requirement: you need training examples and compute for optimization.
Can I use DSPy with any LLM?
Yes. DSPy supports OpenAI, Anthropic, Google, Cohere, local models via Ollama/vLLM, and more. The same code works across providers - just change the LM configuration.
Resources
Related Guides
Ready to try DSPy?
Start with a simple pipeline, add optimization when you have metrics, scale from there.