DSPy (short for Declarative Self-improving Python) is an open-source programming framework developed at Stanford NLP for building systems powered by large language models (LLMs). Rather than relying on hand-written prompt engineering, DSPy treats LLM calls as declarative modules that can be composed into pipelines and then automatically optimized ("compiled") to maximize a given metric. The framework was created by Omar Khattab and collaborators and first released in October 2023. The foundational paper, "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines," was published as a conference paper at ICLR 2024.
DSPy is hosted under the stanfordnlp/dspy repository on GitHub. As of late 2025, it has accumulated over 28,000 GitHub stars and 160,000+ monthly PyPI downloads, making it one of the most widely adopted frameworks for programmatic LLM development.
Traditional approaches to building LLM-powered applications depend heavily on manually crafted prompts and prompt templates. Developers typically invest significant effort in trial-and-error experimentation to discover prompt phrasing that produces acceptable results. This process is fragile: when the underlying model changes, the task requirements shift, or the pipeline grows more complex, the hand-tuned prompts often break down and must be reworked from scratch.
DSPy grew out of the Demonstrate-Search-Predict (DSP) framework, whose research began at Stanford NLP in February 2022. The original DSP paper (Khattab et al., December 2022) introduced the idea of composing retrieval and language models for knowledge-intensive natural language processing tasks by passing natural language texts through sophisticated pipelines between a language model and a retrieval model. DSP programs for open-domain, multi-hop, and conversational question answering delivered 37-120% relative gains against vanilla LM baselines.
DSPy generalized and extended DSP's core insight: rather than manually specifying how an LLM should be prompted, developers should declare what a text transformation needs to accomplish and let the system figure out how to prompt the model effectively. This shift from imperative prompt engineering to declarative programming is the central contribution of DSPy.
The analogy to traditional software compilation is intentional. Just as a compiler translates high-level source code into optimized machine instructions, DSPy's optimizers translate high-level signatures and module compositions into optimized prompts, demonstrations, and weight updates that a language model can execute effectively. The developer works at the level of task specification, while the framework handles the low-level details of prompt construction.
DSPy introduces three primary abstractions that work together to define and optimize LLM programs: Signatures, Modules, and Optimizers (originally called Teleprompters).
A Signature in DSPy is a natural-language typed function declaration. It specifies the input and output fields of a text transformation, describing what the transformation should achieve without dictating how the LLM should be prompted to accomplish it.
Signatures can be expressed in two ways:
Inline (shorthand) notation:
classify = dspy.Predict('sentence -> sentiment: bool')
qa = dspy.ChainOfThought('question -> answer')
summarize = dspy.Predict('document -> summary')
Class-based notation (for more complex specifications):
class BasicQA(dspy.Signature):
"""Answer questions with short factoid answers."""
question: str = dspy.InputField()
answer: str = dspy.OutputField(desc="often between 1 and 5 words")
Signatures serve as the contract between the developer and DSPy. The framework handles translating these high-level specifications into effective prompts, few-shot demonstrations, or fine-tuning data for the underlying language model. DSPy automatically validates that values passed to input fields match the types specified in the signature.
The class-based format also supports docstrings that describe the task in natural language, typed fields with descriptions and constraints, and default values. Starting with DSPy 3.0, signatures can include multi-modal types such as dspy.Image for image inputs, enabling vision-language pipelines to benefit from the same optimization infrastructure as text-only programs.
A Module in DSPy is a parameterized component that implements a specific prompting or reasoning strategy. Modules take a Signature and apply a particular technique to fulfill it. Each module has learnable parameters, including the instructions the LLM follows, stored demonstrations of input/output behavior, and (optionally) LM weights.
DSPy provides several built-in modules:
| Module | Description |
|---|---|
dspy.Predict | The basic predictor. Does not modify the signature. Handles the key forms of learning (demonstrations, instructions). |
dspy.ChainOfThought | Teaches the LM to think step-by-step before committing to the response. Automatically injects a reasoning field before the output fields. |
dspy.ProgramOfThought | Teaches the LM to output executable code, whose execution results dictate the final response. Useful for mathematical or algorithmic tasks. |
dspy.ReAct | An agent that can use tools to implement the given signature. Based on the Reasoning and Acting paradigm (Yao et al., 2022). |
dspy.MultiChainComparison | Generates multiple chain-of-thought outputs and compares them to produce a final, higher-quality prediction. |
dspy.majority | A utility that performs voting across multiple predictions and returns the most popular response. |
All built-in modules are constructed using dspy.Predict internally. Modules can be composed into larger programs, just as layers compose into neural networks in PyTorch.
A typical DSPy program defines a class that inherits from dspy.Module, declares its sub-modules in __init__, and implements a forward method:
class RAG(dspy.Module):
def __init__(self, num_passages=3):
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate = dspy.ChainOfThought('context, question -> answer')
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
Optimizers (formerly called Teleprompters, a portmanteau of "telescoping prompts") are algorithms that automatically tune a DSPy program's parameters to maximize a given evaluation metric. This is the compilation step that distinguishes DSPy from traditional prompt engineering. Given a program, a training dataset, and a metric function, an optimizer searches for the best combination of instructions, demonstrations, and/or model weights for each module in the pipeline.
DSPy provides a range of optimizers targeting different aspects of the program:
| Optimizer | What It Optimizes | How It Works |
|---|---|---|
LabeledFewShot | Few-shot examples | Randomly selects k labeled examples from the training set as demonstrations. |
BootstrapFewShot | Few-shot examples | Uses a teacher module to generate demonstrations for every program stage, validating each against the metric. |
BootstrapFewShotWithRandomSearch | Few-shot examples | Runs BootstrapFewShot multiple times with random search, keeping the best-performing candidate program. |
KNNFewShot | Few-shot examples | Uses k-nearest neighbors to select the most relevant training examples as demonstrations. |
COPRO | Instructions | Generates and refines instructions for each module step using coordinate ascent (hill-climbing). |
MIPROv2 | Instructions + examples | Jointly optimizes instructions and few-shot examples using Bayesian optimization. |
SIMBA | Instructions + examples | Uses stochastic mini-batch sampling to find challenging examples and applies introspective analysis to generate improvement rules. |
GEPA | Instructions | Uses LM reflection on program trajectories to identify weaknesses and propose improved prompts. Based on Genetic-Pareto evolution (Agrawal et al., 2025). |
BootstrapFinetune | Model weights | Distills a prompt-based DSPy program into weight updates by fine-tuning the underlying LM. |
BetterTogether | Instructions + weights | A meta-optimizer that combines prompt optimization and weight optimization in configurable sequences. |
The compilation (optimization) process in DSPy follows a systematic, metrics-driven approach that replaces manual prompt engineering. It requires three ingredients: a DSPy program, a set of training examples (often just a few tens or hundreds), and a metric function that scores program outputs.
The process varies by optimizer, but a representative workflow using MIPROv2 involves three stages:
1. Bootstrapping Stage: The optimizer runs the program across many different training inputs, collecting traces of intermediate input/output behavior at every module. It filters these traces, retaining only those that appear in trajectories scored highly by the metric. These filtered traces become candidate few-shot demonstrations.
2. Grounded Proposal Stage: MIPROv2 inspects the DSPy program's code structure, the training data characteristics, and the collected traces. Using this information, it drafts multiple candidate instructions for every prompt in the program. This step is both data-aware (grounded in actual examples) and demonstration-aware (informed by which traces worked well).
3. Discrete Search Stage: The optimizer samples mini-batches from the training set, proposes combinations of instructions and demonstrations for constructing every prompt in the pipeline, and evaluates each candidate program on the mini-batch. MIPROv2 uses Bayesian optimization to efficiently search this combinatorial space, balancing exploration of new configurations with exploitation of promising ones.
The final output is a "compiled" program where each module's parameters (instructions, demonstrations, or weights) have been tuned for the target task and metric. Importantly, this compiled program remains a standard DSPy program that can be further optimized, composed with other modules, or saved and loaded for deployment.
This approach can produce dramatic quality improvements. The original DSPy paper demonstrated that compilation could raise the quality of simple programs from 33% to 82% accuracy on math word problems using GPT-3.5, and from 9% to 47% using Llama 2-13b-chat. On multi-hop question answering, improvements ranged from 25% to 65% across different models.
The following table summarizes how DSPy's programmatic approach compares to traditional prompt engineering:
| Aspect | Traditional Prompt Engineering | DSPy |
|---|---|---|
| Prompt creation | Manual, trial-and-error writing of prompt templates | Automatic generation and optimization of prompts via compilation |
| Abstraction level | Developer writes low-level prompt text | Developer declares high-level signatures and composes modules |
| Adaptability to new models | Prompts often need rewriting when switching LLMs | Recompile the same program for a different LM |
| Multi-step pipelines | Complex chain of string manipulations | Modular composition of typed modules |
| Optimization | Manual A/B testing and subjective evaluation | Metric-driven, automated search over instructions and demonstrations |
| Reproducibility | Difficult to version-control free-form prompts | Programs are standard Python code, easy to version and test |
| Portability | Tied to specific model behavior | Same program can target GPT-4, Claude, Llama, or fine-tuned models |
| Learning curve | Lower initial barrier | Requires understanding of optimization loops and metric design |
DSPy Assertions extend the framework with computational constraints that LLM outputs should satisfy. Introduced in a companion paper by Arnav Singhvi, Manish Shetty, Shangyin Tan, Christopher Potts, Koushik Sen, Matei Zaharia, and Omar Khattab (December 2023), assertions provide a structured way to enforce rules on LM behavior.
The framework defines two types of constraints:
dspy.Assert: Enforces strict constraints that must be met. When an assertion fails, the pipeline enters a retry state, allowing it to reattempt the failing LM call with awareness of previous attempts and the error message. If the assertion continues to fail after a maximum number of retries, an AssertionError is raised.
dspy.Suggest: Offers a softer approach. It maintains the same retry backtracking as dspy.Assert, but if the constraint still fails after the maximum number of attempts, it logs the persistent failure and continues execution rather than halting.
Assertions can be integrated both at compile time (via automatic prompt optimization) and at inference time (via automatic self-refinement and backtracking). According to the paper, LM Assertions improve constraint compliance by up to 164% and generate up to 37% more higher-quality responses compared to programs without assertions.
Example usage:
dspy.Assert(
len(prediction.answer.split()) < 50,
"Answer must be fewer than 50 words."
)
dspy.Suggest(
prediction.confidence > 0.8,
"Try to provide a higher-confidence answer."
)
DSPy supports a wide range of language model providers through LiteLLM, which serves as its backend for model communication. Developers configure their LM with a simple API call using the provider/model naming convention.
OpenAI models:
lm = dspy.LM('openai/gpt-4o-mini', api_key='your-key')
dspy.configure(lm=lm)
Anthropic models:
lm = dspy.LM('anthropic/claude-sonnet-4-20250514', api_key='your-key')
dspy.configure(lm=lm)
Local models via Ollama:
lm = dspy.LM('ollama_chat/llama3.2', api_base='http://localhost:11434')
dspy.configure(lm=lm)
Local models via SGLang or vLLM (OpenAI-compatible endpoints):
lm = dspy.LM('openai/meta-llama/Llama-3-8B', api_base='http://localhost:30000/v1')
dspy.configure(lm=lm)
The dspy.configure() function sets the default LM globally, while dspy.context() allows temporarily switching the model within a specific block of code. This design makes it straightforward to experiment with different models or to use different models for different stages of a pipeline.
DSPy supports any provider that LiteLLM supports, including OpenAI, Anthropic, Google (Gemini), Mistral, Cohere, Azure OpenAI, AWS Bedrock, and many others. Authentication is handled through environment variables (such as OPENAI_API_KEY or ANTHROPIC_API_KEY) or by passing the key directly.
The foundational DSPy paper was authored by Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. The authors represented Stanford University, UC Berkeley, Carnegie Mellon University, Amazon Alexa AI, IIT Bombay, Microsoft, and Calera Capital.
The paper was first posted as an arXiv preprint (arXiv:2310.03714) in October 2023 and was accepted as a conference paper at the International Conference on Learning Representations (ICLR) 2024 under the title "DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines."
Key contributions of the paper include:
A follow-up paper, "Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs" (Opsahl-Ong, Ryan, Purtell, Broman, Potts, Zaharia, and Khattab), introduced the MIPRO optimizer and was published at EMNLP 2024 in Miami, Florida. MIPRO outperformed baseline optimizers on five of seven diverse multi-stage LM programs using Llama-3-8B, with gains of up to 13% accuracy.
DSPy has been adopted by a diverse set of organizations across industries. The following table highlights selected production deployments:
| Company | Use Case |
|---|---|
| JetBlue | Multiple chatbot applications and LLM pipeline optimization |
| Replit | Code diff synthesis for automated code repair |
| Databricks | LM Judges, RAG systems, classification, and customer solutions |
| Sephora | Agent-based applications |
| Zoro UK | Structured e-commerce product attribute normalization across millions of items |
| VMware | RAG and prompt optimization |
| Haize Labs | Automated red-teaming to test LLM vulnerabilities |
| Moody's | RAG optimization, LLM-as-a-Judge, and financial agentic systems |
| PingCAP | Knowledge graph construction and querying |
| Truelaw | Custom LLM pipelines for law firm workflows |
| RadiantLogic | AI Data Assistant with routing, extraction, text-to-SQL, and summarization |
| Infinitus | Healthcare AI agent development and optimization |
The DSPy community includes over 200 contributors to the open-source project. The framework has also seen adoption in academic research, with the original paper and related work receiving extensive citations since publication.
Beyond enterprise use, DSPy has found traction in research laboratories and educational settings. Its declarative approach makes it a natural fit for rapid prototyping of NLP pipelines in academic contexts, where researchers can quickly test different module compositions and optimization strategies without investing weeks in manual prompt tuning. Several university courses on LLM application development have incorporated DSPy into their curricula as an example of the programmatic paradigm.
DSPy has undergone rapid development since its initial release:
| Version | Release Period | Key Changes |
|---|---|---|
| DSP (predecessor) | February 2022 | Demonstrate-Search-Predict framework for composing retrieval and language models |
| DSPy 1.0 | October 2023 | Initial release with Signatures, Modules, and Teleprompters |
| DSPy 2.0 | 2024 | Expanded optimizer support, improved documentation, renamed Teleprompters to Optimizers |
| DSPy 2.5 | Late 2024 | Enhanced multi-module optimization, advanced telemetry and tracing, expanded model support (GPT-4, Claude, Gemini) |
| DSPy 3.0 | August 2025 | Multi-modal support (images and audio), thread-safe settings, native async support, native MLflow integration, improved prompt optimization and fine-tuning/RL capabilities |
| DSPy 3.0.4 | November 2025 | Stability improvements and bug fixes |
| DSPy 3.1.0b1 | January 2026 | Pre-release with additional features under testing |
DSPy 3.0, released in August 2025, represents the most significant update since the framework's inception. Key features include:
batch function on DSPy modules for thread-safe parallel execution.Omar Khattab, the creator of DSPy, joined Databricks as a Research Scientist in 2024 before starting as MIT faculty in July 2025. His move to Databricks led to increased investment in the open-source DSPy community, with DSPy 3.0 being presented at the Data + AI Summit 2025. The core development team includes contributors from Databricks, Zenbase, Weaviate, and various academic institutions, with mentorship from researchers at Stanford, UC Berkeley, Carnegie Mellon University, and Two Sigma.
DSPy occupies a distinct position in the LLM framework landscape. While frameworks like LangChain focus on orchestration and integration with external tools, databases, and APIs, DSPy focuses specifically on programmatic prompt optimization. The two frameworks are not mutually exclusive and can be used together.
| Feature | DSPy | LangChain |
|---|---|---|
| Primary focus | Prompt and pipeline optimization | Orchestration and tool integration |
| Prompt handling | Automatically optimized via compilation | Manually engineered templates |
| Integration ecosystem | Focused (LLM providers via LiteLLM) | Broad (vector databases, APIs, tools, observability platforms) |
| Optimization | Built-in metric-driven optimizers | Requires external tooling or manual tuning |
| Learning curve | Steeper (requires understanding of optimization, metrics) | Gentler (familiar patterns for web developers) |
| Best suited for | Complex multi-stage reasoning pipelines where performance metrics matter | Applications requiring extensive integration with diverse data sources and APIs |
A typical DSPy development workflow proceeds as follows:
This workflow mirrors conventional machine learning development practices, where training data and evaluation metrics drive improvement, rather than the ad hoc experimentation typical of prompt engineering.
While DSPy offers significant advantages for building LLM-powered systems, several considerations apply: