DSPy

Developer Tools Large Language Models Machine Learning Natural Language Processing

27 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

39 citations

Revision

v8 · 5,392 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DSPy (short for Declarative Self-improving Python) is an open-source framework, developed at Stanford NLP, for programming rather than prompting large language models (LLMs). Instead of hand-writing brittle prompt strings, developers express an LLM pipeline as composable Python code built from three abstractions, signatures, modules, and optimizers, and DSPy then "compiles" that program by automatically generating and tuning the underlying prompts, few-shot demonstrations, and (optionally) model weights to maximize a chosen metric. The framework's own documentation summarizes the idea as "DSPy is the framework for programming, rather than prompting, language models," with the tagline "Program, don't prompt, your LLMs."^[6]^[7] DSPy was created by Omar Khattab and collaborators and first released in October 2023, and its foundational paper was published at ICLR 2024.^[1]

DSPy is the most widely adopted framework for the program-not-prompt paradigm and a direct alternative to template-driven prompt engineering and orchestration libraries such as LangChain. As of June 2026 the official stanfordnlp/dspy repository on GitHub reports roughly 35,300 stars and about 3,000 forks,^[7] and the project's own site reports more than 433 contributors and over 6.4 million monthly downloads.^[6]

Why was DSPy created?

Traditional approaches to building LLM-powered applications depend heavily on manually crafted prompts and prompt templates. Developers typically invest significant effort in trial-and-error experimentation to discover prompt phrasing that produces acceptable results. This process is fragile: when the underlying model changes, the task requirements shift, or the pipeline grows more complex, the hand-tuned prompts often break down and must be reworked from scratch.

DSPy grew out of the Demonstrate-Search-Predict (DSP) framework, whose research began at Stanford NLP in February 2022. The original DSP paper (Khattab et al., December 2022) introduced the idea of composing retrieval and language models for knowledge-intensive natural language processing tasks by passing natural language texts through sophisticated pipelines between a language model and a retrieval model.^[2] DSP programs for open-domain, multi-hop, and conversational question answering delivered 37-120% relative gains against vanilla LM baselines.^[2]

DSPy generalized and extended DSP's core insight: rather than manually specifying how an LLM should be prompted, developers should declare what a text transformation needs to accomplish and let the system figure out how to prompt the model effectively. This shift from imperative prompt engineering to declarative programming is the central contribution of DSPy.^[1]

The analogy to traditional software compilation is intentional. Just as a compiler translates high-level source code into optimized machine instructions, DSPy's optimizers translate high-level signatures and module compositions into optimized prompts, demonstrations, and weight updates that a language model can execute effectively. The developer works at the level of task specification, while the framework handles the low-level details of prompt construction.

What are signatures, modules, and optimizers?

DSPy introduces three primary abstractions that work together to define and optimize LLM programs: Signatures, Modules, and Optimizers (originally called Teleprompters).

Signatures

A Signature in DSPy is a natural-language typed function declaration.^[1] It specifies the input and output fields of a text transformation, describing what the transformation should achieve without dictating how the LLM should be prompted to accomplish it.

Signatures can be expressed in two ways:

Inline (shorthand) notation:

classify = dspy.Predict('sentence -> sentiment: bool')
qa = dspy.ChainOfThought('question -> answer')
summarize = dspy.Predict('document -> summary')

Class-based notation (for more complex specifications):

class BasicQA(dspy.Signature):
    """Answer questions with short factoid answers."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="often between 1 and 5 words")

Signatures serve as the contract between the developer and DSPy. The framework handles translating these high-level specifications into effective prompts, few-shot demonstrations, or fine-tuning data for the underlying language model. DSPy automatically validates that values passed to input fields match the types specified in the signature.^[6]

The class-based format also supports docstrings that describe the task in natural language, typed fields with descriptions and constraints, and default values. Starting with DSPy 3.0, signatures can include multi-modal types such as dspy.Image for image inputs, enabling vision-language pipelines to benefit from the same optimization infrastructure as text-only programs.^[11]

Modules

A Module in DSPy is a parameterized component that implements a specific prompting or reasoning strategy. Modules take a Signature and apply a particular technique to fulfill it. Each module has learnable parameters, including the instructions the LLM follows, stored demonstrations of input/output behavior, and (optionally) LM weights.^[1]

DSPy provides several built-in modules:

Module	Description
`dspy.Predict`	The basic predictor. Does not modify the signature. Handles the key forms of learning (demonstrations, instructions).^[6]
`dspy.ChainOfThought`	Teaches the LM to think step-by-step before committing to the response. Automatically injects a reasoning field before the output fields.^[6]
`dspy.ProgramOfThought`	Teaches the LM to output executable code, whose execution results dictate the final response. Useful for mathematical or algorithmic tasks.^[6]
`dspy.ReAct`	An agent that can use tools to implement the given signature. Based on the Reasoning and Acting paradigm (Yao et al., 2022).
`dspy.MultiChainComparison`	Generates multiple chain-of-thought outputs and compares them to produce a final, higher-quality prediction.
`dspy.majority`	A utility that performs voting across multiple predictions and returns the most popular response.
`dspy.CodeAct`	Introduced in DSPy 3.0, combines elements of ReAct and Program-of-Thought by generating Python code that calls predefined tools, executing it in a sandboxed interpreter, and iterating up to a configurable number of steps until a final answer is reached.[^codeact]
`dspy.Refine`	Introduced in DSPy 3.0, runs a wrapped module up to N times at `temperature=1.0` with different rollout IDs, scores each prediction with a reward function, and returns the first attempt that crosses a threshold or otherwise the highest-scoring result. Useful for enforcing output constraints.[^refine]
`dspy.Reasoning`	Added in DSPy 3.1.0 (January 2026) to capture native reasoning traces from dedicated reasoning models, exposing the underlying chain of thought separately from the final answer.[^v310]

All built-in modules are constructed using dspy.Predict internally.^[6] Modules can be composed into larger programs, just as layers compose into neural networks in PyTorch.

A typical DSPy program defines a class that inherits from dspy.Module, declares its sub-modules in __init__, and implements a forward method:

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate = dspy.ChainOfThought('context, question -> answer')

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

Optimizers (Teleprompters)

Optimizers (formerly called Teleprompters, a portmanteau of "telescoping prompts") are algorithms that automatically tune a DSPy program's parameters to maximize a given evaluation metric.^[1] This is the compilation step that distinguishes DSPy from traditional prompt engineering. Given a program, a training dataset, and a metric function, an optimizer searches for the best combination of instructions, demonstrations, and/or model weights for each module in the pipeline.

DSPy provides a range of optimizers targeting different aspects of the program:

Optimizer	What It Optimizes	How It Works
`LabeledFewShot`	Few-shot examples	Randomly selects k labeled examples from the training set as demonstrations.
`BootstrapFewShot`	Few-shot examples	Uses a teacher module to generate demonstrations for every program stage, validating each against the metric.^[6]
`BootstrapFewShotWithRandomSearch`	Few-shot examples	Runs BootstrapFewShot multiple times with random search, keeping the best-performing candidate program.
`KNNFewShot`	Few-shot examples	Uses k-nearest neighbors to select the most relevant training examples as demonstrations.
`COPRO`	Instructions	Generates and refines instructions for each module step using coordinate ascent (hill-climbing).^[6]
`MIPROv2`	Instructions + examples	Jointly optimizes instructions and few-shot examples using Bayesian optimization.^[4]
`SIMBA`	Instructions + examples	Uses stochastic mini-batch sampling to find challenging examples and applies introspective analysis to generate improvement rules.
`GEPA`	Instructions	Uses LM reflection on program trajectories to identify weaknesses and propose improved prompts. Based on Genetic-Pareto evolution (Agrawal et al., 2025).^[5]
`BootstrapFinetune`	Model weights	Distills a prompt-based DSPy program into weight updates by fine-tuning the underlying LM.
`BetterTogether`	Instructions + weights	A meta-optimizer that combines prompt optimization and weight optimization in configurable sequences.^[20] From DSPy 3.2.0 onwards it accepts arbitrary optimizers as keyword arguments and chains them through strategy strings, allowing sequences such as prompt optimization, then fine-tuning, then prompt optimization again.[^v320]
`GRPO`	Model weights via reinforcement learning	Runs multi-module Group Relative Policy Optimization on a DSPy program through the companion Arbor library. Ships in DSPy 3.0 as `dspy.GRPO` and is the framework's first native online RL optimizer.[^grpo][^arbor]

The GEPA method, which shipped with DSPy 3.0, is described in a paper accepted as an oral presentation at ICLR 2026. Across six tasks, the authors report that reflective prompt evolution outperforms the GRPO reinforcement learning algorithm by roughly 6 points on average while using up to 35x fewer rollouts, and that it beats MIPROv2 by more than 10 points (for example, a 12-point gain on AIME 2025).^[18] GEPA is also maintained as a standalone open-source library, which dspy.GEPA consumes as a dependency.^[19]

How does DSPy compilation work?

The compilation (optimization) process in DSPy follows a systematic, metrics-driven approach that replaces manual prompt engineering. It requires three ingredients: a DSPy program, a set of training examples (often just a few tens or hundreds), and a metric function that scores program outputs.

The process varies by optimizer, but a representative workflow using MIPROv2 involves three stages:^[4]

1. Bootstrapping Stage: The optimizer runs the program across many different training inputs, collecting traces of intermediate input/output behavior at every module. It filters these traces, retaining only those that appear in trajectories scored highly by the metric. These filtered traces become candidate few-shot demonstrations.

2. Grounded Proposal Stage: MIPROv2 inspects the DSPy program's code structure, the training data characteristics, and the collected traces. Using this information, it drafts multiple candidate instructions for every prompt in the program. This step is both data-aware (grounded in actual examples) and demonstration-aware (informed by which traces worked well).

3. Discrete Search Stage: The optimizer samples mini-batches from the training set, proposes combinations of instructions and demonstrations for constructing every prompt in the pipeline, and evaluates each candidate program on the mini-batch. MIPROv2 uses Bayesian optimization to efficiently search this combinatorial space, balancing exploration of new configurations with exploitation of promising ones.^[4]

The final output is a "compiled" program where each module's parameters (instructions, demonstrations, or weights) have been tuned for the target task and metric. Importantly, this compiled program remains a standard DSPy program that can be further optimized, composed with other modules, or saved and loaded for deployment.

This approach can produce dramatic quality improvements. The original DSPy paper demonstrated that compilation could raise the quality of simple programs from 33% to 82% accuracy on math word problems using GPT-3.5, and from 9% to 47% using Llama 2-13b-chat.^[1] On multi-hop question answering, improvements ranged from 25% to 65% across different models.^[1]

How is DSPy different from prompt engineering?

The following table summarizes how DSPy's programmatic approach compares to traditional prompt engineering:

Aspect	Traditional Prompt Engineering	DSPy
Prompt creation	Manual, trial-and-error writing of prompt templates	Automatic generation and optimization of prompts via compilation
Abstraction level	Developer writes low-level prompt text	Developer declares high-level signatures and composes modules
Adaptability to new models	Prompts often need rewriting when switching LLMs	Recompile the same program for a different LM
Multi-step pipelines	Complex chain of string manipulations	Modular composition of typed modules
Optimization	Manual A/B testing and subjective evaluation	Metric-driven, automated search over instructions and demonstrations
Reproducibility	Difficult to version-control free-form prompts	Programs are standard Python code, easy to version and test
Portability	Tied to specific model behavior	Same program can target GPT-4, Claude, Llama, or fine-tuned models
Learning curve	Lower initial barrier	Requires understanding of optimization loops and metric design

DSPy Assertions

DSPy Assertions extend the framework with computational constraints that LLM outputs should satisfy. Introduced in a companion paper by Arnav Singhvi, Manish Shetty, Shangyin Tan, Christopher Potts, Koushik Sen, Matei Zaharia, and Omar Khattab (December 2023), assertions provide a structured way to enforce rules on LM behavior.^[3]

The framework defines two types of constraints:

dspy.Assert: Enforces strict constraints that must be met. When an assertion fails, the pipeline enters a retry state, allowing it to reattempt the failing LM call with awareness of previous attempts and the error message. If the assertion continues to fail after a maximum number of retries, an AssertionError is raised.
dspy.Suggest: Offers a softer approach. It maintains the same retry backtracking as dspy.Assert, but if the constraint still fails after the maximum number of attempts, it logs the persistent failure and continues execution rather than halting.

Assertions can be integrated both at compile time (via automatic prompt optimization) and at inference time (via automatic self-refinement and backtracking). According to the paper, LM Assertions improve constraint compliance by up to 164% and generate up to 37% more higher-quality responses compared to programs without assertions.^[3]

Example usage:

dspy.Assert(
    len(prediction.answer.split()) < 50,
    "Answer must be fewer than 50 words."
)

dspy.Suggest(
    prediction.confidence > 0.8,
    "Try to provide a higher-confidence answer."
)

Which language models does DSPy support?

DSPy supports a wide range of language model providers through LiteLLM, which serves as its backend for model communication.^[6] Developers configure their LM with a simple API call using the provider/model naming convention.

OpenAI models:

lm = dspy.LM('openai/gpt-4o-mini', api_key='your-key')
dspy.configure(lm=lm)

Anthropic models:

lm = dspy.LM('anthropic/claude-sonnet-4-20250514', api_key='your-key')
dspy.configure(lm=lm)

Local models via Ollama:

lm = dspy.LM('ollama_chat/llama3.2', api_base='http://localhost:11434')
dspy.configure(lm=lm)

Local models via SGLang or vLLM (OpenAI-compatible endpoints):

lm = dspy.LM('openai/meta-llama/Llama-3-8B', api_base='http://localhost:30000/v1')
dspy.configure(lm=lm)

The dspy.configure() function sets the default LM globally, while dspy.context() allows temporarily switching the model within a specific block of code. This design makes it straightforward to experiment with different models or to use different models for different stages of a pipeline.

DSPy supports any provider that LiteLLM supports, including OpenAI, Anthropic, Google (Gemini), Mistral, Cohere, Azure OpenAI, AWS Bedrock, and many others.^[6] Authentication is handled through environment variables (such as OPENAI_API_KEY or ANTHROPIC_API_KEY) or by passing the key directly.

What does the foundational DSPy paper say?

The foundational DSPy paper was authored by Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts.^[1] The authors represented Stanford University, UC Berkeley, Carnegie Mellon University, Amazon Alexa AI, IIT Bombay, Microsoft, and Calera Capital.^[1]

The paper was first posted as an arXiv preprint (arXiv:2310.03714) in October 2023 and was accepted as a conference paper at the International Conference on Learning Representations (ICLR) 2024 under the title "DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines."^[1] The paper was designated a spotlight presentation at the conference.^[1]

The abstract frames the core idea in compiler terms, describing DSPy as a programming model that abstracts "LM pipelines as text transformation graphs, i.e. imperative computational graphs where LMs are invoked through declarative modules."^[1] In its case studies the authors report that, within minutes of compiling, DSPy programs "outperform standard few-shot prompting (generally by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively)" for GPT-3.5 and llama2-13b-chat.^[1]

Key contributions of the paper include:

The Signature abstraction for declaratively specifying input/output behavior of LM modules.
A set of parameterized modules (Predict, ChainOfThought, ReAct, etc.) that replace hand-crafted prompt templates.
Teleprompter algorithms that automatically compile programs by generating demonstrations and optimizing instructions.
Two case studies demonstrating that compact DSPy programs can express and optimize pipelines for math word problems, multi-hop retrieval, complex question answering, and agent control loops.
Evidence that DSPy programs compiled for relatively small models (770M-parameter T5 and Llama 2-13b-chat) can be competitive with approaches relying on larger, proprietary LMs such as GPT-3.5.^[1]

A follow-up paper, "Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs" (Opsahl-Ong, Ryan, Purtell, Broman, Potts, Zaharia, and Khattab), introduced the MIPRO optimizer and was published at EMNLP 2024 in Miami, Florida.^[4] MIPRO outperformed baseline optimizers on five of seven diverse multi-stage LM programs using Llama-3-8B, with gains of up to 13% accuracy.^[4]

Two further studies extended this line of optimizer research. "Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together" (Soylu, Potts, and Khattab, EMNLP 2024), the study behind the BetterTogether optimizer, reported that alternating between optimizing prompts and fine-tuning weights outperformed optimizing weights alone by up to 60% and prompts alone by up to 6% on average, across mistral-7b, llama-2-7b, and llama-3-8b on multi-hop question answering, mathematical reasoning, and classification tasks.^[20] In February 2025, researchers from the DSPy ecosystem released LangProBe, the first large-scale benchmark for evaluating language-program architectures and optimization strategies, covering over 2,000 combinations of tasks, architectures, optimizers, and language models; the study found that optimized language programs offer strong cost-quality Pareto improvements over raw calls to the underlying models, while architecture decisions still benefit from human judgment.^[21]

Who uses DSPy in production?

DSPy has been adopted by a diverse set of organizations across industries.^[9] The following table highlights selected production deployments:

Company	Use Case
JetBlue	Multiple chatbot applications and LLM pipeline optimization
Replit	Code diff synthesis for automated code repair
Databricks	LM Judges, RAG systems, classification, and customer solutions
Sephora	Agent-based applications
Zoro UK	Structured e-commerce product attribute normalization across millions of items
VMware	RAG and prompt optimization
Haize Labs	Automated red-teaming to test LLM vulnerabilities
Moody's	RAG optimization, LLM-as-a-Judge, and financial agentic systems
PingCAP	Knowledge graph construction and querying
Truelaw	Custom LLM pipelines for law firm workflows
RadiantLogic	AI Data Assistant with routing, extraction, text-to-SQL, and summarization
Infinitus	Healthcare AI agent development and optimization
Shopify	Structured product metadata extraction at scale, with a reported cost reduction of roughly 550x^[9]
Dropbox	Optimized LLM relevance judge for search ranking^[9]
AWS	Model migration while maintaining performance^[9]

The framework has also seen heavy adoption in academic research, with the original paper and related work receiving extensive citations since publication. As of June 2026, the project's official site reports more than 433 contributors and over 6.4 million monthly downloads,^[6] the GitHub repository shows roughly 35,300 stars and about 3,000 forks,^[7] and independent PyPI statistics recorded several million downloads of the dspy package in the preceding month.^[22]

Beyond enterprise use, DSPy has found traction in research laboratories and educational settings. Its declarative approach makes it a natural fit for rapid prototyping of NLP pipelines in academic contexts, where researchers can quickly test different module compositions and optimization strategies without investing weeks in manual prompt tuning. Several university courses on LLM application development have incorporated DSPy into their curricula as an example of the programmatic paradigm. A prominent research example is STORM, an LLM-powered knowledge curation system from Stanford that researches a topic and drafts Wikipedia-style articles with citations; both STORM and its collaborative extension, Co-STORM, are implemented in a modular way using DSPy.^[23]

Version History and Recent Developments

DSPy has undergone rapid development since its initial release:

Version	Release Date	Key Changes
DSP (predecessor)	February 2022	Demonstrate-Search-Predict framework for composing retrieval and language models
DSPy 1.0	October 2023	Initial release with Signatures, Modules, and Teleprompters
DSPy 2.0	2024	Expanded optimizer support, improved documentation, renamed Teleprompters to Optimizers
DSPy 2.5	Late 2024	Enhanced multi-module optimization, advanced telemetry and tracing, expanded model support (GPT-4, Claude, Gemini)
DSPy 3.0.0	August 12, 2025	Multi-modal I/O (`dspy.Image`, `dspy.Audio`), thread-safe settings, native async support, native MLflow 3.0 integration, new optimizers `GEPA`, `SIMBA`, and `GRPO`, new modules `dspy.CodeAct` and `dspy.Refine`, improved ReAct and `PythonInterpreter`, built-in adapters (`ChatAdapter`, `JSONAdapter`, `XMLAdapter`, `BAMLAdapter`). Python 3.9 support dropped.[^v300][^v300notes]
DSPy 3.0.1-3.0.3	August 2025	Bug fixes and minor enhancements following the 3.0 launch.[^pypi]
DSPy 3.0.4	November 10, 2025	Stability improvements and bug fixes.[^pypi]
DSPy 3.1.0	January 6, 2026	Adds `dspy.Reasoning` for capturing native reasoning traces, new `dspy.File` type, Python 3.14 support, pickle-loading safety guards, GEPA dependency update, MIPROv2 fixes, fixes to Anthropic model handling, XMLAdapter formatting, and OpenAI Responses API conversion.[^v310]
DSPy 3.1.1-3.1.3	January-February 2026	Recursive Language Model (RLM) improvements^[24], parallel `timeout`/`straggler_limit` parameters, code fence parsing fixes, GEPA dependency bumped to 0.0.26.[^pypi]
DSPy 3.2.0	April 21, 2026	`BetterTogether` made compatible with arbitrary chained optimizers, beginning of decoupling DSPy from LiteLLM so custom LM backends can be plugged in, input-field type-validation warnings, optuna made an optional dependency.[^v320]
DSPy 3.2.1	May 5, 2026	Removed the upper bound on the LiteLLM dependency, fixed custom-header propagation in async streaming LM calls, fixed `dspy.Embedder` so per-call `caching=False` is honored for both sync and async embedding calls.[^v321]
DSPy 3.3.0b1 (beta)	May 28, 2026	Beta pre-release: experimental `ReActV2` module built on native tool calling and `dspy.History`, start of a typed provider-neutral LM interface (`LMRequest`/`LMResponse`), GEPA dependency updated to 0.1.1, and fewer base dependencies (NumPy removed from the base install).^[25]

DSPy 3.0 Highlights

DSPy 3.0, released in August 2025, represents the most significant update since the framework's inception.^[11] Key features include:

Multi-modal support: Native handling of images and audio inputs, with support for composite types and Pydantic models for structured outputs.
Production readiness: Thread-safe settings and native async support for high-concurrency applications.
MLflow integration: Native integration with MLflow for experiment tracking, model management, and observability.
Enhanced optimization: Improved prompt optimization algorithms and expanded fine-tuning and reinforcement learning capabilities.
Batch processing: A batch function on DSPy modules for thread-safe parallel execution.
Streaming and history: Enhanced streaming support and execution history for better debugging and observability.

Reinforcement Learning via Arbor and `dspy.GRPO`

One of the headline additions in DSPy 3.0 is native reinforcement learning of multi-module language model programs through dspy.GRPO. The optimizer is backed by Arbor, an open-source RL framework for compound AI systems written by Noah Ziems that builds on the Will Brown Verifiers library and Hugging Face's TRL.[^arbor] A developer starts a local Arbor server with arbor.init(), attaches it to a DSPy program through an ArborProvider, defines a reward function, and calls ArborGRPO.compile() with configurable batch size, learning rate, and LoRA settings to produce an RL-optimized version of the original program.[^arbor]

The underlying algorithm, Multi-module GRPO, generalizes Group Relative Policy Optimization (popularized by DeepSeek for single-prompt LM training) to sophisticated, modular LM programs in which different modules may use different prompts. The companion paper by Ziems, Soylu, Agrawal, Miller, Lai, Qian, Song, Jiang, Klein, Zaharia, D'Oosterlinck, Potts, and Khattab was first posted on arXiv in August 2025 (arXiv:2508.04660) and accepted at ACM CAIS 2026. The authors report that GRPO composes well with automatic prompt optimization, yielding an average 11% accuracy improvement across multiple task types and 5% gains over prompt optimization alone.[^grpo]

Recursive Language Models and the DSPy 3.3 Beta

DSPy 3.1.1 (January 2026) introduced dspy.RLM, a module implementing Recursive Language Models (RLMs), an inference strategy proposed by Alex L. Zhang, Tim Kraska, and Omar Khattab at MIT.^[24]^[26] Instead of placing an entire long input into the prompt, an RLM treats the prompt as part of an external environment: the model inspects metadata about the stored context, writes Python code in a sandboxed REPL to examine and decompose it, and can recursively invoke sub-LM calls over snippets before submitting a final answer.^[24] The authors report that RLMs can process inputs up to two orders of magnitude beyond a model's context window and outperform base models and common long-context scaffolds on four long-context tasks at comparable or lower cost per query.^[26]

On May 28, 2026, the project published DSPy 3.3.0b1, a beta release previewing the 3.3 line.^[25] It adds an experimental dspy.ReActV2 module rebuilt around native tool calling: each turn is stored in dspy.History as structured user, assistant, and tool messages rather than a single growing trajectory string, which enables parallel tool calls, multi-turn replay of prior tool interactions, and better reuse of provider prompt caches; the maintainers reported cost reductions of up to 50% on some tasks in internal testing.^[25] The release also begins migrating DSPy to a typed, provider-neutral language model interface in which custom backends implement a single dspy.LMRequest to dspy.LMResponse contract, with LiteLLM planned to become an optional compatibility fallback rather than a required core dependency in later releases.^[25]

Leadership and Governance

Omar Khattab, the creator of DSPy, joined Databricks as a Research Scientist in 2024 before starting as MIT faculty in July 2025.^[27] His move to Databricks led to increased investment in the open-source DSPy community^[27], with DSPy 3.0 being presented at the Data + AI Summit 2025. The core development team includes contributors from Databricks, Zenbase, Weaviate, and various academic institutions, with mentorship from researchers at Stanford, UC Berkeley, Carnegie Mellon University, and Two Sigma.^[8] As of 2026, Khattab is an assistant professor in the EECS department at MIT and a member of CSAIL.^[28]

How does DSPy compare to LangChain?

DSPy occupies a distinct position in the LLM framework landscape. While frameworks like LangChain focus on orchestration and integration with external tools, databases, and APIs, DSPy focuses specifically on programmatic prompt optimization. The two frameworks are not mutually exclusive and can be used together.

Feature	DSPy	LangChain
Primary focus	Prompt and pipeline optimization	Orchestration and tool integration
Prompt handling	Automatically optimized via compilation	Manually engineered templates
Integration ecosystem	Focused (LLM providers via LiteLLM)	Broad (vector databases, APIs, tools, observability platforms)
Optimization	Built-in metric-driven optimizers	Requires external tooling or manual tuning
Learning curve	Steeper (requires understanding of optimization, metrics)	Gentler (familiar patterns for web developers)
Best suited for	Complex multi-stage reasoning pipelines where performance metrics matter	Applications requiring extensive integration with diverse data sources and APIs

Practical Workflow

A typical DSPy development workflow proceeds as follows:

Define signatures for each task the program needs to perform.
Select and compose modules (Predict, ChainOfThought, ReAct, etc.) to build the pipeline.
Prepare a small training set of representative examples (often 20-200 examples suffice).
Define a metric function that scores program outputs on the desired criteria.
Choose an optimizer and compile the program.
Evaluate the compiled program on a held-out test set.
Iterate by adjusting signatures, modules, training data, or optimizer choices.

This workflow mirrors conventional machine learning development practices, where training data and evaluation metrics drive improvement, rather than the ad hoc experimentation typical of prompt engineering.

What are the limitations of DSPy?

While DSPy offers significant advantages for building LLM-powered systems, several considerations apply:

Data requirements: Optimizers need training examples with expected outputs. In scenarios where labeled data is unavailable or difficult to produce, the compilation step has less material to work with.
Metric design: The quality of optimization depends on the metric function. Poorly designed metrics can lead to degenerate solutions where the program optimizes for the metric without genuinely improving output quality.
Compilation cost: Running an optimizer involves many LLM calls across the training set, which incurs API costs and takes time. More powerful optimizers like MIPROv2 require more trials and therefore more compute.
Debugging complexity: When a compiled program produces unexpected results, diagnosing whether the issue lies in the signature, the module choice, the optimizer configuration, or the training data can be challenging.
Rapidly evolving API: As a fast-moving open-source project, DSPy's API has changed between major versions, requiring developers to update their code when upgrading.
Black-box optimization: Because optimizers work by searching over prompt and demonstration spaces, the resulting compiled programs can be difficult to interpret. Understanding why a particular set of demonstrations was selected or why a specific instruction phrasing works well is not always straightforward.
Metric sensitivity: Small changes in the metric function or training data distribution can lead to noticeably different compiled programs, which makes reproducibility across runs a consideration for production deployments.

References

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., Zaharia, M., & Potts, C. (2024). "DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines." *Proceedings of the International Conference on Learning Representations (ICLR) 2024*. https://openreview.net/forum?id=sY5N0zY5Od ↩
Khattab, O., Santhanam, K., Li, X.L., Hall, D., Liang, P., Potts, C., & Zaharia, M. (2022). "Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP." *arXiv preprint arXiv:2212.14024*. https://arxiv.org/abs/2212.14024 ↩
Singhvi, A., Shetty, M., Tan, S., Potts, C., Sen, K., Zaharia, M., & Khattab, O. (2024). "DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines." *arXiv preprint arXiv:2312.13382*. https://arxiv.org/abs/2312.13382 ↩
Opsahl-Ong, K., Ryan, M.J., Purtell, J., Broman, D., Potts, C., Zaharia, M., & Khattab, O. (2024). "Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. https://aclanthology.org/2024.emnlp-main.525/ ↩
Agrawal, A., et al. (2025). "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning." *arXiv preprint arXiv:2507.19457*. https://dspy.ai/api/optimizers/GEPA/overview/ ↩
DSPy Official Documentation. https://dspy.ai/ ↩
DSPy GitHub Repository. https://github.com/stanfordnlp/dspy ↩
DSPy Roadmap. https://dspy.ai/roadmap/ ↩
DSPy Community Use Cases. https://dspy.ai/community/use-cases/ ↩
Ziems, N., Soylu, D., Agrawal, L.A., Miller, I., Lai, L., Qian, C., Song, K., Jiang, M., Klein, D., Zaharia, M., D'Oosterlinck, K., Potts, C., & Khattab, O. (2025). "Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs." *arXiv preprint arXiv:2508.04660* (accepted to ACM CAIS 2026). https://arxiv.org/abs/2508.04660
DSPy 3.0.0 release notes, stanfordnlp/dspy GitHub. https://github.com/stanfordnlp/dspy/releases/tag/3.0.0 ↩
DSPy 3.1.0 release notes, stanfordnlp/dspy GitHub. https://github.com/stanfordnlp/dspy/releases/tag/3.1.0
DSPy 3.2.0 and 3.2.1 release notes, stanfordnlp/dspy GitHub. https://github.com/stanfordnlp/dspy/releases/tag/3.2.0 and https://github.com/stanfordnlp/dspy/releases/tag/3.2.1
dspy package release history on PyPI. https://pypi.org/project/dspy/#history
Arbor: A framework for optimizing DSPy programs with RL, by Noah Ziems. https://github.com/Ziems/arbor
dspy.CodeAct API reference. https://dspy.ai/api/modules/CodeAct/
dspy.Refine API reference. https://dspy.ai/api/modules/Refine/
Agrawal, L.A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M.J., Jiang, M., Potts, C., Sen, K., Dimakis, A.G., Stoica, I., Klein, D., Zaharia, M., & Khattab, O. (2025). "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning." *arXiv preprint arXiv:2507.19457*, accepted as an oral presentation at ICLR 2026. https://arxiv.org/abs/2507.19457 ↩
gepa-ai/gepa GitHub repository (standalone GEPA optimization library used by dspy.GEPA). https://github.com/gepa-ai/gepa ↩
Soylu, D., Potts, C., & Khattab, O. (2024). "Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. https://arxiv.org/abs/2407.10930 ↩
Tan, S., Agrawal, L.A., Singhvi, A., Lai, L., Ryan, M.J., Klein, D., Khattab, O., Sen, K., & Zaharia, M. (2025). "LangProBe: a Language Programs Benchmark." *arXiv preprint arXiv:2502.20315*. https://arxiv.org/abs/2502.20315 ↩
dspy download statistics, PyPI Stats. https://pypistats.org/packages/dspy ↩
stanford-oval/storm GitHub repository (STORM and Co-STORM, an LLM-powered knowledge curation system implemented using DSPy). https://github.com/stanford-oval/storm ↩
dspy.RLM API reference (Recursive Language Model module). https://dspy.ai/api/modules/RLM/ ↩
DSPy 3.3.0b1 release notes, stanfordnlp/dspy GitHub. https://github.com/stanfordnlp/dspy/releases/tag/3.3.0b1 ↩
Zhang, A.L., Kraska, T., & Khattab, O. (2025). "Recursive Language Models." *arXiv preprint arXiv:2512.24601*. https://arxiv.org/abs/2512.24601 ↩
Omar Khattab (@lateinteraction) on X, August 2024: announcement of joining Databricks as a Research Scientist before starting as MIT faculty in July 2025, with increased investment in the open-source DSPy community. https://x.com/lateinteraction/status/1825623373655024035 ↩
Omar Khattab, personal homepage. https://omarkhattab.com/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

What links here

Abbreviations Agentic Context Engineering Agentic RAG Agentic workflow Agno Best AI Agent Frameworks ColBERT Compound AI System Dev tools Guidance (library)Instructor (library)Meta Prompting Multi-hop RAG Percy Liang Prompt Engineering Pydantic AI

Why was DSPy created?

What are signatures, modules, and optimizers?

Signatures

Modules

Optimizers (Teleprompters)

How does DSPy compilation work?

How is DSPy different from prompt engineering?

DSPy Assertions

Which language models does DSPy support?

What does the foundational DSPy paper say?

Who uses DSPy in production?

Version History and Recent Developments

DSPy 3.0 Highlights

Reinforcement Learning via Arbor and dspy.GRPO

Recursive Language Models and the DSPy 3.3 Beta

Leadership and Governance

How does DSPy compare to LangChain?

Practical Workflow

What are the limitations of DSPy?

See Also

References

Improve this article

Related Articles

Structured output

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

What links here

Related Articles

Structured output

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

What links here

Reinforcement Learning via Arbor and `dspy.GRPO`