# DSPy

> Source: https://aiwiki.ai/wiki/dspy
> Updated: 2026-06-22
> Categories: Developer Tools, Large Language Models, Machine Learning, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**DSPy** (short for **Declarative Self-improving Python**) is an open-source framework, developed at [Stanford NLP](/wiki/stanford_nlp), for programming rather than prompting [large language models](/wiki/large_language_model) (LLMs). Instead of hand-writing brittle prompt strings, developers express an LLM pipeline as composable Python code built from three abstractions, signatures, modules, and optimizers, and DSPy then "compiles" that program by automatically generating and tuning the underlying prompts, few-shot demonstrations, and (optionally) model weights to maximize a chosen metric. The framework's own documentation summarizes the idea as "DSPy is the framework for programming, rather than prompting, language models," with the tagline "Program, don't prompt, your LLMs."[6][7] DSPy was created by Omar Khattab and collaborators and first released in October 2023, and its foundational paper was published at [ICLR](/wiki/iclr) 2024.[1]

DSPy is the most widely adopted framework for the program-not-prompt paradigm and a direct alternative to template-driven [prompt engineering](/wiki/prompt_engineering) and orchestration libraries such as [LangChain](/wiki/langchain). As of June 2026 the official `stanfordnlp/dspy` repository on GitHub reports roughly 35,300 stars and about 3,000 forks,[7] and the project's own site reports more than 433 contributors and over 6.4 million monthly downloads.[6]

## Why was DSPy created?

Traditional approaches to building LLM-powered applications depend heavily on manually crafted prompts and prompt templates. Developers typically invest significant effort in trial-and-error experimentation to discover prompt phrasing that produces acceptable results. This process is fragile: when the underlying model changes, the task requirements shift, or the pipeline grows more complex, the hand-tuned prompts often break down and must be reworked from scratch.

DSPy grew out of the **Demonstrate-Search-Predict (DSP)** framework, whose research began at Stanford NLP in February 2022. The original DSP paper (Khattab et al., December 2022) introduced the idea of composing retrieval and language models for knowledge-intensive [natural language processing](/wiki/natural_language_processing) tasks by passing natural language texts through sophisticated pipelines between a language model and a retrieval model.[2] DSP programs for open-domain, multi-hop, and conversational question answering delivered 37-120% relative gains against vanilla LM baselines.[2]

DSPy generalized and extended DSP's core insight: rather than manually specifying how an LLM should be prompted, developers should declare *what* a text transformation needs to accomplish and let the system figure out *how* to prompt the model effectively. This shift from imperative prompt engineering to declarative programming is the central contribution of DSPy.[1]

The analogy to traditional software compilation is intentional. Just as a compiler translates high-level source code into optimized machine instructions, DSPy's optimizers translate high-level signatures and module compositions into optimized prompts, demonstrations, and weight updates that a language model can execute effectively. The developer works at the level of task specification, while the framework handles the low-level details of prompt construction.

## What are signatures, modules, and optimizers?

DSPy introduces three primary abstractions that work together to define and optimize LLM programs: **Signatures**, **Modules**, and **Optimizers** (originally called Teleprompters).

### Signatures

A **Signature** in DSPy is a natural-language typed function declaration.[1] It specifies the input and output fields of a text transformation, describing *what* the transformation should achieve without dictating *how* the LLM should be prompted to accomplish it.

Signatures can be expressed in two ways:

**Inline (shorthand) notation:**

```python
classify = dspy.Predict('sentence -> sentiment: bool')
qa = dspy.ChainOfThought('question -> answer')
summarize = dspy.Predict('document -> summary')
```

**Class-based notation** (for more complex specifications):

```python
class BasicQA(dspy.Signature):
    """Answer questions with short factoid answers."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="often between 1 and 5 words")
```

Signatures serve as the contract between the developer and DSPy. The framework handles translating these high-level specifications into effective prompts, [few-shot](/wiki/few-shot_learning) demonstrations, or [fine-tuning](/wiki/fine_tuning) data for the underlying language model. DSPy automatically validates that values passed to input fields match the types specified in the signature.[6]

The class-based format also supports docstrings that describe the task in natural language, typed fields with descriptions and constraints, and default values. Starting with DSPy 3.0, signatures can include multi-modal types such as `dspy.Image` for image inputs, enabling vision-language pipelines to benefit from the same optimization infrastructure as text-only programs.[11]

### Modules

A **Module** in DSPy is a parameterized component that implements a specific prompting or reasoning strategy. Modules take a Signature and apply a particular technique to fulfill it. Each module has learnable parameters, including the instructions the LLM follows, stored demonstrations of input/output behavior, and (optionally) LM weights.[1]

DSPy provides several built-in modules:

| Module | Description |
|---|---|
| `dspy.Predict` | The basic predictor. Does not modify the signature. Handles the key forms of learning (demonstrations, instructions).[6] |
| `dspy.ChainOfThought` | Teaches the LM to think step-by-step before committing to the response. Automatically injects a reasoning field before the output fields.[6] |
| `dspy.ProgramOfThought` | Teaches the LM to output executable code, whose execution results dictate the final response. Useful for mathematical or algorithmic tasks.[6] |
| `dspy.ReAct` | An agent that can use tools to implement the given signature. Based on the Reasoning and Acting paradigm (Yao et al., 2022). |
| `dspy.MultiChainComparison` | Generates multiple [chain-of-thought](/wiki/chain_of_thought) outputs and compares them to produce a final, higher-quality prediction. |
| `dspy.majority` | A utility that performs voting across multiple predictions and returns the most popular response. |
| `dspy.CodeAct` | Introduced in DSPy 3.0, combines elements of ReAct and Program-of-Thought by generating Python code that calls predefined tools, executing it in a sandboxed interpreter, and iterating up to a configurable number of steps until a final answer is reached.[^codeact] |
| `dspy.Refine` | Introduced in DSPy 3.0, runs a wrapped module up to *N* times at `temperature=1.0` with different rollout IDs, scores each prediction with a reward function, and returns the first attempt that crosses a threshold or otherwise the highest-scoring result. Useful for enforcing output constraints.[^refine] |
| `dspy.Reasoning` | Added in DSPy 3.1.0 (January 2026) to capture native reasoning traces from dedicated reasoning models, exposing the underlying chain of thought separately from the final answer.[^v310] |

All built-in modules are constructed using `dspy.Predict` internally.[6] Modules can be composed into larger programs, just as layers compose into neural networks in [PyTorch](/wiki/pytorch).

A typical DSPy program defines a class that inherits from `dspy.Module`, declares its sub-modules in `__init__`, and implements a `forward` method:

```python
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate = dspy.ChainOfThought('context, question -> answer')

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)
```

### Optimizers (Teleprompters)

**Optimizers** (formerly called **Teleprompters**, a portmanteau of "telescoping prompts") are algorithms that automatically tune a DSPy program's parameters to maximize a given evaluation metric.[1] This is the compilation step that distinguishes DSPy from traditional prompt engineering. Given a program, a training dataset, and a metric function, an optimizer searches for the best combination of instructions, demonstrations, and/or model weights for each module in the pipeline.

DSPy provides a range of optimizers targeting different aspects of the program:

| Optimizer | What It Optimizes | How It Works |
|---|---|---|
| `LabeledFewShot` | Few-shot examples | Randomly selects *k* labeled examples from the training set as demonstrations. |
| `BootstrapFewShot` | Few-shot examples | Uses a teacher module to generate demonstrations for every program stage, validating each against the metric.[6] |
| `BootstrapFewShotWithRandomSearch` | Few-shot examples | Runs BootstrapFewShot multiple times with random search, keeping the best-performing candidate program. |
| `KNNFewShot` | Few-shot examples | Uses k-nearest neighbors to select the most relevant training examples as demonstrations. |
| `COPRO` | Instructions | Generates and refines instructions for each module step using coordinate ascent (hill-climbing).[6] |
| `MIPROv2` | Instructions + examples | Jointly optimizes instructions and few-shot examples using [Bayesian optimization](/wiki/bayesian_optimization).[4] |
| `SIMBA` | Instructions + examples | Uses stochastic mini-batch sampling to find challenging examples and applies introspective analysis to generate improvement rules. |
| `GEPA` | Instructions | Uses LM reflection on program trajectories to identify weaknesses and propose improved prompts. Based on Genetic-Pareto evolution (Agrawal et al., 2025).[5] |
| `BootstrapFinetune` | Model weights | Distills a prompt-based DSPy program into weight updates by fine-tuning the underlying LM. |
| `BetterTogether` | Instructions + weights | A meta-optimizer that combines prompt optimization and weight optimization in configurable sequences.[20] From DSPy 3.2.0 onwards it accepts arbitrary optimizers as keyword arguments and chains them through strategy strings, allowing sequences such as prompt optimization, then fine-tuning, then prompt optimization again.[^v320] |
| `GRPO` | Model weights via [reinforcement learning](/wiki/reinforcement_learning) | Runs multi-module Group Relative Policy Optimization on a DSPy program through the companion Arbor library. Ships in DSPy 3.0 as `dspy.GRPO` and is the framework's first native online RL optimizer.[^grpo][^arbor] |

The GEPA method, which shipped with DSPy 3.0, is described in a paper accepted as an oral presentation at ICLR 2026. Across six tasks, the authors report that reflective prompt evolution outperforms the [GRPO](/wiki/grpo) reinforcement learning algorithm by roughly 6 points on average while using up to 35x fewer rollouts, and that it beats MIPROv2 by more than 10 points (for example, a 12-point gain on AIME 2025).[18] GEPA is also maintained as a standalone open-source library, which `dspy.GEPA` consumes as a dependency.[19]

## How does DSPy compilation work?

The compilation (optimization) process in DSPy follows a systematic, metrics-driven approach that replaces manual prompt engineering. It requires three ingredients: a DSPy program, a set of training examples (often just a few tens or hundreds), and a metric function that scores program outputs.

The process varies by optimizer, but a representative workflow using MIPROv2 involves three stages:[4]

**1. Bootstrapping Stage:** The optimizer runs the program across many different training inputs, collecting traces of intermediate input/output behavior at every module. It filters these traces, retaining only those that appear in trajectories scored highly by the metric. These filtered traces become candidate few-shot demonstrations.

**2. Grounded Proposal Stage:** MIPROv2 inspects the DSPy program's code structure, the training data characteristics, and the collected traces. Using this information, it drafts multiple candidate instructions for every prompt in the program. This step is both data-aware (grounded in actual examples) and demonstration-aware (informed by which traces worked well).

**3. Discrete Search Stage:** The optimizer samples mini-batches from the training set, proposes combinations of instructions and demonstrations for constructing every prompt in the pipeline, and evaluates each candidate program on the mini-batch. MIPROv2 uses Bayesian optimization to efficiently search this combinatorial space, balancing exploration of new configurations with exploitation of promising ones.[4]

The final output is a "compiled" program where each module's parameters (instructions, demonstrations, or weights) have been tuned for the target task and metric. Importantly, this compiled program remains a standard DSPy program that can be further optimized, composed with other modules, or saved and loaded for deployment.

This approach can produce dramatic quality improvements. The original DSPy paper demonstrated that compilation could raise the quality of simple programs from 33% to 82% accuracy on math word problems using [GPT-3.5](/wiki/gpt-3), and from 9% to 47% using [Llama](/wiki/llama) 2-13b-chat.[1] On multi-hop question answering, improvements ranged from 25% to 65% across different models.[1]

## How is DSPy different from prompt engineering?

The following table summarizes how DSPy's programmatic approach compares to traditional prompt engineering:

| Aspect | Traditional Prompt Engineering | DSPy |
|---|---|---|
| **Prompt creation** | Manual, trial-and-error writing of prompt templates | Automatic generation and optimization of prompts via compilation |
| **Abstraction level** | Developer writes low-level prompt text | Developer declares high-level signatures and composes modules |
| **Adaptability to new models** | Prompts often need rewriting when switching LLMs | Recompile the same program for a different LM |
| **Multi-step pipelines** | Complex chain of string manipulations | Modular composition of typed modules |
| **Optimization** | Manual A/B testing and subjective evaluation | Metric-driven, automated search over instructions and demonstrations |
| **Reproducibility** | Difficult to version-control free-form prompts | Programs are standard Python code, easy to version and test |
| **Portability** | Tied to specific model behavior | Same program can target GPT-4, Claude, Llama, or fine-tuned models |
| **Learning curve** | Lower initial barrier | Requires understanding of optimization loops and metric design |

## DSPy Assertions

DSPy Assertions extend the framework with **computational constraints** that LLM outputs should satisfy. Introduced in a companion paper by Arnav Singhvi, Manish Shetty, Shangyin Tan, Christopher Potts, Koushik Sen, Matei Zaharia, and Omar Khattab (December 2023), assertions provide a structured way to enforce rules on LM behavior.[3]

The framework defines two types of constraints:

- **`dspy.Assert`**: Enforces strict constraints that must be met. When an assertion fails, the pipeline enters a retry state, allowing it to reattempt the failing LM call with awareness of previous attempts and the error message. If the assertion continues to fail after a maximum number of retries, an `AssertionError` is raised.

- **`dspy.Suggest`**: Offers a softer approach. It maintains the same retry backtracking as `dspy.Assert`, but if the constraint still fails after the maximum number of attempts, it logs the persistent failure and continues execution rather than halting.

Assertions can be integrated both at compile time (via automatic prompt optimization) and at inference time (via automatic self-refinement and backtracking). According to the paper, LM Assertions improve constraint compliance by up to 164% and generate up to 37% more higher-quality responses compared to programs without assertions.[3]

Example usage:

```python
dspy.Assert(
    len(prediction.answer.split()) < 50,
    "Answer must be fewer than 50 words."
)

dspy.Suggest(
    prediction.confidence > 0.8,
    "Try to provide a higher-confidence answer."
)
```

## Which language models does DSPy support?

DSPy supports a wide range of language model providers through [LiteLLM](/wiki/litellm), which serves as its backend for model communication.[6] Developers configure their LM with a simple API call using the `provider/model` naming convention.

**OpenAI models:**

```python
lm = dspy.LM('openai/gpt-4o-mini', api_key='your-key')
dspy.configure(lm=lm)
```

**[Anthropic](/wiki/anthropic) models:**

```python
lm = dspy.LM('anthropic/claude-sonnet-4-20250514', api_key='your-key')
dspy.configure(lm=lm)
```

**Local models via [Ollama](/wiki/ollama):**

```python
lm = dspy.LM('ollama_chat/llama3.2', api_base='http://localhost:11434')
dspy.configure(lm=lm)
```

**Local models via [SGLang](/wiki/sglang) or [vLLM](/wiki/vllm) (OpenAI-compatible endpoints):**

```python
lm = dspy.LM('openai/meta-llama/Llama-3-8B', api_base='http://localhost:30000/v1')
dspy.configure(lm=lm)
```

The `dspy.configure()` function sets the default LM globally, while `dspy.context()` allows temporarily switching the model within a specific block of code. This design makes it straightforward to experiment with different models or to use different models for different stages of a pipeline.

DSPy supports any provider that LiteLLM supports, including [OpenAI](/wiki/openai), Anthropic, [Google](/wiki/google) ([Gemini](/wiki/gemini)), [Mistral](/wiki/mistral_ai), [Cohere](/wiki/cohere), Azure OpenAI, AWS Bedrock, and many others.[6] Authentication is handled through environment variables (such as `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`) or by passing the key directly.

## What does the foundational DSPy paper say?

The foundational DSPy paper was authored by Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts.[1] The authors represented Stanford University, UC Berkeley, Carnegie Mellon University, Amazon Alexa AI, IIT Bombay, Microsoft, and Calera Capital.[1]

The paper was first posted as an arXiv preprint (arXiv:2310.03714) in October 2023 and was accepted as a conference paper at the International Conference on Learning Representations (ICLR) 2024 under the title "DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines."[1] The paper was designated a spotlight presentation at the conference.[1]

The abstract frames the core idea in compiler terms, describing DSPy as a programming model that abstracts "LM pipelines as text transformation graphs, i.e. imperative computational graphs where LMs are invoked through declarative modules."[1] In its case studies the authors report that, within minutes of compiling, DSPy programs "outperform standard few-shot prompting (generally by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively)" for GPT-3.5 and llama2-13b-chat.[1]

Key contributions of the paper include:

1. The **Signature** abstraction for declaratively specifying input/output behavior of LM modules.
2. A set of **parameterized modules** (Predict, ChainOfThought, ReAct, etc.) that replace hand-crafted prompt templates.
3. **Teleprompter algorithms** that automatically compile programs by generating demonstrations and optimizing instructions.
4. Two case studies demonstrating that compact DSPy programs can express and optimize pipelines for math word problems, multi-hop retrieval, complex question answering, and agent control loops.
5. Evidence that DSPy programs compiled for relatively small models (770M-parameter [T5](/wiki/t5) and Llama 2-13b-chat) can be competitive with approaches relying on larger, proprietary LMs such as [GPT-3](/wiki/gpt-3).5.[1]

A follow-up paper, "Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs" (Opsahl-Ong, Ryan, Purtell, Broman, Potts, Zaharia, and Khattab), introduced the MIPRO optimizer and was published at [EMNLP](/wiki/emnlp) 2024 in Miami, Florida.[4] MIPRO outperformed baseline optimizers on five of seven diverse multi-stage LM programs using Llama-3-8B, with gains of up to 13% accuracy.[4]

Two further studies extended this line of optimizer research. "Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together" (Soylu, Potts, and Khattab, EMNLP 2024), the study behind the `BetterTogether` optimizer, reported that alternating between optimizing prompts and fine-tuning weights outperformed optimizing weights alone by up to 60% and prompts alone by up to 6% on average, across mistral-7b, llama-2-7b, and llama-3-8b on multi-hop question answering, mathematical reasoning, and classification tasks.[20] In February 2025, researchers from the DSPy ecosystem released LangProBe, the first large-scale benchmark for evaluating language-program architectures and optimization strategies, covering over 2,000 combinations of tasks, architectures, optimizers, and language models; the study found that optimized language programs offer strong cost-quality Pareto improvements over raw calls to the underlying models, while architecture decisions still benefit from human judgment.[21]

## Who uses DSPy in production?

DSPy has been adopted by a diverse set of organizations across industries.[9] The following table highlights selected production deployments:

| Company | Use Case |
|---|---|
| JetBlue | Multiple chatbot applications and LLM pipeline optimization |
| Replit | Code diff synthesis for automated code repair |
| Databricks | LM Judges, [RAG](/wiki/retrieval_augmented_generation) systems, classification, and customer solutions |
| Sephora | Agent-based applications |
| Zoro UK | Structured e-commerce product attribute normalization across millions of items |
| VMware | RAG and prompt optimization |
| Haize Labs | Automated red-teaming to test LLM vulnerabilities |
| Moody's | RAG optimization, LLM-as-a-Judge, and financial agentic systems |
| PingCAP | Knowledge graph construction and querying |
| Truelaw | Custom LLM pipelines for law firm workflows |
| RadiantLogic | AI Data Assistant with routing, extraction, [text-to-SQL](/wiki/text_to_sql), and summarization |
| Infinitus | Healthcare AI agent development and optimization |
| Shopify | Structured product metadata extraction at scale, with a reported cost reduction of roughly 550x[9] |
| Dropbox | Optimized LLM relevance judge for search ranking[9] |
| [AWS](/wiki/amazon_web_services) | Model migration while maintaining performance[9] |

The framework has also seen heavy adoption in academic research, with the original paper and related work receiving extensive citations since publication. As of June 2026, the project's official site reports more than 433 contributors and over 6.4 million monthly downloads,[6] the GitHub repository shows roughly 35,300 stars and about 3,000 forks,[7] and independent PyPI statistics recorded several million downloads of the `dspy` package in the preceding month.[22]

Beyond enterprise use, DSPy has found traction in research laboratories and educational settings. Its declarative approach makes it a natural fit for rapid prototyping of NLP pipelines in academic contexts, where researchers can quickly test different module compositions and optimization strategies without investing weeks in manual prompt tuning. Several university courses on LLM application development have incorporated DSPy into their curricula as an example of the programmatic paradigm. A prominent research example is STORM, an LLM-powered knowledge curation system from Stanford that researches a topic and drafts Wikipedia-style articles with citations; both STORM and its collaborative extension, Co-STORM, are implemented in a modular way using DSPy.[23]

## Version History and Recent Developments

DSPy has undergone rapid development since its initial release:

| Version | Release Date | Key Changes |
|---|---|---|
| DSP (predecessor) | February 2022 | Demonstrate-Search-Predict framework for composing retrieval and language models |
| DSPy 1.0 | October 2023 | Initial release with Signatures, Modules, and Teleprompters |
| DSPy 2.0 | 2024 | Expanded optimizer support, improved documentation, renamed Teleprompters to Optimizers |
| DSPy 2.5 | Late 2024 | Enhanced multi-module optimization, advanced telemetry and tracing, expanded model support (GPT-4, Claude, Gemini) |
| DSPy 3.0.0 | August 12, 2025 | Multi-modal I/O (`dspy.Image`, `dspy.Audio`), thread-safe settings, native async support, native [MLflow](/wiki/mlflow) 3.0 integration, new optimizers `GEPA`, `SIMBA`, and `GRPO`, new modules `dspy.CodeAct` and `dspy.Refine`, improved ReAct and `PythonInterpreter`, built-in adapters (`ChatAdapter`, `JSONAdapter`, `XMLAdapter`, `BAMLAdapter`). Python 3.9 support dropped.[^v300][^v300notes] |
| DSPy 3.0.1-3.0.3 | August 2025 | Bug fixes and minor enhancements following the 3.0 launch.[^pypi] |
| DSPy 3.0.4 | November 10, 2025 | Stability improvements and bug fixes.[^pypi] |
| DSPy 3.1.0 | January 6, 2026 | Adds `dspy.Reasoning` for capturing native reasoning traces, new `dspy.File` type, Python 3.14 support, pickle-loading safety guards, GEPA dependency update, MIPROv2 fixes, fixes to Anthropic model handling, XMLAdapter formatting, and OpenAI Responses API conversion.[^v310] |
| DSPy 3.1.1-3.1.3 | January-February 2026 | Recursive Language Model (RLM) improvements[24], parallel `timeout`/`straggler_limit` parameters, code fence parsing fixes, GEPA dependency bumped to 0.0.26.[^pypi] |
| DSPy 3.2.0 | April 21, 2026 | `BetterTogether` made compatible with arbitrary chained optimizers, beginning of decoupling DSPy from [LiteLLM](/wiki/litellm) so custom LM backends can be plugged in, input-field type-validation warnings, optuna made an optional dependency.[^v320] |
| DSPy 3.2.1 | May 5, 2026 | Removed the upper bound on the LiteLLM dependency, fixed custom-header propagation in async streaming LM calls, fixed `dspy.Embedder` so per-call `caching=False` is honored for both sync and async embedding calls.[^v321] |
| DSPy 3.3.0b1 (beta) | May 28, 2026 | Beta pre-release: experimental `ReActV2` module built on native tool calling and `dspy.History`, start of a typed provider-neutral LM interface (`LMRequest`/`LMResponse`), GEPA dependency updated to 0.1.1, and fewer base dependencies (NumPy removed from the base install).[25] |

### DSPy 3.0 Highlights

DSPy 3.0, released in August 2025, represents the most significant update since the framework's inception.[11] Key features include:

- **Multi-modal support:** Native handling of images and audio inputs, with support for composite types and [Pydantic](/wiki/pydantic) models for structured outputs.
- **Production readiness:** Thread-safe settings and native async support for high-concurrency applications.
- **MLflow integration:** Native integration with [MLflow](/wiki/mlflow) for experiment tracking, model management, and observability.
- **Enhanced optimization:** Improved prompt optimization algorithms and expanded fine-tuning and [reinforcement learning](/wiki/reinforcement_learning) capabilities.
- **Batch processing:** A `batch` function on DSPy modules for thread-safe parallel execution.
- **Streaming and history:** Enhanced streaming support and execution history for better debugging and observability.

### Reinforcement Learning via Arbor and `dspy.GRPO`

One of the headline additions in DSPy 3.0 is native [reinforcement learning](/wiki/reinforcement_learning) of multi-module language model programs through `dspy.GRPO`. The optimizer is backed by **Arbor**, an open-source RL framework for compound AI systems written by Noah Ziems that builds on the Will Brown Verifiers library and [Hugging Face](/wiki/hugging_face)'s TRL.[^arbor] A developer starts a local Arbor server with `arbor.init()`, attaches it to a DSPy program through an `ArborProvider`, defines a reward function, and calls `ArborGRPO.compile()` with configurable batch size, learning rate, and [LoRA](/wiki/lora) settings to produce an RL-optimized version of the original program.[^arbor]

The underlying algorithm, **Multi-module GRPO**, generalizes Group Relative Policy Optimization (popularized by [DeepSeek](/wiki/deepseek) for single-prompt LM training) to sophisticated, modular LM programs in which different modules may use different prompts. The companion paper by Ziems, Soylu, Agrawal, Miller, Lai, Qian, Song, Jiang, Klein, Zaharia, D'Oosterlinck, Potts, and Khattab was first posted on arXiv in August 2025 (arXiv:2508.04660) and accepted at ACM CAIS 2026. The authors report that GRPO composes well with automatic prompt optimization, yielding an average 11% accuracy improvement across multiple task types and 5% gains over prompt optimization alone.[^grpo]

### Recursive Language Models and the DSPy 3.3 Beta

DSPy 3.1.1 (January 2026) introduced `dspy.RLM`, a module implementing **Recursive Language Models** (RLMs), an inference strategy proposed by Alex L. Zhang, Tim Kraska, and Omar Khattab at [MIT](/wiki/mit).[24][26] Instead of placing an entire long input into the prompt, an RLM treats the prompt as part of an external environment: the model inspects metadata about the stored context, writes Python code in a sandboxed REPL to examine and decompose it, and can recursively invoke sub-LM calls over snippets before submitting a final answer.[24] The authors report that RLMs can process inputs up to two orders of magnitude beyond a model's [context window](/wiki/context_window) and outperform base models and common long-context scaffolds on four long-context tasks at comparable or lower cost per query.[26]

On May 28, 2026, the project published DSPy 3.3.0b1, a beta release previewing the 3.3 line.[25] It adds an experimental `dspy.ReActV2` module rebuilt around native [tool calling](/wiki/tool_use): each turn is stored in `dspy.History` as structured user, assistant, and tool messages rather than a single growing trajectory string, which enables parallel tool calls, multi-turn replay of prior tool interactions, and better reuse of provider prompt caches; the maintainers reported cost reductions of up to 50% on some tasks in internal testing.[25] The release also begins migrating DSPy to a typed, provider-neutral language model interface in which custom backends implement a single `dspy.LMRequest` to `dspy.LMResponse` contract, with LiteLLM planned to become an optional compatibility fallback rather than a required core dependency in later releases.[25]

### Leadership and Governance

Omar Khattab, the creator of DSPy, joined [Databricks](/wiki/databricks) as a Research Scientist in 2024 before starting as MIT faculty in July 2025.[27] His move to Databricks led to increased investment in the open-source DSPy community[27], with DSPy 3.0 being presented at the Data + AI Summit 2025. The core development team includes contributors from Databricks, Zenbase, [Weaviate](/wiki/weaviate), and various academic institutions, with mentorship from researchers at Stanford, UC Berkeley, Carnegie Mellon University, and Two Sigma.[8] As of 2026, Khattab is an assistant professor in the EECS department at MIT and a member of CSAIL.[28]

## How does DSPy compare to LangChain?

DSPy occupies a distinct position in the LLM framework landscape. While frameworks like [LangChain](/wiki/langchain) focus on orchestration and integration with external tools, databases, and APIs, DSPy focuses specifically on programmatic prompt optimization. The two frameworks are not mutually exclusive and can be used together.

| Feature | DSPy | LangChain |
|---|---|---|
| **Primary focus** | Prompt and pipeline optimization | Orchestration and tool integration |
| **Prompt handling** | Automatically optimized via compilation | Manually engineered templates |
| **Integration ecosystem** | Focused (LLM providers via LiteLLM) | Broad (vector databases, APIs, tools, observability platforms) |
| **Optimization** | Built-in metric-driven optimizers | Requires external tooling or manual tuning |
| **Learning curve** | Steeper (requires understanding of optimization, metrics) | Gentler (familiar patterns for web developers) |
| **Best suited for** | Complex multi-stage reasoning pipelines where performance metrics matter | Applications requiring extensive integration with diverse data sources and APIs |

## Practical Workflow

A typical DSPy development workflow proceeds as follows:

1. **Define signatures** for each task the program needs to perform.
2. **Select and compose modules** (Predict, ChainOfThought, ReAct, etc.) to build the pipeline.
3. **Prepare a small training set** of representative examples (often 20-200 examples suffice).
4. **Define a metric function** that scores program outputs on the desired criteria.
5. **Choose an optimizer** and compile the program.
6. **Evaluate** the compiled program on a held-out test set.
7. **Iterate** by adjusting signatures, modules, training data, or optimizer choices.

This workflow mirrors conventional [machine learning](/wiki/machine_learning) development practices, where training data and evaluation metrics drive improvement, rather than the ad hoc experimentation typical of prompt engineering.

## What are the limitations of DSPy?

While DSPy offers significant advantages for building LLM-powered systems, several considerations apply:

- **Data requirements:** Optimizers need training examples with expected outputs. In scenarios where labeled data is unavailable or difficult to produce, the compilation step has less material to work with.
- **Metric design:** The quality of optimization depends on the metric function. Poorly designed metrics can lead to degenerate solutions where the program optimizes for the metric without genuinely improving output quality.
- **Compilation cost:** Running an optimizer involves many LLM calls across the training set, which incurs API costs and takes time. More powerful optimizers like MIPROv2 require more trials and therefore more compute.
- **Debugging complexity:** When a compiled program produces unexpected results, diagnosing whether the issue lies in the signature, the module choice, the optimizer configuration, or the training data can be challenging.
- **Rapidly evolving API:** As a fast-moving open-source project, DSPy's API has changed between major versions, requiring developers to update their code when upgrading.
- **Black-box optimization:** Because optimizers work by searching over prompt and demonstration spaces, the resulting compiled programs can be difficult to interpret. Understanding *why* a particular set of demonstrations was selected or *why* a specific instruction phrasing works well is not always straightforward.
- **Metric sensitivity:** Small changes in the metric function or training data distribution can lead to noticeably different compiled programs, which makes reproducibility across runs a consideration for production deployments.

## See Also

- [Prompt Engineering](/wiki/prompt_engineering)
- [LangChain](/wiki/langchain)
- [Retrieval-Augmented Generation](/wiki/retrieval_augmented_generation)
- [Chain-of-Thought Prompting](/wiki/chain_of_thought)
- [Fine-Tuning](/wiki/fine_tuning)
- [Few-Shot Learning](/wiki/few-shot_learning)

## References

1. Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., Zaharia, M., & Potts, C. (2024). "DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines." *Proceedings of the International Conference on Learning Representations (ICLR) 2024*. [https://openreview.net/forum?id=sY5N0zY5Od](https://openreview.net/forum?id=sY5N0zY5Od)

2. Khattab, O., Santhanam, K., Li, X.L., Hall, D., Liang, P., Potts, C., & Zaharia, M. (2022). "Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP." *arXiv preprint arXiv:2212.14024*. [https://arxiv.org/abs/2212.14024](https://arxiv.org/abs/2212.14024)

3. Singhvi, A., Shetty, M., Tan, S., Potts, C., Sen, K., Zaharia, M., & Khattab, O. (2024). "DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines." *arXiv preprint arXiv:2312.13382*. [https://arxiv.org/abs/2312.13382](https://arxiv.org/abs/2312.13382)

4. Opsahl-Ong, K., Ryan, M.J., Purtell, J., Broman, D., Potts, C., Zaharia, M., & Khattab, O. (2024). "Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. [https://aclanthology.org/2024.emnlp-main.525/](https://aclanthology.org/2024.emnlp-main.525/)

5. Agrawal, A., et al. (2025). "GEPA: Reflective [Prompt](/wiki/prompt) Evolution Can Outperform Reinforcement Learning." *arXiv preprint arXiv:2507.19457*. [https://dspy.ai/api/optimizers/GEPA/overview/](https://dspy.ai/api/optimizers/GEPA/overview/)

6. DSPy Official Documentation. [https://dspy.ai/](https://dspy.ai/)

7. DSPy GitHub Repository. [https://github.com/stanfordnlp/dspy](https://github.com/stanfordnlp/dspy)

8. DSPy Roadmap. [https://dspy.ai/roadmap/](https://dspy.ai/roadmap/)

9. DSPy Community Use Cases. [https://dspy.ai/community/use-cases/](https://dspy.ai/community/use-cases/)

10. Ziems, N., Soylu, D., Agrawal, L.A., Miller, I., Lai, L., Qian, C., Song, K., Jiang, M., Klein, D., Zaharia, M., D'Oosterlinck, K., Potts, C., & Khattab, O. (2025). "Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs." *arXiv preprint arXiv:2508.04660* (accepted to ACM CAIS 2026). [https://arxiv.org/abs/2508.04660](https://arxiv.org/abs/2508.04660)

11. DSPy 3.0.0 release notes, stanfordnlp/dspy GitHub. [https://github.com/stanfordnlp/dspy/releases/tag/3.0.0](https://github.com/stanfordnlp/dspy/releases/tag/3.0.0)

12. DSPy 3.1.0 release notes, stanfordnlp/dspy GitHub. [https://github.com/stanfordnlp/dspy/releases/tag/3.1.0](https://github.com/stanfordnlp/dspy/releases/tag/3.1.0)

13. DSPy 3.2.0 and 3.2.1 release notes, stanfordnlp/dspy GitHub. [https://github.com/stanfordnlp/dspy/releases/tag/3.2.0](https://github.com/stanfordnlp/dspy/releases/tag/3.2.0) and [https://github.com/stanfordnlp/dspy/releases/tag/3.2.1](https://github.com/stanfordnlp/dspy/releases/tag/3.2.1)

14. dspy package release history on PyPI. [https://pypi.org/project/dspy/#history](https://pypi.org/project/dspy/#history)

15. Arbor: A framework for optimizing DSPy programs with RL, by Noah Ziems. [https://github.com/Ziems/arbor](https://github.com/Ziems/arbor)

16. dspy.CodeAct API reference. [https://dspy.ai/api/modules/CodeAct/](https://dspy.ai/api/modules/CodeAct/)

17. dspy.Refine API reference. [https://dspy.ai/api/modules/Refine/](https://dspy.ai/api/modules/Refine/)

18. Agrawal, L.A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M.J., Jiang, M., Potts, C., Sen, K., Dimakis, A.G., Stoica, I., Klein, D., Zaharia, M., & Khattab, O. (2025). "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning." *arXiv preprint arXiv:2507.19457*, accepted as an oral presentation at ICLR 2026. [https://arxiv.org/abs/2507.19457](https://arxiv.org/abs/2507.19457)

19. gepa-ai/gepa GitHub repository (standalone GEPA optimization library used by dspy.GEPA). [https://github.com/gepa-ai/gepa](https://github.com/gepa-ai/gepa)

20. Soylu, D., Potts, C., & Khattab, O. (2024). "Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. [https://arxiv.org/abs/2407.10930](https://arxiv.org/abs/2407.10930)

21. Tan, S., Agrawal, L.A., Singhvi, A., Lai, L., Ryan, M.J., Klein, D., Khattab, O., Sen, K., & Zaharia, M. (2025). "LangProBe: a Language Programs Benchmark." *arXiv preprint arXiv:2502.20315*. [https://arxiv.org/abs/2502.20315](https://arxiv.org/abs/2502.20315)

22. dspy download statistics, PyPI Stats. [https://pypistats.org/packages/dspy](https://pypistats.org/packages/dspy)

23. stanford-oval/storm GitHub repository (STORM and Co-STORM, an LLM-powered knowledge curation system implemented using DSPy). [https://github.com/stanford-oval/storm](https://github.com/stanford-oval/storm)

24. dspy.RLM API reference (Recursive Language Model module). [https://dspy.ai/api/modules/RLM/](https://dspy.ai/api/modules/RLM/)

25. DSPy 3.3.0b1 release notes, stanfordnlp/dspy GitHub. [https://github.com/stanfordnlp/dspy/releases/tag/3.3.0b1](https://github.com/stanfordnlp/dspy/releases/tag/3.3.0b1)

26. Zhang, A.L., Kraska, T., & Khattab, O. (2025). "Recursive Language Models." *arXiv preprint arXiv:2512.24601*. [https://arxiv.org/abs/2512.24601](https://arxiv.org/abs/2512.24601)

27. Omar Khattab (@lateinteraction) on X, August 2024: announcement of joining Databricks as a Research Scientist before starting as MIT faculty in July 2025, with increased investment in the open-source DSPy community. [https://x.com/lateinteraction/status/1825623373655024035](https://x.com/lateinteraction/status/1825623373655024035)

28. Omar Khattab, personal homepage. [https://omarkhattab.com/](https://omarkhattab.com/)

[^github]: stanfordnlp/dspy on GitHub, star and fork counts as of June 2026. [https://github.com/stanfordnlp/dspy](https://github.com/stanfordnlp/dspy)

[^codeact]: dspy.CodeAct API reference. [https://dspy.ai/api/modules/CodeAct/](https://dspy.ai/api/modules/CodeAct/)

[^refine]: dspy.Refine API reference. [https://dspy.ai/api/modules/Refine/](https://dspy.ai/api/modules/Refine/)

[^v310]: DSPy 3.1.0 release notes. [https://github.com/stanfordnlp/dspy/releases/tag/3.1.0](https://github.com/stanfordnlp/dspy/releases/tag/3.1.0)

[^v320]: DSPy 3.2.0 release notes. [https://github.com/stanfordnlp/dspy/releases/tag/3.2.0](https://github.com/stanfordnlp/dspy/releases/tag/3.2.0)

[^v321]: DSPy 3.2.1 release notes. [https://github.com/stanfordnlp/dspy/releases/tag/3.2.1](https://github.com/stanfordnlp/dspy/releases/tag/3.2.1)

[^v300]: DSPy 3.0.0 release notes, stanfordnlp/dspy GitHub. [https://github.com/stanfordnlp/dspy/releases/tag/3.0.0](https://github.com/stanfordnlp/dspy/releases/tag/3.0.0)

[^v300notes]: dspy package release history on PyPI, confirming 3.0.0 release date of August 12, 2025. [https://pypi.org/project/dspy/#history](https://pypi.org/project/dspy/#history)

[^pypi]: dspy package release history on PyPI. [https://pypi.org/project/dspy/#history](https://pypi.org/project/dspy/#history)

[^grpo]: Ziems et al. (2025), "Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs", arXiv:2508.04660. [https://arxiv.org/abs/2508.04660](https://arxiv.org/abs/2508.04660)

[^arbor]: Arbor: A framework for optimizing DSPy programs with RL, by Noah Ziems. [https://github.com/Ziems/arbor](https://github.com/Ziems/arbor)