# Compound AI System

> Source: https://aiwiki.ai/wiki/compound_ai_system
> Updated: 2026-07-07
> Categories: Artificial Intelligence, Large Language Models, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **compound AI system** is an AI system that achieves its objectives by combining multiple interacting components, such as [large language models](/wiki/large_language_model), retrieval mechanisms, external tools, guardrails, and orchestration logic, rather than relying on a single monolithic model. Matei Zaharia, Omar Khattab, and colleagues at UC Berkeley, who popularized the term in a February 18, 2024 blog post titled "The Shift from Models to Compound AI Systems," define it as "a system that tackles AI tasks using multiple interacting components, including multiple calls to models, retrievers, or external tools" [1]. Their central argument was that state-of-the-art AI results are increasingly obtained not by training bigger models but by composing multiple components into integrated systems, a shift they predicted "will likely be the best way to maximize AI results in the future, and might be one of the most impactful trends in AI in 2024" [1].

The compound AI system perspective represents a significant shift in how AI practitioners think about building applications. Instead of asking "which model should I use?" the question becomes "what system of components, including models, retrievers, tools, and control logic, will best solve this problem?" This framing has become central to modern AI engineering, influencing the design of [retrieval-augmented generation](/wiki/retrieval_augmented_generation) (RAG) pipelines, [AI agents](/wiki/ai_agent), and production AI applications across industries [1]. By early 2024, most production LLM applications were already compound systems: a Databricks survey the authors cited found that 60% of LLM applications used some form of RAG and 30% used multi-step chains [1].

## Origins and definition

### The Berkeley blog post (February 2024)

On February 18, 2024, a group of researchers published "The Shift from Models to Compound AI Systems" on the Berkeley Artificial Intelligence Research (BAIR) blog. The authors included Matei Zaharia (co-creator of Apache Spark and co-founder of [Databricks](/wiki/databricks)), Omar Khattab (creator of [DSPy](/wiki/dspy)), Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi [1].

The post made several key observations:

1. The highest-quality AI results in 2024 were increasingly achieved by compound systems, not individual models.
2. Even simple composition strategies (such as having an LLM call a search engine) could outperform much larger or more expensive standalone models.
3. The trend toward compound systems was likely to accelerate because it offered a more accessible path to improving AI quality than training ever-larger models.
4. New tools, frameworks, and optimization techniques were needed specifically for designing, building, and optimizing compound systems.

The authors defined a compound AI system broadly as any system that combines multiple AI components, potentially including non-AI components, to accomplish a task. This definition intentionally encompasses a wide spectrum of architectures, from a simple LLM plus retriever pipeline to complex multi-agent systems with dozens of interacting components [1].

To illustrate the point, the authors listed several well-known systems that were already compound rather than monolithic [1]:

| System | Components |
|---|---|
| [AlphaCode 2](/wiki/alphacode) | Fine-tuned LLMs, code execution, and clustering |
| [AlphaGeometry](/wiki/alphageometry) | A fine-tuned LLM paired with a symbolic math engine |
| Medprompt | GPT-4 with nearest-neighbor retrieval and chain-of-thought inference |
| [Gemini](/wiki/gemini) on MMLU | The Gemini model with custom inference logic (CoT@32) |
| ChatGPT Plus | An LLM with a web browser, code interpreter, and image-generation plugins |
| RAG, ORQA, Bing, Baleen | LLMs combined with retrieval systems |

As concrete evidence that composition can beat scaling, the authors pointed to Microsoft's report that a chaining strategy (the Medprompt result) "exceeded GPT-4's accuracy on medical exams by 9%" [1].

### Why the shift matters

The compound systems perspective challenged the prevailing narrative in AI, which had focused heavily on scaling individual models. The [scaling laws](/wiki/scaling_laws) paradigm, popularized by research at [OpenAI](/wiki/openai) and elsewhere, suggested that the primary path to better AI was training larger models on more data with more compute. While this approach yielded remarkable results, it faced several limitations that compound systems could address.

## Why not just use a single large model?

Single-model approaches, regardless of model size, encounter fundamental limitations that motivate the compound systems approach.

| Limitation | Description | Compound system solution |
|---|---|---|
| [Hallucination](/wiki/hallucination) | Models generate plausible but factually incorrect information | Retrieval components provide grounding in verified sources |
| Stale knowledge | Model training data has a cutoff date; the model does not know about recent events | Real-time retrieval from up-to-date databases, APIs, or web search |
| No tool use | A standalone model cannot execute code, query databases, or interact with external services | Tool-use components allow the system to take actions in the real world |
| Lack of specialization | A single generalist model may underperform domain-specific solutions | Multiple specialized models or components, each optimized for a specific subtask |
| Inconsistent quality | Model outputs vary in quality and may not meet reliability requirements | Guardrails, validators, and retry logic enforce quality standards |
| Context window limits | Models have finite [context windows](/wiki/context_window) that limit how much information they can process at once | Retrieval and summarization components manage information flow across large document collections |
| Opacity | Model reasoning is difficult to inspect or debug | Structured pipelines with intermediate outputs enable better observability |
| Cost and latency | The largest models are expensive and slow; not every subtask requires frontier-model capability | Routing and model selection components direct each subtask to the most cost-effective model |

Compound systems address these limitations not by building better individual models (though they benefit from model improvements) but by combining models with complementary components that compensate for their weaknesses [1].

## Components of compound AI systems

A compound AI system integrates multiple types of components, each serving a distinct function within the overall architecture.

### Core components

| Component | Function | Examples |
|---|---|---|
| Language model (LLM) | Generates text, reasons about tasks, makes decisions | [GPT-4](/wiki/gpt-4), [Claude](/wiki/claude), [Gemini](/wiki/gemini), [Llama](/wiki/llama), [Mistral](/wiki/mistral) |
| Retriever | Fetches relevant information from external knowledge sources | Vector databases, [semantic search](/wiki/semantic_search) engines, BM25 indexes |
| Tools / APIs | Execute actions in external systems: run code, query databases, call web services | Code interpreters, calculators, web browsers, database connectors |
| Guardrails | Validate and constrain inputs and outputs to enforce safety, quality, or policy requirements | Content filters, format validators, factuality checkers |
| Memory | Stores and retrieves information across interactions or processing steps | Conversation history stores, [knowledge graphs](/wiki/knowledge_graph), working memory buffers |
| Orchestrator | Controls the flow of execution, deciding which components to invoke and in what order | LangGraph, programmatic control flow, state machines |
| Evaluator / verifier | Assesses intermediate or final outputs for correctness, relevance, or quality | LLM-as-judge, unit tests, assertion checks |
| Router / classifier | Directs inputs to the appropriate processing path based on their characteristics | Intent classifiers, difficulty estimators, model selectors |

### Interaction patterns

Components in a compound AI system interact through several common patterns:

- **Sequential pipeline:** Components execute in a fixed order. Example: a RAG system where a query is first sent to a retriever, then retrieved documents and the query are passed to an LLM for answer generation.
- **Parallel execution:** Multiple components process the same input simultaneously, and their outputs are aggregated. Example: querying multiple search indexes in parallel and merging results.
- **Iterative refinement:** A component's output is fed back into the system for further processing. Example: an LLM generates a response, a verifier checks it, and if the check fails, the LLM generates a revised response.
- **Conditional branching:** The system routes inputs to different processing paths based on classification or analysis. Example: simple queries go directly to an LLM, while complex queries trigger a multi-step research pipeline.
- **[Agent](/wiki/agent) loops:** An LLM decides which tools to call, observes the results, and decides the next action in an open-ended loop. Example: an [AI agent](/wiki/ai_agent) that researches a topic by iteratively searching, reading, and synthesizing information [2].

## Key examples

### Retrieval-augmented generation (RAG)

[Retrieval-augmented generation](/wiki/retrieval_augmented_generation) is the most widely deployed type of compound AI system. A RAG system combines an LLM with a retrieval component that fetches relevant documents from a knowledge base before the LLM generates a response. This simple two-component architecture addresses two major model limitations: hallucination (by grounding responses in retrieved evidence) and stale knowledge (by retrieving from up-to-date sources) [3].

A basic RAG pipeline typically involves:

1. **Query processing:** The user's question is optionally reformulated for better retrieval.
2. **Retrieval:** A search component queries a vector database or other index to find relevant documents.
3. **Context assembly:** Retrieved documents are combined with the original query into a prompt.
4. **Generation:** The LLM generates a response based on the assembled context.
5. **Optional post-processing:** The response is checked for quality, citations are verified, or the output is formatted.

More advanced RAG systems add components for query decomposition, multi-step retrieval, re-ranking, citation verification, and fallback strategies, illustrating how compound systems grow in complexity as requirements increase [3].

### AI agents

AI agents represent a more complex class of compound AI systems in which an LLM serves as the central reasoning component that dynamically decides which actions to take. Unlike fixed pipelines, agents operate in open-ended loops: the LLM observes the current state, selects a tool or action, observes the result, and repeats until the task is complete [2].

Agent architectures typically combine:

- An LLM for planning and reasoning
- A set of tools the agent can invoke ([function calling](/wiki/function_calling))
- A memory system for tracking state across steps
- An orchestration framework that manages the agent loop

Frameworks like [LangChain](/wiki/langchain), [LlamaIndex](/wiki/llamaindex), and [CrewAI](/wiki/crewai) provide infrastructure for building agent-based compound systems. [Multi-agent systems](/wiki/multi_agent_system), where multiple specialized agents collaborate on a task, add another layer of composition [2].

### Multi-model pipelines

Some compound systems use multiple models, each specialized for different subtasks. For example, a document processing system might use:

- A [vision model](/wiki/vision_transformer) to extract text and layout from scanned documents
- A classification model to categorize the document type
- A specialized extraction model to pull structured data from the classified document
- A large LLM to answer questions about the extracted data

This approach allows each component to be optimized independently for its specific subtask, often achieving better overall performance than a single large model attempting to handle everything [1].

### AlphaCode 2 and competitive programming

Google [DeepMind](/wiki/google_deepmind)'s [AlphaCode 2](/wiki/alphacode), highlighted in the Berkeley post as a canonical compound system, exemplifies the generate-filter-select pattern. Rather than having a single model produce one solution, AlphaCode 2 uses a Gemini-based model to sample up to one million candidate C++ programs per problem, then prunes them to at most 10 submissions through test-based filtering, behavioral clustering, and a fine-tuned scoring model [11]. On Codeforces contests it solved 43% of problems, placing it above roughly 85% of human competitors and solving about 1.7 times as many problems as the original AlphaCode [11]. This generate-filter-select pipeline outperformed what any single model call could achieve [1].

## DSPy: Programming compound AI systems

### Overview

DSPy (Declarative Self-improving Python) is a framework developed by Omar Khattab and colleagues, initially at Stanford NLP and published at ICLR 2024, that provides a programming model for building and optimizing compound AI systems. DSPy's central thesis is that building effective AI systems should involve programming, not prompting: developers should write modular code that defines the structure of their system, and let automated optimization find the best prompts, examples, and parameters [4].

### The prompting problem

Traditional LLM application development relies heavily on [prompt engineering](/wiki/prompt_engineering): manually crafting prompt templates through trial and error. This approach has several weaknesses:

- Prompts are brittle and break when models are updated or swapped.
- Manually optimizing prompts for multi-step pipelines is combinatorially difficult.
- There is no systematic way to ensure that prompt changes in one component do not degrade performance in others.
- [Prompt](/wiki/prompt) engineering does not scale well as systems grow in complexity.

DSPy addresses these problems by treating prompts as parameters that can be optimized automatically, similar to how weights are optimized in [neural network](/wiki/neural_network) training [4].

### How DSPy works

In DSPy, developers define their compound AI system as a series of modules (each representing an LLM call or other operation) with typed input/output signatures. Rather than writing specific prompt text, the developer specifies what each module should accomplish (its input and output types and a brief description). DSPy then uses optimization algorithms to automatically determine the best instructions, few-shot examples, and other prompt parameters for each module [4].

The optimization process works by:

1. The developer defines a pipeline of DSPy modules.
2. The developer provides a small set of training examples and a quality metric.
3. DSPy's optimizer (called a "teleprompter") automatically generates and evaluates different prompt configurations.
4. The optimizer selects prompt configurations that maximize end-to-end performance on the quality metric.

### Performance results

DSPy demonstrated that automatically optimized pipelines could outperform carefully hand-crafted prompts. In experiments, a few lines of DSPy code allowed [GPT-3.5](/wiki/gpt-3.5) and Llama-2-13b-chat to self-bootstrap pipelines that outperformed standard few-shot prompting (by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively) [4].

### DSPy 2.0 and beyond

Since its initial release, DSPy has evolved significantly. Later versions expanded the optimization algorithms, added support for more LLM providers, and improved the developer experience. The framework has been adopted by researchers and practitioners building production compound AI systems across various domains [4].

## How do compound AI systems differ from monolithic models?

The compound systems approach and the monolithic model approach represent different philosophies for building AI applications.

| Dimension | Monolithic model | Compound AI system |
|---|---|---|
| Architecture | Single large model handles all aspects of the task | Multiple specialized components collaborate |
| Improvement path | Train a larger model or fine-tune on more data | Add, replace, or optimize individual components |
| Knowledge update | Requires retraining or fine-tuning | Update the retrieval source or tool without touching the model |
| Debugging | Opaque: difficult to identify why a specific output was produced | Modular: intermediate outputs from each component can be inspected |
| Cost structure | Fixed cost per query based on model size | Variable cost: simple queries can use cheaper components |
| Reliability | Dependent on the single model's capabilities | Can add verification, retry, and fallback mechanisms |
| Development speed | Fast for prototypes; slow for reliability improvements | More initial setup; faster iteration on individual components |
| Customization | Requires fine-tuning or prompt engineering for the entire model | Can swap or customize individual components |

The Berkeley authors argued that compound systems would become the dominant paradigm for production AI applications, even as individual models continued to improve. Their reasoning was that compound systems offer more accessible paths to improvement: adding a retrieval component or a verification step requires engineering effort, not the massive compute budgets needed to train larger models [1].

## Optimization challenges

Building effective compound AI systems introduces optimization challenges that do not exist with single-model approaches.

### End-to-end optimization

Each component in a compound system has parameters (prompts, retrieval configurations, model choices, thresholds) that affect the overall system's performance. Optimizing these parameters jointly is difficult because:

- The search space grows combinatorially with the number of components.
- Components interact in complex ways; a change in one component may require adjustments in others.
- Evaluation requires running the full pipeline, which can be expensive.

DSPy and similar frameworks attempt to automate this optimization, but the problem remains an active area of research [4].

### Scaling properties

A paper titled "Are More LLM Calls All You Need? Towards the Scaling Properties of Compound AI Systems," published at [NeurIPS](/wiki/neurips) 2024, investigated how compound system performance scales with the number of LLM calls. Analyzing two aggregation designs, Vote and Filter-Vote, the authors found that performance can first increase and then decrease as more calls are added: additional calls help on "easy" queries but hurt on "hard" ones, so a task containing both produces non-monotonic scaling rather than steady improvement [5].

### Latency management

Compound systems inherently involve more computation than single-model calls. Maintaining acceptable latency requires careful design:

- Parallelizing independent component calls
- Caching frequently used retrieval results
- Using smaller, faster models for subtasks that do not require frontier-model capability
- Implementing streaming to deliver partial results while processing continues

ALTO (An Efficient Network Orchestrator for Compound AI Systems), presented at EuroMLSys '24 in April 2024 by researchers from Stanford, Carnegie Mellon, and UC Berkeley, addresses the network orchestration challenge in distributed compound AI systems by streaming partial outputs between pipeline stages as they are generated token by token. On a chatbot verification pipeline, ALTO increased throughput by up to 3 times at a fixed 4-second latency target and reduced tail latency by 1.8 times compared with a baseline serving approach [6].

### Cost optimization

Production compound AI systems must balance quality against cost. Model routing, where a classifier directs simple queries to smaller, cheaper models and complex queries to larger, more expensive ones, is a common cost optimization strategy. Research on optimizing model selection for compound AI systems has explored how to make these routing decisions automatically. One such framework, LLMSelector, assigns each module in a compound system the model that best performs that module's job while holding the others fixed, yielding 5% to 70% accuracy gains over using a single model for every module across systems such as multi-agent debate and self-refine (tested with GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5) [7].

## How do compound AI systems relate to AI agents?

[Agentic AI](/wiki/ai_agent) can be understood as a specific class of compound AI systems: those in which an LLM dynamically controls the system's execution flow. All AI agents are compound AI systems (they combine models with tools, memory, and orchestration), but not all compound AI systems are agents (a fixed RAG pipeline, for example, does not involve dynamic decision-making by the model) [2].

The compound systems framework provides a useful lens for analyzing agent architectures. An agent's effectiveness depends not only on the quality of its core LLM but also on the quality of its tools, the reliability of its memory, the robustness of its guardrails, and the efficiency of its orchestration. Improving any of these components can improve the agent's overall performance, even without changing the underlying model.

As agentic AI has grown more prominent through 2024 and 2025, the compound systems perspective has become increasingly relevant. Complex agent workflows involving multiple specialized agents, shared knowledge bases, and sophisticated orchestration are fundamentally compound systems problems [2].

## Frameworks and tooling

The growth of compound AI systems has driven the development of specialized frameworks and tools.

| Framework | Primary focus | Key features |
|---|---|---|
| DSPy | Pipeline optimization | Automatic prompt optimization, modular signatures, teleprompters |
| [LangChain](/wiki/langchain) | General-purpose LLM application building | Chains, agents, retrieval, tool integration, LangGraph for stateful workflows |
| [LlamaIndex](/wiki/llamaindex) | Data-connected LLM applications | Document ingestion, indexing, retrieval, query engines |
| Haystack | Production-ready NLP pipelines | Component-based pipeline design, retrieval, document processing |
| [Semantic Kernel](/wiki/semantic_kernel) | Enterprise AI orchestration (Microsoft) | Planner, skills, connectors, memory integration |
| [CrewAI](/wiki/crewai) | Multi-agent systems | Role-based agents, task delegation, collaborative workflows |

These frameworks share a common architectural philosophy: decompose AI applications into modular, composable components that can be developed, tested, and optimized independently [2].

## Why do enterprises use compound AI systems?

Compound AI systems have become the standard architecture for enterprise AI applications. Industry surveys found that most production LLM applications used some form of compound system architecture, most commonly RAG; the Databricks survey cited by the Berkeley authors put RAG usage at 60% of LLM applications and multi-step chains at 30% [1]. Enterprises favor compound systems because they offer:

- **Controllability:** Organizations can enforce policies, compliance requirements, and brand guidelines through guardrail components.
- **Auditability:** Intermediate outputs from each component create an audit trail that supports regulatory compliance.
- **Updatability:** Knowledge bases can be updated without retraining models, keeping AI applications current.
- **Cost management:** Routing and model selection allow organizations to manage inference costs as usage scales.

A 2024 paper from Megagon Labs, "A Blueprint Architecture of Compound AI Systems for Enterprise," proposed a reference architecture for enterprise compound systems built around a central "stream" orchestration concept, with task and data planners that break down, map, and optimize work across the available agents and data sources [8].

## Current state (2026)

As of mid-2026, the compound AI systems paradigm has moved from a research insight to mainstream practice. Several developments characterize the current landscape.

First, the distinction between "model" and "system" has become standard vocabulary in AI engineering. Major cloud providers (AWS, Google Cloud, Azure) offer compound system building blocks as managed services, including retrieval, guardrails, orchestration, and evaluation components.

Second, optimization of compound systems remains an active research frontier. DSPy and related work have demonstrated that automatic optimization is possible, but scaling these techniques to production systems with many components and complex interactions is still challenging [4].

Third, the relationship between compound systems and [foundation models](/wiki/foundation_model) is evolving. Newer models increasingly incorporate capabilities (tool use, retrieval, structured output) that were previously handled by external components. This trend does not eliminate the need for compound systems but changes which components are external versus built into the model [1].

Fourth, the agentic AI trend has made compound systems more complex and dynamic. Multi-agent workflows, where multiple LLM-powered agents collaborate on tasks, represent some of the most sophisticated compound systems being built. These systems push the boundaries of what current frameworks and optimization techniques can handle.

### Interoperability standards: MCP and the Agentic AI Foundation

A significant development for compound AI system interoperability was the emergence of the [Model Context Protocol](/wiki/model_context_protocol) (MCP). Originally introduced by Anthropic in November 2024 as an open standard for connecting AI agents to external tools, data sources, and APIs, MCP was donated to the Agentic AI Foundation (AAIF) under the Linux Foundation on December 9, 2025. The AAIF was co-founded by Anthropic, Block, and OpenAI, with support from Google, Microsoft, Amazon Web Services, Cloudflare, and Bloomberg. By the time of the donation, MCP had accumulated more than 97 million monthly SDK downloads across Python and TypeScript and more than 10,000 active public servers, with support across major AI platforms including ChatGPT, Cursor, Gemini, Microsoft Copilot, and Visual Studio Code, and over 75 connectors available in Claude. MCP has become a de facto standard for the tool-use layer in compound AI systems [9].

### Framework consolidation

A notable framework consolidation event was Microsoft's merger of AutoGen and [Semantic Kernel](/wiki/semantic_kernel) into the unified Microsoft Agent Framework. The framework reached Release Candidate status in February 2026 and shipped version 1.0 (general availability) for both .NET and Python on April 3, 2026, at which point AutoGen and Semantic Kernel moved into maintenance mode. This consolidation reflected the maturation of the compound AI systems market, where competing framework approaches were being standardized into enterprise-grade platforms [10].

The Berkeley authors' original prediction, that compound AI systems would be one of the most important trends in AI, has been borne out. The trend reflects a maturation of the field: as AI moves from research demonstrations to production applications, the engineering of complete systems, not just models, becomes the central challenge.

## See also

- [Retrieval-augmented generation](/wiki/retrieval_augmented_generation)
- [AI agent](/wiki/ai_agent)
- [LangChain](/wiki/langchain)
- [Prompt engineering](/wiki/prompt_engineering)
- [Large language model](/wiki/large_language_model)
- [Multi-agent system](/wiki/multi_agent_system)
- [Model Context Protocol](/wiki/model_context_protocol)

## References

1. [The Shift from Models to Compound AI Systems - Berkeley Artificial Intelligence Research Blog (February 18, 2024)](https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/)
2. [Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG - arXiv (January 2025)](https://arxiv.org/abs/2501.09136)
3. [RAG Architecture Explained: A Comprehensive Guide - Orq.ai (2025)](https://orq.ai/blog/rag-architecture)
4. [DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines - ICLR 2024](https://openreview.net/forum?id=sY5N0zY5Od)
5. [Are More LLM Calls All You Need? Towards the Scaling Properties of Compound AI Systems - NeurIPS 2024](https://arxiv.org/abs/2403.02419)
6. [ALTO: An Efficient Network Orchestrator for Compound AI Systems - EuroMLSys '24 (2024)](https://dl.acm.org/doi/10.1145/3642970.3655844)
7. [Optimizing Model Selection for Compound AI Systems - arXiv (2025)](https://arxiv.org/abs/2502.14815)
8. [A Blueprint Architecture of Compound AI Systems for Enterprise (Megagon Labs) - arXiv (2024)](https://arxiv.org/abs/2406.00584)
9. Anthropic. (2025). "Donating the Model Context Protocol and establishing the Agentic AI Foundation." https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation
10. [Microsoft Agent Framework Overview - Microsoft Learn (2026)](https://learn.microsoft.com/en-us/agent-framework/overview/)
11. [AlphaCode 2 Technical Report - Google DeepMind (December 2023)](https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf)