A compound AI system is an AI system that achieves its objectives by combining multiple interacting components, such as large language models, retrieval mechanisms, external tools, guardrails, and orchestration logic, rather than relying on a single monolithic model. The term was popularized by Matei Zaharia, Omar Khattab, and colleagues at UC Berkeley in a February 2024 blog post titled "The Shift from Models to Compound AI Systems," which argued that state-of-the-art AI results are increasingly obtained not by training bigger models but by composing multiple components into integrated systems [1].
The compound AI system perspective represents a significant shift in how AI practitioners think about building applications. Instead of asking "which model should I use?" the question becomes "what system of components, including models, retrievers, tools, and control logic, will best solve this problem?" This framing has become central to modern AI engineering, influencing the design of retrieval-augmented generation (RAG) pipelines, AI agents, and production AI applications across industries [1].
On February 18, 2024, a group of researchers published "The Shift from Models to Compound AI Systems" on the Berkeley Artificial Intelligence Research (BAIR) blog. The authors included Matei Zaharia (co-creator of Apache Spark and co-founder of Databricks), Omar Khattab (creator of DSPy), Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi [1].
The post made several key observations:
The authors defined a compound AI system broadly as any system that combines multiple AI components, potentially including non-AI components, to accomplish a task. This definition intentionally encompasses a wide spectrum of architectures, from a simple LLM plus retriever pipeline to complex multi-agent systems with dozens of interacting components [1].
The compound systems perspective challenged the prevailing narrative in AI, which had focused heavily on scaling individual models. The scaling laws paradigm, popularized by research at OpenAI and elsewhere, suggested that the primary path to better AI was training larger models on more data with more compute. While this approach yielded remarkable results, it faced several limitations that compound systems could address.
Single-model approaches, regardless of model size, encounter fundamental limitations that motivate the compound systems approach.
| Limitation | Description | Compound system solution |
|---|---|---|
| Hallucination | Models generate plausible but factually incorrect information | Retrieval components provide grounding in verified sources |
| Stale knowledge | Model training data has a cutoff date; the model does not know about recent events | Real-time retrieval from up-to-date databases, APIs, or web search |
| No tool use | A standalone model cannot execute code, query databases, or interact with external services | Tool-use components allow the system to take actions in the real world |
| Lack of specialization | A single generalist model may underperform domain-specific solutions | Multiple specialized models or components, each optimized for a specific subtask |
| Inconsistent quality | Model outputs vary in quality and may not meet reliability requirements | Guardrails, validators, and retry logic enforce quality standards |
| Context window limits | Models have finite context windows that limit how much information they can process at once | Retrieval and summarization components manage information flow across large document collections |
| Opacity | Model reasoning is difficult to inspect or debug | Structured pipelines with intermediate outputs enable better observability |
| Cost and latency | The largest models are expensive and slow; not every subtask requires frontier-model capability | Routing and model selection components direct each subtask to the most cost-effective model |
Compound systems address these limitations not by building better individual models (though they benefit from model improvements) but by combining models with complementary components that compensate for their weaknesses [1].
A compound AI system integrates multiple types of components, each serving a distinct function within the overall architecture.
| Component | Function | Examples |
|---|---|---|
| Language model (LLM) | Generates text, reasons about tasks, makes decisions | GPT-4, Claude, Gemini, Llama, Mistral |
| Retriever | Fetches relevant information from external knowledge sources | Vector databases, semantic search engines, BM25 indexes |
| Tools / APIs | Execute actions in external systems: run code, query databases, call web services | Code interpreters, calculators, web browsers, database connectors |
| Guardrails | Validate and constrain inputs and outputs to enforce safety, quality, or policy requirements | Content filters, format validators, factuality checkers |
| Memory | Stores and retrieves information across interactions or processing steps | Conversation history stores, knowledge graphs, working memory buffers |
| Orchestrator | Controls the flow of execution, deciding which components to invoke and in what order | LangGraph, programmatic control flow, state machines |
| Evaluator / verifier | Assesses intermediate or final outputs for correctness, relevance, or quality | LLM-as-judge, unit tests, assertion checks |
| Router / classifier | Directs inputs to the appropriate processing path based on their characteristics | Intent classifiers, difficulty estimators, model selectors |
Components in a compound AI system interact through several common patterns:
Retrieval-augmented generation is the most widely deployed type of compound AI system. A RAG system combines an LLM with a retrieval component that fetches relevant documents from a knowledge base before the LLM generates a response. This simple two-component architecture addresses two major model limitations: hallucination (by grounding responses in retrieved evidence) and stale knowledge (by retrieving from up-to-date sources) [3].
A basic RAG pipeline typically involves:
More advanced RAG systems add components for query decomposition, multi-step retrieval, re-ranking, citation verification, and fallback strategies, illustrating how compound systems grow in complexity as requirements increase [3].
AI agents represent a more complex class of compound AI systems in which an LLM serves as the central reasoning component that dynamically decides which actions to take. Unlike fixed pipelines, agents operate in open-ended loops: the LLM observes the current state, selects a tool or action, observes the result, and repeats until the task is complete [2].
Agent architectures typically combine:
Frameworks like LangChain, LlamaIndex, and CrewAI provide infrastructure for building agent-based compound systems. Multi-agent systems, where multiple specialized agents collaborate on a task, add another layer of composition [2].
Some compound systems use multiple models, each specialized for different subtasks. For example, a document processing system might use:
This approach allows each component to be optimized independently for its specific subtask, often achieving better overall performance than a single large model attempting to handle everything [1].
Google DeepMind's AlphaCode, which achieved competitive-level performance on programming contests, exemplifies a compound system. Rather than having a single model generate one solution, the system generates a large number of candidate programs, filters and clusters them, and selects the best candidates for submission. This generate-filter-select pipeline outperformed what any single model call could achieve [1].
DSPy (Declarative Self-improving Python) is a framework developed by Omar Khattab and colleagues, initially at Stanford NLP and published at ICLR 2024, that provides a programming model for building and optimizing compound AI systems. DSPy's central thesis is that building effective AI systems should involve programming, not prompting: developers should write modular code that defines the structure of their system, and let automated optimization find the best prompts, examples, and parameters [4].
Traditional LLM application development relies heavily on prompt engineering: manually crafting prompt templates through trial and error. This approach has several weaknesses:
DSPy addresses these problems by treating prompts as parameters that can be optimized automatically, similar to how weights are optimized in neural network training [4].
In DSPy, developers define their compound AI system as a series of modules (each representing an LLM call or other operation) with typed input/output signatures. Rather than writing specific prompt text, the developer specifies what each module should accomplish (its input and output types and a brief description). DSPy then uses optimization algorithms to automatically determine the best instructions, few-shot examples, and other prompt parameters for each module [4].
The optimization process works by:
DSPy demonstrated that automatically optimized pipelines could outperform carefully hand-crafted prompts. In experiments, a few lines of DSPy code allowed GPT-3.5 and Llama-2-13b-chat to self-bootstrap pipelines that outperformed standard few-shot prompting (by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively) [4].
Since its initial release, DSPy has evolved significantly. Later versions expanded the optimization algorithms, added support for more LLM providers, and improved the developer experience. The framework has been adopted by researchers and practitioners building production compound AI systems across various domains [4].
The compound systems approach and the monolithic model approach represent different philosophies for building AI applications.
| Dimension | Monolithic model | Compound AI system |
|---|---|---|
| Architecture | Single large model handles all aspects of the task | Multiple specialized components collaborate |
| Improvement path | Train a larger model or fine-tune on more data | Add, replace, or optimize individual components |
| Knowledge update | Requires retraining or fine-tuning | Update the retrieval source or tool without touching the model |
| Debugging | Opaque: difficult to identify why a specific output was produced | Modular: intermediate outputs from each component can be inspected |
| Cost structure | Fixed cost per query based on model size | Variable cost: simple queries can use cheaper components |
| Reliability | Dependent on the single model's capabilities | Can add verification, retry, and fallback mechanisms |
| Development speed | Fast for prototypes; slow for reliability improvements | More initial setup; faster iteration on individual components |
| Customization | Requires fine-tuning or prompt engineering for the entire model | Can swap or customize individual components |
The Berkeley authors argued that compound systems would become the dominant paradigm for production AI applications, even as individual models continued to improve. Their reasoning was that compound systems offer more accessible paths to improvement: adding a retrieval component or a verification step requires engineering effort, not the massive compute budgets needed to train larger models [1].
Building effective compound AI systems introduces optimization challenges that do not exist with single-model approaches.
Each component in a compound system has parameters (prompts, retrieval configurations, model choices, thresholds) that affect the overall system's performance. Optimizing these parameters jointly is difficult because:
DSPy and similar frameworks attempt to automate this optimization, but the problem remains an active area of research [4].
A paper titled "Are More LLM Calls All You Need? Towards the Scaling Properties of Compound AI Systems," published at NeurIPS 2024, investigated how compound system performance scales with the number of LLM calls. The research found that simply adding more model calls (through ensembling, verification, or multi-step reasoning) could improve performance, but with diminishing returns that depend on the specific task and system architecture [5].
Compound systems inherently involve more computation than single-model calls. Maintaining acceptable latency requires careful design:
ALTO (An Efficient Network Orchestrator for Compound AI Systems), presented at OSDI 2024, addresses the network orchestration challenge in distributed compound AI systems, optimizing how data flows between components across a network [6].
Production compound AI systems must balance quality against cost. Model routing, where a classifier directs simple queries to smaller, cheaper models and complex queries to larger, more expensive ones, is a common cost optimization strategy. Research on optimizing model selection for compound AI systems has explored how to make these routing decisions automatically [7].
Agentic AI can be understood as a specific class of compound AI systems: those in which an LLM dynamically controls the system's execution flow. All AI agents are compound AI systems (they combine models with tools, memory, and orchestration), but not all compound AI systems are agents (a fixed RAG pipeline, for example, does not involve dynamic decision-making by the model) [2].
The compound systems framework provides a useful lens for analyzing agent architectures. An agent's effectiveness depends not only on the quality of its core LLM but also on the quality of its tools, the reliability of its memory, the robustness of its guardrails, and the efficiency of its orchestration. Improving any of these components can improve the agent's overall performance, even without changing the underlying model.
As agentic AI has grown more prominent through 2024 and 2025, the compound systems perspective has become increasingly relevant. Complex agent workflows involving multiple specialized agents, shared knowledge bases, and sophisticated orchestration are fundamentally compound systems problems [2].
The growth of compound AI systems has driven the development of specialized frameworks and tools.
| Framework | Primary focus | Key features |
|---|---|---|
| DSPy | Pipeline optimization | Automatic prompt optimization, modular signatures, teleprompters |
| LangChain | General-purpose LLM application building | Chains, agents, retrieval, tool integration, LangGraph for stateful workflows |
| LlamaIndex | Data-connected LLM applications | Document ingestion, indexing, retrieval, query engines |
| Haystack | Production-ready NLP pipelines | Component-based pipeline design, retrieval, document processing |
| Semantic Kernel | Enterprise AI orchestration (Microsoft) | Planner, skills, connectors, memory integration |
| CrewAI | Multi-agent systems | Role-based agents, task delegation, collaborative workflows |
These frameworks share a common architectural philosophy: decompose AI applications into modular, composable components that can be developed, tested, and optimized independently [2].
Compound AI systems have become the standard architecture for enterprise AI applications. A 2024 survey of enterprise AI practitioners found that the vast majority of production LLM applications used some form of compound system architecture, most commonly RAG. Enterprises favor compound systems because they offer:
A 2024 paper from IBM Research, "A Blueprint Architecture of Compound AI Systems for Enterprise," proposed a reference architecture for enterprise compound systems that includes components for data management, model orchestration, evaluation, governance, and deployment [8].
As of early 2026, the compound AI systems paradigm has moved from a research insight to mainstream practice. Several developments characterize the current landscape.
First, the distinction between "model" and "system" has become standard vocabulary in AI engineering. Major cloud providers (AWS, Google Cloud, Azure) offer compound system building blocks as managed services, including retrieval, guardrails, orchestration, and evaluation components.
Second, optimization of compound systems remains an active research frontier. DSPy and related work have demonstrated that automatic optimization is possible, but scaling these techniques to production systems with many components and complex interactions is still challenging [4].
Third, the relationship between compound systems and foundation models is evolving. Newer models increasingly incorporate capabilities (tool use, retrieval, structured output) that were previously handled by external components. This trend does not eliminate the need for compound systems but changes which components are external versus built into the model [1].
Fourth, the agentic AI trend has made compound systems more complex and dynamic. Multi-agent workflows, where multiple LLM-powered agents collaborate on tasks, represent some of the most sophisticated compound systems being built. These systems push the boundaries of what current frameworks and optimization techniques can handle.
The Berkeley authors' original prediction, that compound AI systems would be one of the most important trends in AI, has been borne out. The trend reflects a maturation of the field: as AI moves from research demonstrations to production applications, the engineering of complete systems, not just models, becomes the central challenge.