Compound AI System

Artificial Intelligence Large Language Models Machine Learning

20 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v4 · 4,098 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A compound AI system is an AI system that achieves its objectives by combining multiple interacting components, such as large language models, retrieval mechanisms, external tools, guardrails, and orchestration logic, rather than relying on a single monolithic model. Matei Zaharia, Omar Khattab, and colleagues at UC Berkeley, who popularized the term in a February 18, 2024 blog post titled "The Shift from Models to Compound AI Systems," define it as "a system that tackles AI tasks using multiple interacting components, including multiple calls to models, retrievers, or external tools" ^[1]. Their central argument was that state-of-the-art AI results are increasingly obtained not by training bigger models but by composing multiple components into integrated systems, a shift they predicted "will likely be the best way to maximize AI results in the future, and might be one of the most impactful trends in AI in 2024" ^[1].

The compound AI system perspective represents a significant shift in how AI practitioners think about building applications. Instead of asking "which model should I use?" the question becomes "what system of components, including models, retrievers, tools, and control logic, will best solve this problem?" This framing has become central to modern AI engineering, influencing the design of retrieval-augmented generation (RAG) pipelines, AI agents, and production AI applications across industries ^[1]. By early 2024, most production LLM applications were already compound systems: a Databricks survey the authors cited found that 60% of LLM applications used some form of RAG and 30% used multi-step chains ^[1].

Origins and definition

The Berkeley blog post (February 2024)

On February 18, 2024, a group of researchers published "The Shift from Models to Compound AI Systems" on the Berkeley Artificial Intelligence Research (BAIR) blog. The authors included Matei Zaharia (co-creator of Apache Spark and co-founder of Databricks), Omar Khattab (creator of DSPy), Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi ^[1].

The post made several key observations:

The highest-quality AI results in 2024 were increasingly achieved by compound systems, not individual models.
Even simple composition strategies (such as having an LLM call a search engine) could outperform much larger or more expensive standalone models.
The trend toward compound systems was likely to accelerate because it offered a more accessible path to improving AI quality than training ever-larger models.
New tools, frameworks, and optimization techniques were needed specifically for designing, building, and optimizing compound systems.

The authors defined a compound AI system broadly as any system that combines multiple AI components, potentially including non-AI components, to accomplish a task. This definition intentionally encompasses a wide spectrum of architectures, from a simple LLM plus retriever pipeline to complex multi-agent systems with dozens of interacting components ^[1].

To illustrate the point, the authors listed several well-known systems that were already compound rather than monolithic ^[1]:

System	Components
AlphaCode 2	Fine-tuned LLMs, code execution, and clustering
AlphaGeometry	A fine-tuned LLM paired with a symbolic math engine
Medprompt	GPT-4 with nearest-neighbor retrieval and chain-of-thought inference
Gemini on MMLU	The Gemini model with custom inference logic (CoT@32)
ChatGPT Plus	An LLM with a web browser, code interpreter, and image-generation plugins
RAG, ORQA, Bing, Baleen	LLMs combined with retrieval systems

As concrete evidence that composition can beat scaling, the authors pointed to Microsoft's report that a chaining strategy (the Medprompt result) "exceeded GPT-4's accuracy on medical exams by 9%" ^[1].

Why the shift matters

The compound systems perspective challenged the prevailing narrative in AI, which had focused heavily on scaling individual models. The scaling laws paradigm, popularized by research at OpenAI and elsewhere, suggested that the primary path to better AI was training larger models on more data with more compute. While this approach yielded remarkable results, it faced several limitations that compound systems could address.

Why not just use a single large model?

Single-model approaches, regardless of model size, encounter fundamental limitations that motivate the compound systems approach.

Limitation	Description	Compound system solution
Hallucination	Models generate plausible but factually incorrect information	Retrieval components provide grounding in verified sources
Stale knowledge	Model training data has a cutoff date; the model does not know about recent events	Real-time retrieval from up-to-date databases, APIs, or web search
No tool use	A standalone model cannot execute code, query databases, or interact with external services	Tool-use components allow the system to take actions in the real world
Lack of specialization	A single generalist model may underperform domain-specific solutions	Multiple specialized models or components, each optimized for a specific subtask
Inconsistent quality	Model outputs vary in quality and may not meet reliability requirements	Guardrails, validators, and retry logic enforce quality standards
Context window limits	Models have finite context windows that limit how much information they can process at once	Retrieval and summarization components manage information flow across large document collections
Opacity	Model reasoning is difficult to inspect or debug	Structured pipelines with intermediate outputs enable better observability
Cost and latency	The largest models are expensive and slow; not every subtask requires frontier-model capability	Routing and model selection components direct each subtask to the most cost-effective model

Compound systems address these limitations not by building better individual models (though they benefit from model improvements) but by combining models with complementary components that compensate for their weaknesses ^[1].

Components of compound AI systems

A compound AI system integrates multiple types of components, each serving a distinct function within the overall architecture.

Core components

Component	Function	Examples
Language model (LLM)	Generates text, reasons about tasks, makes decisions	GPT-4, Claude, Gemini, Llama, Mistral
Retriever	Fetches relevant information from external knowledge sources	Vector databases, semantic search engines, BM25 indexes
Tools / APIs	Execute actions in external systems: run code, query databases, call web services	Code interpreters, calculators, web browsers, database connectors
Guardrails	Validate and constrain inputs and outputs to enforce safety, quality, or policy requirements	Content filters, format validators, factuality checkers
Memory	Stores and retrieves information across interactions or processing steps	Conversation history stores, knowledge graphs, working memory buffers
Orchestrator	Controls the flow of execution, deciding which components to invoke and in what order	LangGraph, programmatic control flow, state machines
Evaluator / verifier	Assesses intermediate or final outputs for correctness, relevance, or quality	LLM-as-judge, unit tests, assertion checks
Router / classifier	Directs inputs to the appropriate processing path based on their characteristics	Intent classifiers, difficulty estimators, model selectors

Interaction patterns

Components in a compound AI system interact through several common patterns:

Sequential pipeline: Components execute in a fixed order. Example: a RAG system where a query is first sent to a retriever, then retrieved documents and the query are passed to an LLM for answer generation.
Parallel execution: Multiple components process the same input simultaneously, and their outputs are aggregated. Example: querying multiple search indexes in parallel and merging results.
Iterative refinement: A component's output is fed back into the system for further processing. Example: an LLM generates a response, a verifier checks it, and if the check fails, the LLM generates a revised response.
Conditional branching: The system routes inputs to different processing paths based on classification or analysis. Example: simple queries go directly to an LLM, while complex queries trigger a multi-step research pipeline.
Agent loops: An LLM decides which tools to call, observes the results, and decides the next action in an open-ended loop. Example: an AI agent that researches a topic by iteratively searching, reading, and synthesizing information ^[2].

Key examples

Retrieval-augmented generation (RAG)

Retrieval-augmented generation is the most widely deployed type of compound AI system. A RAG system combines an LLM with a retrieval component that fetches relevant documents from a knowledge base before the LLM generates a response. This simple two-component architecture addresses two major model limitations: hallucination (by grounding responses in retrieved evidence) and stale knowledge (by retrieving from up-to-date sources) ^[3].

A basic RAG pipeline typically involves:

Query processing: The user's question is optionally reformulated for better retrieval.
Retrieval: A search component queries a vector database or other index to find relevant documents.
Context assembly: Retrieved documents are combined with the original query into a prompt.
Generation: The LLM generates a response based on the assembled context.
Optional post-processing: The response is checked for quality, citations are verified, or the output is formatted.

More advanced RAG systems add components for query decomposition, multi-step retrieval, re-ranking, citation verification, and fallback strategies, illustrating how compound systems grow in complexity as requirements increase ^[3].

AI agents

AI agents represent a more complex class of compound AI systems in which an LLM serves as the central reasoning component that dynamically decides which actions to take. Unlike fixed pipelines, agents operate in open-ended loops: the LLM observes the current state, selects a tool or action, observes the result, and repeats until the task is complete ^[2].

Agent architectures typically combine:

An LLM for planning and reasoning
A set of tools the agent can invoke (function calling)
A memory system for tracking state across steps
An orchestration framework that manages the agent loop

Frameworks like LangChain, LlamaIndex, and CrewAI provide infrastructure for building agent-based compound systems. Multi-agent systems, where multiple specialized agents collaborate on a task, add another layer of composition ^[2].

Multi-model pipelines

Some compound systems use multiple models, each specialized for different subtasks. For example, a document processing system might use:

A vision model to extract text and layout from scanned documents
A classification model to categorize the document type
A specialized extraction model to pull structured data from the classified document
A large LLM to answer questions about the extracted data

This approach allows each component to be optimized independently for its specific subtask, often achieving better overall performance than a single large model attempting to handle everything ^[1].

AlphaCode 2 and competitive programming

Google DeepMind's AlphaCode 2, highlighted in the Berkeley post as a canonical compound system, exemplifies the generate-filter-select pattern. Rather than having a single model produce one solution, AlphaCode 2 uses a Gemini-based model to sample up to one million candidate C++ programs per problem, then prunes them to at most 10 submissions through test-based filtering, behavioral clustering, and a fine-tuned scoring model ^[11]. On Codeforces contests it solved 43% of problems, placing it above roughly 85% of human competitors and solving about 1.7 times as many problems as the original AlphaCode ^[11]. This generate-filter-select pipeline outperformed what any single model call could achieve ^[1].

DSPy: Programming compound AI systems

Overview

DSPy (Declarative Self-improving Python) is a framework developed by Omar Khattab and colleagues, initially at Stanford NLP and published at ICLR 2024, that provides a programming model for building and optimizing compound AI systems. DSPy's central thesis is that building effective AI systems should involve programming, not prompting: developers should write modular code that defines the structure of their system, and let automated optimization find the best prompts, examples, and parameters ^[4].

The prompting problem

Traditional LLM application development relies heavily on prompt engineering: manually crafting prompt templates through trial and error. This approach has several weaknesses:

Prompts are brittle and break when models are updated or swapped.
Manually optimizing prompts for multi-step pipelines is combinatorially difficult.
There is no systematic way to ensure that prompt changes in one component do not degrade performance in others.
Prompt engineering does not scale well as systems grow in complexity.

DSPy addresses these problems by treating prompts as parameters that can be optimized automatically, similar to how weights are optimized in neural network training ^[4].

How DSPy works

In DSPy, developers define their compound AI system as a series of modules (each representing an LLM call or other operation) with typed input/output signatures. Rather than writing specific prompt text, the developer specifies what each module should accomplish (its input and output types and a brief description). DSPy then uses optimization algorithms to automatically determine the best instructions, few-shot examples, and other prompt parameters for each module ^[4].

The optimization process works by:

The developer defines a pipeline of DSPy modules.
The developer provides a small set of training examples and a quality metric.
DSPy's optimizer (called a "teleprompter") automatically generates and evaluates different prompt configurations.
The optimizer selects prompt configurations that maximize end-to-end performance on the quality metric.

Performance results

DSPy demonstrated that automatically optimized pipelines could outperform carefully hand-crafted prompts. In experiments, a few lines of DSPy code allowed GPT-3.5 and Llama-2-13b-chat to self-bootstrap pipelines that outperformed standard few-shot prompting (by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively) ^[4].

DSPy 2.0 and beyond

Since its initial release, DSPy has evolved significantly. Later versions expanded the optimization algorithms, added support for more LLM providers, and improved the developer experience. The framework has been adopted by researchers and practitioners building production compound AI systems across various domains ^[4].

How do compound AI systems differ from monolithic models?

The compound systems approach and the monolithic model approach represent different philosophies for building AI applications.

Dimension	Monolithic model	Compound AI system
Architecture	Single large model handles all aspects of the task	Multiple specialized components collaborate
Improvement path	Train a larger model or fine-tune on more data	Add, replace, or optimize individual components
Knowledge update	Requires retraining or fine-tuning	Update the retrieval source or tool without touching the model
Debugging	Opaque: difficult to identify why a specific output was produced	Modular: intermediate outputs from each component can be inspected
Cost structure	Fixed cost per query based on model size	Variable cost: simple queries can use cheaper components
Reliability	Dependent on the single model's capabilities	Can add verification, retry, and fallback mechanisms
Development speed	Fast for prototypes; slow for reliability improvements	More initial setup; faster iteration on individual components
Customization	Requires fine-tuning or prompt engineering for the entire model	Can swap or customize individual components

The Berkeley authors argued that compound systems would become the dominant paradigm for production AI applications, even as individual models continued to improve. Their reasoning was that compound systems offer more accessible paths to improvement: adding a retrieval component or a verification step requires engineering effort, not the massive compute budgets needed to train larger models ^[1].

Optimization challenges

Building effective compound AI systems introduces optimization challenges that do not exist with single-model approaches.

End-to-end optimization

Each component in a compound system has parameters (prompts, retrieval configurations, model choices, thresholds) that affect the overall system's performance. Optimizing these parameters jointly is difficult because:

The search space grows combinatorially with the number of components.
Components interact in complex ways; a change in one component may require adjustments in others.
Evaluation requires running the full pipeline, which can be expensive.

DSPy and similar frameworks attempt to automate this optimization, but the problem remains an active area of research ^[4].

Scaling properties

A paper titled "Are More LLM Calls All You Need? Towards the Scaling Properties of Compound AI Systems," published at NeurIPS 2024, investigated how compound system performance scales with the number of LLM calls. Analyzing two aggregation designs, Vote and Filter-Vote, the authors found that performance can first increase and then decrease as more calls are added: additional calls help on "easy" queries but hurt on "hard" ones, so a task containing both produces non-monotonic scaling rather than steady improvement ^[5].

Latency management

Compound systems inherently involve more computation than single-model calls. Maintaining acceptable latency requires careful design:

Parallelizing independent component calls
Caching frequently used retrieval results
Using smaller, faster models for subtasks that do not require frontier-model capability
Implementing streaming to deliver partial results while processing continues

ALTO (An Efficient Network Orchestrator for Compound AI Systems), presented at EuroMLSys '24 in April 2024 by researchers from Stanford, Carnegie Mellon, and UC Berkeley, addresses the network orchestration challenge in distributed compound AI systems by streaming partial outputs between pipeline stages as they are generated token by token. On a chatbot verification pipeline, ALTO increased throughput by up to 3 times at a fixed 4-second latency target and reduced tail latency by 1.8 times compared with a baseline serving approach ^[6].

Cost optimization

Production compound AI systems must balance quality against cost. Model routing, where a classifier directs simple queries to smaller, cheaper models and complex queries to larger, more expensive ones, is a common cost optimization strategy. Research on optimizing model selection for compound AI systems has explored how to make these routing decisions automatically. One such framework, LLMSelector, assigns each module in a compound system the model that best performs that module's job while holding the others fixed, yielding 5% to 70% accuracy gains over using a single model for every module across systems such as multi-agent debate and self-refine (tested with GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5) ^[7].

How do compound AI systems relate to AI agents?

Agentic AI can be understood as a specific class of compound AI systems: those in which an LLM dynamically controls the system's execution flow. All AI agents are compound AI systems (they combine models with tools, memory, and orchestration), but not all compound AI systems are agents (a fixed RAG pipeline, for example, does not involve dynamic decision-making by the model) ^[2].

The compound systems framework provides a useful lens for analyzing agent architectures. An agent's effectiveness depends not only on the quality of its core LLM but also on the quality of its tools, the reliability of its memory, the robustness of its guardrails, and the efficiency of its orchestration. Improving any of these components can improve the agent's overall performance, even without changing the underlying model.

As agentic AI has grown more prominent through 2024 and 2025, the compound systems perspective has become increasingly relevant. Complex agent workflows involving multiple specialized agents, shared knowledge bases, and sophisticated orchestration are fundamentally compound systems problems ^[2].

Frameworks and tooling

The growth of compound AI systems has driven the development of specialized frameworks and tools.

Framework	Primary focus	Key features
DSPy	Pipeline optimization	Automatic prompt optimization, modular signatures, teleprompters
LangChain	General-purpose LLM application building	Chains, agents, retrieval, tool integration, LangGraph for stateful workflows
LlamaIndex	Data-connected LLM applications	Document ingestion, indexing, retrieval, query engines
Haystack	Production-ready NLP pipelines	Component-based pipeline design, retrieval, document processing
Semantic Kernel	Enterprise AI orchestration (Microsoft)	Planner, skills, connectors, memory integration
CrewAI	Multi-agent systems	Role-based agents, task delegation, collaborative workflows

These frameworks share a common architectural philosophy: decompose AI applications into modular, composable components that can be developed, tested, and optimized independently ^[2].

Why do enterprises use compound AI systems?

Compound AI systems have become the standard architecture for enterprise AI applications. Industry surveys found that most production LLM applications used some form of compound system architecture, most commonly RAG; the Databricks survey cited by the Berkeley authors put RAG usage at 60% of LLM applications and multi-step chains at 30% ^[1]. Enterprises favor compound systems because they offer:

Controllability: Organizations can enforce policies, compliance requirements, and brand guidelines through guardrail components.
Auditability: Intermediate outputs from each component create an audit trail that supports regulatory compliance.
Updatability: Knowledge bases can be updated without retraining models, keeping AI applications current.
Cost management: Routing and model selection allow organizations to manage inference costs as usage scales.

A 2024 paper from Megagon Labs, "A Blueprint Architecture of Compound AI Systems for Enterprise," proposed a reference architecture for enterprise compound systems built around a central "stream" orchestration concept, with task and data planners that break down, map, and optimize work across the available agents and data sources ^[8].

Current state (2026)

As of mid-2026, the compound AI systems paradigm has moved from a research insight to mainstream practice. Several developments characterize the current landscape.

First, the distinction between "model" and "system" has become standard vocabulary in AI engineering. Major cloud providers (AWS, Google Cloud, Azure) offer compound system building blocks as managed services, including retrieval, guardrails, orchestration, and evaluation components.

Second, optimization of compound systems remains an active research frontier. DSPy and related work have demonstrated that automatic optimization is possible, but scaling these techniques to production systems with many components and complex interactions is still challenging ^[4].

Third, the relationship between compound systems and foundation models is evolving. Newer models increasingly incorporate capabilities (tool use, retrieval, structured output) that were previously handled by external components. This trend does not eliminate the need for compound systems but changes which components are external versus built into the model ^[1].

Fourth, the agentic AI trend has made compound systems more complex and dynamic. Multi-agent workflows, where multiple LLM-powered agents collaborate on tasks, represent some of the most sophisticated compound systems being built. These systems push the boundaries of what current frameworks and optimization techniques can handle.

Interoperability standards: MCP and the Agentic AI Foundation

A significant development for compound AI system interoperability was the emergence of the Model Context Protocol (MCP). Originally introduced by Anthropic in November 2024 as an open standard for connecting AI agents to external tools, data sources, and APIs, MCP was donated to the Agentic AI Foundation (AAIF) under the Linux Foundation on December 9, 2025. The AAIF was co-founded by Anthropic, Block, and OpenAI, with support from Google, Microsoft, Amazon Web Services, Cloudflare, and Bloomberg. By the time of the donation, MCP had accumulated more than 97 million monthly SDK downloads across Python and TypeScript and more than 10,000 active public servers, with support across major AI platforms including ChatGPT, Cursor, Gemini, Microsoft Copilot, and Visual Studio Code, and over 75 connectors available in Claude. MCP has become a de facto standard for the tool-use layer in compound AI systems ^[9].

Framework consolidation

A notable framework consolidation event was Microsoft's merger of AutoGen and Semantic Kernel into the unified Microsoft Agent Framework. The framework reached Release Candidate status in February 2026 and shipped version 1.0 (general availability) for both .NET and Python on April 3, 2026, at which point AutoGen and Semantic Kernel moved into maintenance mode. This consolidation reflected the maturation of the compound AI systems market, where competing framework approaches were being standardized into enterprise-grade platforms ^[10].

The Berkeley authors' original prediction, that compound AI systems would be one of the most important trends in AI, has been borne out. The trend reflects a maturation of the field: as AI moves from research demonstrations to production applications, the engineering of complete systems, not just models, becomes the central challenge.

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Agentic Context Engineering Data Visualization Fireworks AI

Origins and definition

The Berkeley blog post (February 2024)

Why the shift matters

Why not just use a single large model?

Components of compound AI systems

Core components

Interaction patterns

Key examples

Retrieval-augmented generation (RAG)

AI agents

Multi-model pipelines

AlphaCode 2 and competitive programming

DSPy: Programming compound AI systems

Overview

The prompting problem

How DSPy works

Performance results

DSPy 2.0 and beyond

How do compound AI systems differ from monolithic models?

Optimization challenges

End-to-end optimization

Scaling properties

Latency management

Cost optimization

How do compound AI systems relate to AI agents?

Frameworks and tooling

Why do enterprises use compound AI systems?

Current state (2026)

Interoperability standards: MCP and the Agentic AI Foundation

Framework consolidation

See also

References

Improve this article

Related Articles

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Post-training

Reasoning (artificial intelligence)

What links here

Related Articles

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Post-training

Reasoning (artificial intelligence)

What links here