A multi-agent system (MAS) is a system composed of multiple interacting intelligent agents that collaborate, compete, or negotiate to accomplish tasks that would be difficult or impossible for a single agent. In the context of modern artificial intelligence, multi-agent systems have evolved from their roots in distributed computing and game theory into sophisticated architectures where multiple large language model-powered agents work together on complex problems such as software development, scientific research, and data analysis.
The concept of multi-agent systems originated in the field of distributed artificial intelligence (DAI) during the 1980s. Early research focused on how independent software agents could coordinate their actions to solve problems that no single agent could handle alone. These systems drew on principles from economics, game theory, and organizational theory to model agent interactions.
Classical MAS research addressed questions like how agents should divide labor, how they should communicate, and how conflicts between agents with competing goals should be resolved. The Nash equilibrium, a concept from game theory, became an important tool for analyzing multi-agent interactions where each agent's optimal strategy depends on the choices made by other agents.
Key areas of classical MAS research included:
These foundational ideas continue to influence how modern LLM-based multi-agent systems are designed, even as the underlying technology has changed dramatically.
The arrival of large language models, particularly GPT-4 and Claude in 2023, triggered a wave of experimentation with LLM-powered multi-agent systems. Researchers and engineers discovered that assigning different roles, instructions, and tools to multiple LLM instances could produce results superior to what a single LLM could achieve, even with extensive prompt engineering.
In an LLM-based MAS, each agent is typically an LLM instance configured with a specific persona, set of instructions, and access to particular tools or data sources. One agent might be configured as a "researcher" with access to web search, while another might serve as a "code reviewer" with access to a codebase. These agents communicate by exchanging natural language messages, and a coordination mechanism determines the order and flow of their interactions.
The key insight behind modern multi-agent systems is that specialization improves performance. Rather than asking a single LLM to handle an entire complex workflow (writing code, reviewing it, testing it, and documenting it), the work is divided among agents that each focus on one aspect. This mirrors how human teams operate: specialists in different domains collaborate to produce better outcomes than any individual could.
By 2025, LLM-based multi-agent systems had moved from academic experiments to production deployments. Organizations began using them for automated software development pipelines, research workflows, customer service operations, and data analysis tasks.
Several open-source and commercial frameworks have emerged to simplify building multi-agent systems. The following table compares the major frameworks as of early 2026.
| Framework | Developer | Release Year | Architecture Style | Key Features | License |
|---|---|---|---|---|---|
| AutoGen | Microsoft | 2023 | Conversational agents | Flexible agent routing, async communication, LLM caching with Redis/disk, human-in-the-loop support | MIT |
| CrewAI | CrewAI Inc. | 2024 | Role-based orchestration | Role definitions for agents, task delegation, beginner-friendly API, teamwork-oriented workflows | MIT |
| LangGraph | LangChain | 2024 | State machine / graph | Explicit node-and-edge control flow, parallel execution, state persistence, reached v1.0 in late 2025 | MIT |
| Swarm | OpenAI | 2024 | Lightweight handoffs | Minimal stateless abstraction, educational focus, client-side execution; replaced by OpenAI Agents SDK in 2025 | MIT |
| Claude Agent Teams | Anthropic | 2026 | Lead-plus-teammates | Multiple Claude Code instances with a team lead, inter-agent messaging, shared task management | Proprietary |
| MetaGPT | DeepWisdom | 2023 | Software company simulation | Structured communication (not free-form natural language), SOP-based workflows with product managers, architects, and engineers | MIT |
| ChatDev | Tsinghua University | 2023 | Chat-powered development | Role-playing agents guided through design, coding, and testing phases using natural and programming languages | Apache 2.0 |
| CAMEL | CAMEL-AI | 2023 | Role-playing conversation | Prompt-defined agent personalities, two-or-three-agent conversations for task completion | Apache 2.0 |
Microsoft's AutoGen defines agents as adaptive units that communicate through asynchronous message exchanges. It supports flexible routing so that messages can flow between agents (and optionally humans) based on the content and context of the conversation. One of its distinguishing features is LLM response caching, which can use disk or Redis backends. This allows shared caches across agents, reducing costs and improving reproducibility. AutoGen has become one of the most popular frameworks for enterprise multi-agent deployments.
CrewAI takes a role-driven approach to multi-agent orchestration. Users define agents by specifying who the agent is, what it should do, and what tools it has access to, similar to writing a job description. CrewAI then handles orchestration, making it practical for teams that think about workflows in terms of human roles and responsibilities. In benchmark comparisons, CrewAI tasks typically complete in 45 to 60 seconds for a standard four-agent workflow with 8 to 12 LLM calls.
LangGraph models agent workflows as explicit state machines where developers define nodes (processing steps) and edges (transitions between steps). This gives maximum control over execution flow at the cost of more code for simple workflows. LangGraph reached version 1.0 in late 2025 and became the default runtime for all LangChain agents. It leads in token efficiency because it minimizes redundant LLM calls through direct state transitions rather than repetitive chat history. A four-agent workflow in LangGraph typically completes in 25 to 35 seconds with parallel node execution.
OpenAI released Swarm in October 2024 as a lightweight, educational framework for multi-agent orchestration. Swarm was deliberately minimal: an agent encapsulates instructions and functions, with explicit handoff capabilities to other agents. It ran almost entirely on the client side and did not store state between calls. OpenAI later replaced Swarm with the Agents SDK, a production-ready evolution with active maintenance and improved capabilities. OpenAI recommends migrating all production use cases to the Agents SDK.
Anthropic's agent teams feature, officially released on February 5, 2026, allows coordination of multiple Claude Code instances working together. One session acts as the team lead, coordinating work and assigning tasks, while teammates work independently in their own context windows and communicate with each other directly. In a notable stress test, 16 agents working across nearly 2,000 Claude Code sessions produced a 100,000-line Rust-based C compiler capable of building Linux 6.9 on x86, ARM, and RISC-V architectures.
MetaGPT simulates the structure of a software company, with agents taking on roles such as product manager, architect, project manager, and engineer. Unlike most LLM-based multi-agent frameworks, MetaGPT uses structured communication rather than unconstrained natural language for agent interactions. Given a single-line requirement as input, MetaGPT produces user stories, competitive analysis, requirements documents, data structures, APIs, and code. DeepWisdom launched MGX (MetaGPT X) on February 19, 2025, described as the world's first AI agent development team product.
Developed by researchers at Tsinghua University, ChatDev is a chat-powered software development framework where specialized agents collaborate through design, coding, and testing phases. The agents communicate using both natural language and programming languages. In comparative studies, ChatDev outperformed MetaGPT on code quality metrics due to its cooperative communication methods.
CAMEL (Communicative Agents for "Mind" Exploration of Large Language Model Society) is a role-playing-based framework that demonstrates how prompting can define agent personalities. It supports two- or three-agent conversations where agents with defined roles work toward task completion through structured dialogue.
Multi-agent systems use several architectural patterns, each with distinct tradeoffs in control, flexibility, and complexity.
In a hierarchical architecture, a lead agent (sometimes called an orchestrator or supervisor) decomposes the overall task and delegates subtasks to worker agents. The lead agent collects results, resolves conflicts, and synthesizes the final output. This pattern is straightforward to implement and reason about, but the lead agent can become a bottleneck. If the orchestrator makes a poor decomposition decision, the entire workflow suffers. Claude Agent Teams and CrewAI both support hierarchical orchestration.
In a flat architecture, agents operate as peers and communicate directly with each other, requesting input or support as needed. There is no central coordinator. This enables high flexibility and parallelism but introduces complexity in managing communication protocols and preventing circular dependencies. LangGraph's network architecture and AutoGen's conversational patterns can both support flat topologies.
In the debate pattern, multiple agents independently analyze the same problem and then present their conclusions. A judge agent (or the agents themselves through discussion) evaluates the competing analyses and selects or synthesizes the best answer. This approach is effective for tasks where verification is important, such as fact-checking or code review. Research has shown that LLM debate can reduce hallucination rates by forcing agents to justify their reasoning to skeptical counterparts.
Role-playing architectures assign distinct personas to agents (e.g., "customer," "support agent," "supervisor") and let them interact according to those roles. CAMEL pioneered this approach, and ChatDev refined it for software development. Role-playing can produce more diverse and creative outputs because agents with different personas approach problems differently.
In the handoff pattern, tasks flow linearly from one agent to the next. Each agent performs its operation and passes control to the next agent in the chain. OpenAI's Swarm was built around this concept, with explicit handoff functions controlling when and how control transfers between agents. This is the simplest architecture to implement but also the most fragile: if any agent in the chain fails, the process stalls.
The first step in most multi-agent workflows is breaking a complex task into smaller, manageable subtasks. This can happen in several ways:
Chain-of-Thought (CoT) reasoning helps agents plan their decomposition by thinking through the problem step by step. Tree of Thoughts (ToT) allows exploration of multiple decomposition paths simultaneously, and Graph of Thought supports graph-structured reasoning for more complex task relationships.
Each agent in a multi-agent system is typically specialized for a particular role or domain. Specialization is achieved through a combination of system prompts (which define the agent's persona and instructions), tool access (which determines what actions the agent can take), and knowledge bases (which provide domain-specific information).
A software development MAS might include agents specialized in requirements analysis, system design, code generation, code review, testing, and documentation. Each agent focuses on what it does best, and the combined output exceeds what any single general-purpose agent could produce.
Agents in a multi-agent system need structured ways to exchange information. Common communication approaches include:
| Communication Method | Description | Used By |
|---|---|---|
| Direct messaging | Agents send natural language messages to specific other agents | AutoGen, Claude Agent Teams |
| Shared memory / blackboard | Agents read from and write to a shared state object | LangGraph |
| Structured artifacts | Agents exchange formatted documents, code, or data structures | MetaGPT |
| Broadcast | An agent sends a message to all other agents simultaneously | Custom implementations |
| Publish-subscribe | Agents subscribe to topics and receive relevant messages | Enterprise MAS deployments |
Communication optimization is an active research area. Techniques include attentional communication (agents learn when communication is necessary), message filtering (subscription-based relevance determination), and structured protocols with built-in error handling.
Modern multi-agent systems do not just exchange messages; they also interact with external tools and environments. Agents may execute code, query databases, search the web, call APIs, or interact with computer interfaces. Anthropic's Model Context Protocol (MCP) has become a standard way to give agents access to tools, while Google's Agent2Agent (A2A) protocol handles agent-to-agent communication at a higher level.
Software development was one of the first domains where LLM-based multi-agent systems proved their value. Frameworks like MetaGPT and ChatDev showed that assigning different software engineering roles to agents could produce functional software from a single natural language requirement. As of 2025, multi-agent systems are used in production for:
Anthropic's Claude Code agent teams demonstrated the scale achievable by multi-agent software development when 16 agents produced a working C compiler comprising 100,000 lines of Rust.
Multi-agent systems are increasingly used for research workflows where different agents handle literature search, data collection, analysis, and synthesis. A research MAS might include a "librarian" agent that searches academic databases, a "statistician" agent that analyzes data, and a "writer" agent that produces the final report. This parallel specialization significantly accelerates research timelines.
For data analysis tasks, multi-agent systems can assign specialized agents to different stages of the analysis pipeline: data cleaning, exploratory analysis, statistical modeling, and visualization. Each agent brings domain-specific tools and knowledge to its stage. Organizations have reported that multi-agent data analysis pipelines can process datasets that would take human analysts weeks in a matter of hours.
Enterprise deployments often use multi-agent systems for customer service, where a triage agent routes incoming requests to specialized agents handling billing, technical support, or account management. Each specialized agent has access to relevant internal systems and knowledge bases.
Multi-agent systems have been used to simulate social dynamics, economic markets, and organizational behavior. Projects like Stanford's "Generative Agents" (2023) demonstrated that LLM-powered agents with memory and planning capabilities could produce emergent social behaviors in a simulated town environment.
Google launched the Agent2Agent (A2A) protocol in April 2025, with support from more than 50 technology partners including Atlassian, Box, Cohere, Intuit, LangChain, MongoDB, PayPal, Salesforce, SAP, ServiceNow, UKG, and Workday.
The A2A protocol enables AI agents built on different frameworks to communicate with each other, exchange information securely, and coordinate actions across enterprise platforms. It was designed around five principles: embracing agentic capabilities (letting agents collaborate in unstructured modalities), building on existing standards (HTTP, SSE, JSON-RPC), being secure by default with enterprise-grade authentication, supporting long-running tasks, and allowing modality-agnostic communication.
A2A introduces several key abstractions:
| Concept | Description |
|---|---|
| Agent Cards | JSON documents that describe an agent's capabilities, skills, and connection information, enabling discovery |
| Tasks | The primary unit of work, with defined lifecycle states (submitted, working, completed, failed) |
| Messages | Structured communications between agents carrying context and instructions |
| Artifacts | Structured data and results that agents share across communication boundaries |
Agent discovery works through Agent Cards, which allow clients to locate and identify available remote agents without hardcoded connections. This enables a dynamic ecosystem where new agents can be discovered and utilized as they become available.
A2A is designed to complement Anthropic's Model Context Protocol (MCP). While MCP standardizes how agents connect to tools and data sources, A2A handles agent-to-agent communication. Together, they form a two-layer interoperability stack: MCP for agent-to-tool connections and A2A for agent-to-agent coordination.
In June 2025, Google contributed the A2A protocol to the Linux Foundation, establishing it as a vendor-neutral open standard. Version 1.0 was released with gRPC support, signed security cards, and extended client-side support in the Python SDK.
Every interaction between agents consumes tokens, and coordination messages can add up quickly. In a four-agent system, the overhead from inter-agent communication can account for 30 to 50 percent of total token usage. This makes multi-agent systems significantly more expensive than single-agent approaches for simple tasks. The cost-benefit tradeoff only favors multi-agent systems when the task is complex enough that specialization and parallelism provide genuine advantages.
When one agent in a multi-agent system produces an incorrect output, downstream agents may build on that error, amplifying it through the system. This cascading failure mode is particularly dangerous because each agent may appear to be functioning correctly in isolation. Detecting and recovering from such errors requires robust monitoring, validation checkpoints, and sometimes redundant agents that can cross-check each other's work.
Multi-agent systems are inherently more expensive than single-agent systems because they require multiple LLM calls for every task. A workflow that a single agent might handle in one or two API calls could require dozens of calls when distributed across multiple agents. Token costs scale with the number of agents and the complexity of their communication. Organizations must carefully evaluate whether the quality improvement justifies the additional cost.
When agents rely on each other's outputs, a hallucination by one agent can propagate and be reinforced by others. If an agent generates a plausible but incorrect fact, downstream agents may treat it as established truth and build further reasoning on top of it. Debate architectures and cross-validation patterns can mitigate this, but they add further cost and complexity.
Debugging multi-agent systems is substantially harder than debugging single-agent systems. When the final output is wrong, tracing the error back to a specific agent and a specific turn in the conversation requires tools for logging, visualization, and replay that are still maturing. There is also a lack of standardized evaluation metrics for multi-agent system performance.
As of early 2026, multi-agent systems are transitioning from experimental research to production infrastructure. The framework landscape has consolidated around four major options: AutoGen, CrewAI, LangGraph, and the OpenAI Agents SDK. Each serves different use cases and developer preferences.
A notable trend is the emergence of what practitioners call the "agentic mesh," where different frameworks are combined in a single deployment. A LangGraph orchestrator might coordinate a CrewAI team of marketing agents while calling OpenAI tools for specific sub-tasks. The A2A protocol and MCP are enabling this kind of cross-framework interoperability.
Performance benchmarks from early 2026 show LangGraph and OpenAI's Agents SDK leading in token efficiency, while CrewAI remains the most accessible framework for teams new to multi-agent development. AutoGen continues to dominate in enterprise settings where its caching and async capabilities provide cost advantages.
Anthropic's entry into the space with Claude Agent Teams in February 2026, along with their multi-agent code review tool for Claude Teams and Enterprise users, signals that major AI companies view multi-agent systems as a core product category rather than a research curiosity.
The field faces ongoing challenges around cost, reliability, and standardization, but the rapid pace of framework development and the growing body of production deployments suggest that multi-agent systems will become a standard pattern for complex AI applications.