Agent orchestration

AI Agents Developer Tools

32 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

36 citations

Revision

v4 · 6,333 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Agent orchestration is the coordinated management of multiple AI agents working together as a unified system to accomplish tasks that exceed the capability of any single agent. It encompasses the routing of tasks to appropriate agents, the flow of context between them, and the lifecycle management that governs how agents start, communicate, fail, recover, and terminate. As multi-agent systems have moved from research prototypes to production deployments, orchestration has become the central engineering challenge in building reliable agentic AI applications.

Orchestration draws a line between two kinds of systems. In Anthropic's framing, "Workflows are systems where LLMs and tools are orchestrated through predefined code paths," while "Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."^[36] Agent orchestration spans both: it covers deterministic pipelines whose control flow is fixed in code as well as systems where a model decides at runtime which agent or tool to invoke next.

The autonomous AI agent market is projected to reach $8.5 billion by 2026 and could grow to $45 billion by 2030 if enterprises improve their orchestration capabilities, according to a 2026 Deloitte analysis.^[1] Gartner documented a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025, and predicts that by 2028, 33% of enterprise software applications will include agentic AI, up from less than 1% in 2024.^[1] Nearly 50% of surveyed technology vendors now identify AI orchestration as their primary competitive differentiator.^[1]

Why does agent orchestration matter?

A single AI agent with access to a large language model and a set of tools can handle many tasks on its own. But when the work spans multiple domains, requires parallel processing, or demands specialized expertise at different stages, a single agent runs into practical limits. Its context window fills up. Its prompt becomes overloaded with instructions for too many tools. Security boundaries require isolating certain capabilities. At that point, the problem calls for multiple agents, and multiple agents call for orchestration.

Multi-agent AI systems have demonstrated 3x faster task completion and 60% better accuracy compared to single-agent implementations in benchmarks. But the gains come with coordination overhead. Most agent failures in production are not failures of the underlying model; they are orchestration and context-transfer failures at handoff points.^[1] Getting orchestration right is what separates a demo from a production system.

Microsoft's Azure Architecture Center recommends starting with the lowest level of complexity that reliably meets requirements.^[2] A direct model call with a well-crafted prompt is sufficient for single-step tasks like classification or summarization. A single agent with tools handles varied queries within one domain. Multi-agent orchestration becomes necessary only for cross-functional problems, scenarios requiring distinct security boundaries per agent, or tasks that benefit from parallel specialization.^[2]

Anthropic's published account of its own Research feature quantified both the benefit and the cost of going multi-agent. Its internal evaluation found that an orchestrator-worker system using Claude Opus 4 as the lead agent and Claude Sonnet 4 as parallel subagents outperformed a single-agent Claude Opus 4 by 90.2% on a research task, but multi-agent systems consumed roughly 15 times more tokens than ordinary chat interactions.^[22] Anthropic attributed the gain largely to compute: "Multi-agent systems work mainly because they help spend enough tokens to solve the problem."^[22] The company reported that token usage alone explained about 80% of the performance variance on its BrowseComp evaluation, with token usage, number of tool calls, and model choice together explaining about 95%.^[22] It concluded that multi-agent architectures are economically justified mainly for high-value tasks that involve heavy parallelization, information exceeding a single context window, or many complex tools.^[22]

What are the main orchestration patterns?

Several well-established patterns have emerged for coordinating multiple agents. Each pattern optimizes for different coordination requirements, and most production systems combine more than one. The patterns map closely to the workflow categories Anthropic documented in its "Building effective agents" guide, which describes the orchestrator-workers arrangement as one in which "a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results."^[36]

Sequential orchestration

Sequential orchestration (also called pipeline or prompt chaining) arranges agents in a predefined linear order. Each agent processes the output of the previous agent, creating a pipeline of specialized transformations. The choice of which agent runs next is deterministic and defined as part of the workflow; agents do not choose their successors.

This pattern works well for multistage processes with clear linear dependencies, such as a "draft, review, polish" workflow. A law firm might use it for contract generation: a template selection agent picks the base document, a clause customization agent modifies terms, a regulatory compliance agent checks against applicable laws, and a risk assessment agent evaluates liability exposure. Each stage builds on the complete output of the previous one.

The main drawback is that failures in early stages propagate through the entire pipeline. Latency compounds because each step must wait for the previous one to finish. If stages can run independently, a different pattern is more appropriate.

Concurrent orchestration

Concurrent orchestration (also called parallel, fan-out/fan-in, or scatter-gather) runs multiple agents simultaneously on the same input. Each agent provides independent analysis from its own specialization, and the results are aggregated at the end through voting, weighted merging, or LLM-synthesized summarization.

A financial services application might evaluate a stock by dispatching the same ticker to four agents running in parallel: a fundamental analysis agent, a technical analysis agent, a sentiment analysis agent, and an ESG (environmental, social, governance) agent. Each works independently, and their results are combined into a comprehensive recommendation.

This pattern reduces overall latency for tasks that can be parallelized and provides diverse perspectives. It requires a clear conflict-resolution strategy when results contradict each other, and it is more resource-intensive than sequential processing.

Hierarchical orchestration

Hierarchical orchestration organizes agents in a tree-like structure with clear authority relationships. A supervisor or coordinator agent at the top delegates tasks to subordinate agents, monitors their progress, and synthesizes their results. Subordinate agents may further delegate to their own sub-agents.

This pattern is effective for managing large problems by decomposing them into manageable parts. It mirrors how human organizations work, with managers distributing work to specialists. The risk is that the supervisor becomes a single point of failure; if it makes poor routing decisions, the entire workflow suffers.

Handoff orchestration

Handoff orchestration (also called routing, triage, or delegation) enables dynamic transfer of control between specialized agents. Each agent assesses the current task and decides whether to handle it directly or pass it to a more appropriate agent. Only one agent is active at a time, and full control transfers from one to another.

This pattern suits scenarios where the optimal agent for a task is not known upfront. A customer support system might start with a triage agent that interprets the request and handles common problems. When it recognizes a billing dispute, it hands off to a financial resolution agent. If that agent discovers an account access issue, it passes control to an account access agent. The key risk is infinite handoff loops, where agents keep bouncing tasks between each other.

The OpenAI Agents SDK represents handoffs as tools visible to the language model.^[3] A handoff to a "Refund Agent" becomes a callable tool named transfer_to_refund_agent. When the model invokes that tool, control transfers to the target agent along with relevant context.^[3]

Group chat orchestration

Group chat orchestration (also called roundtable, multi-agent debate, or council) places multiple agents in a shared conversation thread. A chat manager coordinates the flow by determining which agents respond next and managing interaction modes from collaborative brainstorming to structured quality gates.

A specific variant is the maker-checker loop, where one agent proposes output and another evaluates it against defined criteria. If the checker finds issues, it sends feedback to the maker, which revises and resubmits. This cycle repeats until approval or an iteration cap is reached. The pattern requires clear acceptance criteria and an iteration limit to prevent infinite refinement loops.

This pattern provides transparency and auditability because all contributions appear in a single thread. It works well for human-in-the-loop scenarios where people can guide conversations. The main challenge is managing conversation flow; Microsoft recommends limiting group chat orchestration to three or fewer agents to maintain control.^[2]

Magentic orchestration

Magentic orchestration (also called dynamic orchestration or task-ledger-based orchestration) handles open-ended problems without a predetermined plan. A magentic manager agent builds and refines a task ledger that documents the approach, goals, and subgoals. It consults with specialized agents, iterates, backtracks, and delegates as needed until the original request is satisfied or the system detects a stall.

This pattern was introduced by Microsoft Research through Magentic-One, a generalist multi-agent system.^[2] It is well suited for incident response, where the specific remediation steps are unknown upfront. An SRE automation might start with a diagnostics agent analyzing logs, then update the task ledger based on findings, bring in an infrastructure agent or rollback agent as needed, and maintain a complete audit trail throughout.

Event-driven orchestration

Event-driven orchestration coordinates agents through asynchronous event propagation using data streaming and publish-subscribe patterns. Agents react to events rather than being called directly, which provides temporal decoupling, event replay for debugging, and scalability through partitioning.

Confluent has documented design patterns for event-driven multi-agent systems built on Apache Kafka, where agents produce and consume events on topic streams.^[10] This approach works well for real-time triggers and high-throughput scenarios but introduces challenges around increased latency from asynchronous communication and the difficulty of debugging distributed event flows.^[10] LlamaIndex Workflows applies the same idea at the framework level: it released version 1.0 of an event-driven orchestration package in which an application is decomposed into steps that are triggered by events and emit further events, with support for typed workflow state and optional OpenTelemetry instrumentation.^[33]

Comparison of orchestration patterns

Pattern	Coordination style	Routing	Best for	Key risk
Sequential	Linear pipeline; each agent processes previous output	Deterministic, predefined order	Step-by-step refinement with clear dependencies	Early failures propagate; no parallelism
Concurrent	Parallel; agents work independently on same input	Deterministic or dynamic agent selection	Independent analysis from multiple perspectives	Contradictory results require conflict resolution
Hierarchical	Tree structure; supervisor delegates to subordinates	Top-down delegation	Large problems decomposed into parts	Supervisor is single point of failure
Handoff	Dynamic delegation; one active agent at a time	Agents decide when to transfer control	Tasks where the right specialist emerges during processing	Infinite handoff loops
Group chat	Conversational; agents contribute to shared thread	Chat manager controls turn order	Consensus-building, brainstorming, maker-checker validation	Conversation loops; hard to control with many agents
Magentic	Plan-build-execute; manager adapts a task ledger	Manager assigns and reorders tasks dynamically	Open-ended problems with no predetermined solution	Slow to converge; stalls on ambiguous goals
Event-driven	Asynchronous; agents react to events on streams	Publish-subscribe routing	Real-time triggers, high-throughput scenarios	Debugging distributed event flows

Which frameworks handle agent orchestration?

The rapid growth of agent orchestration has produced a diverse ecosystem of frameworks. Each takes a different architectural approach to the problem. Through 2025 and into 2026, several of the leading frameworks reached their first stable 1.0 releases, a milestone the field treated as a signal that the tooling had matured from experimentation toward production use.^[19]^[21]

LangGraph

LangGraph, developed by LangChain, uses a graph-based workflow design that treats agent interactions as nodes in a directed graph. Edges between nodes can be conditional, allowing branching, looping, and dynamic adaptation based on intermediate results. State is passed explicitly along graph edges.

LangGraph demands a higher upfront investment in setup and learning but offers long-term flexibility for stateful workflows with conditional logic. It excels at enforcing strict output formats and state transitions through its state graph model. LangGraph integrates tightly with LangSmith for observability and tracing.

LangChain released LangGraph 1.0 on October 22, 2025, describing it as the first stable major release in the durable agent framework category.^[19] The release was promoted without breaking changes, the main exception being the deprecation of the langgraph.prebuilt module, whose functionality moved into langchain.agents.^[19] Version 1.0 emphasized durable execution and built-in persistence, so that an interrupted workflow resumes from where it stopped, along with first-class human-in-the-loop APIs for pausing execution to allow human review or approval.^[19] LangChain cited production use at companies including Uber, LinkedIn, and Klarna.^[19] By early 2026, independent commentary described LangGraph as a common default for stateful production agents on the strength of its checkpointing, typed state, and durable execution.^[16]

CrewAI

CrewAI follows a role-based model inspired by real-world organizational structures. Agents are defined as team members with specific roles, goals, and backstories. Tasks are assigned to agents based on their roles, and the framework manages execution order and information passing.

CrewAI uses a YAML-driven configuration approach that balances simplicity with clarity. It is well suited for projects focused on defined role delegation where the workflow maps naturally to a team structure. The framework is less flexible than LangGraph for workflows requiring conditional branching.

CrewAI raised $18 million in funding announced in October 2024, combining an inception round led by Boldstart Ventures with a Series A led by Insight Partners.^[35] The framework remained under active development through 2026 and is positioned for role-based, team-style multi-agent collaboration.^[16]

AutoGen

AutoGen, originally developed at Microsoft Research, focuses on conversational agent architecture. Agents collaborate through natural language conversations, with the framework managing message routing and turn-taking. AutoGen 0.4, released in January 2025, introduced a complete architectural reimagining based on the actor model for distributed, event-driven systems.^[11]

The framework supports cross-language agent communication and includes built-in debugging and monitoring for agent workflows. In late 2024, the original creators departed Microsoft to establish AG2 as a community-driven fork that maintains backward compatibility with AutoGen 0.2. Microsoft subsequently merged AutoGen with Semantic Kernel into the Microsoft Agent Framework for production workloads.^[20] As part of that consolidation, Microsoft placed AutoGen into maintenance mode and now directs new enterprise projects to the Microsoft Agent Framework, publishing a dedicated migration guide from AutoGen.^[28]

Microsoft Agent Framework

The Microsoft Agent Framework is the production-oriented successor that unifies the enterprise foundations of Semantic Kernel with the multi-agent orchestration patterns of AutoGen into a single open-source SDK and runtime.^[20] Microsoft released it in public preview on October 1, 2025, with first-class support for Python and .NET, and reached general availability with version 1.0 on April 3, 2026.^[20]^[21] The framework ships the common orchestration patterns (sequential, concurrent, handoff, group chat, and Magentic) as built-in primitives, a graph-based workflow engine with checkpointing, declarative YAML agent and workflow definitions, and built-in OpenTelemetry tracing.^[20]^[21] It supports open standards for interoperability, including the Model Context Protocol for tools and the Agent2Agent protocol for cross-runtime agent collaboration, and connects to multiple model providers such as Azure OpenAI, OpenAI, Anthropic, Amazon Bedrock, Google Gemini, and Ollama.^[21]

OpenAI Agents SDK

The OpenAI Agents SDK, released in March 2025 as a replacement for the experimental Swarm framework, provides production-grade building blocks for tool use, handoffs, guardrails, and tracing. It supports two main collaboration patterns: handoff collaboration, where agents transfer control to each other mid-conversation, and agent-as-tool, where a central planner invokes sub-agents as if they were tools and incorporates their results.^[3]^[4]

Handoffs are represented as tools to the LLM, with customization options including callback functions on handoff, structured input types for metadata like escalation reasons, and input filters that control what conversation history the receiving agent sees.^[3] The SDK is provider-agnostic; while optimized for OpenAI models, it works with over 100 other LLMs through the Chat Completions API.

At its DevDay event on October 6, 2025, OpenAI introduced AgentKit, a higher-level toolkit built on top of the Agents SDK.^[23]^[24] AgentKit bundled Agent Builder, a visual drag-and-drop canvas for composing and versioning multi-agent workflows; a Connector Registry for managing data and tool connections; ChatKit for embedding chat-based agent interfaces; and an upgraded Evals suite for grading and optimizing agents.^[23] On June 3, 2026, OpenAI announced it would wind down the Agent Builder and Evals products, with a final shutdown scheduled for November 30, 2026, and directed developers back to the Agents SDK for code-based workflows and to ChatKit for embedded interfaces.^[25]

Claude Agent SDK

The Claude Agent SDK, developed by Anthropic, enables building autonomous agents with Claude's capabilities.^[5] Originally called the Claude Code SDK, it was renamed to reflect broader applications beyond coding. The SDK operates around a four-stage feedback cycle: gather context, take action, verify work, and iterate until task completion.^[5]

The SDK supports subagents by default. Developers define agent types with descriptions, system prompts, and restricted tool access. When Claude determines a subtask fits a subagent's definition, it spawns the subagent and receives only the final result. This enables parallel task execution while keeping each agent's context window isolated. The SDK handles orchestration details including tool execution, context management, retries, and automatic context compaction that summarizes previous messages when the context limit approaches.^[5]

Google Agent Development Kit (ADK)

Google's Agent Development Kit (ADK), introduced at Google Cloud NEXT 2025, is a framework for building and deploying multi-agent systems.^[6] It is model-agnostic, deployment-agnostic, and built for compatibility with other frameworks.^[6]

ADK provides workflow agents (SequentialAgent, ParallelAgent, LoopAgent) for predictable pipelines and LLM-driven dynamic routing for adaptive behavior. Agents are organized hierarchically, with root agents coordinating subordinate agents through description-driven routing: the LLM considers the query, the current agent's description, and related agents' descriptions to determine delegation. ADK Python 2.0 Alpha added graph-based workflows, and the framework is available in Python, TypeScript, and Go. The Python ADK reached a stable v1.0.0 release in May 2025, shifting its core service interfaces to asynchronous operations for better I/O scalability and adding a trace view for visualizing agent invocations.^[27] Google later extended the 1.0 line to additional languages, releasing ADK for Java and ADK for Go at version 1.0 in 2026.^[27]

Framework comparison

Framework	Architecture	Orchestration style	Language support	Key strength
LangGraph	Graph-based workflows	Directed graph with conditional edges	Python, TypeScript	Flexible stateful workflows with conditional logic
CrewAI	Role-based teams	YAML-driven role delegation	Python	Simple team-based workflows
AutoGen	Conversational / Actor model	Message-based agent collaboration	Python, .NET, Go	Distributed event-driven systems
Microsoft Agent Framework	Unified SDK and runtime	Sequential, concurrent, handoff, group chat, Magentic	Python, .NET	Production convergence of Semantic Kernel and AutoGen
OpenAI Agents SDK	Tool-based handoffs	Handoff and agent-as-tool patterns	Python, TypeScript	Provider-agnostic with built-in guardrails
Claude Agent SDK	Subagent spawning	Parallel subagents with isolated context	Python, TypeScript	Context management and automatic compaction
Google ADK	Hierarchical with workflow agents	Sequential, parallel, loop, and dynamic routing	Python, TypeScript, Go, Java	Multi-language support with Google Cloud integration
LlamaIndex Workflows	Event-driven steps	Steps triggered by and emitting events	Python, TypeScript	Async, event-driven orchestration for LLM systems

How do agent handoffs work?

Handoffs are the mechanism by which one agent transfers control, context, and responsibility to another. They are one of the most failure-prone points in multi-agent systems; Deloitte notes that most "agent failures" are actually orchestration and context-transfer issues at handoff points rather than model capability failures.^[1]

How a handoff is structured

When an agent determines that a task falls outside its specialization or that another agent would handle it better, it initiates a handoff. The implementation varies by framework, but the general flow involves three steps: the current agent signals intent to hand off, relevant context is packaged and transferred, and the receiving agent takes over processing.

In the OpenAI Agents SDK, handoffs are modeled as tools.^[3] Each potential handoff destination registers as a tool (for example, transfer_to_billing_agent), and the LLM decides when to call it based on conversation context. The handoff can include structured metadata through an input_type parameter, allowing the model to pass along information like the reason for escalation or a priority level.^[3] Input filters give developers control over what conversation history the receiving agent sees, preventing context window bloat while preserving essential information.^[3]

The Claude Agent SDK takes a different approach by spawning subagents with isolated context windows.^[5] Rather than transferring the full conversation, the orchestrator creates a subagent with a specific task description and restricted tool access. The subagent works independently and returns only its final result to the parent agent.^[5]

How much context should a handoff carry?

How much context to transfer during a handoff is a core design decision with significant cost and quality implications. Three strategies are common in production:

Full context forwarding passes the entire conversation history to the receiving agent. This is simple to implement but expensive. A 50-message thread with four agent handoffs means the fifth agent processes roughly 200 messages, and token costs scale quadratically with the number of handoffs.

Structured context objects use a typed data structure (containing fields like customer ID, detected intent, extracted entities, and resolution status) that the orchestrator maintains and passes selectively to each worker. This is the most token-efficient approach, typically requiring 200 to 500 tokens compared to 5,000 to 20,000 tokens for full conversation forwarding.

Shared memory stores context in a vector database or object store that agents can read from and write to asynchronously. This allows agents to remain loosely coupled while staying coordinated. It works well for long-running workflows where agents may not execute consecutively.

How is state managed across agents?

State management in multi-agent systems determines how information persists and flows between agents across a workflow's lifetime. Poor state management leads to lost context, duplicated work, and inconsistent behavior.

Short-term and long-term memory

Short-term memory handles active session state using sliding windows and in-memory storage. It tracks the current conversation, intermediate results, and pending tasks within a single workflow execution.

Long-term memory persists across sessions using vector databases for semantic retrieval and structured storage for relational knowledge. An agent can retrieve relevant information from past interactions, user preferences, or domain knowledge that was accumulated over time. Frameworks like LangChain integrate with vector stores such as Pinecone, Weaviate, and Chroma for this purpose.

Context window management

In multi-agent orchestrations, context windows grow rapidly because each agent adds its own reasoning, tool results, and intermediate outputs. Production systems use several techniques to manage this:

Context compaction summarizes previous interactions into key facts rather than passing complete histories. For example, a multi-message customer support exchange might be compressed into a structured summary of the customer's identity, issue type, and resolution status.

Selective pruning removes tool call details and intermediate reasoning steps that are no longer relevant, keeping only the conclusions and decisions.

Scoped context limits each agent to only the information relevant to its task, rather than exposing the full system state. The Claude Agent SDK implements this through isolated subagent context windows.^[5]

How do multi-agent systems handle errors?

Production multi-agent systems need layered error handling because failures can occur at every level: model API calls, tool execution, inter-agent communication, and orchestration logic.

Retry with exponential backoff

The first layer of defense handles transient errors like rate limits and network timeouts. Retries with exponential backoff and jitter prevent thundering herd problems when multiple agents hit the same API. Timeouts should be calibrated using the 95th percentile of response times rather than averages, to capture realistic worst-case behavior without triggering premature timeouts.

Model fallback chains

When a primary model provider experiences an outage, a fallback chain routes requests to alternative models. An orchestrator might try GPT-4o first, fall back to Claude, and then to a smaller local model. Each fallback may produce different quality levels, so the system needs to account for degraded performance.

Error classification

Different errors require different responses. A rate limit needs a retry. A tool that returns invalid output needs the LLM to reformulate its query. Missing user input needs a human-in-the-loop escalation. Classifying errors at the orchestration layer and routing them to the appropriate recovery mechanism prevents wasted retries on non-transient failures.

Checkpoint and recovery

For long-running multi-agent workflows, checkpoint-based recovery saves the system state at defined points so that a crash does not require restarting from the beginning. This is especially important for workflows that involve expensive operations or external side effects that cannot be easily reversed. Durable execution has become a first-class feature in production frameworks: LangGraph 1.0 persists agent execution state automatically so that an interrupted run resumes from its last checkpoint, and the Microsoft Agent Framework provides a graph-based workflow engine with built-in checkpointing.^[19]^[21]

Circuit breakers and bulkheads

Patterns borrowed from distributed systems engineering help contain failures. Circuit breakers monitor failure rates for downstream services and stop sending requests when failures exceed a threshold, giving the service time to recover. Bulkhead patterns compartmentalize the system into failure domains; if one group of agents fails, others continue operating independently.

What protocols let agents interoperate?

Two open protocols have emerged to standardize how agents interact with tools and with each other.

Model Context Protocol (MCP)

The Model Context Protocol (MCP), announced by Anthropic on November 25, 2024, is "an open standard that enables developers to build secure, two-way connections between their data sources and AI-powered tools."^[9] Built on JSON-RPC 2.0, MCP provides three core capabilities: tools (executable functions that perform actions), resources (access to data), and prompts (templates for common interactions).^[9]

Before MCP, developers had to build custom connectors for each data source, creating an N-by-M integration problem. MCP reduces this to N+M by providing a standardized interface. The protocol has been adopted by OpenAI, Google DeepMind, and other major providers. Thousands of MCP servers have been built by the community, and SDKs are available for all major programming languages. MCP is not an agent framework itself; it is an integration layer that complements orchestration frameworks like LangChain, LangGraph, and CrewAI.

Agent2Agent protocol (A2A)

The Agent2Agent protocol (A2A), introduced by Google in April 2025, is "an open protocol that provides a standard way for agents to collaborate with each other, regardless of the underlying framework or vendor."^[7] Built on HTTP, JSON-RPC, and Server-Sent Events (SSE), A2A provides capability discovery through "Agent Cards" in JSON format, task management with defined lifecycle states, context and instruction sharing between agents, and user experience negotiation.^[7]

A2A launched with support from more than 50 technology partners including Atlassian, Salesforce, SAP, and ServiceNow.^[7] In June 2025, Google contributed the protocol to the Linux Foundation.^[8] Version 0.3, released in July 2025, added gRPC support and signed security cards. The protocol now counts over 150 supporting organizations.^[26] A2A complements MCP: where MCP connects agents to tools, A2A connects agents to each other. At the project's one-year mark in April 2026, the Linux Foundation reported that more than 150 organizations supported the standard, that the specification had reached its first stable release (version 1.0) adding multi-tenancy and signed Agent Cards for cryptographic identity verification, and that the protocol had been integrated into Google Cloud, Microsoft Azure AI Foundry and Copilot Studio, and Amazon Bedrock AgentCore, with SDKs for Python, JavaScript, Java, Go, and .NET.^[26]

Other agent standards

The protocol landscape broadened in 2025 as additional consortia targeted agent interoperability. AGNTCY, an open-source effort initiated by Outshift by Cisco in March 2025 and accepted by the Linux Foundation in mid-2025, aims to build the infrastructure for an "Internet of Agents," including an Open Agent Schema Framework for describing agent capabilities and the Agent Gateway Protocol (AGP), a gRPC-based transport for secure, low-latency messaging between agents.^[34] These standards are generally complementary rather than competing, addressing tool access, direct agent-to-agent messaging, and capability discovery at different layers of the stack.^[34]

How are multi-agent systems monitored?

Observing multi-agent systems requires tracking multiple LLM calls, control flows, decision-making processes, tool invocations, and outputs across agents. Traditional application monitoring is insufficient because agent behavior is non-deterministic and context-dependent.

Tracing

Distributed tracing captures the full execution path of a multi-agent workflow, including each LLM call, tool invocation, handoff, and decision point. When an agent takes a 12-step path to answer a query, developers need to understand every decision: why it chose specific tools, why it retried steps, and where time was spent.

The industry is converging on OpenTelemetry (OTEL) as a standard for collecting agent telemetry data.^[12]^[13] This prevents vendor lock-in and enables interoperability across frameworks. OpenTelemetry's GenAI special interest group has published semantic conventions specifically for agent and framework spans, defining operations such as invoke_agent and create_agent and standard attributes for model name, input and output token counts, and finish reasons, so that a single trace can show a top-level agent span with nested chat and tool-execution spans.^[29]

Observability platforms

Several platforms specialize in agent observability:

Platform	Type	Key capabilities
LangSmith	Commercial (LangChain)	Tracing, real-time monitoring, cost and latency tracking; virtually no measurable overhead
Langfuse	Open source	LLM engineering platform for monitoring, evaluation, and debugging; supports LangGraph, OpenAI Agents, CrewAI, and more
AgentOps	Commercial	Agent lifecycle tracking and debugging
Arize Phoenix	Open source	LLM tracing and evaluation

LangSmith demonstrated virtually no measurable overhead in benchmarks, making it suitable for performance-critical production environments.^[13] Langfuse and AgentOps showed higher overhead (15% and 12% respectively) in multi-step workflows but offer different pricing and self-hosting options.^[12]

Key metrics

Production agent systems should track accuracy rates (target of 95% or higher), task completion rates (target of 90% or higher), response latency per agent and end-to-end, token consumption and cost per workflow, handoff success rates, and error rates by type and agent.

How secure is multi-agent orchestration?

Multi-agent orchestration enlarges the attack surface relative to a single agent, because risk arises not only from individual agents but from their interactions, coordination, and dependencies.^[30] Academic work has begun to treat multi-agent security as a distinct field. A 2025 survey led by researchers at the University of Oxford and collaborators argues that security is non-compositional in this setting, meaning individually safe agents can combine into an unsafe system, and catalogs novel failure modes such as covert collusion between agents, swarm attacks that look innocuous individually, cascading jailbreaks and privacy leaks that propagate across agent boundaries, and emergent misalignment arising from interaction rather than any single agent's flaw.^[30]

A concrete example is "prompt infection," described in a 2024 paper, in which a malicious instruction processed by one agent appends itself to that agent's outgoing messages and thereby spreads to downstream agents, behaving like a self-replicating worm; the authors reported that such infections induced harmful actions, including data exfiltration and content manipulation, in more than 80% of tested cases using GPT-4o in a simulated environment.^[31] Practitioners conclude that prompt-level guardrails designed for single agents do not address propagation across agent boundaries, so multi-agent deployments need architectural controls (validation, isolation, and trust boundaries) at every inter-agent communication point.^[30]

How are multi-agent systems deployed to production?

Deploying multi-agent systems to production introduces challenges around reliability, latency, scaling, and governance that do not surface during prototyping. The gap between prototype and production is wide enough that Gartner predicted in June 2025 that more than 40% of agentic AI projects would be canceled by the end of 2027 because of escalating costs, unclear business value, or inadequate risk controls, and warned that many vendors were "agent washing" existing chatbots and robotic process automation tools as agents.^[32]

Reliability and predictability

Large language models make agents flexible but also lead to inconsistent outputs. Small changes in wording can derail entire interactions, a phenomenon called "prompt brittleness" that requires rigorous testing and careful prompt engineering. Hallucinations (agents making up facts or tool inputs) can grind processes to a halt. Production systems mitigate these risks through guardrails, output validation, and structured output formats.

Latency and throughput

LLM-powered agents can be too slow for high-traffic or real-time applications. Teams often need to rearchitect for efficiency by using caching, swapping in faster models for simpler tasks, or simplifying agent logic. An orchestration pattern that works at 100 requests per minute may fall apart at 10,000.

Scaling considerations

As organizations move beyond 100 agents, emergent behaviors become a primary concern. Agents may interact in unexpected ways, create feedback loops, or compete for shared resources. Architectural approaches that work at small scale (a single orchestrator managing all agents) do not hold up at larger scales, requiring distributed orchestration with partitioned namespaces and independent failure domains.

Governance

Production agent orchestration requires clear rules for agent roles, accountability, fallback routes, and oversight. Key governance concerns include:

Auditability: maintaining logs of every agent decision and action for compliance review
Access control: restricting which agents can access which tools, data sources, and external systems
Human oversight: defining when agents must escalate to humans and ensuring those escalation paths work reliably
Cost management: setting budgets for token consumption per workflow and agent, with alerts and circuit breakers when costs exceed thresholds

Testing

Before scaling, organizations should stress-test orchestrations by simulating real-world complexity including incomplete data, conflicting goals, adversarial inputs, and simultaneous high-volume requests. End-to-end tests should cover the full multi-agent workflow, not just individual agent behavior.

What are the best practices for agent orchestration?

Several principles have emerged from early production deployments of multi-agent orchestration systems.

Start simple. Use the minimum number of agents that reliably solve the problem. Each additional agent adds coordination overhead, latency, and failure modes. A single agent with multiple tools is preferable to a multi-agent system if it can handle the task.

Give each agent a single, well-defined responsibility. Agents with broad, overlapping responsibilities produce complex prompts and degrade performance. Clear boundaries reduce ambiguity in routing decisions.

Make orchestration deterministic where possible. Use state machines and explicit routing rules for flow control. Reserve LLM-based decision-making for bounded judgments within agents rather than for choosing which agent runs next.

Design for failure. Every inter-agent communication point is a potential failure point. Build retry logic, fallback paths, timeouts, and circuit breakers into the orchestration layer from the beginning.

Compress context aggressively. Summarize and prune information between agents rather than forwarding complete conversation histories. Token costs and latency grow linearly (or worse) with context size.

Monitor everything. Instrument every LLM call, tool invocation, handoff, and decision point. Without observability, debugging multi-agent systems is nearly impossible.

Test with realistic scenarios. Agent behavior under controlled test conditions often differs from behavior with real user inputs. Include edge cases, ambiguous requests, and adversarial inputs in test suites.

References

Deloitte. "Unlocking exponential value with AI agent orchestration." Deloitte TMT Predictions 2026. https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2026/ai-agent-orchestration.html ↩
Microsoft. "AI Agent Orchestration Patterns." Azure Architecture Center, February 2026. https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns ↩
OpenAI. "Handoffs." OpenAI Agents SDK Documentation. https://openai.github.io/openai-agents-python/handoffs/ ↩
OpenAI. "Orchestrating Agents: Routines and Handoffs." OpenAI Cookbook. https://cookbook.openai.com/examples/orchestrating_agents ↩
Anthropic. "Building agents with the Claude Agent SDK." Claude Blog. https://claude.com/blog/building-agents-with-the-claude-agent-sdk ↩
Google. "Agent Development Kit: Making it easy to build multi-agent applications." Google Developers Blog, April 2025. https://developers.googleblog.com/en/agent-development-kit-easy-to-build-multi-agent-applications/ ↩
Google. "Announcing the Agent2Agent Protocol (A2A)." Google Developers Blog, April 2025. https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/ ↩
Linux Foundation. "Linux Foundation Launches the Agent2Agent Protocol Project." June 2025. https://www.linuxfoundation.org/press/linux-foundation-launches-the-agent2agent-protocol-project-to-enable-secure-intelligent-communication-between-ai-agents ↩
Anthropic. "Introducing the Model Context Protocol." November 2024. https://www.anthropic.com/news/model-context-protocol ↩
Confluent. "Four Design Patterns for Event-Driven, Multi-Agent Systems." https://www.confluent.io/blog/event-driven-multi-agent-systems/ ↩
Microsoft Research. "AutoGen v0.4: Reimagining the foundation of agentic AI for scale and more." January 2025. https://www.microsoft.com/en-us/research/video/autogen-v0-4-reimagining-the-foundation-of-agentic-ai-for-scale-and-more-microsoft-research-forum/ ↩
Langfuse. "AI Agent Observability, Tracing & Evaluation with Langfuse." https://langfuse.com/blog/2024-07-ai-agent-observability-with-langfuse ↩
LangChain. "LangSmith: AI Agent & LLM Observability Platform." https://www.langchain.com/langsmith/observability ↩
Redis. "AI agent orchestration for production systems." https://redis.io/blog/ai-agent-orchestration/
Kore.ai. "How multi-agent orchestration powers enterprise AI." https://www.kore.ai/blog/what-is-multi-agent-orchestration
DataCamp. "CrewAI vs LangGraph vs AutoGen: Choosing the Right Multi-Agent AI Framework." https://www.datacamp.com/tutorial/crewai-vs-langgraph-vs-autogen ↩
Galileo. "Multi-Agent AI Failure Recovery That Actually Works." https://galileo.ai/blog/multi-agent-ai-system-failure-recovery
ZenML. "LLM Agents in Production: Architectures, Challenges, and Best Practices." https://www.zenml.io/blog/llm-agents-in-production-architectures-challenges-and-best-practices
LangChain. "LangGraph 1.0 is now generally available." LangChain Changelog, October 22, 2025. https://changelog.langchain.com/announcements/langgraph-1-0-is-now-generally-available ↩
Microsoft. "Introducing Microsoft Agent Framework." Microsoft Azure Blog, October 1, 2025. https://azure.microsoft.com/en-us/blog/introducing-microsoft-agent-framework/ ↩
Microsoft. "Microsoft Agent Framework Version 1.0." Microsoft Agent Framework DevBlog, April 3, 2026. https://devblogs.microsoft.com/agent-framework/microsoft-agent-framework-version-1-0/ ↩
Anthropic. "How we built our multi-agent research system." Anthropic Engineering, June 13, 2025. https://www.anthropic.com/engineering/multi-agent-research-system ↩
OpenAI. "Introducing AgentKit." October 6, 2025. https://openai.com/index/introducing-agentkit/ ↩
TechCrunch. "OpenAI launches AgentKit to help developers build and ship AI agents." October 6, 2025. https://techcrunch.com/2025/10/06/openai-launches-agentkit-to-help-developers-build-and-ship-ai-agents/ ↩
OpenAI Developer Community. "Deprecation notice: Agent Builder." June 3, 2026. https://community.openai.com/t/deprecation-notice-agent-builder/1382650 ↩
Linux Foundation. "A2A Protocol Surpasses 150 Organizations, Lands in Major Cloud Platforms, and Sees Enterprise Production Use in First Year." April 9, 2026. https://www.linuxfoundation.org/press/a2a-protocol-surpasses-150-organizations-lands-in-major-cloud-platforms-and-sees-enterprise-production-use-in-first-year ↩
Jack Wotherspoon. "What's New in Agent Development Kit (ADK) v1.0.0." Google Cloud Community, May 2025. https://medium.com/google-cloud/whats-new-in-agent-development-kit-adk-v1-0-0-fe8d79384bbd ↩
Microsoft. "AutoGen to Microsoft Agent Framework Migration Guide." Microsoft Learn. https://learn.microsoft.com/en-us/agent-framework/migration-guide/from-autogen/ ↩
OpenTelemetry. "Semantic Conventions for GenAI agent and framework spans." https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/ ↩
Christian Schroeder de Witt and others. "Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents." arXiv:2505.02077, 2025. https://arxiv.org/abs/2505.02077 ↩
Donghyun Lee and Mo Tiwari. "Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems." arXiv:2410.07283, 2024. https://arxiv.org/abs/2410.07283 ↩
Gartner. "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027." June 25, 2025. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027 ↩
LlamaIndex. "Announcing Workflows 1.0: A Lightweight Framework for Agentic Systems." https://www.llamaindex.ai/blog/announcing-workflows-1-0-a-lightweight-framework-for-agentic-systems ↩
Linux Foundation. "Linux Foundation Welcomes the AGNTCY Project to Standardize Open Multi-Agent System Infrastructure." 2025. https://www.linuxfoundation.org/press/linux-foundation-welcomes-the-agntcy-project-to-standardize-open-multi-agent-system-infrastructure-and-break-down-ai-agent-silos ↩
Enterprise AI World. "$18M in Funding Catapults CrewAI's Multi-Agentic Platform to the Enterprise Level." October 22, 2024. https://www.enterpriseaiworld.com/Articles/News/News/$18M-in-Funding-Catapults-CrewAIs-Multi-Agentic-Platform-to-the-Enterprise-Level-166495.aspx ↩
Erik Schluntz and Barry Zhang. "Building effective agents." Anthropic, December 19, 2024. https://www.anthropic.com/research/building-effective-agents ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

AG-UI Protocol ChatDev MetaGPT Productivity

Why does agent orchestration matter?

What are the main orchestration patterns?

Sequential orchestration

Concurrent orchestration

Hierarchical orchestration

Handoff orchestration

Group chat orchestration

Magentic orchestration

Event-driven orchestration

Comparison of orchestration patterns

Which frameworks handle agent orchestration?

LangGraph

CrewAI

AutoGen

Microsoft Agent Framework

OpenAI Agents SDK

Claude Agent SDK

Google Agent Development Kit (ADK)

Framework comparison

How do agent handoffs work?

How a handoff is structured

How much context should a handoff carry?

How is state managed across agents?

Short-term and long-term memory

Context window management

How do multi-agent systems handle errors?

Retry with exponential backoff

Model fallback chains

Error classification

Checkpoint and recovery

Circuit breakers and bulkheads

What protocols let agents interoperate?

Model Context Protocol (MCP)

Agent2Agent protocol (A2A)

Other agent standards

How are multi-agent systems monitored?

Tracing

Observability platforms

Key metrics

How secure is multi-agent orchestration?

How are multi-agent systems deployed to production?

Reliability and predictability

Latency and throughput

Scaling considerations

Governance

Testing

What are the best practices for agent orchestration?

See also

References

Improve this article

Related Articles

Claude Code

Claude Code Playwright

LangChain

CrewAI

Semantic Kernel

AI coding agent

What links here

Related Articles

Claude Code

Claude Code Playwright

LangChain

CrewAI

Semantic Kernel

AI coding agent

What links here