An agentic workflow is a multi-step process in which one or more AI agents independently plan a sequence of actions, select and use tools, evaluate intermediate results, and iterate until a goal is reached. Unlike conventional large language model usage, where a single prompt produces a single response (zero-shot prompting), agentic workflows let AI systems break complex tasks into subtasks, reflect on their own outputs, and revise their work across multiple turns.
The term gained broad usage in early 2024 after Andrew Ng, co-founder of Google Brain and founder of DeepLearning.AI, presented a talk at Sequoia Capital's AI Ascent event in March 2024. Ng argued that agentic workflows would "drive massive AI progress this year, perhaps even more than the next generation of foundation models." He demonstrated this with a striking benchmark result: on OpenAI's HumanEval coding benchmark, GPT-3.5 scored 48.1% in zero-shot mode and GPT-4 scored 67.0%, but GPT-3.5 wrapped in an agentic workflow reached 95.1%, surpassing GPT-4's zero-shot performance by a wide margin. The takeaway was clear: iterative workflow design can matter more than raw model capability.
Traditional LLM applications follow a request-response pattern. A user sends a prompt, the model generates a completion, and the interaction ends. This works for simple tasks like translation, summarization, or question answering, but it struggles with open-ended problems that require research, planning, or multi-step reasoning.
Human experts rarely solve complex problems in a single pass. A software engineer writes code, runs tests, reads error messages, and revises. A researcher formulates a hypothesis, gathers data, analyzes results, and adjusts the hypothesis. Agentic workflows apply this same iterative pattern to AI systems. The agent receives a high-level goal, decomposes it into steps, executes each step (often calling external tools), evaluates whether the result meets the goal, and loops back if it does not.
This shift from single-turn completions to multi-turn autonomous execution is what distinguishes agentic AI from earlier prompt engineering techniques. Where prompt engineering optimizes a single model call, agentic workflows orchestrate many calls into a coherent process.
In his March 2024 talk and subsequent posts on X (formerly Twitter), Andrew Ng identified four core design patterns for agentic AI. These patterns are not mutually exclusive; production systems often combine two or more.
In a reflection workflow, the model generates an output and then critiques its own work. Ng described a common setup with two agents: one prompted to generate high-quality output and another prompted to provide constructive criticism. The conversation between these two agents leads to iterative improvement. For example, a coding agent might generate a function, then a reviewer agent points out an edge case, and the first agent revises its code.
Reflection is the simplest agentic pattern and often the easiest to implement. Despite its simplicity, Ng noted that he had been "delighted by how much it improved applications' results in a few cases." The pattern works especially well for writing, code generation, and data analysis where quality can be evaluated programmatically.
Tool use extends the agent's capabilities beyond text generation. Instead of relying solely on its training data, the agent can call external APIs, query databases, execute code, search the web, or interact with file systems. The ReAct (Reasoning and Acting) framework, introduced by Yao et al. in a 2022 paper published at ICLR, formalized this pattern by interleaving reasoning traces with tool-calling actions. At each step, the agent articulates its reasoning (a "thought"), decides on an action (such as calling a search API), observes the result, and then reasons about the next step.
Tool use is what transforms a language model from a text generator into a functional agent. Without tools, the model can only produce text based on its training data. With tools, it can retrieve current information, perform calculations, modify files, send messages, and interact with software systems.
Planning involves the agent decomposing a complex task into a sequence of smaller, manageable steps before executing them. Rather than attempting to solve the entire problem at once, the agent creates a plan, executes it step by step, and adjusts the plan if intermediate results reveal that the original approach is not working.
Sophisticated planning agents use techniques like chain-of-thought reasoning to map out their approach. Some systems maintain an explicit plan data structure that tracks completed steps, pending steps, and alternative paths. Planning becomes especially valuable when tasks have dependencies, where step three cannot begin until step two completes, or when the agent needs to allocate limited resources (such as API calls or compute time) across competing priorities.
Multi-agent collaboration assigns different roles or specializations to separate agents that work together on a shared task. One agent might act as a researcher, another as a writer, a third as a fact-checker, and a fourth as an editor. Each agent focuses on its specialty, and they communicate through structured message passing.
This pattern mirrors how human teams operate. A product team has a designer, an engineer, a product manager, and a QA tester. Each contributes their expertise, and the final product benefits from multiple perspectives. Multi-agent systems can tackle problems too complex for any single agent by dividing labor and enabling specialization.
Gartner reported a 1,445% increase in enterprise inquiries about multi-agent systems from Q1 2024 to Q2 2025, indicating strong industry interest in this pattern.
Beyond Ng's four high-level categories, practitioners have identified several specific patterns that appear repeatedly in production agentic systems.
| Pattern | Description | Typical use case | Complexity |
|---|---|---|---|
| ReAct loop | The agent alternates between reasoning (thought) and acting (tool call), observing results at each step | Search and retrieval, data analysis, question answering | Low |
| Reflection / self-critique | The agent generates output, critiques it, and revises iteratively | Code generation, writing, debugging | Low |
| Plan-and-execute | The agent creates an explicit multi-step plan, then executes each step sequentially | Complex research, project management, multi-file code changes | Medium |
| Tool routing | A dispatcher agent decides which specialized tool or sub-agent to invoke based on the user's request | Customer support triage, IT helpdesk, function calling | Medium |
| Multi-agent debate | Multiple agents argue different positions and converge on a consensus answer | Fact verification, decision making, risk assessment | Medium |
| Hierarchical delegation | A supervisor agent delegates subtasks to worker agents and aggregates their results | Report generation, data pipelines, software engineering | High |
| Human-in-the-loop | The agent pauses at specified checkpoints to request human approval before proceeding | Financial transactions, medical recommendations, legal review | Medium |
| Map-reduce | The agent splits a large task into parallel subtasks, processes them concurrently, and merges the outputs | Document analysis, batch processing, large-scale summarization | High |
How much independence an agent should have in production is one of the central design questions. Anthropic's research on measuring agent autonomy (published in 2025) emphasized that "autonomy is not a fixed property of a model or system but an emergent characteristic of a deployment." Three broad patterns have emerged.
In this pattern, the agent proposes actions but waits for explicit human approval before executing them. Every tool call, API request, or output goes through a human review step. This is the safest approach and is common in high-stakes domains like healthcare, finance, and legal work. The downside is speed: the workflow can only move as fast as the human reviewer.
The agent operates autonomously for routine, low-risk, reversible actions but escalates to a human for high-risk, irreversible, or ambiguous decisions. This "calibrated autonomy" approach tries to balance safety with efficiency. A customer service agent, for instance, might process a standard refund on its own but escalate a complaint involving potential fraud to a human supervisor.
Anthropics's autonomy research found that experienced users naturally shift from approving individual actions to a monitoring-and-intervening style. Over 40% of experienced users adopted this pattern, compared to roughly 20% of new users.
The agent executes its entire workflow without human intervention. Guardrails and safety checks are embedded programmatically rather than relying on human review. This pattern is used for low-stakes, high-volume tasks like data entry, content tagging, or log analysis. Fully autonomous agents require robust monitoring and observability infrastructure to catch failures after the fact.
Most production deployments in 2025 and 2026 sit in the supervised autonomy zone. A Deloitte report on agentic AI strategy noted that organizations are moving from single-step automation to systems that manage entire workflows, but with human oversight at decision points.
Several open-source and commercial frameworks have emerged to simplify building agentic workflows. They handle common concerns like state management, tool registration, agent communication, error handling, and observability.
LangGraph, developed by LangChain, uses a graph-based workflow design. Agent interactions are modeled as nodes in a directed graph, with edges representing transitions between steps. This architecture supports conditional branching, cycles (for iterative refinement), and parallel execution paths. LangGraph provides fine-grained control over agent state through a persistent state object that flows through the graph. It is written in Python and TypeScript and integrates tightly with the broader LangChain ecosystem.
LangGraph is well suited for complex workflows with conditional logic, error recovery, and dynamic routing. Its graph-based approach makes it easier to reason about and debug multi-step processes.
CrewAI takes a role-based approach inspired by human team structures. Developers define "crews" of agents, each with a distinct role, backstory, and set of capabilities. Tasks are assigned to agents based on their roles, and agents can communicate with each other and delegate subtasks. CrewAI focuses on simplicity and rapid prototyping; defining a multi-agent workflow takes relatively little code.
CrewAI works well for workflows where agent responsibilities map cleanly to human job roles: researcher, writer, analyst, reviewer. It is popular for content generation pipelines, competitive analysis, and customer support workflows.
AutoGen, originally developed by Microsoft Research, approaches multi-agent orchestration through structured conversations. Agents communicate by exchanging messages in a group chat format, with different agents taking on roles like planner, researcher, coder, and executor. AutoGen supports both fully autonomous conversations and human-in-the-loop configurations where a human participant can intervene in the conversation.
In October 2025, Microsoft merged AutoGen with Semantic Kernel into the unified Microsoft Agent Framework. This combined platform offers AutoGen's flexible multi-agent orchestration with Semantic Kernel's enterprise features, including session-based state management, middleware, and telemetry. The 1.0 general availability release is targeted for Q1 2026.
Semantic Kernel is Microsoft's model-agnostic SDK for building AI agents and multi-agent systems. Available in Python, C#, and Java, it targets enterprise developers already working in the Microsoft ecosystem. The framework supports multiple orchestration patterns: sequential, concurrent, group chat, handoff, and what Microsoft calls "magnetic" orchestration (where agents are dynamically attracted to tasks matching their capabilities).
Following the merger with AutoGen, Semantic Kernel provides the production foundation (state management, type safety, telemetry) while AutoGen contributes the agent abstractions and conversational patterns.
Other notable frameworks include LlamaIndex Workflows (focused on retrieval-augmented generation agents), DSPy (which compiles declarative agent programs into optimized prompts), Haystack (by deepset, for NLP pipeline orchestration), and Amazon Bedrock Agents (a managed service on AWS for building production agents). The framework landscape is rapidly evolving, with new entrants appearing regularly.
| Framework | Developer | Architecture style | Primary languages | Release year | Ideal use case |
|---|---|---|---|---|---|
| LangGraph | LangChain | Graph-based workflows | Python, TypeScript | 2024 | Complex conditional workflows |
| CrewAI | CrewAI Inc. | Role-based crews | Python | 2023 | Team-structured content and research workflows |
| AutoGen / Microsoft Agent Framework | Microsoft | Conversation-based | Python, .NET | 2023 (AutoGen), 2025 (merged) | Research, collaborative problem-solving |
| Semantic Kernel | Microsoft | Plugin-based, multi-pattern | Python, C#, Java | 2023 | Enterprise .NET/Java applications |
| LlamaIndex Workflows | LlamaIndex | Event-driven | Python, TypeScript | 2024 | RAG-heavy agent applications |
| DSPy | Stanford NLP | Declarative compilation | Python | 2023 | Optimized prompt pipelines |
| Amazon Bedrock Agents | AWS | Managed service | Python (SDK) | 2023 | Serverless production agents on AWS |
The Model Context Protocol (MCP) is an open standard introduced by Anthropic on November 25, 2024, to standardize how AI agents connect to external tools, data sources, and systems. Before MCP, every integration between an agent and an external service required custom code. MCP replaces this fragmented approach with a single, universal protocol.
MCP uses a client-server architecture. An MCP client (the AI agent or application) connects to one or more MCP servers, each of which exposes tools and data from a specific system. A single MCP server for GitHub, for example, might expose tools for creating pull requests, reading issues, and searching repositories. The agent connects to the server once and can call any of its tools through a standardized interface.
At launch, Anthropic released pre-built MCP servers for Google Drive, Slack, GitHub, Git, Postgres, and Puppeteer. Early adopters included Block, Apollo, Zed, Replit, Codeium, and Sourcegraph. In March 2025, OpenAI adopted MCP across its products, including the ChatGPT desktop application. By the end of 2025, the ecosystem had grown to over 10,000 active MCP servers with 97 million monthly SDK downloads across Python and TypeScript.
In December 2025, Anthropic donated MCP to the newly formed Agentic AI Foundation (AAIF) under the Linux Foundation, with OpenAI and Block as co-founders. This move signaled industry-wide commitment to a shared standard for agent-tool integration.
MCP has become the de facto standard for connecting agents to tools. Its rapid adoption across competing AI companies and thousands of third-party developers suggests that the fragmentation problem for tool integration is largely solved at the protocol level.
While MCP standardizes how agents connect to tools, the Agent-to-Agent protocol (A2A) standardizes how agents communicate with each other. Google announced A2A on April 9, 2025, with support from over 50 technology partners including Atlassian, Salesforce, SAP, ServiceNow, PayPal, LangChain, and MongoDB.
A2A addresses a different problem than MCP. In a multi-agent system, agents from different vendors or built on different frameworks need a common language to discover each other's capabilities, delegate tasks, exchange results, and negotiate how to present information to users. A2A provides this common language.
The protocol has four core capabilities:
A2A is built on existing web standards (HTTP, Server-Sent Events, and JSON-RPC) and supports enterprise-grade authentication and authorization. In June 2025, Google contributed A2A to the Linux Foundation. Version 0.3, released in July 2025, provided a more stable API surface for production use.
A2A and MCP are complementary. MCP handles agent-to-tool communication (vertical integration), while A2A handles agent-to-agent communication (horizontal integration). Together, they form the emerging protocol stack for multi-agent systems.
Agentic AI moved from experimentation to early production deployment across several industries during 2025 and into 2026. Gartner predicted that 40% of enterprise applications would include task-specific AI agents by the end of 2026, up from less than 5% in 2025. The global AI agents market was valued at approximately $7.8 billion in 2025, with projections reaching $52.6 billion by 2030 (a 46.3% compound annual growth rate).
Customer service was among the first domains to adopt agentic workflows at scale. Agents handle routine inquiries, process refunds, update account information, and escalate complex cases to human representatives. Workflow automation is the primary use case in 64% of agent deployments. Organizations report an average 6.7% increase in customer satisfaction scores after deploying AI agents, and small teams report saving over 40 hours per month.
Agentic coding tools like GitHub Copilot, Cursor, Claude Code, and Devin represent some of the most visible agentic workflow deployments. These tools go beyond autocomplete to plan multi-file changes, run tests, interpret error messages, and iterate on code until tests pass. Anthropic's 2026 Agentic Coding Trends Report noted that agentic AI is "reconfiguring the software development lifecycle," with coding agents increasingly handling not just code generation but debugging, testing, documentation, and code review.
Research agents can search the web, read documents, synthesize findings, and produce structured reports. They are used in competitive intelligence, market research, due diligence, and scientific literature review. The iterative nature of agentic workflows is particularly well suited to research tasks, where initial findings often raise follow-up questions that require additional investigation.
Banks and financial institutions use agentic workflows for Know Your Customer (KYC) verification, anti-money laundering (AML) screening, fraud detection, and regulatory compliance. McKinsey reported that banks implementing agentic AI for KYC/AML workflows have realized productivity gains ranging from 200% to 2,000%.
Data analysis agents can write SQL queries, execute them against databases, interpret the results, generate visualizations, and produce written summaries. They iterate: if a query returns unexpected results, the agent can reformulate it. This pattern is especially valuable for business analysts who need answers from data but lack SQL expertise.
Despite rapid adoption, agentic workflows face several challenges in production. A Gartner forecast warned that over 40% of agentic AI projects will be canceled by 2027 due to escalating costs, unclear business value, or inadequate risk controls.
The gap between a demo and a reliable production system is wide. Multi-agent architectures where agents delegate to other agents, retry failed steps, or dynamically choose tools create orchestration complexity that compounds with scale. Error propagation is a particular concern: if one agent in a chain produces a flawed output, downstream agents may build on that flaw without detecting it. Retries, timeouts, and fallback logic add further complexity.
Agentic workflows consume significantly more LLM tokens than single-turn interactions. A reflection loop that runs five iterations costs roughly five times as much as a single generation. Multi-agent systems multiply this further, with each agent making its own LLM calls. Token costs, tool-call costs, and compute costs can spike unpredictably when agents encounter edge cases that trigger extensive retries or recursive calls.
Multi-step workflows are inherently slower than single-turn responses. Each step involves at least one LLM call (often hundreds of milliseconds to several seconds), plus any time spent on tool calls (database queries, API requests, web searches). A workflow with ten steps might take 30 seconds to several minutes, which is acceptable for background tasks but problematic for interactive applications.
Agents that can take real-world actions (sending emails, modifying databases, executing code, transferring money) need robust safeguards. The challenge is building guardrails that prevent harmful actions without being so restrictive that they negate the value of automation. Common approaches include:
Safety infrastructure is still maturing. Most production deployments rely on a combination of programmatic guardrails and human oversight rather than any single safety mechanism.
Understanding what an agent is doing and why is harder than monitoring a conventional application. Traditional logging captures inputs and outputs but misses what matters for agents: the reasoning behind tool selection, intermediate plans, decision paths, and context that informed each step. Tracing infrastructure for deep agent observability remains immature, and limited lineage tracking slows incident response and auditing.
Evaluating agentic workflows requires different approaches than evaluating single-turn LLM outputs. The core metrics shift from response quality to task completion and process efficiency.
| Metric | What it measures | Why it matters | |---|---|---|| | Goal completion rate | Percentage of tasks the agent completes successfully end-to-end | The most direct measure of whether the agent is useful | | Tool use efficiency | Whether the agent selects the right tools and uses them correctly | Incorrect tool selection wastes time and tokens | | Step efficiency | Number of steps taken relative to the minimum required | Excessive steps indicate poor planning or unnecessary retries | | Error recovery rate | How often the agent recovers from errors without human intervention | Directly impacts the value of autonomous operation | | Latency (time to completion) | Total wall-clock time from task start to completion | Determines whether the workflow is practical for its intended use | | Cost per task | Total token and tool-call costs for completing a task | Determines economic viability | | Safety violation rate | Frequency of actions that violate defined guardrails | Measures the risk profile of the deployment |
Several benchmarks specifically target agentic capabilities:
Pre-deployment benchmarks are necessary but not sufficient. Amazon's engineering team published a blog post in 2025 on lessons learned from building agentic systems, emphasizing the need for continuous evaluation in production. Pre-deployment testing cannot fully capture the diversity of real-world user behaviors, usage patterns, and edge cases. Production evaluation monitors for performance degradation, identifies failure modes, and provides data for iterative improvement.
Common production evaluation approaches include A/B testing (comparing agent performance against human baselines or alternative agent configurations), shadow mode (running the agent alongside human workers without taking live actions), and human review sampling (having human reviewers score a random sample of agent outputs on a regular cadence).
Agentic workflows connect to several related areas in AI:
The agentic AI ecosystem is evolving rapidly. Several trends are shaping the next phase:
Protocol convergence around MCP and A2A is reducing the integration burden for multi-agent systems. As these protocols mature and stabilize, building interoperable agent systems should become significantly simpler.
The shift from single agents to multi-agent teams continues to accelerate. Organizations are adopting a "microservices" approach to agents, replacing monolithic all-purpose agents with orchestrated teams of specialized agents.
Cost reduction through smarter orchestration is an active area of work. Techniques like routing simple queries to smaller, cheaper models while reserving expensive frontier models for complex reasoning steps can reduce costs substantially without sacrificing quality.
Improved evaluation and observability tools are emerging to close the gap between demo-quality and production-quality agent systems. The ability to trace agent reasoning, replay failures, and systematically test edge cases is critical for enterprise adoption.
Gartner predicts that by the end of 2026, 40% of enterprise applications will include embedded AI agents. Whether this prediction holds depends on continued progress in reliability, cost management, safety, and the maturity of orchestration frameworks and interoperability protocols.