Agentic workflow
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 8,003 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 8,003 words
Add missing citations, update stale details, or suggest a clearer explanation.
An agentic workflow is a multi-step process in which one or more AI agents independently plan a sequence of actions, select and use tools, evaluate intermediate results, and iterate until a goal is reached. Unlike conventional large language model usage, where a single prompt produces a single response (zero-shot prompting), agentic workflows let AI systems break complex tasks into subtasks, reflect on their own outputs, and revise their work across multiple turns.[1][2]
The term gained broad usage in early 2024 after Andrew Ng, co-founder of Google Brain and founder of DeepLearning.AI, presented a talk at Sequoia Capital's AI Ascent event in March 2024. Ng argued that agentic workflows would "drive massive AI progress this year, perhaps even more than the next generation of foundation models."[2] He demonstrated this with a striking benchmark result: on OpenAI's HumanEval coding benchmark, GPT-3.5 scored 48.1% in zero-shot mode and GPT-4 scored 67.0%, but GPT-3.5 wrapped in an agentic workflow reached 95.1%, surpassing GPT-4's zero-shot performance by a wide margin.[3] The takeaway was clear: iterative workflow design can matter more than raw model capability.
In December 2024, Anthropic published "Building effective agents," which sharpened the vocabulary by distinguishing two kinds of "agentic systems": workflows, in which "LLMs and tools are orchestrated through predefined code paths," and agents, in which "LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."[4] The term "agentic workflow" is now often used either as an umbrella for both modes or, more narrowly, for the workflow side of the distinction.
Traditional LLM applications follow a request-response pattern. A user sends a prompt, the model generates a completion, and the interaction ends. This works for simple tasks like translation, summarization, or question answering, but it struggles with open-ended problems that require research, planning, or multi-step reasoning. Human experts rarely solve complex problems in a single pass: a software engineer writes code, runs tests, reads error messages, and revises. Agentic workflows apply this iterative pattern to AI systems, where the agent receives a high-level goal, decomposes it into steps, executes each step (often calling external tools), evaluates the result, and loops back if it falls short.
This shift from single-turn completions to multi-turn autonomous execution is what distinguishes agentic AI from earlier prompt engineering techniques. Where prompt engineering optimizes a single model call, agentic workflows orchestrate many calls into a coherent process. The conceptual roots reach back to the ReAct framework introduced by Yao et al. in 2022, which interleaves reasoning traces with tool-calling actions,[5] and to the Reflexion paper by Shinn et al. (NeurIPS 2023), which showed that an agent reflecting verbally on its own task feedback could lift HumanEval pass@1 from 80% (GPT-4 baseline) to 91% within a few iterations.[6] Early autonomous-agent experiments such as Auto-GPT, released March 30, 2023 by Toran Bruce Richards, and BabyAGI, released in April 2023 by Yohei Nakajima, popularised the idea of letting a language model loop on a goal until completion, though both were widely criticised for getting stuck and burning tokens with little useful output.[7]
In his March 2024 talk and a follow-up post on X dated March 27, 2024, Andrew Ng identified four core design patterns for agentic AI: reflection, tool use, planning, and multi-agent collaboration.[2][8] These patterns are not mutually exclusive; production systems often combine two or more.
In a reflection workflow, the model generates an output and then critiques its own work. Ng described a common setup with two agents: one prompted to generate high-quality output and another prompted to provide constructive criticism. The conversation between these two agents leads to iterative improvement. For example, a coding agent might generate a function, then a reviewer agent points out an edge case, and the first agent revises its code.[8]
Reflection is the simplest agentic pattern and often the easiest to implement. Despite its simplicity, Ng noted that he had been "delighted by how much it improved applications' results in a few cases."[8] The pattern works especially well for writing, code generation, and data analysis where quality can be evaluated programmatically. The Reflexion paper (2023) is the canonical academic formulation of this idea, framing verbal self-reflection as a substitute for weight updates.[6]
Tool use extends the agent's capabilities beyond text generation. Instead of relying solely on its training data, the agent can call external APIs, query databases, execute code, search the web, or interact with file systems. The ReAct (reasoning and acting) framework, introduced by Yao et al. in a 2022 paper published at ICLR 2023, formalized this pattern by interleaving reasoning traces with tool-calling actions. At each step, the agent articulates its reasoning (a "thought"), decides on an action (such as calling a search API), observes the result, and then reasons about the next step.[5]
Tool use is what transforms a language model from a text generator into a functional agent. With tools, it can retrieve current information, perform calculations, modify files, send messages, and interact with software systems. The technical substrate in modern systems is structured function calling, in which the model emits a typed JSON object specifying which tool to invoke and with which arguments.
Planning involves the agent decomposing a complex task into a sequence of smaller, manageable steps before executing them. Rather than attempting to solve the entire problem at once, the agent creates a plan, executes it step by step, and adjusts the plan if intermediate results reveal that the original approach is not working.[8] Sophisticated planning agents use techniques like chain-of-thought reasoning to map out their approach, and some systems maintain an explicit plan data structure that tracks completed and pending steps. Planning becomes especially valuable when tasks have dependencies or when the agent needs to allocate limited resources across competing priorities.
Multi-agent collaboration assigns different roles or specializations to separate agents that work together on a shared task. One agent might act as a researcher, another as a writer, a third as a fact-checker, and a fourth as an editor. Each agent focuses on its specialty, and they communicate through structured message passing.[8] The pattern mirrors how human product teams divide labor across designer, engineer, product manager, and QA tester. Gartner reported a 1,445% increase in enterprise inquiries about multi-agent systems from Q1 2024 to Q2 2025, indicating strong industry interest in this pattern.[9]
Anthropic's December 19, 2024 essay "Building effective agents" is the most influential post-Ng formalisation of agentic workflows. It enumerates five concrete workflow patterns and contrasts them with the open-ended agent pattern.[4] The five workflow patterns are:
Anthropic contrasts these predetermined patterns with the agent pattern proper, in which an LLM operates a tool-use loop, decides at each step what to do next, and runs until it judges the task complete or hits a stop condition.[4] The essay urges practitioners to "find the simplest solution possible, and only increase complexity when needed," noting that many production use cases are best served by a workflow rather than a full agent.[4]
Beyond Ng's four high-level categories and Anthropic's five workflow patterns, practitioners have identified several specific patterns that appear repeatedly in production agentic systems.
| Pattern | Description | Typical use case | Complexity |
|---|---|---|---|
| ReAct loop | The agent alternates between reasoning (thought) and acting (tool call), observing results at each step | Search and retrieval, data analysis, question answering | Low |
| Reflection / self-critique | The agent generates output, critiques it, and revises iteratively | Code generation, writing, debugging | Low |
| Plan-and-execute | The agent creates an explicit multi-step plan, then executes each step sequentially | Complex research, project management, multi-file code changes | Medium |
| Tool routing | A dispatcher agent decides which specialized tool or sub-agent to invoke based on the user's request | Customer support triage, IT helpdesk, function calling | Medium |
| Multi-agent debate | Multiple agents argue different positions and converge on a consensus answer | Fact verification, decision making, risk assessment | Medium |
| Hierarchical delegation | A supervisor agent delegates subtasks to worker agents and aggregates their results | Report generation, data pipelines, software engineering | High |
| Human-in-the-loop | The agent pauses at specified checkpoints to request human approval before proceeding | Financial transactions, medical recommendations, legal review | Medium |
| Map-reduce | The agent splits a large task into parallel subtasks, processes them concurrently, and merges the outputs | Document analysis, batch processing, large-scale summarization | High |
| Orchestrator-worker | A planner LLM fans out subtasks to multiple worker LLMs and synthesises their outputs | Long research reports, complex code changes | High |
| Evaluator-optimizer | A generator LLM and a critic LLM loop until a quality threshold is met | High-quality writing, structured extraction | Medium |
How much independence an agent should have in production is one of the central design questions. Anthropic's 2026 research on measuring agent autonomy emphasized that "autonomy is not a fixed property of a model or system but an emergent characteristic of a deployment."[10] Three broad patterns have emerged.
In this pattern, the agent proposes actions but waits for explicit human approval before executing them. Every tool call, API request, or output goes through a human review step. This is the safest approach and is common in high-stakes domains like healthcare, finance, and legal work. The downside is speed: the workflow can only move as fast as the human reviewer.
The agent operates autonomously for routine, low-risk, reversible actions but escalates to a human for high-risk, irreversible, or ambiguous decisions. This "calibrated autonomy" approach tries to balance safety with efficiency. A customer service agent, for instance, might process a standard refund on its own but escalate a complaint involving potential fraud to a human supervisor.
Anthropic's autonomy research, drawing on nearly a million tool calls and hundreds of thousands of live coding sessions in deployed environments, found that approximately 73% of tool calls were human-in-the-loop in practice and that only about 0.8% of calls were irreversible. New users started with an auto-approve rate near 20%, while experienced users moved to over 50%, even as they also interrupted the agent roughly twice as often.[10] Between October 2025 and January 2026, the 99.9th percentile of Claude Code session length nearly doubled, from under 25 minutes to over 45 minutes, indicating that users were trusting agents with longer-running tasks.[10]
The agent executes its entire workflow without human intervention. Guardrails and safety checks are embedded programmatically rather than relying on human review. This pattern is used for low-stakes, high-volume tasks like data entry, content tagging, or log analysis. Fully autonomous agents require robust monitoring and observability infrastructure to catch failures after the fact.
Most production deployments in 2025 and 2026 sit in the supervised autonomy zone. A Deloitte report on agentic AI strategy noted that organizations are moving from single-step automation to systems that manage entire workflows, but with human oversight at decision points.[11]
Several open-source and commercial frameworks have emerged to simplify building agentic workflows. They handle common concerns like state management, tool registration, agent communication, error handling, and observability.
LangGraph, developed by LangChain, uses a graph-based workflow design. Agent interactions are modeled as nodes in a directed graph, with edges representing transitions between steps. This architecture supports conditional branching, cycles (for iterative refinement), and parallel execution paths. LangGraph provides fine-grained control over agent state through a persistent state object that flows through the graph. It is written in Python and TypeScript and integrates tightly with the broader LangChain ecosystem.[12]
LangGraph implements all five Anthropic workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer) as documented reference implementations.[12] Its Send API supports dynamic dispatch of worker nodes from an orchestrator, enabling fan-out and fan-in patterns over arbitrary numbers of parallel subtasks.[12]
CrewAI takes a role-based approach inspired by human team structures. Developers define "crews" of agents, each with a distinct role, backstory, and set of capabilities. Tasks are assigned to agents based on their roles, and agents can communicate with each other and delegate subtasks. CrewAI focuses on simplicity and rapid prototyping and is popular for content generation pipelines, competitive analysis, and customer support workflows.
AutoGen, originally developed by Microsoft Research, approaches multi-agent orchestration through structured conversations. Agents communicate by exchanging messages in a group chat format, with different agents taking on roles like planner, researcher, coder, and executor. AutoGen supports both fully autonomous conversations and human-in-the-loop configurations where a human participant can intervene in the conversation.
In October 2025, Microsoft merged AutoGen with Semantic Kernel into the unified Microsoft Agent Framework. This combined platform offers AutoGen's flexible multi-agent orchestration with Semantic Kernel's enterprise features, including session-based state management, middleware, and telemetry. The 1.0 general availability release is targeted for Q1 2026.[13]
Semantic Kernel is Microsoft's model-agnostic SDK for building AI agents and multi-agent systems. Available in Python, C#, and Java, it targets enterprise developers already working in the Microsoft ecosystem. The framework supports sequential, concurrent, group-chat, handoff, and "magnetic" orchestration patterns. Following the merger with AutoGen, Semantic Kernel provides the production foundation (state management, type safety, telemetry) while AutoGen contributes the agent abstractions and conversational patterns.[13]
The Claude Agent SDK, released by Anthropic in 2025, exposes the same primitives used by Claude Code: a tool-use loop, subagent spawning, Claude Skills, and integration with Model Context Protocol servers. Anthropic shipped these building blocks in rapid succession, with the Agent SDK, Claude Code Subagents, Skills (October 2025), and Agent Teams (early 2026) arriving over a span of about a year.[14][15]
Other notable frameworks include LlamaIndex Workflows (focused on retrieval-augmented generation agents), DSPy (which compiles declarative agent programs into optimized prompts), Haystack (by deepset, for NLP pipeline orchestration), Amazon Bedrock Agents (a managed service on AWS for building production agents), and the OpenAI Agents SDK. The framework landscape is rapidly evolving, with new entrants appearing regularly.
| Framework | Developer | Architecture style | Primary languages | Release year | Ideal use case |
|---|---|---|---|---|---|
| LangGraph | LangChain | Graph-based workflows | Python, TypeScript | 2024 | Complex conditional workflows |
| CrewAI | CrewAI Inc. | Role-based crews | Python | 2023 | Team-structured content and research workflows |
| AutoGen / Microsoft Agent Framework | Microsoft | Conversation-based | Python, .NET | 2023 (AutoGen), 2025 (merged) | Research, collaborative problem-solving |
| Semantic Kernel | Microsoft | Plugin-based, multi-pattern | Python, C#, Java | 2023 | Enterprise .NET/Java applications |
| LlamaIndex Workflows | LlamaIndex | Event-driven | Python, TypeScript | 2024 | RAG-heavy agent applications |
| DSPy | Stanford NLP | Declarative compilation | Python | 2023 | Optimized prompt pipelines |
| Amazon Bedrock Agents | AWS | Managed service | Python (SDK) | 2023 | Serverless production agents on AWS |
| Claude Agent SDK | Anthropic | Tool-loop + subagents + MCP | Python, TypeScript | 2025 | Long-running coding and research agents |
| OpenAI Agents SDK | OpenAI | Handoff-based, function-tool | Python, TypeScript | 2025 | OpenAI-native multi-agent apps |
The Model Context Protocol (MCP) is an open standard introduced by Anthropic on November 25, 2024, to standardize how AI agents connect to external tools, data sources, and systems. Before MCP, every integration between an agent and an external service required custom code. MCP replaces this fragmented approach with a single, universal protocol.[16]
MCP uses a client-server architecture. An MCP client (the AI agent or application) connects to one or more MCP servers, each of which exposes tools and data from a specific system. A single MCP server for GitHub, for example, might expose tools for creating pull requests, reading issues, and searching repositories. The agent connects to the server once and can call any of its tools through a standardized interface.
At launch, Anthropic released pre-built MCP servers for Google Drive, Slack, GitHub, Git, Postgres, and Puppeteer. Early adopters included Block, Apollo, Zed, Replit, Codeium, and Sourcegraph.[16] In March 2025, OpenAI adopted MCP across its products, including the ChatGPT desktop application. By the end of 2025, the ecosystem had grown to over 10,000 active MCP servers with 97 million monthly SDK downloads across Python and TypeScript.
On December 9, 2025, Anthropic donated MCP to the newly formed Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation. The foundation was co-founded by Anthropic, Block, and OpenAI, with founding contributions of three projects: Anthropic's MCP, Block's goose, and OpenAI's AGENTS.md. Platinum members included Amazon Web Services, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI.[17] This move signaled industry-wide commitment to a shared standard for agent-tool integration.
MCP has become the de facto standard for connecting agents to tools. Its rapid adoption across competing AI companies and thousands of third-party developers suggests that the fragmentation problem for tool integration is largely solved at the protocol level.
While MCP standardizes how agents connect to tools, the Agent2Agent Protocol (A2A) standardizes how agents communicate with each other. Google announced A2A on April 9, 2025, with support from over 50 technology partners including Atlassian, Salesforce, SAP, ServiceNow, PayPal, LangChain, and MongoDB.[18]
A2A addresses a different problem than MCP. In a multi-agent system, agents from different vendors or built on different frameworks need a common language to discover each other's capabilities, delegate tasks, exchange results, and negotiate how to present information to users. A2A provides this common language.
The protocol has four core capabilities:
A2A is built on existing web standards (HTTP, Server-Sent Events, and JSON-RPC) and supports enterprise-grade authentication and authorization. In June 2025, Google contributed A2A to the Linux Foundation. Version 0.3, released in July 2025, provided a more stable API surface for production use.[19]
A2A and MCP are complementary. MCP handles agent-to-tool communication (vertical integration), while A2A handles agent-to-agent communication (horizontal integration). Together, they form the emerging protocol stack for multi-agent systems.
The eighteen months from late 2024 through mid-2026 transformed agentic workflows from a research idea into a mass-market product category. Several developments are particularly worth recording in detail.
On October 22, 2024, Anthropic introduced computer use as a public beta capability of an upgraded Claude 3.5 Sonnet, making Claude one of the first frontier systems to offer autonomous desktop control through an API tool. Instead of building specialised tools for each task, Anthropic taught Claude general computer skills, looking at a screen, moving a cursor, clicking buttons, and typing text. Anthropic released the feature as experimental, acknowledging that it was "at times cumbersome and error-prone."[20]
OpenAI followed on January 23, 2025 with Operator, a browser-based agent powered by a Computer-Using Agent (CUA) model that combined GPT-4o's vision with reinforcement-learning-based reasoning over screenshots. Operator was made available to ChatGPT Pro subscribers in the US at $200 per month. CUA scored 38.1% on OSWorld for full computer use tasks, 58.1% on WebArena, and 87% on WebVoyager.[21] On July 17, 2025, OpenAI launched ChatGPT Agent, a unified system that combined Operator's browser, deep research's web synthesis, and ChatGPT's conversational interface, rolled out to Pro, Plus, and Team subscribers; the model underlying ChatGPT Agent scored 41.6% pass@1 on Humanity's Last Exam, roughly double the o3 and o4-mini baselines.[22] The standalone Operator site at operator.chatgpt.com was deprecated and shut down on August 31, 2025.[22]
Google's Project Mariner entered the same space in December 2024 as a research prototype built on Gemini 2.0, an experimental Chrome extension that could reason across page text, images, forms, and code to drive a browser. At Google I/O on May 20, 2025, Project Mariner gained cloud-based execution, the ability to handle up to ten simultaneous tasks, and a "Teach & Repeat" feature; it scored 83.5% on the WebVoyager benchmark.[23] In December 2025 Google folded Project Mariner's capabilities into Gemini Agent inside the Gemini app, and the standalone Mariner landing page listed May 4, 2026 as the shutdown date.[24]
Agentic coding became the most commercially successful application of the technology. Anthropic released Claude Code in February 2025 as an agentic command-line tool, made it generally available in May 2025 alongside Claude 4, and shipped Claude Code 2.0 in September 2025 with split-panel views, a built-in terminal, parallel sessions, checkpoints, an IDE extension, and automation hooks.[25] Claude Code reached approximately $1 billion in annualized run-rate revenue within roughly six months of general availability, and Anthropic's overall annualized revenue jumped from approximately $1 billion at the start of 2025 to around $5 billion by August 2025.[26]
Cursor shipped Cursor 2.0 on October 29, 2025, pairing a purpose-built coding model named Composer with an agent-centred interface. Composer is a mixture-of-experts model trained with reinforcement learning for software engineering and is roughly four times faster than comparably intelligent models, completing most turns in under 30 seconds. Cursor 2.0 supports up to eight parallel agents on a single prompt, using git worktrees or remote machines to keep their workspaces isolated.[27]
Replit Agent launched in September 2024 and helped drive Replit to $100 million ARR in 2025; Replit Agent 3, released September 10, 2025, can operate for up to 200 minutes continuously, self-tests its own bugs in a loop, and can build other agents and automations through an "Agents & Automations" interface.[28] OpenAI shipped a succession of agentic Codex models throughout 2025 and into 2026, including GPT-5-Codex (September 23, 2025), GPT-5.1-Codex-Max, and GPT-5.2-Codex (December 18, 2025), each positioned as more capable than the last for long-horizon agentic coding work.[29]
Devin, the autonomous software engineer introduced by Cognition AI in March 2024 with the tagline "the first AI software engineer," had a mixed reception. An independent analysis by Answer.AI in early 2025 found that across 20 tasks Devin produced 14 failures, three inconclusive results, and three successes, a 15% success rate, and observed cases where Devin "got stuck on npm package dependency errors and committed a bunch of hallucinated, non-incrementally-beneficial code."[30] Cognition responded with Devin 2.0 in April 2025, which lowered the entry price to $20 per month, added a cloud IDE supporting multiple parallel Devins, an Interactive Planning mode, and a Devin Search/Wiki feature.[31] On July 14, 2025 Cognition acquired the AI coding startup Windsurf after Google had hired away Windsurf's leadership in a separate $2.4 billion talent and licensing deal; the acquisition included Windsurf's IP, product, and team, and reportedly more than doubled Cognition's ARR. Cognition's valuation reached $10.2 billion by September 2025.[32] Devin 2.2, released February 24, 2026, added desktop-app testing, made startup three times faster, and introduced Devin Review for automated pull-request review.[33]
The trend toward subagent and multi-agent patterns accelerated through 2025. In Claude Code, the Task tool spawns subagents that each receive their own context window, system prompt, tool allowlist, and permissions; the parent agent receives only the subagent's final output, not its full reasoning trace, which preserves orchestrator context.[15] Multiple Task invocations can run concurrently, with up to ten subagents in parallel in a typical configuration.[15] Anthropic's "Agent Teams" feature, released in early 2026, added higher-level coordination primitives, a shared task list with dependency tracking, peer-to-peer messaging between teammates, and file locking to prevent merge conflicts.[14]
This "split-and-merge" pattern, in which an orchestrator decomposes a large task, fans it out to parallel subagents, and merges their results in a single session, became the canonical way to use a coding agent at scale. Cursor 2.0's eight-parallel-agents feature is the same idea wrapped in a graphical UI.[27]
On March 6, 2025, the Chinese startup Butterfly Effect launched Manus AI in invitation-only beta. Manus presented itself as the world's first general AI agent, built on top of Claude Sonnet 4.5 (and earlier 3.5) plus fine-tuned versions of Alibaba's open-source Qwen, with multiple specialised subagents orchestrated to act autonomously on a wide range of tasks. The launch demo, which showed Manus screening resumes and analysing stocks, drew more than one million views within twenty hours, and invitation codes traded second-hand for tens of thousands of renminbi.[34] Voices ranging from Jack Dorsey to Hugging Face product lead Victor Mustar called Manus a "second DeepSeek moment."
The hype translated into rapid commercial growth. By December 2025, Manus had millions of subscribers, more than $100 million in annual recurring revenue, and a revenue run rate above $125 million eight months after launch. In December 2025, Meta announced an agreement to acquire Manus at a reported valuation between US$2 billion and US$3 billion. The transaction closed in December 2025, and Meta integrated Manus's operations. On April 27, 2026, China's National Development and Reform Commission blocked the acquisition, citing concerns about technology leakage and ordering all parties to withdraw, although Manus employees had already joined Meta and earlier backers had received their proceeds.[35]
Salesforce launched Agentforce on October 29, 2024 as an enterprise AI agent platform, then shipped Agentforce 2.0 in December 2024 (general availability February 2025), Agentforce 3 in June 2025, and Agentforce 360 in October 2025. By the time Agentforce 360 became generally available, Salesforce reported over 12,000 customers; one early reference customer, Reddit, deflected 46% of its support cases and cut average response times from 8.9 minutes to 1.4 minutes.[36] Microsoft pursued a parallel strategy, embedding agentic capabilities into Microsoft 365 Copilot through 2025 and shipping the "Copilot Cowork" wave of specialised teammates in early 2026.
OpenAI's Deep Research, announced in early 2025 and integrated into ChatGPT Agent in July 2025, established the "browse for hours, then write a report" pattern as a category of its own. Google released a reimagined Gemini Deep Research agent on December 11, 2025, based on Gemini 3 Pro and exposed to developers through a new Interactions API, on the same day OpenAI released GPT-5.2.[37]
METR's time-horizon methodology, published in March 2025, attempts to quantify how long a task an AI agent can complete autonomously with 50% success. METR's 2026 update showed the 50% time horizon growing roughly exponentially with a doubling time of about seven months, from nine seconds for early GPT-3 agents in 2020 to around 14.5 hours for Claude Opus 4.5 class systems in early 2026. METR's task suite grew from 170 to 228 tasks in the 2026 revision.[38][39]
Anthropic's empirical autonomy study, drawing on roughly a million tool calls and hundreds of thousands of coding sessions, found that software engineering accounted for around 50% of tool calls on the Anthropic API; that approximately 73% of tool calls still ran with a human in the loop; and that only about 0.8% of calls were judged irreversible.[10] Crossing roughly 750 sessions of experience was associated with a qualitative shift, users tended to stop approving individual actions and instead monitored from a distance, intervening when necessary.[10]
BrowseComp, released by OpenAI on April 10, 2025, is a 1,266-question benchmark for browsing agents that asks for "hard-to-find, entangled information" that might require navigating tens or hundreds of websites. On launch, GPT-4o with browsing reached only 1.9% accuracy, while OpenAI's Deep Research scored 51.5%, a 27x gap that suggested architectural choices, not tool access, were the binding constraint.[40]
On τ-bench (Sierra Research), which simulates a tool-using customer-service agent in dialogue with a simulated user across retail and airline domains, Claude Sonnet 4.5 led the airline split with 70.0% by late 2025. Sierra subsequently released τ²-bench (2025) and τ³-bench, which added a banking domain and a voice-evaluation modality.[41]
On SWE-bench Verified, the 500-task human-reviewed subset of real GitHub issues, the first system crossed 50% in early 2025; by April 2026 the top four systems were above 60% and the leader reached 72.0%. Human software engineers on the same tasks score around 90% with reasonable time budgets, so the benchmark is not yet saturated.[42]
On the GAIA benchmark for general AI assistants, where humans score 92% but GPT-4 with plugins managed 15% at the original 2023 release, H2O.ai's h2oGPTe Agent reached 75% in 2025, the first time a system received a passing grade.[43]
The UK AI Security Institute and Meridian Labs developed Inspect AI, an open-source Python framework for reproducible LLM and agent evaluations. Inspect has been adopted by METR, Apollo Research, other government AISIs, and major safety labs. Its built-in agent primitives include ReAct, multi-agent composition, an external-agent bridge for tools like Claude Code, Codex CLI, and Gemini CLI, and a sandboxing toolkit with Docker built in and Kubernetes and Proxmox adapters; over 200 pre-built evaluations ship with the framework.[44][45]
On October 16, 2025, Anthropic released Claude Skills as an skills-2025-10-02 API beta. A Skill is an organized folder of instructions, scripts, and resources that an agent loads dynamically when it judges a task to match. Pre-built skills shipped for PowerPoint, Excel, Word, and PDF generation, and Skills work across Claude.ai, Claude Code, the Claude Agent SDK, and the Claude Developer Platform, so a developer can write a skill once and use it everywhere.[46]
Agentic AI moved from experimentation to early production deployment across several industries during 2025 and into 2026. Gartner predicted that 40% of enterprise applications would include task-specific AI agents by the end of 2026, up from less than 5% in 2025.[47] The global AI agents market was valued at approximately $7.8 billion in 2025, with projections reaching $52.6 billion by 2030 (a 46.3% compound annual growth rate). A January 2025 Gartner poll of 3,412 webinar attendees found that 19% of organisations had made significant investments in agentic AI, 42% had made conservative investments, 8% had made none, and the remaining 31% were taking a wait-and-see approach.[47]
Customer service was among the first domains to adopt agentic workflows at scale. Agents handle routine inquiries, process refunds, update account information, and escalate complex cases to human representatives. Workflow automation is the primary use case in 64% of agent deployments. Salesforce's Agentforce 360 reference customer Reddit deflected 46% of support cases and cut resolution times by 84%.[36]
Agentic coding tools like GitHub Copilot, Cursor, Claude Code, and Devin are among the most visible agentic workflow deployments. They go beyond autocomplete to plan multi-file changes, run tests, interpret error messages, and iterate until tests pass. Anthropic's 2026 Agentic Coding Trends Report noted that agentic AI is "reconfiguring the software development lifecycle," with coding agents increasingly handling debugging, testing, documentation, and code review in addition to code generation.[48] The non-programmer audience also adopted the technology rapidly through what is now called vibe coding.
Research agents search the web, read documents, synthesize findings, and produce structured reports. They are used in competitive intelligence, market research, due diligence, and scientific literature review. OpenAI Deep Research, Gemini Deep Research, Claude Code with web tools, and Manus all targeted this market in 2025.
Banks and financial institutions use agentic workflows for Know Your Customer (KYC) verification, anti-money laundering (AML) screening, fraud detection, and regulatory compliance. McKinsey reported that banks implementing agentic AI for KYC/AML workflows have realized productivity gains ranging from 200% to 2,000%.[49]
Data analysis agents can write SQL queries, execute them against databases, interpret the results, generate visualizations, and produce written summaries. They iterate when a query returns unexpected results, reformulating until they reach an answer. The pattern is especially valuable for business analysts who need answers from data but lack SQL expertise.
Despite rapid adoption, agentic workflows face several challenges in production. A Gartner forecast warned that over 40% of agentic AI projects will be canceled by 2027 due to escalating costs, unclear business value, or inadequate risk controls.[50]
The gap between a demo and a reliable production system is wide. Multi-agent architectures create orchestration complexity that compounds with scale. Error propagation is a particular concern: if one agent in a chain produces a flawed output, downstream agents may build on that flaw without detecting it. Hallucinated tool calls and fabricated success claims have proven especially difficult to suppress. Bug reports against Claude Code through 2025 and 2026 documented cases of research subagents returning fabricated data (non-existent files, wrong prices, fictional API capabilities), of the main agent trusting those results and building plans around them, and of "false success" rates approaching 29-30% in some workflows, where an agent reports a task complete without verifying.[51]
Agentic workflows consume significantly more LLM tokens than single-turn interactions. A reflection loop that runs five iterations costs roughly five times as much as a single generation. Multi-agent systems multiply this further, with each agent making its own LLM calls. Token costs, tool-call costs, and compute costs can spike unpredictably when agents encounter edge cases that trigger extensive retries or recursive calls. Anthropic's Opus 4.5 release in November 2025 specifically targeted this pain point with a 67% price reduction and an average 76% reduction in output tokens per task relative to prior Opus models.[52]
Multi-step workflows are inherently slower than single-turn responses. Each step involves at least one LLM call (often hundreds of milliseconds to several seconds), plus any time spent on tool calls (database queries, API requests, web searches). A workflow with ten steps might take 30 seconds to several minutes, which is acceptable for background tasks but problematic for interactive applications. Cursor's Composer model was explicitly trained for low-latency agentic coding, targeting "most turns in under 30 seconds."[27]
Agents that can take real-world actions (sending emails, modifying databases, executing code, transferring money) need robust safeguards. The challenge is building guardrails that prevent harmful actions without being so restrictive that they negate the value of automation. Common approaches include:
Safety infrastructure is still maturing. Most production deployments rely on a combination of programmatic guardrails and human oversight rather than any single safety mechanism. The UK AISI Inspect Sandboxing Toolkit, released in 2025, was a notable contribution from the safety community.[45]
Understanding what an agent is doing and why is harder than monitoring a conventional application. Traditional logging captures inputs and outputs but misses what matters for agents: the reasoning behind tool selection, intermediate plans, decision paths, and context that informed each step. Tracing infrastructure for agent observability remains immature, and limited lineage tracking slows incident response and auditing.
Evaluating agentic workflows requires different approaches than evaluating single-turn LLM outputs. The core metrics shift from response quality to task completion and process efficiency.
| Metric | What it measures | Why it matters |
|---|---|---|
| Goal completion rate | Percentage of tasks the agent completes successfully end-to-end | The most direct measure of whether the agent is useful |
| Tool use efficiency | Whether the agent selects the right tools and uses them correctly | Incorrect tool selection wastes time and tokens |
| Step efficiency | Number of steps taken relative to the minimum required | Excessive steps indicate poor planning or unnecessary retries |
| Error recovery rate | How often the agent recovers from errors without human intervention | Directly impacts the value of autonomous operation |
| Latency (time to completion) | Total wall-clock time from task start to completion | Determines whether the workflow is practical for its intended use |
| Cost per task | Total token and tool-call costs for completing a task | Determines economic viability |
| Safety violation rate | Frequency of actions that violate defined guardrails | Measures the risk profile of the deployment |
| pass^k reliability | Whether the agent produces correct results across k independent trials | Captures the reliability gap between best-case and average behavior |
Several benchmarks specifically target agentic capabilities:
Pre-deployment benchmarks are necessary but not sufficient. Pre-deployment testing cannot fully capture the diversity of real-world user behaviors, usage patterns, and edge cases. Common production evaluation approaches include A/B testing (comparing agent performance against human baselines or alternative agent configurations), shadow mode (running the agent alongside human workers without taking live actions), and human review sampling (having human reviewers score a random sample of agent outputs on a regular cadence).
Agentic workflows connect to several related areas in AI:
Protocol convergence around MCP and A2A is reducing the integration burden for multi-agent systems. The Agentic AI Foundation under the Linux Foundation now stewards both MCP (via Anthropic) and OpenAI's AGENTS.md, with backing from AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI.[17] As these protocols mature, building interoperable agent systems should become significantly simpler.
Organizations are adopting a "microservices" approach to agents, replacing monolithic all-purpose agents with orchestrated teams of specialised agents. Both Forrester and Gartner described 2026 as the breakthrough year for multi-agent systems.[47] Cost reduction through smarter orchestration is an active area of work: techniques like routing simple queries to smaller, cheaper models while reserving frontier models for complex reasoning steps can reduce costs substantially without sacrificing quality, and Anthropic's Opus 4.5 release in November 2025 cut prices by 67% while reducing output tokens by 76% on equivalent tasks.[52]
Improved evaluation and observability tools are closing the gap between demo-quality and production-quality systems. UK AISI's Inspect framework, METR's time-horizon methodology, and Anthropic's empirical autonomy study are early instances. Gartner predicts that by the end of 2026, 40% of enterprise applications will include embedded AI agents, and that at least 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028, up from 0% in 2024.[47][53] Whether these predictions hold depends on continued progress in reliability, cost management, safety, and the maturity of orchestration frameworks.