Agentic workflow

AI Agents Artificial Intelligence Large Language Models Software Development

41 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

53 citations

Revision

v6 · 8,120 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

An agentic workflow is a multi-step process in which one or more AI agents independently plan a sequence of actions, select and use tools, evaluate intermediate results, and iterate until a goal is reached. Unlike conventional large language model usage, where a single prompt produces a single response (zero-shot prompting), agentic workflows let AI systems break complex tasks into subtasks, reflect on their own outputs, and revise their work across multiple turns.^[1]^[2] The four design patterns most often used to build them, named by Andrew Ng in March 2024, are reflection, tool use, planning, and multi-agent collaboration.^[8]

The term gained broad usage in early 2024 after Andrew Ng, co-founder of Google Brain and founder of DeepLearning.AI, presented a talk at Sequoia Capital's AI Ascent event in March 2024. Ng argued that agentic workflows would "drive massive AI progress this year, perhaps even more than the next generation of foundation models."^[2] He demonstrated this with a striking benchmark result: on OpenAI's HumanEval coding benchmark, GPT-3.5 scored 48.1% in zero-shot mode and GPT-4 scored 67.0%, but GPT-3.5 wrapped in an agentic workflow reached 95.1%, surpassing GPT-4's zero-shot performance by a wide margin.^[3] As Ng put it, "The improvement from GPT-3.5 to GPT-4 is dwarfed by incorporating an iterative agent workflow."^[1] The takeaway was clear: iterative workflow design can matter more than raw model capability.

In December 2024, Anthropic published "Building effective agents," which sharpened the vocabulary by distinguishing two kinds of "agentic systems": workflows, in which "LLMs and tools are orchestrated through predefined code paths," and agents, in which "LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks."^[4] The term "agentic workflow" is now often used either as an umbrella for both modes or, more narrowly, for the workflow side of the distinction.

Why did agentic workflows emerge?

Traditional LLM applications follow a request-response pattern. A user sends a prompt, the model generates a completion, and the interaction ends. This works for simple tasks like translation, summarization, or question answering, but it struggles with open-ended problems that require research, planning, or multi-step reasoning. Human experts rarely solve complex problems in a single pass: a software engineer writes code, runs tests, reads error messages, and revises. Agentic workflows apply this iterative pattern to AI systems, where the agent receives a high-level goal, decomposes it into steps, executes each step (often calling external tools), evaluates the result, and loops back if it falls short.

This shift from single-turn completions to multi-turn autonomous execution is what distinguishes agentic AI from earlier prompt engineering techniques. Where prompt engineering optimizes a single model call, agentic workflows orchestrate many calls into a coherent process. The conceptual roots reach back to the ReAct framework introduced by Yao et al. in 2022, which interleaves reasoning traces with tool-calling actions,^[5] and to the Reflexion paper by Shinn et al. (NeurIPS 2023), which showed that an agent reflecting verbally on its own task feedback could lift HumanEval pass@1 from 80% (GPT-4 baseline) to 91% within a few iterations.^[6] Early autonomous-agent experiments such as Auto-GPT, released March 30, 2023 by Toran Bruce Richards, and BabyAGI, released in April 2023 by Yohei Nakajima, popularised the idea of letting a language model loop on a goal until completion, though both were widely criticised for getting stuck and burning tokens with little useful output.^[7]

What are Andrew Ng's four agentic design patterns?

In his March 2024 talk and a follow-up post on X dated March 27, 2024, Andrew Ng identified four core design patterns for agentic AI: reflection, tool use, planning, and multi-agent collaboration.^[2]^[8] These patterns are not mutually exclusive; production systems often combine two or more.

Reflection

In a reflection workflow, the model generates an output and then critiques its own work. Ng described a common setup with two agents: one prompted to generate high-quality output and another prompted to provide constructive criticism. The conversation between these two agents leads to iterative improvement. For example, a coding agent might generate a function, then a reviewer agent points out an edge case, and the first agent revises its code.^[8]

Reflection is the simplest agentic pattern and often the easiest to implement. Despite its simplicity, Ng noted that he had been "delighted by how much it improved applications' results in a few cases."^[8] The pattern works especially well for writing, code generation, and data analysis where quality can be evaluated programmatically. The Reflexion paper (2023) is the canonical academic formulation of this idea, framing verbal self-reflection as a substitute for weight updates.^[6]

Tool use

Tool use extends the agent's capabilities beyond text generation. Instead of relying solely on its training data, the agent can call external APIs, query databases, execute code, search the web, or interact with file systems. The ReAct (reasoning and acting) framework, introduced by Yao et al. in a 2022 paper published at ICLR 2023, formalized this pattern by interleaving reasoning traces with tool-calling actions. At each step, the agent articulates its reasoning (a "thought"), decides on an action (such as calling a search API), observes the result, and then reasons about the next step.^[5]

Tool use is what transforms a language model from a text generator into a functional agent. With tools, it can retrieve current information, perform calculations, modify files, send messages, and interact with software systems. The technical substrate in modern systems is structured function calling, in which the model emits a typed JSON object specifying which tool to invoke and with which arguments.

Planning

Planning involves the agent decomposing a complex task into a sequence of smaller, manageable steps before executing them. Rather than attempting to solve the entire problem at once, the agent creates a plan, executes it step by step, and adjusts the plan if intermediate results reveal that the original approach is not working.^[8] Sophisticated planning agents use techniques like chain-of-thought reasoning to map out their approach, and some systems maintain an explicit plan data structure that tracks completed and pending steps. Planning becomes especially valuable when tasks have dependencies or when the agent needs to allocate limited resources across competing priorities.

Multi-agent collaboration

Multi-agent collaboration assigns different roles or specializations to separate agents that work together on a shared task. One agent might act as a researcher, another as a writer, a third as a fact-checker, and a fourth as an editor. Each agent focuses on its specialty, and they communicate through structured message passing.^[8] The pattern mirrors how human product teams divide labor across designer, engineer, product manager, and QA tester. Gartner reported a 1,445% increase in enterprise inquiries about multi-agent systems from Q1 2024 to Q2 2025, indicating strong industry interest in this pattern.^[9]

What are Anthropic's workflow patterns?

Anthropic's December 19, 2024 essay "Building effective agents" is the most influential post-Ng formalisation of agentic workflows. It enumerates five concrete workflow patterns and contrasts them with the open-ended agent pattern.^[4] The five workflow patterns are:

Prompt chaining: decompose a task into sequential steps with programmatic checkpoints between LLM calls.
Routing: classify the input and dispatch it to a specialised downstream prompt or tool.
Parallelization: run multiple LLM calls concurrently and either section the work or vote on the same task.
Orchestrator-workers: a central LLM dynamically breaks down the task, delegates subtasks to worker LLMs, and synthesises their results.
Evaluator-optimizer: one LLM generates a candidate response while another evaluates it, looping until quality criteria are met.

Anthropic contrasts these predetermined patterns with the agent pattern proper, in which an LLM operates a tool-use loop, decides at each step what to do next, and runs until it judges the task complete or hits a stop condition.^[4] The essay urges practitioners to "find the simplest solution possible, and only increase complexity when needed," noting that many production use cases are best served by a workflow rather than a full agent.^[4]

What are the common agentic design patterns?

Beyond Ng's four high-level categories and Anthropic's five workflow patterns, practitioners have identified several specific patterns that appear repeatedly in production agentic systems.

Pattern	Description	Typical use case	Complexity
ReAct loop	The agent alternates between reasoning (thought) and acting (tool call), observing results at each step	Search and retrieval, data analysis, question answering	Low
Reflection / self-critique	The agent generates output, critiques it, and revises iteratively	Code generation, writing, debugging	Low
Plan-and-execute	The agent creates an explicit multi-step plan, then executes each step sequentially	Complex research, project management, multi-file code changes	Medium
Tool routing	A dispatcher agent decides which specialized tool or sub-agent to invoke based on the user's request	Customer support triage, IT helpdesk, function calling	Medium
Multi-agent debate	Multiple agents argue different positions and converge on a consensus answer	Fact verification, decision making, risk assessment	Medium
Hierarchical delegation	A supervisor agent delegates subtasks to worker agents and aggregates their results	Report generation, data pipelines, software engineering	High
Human-in-the-loop	The agent pauses at specified checkpoints to request human approval before proceeding	Financial transactions, medical recommendations, legal review	Medium
Map-reduce	The agent splits a large task into parallel subtasks, processes them concurrently, and merges the outputs	Document analysis, batch processing, large-scale summarization	High
Orchestrator-worker	A planner LLM fans out subtasks to multiple worker LLMs and synthesises their outputs	Long research reports, complex code changes	High
Evaluator-optimizer	A generator LLM and a critic LLM loop until a quality threshold is met	High-quality writing, structured extraction	Medium

How much autonomy should a production agent have?

How much independence an agent should have in production is one of the central design questions. Anthropic's 2026 research on measuring agent autonomy emphasized that "autonomy is not a fixed property of a model or system but an emergent characteristic of a deployment."^[10] Three broad patterns have emerged.

Human-in-the-loop

In this pattern, the agent proposes actions but waits for explicit human approval before executing them. Every tool call, API request, or output goes through a human review step. This is the safest approach and is common in high-stakes domains like healthcare, finance, and legal work. The downside is speed: the workflow can only move as fast as the human reviewer.

Supervised autonomy (human-on-the-loop)

The agent operates autonomously for routine, low-risk, reversible actions but escalates to a human for high-risk, irreversible, or ambiguous decisions. This "calibrated autonomy" approach tries to balance safety with efficiency. A customer service agent, for instance, might process a standard refund on its own but escalate a complaint involving potential fraud to a human supervisor.

Anthropic's autonomy research, drawing on nearly a million tool calls and hundreds of thousands of live coding sessions in deployed environments, found that approximately 73% of tool calls were human-in-the-loop in practice and that only about 0.8% of calls were irreversible. New users started with an auto-approve rate near 20%, while experienced users moved to over 50%, even as they also interrupted the agent roughly twice as often.^[10] Between October 2025 and January 2026, the 99.9th percentile of Claude Code session length nearly doubled, from under 25 minutes to over 45 minutes, indicating that users were trusting agents with longer-running tasks.^[10]

Fully autonomous

The agent executes its entire workflow without human intervention. Guardrails and safety checks are embedded programmatically rather than relying on human review. This pattern is used for low-stakes, high-volume tasks like data entry, content tagging, or log analysis. Fully autonomous agents require robust monitoring and observability infrastructure to catch failures after the fact.

Most production deployments in 2025 and 2026 sit in the supervised autonomy zone. A Deloitte report on agentic AI strategy noted that organizations are moving from single-step automation to systems that manage entire workflows, but with human oversight at decision points.^[11]

Which frameworks build agentic workflows?

Several open-source and commercial frameworks have emerged to simplify building agentic workflows. They handle common concerns like state management, tool registration, agent communication, error handling, and observability.

LangGraph

LangGraph, developed by LangChain, uses a graph-based workflow design. Agent interactions are modeled as nodes in a directed graph, with edges representing transitions between steps. This architecture supports conditional branching, cycles (for iterative refinement), and parallel execution paths. LangGraph provides fine-grained control over agent state through a persistent state object that flows through the graph. It is written in Python and TypeScript and integrates tightly with the broader LangChain ecosystem.^[12]

LangGraph implements all five Anthropic workflow patterns (prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer) as documented reference implementations.^[12] Its Send API supports dynamic dispatch of worker nodes from an orchestrator, enabling fan-out and fan-in patterns over arbitrary numbers of parallel subtasks.^[12]

CrewAI

CrewAI takes a role-based approach inspired by human team structures. Developers define "crews" of agents, each with a distinct role, backstory, and set of capabilities. Tasks are assigned to agents based on their roles, and agents can communicate with each other and delegate subtasks. CrewAI focuses on simplicity and rapid prototyping and is popular for content generation pipelines, competitive analysis, and customer support workflows.

AutoGen

AutoGen, originally developed by Microsoft Research, approaches multi-agent orchestration through structured conversations. Agents communicate by exchanging messages in a group chat format, with different agents taking on roles like planner, researcher, coder, and executor. AutoGen supports both fully autonomous conversations and human-in-the-loop configurations where a human participant can intervene in the conversation.

In October 2025, Microsoft merged AutoGen with Semantic Kernel into the unified Microsoft Agent Framework. This combined platform offers AutoGen's flexible multi-agent orchestration with Semantic Kernel's enterprise features, including session-based state management, middleware, and telemetry. The 1.0 general availability release is targeted for Q1 2026.^[13]

Semantic Kernel

Semantic Kernel is Microsoft's model-agnostic SDK for building AI agents and multi-agent systems. Available in Python, C#, and Java, it targets enterprise developers already working in the Microsoft ecosystem. The framework supports sequential, concurrent, group-chat, handoff, and "magnetic" orchestration patterns. Following the merger with AutoGen, Semantic Kernel provides the production foundation (state management, type safety, telemetry) while AutoGen contributes the agent abstractions and conversational patterns.^[13]

Claude Agent SDK

The Claude Agent SDK, released by Anthropic in 2025, exposes the same primitives used by Claude Code: a tool-use loop, subagent spawning, Claude Skills, and integration with Model Context Protocol servers. Anthropic shipped these building blocks in rapid succession, with the Agent SDK, Claude Code Subagents, Skills (October 2025), and Agent Teams (early 2026) arriving over a span of about a year.^[14]^[15]

Other frameworks

Other notable frameworks include LlamaIndex Workflows (focused on retrieval-augmented generation agents), DSPy (which compiles declarative agent programs into optimized prompts), Haystack (by deepset, for NLP pipeline orchestration), Amazon Bedrock Agents (a managed service on AWS for building production agents), and the OpenAI Agents SDK. The framework landscape is rapidly evolving, with new entrants appearing regularly.

Framework	Developer	Architecture style	Primary languages	Release year	Ideal use case
LangGraph	LangChain	Graph-based workflows	Python, TypeScript	2024	Complex conditional workflows
CrewAI	CrewAI Inc.	Role-based crews	Python	2023	Team-structured content and research workflows
AutoGen / Microsoft Agent Framework	Microsoft	Conversation-based	Python, .NET	2023 (AutoGen), 2025 (merged)	Research, collaborative problem-solving
Semantic Kernel	Microsoft	Plugin-based, multi-pattern	Python, C#, Java	2023	Enterprise .NET/Java applications
LlamaIndex Workflows	LlamaIndex	Event-driven	Python, TypeScript	2024	RAG-heavy agent applications
DSPy	Stanford NLP	Declarative compilation	Python	2023	Optimized prompt pipelines
Amazon Bedrock Agents	AWS	Managed service	Python (SDK)	2023	Serverless production agents on AWS
Claude Agent SDK	Anthropic	Tool-loop + subagents + MCP	Python, TypeScript	2025	Long-running coding and research agents
OpenAI Agents SDK	OpenAI	Handoff-based, function-tool	Python, TypeScript	2025	OpenAI-native multi-agent apps

What is the Model Context Protocol (MCP)?

The Model Context Protocol (MCP) is an open standard introduced by Anthropic on November 25, 2024, to standardize how AI agents connect to external tools, data sources, and systems. Before MCP, every integration between an agent and an external service required custom code. MCP replaces this fragmented approach with a single, universal protocol.^[16]

MCP uses a client-server architecture. An MCP client (the AI agent or application) connects to one or more MCP servers, each of which exposes tools and data from a specific system. A single MCP server for GitHub, for example, might expose tools for creating pull requests, reading issues, and searching repositories. The agent connects to the server once and can call any of its tools through a standardized interface.

At launch, Anthropic released pre-built MCP servers for Google Drive, Slack, GitHub, Git, Postgres, and Puppeteer. Early adopters included Block, Apollo, Zed, Replit, Codeium, and Sourcegraph.^[16] In March 2025, OpenAI adopted MCP across its products, including the ChatGPT desktop application. By the end of 2025, the ecosystem had grown to over 10,000 active MCP servers with 97 million monthly SDK downloads across Python and TypeScript.

On December 9, 2025, Anthropic donated MCP to the newly formed Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation. The foundation was co-founded by Anthropic, Block, and OpenAI, with founding contributions of three projects: Anthropic's MCP, Block's goose, and OpenAI's AGENTS.md. Platinum members included Amazon Web Services, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI.^[17] This move signaled industry-wide commitment to a shared standard for agent-tool integration.

MCP has become the de facto standard for connecting agents to tools. Its rapid adoption across competing AI companies and thousands of third-party developers suggests that the fragmentation problem for tool integration is largely solved at the protocol level.

How do agents talk to each other (A2A)?

While MCP standardizes how agents connect to tools, the Agent2Agent Protocol (A2A) standardizes how agents communicate with each other. Google announced A2A on April 9, 2025, with support from over 50 technology partners including Atlassian, Salesforce, SAP, ServiceNow, PayPal, LangChain, and MongoDB.^[18]

A2A addresses a different problem than MCP. In a multi-agent system, agents from different vendors or built on different frameworks need a common language to discover each other's capabilities, delegate tasks, exchange results, and negotiate how to present information to users. A2A provides this common language.

The protocol has four core capabilities:

Capability discovery: Agents publish "Agent Cards" in JSON format that describe what they can do. A client agent reads these cards to find the right agent for a given task.
Task management: Communication centers on task objects with a defined lifecycle. Tasks can complete immediately or run asynchronously with continuous status updates.
Collaboration: Agents exchange context, instructions, and partial results through structured messages containing typed "parts" (text, images, forms, or other media).
User experience negotiation: Agents negotiate how to present results, adapting to the capabilities of the user interface (plain text, iframes, web forms, video).

A2A is built on existing web standards (HTTP, Server-Sent Events, and JSON-RPC) and supports enterprise-grade authentication and authorization. In June 2025, Google contributed A2A to the Linux Foundation. Version 0.3, released in July 2025, provided a more stable API surface for production use.^[19]

A2A and MCP are complementary. MCP handles agent-to-tool communication (vertical integration), while A2A handles agent-to-agent communication (horizontal integration). Together, they form the emerging protocol stack for multi-agent systems.

How did agentic workflows evolve in 2025-2026?

The eighteen months from late 2024 through mid-2026 transformed agentic workflows from a research idea into a mass-market product category. Several developments are particularly worth recording in detail.

Computer-use agents

On October 22, 2024, Anthropic introduced computer use as a public beta capability of an upgraded Claude 3.5 Sonnet, making Claude one of the first frontier systems to offer autonomous desktop control through an API tool. Instead of building specialised tools for each task, Anthropic taught Claude general computer skills, looking at a screen, moving a cursor, clicking buttons, and typing text. Anthropic released the feature as experimental, acknowledging that it was "at times cumbersome and error-prone."^[20]

OpenAI followed on January 23, 2025 with Operator, a browser-based agent powered by a Computer-Using Agent (CUA) model that combined GPT-4o's vision with reinforcement-learning-based reasoning over screenshots. Operator was made available to ChatGPT Pro subscribers in the US at $200 per month. CUA scored 38.1% on OSWorld for full computer use tasks, 58.1% on WebArena, and 87% on WebVoyager.^[21] On July 17, 2025, OpenAI launched ChatGPT Agent, a unified system that combined Operator's browser, deep research's web synthesis, and ChatGPT's conversational interface, rolled out to Pro, Plus, and Team subscribers; the model underlying ChatGPT Agent scored 41.6% pass@1 on Humanity's Last Exam, roughly double the o3 and o4-mini baselines.^[22] The standalone Operator site at operator.chatgpt.com was deprecated and shut down on August 31, 2025.^[22]

Google's Project Mariner entered the same space in December 2024 as a research prototype built on Gemini 2.0, an experimental Chrome extension that could reason across page text, images, forms, and code to drive a browser. At Google I/O on May 20, 2025, Project Mariner gained cloud-based execution, the ability to handle up to ten simultaneous tasks, and a "Teach & Repeat" feature; it scored 83.5% on the WebVoyager benchmark.^[23] In December 2025 Google folded Project Mariner's capabilities into Gemini Agent inside the Gemini app, and the standalone Mariner landing page listed May 4, 2026 as the shutdown date.^[24]

Agentic coding tools

Agentic coding became the most commercially successful application of the technology. Anthropic released Claude Code in February 2025 as an agentic command-line tool, made it generally available in May 2025 alongside Claude 4, and shipped Claude Code 2.0 in September 2025 with split-panel views, a built-in terminal, parallel sessions, checkpoints, an IDE extension, and automation hooks.^[25] Claude Code reached approximately $1 billion in annualized run-rate revenue within roughly six months of general availability, and Anthropic's overall annualized revenue jumped from approximately $1 billion at the start of 2025 to around $5 billion by August 2025.^[26]

Cursor shipped Cursor 2.0 on October 29, 2025, pairing a purpose-built coding model named Composer with an agent-centred interface. Composer is a mixture-of-experts model trained with reinforcement learning for software engineering and is roughly four times faster than comparably intelligent models, completing most turns in under 30 seconds. Cursor 2.0 supports up to eight parallel agents on a single prompt, using git worktrees or remote machines to keep their workspaces isolated.^[27]

Replit Agent launched in September 2024 and helped drive Replit to $100 million ARR in 2025; Replit Agent 3, released September 10, 2025, can operate for up to 200 minutes continuously, self-tests its own bugs in a loop, and can build other agents and automations through an "Agents & Automations" interface.^[28] OpenAI shipped a succession of agentic Codex models throughout 2025 and into 2026, including GPT-5-Codex (September 23, 2025), GPT-5.1-Codex-Max, and GPT-5.2-Codex (December 18, 2025), each positioned as more capable than the last for long-horizon agentic coding work.^[29]

Devin, the autonomous software engineer introduced by Cognition AI in March 2024 with the tagline "the first AI software engineer," had a mixed reception. An independent analysis by Answer.AI in early 2025 found that across 20 tasks Devin produced 14 failures, three inconclusive results, and three successes, a 15% success rate, and observed cases where Devin "got stuck on npm package dependency errors and committed a bunch of hallucinated, non-incrementally-beneficial code."^[30] Cognition responded with Devin 2.0 in April 2025, which lowered the entry price to $20 per month, added a cloud IDE supporting multiple parallel Devins, an Interactive Planning mode, and a Devin Search/Wiki feature.^[31] On July 14, 2025 Cognition acquired the AI coding startup Windsurf after Google had hired away Windsurf's leadership in a separate $2.4 billion talent and licensing deal; the acquisition included Windsurf's IP, product, and team, and reportedly more than doubled Cognition's ARR. Cognition's valuation reached $10.2 billion by September 2025.^[32] Devin 2.2, released February 24, 2026, added desktop-app testing, made startup three times faster, and introduced Devin Review for automated pull-request review.^[33]

Subagent and multi-agent patterns

The trend toward subagent and multi-agent patterns accelerated through 2025. In Claude Code, the Task tool spawns subagents that each receive their own context window, system prompt, tool allowlist, and permissions; the parent agent receives only the subagent's final output, not its full reasoning trace, which preserves orchestrator context.^[15] Multiple Task invocations can run concurrently, with up to ten subagents in parallel in a typical configuration.^[15] Anthropic's "Agent Teams" feature, released in early 2026, added higher-level coordination primitives, a shared task list with dependency tracking, peer-to-peer messaging between teammates, and file locking to prevent merge conflicts.^[14]

This "split-and-merge" pattern, in which an orchestrator decomposes a large task, fans it out to parallel subagents, and merges their results in a single session, became the canonical way to use a coding agent at scale. Cursor 2.0's eight-parallel-agents feature is the same idea wrapped in a graphical UI.^[27]

Manus and the agent-product hype cycle

On March 6, 2025, the Chinese startup Butterfly Effect launched Manus AI in invitation-only beta. Manus presented itself as the world's first general AI agent, built on top of Claude Sonnet 4.5 (and earlier 3.5) plus fine-tuned versions of Alibaba's open-source Qwen, with multiple specialised subagents orchestrated to act autonomously on a wide range of tasks. The launch demo, which showed Manus screening resumes and analysing stocks, drew more than one million views within twenty hours, and invitation codes traded second-hand for tens of thousands of renminbi.^[34] Voices ranging from Jack Dorsey to Hugging Face product lead Victor Mustar called Manus a "second DeepSeek moment."

The hype translated into rapid commercial growth. By December 2025, Manus had millions of subscribers, more than $100 million in annual recurring revenue, and a revenue run rate above $125 million eight months after launch. In December 2025, Meta announced an agreement to acquire Manus at a reported valuation between US$2 billion and US$3 billion. The transaction closed in December 2025, and Meta integrated Manus's operations. On April 27, 2026, China's National Development and Reform Commission blocked the acquisition, citing concerns about technology leakage and ordering all parties to withdraw, although Manus employees had already joined Meta and earlier backers had received their proceeds.^[35]

Salesforce and enterprise agent platforms

Salesforce launched Agentforce on October 29, 2024 as an enterprise AI agent platform, then shipped Agentforce 2.0 in December 2024 (general availability February 2025), Agentforce 3 in June 2025, and Agentforce 360 in October 2025. By the time Agentforce 360 became generally available, Salesforce reported over 12,000 customers; one early reference customer, Reddit, deflected 46% of its support cases and cut average response times from 8.9 minutes to 1.4 minutes.^[36] Microsoft pursued a parallel strategy, embedding agentic capabilities into Microsoft 365 Copilot through 2025 and shipping the "Copilot Cowork" wave of specialised teammates in early 2026.

Deep research and long-horizon agents

OpenAI's Deep Research, announced in early 2025 and integrated into ChatGPT Agent in July 2025, established the "browse for hours, then write a report" pattern as a category of its own. Google released a reimagined Gemini Deep Research agent on December 11, 2025, based on Gemini 3 Pro and exposed to developers through a new Interactions API, on the same day OpenAI released GPT-5.2.^[37]

METR's time-horizon methodology, published in March 2025, attempts to quantify how long a task an AI agent can complete autonomously with 50% success. METR's 2026 update showed the 50% time horizon growing roughly exponentially with a doubling time of about seven months, from nine seconds for early GPT-3 agents in 2020 to around 14.5 hours for Claude Opus 4.5 class systems in early 2026. METR's task suite grew from 170 to 228 tasks in the 2026 revision.^[38]^[39]

Anthropic's empirical autonomy study, drawing on roughly a million tool calls and hundreds of thousands of coding sessions, found that software engineering accounted for around 50% of tool calls on the Anthropic API; that approximately 73% of tool calls still ran with a human in the loop; and that only about 0.8% of calls were judged irreversible.^[10] Crossing roughly 750 sessions of experience was associated with a qualitative shift, users tended to stop approving individual actions and instead monitored from a distance, intervening when necessary.^[10]

Evaluation in the wild

BrowseComp, released by OpenAI on April 10, 2025, is a 1,266-question benchmark for browsing agents that asks for "hard-to-find, entangled information" that might require navigating tens or hundreds of websites. On launch, GPT-4o with browsing reached only 1.9% accuracy, while OpenAI's Deep Research scored 51.5%, a 27x gap that suggested architectural choices, not tool access, were the binding constraint.^[40]

On τ-bench (Sierra Research), which simulates a tool-using customer-service agent in dialogue with a simulated user across retail and airline domains, Claude Sonnet 4.5 led the airline split with 70.0% by late 2025. Sierra subsequently released τ²-bench (2025) and τ³-bench, which added a banking domain and a voice-evaluation modality.^[41]

On SWE-bench Verified, the 500-task human-reviewed subset of real GitHub issues, the first system crossed 50% in early 2025; by April 2026 the top four systems were above 60% and the leader reached 72.0%. Human software engineers on the same tasks score around 90% with reasonable time budgets, so the benchmark is not yet saturated.^[42]

On the GAIA benchmark for general AI assistants, where humans score 92% but GPT-4 with plugins managed 15% at the original 2023 release, H2O.ai's h2oGPTe Agent reached 75% in 2025, the first time a system received a passing grade.^[43]

The UK AI Security Institute and Meridian Labs developed Inspect AI, an open-source Python framework for reproducible LLM and agent evaluations. Inspect has been adopted by METR, Apollo Research, other government AISIs, and major safety labs. Its built-in agent primitives include ReAct, multi-agent composition, an external-agent bridge for tools like Claude Code, Codex CLI, and Gemini CLI, and a sandboxing toolkit with Docker built in and Kubernetes and Proxmox adapters; over 200 pre-built evaluations ship with the framework.^[44]^[45]

Claude Skills and a portable capability layer

On October 16, 2025, Anthropic released Claude Skills as an skills-2025-10-02 API beta. A Skill is an organized folder of instructions, scripts, and resources that an agent loads dynamically when it judges a task to match. Pre-built skills shipped for PowerPoint, Excel, Word, and PDF generation, and Skills work across Claude.ai, Claude Code, the Claude Agent SDK, and the Claude Developer Platform, so a developer can write a skill once and use it everywhere.^[46]

Where are agentic workflows used in the enterprise?

Agentic AI moved from experimentation to early production deployment across several industries during 2025 and into 2026. Gartner predicted that 40% of enterprise applications would include task-specific AI agents by the end of 2026, up from less than 5% in 2025.^[47] The global AI agents market was valued at approximately $7.8 billion in 2025, with projections reaching $52.6 billion by 2030 (a 46.3% compound annual growth rate). A January 2025 Gartner poll of 3,412 webinar attendees found that 19% of organisations had made significant investments in agentic AI, 42% had made conservative investments, 8% had made none, and the remaining 31% were taking a wait-and-see approach.^[47]

Customer service

Customer service was among the first domains to adopt agentic workflows at scale. Agents handle routine inquiries, process refunds, update account information, and escalate complex cases to human representatives. Workflow automation is the primary use case in 64% of agent deployments. Salesforce's Agentforce 360 reference customer Reddit deflected 46% of support cases and cut resolution times by 84%.^[36]

Software development

Agentic coding tools like GitHub Copilot, Cursor, Claude Code, and Devin are among the most visible agentic workflow deployments. They go beyond autocomplete to plan multi-file changes, run tests, interpret error messages, and iterate until tests pass. Anthropic's 2026 Agentic Coding Trends Report noted that agentic AI is "reconfiguring the software development lifecycle," with coding agents increasingly handling debugging, testing, documentation, and code review in addition to code generation.^[48] The non-programmer audience also adopted the technology rapidly through what is now called vibe coding.

Research and analysis

Research agents search the web, read documents, synthesize findings, and produce structured reports. They are used in competitive intelligence, market research, due diligence, and scientific literature review. OpenAI Deep Research, Gemini Deep Research, Claude Code with web tools, and Manus all targeted this market in 2025.

Financial services

Banks and financial institutions use agentic workflows for Know Your Customer (KYC) verification, anti-money laundering (AML) screening, fraud detection, and regulatory compliance. McKinsey reported that banks implementing agentic AI for KYC/AML workflows have realized productivity gains ranging from 200% to 2,000%.^[49]

Data analysis

Data analysis agents can write SQL queries, execute them against databases, interpret the results, generate visualizations, and produce written summaries. They iterate when a query returns unexpected results, reformulating until they reach an answer. The pattern is especially valuable for business analysts who need answers from data but lack SQL expertise.

What are the main challenges of agentic workflows?

Despite rapid adoption, agentic workflows face several challenges in production. A Gartner forecast warned that over 40% of agentic AI projects will be canceled by 2027 due to escalating costs, unclear business value, or inadequate risk controls.^[50]

Reliability

The gap between a demo and a reliable production system is wide. Multi-agent architectures create orchestration complexity that compounds with scale. Error propagation is a particular concern: if one agent in a chain produces a flawed output, downstream agents may build on that flaw without detecting it. Hallucinated tool calls and fabricated success claims have proven especially difficult to suppress. Bug reports against Claude Code through 2025 and 2026 documented cases of research subagents returning fabricated data (non-existent files, wrong prices, fictional API capabilities), of the main agent trusting those results and building plans around them, and of "false success" rates approaching 29-30% in some workflows, where an agent reports a task complete without verifying.^[51]

Cost

Agentic workflows consume significantly more LLM tokens than single-turn interactions. A reflection loop that runs five iterations costs roughly five times as much as a single generation. Multi-agent systems multiply this further, with each agent making its own LLM calls. Token costs, tool-call costs, and compute costs can spike unpredictably when agents encounter edge cases that trigger extensive retries or recursive calls. Anthropic's Opus 4.5 release in November 2025 specifically targeted this pain point with a 67% price reduction and an average 76% reduction in output tokens per task relative to prior Opus models.^[52]

Latency

Multi-step workflows are inherently slower than single-turn responses. Each step involves at least one LLM call (often hundreds of milliseconds to several seconds), plus any time spent on tool calls (database queries, API requests, web searches). A workflow with ten steps might take 30 seconds to several minutes, which is acceptable for background tasks but problematic for interactive applications. Cursor's Composer model was explicitly trained for low-latency agentic coding, targeting "most turns in under 30 seconds."^[27]

Safety and guardrails

Agents that can take real-world actions (sending emails, modifying databases, executing code, transferring money) need robust safeguards. The challenge is building guardrails that prevent harmful actions without being so restrictive that they negate the value of automation. Common approaches include:

Allowlisting specific tools and actions the agent can take
Requiring human approval for irreversible or high-stakes actions
Sandboxing code execution in isolated environments
Monitoring agent behavior with anomaly detection
Setting budget limits on token usage and API calls

Safety infrastructure is still maturing. Most production deployments rely on a combination of programmatic guardrails and human oversight rather than any single safety mechanism. The UK AISI Inspect Sandboxing Toolkit, released in 2025, was a notable contribution from the safety community.^[45]

Observability

Understanding what an agent is doing and why is harder than monitoring a conventional application. Traditional logging captures inputs and outputs but misses what matters for agents: the reasoning behind tool selection, intermediate plans, decision paths, and context that informed each step. Tracing infrastructure for agent observability remains immature, and limited lineage tracking slows incident response and auditing.

How are agentic workflows evaluated?

Evaluating agentic workflows requires different approaches than evaluating single-turn LLM outputs. The core metrics shift from response quality to task completion and process efficiency.

Key metrics

Metric	What it measures	Why it matters
Goal completion rate	Percentage of tasks the agent completes successfully end-to-end	The most direct measure of whether the agent is useful
Tool use efficiency	Whether the agent selects the right tools and uses them correctly	Incorrect tool selection wastes time and tokens
Step efficiency	Number of steps taken relative to the minimum required	Excessive steps indicate poor planning or unnecessary retries
Error recovery rate	How often the agent recovers from errors without human intervention	Directly impacts the value of autonomous operation
Latency (time to completion)	Total wall-clock time from task start to completion	Determines whether the workflow is practical for its intended use
Cost per task	Total token and tool-call costs for completing a task	Determines economic viability
Safety violation rate	Frequency of actions that violate defined guardrails	Measures the risk profile of the deployment
pass^k reliability	Whether the agent produces correct results across k independent trials	Captures the reliability gap between best-case and average behavior

Benchmarks

Several benchmarks specifically target agentic capabilities:

SWE-bench: Tests agents on real GitHub issues from popular open-source Python repositories. The agent must read the issue, understand the codebase, write a patch, and pass the repository's test suite. SWE-bench Verified is the human-reviewed 500-task subset; the leader reached 72.0% in April 2026.^[42]
WebArena and VisualWebArena: Test agents on realistic web browsing tasks like booking flights, managing e-commerce accounts, and navigating complex websites.
GAIA benchmark: Tests general AI assistants on tasks requiring multi-step reasoning, tool use, and web search. H2O.ai's h2oGPTe Agent reached 75% in 2025.^[43]
HumanEval and MBPP: Code generation benchmarks that, when used with agentic workflows (iterative generation, testing, and debugging), measure the improvement from agentic patterns over zero-shot generation.
τ-bench (Sierra Research): Tests tool-using customer-service agents in retail and airline domains, with simulated users and a pass^k reliability metric.^[41]
BrowseComp (OpenAI, April 2025): 1,266 challenging browsing problems requiring entangled information retrieval.^[40]
Humanity's Last Exam: Designed as the "last" academic benchmark before saturation; ChatGPT Agent scored 41.6% pass@1 in July 2025.^[22]
METR time horizons: A meta-evaluation that reports the maximum task length at which an agent achieves 50% success.^[38]
Context-Bench (Letta, October 2025): Tests agents' ability to maintain and reason over long-running context across extended, multi-step workflows.
TRAIL (Patronus AI): Evaluates agents on multi-turn, tool-using tasks with complex dependency chains.

Production evaluation

Pre-deployment benchmarks are necessary but not sufficient. Pre-deployment testing cannot fully capture the diversity of real-world user behaviors, usage patterns, and edge cases. Common production evaluation approaches include A/B testing (comparing agent performance against human baselines or alternative agent configurations), shadow mode (running the agent alongside human workers without taking live actions), and human review sampling (having human reviewers score a random sample of agent outputs on a regular cadence).

How do agentic workflows relate to other concepts?

Agentic workflows connect to several related areas in AI:

Retrieval-augmented generation (RAG): RAG is a specific form of tool use where the agent retrieves relevant documents before generating a response. Many agentic workflows incorporate RAG as one step in a larger process.
Function calling: The ability for LLMs to output structured function calls is the technical foundation that enables tool use in agentic workflows.
Chain-of-thought prompting: Chain-of-thought reasoning within a single LLM call is a precursor to the multi-step planning that agentic workflows perform across multiple calls.
Reinforcement learning: Some agentic systems use reinforcement learning to optimize the agent's decision-making policy over time, learning which tools to use and when. Cursor's Composer is one such example, trained via RL across software-engineering environments.^[27]
Robotic process automation (RPA): RPA automates repetitive tasks through scripted interactions with software interfaces. Agentic workflows differ in that the agent decides what to do at each step rather than following a fixed script.
Vibe coding: A cultural and product layer on top of agentic coding workflows, in which non-programmers describe what they want in natural language and an agent produces a working app.

Future directions

Protocol convergence around MCP and A2A is reducing the integration burden for multi-agent systems. The Agentic AI Foundation under the Linux Foundation now stewards both MCP (via Anthropic) and OpenAI's AGENTS.md, with backing from AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI.^[17] As these protocols mature, building interoperable agent systems should become significantly simpler.

Organizations are adopting a "microservices" approach to agents, replacing monolithic all-purpose agents with orchestrated teams of specialised agents. Both Forrester and Gartner described 2026 as the breakthrough year for multi-agent systems.^[47] Cost reduction through smarter orchestration is an active area of work: techniques like routing simple queries to smaller, cheaper models while reserving frontier models for complex reasoning steps can reduce costs substantially without sacrificing quality, and Anthropic's Opus 4.5 release in November 2025 cut prices by 67% while reducing output tokens by 76% on equivalent tasks.^[52]

Improved evaluation and observability tools are closing the gap between demo-quality and production-quality systems. UK AISI's Inspect framework, METR's time-horizon methodology, and Anthropic's empirical autonomy study are early instances. Gartner predicts that by the end of 2026, 40% of enterprise applications will include embedded AI agents, and that at least 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028, up from 0% in 2024.^[47]^[53] Whether these predictions hold depends on continued progress in reliability, cost management, safety, and the maturity of orchestration frameworks.

References

Andrew Ng. "AI Agentic Workflows Could Drive More AI Progress Than Even the Next Generation of Foundation Models." DeepLearning.AI The Batch, 2024-03-26. https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/. Accessed 2026-05-24. ↩
Andrew Ng [@AndrewYNg]. "I think AI agentic workflows will drive massive AI progress this year." X (Twitter), 2024-03-21. https://x.com/AndrewYNg/status/1770897666702233815. Accessed 2026-05-24. ↩
Andrew Ng. "AI Agentic Workflows." Talk at Sequoia Capital AI Ascent 2024, 2024-03-26. https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance/. Accessed 2026-05-24. ↩
Anthropic. "Building effective agents." Anthropic Research, 2024-12-19. https://www.anthropic.com/research/building-effective-agents. Accessed 2026-05-24. ↩
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629, 2022-10-06 (revised 2023). Published at ICLR 2023. https://arxiv.org/abs/2210.03629. Accessed 2026-05-24. ↩
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. "Reflexion: Language Agents with Verbal Reinforcement Learning." arXiv:2303.11366, 2023-03-20. NeurIPS 2023. https://arxiv.org/abs/2303.11366. Accessed 2026-05-24. ↩
Jeremy Kahn. "What are BabyAGI and AutoGPT, the new A.I. tools Silicon Valley is buzzing about?" Fortune, 2023-04-15. https://fortune.com/2023/04/15/babyagi-autogpt-openai-gpt-4-autonomous-assistant-agi/. Accessed 2026-05-24. ↩
Andrew Ng [@AndrewYNg]. "Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress this year: Reflection, Tool use, Planning and Multi-agent collaboration." X (Twitter), 2024-03-27. https://x.com/AndrewYNg/status/1773393357022298617. Accessed 2026-05-24. ↩
Gartner. "Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, Up from Less Than 5% in 2025." Press release, 2025-08-26. https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025. Accessed 2026-05-24. ↩
Anthropic. "Measuring AI Agent Autonomy." Anthropic Research, 2026-02. https://www.anthropic.com/research. Accessed 2026-05-24. ↩
Deloitte Insights. "Agentic AI Strategy." Tech Trends 2026, 2025. https://www2.deloitte.com/us/en/insights/focus/tech-trends.html. Accessed 2026-05-24. ↩
LangChain. "Workflows and agents." LangGraph documentation, 2025. https://docs.langchain.com/oss/python/langgraph/workflows-agents. Accessed 2026-05-24. ↩
David Ramel. "Semantic Kernel + AutoGen = Open-Source Microsoft Agent Framework." Visual Studio Magazine, 2025-10. https://visualstudiomagazine.com/. Accessed 2026-05-24. ↩
Anthropic. "Building agents with the Claude Agent SDK." Anthropic Engineering, 2025. https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk. Accessed 2026-05-24. ↩
Anthropic. "Create custom subagents." Claude Code documentation, 2025. https://code.claude.com/docs/en/sub-agents. Accessed 2026-05-24. ↩
Anthropic. "Introducing the Model Context Protocol." Anthropic News, 2024-11-25. https://www.anthropic.com/news/model-context-protocol. Accessed 2026-05-24. ↩
Linux Foundation. "Linux Foundation Announces the Formation of the Agentic AI Foundation (AAIF), Anchored by New Project Contributions Including Model Context Protocol (MCP), goose and AGENTS.md." Press release, 2025-12-09. https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation. Accessed 2026-05-24. ↩
Google Developers Blog. "Announcing the Agent2Agent Protocol (A2A)." Google for Developers, 2025-04-09. https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/. Accessed 2026-05-24. ↩
Linux Foundation. "Linux Foundation Launches the Agent2Agent Protocol Project to Enable Secure, Intelligent Communication Between AI Agents." Press release, 2025-06-23. https://www.linuxfoundation.org/. Accessed 2026-05-24. ↩
Anthropic. "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." Anthropic News, 2024-10-22. https://www.anthropic.com/news/3-5-models-and-computer-use. Accessed 2026-05-24. ↩
OpenAI. "Introducing Operator." OpenAI, 2025-01-23. https://openai.com/index/introducing-operator/. Accessed 2026-05-24. ↩
OpenAI. "Introducing ChatGPT agent: bridging research and action." OpenAI, 2025-07-17. https://openai.com/index/introducing-chatgpt-agent/. Accessed 2026-05-24. ↩
TechCrunch (Maxwell Zeff). "Google rolls out Project Mariner, its web-browsing AI agent." TechCrunch, 2025-05-20. https://techcrunch.com/2025/05/20/google-rolls-out-project-mariner-its-web-browsing-ai-agent/. Accessed 2026-05-24. ↩
Google. "Project Mariner." labs.google.com, 2025-2026. https://labs.google.com/mariner/landing. Accessed 2026-05-24. ↩
Anthropic. "Claude Code 2.0 release notes and changelog." Claude Code documentation, 2025-09. https://code.claude.com/docs/en/changelog. Accessed 2026-05-24. ↩
Bilal Jamy. "Anthropic's Claude Code is having its 'ChatGPT' moment." Uncover Alpha, 2025-08. https://www.uncoveralpha.com/p/anthropics-claude-code-is-having. Accessed 2026-05-24. ↩
Cursor. "Introducing Cursor 2.0 and Composer." Cursor blog, 2025-10-29. https://cursor.com/blog/2-0. Accessed 2026-05-24. ↩
Replit. "Introducing Agent 3: Our Most Autonomous Agent Yet." Replit blog, 2025-09-10. https://blog.replit.com/introducing-agent-3-our-most-autonomous-agent-yet. Accessed 2026-05-24. ↩
OpenAI. "Introducing GPT-5.2-Codex." OpenAI, 2025-12-18. https://openai.com/index/introducing-gpt-5-2-codex/. Accessed 2026-05-24. ↩
Maggie Harrison Dupré. "The 'First AI Software Engineer' Is Bungling the Vast Majority of Tasks It's Asked to Do." Futurism, 2025-01-23. https://futurism.com/first-ai-software-engineer-devin-bungling-tasks. Accessed 2026-05-24. ↩
Cognition AI. "Devin 2.0." Cognition blog, 2025-04-03. https://cognition.ai/blog/devin-2. Accessed 2026-05-24. ↩
TechCrunch (Maxwell Zeff). "Cognition, maker of the AI coding agent Devin, acquires Windsurf." TechCrunch, 2025-07-14. https://techcrunch.com/2025/07/14/cognition-maker-of-the-ai-coding-agent-devin-acquires-windsurf/. Accessed 2026-05-24. ↩
Cognition AI. "Introducing Devin 2.2." Cognition blog, 2026-02-24. https://cognition.ai/blog/introducing-devin-2-2. Accessed 2026-05-24. ↩
MIT Technology Review (Caiwei Chen). "Everyone in AI is talking about Manus. We put it to the test." MIT Technology Review, 2025-03-11. https://www.technologyreview.com/2025/03/11/1113133/manus-ai-review/. Accessed 2026-05-24. ↩
CNBC. "China blocks Meta's $2 billion takeover of AI startup Manus." CNBC, 2026-04-27. https://www.cnbc.com/2026/04/27/meta-manus-china-blocks-acquisition-ai-startup.html. Accessed 2026-05-24. ↩
Salesforce. "Welcome to the Agentic Enterprise: With Agentforce 360, Salesforce Elevates Human Potential in the Age of AI." Salesforce press release, 2025-10-13. https://www.salesforce.com/news/press-releases/2025/10/13/agentic-enterprise-announcement/. Accessed 2026-05-24. ↩
Maxwell Zeff. "Google launched its deepest AI research agent yet, on the same day OpenAI dropped GPT-5.2." TechCrunch, 2025-12-11. https://techcrunch.com/2025/12/11/google-launched-its-deepest-ai-research-agent-yet-on-the-same-day-openai-dropped-gpt-5-2/. Accessed 2026-05-24. ↩
METR. "Measuring AI Ability to Complete Long Tasks." METR Research, 2025-03-19. https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/. Accessed 2026-05-24. ↩
METR. "Time Horizon 1.1." METR Blog, 2026-01-29. https://metr.org/blog/2026-1-29-time-horizon-1-1/. Accessed 2026-05-24. ↩
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, et al. "BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents." OpenAI, 2025-04-10. https://openai.com/index/browsecomp/. Accessed 2026-05-24. ↩
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." Sierra Research / arXiv:2406.12045, 2024. ICLR 2025 poster, 2025. https://arxiv.org/abs/2406.12045. Accessed 2026-05-24. ↩
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" Princeton NLP / arXiv:2310.06770, 2023; SWE-bench Verified leaderboard, 2026. https://www.swebench.com/. Accessed 2026-05-24. ↩
H2O.ai. "H2O.ai Tops the General AI Assistant (GAIA) Test." H2O.ai Blog, 2025. https://h2o.ai/blog/2025/h2o-ai-tops-the-general-ai-assistant-test/. Accessed 2026-05-24. ↩
UK AI Security Institute. "Announcing Inspect Evals." AISI Work blog, 2024-2025. https://www.aisi.gov.uk/blog/inspect-evals. Accessed 2026-05-24. ↩
UK AI Security Institute. "The Inspect Sandboxing Toolkit: Scalable and secure AI agent evaluations." AISI Work blog, 2025. https://www.aisi.gov.uk/blog/the-inspect-sandboxing-toolkit-scalable-and-secure-ai-agent-evaluations. Accessed 2026-05-24. ↩
Anthropic. "Equipping agents for the real world with Agent Skills." Anthropic Engineering, 2025-10-16. https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills. Accessed 2026-05-24. ↩
Gartner. "Top Strategic Technology Trends for 2025: Agentic AI." Gartner Research, 2024-2025. https://www.gartner.com/en/documents/5850847. Accessed 2026-05-24. ↩
Anthropic. "2026 Agentic Coding Trends Report." Anthropic, 2026. https://www.anthropic.com/. Accessed 2026-05-24. ↩
McKinsey & Company. "How agentic AI is transforming KYC and AML in banking." McKinsey Insights, 2025. https://www.mckinsey.com/. Accessed 2026-05-24. ↩
Gartner. "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027." Gartner press release, 2025-06-25. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027. Accessed 2026-05-24. ↩
anthropics/claude-code. "Opus 4.6 Max 20x: systematic hallucinations, rule violations, 80% weekly usage wasted, April 2026" (Issue #46727). GitHub, 2026-04. https://github.com/anthropics/claude-code/issues/46727. Accessed 2026-05-24. ↩
Anthropic. "Introducing Claude Opus 4.5." Anthropic News, 2025-11. https://www.anthropic.com/news/claude-opus-4-5. Accessed 2026-05-24. ↩
Gartner. "Press Release: Gartner Predicts 60% of Brands Will Use Agentic AI to Deliver Streamlined One-to-One Interactions by 2028." Gartner press release, 2026-01-15. https://www.gartner.com/en/newsroom/press-releases/2026-01-15-gartner-predicts-60-percent-of-brands-will-use-agentic-ai-to-deliver-streamlined-one-to-one-interactions-by-2028. Accessed 2026-05-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

Why did agentic workflows emerge?

What are Andrew Ng's four agentic design patterns?

Reflection

Tool use

Planning

Multi-agent collaboration

What are Anthropic's workflow patterns?

What are the common agentic design patterns?

How much autonomy should a production agent have?

Human-in-the-loop

Supervised autonomy (human-on-the-loop)

Fully autonomous

Which frameworks build agentic workflows?

LangGraph

CrewAI

AutoGen

Semantic Kernel

Claude Agent SDK

Other frameworks

What is the Model Context Protocol (MCP)?

How do agents talk to each other (A2A)?

How did agentic workflows evolve in 2025-2026?

Computer-use agents

Agentic coding tools

Subagent and multi-agent patterns

Manus and the agent-product hype cycle

Salesforce and enterprise agent platforms

Deep research and long-horizon agents

Evaluation in the wild

Claude Skills and a portable capability layer

Where are agentic workflows used in the enterprise?

Customer service

Software development

Research and analysis

Financial services

Data analysis

What are the main challenges of agentic workflows?

Reliability

Cost

Latency

Safety and guardrails

Observability

How are agentic workflows evaluated?

Key metrics

Benchmarks

Production evaluation

How do agentic workflows relate to other concepts?

Future directions

See also

References

Improve this article

Related Articles

Agentic Context Engineering

Context engineering

AI agents

Mixture of Agents

Tool use

AI browser agent

What links here (24 of 25)

Related Articles

Agentic Context Engineering

Context engineering

AI agents

Mixture of Agents

Tool use

AI browser agent

What links here (24 of 25)