Agentic AI

AI Agents Artificial Intelligence Generative AI

34 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

50 citations

Revision

v4 · 6,835 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

This article covers "agentic AI" as a paradigm and discourse: the framing and why the term emerged, the design patterns treated as a movement, the commentary and debate around it, and the market and adoption narrative. For the foundational concept of an agent, see Agent; for the practical engineering of contemporary LLM agents (frameworks, protocols, products, benchmarks), see AI agents.

Agentic AI refers to artificial intelligence systems that can take autonomous, multi-step actions to accomplish goals, rather than simply responding to individual prompts. Unlike traditional chatbots or single-step generative AI assistants that answer one question at a time and wait for the next instruction, agentic systems receive a high-level objective and then plan, execute, and iterate through a sequence of steps on their own, often using external tools, browsing the web, writing and running code, or coordinating with other AI agents.^[1]^[2] The term gained wide currency in 2024 when Andrew Ng, the AI Fund founder and former head of Google Brain, presented it at Sequoia Capital's AI Ascent conference, and it was subsequently adopted by Gartner, McKinsey, and most major AI labs as the dominant framing for the next generation of AI deployment.^[1]^[3]

Gartner named agentic AI the top strategic technology trend for 2025, defining it as systems that "possess the capability to act autonomously to complete tasks" within defined guardrails.^[3] McKinsey, in its 2025 report "Seizing the agentic AI advantage," defined the category as "AI agents built on generative AI foundation models that can act in the real world and carry out multistep processes," describing agents as "skilled virtual coworkers" that shift generative AI "from a reactive tool to a proactive, goal-driven virtual collaborator."^[6]^[48] As of 2026, adoption has spread rapidly: roughly 79% of organizations report some level of agentic AI deployment, though fewer than one in nine runs such systems in full production.^[15]

What is agentic AI? Definition and scope

The word "agentic" comes from "agency," the capacity of an entity to act independently in pursuit of a goal. Applied to AI, agentic behavior means a system can perceive its environment, reason about what to do, choose and invoke tools or sub-processes, observe the results, and loop back until the goal is reached or the system determines it cannot proceed.

Agentic AI overlaps with, but is broader than, several related concepts. An AI agent is any system that acts on behalf of a user; the agentic AI framing emphasizes the class of agents built on large language models (LLMs) that reason in natural language and use a wide variety of tools.^[13] The foundational, conceptual idea of an agent (rational agents, classical architectures, reinforcement learning) is treated in Agent, while the practical engineering of present-day LLM agents (frameworks, protocols, products, and benchmarks) is the subject of AI agents; "agentic AI" denotes the surrounding paradigm and discourse. An agentic workflow describes the specific sequence of steps such a system follows. The term agentic AI, as it emerged in 2024 to 2026, typically refers to the overall paradigm: LLM-based systems capable of operating autonomously over extended periods with minimal human intervention.

Some researchers distinguish between "AI agents" (a long-standing term in academic AI, referring to any goal-directed system) and the newer class of "LLM-powered autonomous agents" that can reason, plan, and act through natural language.^[13] In industry usage, the two phrases are often interchangeable. The Salesforce and IBM distinctions between "AI agents" and "AI assistants" treat the former as goal-driven and autonomous and the latter as reactive responders to specific instructions.^[16]^[17]

Lilian Weng's three-component framework

In an influential June 2023 essay on her blog Lil'Log, Lilian Weng (then head of safety systems at OpenAI) summarized the architecture of LLM-powered autonomous agents with a compact formula: Agent = LLM + planning + memory + tool use.^[13] As she put it, "In a LLM-powered autonomous agent system, LLM functions as the agent's brain, complemented by several key components": planning, memory, and tool use.^[13] Each component plays a distinct role: planning breaks a goal into sub-goals and revises the plan in response to feedback ("the agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks"); memory holds context within and across sessions, often combining a short in-context buffer with a long-term vector store; and tool use connects the model to APIs, code execution, and the wider software environment.^[13] Weng's post became one of the most cited primers on the topic and shaped how production frameworks like LangChain and LangGraph organize agent code.^[13]

How does the agent loop work?

Most agentic AI systems share a common control structure often called the agent loop, or the think-act-observe cycle. In each iteration, the model produces a thought (an internal reasoning step), selects an action (typically a tool call with arguments), and then receives an observation (the tool's response, an environment state, or an error message). The model then conditions on the updated context and produces the next thought, repeating until it decides the task is complete or a termination condition is met.^[21]^[22] The code that wraps the LLM with this loop, manages tool calls, persists memory, and applies guardrails is often called the agent's scaffold.^[22]

The loop can be extended in several ways. Multiple tool calls can run in parallel within a single step, which Anthropic's Claude Sonnet 4.5 and Opus 4.5 explicitly optimize for, firing speculative searches simultaneously to build context faster.^[36] Sub-agents can be spawned as tools, producing a supervisor tree. Long-running agents persist state to disk through checkpointing so they can resume after a crash. Human-in-the-loop pauses can interrupt the loop at sensitive steps for approval. The simplicity of the abstract loop combined with the many variations that production systems introduce is one reason the field is still consolidating around shared frameworks rather than a single dominant architecture.^[21]

What are the common agentic patterns?

The agentic AI movement has organized around a small set of recurring design patterns that distinguish agentic workflows from one-shot prompting.

ReAct (Reason + Act)

The ReAct pattern, introduced by Shunyu Yao and colleagues at Google and Princeton in October 2022, interleaves natural-language reasoning steps with tool-using actions in a single trajectory.^[23] The agent writes out a thought, takes an action, observes the result, and then continues with another thought, all in the same chain of tokens that the model produces. On HotpotQA and Fever, ReAct overcame hallucination and error-propagation problems that plagued pure chain-of-thought prompting by grounding the reasoning trace in retrieved Wikipedia content. On the ALFWorld and WebShop benchmarks, ReAct beat imitation-learning and reinforcement-learning baselines by 34 and 10 absolute percentage points respectively, using only one or two in-context examples.^[23] ReAct became the default pattern in early LangChain agents and remains a building block in nearly every modern agent framework.

Plan-and-Execute

The plan-and-execute pattern separates planning from execution into two phases handled by two different LLM calls (or even two different models).^[24] A planner LLM examines the user goal and produces a structured multi-step plan; one or more executor LLMs then take each step, invoke tools, and report results back. The planner can be re-invoked between steps when execution surfaces new information that invalidates the original plan, a variation sometimes called plan-and-re-plan. The pattern was popularized by a 2024 LangChain blog post and is closely related to the BabyAGI task list architecture from April 2023.^[24] Plan-and-execute often performs better than pure ReAct on tasks that require many sequential steps, because the planner forces the system to think through the full task before any execution begins, and a smaller or cheaper model can then handle each step.^[24]

Reflection lets an agent examine its own output, criticize it, and revise. Two papers established the pattern in 2023. Self-Refine, by Aman Madaan and colleagues, used the same LLM as a generator, feedback provider, and refiner, looping until the output stopped improving. Across seven diverse tasks evaluated with GPT-3.5, ChatGPT, and GPT-4, Self-Refine improved performance by roughly 20 absolute percentage points over single-step generation.^[25] Reflexion, by Noah Shinn and colleagues, framed reflection as verbal reinforcement learning: the agent stores natural-language reflections on past failures in an episodic memory buffer, then conditions on those reflections during the next attempt. Reflexion reached 91% pass@1 on HumanEval, surpassing the GPT-4 baseline of 80%.^[26]

Tool use

Tool use is the ability of an LLM to invoke external functions, APIs, or services. Rather than relying only on knowledge baked into its weights, an agent with tool use can search the web, run code, read files, call a database, send emails, or interact with virtually any software system that exposes an interface. OpenAI formalized this capability as "function calling" in its API in June 2023, and it has become standard across virtually all frontier model providers. Tool use is what lets an agentic system actually affect the world, not just describe it.

Planning

Planning refers to the agent's ability to decompose a high-level goal into a sequence of steps and then execute them in order, adjusting when unexpected results appear. Planning is often implemented through chain-of-thought prompting or through explicit scratchpad outputs where the agent writes out its intended steps before taking them. More sophisticated planning involves tree-of-thoughts approaches or model-based reasoning that evaluates multiple possible sequences before committing to one. Planning is the pattern most closely tied to the autonomous quality of agentic AI: a system that can only respond to individual instructions is reactive, while one that plans is proactive. Ng characterized planning as "incredibly powerful" but noted it "works less consistently" than reflection and tool use, cautioning that "the devil is in the details."^[1]^[2]

Multi-agent collaboration

Multi-agent collaboration distributes work across several specialized agents that coordinate to complete tasks too large for a single agent. One agent might gather research, a second might draft a report, a third might critique and revise it, and a fourth might format and deliver it. Ng noted that multi-agent workflows are more difficult to control because the system's behavior is harder to predict in advance, but that they consistently produce better results on complex tasks.^[1] This pattern maps onto Agentic AI Foundation architectures and is central to how production systems at companies like Salesforce, ServiceNow, and Atlassian are structured.

Lifelong learning agents

A separate research line, exemplified by the 2023 Voyager system from researchers at Nvidia, Caltech, Stanford, and UT Austin, explores agents that accumulate skills over time. Voyager, an embodied agent in Minecraft built on GPT-4, maintained an ever-growing skill library of executable code, used an automatic curriculum that selected what to explore next, and used iterative prompting with environment feedback to refine its programs. It collected 3.3 times more unique items, travelled 2.3 times longer, and unlocked tech-tree milestones up to 15.3 times faster than prior state-of-the-art Minecraft agents.^[27] The pattern of growing a persistent skill library has since influenced production coding agents that build up project-specific helper functions across sessions.

Andrew Ng's four patterns synthesis

The four patterns that Andrew Ng presented at AI Ascent in March 2024 (reflection, tool use, planning, and multi-agent collaboration) drew together prior research into a single, communicable framework, and the synthesis is what gave the agentic AI movement its industrial momentum.^[1]^[2] Ng argued that agentic reasoning, applied to existing models, could yield more improvement per dollar than continued scaling of model size, stating at AI Ascent that $100,000 spent on agentic reasoning could outpace $100 million spent on training larger models for many practical tasks.^[2]

Where did the term come from? Background and history

The technical building blocks of agentic AI accumulated over several years before the term became mainstream.

In 2022, Google researchers published the ReAct (Reasoning + Acting) framework, which showed that LLMs could interleave natural language reasoning with discrete actions such as querying Wikipedia. This established the basic loop that most modern agentic systems still follow: think, act, observe, repeat.^[23]

In March 2023, two viral projects appeared within days of each other. On March 26, 2023, Yohei Nakajima, then a partner at Untapped Capital, released BabyAGI, a small Python script that maintained a list of tasks and used an LLM to add, prioritize, and complete them in a loop.^[28] Four days later, on March 30, 2023, Toran Bruce Richards of Significant Gravitas published Auto-GPT on GitHub.^[28] Auto-GPT crossed 50,000 GitHub stars in its first month and roughly 100,000 over its lifetime. Both projects were brittle (they often looped forever, hallucinated subgoals, or burned through API budgets without finishing) but they demonstrated public appetite for autonomous AI and surfaced the key failure modes that subsequent work had to address.

In June 2023, Lilian Weng's Lil'Log post synthesized the academic literature into the "LLM + planning + memory + tool use" framework that would shape how engineers built and described agents for the next several years.^[13] The same month, OpenAI introduced function calling in the GPT API, giving developers a structured way to expose tools to language models. LangChain, launched in October 2022 by Harrison Chase, became the dominant library for building LLM-powered applications and agents. By 2024, LangGraph had emerged from the LangChain ecosystem as a dedicated orchestration layer for complex, stateful agent workflows.

In early 2024, Andrew Ng's AI Ascent presentation and the launch of Devin (March 12, 2024) by Cognition Labs shifted the conversation from research curiosity to production readiness. Devin was marketed as the world's first AI software engineer, capable of handling entire coding tasks end to end. By the second half of 2024, major AI labs had all launched or announced dedicated agent offerings, and enterprise pilots were multiplying rapidly.

Is agentic AI overhyped? Karpathy's "decade of agents" thesis

In an October 17, 2025 podcast with Dwarkesh Patel, Andrej Karpathy (a co-founder of OpenAI and former Senior Director of AI at Tesla) responded to the widespread industry framing of 2025 as "the year of agents" by arguing that the realistic horizon was an order of magnitude longer.^[29]^[30] Karpathy described his position as "it's the decade of agents," meaning the 2025-2035 period rather than any single year. Of the current generation of systems he said bluntly, "they just don't work," arguing that they are "cognitively lacking" in four specific ways: insufficient intelligence on long-horizon tasks, weak multimodal perception, unreliable computer-use grounding, and absent continual learning that would let them remember and improve from new experience. "You can't just tell them something and they'll remember it," he added.^[29]^[30]

The remark became one of the most-cited reframings of the agentic AI hype cycle. It was widely read both as a corrective to vendor claims that fully autonomous agents were already viable and as an endorsement of the broader research direction, since Karpathy was forecasting that agents would eventually work but only after another decade of engineering. Karpathy paired the talk with public skepticism of the term "AGI by 2027" then circulating among AI labs, arguing that the kind of always-on, learning, autonomous systems that the agentic AI framing implies are themselves part of the gap to AGI rather than a near-term stepping stone.^[29]^[30] The "decade of agents" framing was subsequently echoed in Gartner's June 2025 prediction that more than 40% of agentic AI projects would be canceled by the end of 2027 because of escalating costs, unclear business value, and inadequate risk controls.^[5]

Agentic AI frameworks

Several software frameworks handle the orchestration logic that agentic AI requires: managing state across steps, routing between agents, handling failures, and streaming results back to users. The comprehensive framework comparison, with release dates, licenses, and download counts, is maintained in the AI agents article, which owns the framework landscape; the frameworks most associated with the agentic AI movement are summarized here by their distinguishing design idea.

LangGraph (MIT licensed, maintained by LangChain) models agent workflows as directed graphs whose nodes are processing steps and whose edges define transitions including conditional branches and cycles. It supports durable execution (an agent can resume from a checkpoint after a failure), human-in-the-loop pauses, and both short-term and long-term memory. Companies such as Klarna, Replit, and Elastic have used it in production, and between Q1 2024 and Q1 2025 the broader LangChain ecosystem saw a 220% increase in GitHub stars and a 300% increase in package downloads. CrewAI, an open-source Python framework built independently of LangChain, organizes work around the "crew": each agent gets a role, goal, and backstory, and the framework supports sequential, hierarchical (a manager delegates to workers), and consensual (agents vote) process types; it became the second most widely searched agent framework, popular for low-barrier multi-agent automation. The OpenAI Agents SDK (March 2025, replacing the experimental Swarm) centers on the handoff, where one agent explicitly transfers control and context to another; OpenAI added enterprise safety features and guardrails in an April 2026 update.^[18] The Claude Agent SDK takes a tool-use-first philosophy in which sub-agents are invoked as tools of a parent agent (a supervisor tree) and integrates tightly with the Model Context Protocol. Google's Agent Development Kit (ADK) (April 2025, part of Google Cloud) uses a hierarchical agent tree with native support for Gemini's multimodal inputs and Vertex AI deployment.

Protocols

As agentic AI systems multiplied, two open protocols emerged to solve coordination problems that no single framework could address alone. Both are documented in full, including SDK download trends, the registry of available servers, version history, and the Agentic AI Foundation's founding projects, in the AI agents article; the paradigm-level summary follows.

The Model Context Protocol (MCP), introduced by Anthropic in November 2024, standardizes how AI systems connect to external tools and data sources.^[9] It addresses the "N times M problem" (connecting ten applications to one hundred tools could require up to one thousand custom connectors) by defining a single JSON-RPC 2.0 client-server protocol that a tool provider implements once and any MCP-compatible client can use. OpenAI adopted MCP across its products, including the ChatGPT desktop app, in March 2025, and in December 2025 Anthropic donated MCP governance to the Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation co-founded by Anthropic, Block, and OpenAI.^[20]

The Agent2Agent Protocol (A2A), announced by Google in April 2025 with support from more than 50 technology and consulting partners (including Salesforce, SAP, ServiceNow, Deloitte, McKinsey, and Accenture), solves a different problem: it lets agents built by different vendors and frameworks communicate with each other.^[10] A2A introduces Agent Cards (JSON documents describing an agent's capabilities), task management with defined lifecycle states, context sharing for long-running collaboration, and user-experience negotiation. Google donated A2A governance to the Linux Foundation under Apache 2.0 in June 2025, and by April 2026 more than 150 organizations had adopted it.^[14] The two are complementary: MCP governs the relationship between an agent and its tools, while A2A governs the relationship between agents and other agents.

Major products

The products below are the ones most often invoked to illustrate the agentic AI paradigm; the full product roster, with revenue figures, acquisitions, and detailed benchmark scores, is maintained in the AI agents article, which owns the product landscape. This section keeps only the facts that bear on the paradigm narrative.

Devin, released by Cognition Labs on March 12, 2024, was widely described as the first AI software engineer capable of handling an entire software task autonomously.^[7] It plans, writes code, runs tests, debugs, and can deploy. At launch it resolved 13.86% of real GitHub issues on SWE-bench end-to-end, against a previous state-of-the-art of 1.96%.^[7] Cognition was founded by ten engineers, including CEO Scott Wu, and funded by Founders Fund; Devin was later deployed at Goldman Sachs as part of a "hybrid workforce" initiative. Reception was mixed, with reviewers noting that the public demos did not fully represent reliability on unscripted tasks, a recurring theme in the paradigm's hype debate.

Manus AI launched in invitation-only beta on March 6, 2025, developed by Butterfly Effect, a startup backed by Tencent Holdings.^[12] Positioned as a general AI agent, its launch demo (autonomously screening job applications and conducting stock analysis) drew more than one million views within twenty hours. Manus distinguishes itself through a "Manus's Computer" interface that lets users observe and intervene in real time. In a notable turn, it shelved a planned Chinese-language version announced as an Alibaba Qwen partnership, closed its Chinese social accounts, and blocked access from mainland China.

OpenAI Operator and ChatGPT agent: Operator was announced January 23, 2025, and offered as a research preview to ChatGPT Pro users from February 1, 2025.^[11] It was powered by a Computer-Using Agent (CUA) model combining GPT-4o vision with reinforcement learning on graphical user interfaces, and set new WebArena and WebVoyager results at release. On July 17, 2025, its capabilities were merged into ChatGPT as the ChatGPT agent feature, combining Operator's remote browser, Deep Research's web synthesis, and ChatGPT's conversational interface.^[31]

Deep Research, introduced February 2, 2025 and powered initially by the o3 reasoning model, autonomously browses the web and produces long-form cited reports.^[32] Quotas rose from 100 queries per month for Pro at launch to 250 for Pro, 25 for Plus/Team/Enterprise/Edu, and 5 for free users in April 2025. It outperformed DeepSeek-R1 by 17.2 points and Gemini Thinking by 20.4 points on Humanity's Last Exam at release, and triggered a wave of competitors; Google released its own Deep Research agent the same day OpenAI launched GPT-5.2 in December 2025.^[33]

Computer use, introduced by Anthropic in October 2024 for Claude 3.5 Sonnet, lets Claude take screenshots, move a cursor, click, and type across a full desktop rather than only a browser as Operator does.^[8] Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company integrated it during the beta, and Anthropic has continued expanding support across Claude generations.

Coding agents

A distinct ecosystem of coding agents has matured around frontier models since 2024, blurring the line between IDE and autonomous agent.

Product	Form factor	Released	Notes
Cursor	VS Code fork (IDE)	2023	Agent mode added 2024-2025, runs multi-step edits across a repo
Aider	CLI	2023	Open source, edits a local git repo with model-driven commits
Cline	VS Code extension	2024	Brings autonomous task execution into the IDE; open source
Claude Code	CLI, IDE extensions, web	2025	Anthropic's official coding agent, runs on the Claude Agent SDK
Devin	Web app, Slack	2024	Long-running headless agent for autonomous tasks

By late 2025, most professional developers using coding agents reported using more than one tool in parallel: Cursor or Cline for inside-the-loop pair programming and Claude Code or Devin for asynchronous, longer-running tasks.^[34] The release of Claude Sonnet 4.5 in September 2025 and Claude Opus 4.5 in November 2025 was specifically pitched at sustained, multi-hour autonomous coding work, with parallel tool use and improved long-context reliability cited as the main agentic upgrades.^[36]

How is agentic AI measured? Benchmarks

A distinct family of benchmarks has emerged to evaluate agentic capabilities rather than single-turn language modeling.

Benchmark	Released	What it measures	Notable result
SWE-bench	2023	Real GitHub issue resolution in Python repos	Claude Opus 4.7 reaches 87.6% on SWE-bench Verified; Claude Sonnet 4.5 leads the Agentic Coding split at 77.2%^[37]
GAIA	2023	Multi-step real-world questions requiring tool use and web browsing	Humans 92%, GPT-4 with plugins 15% at launch^[38]
WebArena	2023	Web tasks across self-hosted simulated sites	Used as the canonical browser-agent benchmark
OSWorld	2024 (NeurIPS)	369 desktop and OS tasks across Ubuntu, Windows, macOS	Humans 72.4%, best 2024 model 12.2%^[39]
TheAgentCompany	2024 (CMU)	175 simulated tasks at a small software company	Gemini 2.5 Pro completed 30.3% autonomously, 39.3% with partial credit^[40]
METR time horizon	2025	Task length AI can complete at 50% reliability	Doubling roughly every 7 months since 2019, accelerating to every 4 months in 2024 to 2025^[41]

The METR result is the most cited statistical claim of the era. In its March 2025 paper "Measuring AI Ability to Complete Long Tasks," METR evaluated 13 frontier models on a suite of 170 software, cybersecurity, and reasoning tasks calibrated against 800 human professional baselines. The 50%-reliability task length doubled approximately every 7 months across 2019 to 2025 and roughly every 4 months across 2024 to 2025, a trend the authors summarized by noting that "extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks."^[41] At the time of the paper, frontier models such as Claude 3.7 Sonnet had a 50% time horizon of roughly 50 minutes.^[41]

How widely is agentic AI adopted? Enterprise adoption

By 2025 and 2026, agentic AI had moved from research labs to enterprise procurement conversations.

Gartner predicted in its 2025 strategic technology trends report that at least 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028, up from 0% in 2024.^[3] Gartner also projected that 33% of enterprise software applications will include agentic AI by 2028, up from less than 1% in 2024.^[3] In a separate August 2025 forecast, Gartner predicted that 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025.^[49] In a more cautionary note, Gartner predicted in June 2025 that more than 40% of agentic AI projects would be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls.^[5]

G2's August 2025 enterprise survey found that 57% of companies had AI agents in production, 22% were in pilot, and 21% were pre-pilot.^[15] The AI agents market is projected to reach between $9.1 billion and $10.9 billion in 2026, up from approximately $7.6 billion in 2025, at a compound annual growth rate above 45%.^[42] Grand View Research estimated the global AI agents market at $7.63 billion in 2025 and projected $10.91 billion in 2026, reaching $182.97 billion by 2033 at a 49.6% compound annual growth rate.^[50] IDC has forecast AI spending overall to reach $1.3 trillion by 2029 and a 10x increase in agent usage by 2027, with inference demand growing by a factor of 1,000 over the same period.^[42] Enterprise deployments that do reach production have reported average returns on investment of 171%, according to Arcade Research's 2025 survey, with US enterprises reporting closer to 192%.^[42] McKinsey's 2026 "State of AI trust" report described the period as a shift to "the agentic era" and emphasized that organizations succeeding at scale were redesigning workflows around agent capabilities rather than retrofitting agents into legacy processes.^[43]

Software development and customer service are the two use cases most commonly cited by enterprises in production. Gartner predicted that by 2029, agentic AI would autonomously resolve 80% of common customer service issues, cutting operational costs by 30%.^[4]

Use cases

Agentic AI has been applied across a wide range of domains:

Domain	Representative application
Software engineering	Full software task completion (Devin, GitHub Copilot Workspace, Cursor, Claude Code)
Research	Multi-source synthesis and report generation (OpenAI Deep Research, Perplexity)
Customer service	Autonomous ticket resolution and escalation routing
Web automation	Booking travel, filling forms, scraping data (Operator, Browser Use)
Data analysis	Pulling from databases, running calculations, generating visualizations
Document workflows	Contract review, summarization, redlining
IT operations	Incident triage, automated remediation, infrastructure provisioning
Sales	Lead research, personalized outreach drafting, CRM updates
Healthcare	Clinical note generation, prior authorization, appointment scheduling
Finance	Reconciliation, fraud pattern detection, regulatory report drafting
Embodied / gaming	Open-ended exploration and skill acquisition (Voyager in Minecraft)

How reliable is agentic AI? Reliability challenges

Despite rapid adoption, agentic AI systems face reliability problems that meaningfully distinguish them from traditional software.

A factual error in a chatbot answer is merely misleading. A hallucinated action taken by an autonomous agent, such as calling a non-existent API endpoint, deleting the wrong file, or submitting incorrect information to a third-party service, can cause irreversible damage. Researchers have termed this the "hallucination in action" problem.

In multi-step workflows, errors compound. Even if each step in an agent workflow succeeds 95% of the time, a 20-step workflow finishes without error only about 36% of the time, while production systems for critical processes often target 99.9% or higher overall reliability.^[44] A single mistake in an early reasoning step can propagate through subsequent steps, producing a cascade of failures by the end of the task. In multi-agent systems, the problem is more severe: a bad decision by one agent can propagate across interconnected workflows affecting other agents.

Agents also sometimes enter infinite loops, retrying a failed action repeatedly without modifying their approach. This is distinct from hallucination but equally disruptive in production systems.

Common mitigation strategies include:

Human-in-the-loop checkpoints where agents pause and request approval before irreversible actions
Sandboxed execution environments that limit what tools an agent can invoke
Token budget limits and circuit breakers that terminate agents exceeding defined thresholds
Comprehensive logging and tracing so failures can be investigated and reproduced
Evals and automated testing pipelines specific to agent behavior
Context engineering: deliberate management of which tokens (system instructions, tools, memory, retrieved documents) enter the agent's window at each step, a discipline that emerged in mid-2025 as the production successor to single-prompt prompt engineering^[35]

The NIST AI Risk Management Framework was updated in 2025 to include specific guidance for agentic AI, requiring organizations to map all agent tool access permissions and implement circuit breakers.

Legal RAG (retrieval-augmented generation) systems, one of the more mature deployment patterns, still hallucinate citations between 17% and 33% of the time in 2025 assessments. The overall failure rate for production deployment remains high: one study found that 88% of AI agents fail to reach production from pilot, though those that do reach production deliver significant ROI.

Critiques and limitations

The agentic AI framing has attracted sharp scrutiny from researchers who argue that current systems are oversold relative to their measured capabilities.

Subbarao Kambhampati, an Arizona State University professor and former AAAI president, has argued in a series of papers and talks since 2023 that autoregressive LLMs cannot, on their own, perform planning or reliable self-verification.^[45] In his 2024 ICML position paper "LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks," Kambhampati showed that LLM success rates on classical planning benchmarks average around 3% when used in isolation and argued that apparent gains from self-critique are often a Clever Hans effect, with the human prompter unconsciously steering the model. His proposed remedy, the LLM-Modulo framework, pairs an LLM as a fast knowledge source with an external symbolic verifier in a closed loop, treating the LLM as a generator of candidates and the verifier as the source of correctness.^[45]

Gary Marcus has used his Substack to track agent failures in production and to argue that mimicry-based systems will not reach the reliability thresholds required for autonomous deployment in important domains. He predicted at the start of 2025 that agents would be "endlessly hyped throughout 2025 but far from reliable, except possibly in very narrow use cases" and pointed to TheAgentCompany's roughly 25% completion rate as evidence that the gap between marketing and capability remained large.^[46] Marcus and Kambhampati both advocate hybrid neurosymbolic approaches that combine LLMs with explicit world models and symbolic verifiers.

The Karpathy "decade of agents" position, while less polemical than Marcus's, fits the same broad pattern: a recognition that the underlying paradigm has potential paired with skepticism that the current generation of systems is close to autonomous, reliable deployment.^[29]

Security

Agents that have access to tools, APIs, and file systems dramatically expand the attack surface compared to read-only assistants.

Indirect prompt injection, first systematically documented by Kai Greshake, Sahar Abdelnabi and colleagues in their February 2023 paper "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection," is the dominant practical concern.^[47] In this attack class, the adversary embeds instructions in data that the agent will later retrieve such as a webpage, email, document, ticket, calendar invite, PDF, or image with hidden text. When the agent reads that data, the embedded instructions enter its context on the same footing as the user's own request. Unlike direct prompt injection, the attacker never speaks to the LLM; they speak to a document that the LLM will someday read. Demonstrated impacts in subsequent research include data exfiltration, persistent backdoors, denial of service, and remote control of integrated tools.^[47]

The CSO Online assessment in 2025 described the agentic AI boom as "a CISO's worst nightmare" because agents with broad tool access can exfiltrate data or trigger unintended actions if compromised.^[19] Microsoft's 2025 documentation of how it defends Copilot against indirect prompt injection describes a layered defense including spotlighting (visually delimiting untrusted content within the prompt), classifier-based input screening, and tool-level guardrails. As of 2026 there is no universally agreed solution, and most production deployments rely on a combination of sandboxing, least-privilege tool permissions, human-in-the-loop approvals for sensitive actions, and continuous monitoring.

How does agentic AI differ from a chatbot? Comparison with traditional AI assistants

The following table summarizes the main differences between agentic AI systems and the earlier generation of AI assistants and chatbots:^[16]^[17]

Dimension	Traditional AI assistant / chatbot	Agentic AI
Interaction model	Responds to each user prompt individually	Receives a goal and works autonomously through multiple steps
Tool access	Limited or none	Broad: web, code execution, APIs, file systems
Memory	Typically stateless within a session	Can maintain state across steps; some systems persist memory across sessions
Error handling	Returns incorrect answer or fails silently	Can observe errors, retry, or escalate
Human involvement	Required for every action	Can operate with minimal human input; human-in-the-loop optional
Output type	Text response	Actions taken in the world plus a summary
Task length	One exchange	Minutes to hours or longer
Risk profile	Misinformation	Misinformation plus irreversible real-world actions
Typical interface	Chat window	Background process, dashboard, or embedded in other software

Limitations

Beyond reliability and security, agentic AI faces several structural limitations as of 2026.

Long context degradation: While context windows have grown substantially, LLM attention quality tends to degrade over very long contexts, which means agents that accumulate too much history in a single session can make progressively worse decisions. Context engineering practices, including summarization, selective retrieval, and explicit memory tiers, aim to limit this degradation.^[35]

Alignment drift: In extended autonomous operation, an agent's behavior can diverge from user intent in ways that are difficult to detect without continuous monitoring. The problem is more acute when agents are coordinating with each other, because misaligned sub-goals can compound.

Cost: Multi-step agentic runs consume many more tokens than single-turn interactions. Complex agent tasks can cost dollars per run on frontier models, which limits scalability for high-volume use cases.

Explainability: When an agent produces an incorrect result after dozens of steps, tracing the root cause is substantially harder than auditing a single model response. This is a compliance problem in regulated industries such as healthcare and finance.

Continual learning: Karpathy and others have noted that current frontier models do not learn from experience between sessions without explicit fine-tuning or external memory; an agent that just made the same mistake yesterday will make it again today.^[29] Building durable, online learning into the agent loop remains an unsolved problem.

Over-reliance on individual models: Most agentic frameworks depend on a single frontier model for orchestration, which means quality, latency, and cost are all tightly coupled to that model's performance. Model provider outages or price changes propagate directly to agent system availability.

References

Ng, Andrew. "Four design patterns for AI agent workflows." LinkedIn / X, March 2024. https://x.com/AndrewYNg/status/1773393357022298617 ↩
Octet Consulting. "Notes on Agentic Reasoning from Andrew Ng at Sequoia AI Ascent 2024." https://octetdata.com/blog/notes-andrew-ng-agentic-reasoning-2024/ ↩
Gartner. "Top Strategic Technology Trends for 2025: Agentic AI." https://www.gartner.com/en/documents/5850847 ↩
Gartner. "Gartner Predicts Agentic AI Will Autonomously Resolve 80% of Common Customer Service Issues Without Human Intervention by 2029." March 5, 2025. https://www.gartner.com/en/newsroom/press-releases/2025-03-05-gartner-predicts-agentic-ai-will-autonomously-resolve-80-percent-of-common-customer-service-issues-without-human-intervention-by-20290 ↩
Gartner. "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027." June 25, 2025. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027 ↩
McKinsey. "The agentic organization: Contours of the next paradigm for the AI era." https://www.mckinsey.com/capabilities/people-and-organizational-performance/our-insights/the-agentic-organization-contours-of-the-next-paradigm-for-the-ai-era ↩
Cognition Labs. "Introducing Devin, the first AI software engineer." March 12, 2024. https://cognition.ai/blog/introducing-devin ↩
Anthropic. "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." October 2024. https://www.anthropic.com/news/3-5-models-and-computer-use ↩
Anthropic. "Introducing the Model Context Protocol." November 2024. https://www.anthropic.com/news/model-context-protocol ↩
Google Developers Blog. "Announcing the Agent2Agent Protocol (A2A)." April 2025. https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/ ↩
OpenAI. "Introducing Operator." January 23, 2025. https://openai.com/index/introducing-operator/ ↩
MIT Technology Review. "Everyone in AI is talking about Manus. We put it to the test." March 11, 2025. https://www.technologyreview.com/2025/03/11/1113133/manus-ai-review/ ↩
Weng, Lilian. "LLM Powered Autonomous Agents." Lil'Log, June 23, 2023. https://lilianweng.github.io/posts/2023-06-23-agent/ ↩
Linux Foundation. "Linux Foundation Launches the Agent2Agent Protocol Project." June 2025. https://www.linuxfoundation.org/press/linux-foundation-launches-the-agent2agent-protocol-project-to-enable-secure-intelligent-communication-between-ai-agents ↩
G2. "Enterprise AI Agents Report: Industry Outlook for 2026." https://learn.g2.com/enterprise-ai-agents-report ↩
Salesforce. "AI Agent vs. Chatbot." https://www.salesforce.com/agentforce/ai-agent-vs-chatbot/ ↩
IBM. "AI Agents vs. AI Assistants." https://www.ibm.com/think/topics/ai-agents-vs-ai-assistants ↩
TechCrunch. "OpenAI updates its Agents SDK to help enterprises build safer, more capable agents." April 15, 2026. https://techcrunch.com/2026/04/15/openai-updates-its-agents-sdk-to-help-enterprises-build-safer-more-capable-agents/ ↩
CSO Online. "Why 2025's agentic AI boom is a CISO's worst nightmare." https://www.csoonline.com/article/4132860/why-2025s-agentic-ai-boom-is-a-cisos-worst-nightmare.html ↩
Ajeet Raina. "One Year of Model Context Protocol: From Experiment to Industry Standard." https://www.ajeetraina.com/one-year-of-model-context-protocol-from-experiment-to-industry-standard/ ↩
Hugging Face. "Understanding AI Agents through the Thought-Action-Observation Cycle." https://huggingface.co/learn/agents-course/en/unit1/agent-steps-and-structure ↩
ZBrain. "Agent scaffolding: Architecture, types and enterprise applications." https://zbrain.ai/agent-scaffolding/ ↩
Yao, Shunyu et al. "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629, October 6, 2022. https://arxiv.org/abs/2210.03629 ↩
LangChain Blog. "Plan-and-Execute Agents." 2024. https://blog.langchain.com/planning-agents/ ↩
Madaan, Aman et al. "Self-Refine: Iterative Refinement with Self-Feedback." arXiv:2303.17651, March 30, 2023. https://arxiv.org/abs/2303.17651 ↩
Shinn, Noah et al. "Reflexion: Language Agents with Verbal Reinforcement Learning." arXiv:2303.11366, NeurIPS 2023. https://arxiv.org/abs/2303.11366 ↩
Wang, Guanzhi et al. "Voyager: An Open-Ended Embodied Agent with Large Language Models." arXiv:2305.16291, 2023. https://voyager.minedojo.org/ ↩
Fortune. "What are BabyAGI and AutoGPT, the new A.I. tools Silicon Valley is buzzing about?" April 15, 2023. https://fortune.com/2023/04/15/babyagi-autogpt-openai-gpt-4-autonomous-assistant-agi/ ↩
Patel, Dwarkesh. "Andrej Karpathy: AGI is still a decade away." Dwarkesh Podcast, October 17, 2025. https://www.dwarkesh.com/p/andrej-karpathy ↩
Willison, Simon. "Andrej Karpathy: AGI is still a decade away." October 18, 2025. https://simonwillison.net/2025/Oct/18/agi-is-still-a-decade-away/ ↩
OpenAI. "Introducing ChatGPT agent: bridging research and action." July 17, 2025. https://openai.com/index/introducing-chatgpt-agent/ ↩
OpenAI. "Introducing deep research." February 2, 2025. https://openai.com/index/introducing-deep-research/ ↩
TechCrunch. "Google launched its deepest AI research agent yet, on the same day OpenAI dropped GPT-5.2." December 11, 2025. https://techcrunch.com/2025/12/11/google-launched-its-deepest-ai-research-agent-yet-on-the-same-day-openai-dropped-gpt-5-2/ ↩
Requesty Blog. "Agentic Coding Tools Compared (2026): Claude Code, Cursor, Codex, Aider." https://www.requesty.ai/blog/agentic-coding-tools-compared-2026-claude-code-cursor-codex-aider ↩
Anthropic. "Effective context engineering for AI agents." 2025. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents ↩
Anthropic. "Introducing Claude Sonnet 4.5." September 29, 2025. https://www.anthropic.com/news/claude-sonnet-4-5 ↩
SWE-bench Leaderboards. https://www.swebench.com/ ↩
Mialon, Grégoire et al. "GAIA: a benchmark for General AI Assistants." arXiv:2311.12983, 2023. https://arxiv.org/abs/2311.12983 ↩
Xie, Tianbao et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." NeurIPS 2024. https://os-world.github.io/ ↩
Xu, Frank F. et al. "TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks." arXiv:2412.14161, 2024. https://arxiv.org/abs/2412.14161 ↩
METR. "Measuring AI Ability to Complete Long Tasks." March 19, 2025. https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ ↩
Axis Intelligence. "Agentic AI Adoption Statistics 2026: Enterprise Deployment Rates, Market Projections & ROI Data." https://axis-intelligence.com/agentic-ai-adoption-statistics-2026/ ↩
McKinsey. "State of AI trust in 2026: Shifting to the agentic era." https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/tech-forward/state-of-ai-trust-in-2026-shifting-to-the-agentic-era ↩
Marcus, Gary. "AI Agents have, so far, mostly been a dud." Marcus on AI Substack, 2025. https://garymarcus.substack.com/p/ai-agents-have-so-far-mostly-been ↩
Kambhampati, Subbarao et al. "LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks." ICML 2024. https://arxiv.org/abs/2402.01817 ↩
Marcus, Gary. "Breaking: Autonomous Agents are a Shitshow." Marcus on AI Substack, 2025. https://garymarcus.substack.com/p/breaking-autonomous-agents-are-a ↩
Greshake, Kai et al. "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv:2302.12173, February 2023. https://arxiv.org/abs/2302.12173 ↩
McKinsey. "Seizing the agentic AI advantage." QuantumBlack, 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/seizing-the-agentic-ai-advantage ↩
Gartner. "Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, Up From Less Than 5% in 2025." August 26, 2025. https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025 ↩
Grand View Research. "AI Agents Market Size, Share and Trends Report, 2026-2033." https://www.grandviewresearch.com/industry-analysis/ai-agents-market-report ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

Agentic AI

What is agentic AI? Definition and scope

Lilian Weng's three-component framework

How does the agent loop work?

What are the common agentic patterns?

ReAct (Reason + Act)

Plan-and-Execute

Reflection and self-refinement

Tool use

Planning

Multi-agent collaboration

Lifelong learning agents

Andrew Ng's four patterns synthesis

Where did the term come from? Background and history

Is agentic AI overhyped? Karpathy's "decade of agents" thesis

Agentic AI frameworks

Protocols

Major products

Coding agents

How is agentic AI measured? Benchmarks

How widely is agentic AI adopted? Enterprise adoption

Use cases

How reliable is agentic AI? Reliability challenges

Critiques and limitations

Security

How does agentic AI differ from a chatbot? Comparison with traditional AI assistants

Limitations

See also

References

Improve this article

What links here (24 of 94)

What links here (24 of 94)

What is agentic AI? Definition and scope

Lilian Weng's three-component framework

How does the agent loop work?

What are the common agentic patterns?

ReAct (Reason + Act)

Plan-and-Execute

Reflection and self-refinement

Tool use

Planning

Multi-agent collaboration

Lifelong learning agents

Andrew Ng's four patterns synthesis

Where did the term come from? Background and history

Is agentic AI overhyped? Karpathy's "decade of agents" thesis

Agentic AI frameworks

Protocols

Major products

Coding agents

How is agentic AI measured? Benchmarks

How widely is agentic AI adopted? Enterprise adoption

Use cases

How reliable is agentic AI? Reliability challenges

Critiques and limitations

Security

How does agentic AI differ from a chatbot? Comparison with traditional AI assistants

Limitations

See also

References

Improve this article

Related Articles

Replit Agent

Perplexity Comet

Dia (browser)

ChatGPT Atlas

GitHub Spark

Agentic Context Engineering

What links here (24 of 94)

Related Articles

Replit Agent

Perplexity Comet

Dia (browser)

ChatGPT Atlas

GitHub Spark

Agentic Context Engineering

What links here (24 of 94)