AI agents

AI Agents Artificial Intelligence Large Language Models Machine Learning

43 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

41 citations

Revision

v16 · 8,592 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

This article covers contemporary LLM-based AI agents in practice: their architectures, frameworks, protocols, products, benchmarks, enterprise adoption, and safety. For the foundational and historical concept of an agent (rational agents, classical architectures, reinforcement learning), see Agent; for the "agentic AI" paradigm and discourse, see Agentic AI.

AI agents are software systems that use artificial intelligence to perceive their environment, reason about goals, make decisions, and take autonomous actions to accomplish tasks on behalf of a user or another system. Unlike traditional AI models that simply generate responses to prompts, agents operate in loops of perception, reasoning, and action, often using external tools and maintaining memory across interactions. The term has become central to the AI industry since 2023, when large language models (LLMs) enabled a new generation of agents capable of general-purpose reasoning, planning, and tool use.

The concept of an agent has been a foundational abstraction in AI since the field's earliest days. In the most widely cited textbook definition, Stuart Russell and Peter Norvig describe an agent as "anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators."^[1] Their textbook, Artificial Intelligence: A Modern Approach (1995, fourth edition 2021), frames the entire study of AI as "the study and design of rational agents," where a rational agent "acts so as to achieve the best outcome or, when there is uncertainty, the best expected outcome."^[1]

In practice, AI agents range from simple thermostat-like systems that follow fixed rules to sophisticated LLM-powered autonomous systems that can browse the web, write and execute code, manage files, and interact with external APIs. As of 2026, AI agents represent one of the fastest-growing segments of the AI industry. The AI agents market grew from $5.40 billion in 2024 to roughly $7.63 billion in 2025, and is projected to reach $50.31 billion by 2030, a compound annual growth rate of 45.8% from 2025 to 2030, according to Grand View Research.^[27] Gartner predicts that 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025.^[13] Looking further out, Gartner expects 33% of enterprise software applications to include agentic AI by 2028, up from less than 1% in 2024, and at least 15% of day-to-day work decisions to be made autonomously by agentic AI by 2028, up from 0% in 2024.^[30]

What are AI agents? Definition and core characteristics

There is no single agreed-upon definition of an AI agent, but most researchers and practitioners identify several characteristics that distinguish agents from simpler AI systems.

An AI agent is a system that can:

Perceive its environment through inputs such as text, images, sensor data, API responses, or screen content.
Reason about what actions to take based on its goals, current context, and available information.
Plan by breaking complex tasks into sequences of smaller steps.
Act by executing actions in the real world, such as calling APIs, writing code, clicking buttons, or sending messages.
Learn and adapt by incorporating feedback from previous actions to refine future behavior.

Michael Wooldridge and Nicholas Jennings, in influential work from the 1990s, identified four properties that distinguish intelligent agents: autonomy (operating without direct human control), reactivity (responding to changes in the environment), proactiveness (taking initiative to achieve goals), and social ability (interacting with other agents or humans).^[5]

How does an AI agent differ from a chatbot?

The key distinction between an agent and a standard chatbot or LLM is the action loop. A chatbot receives a prompt and returns a response. An agent receives a goal, then enters a loop where it repeatedly reasons about what to do next, takes an action, observes the result, and decides whether to continue or stop. This loop can run for seconds, minutes, or even hours depending on the complexity of the task.

Andrew Ng, a prominent AI researcher, has described the spectrum of "agentic" behavior as a continuum rather than a binary classification. Writing in his DeepLearning.AI newsletter The Batch, Ng argued: "Rather than having to choose whether or not something is an agent in a binary way, I thought, it would be more useful to think of systems as being agent-like to different degrees."^[28] A system that uses an LLM to generate a single response is not agentic. A system that uses an LLM to generate a response, then reflects on that response, then revises it, is somewhat agentic. A system that autonomously plans multi-step workflows, uses tools, and adapts based on results is highly agentic.

History

When did AI agents emerge? Roots from symbolic AI to game-playing agents (1950s to 2010s)

The concept of software agents predates modern AI by decades, and the full conceptual and historical arc is covered in the Agent article. In brief: the earliest agent-like systems emerged during the Symbolic AI era, including ELIZA (1966), SHRDLU (1970), the STRIPS planner (1971), and the expert systems of the 1970s and 1980s such as DENDRAL and MYCIN. Reinforcement learning (RL) then introduced a formal framework for agents that learn from interaction with an environment, establishing the classic perceive-act-feedback-update loop that still underlies modern agents. Game-playing systems made the abstraction concrete and visible: IBM's Deep Blue beat chess champion Garry Kasparov in 1997, AlphaGo beat Lee Sedol at Go in 2016, and later AlphaStar (StarCraft II) and OpenAI Five (Dota 2) showed RL agents mastering complex strategic environments. See Agent for the classical architectures (subsumption, BDI, the actor model) and multi-agent systems research behind this lineage.

The 2010s also saw the emergence of consumer virtual assistants such as Apple's Siri (2011), Amazon's Alexa (2014), and Google Assistant (2016). While these systems could carry out simple voice-driven tasks, they relied heavily on predefined intents and lacked the ability to reason through multi-step problems or use arbitrary tools, a gap the LLM-based agents below were the first to close.

Early LLM-based agents (2022 to 2023)

The release of GPT-3 in 2020 and ChatGPT in late 2022 revealed that large language models could serve as general-purpose reasoning engines. Early experiments with prompt engineering showed that LLMs could simulate procedural reasoning when given instructions like "think step by step."

In October 2022, researchers at Princeton and Google published the ReAct paper ("Synergizing Reasoning and Acting in Language Models"), which introduced a framework where an LLM alternates between generating reasoning traces and taking actions.^[2] The paper explored using LLMs "to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two," and reported that ReAct "overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API."^[2] Interleaving thought and action dramatically improved performance on knowledge-intensive tasks and interactive environments compared to reasoning or acting alone.^[2]

In February 2023, Meta AI published the Toolformer paper (Schick et al.), showing that language models could teach themselves to use external tools such as search engines, calculators, and translation systems via API calls.^[3] Toolformer established that tool use did not need to be hand-programmed; models could learn when and how to invoke tools through self-supervised training, with API calls inserted into the training data at useful positions.^[3]

The spring of 2023 saw an explosion of open-source autonomous agent projects. AutoGPT, created by Toran Bruce Richards and released in March 2023, became one of the fastest-growing GitHub repositories in history, amassing over 100,000 stars within months. AutoGPT worked by taking a user-defined goal, breaking it into sub-tasks, and using GPT-4 to execute them autonomously, with access to web search, file operations, and code execution.

BabyAGI, a Python script created by venture capitalist Yohei Nakajima, went viral around the same time. It orchestrated a simple loop of task creation, execution, and prioritization using an LLM and a vector memory store. Despite their limitations (frequent loops, high API costs, unreliable outputs), AutoGPT and BabyAGI proved the concept of LLM-powered autonomous agents and sparked massive investment and research in the field.

The agentic era (2024 to present)

By 2024, the AI industry had shifted decisively toward building production-grade agent systems. Major labs released dedicated agent APIs, frameworks matured, and enterprises began deploying agents for real workflows. OpenAI launched its Assistants API (later succeeded by the Responses API and Agents SDK),^[9] Anthropic introduced Claude with computer use capabilities,^[11] and Google released Project Mariner for autonomous web browsing.^[12]

The year 2025 marked the emergence of standardized protocols (MCP, A2A) and governance structures (the Agentic AI Foundation) designed to make agents interoperable and safe at scale. By early 2026, agents had moved from experimental demos to production deployments at thousands of enterprises.

How are AI agents classified?

AI agents can be classified along several dimensions. The most widely referenced taxonomy, drawn from Russell and Norvig's textbook, identifies five types based on increasing sophistication: simple reflex agents (act on the current percept via condition-action rules), model-based reflex agents (maintain an internal world model to handle partial observability), goal-based agents (search and plan toward explicit goals), utility-based agents (compare outcomes on a continuous utility scale), and learning agents (improve over time through a performance element, learning element, critic, and problem generator).^[1] The summary table below is reproduced for quick reference; the formal taxonomy, its history, and worked examples are treated in depth in the Agent article, which owns the conceptual framework. Most modern LLM-based agents blend elements of all five types.

Agent type	Internal state	Planning	Learning	Example
Simple reflex	None	No	No	Thermostat
Model-based reflex	World model	No	No	Spam filter with context state
Goal-based	World model + goals	Yes	No	Route planner
Utility-based	World model + utility function	Yes	No	Financial trading agent
Learning	All of the above + learning element	Yes	Yes	Self-driving car, AlphaGo

Architecture patterns

AI agent architectures define how an agent processes inputs, maintains state, makes decisions, and executes actions. Several design patterns have emerged as the field has matured.

The agent loop

A typical LLM-based agent operates in a loop:

Perceive: Receive input from the user or environment.
Think: Reason about the input using the LLM, considering goals, memory, and available tools.
Act: Execute an action, such as calling a tool, writing code, or generating a response.
Observe: Process the result of the action.
Repeat: Continue the loop until the task is complete or the agent determines it cannot proceed.

This perceive-think-act-observe loop can run for anywhere from a single iteration (simple question answering) to hundreds of iterations (complex software engineering tasks that involve reading code, writing patches, running tests, and debugging failures).

ReAct (Reasoning and Acting)

The ReAct pattern, introduced by Yao et al. in 2022, interleaves reasoning steps with action steps.^[2] At each iteration, the agent generates a "thought" explaining its reasoning, then selects and executes an action (such as a tool call), then observes the result. This cycle repeats until the task is complete.

ReAct's key advantage is interpretability. Because the agent explicitly articulates its reasoning before each action, humans can follow the agent's decision-making process and identify errors. ReAct has become the default pattern for many agent frameworks, including early versions of LangChain agents.

Plan-and-execute

In the plan-and-execute pattern, the agent first generates a complete plan for accomplishing a goal, then executes each step sequentially. If a step fails or produces unexpected results, the agent may re-plan from that point.

This pattern works well for tasks with clear structure and predictable steps, such as data analysis pipelines or multi-step research workflows. The separation of planning from execution also makes it easier to audit and control agent behavior. LangGraph's "plan-and-execute" template is a widely used implementation of this pattern.

Reflection

Reflection agents add a self-critique step after each action or after completing a draft output. The agent generates output, then evaluates that output against criteria (correctness, completeness, style), then revises based on its own feedback. This loop can repeat multiple times.

Reflection significantly improves output quality for tasks like writing, code generation, and analysis. Anthropic's Claude models, for example, use extended thinking (also called "thinking" or "scratch pad") to reason through complex problems before producing a final answer. Reflexion (Shinn et al., NeurIPS 2023) added a verbal self-critique step where the agent reviews its own outputs, writes a reflection to memory, and uses that reflection in subsequent attempts.^[4]

Tool use and function calling

Tool use enables agents to interact with external systems, moving beyond text generation to real-world action. Common tools include web search APIs, code interpreters, database queries, file system access, browser automation, and custom API integrations.

The mechanism works through function calling (also called tool calling). The developer provides the agent with a set of available tools, each described by a name, a natural language description, and a JSON schema specifying its parameters. The model then decides whether to call a tool and, if so, which tool to call with what arguments. The runtime executes the call and returns the result, which is fed back into the model's context for the next decision. Modern LLMs are trained with function-calling capabilities that make tool use reliable and structured. OpenAI's function calling, Anthropic's tool use API, and Google's function calling for Gemini all provide standardized interfaces for this pattern.

Multi-agent orchestration

Rather than building a single agent that handles everything, multi-agent systems use specialized agents that collaborate on complex tasks. Common orchestration patterns include:

Pattern	Description	Best for
Hierarchical	A supervisor agent delegates tasks to specialized worker agents and aggregates their results	Complex workflows with clear task decomposition
Hub-and-spoke	A central orchestrator routes messages between agents, maintaining a shared state	Workflows requiring strong consistency and auditability
Mesh / peer-to-peer	Agents communicate directly with each other without a central coordinator	Resilient systems that need to handle partial failures gracefully
Pipeline	Agents process work sequentially, each transforming the output of the previous agent	Linear workflows like content creation (research, write, edit, review)
Debate / adversarial	Multiple agents propose solutions and critique each other's outputs	Tasks requiring high accuracy where errors are costly

Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025, reflecting growing enterprise interest. Organizations using multi-agent architectures report 45% faster problem resolution and 60% more accurate outcomes compared to single-agent systems.

Major frameworks and platforms

A growing ecosystem of frameworks, SDKs, and platforms supports agent development. The following table summarizes the major options as of early 2026.

Framework / Platform	Organization	First Release	Key Features	License
LangChain / LangGraph	LangChain Inc.	October 2022	Graph-based workflows, stateful agents, 47M+ PyPI downloads, largest ecosystem of integrations	MIT
AutoGPT	Significant Gravitas	March 2023	Autonomous goal-driven agents, web browsing, file operations, 160k+ GitHub stars	MIT
CrewAI	CrewAI Inc.	December 2023	Role-based multi-agent teams, natural task division, low barrier to entry	MIT
AutoGen / AG2	Microsoft Research	September 2023	Multi-agent conversation patterns, group chat, nested conversations (maintenance mode as of 2025)	MIT
Microsoft Semantic Kernel	Microsoft	March 2023	Enterprise-grade LLM integration with C# and Python, planners, plugins	MIT
OpenAI Agents SDK	OpenAI	March 2025	Tool use, handoffs, guardrails, tracing; production successor to Swarm; provider-agnostic	MIT
OpenAI Responses API	OpenAI	March 2025	Unified API merging Chat Completions and Assistants capabilities; built-in web search, file search, computer use	Proprietary
Claude Agent SDK	Anthropic	2025	Same infrastructure powering Claude Code; supports building custom agents	Proprietary
Google Agent Development Kit (ADK)	Google	2025	Multi-agent orchestration, integration with Gemini models and Google Cloud	Apache 2.0
Amazon Bedrock Agents	Amazon	2023	Managed agent service with knowledge bases, action groups, and guardrails on AWS	Proprietary
Llama Index	LlamaIndex Inc.	November 2022	Data indexing, retrieval-augmented generation, agent workflows	MIT

Are AI agent frameworks open source?

Most of the leading agent frameworks are open source under permissive licenses. As the table above shows, LangChain/LangGraph, AutoGPT, CrewAI, AutoGen, Semantic Kernel, the OpenAI Agents SDK, and Llama Index are released under the MIT license, while Google's Agent Development Kit uses Apache 2.0. The managed cloud services and hosted agent runtimes, such as OpenAI's Responses API, the Claude Agent SDK, and Amazon Bedrock Agents, are proprietary. This split mirrors the broader market: open-source SDKs dominate developer experimentation, while production deployment often runs on a proprietary hosted platform.

LangChain and LangGraph

LangChain, created by Harrison Chase, launched in October 2022 and quickly became the most popular framework for building LLM-powered applications. It provides abstractions for chains (sequential LLM calls), agents (LLM + tools), memory, and retrieval. LangGraph, released as a companion library, represents agent workflows as directed graphs where nodes are computation steps and edges define control flow.

By late 2025, LangGraph reached version 1.0 and became the default runtime for all LangChain agents. The framework has accumulated over 47 million PyPI downloads and the largest third-party integration ecosystem in the agentic AI space.

CrewAI

CrewAI, created by Joao Moura and first released in December 2023, focuses on multi-agent collaboration using a role-based metaphor. Developers define "crews" of agents, each with a specific role (researcher, writer, reviewer), and CrewAI orchestrates their interaction. The framework has the lowest barrier to entry for multi-agent prototyping and has grown rapidly through 2025.

AutoGen and the Microsoft Agent Framework

AutoGen, developed by Microsoft Research, pioneered structured multi-agent conversation patterns where agents interact through two-agent chats, group chats, sequential conversations, and nested dialogues. In 2025, Microsoft shifted AutoGen to maintenance mode and launched the broader Microsoft Agent Framework, which encompasses Semantic Kernel, Azure AI Agent Service, and Copilot Studio. General availability is targeted for Q1 2026, offering production SLAs and multi-language support.

OpenAI Agents SDK and Responses API

In March 2025, OpenAI released the building blocks of its new agents platform: the Responses API (merging capabilities from Chat Completions and Assistants APIs), built-in tools for web search, file search, and computer use, and the open-source Agents SDK with tracing.^[9] The Assistants API is scheduled for deprecation on August 26, 2026.^[9]

The Agents SDK is a production-ready upgrade of OpenAI's earlier experimental framework, Swarm. It provides building blocks for tool use, handoffs between agents, guardrails, and observability, and is designed to be provider-agnostic.^[9]

Protocols and standards

Model Context Protocol (MCP)

The Model Context Protocol (MCP) is an open standard announced by Anthropic in November 2024 for connecting AI models to external data sources and tools.^[6] MCP defines a client-server architecture where an AI application (the MCP client) communicates with tool providers (MCP servers) through a standardized JSON-RPC 2.0 interface, reusing architectural ideas from the Language Server Protocol (LSP).^[6]

MCP solves the "N times M" integration problem. Without a standard protocol, every AI application must build custom integrations with every tool. With MCP, tool providers implement a single server, and AI applications implement a single client, and they all work together.

Adoption of MCP has been rapid. OpenAI integrated MCP across its products, including the ChatGPT desktop app, in March 2025. At Microsoft Build 2025, GitHub and Microsoft announced they were joining MCP's steering committee, and Microsoft announced MCP support in Windows 11. Google DeepMind confirmed support in April 2025. Monthly SDK downloads grew from approximately 2 million at launch to over 97 million by late 2025, a roughly 4,750% increase in 16 months. By February 2026, the official MCP registry listed over 6,400 MCP servers.

In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, establishing open governance for the protocol's future development.^[8]

Agent-to-Agent Protocol (A2A)

The Agent2Agent Protocol (A2A) was introduced by Google in April 2025 as a communication standard for multi-agent systems.^[7] While MCP standardizes how agents connect to tools, A2A standardizes how agents communicate with each other.

A2A uses JSON-RPC 2.0 over HTTPS and defines mechanisms for agent discovery through "Agent Cards" (JSON documents that describe an agent's capabilities, authentication requirements, and connection details), task management, secure information exchange, and coordination across different frameworks and vendors.^[7] Version 0.3, released in July 2025, added gRPC support, signed security cards, and an extended Python SDK. The protocol received support from over 150 organizations, including Atlassian, Salesforce, SAP, ServiceNow, and PayPal.

However, adoption of A2A has been slower than MCP. As of late 2025, most of the AI agent ecosystem consolidated around MCP for tool connectivity, and A2A's development pace slowed. The Linux Foundation launched A2A as an open-source project under the Apache 2.0 license to encourage broader participation.

Agentic AI Foundation (AAIF)

The Agentic AI Foundation (AAIF) was announced on December 9, 2025, as a directed fund under the Linux Foundation.^[8] It was co-founded by Anthropic, Block, and OpenAI, with the goal of providing neutral, open governance for the standards and tools that power agentic AI.^[8]

The three founding projects are:

Model Context Protocol (MCP): The standard for connecting AI systems to tools and data, contributed by Anthropic.
Goose: An open-source, local-first AI agent framework by Block that combines language models with MCP-based tool integration.
AGENTS.md: A standard by OpenAI (released August 2025) that provides AI coding agents with project-specific guidance across different repositories and toolchains.

Platinum members of the AAIF include Amazon Web Services, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI.^[8] Together, MCP and A2A are forming the foundation of an interoperable agent ecosystem, sometimes compared to how HTTP and SMTP standardized web and email communication.

Commercial agent products

ChatGPT with tools, Operator, and ChatGPT Agent

OpenAI has progressively added agentic capabilities to ChatGPT. The product now includes browsing (powered by Bing), code execution via a Python sandbox, image generation with DALL-E, and file analysis. In late July 2025, OpenAI introduced a Tools dropdown providing access to six specialized modes: Agent mode, Deep research, Create image, Study and learn, Web search, and Canvas.

Deep Research, launched in February 2025, operates as an autonomous research agent that browses the web for 5 to 30 minutes, synthesizing findings into structured reports with citations.^[20] Operator, launched January 23, 2025, was powered by a Computer-Using Agent (CUA) model that combined GPT-4o vision with reinforcement learning on GUI control; at launch it scored 38.1% on OSWorld, 58.1% on WebArena, and 87% on WebVoyager.

On July 17, 2025, OpenAI launched ChatGPT Agent, a unified agentic system that consolidated Operator's web-interaction skills, Deep Research's synthesis ability, and ChatGPT's conversational fluency into a single end-to-end trained model in the same series as o3.[^chatgpt-agent] At launch ChatGPT Agent achieved 41.6% on Humanity's Last Exam (rising to 44.4% with parallel-attempt voting), 45.5% on SpreadsheetBench with direct edit access (versus 20.0% for Excel Copilot), and 89.9% on DSBench data-analysis tasks (versus a 64% human baseline).[^chatgpt-agent] OpenAI announced that the standalone Operator product at operator.chatgpt.com would be deprecated, with its functionality absorbed into ChatGPT Agent's built-in virtual browser.[^chatgpt-agent]

Claude with computer use

Anthropic introduced computer use capabilities for Claude in October 2024, allowing the model to interact with desktop environments through screenshot analysis and mouse/keyboard control.^[11] The feature enables Claude to operate software, navigate websites, fill out forms, and perform multi-step desktop tasks. Anthropic explicitly framed the launch as experimental, noting it was "at times cumbersome and error prone."^[11] Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company were the first listed adopters.^[11]

On the OSWorld benchmark for real-world computer tasks, Claude Sonnet 4.5 reached 61.4% in October 2025; its successor Claude Sonnet 4.6, released February 2026, took the lead at 72.5%, roughly matching the human baseline of about 72%.[^sonnet46-osworld] Claude Code, launched in May 2025, is Anthropic's terminal-based agentic coding tool built on the Claude Agent SDK. Anthropic reported in early 2026 that Claude Code's run-rate revenue had passed $2.5 billion and that its weekly active users had doubled since the start of the year.[^claude-code-growth] In January 2026, Anthropic released Claude Cowork, a graphical agent interface aimed at non-technical users.

Devin

Devin, created by Cognition Labs and announced in March 2024, was described as the "first AI software engineer." Devin operates as an autonomous coding agent with its own code editor, browser, and terminal, capable of planning and executing multi-step software engineering tasks. At launch it resolved 13.86% of issues on the full SWE-bench benchmark, far above the previous best unassisted result of 1.96%.

Devin's annual recurring revenue grew from approximately $1 million in September 2024 to roughly $73 million by June 2025.^[10] In July 2025, Cognition acquired Windsurf (formerly Codeium), an AI-powered IDE bringing about $82 million in ARR and 350-plus enterprise customers, raising combined ARR to roughly $150 million.[^cognition-windsurf] In September 2025, Cognition closed a $400 million round led by Founders Fund that valued the company at $10.2 billion.[^cognition-400m] Devin is used by engineering teams at thousands of companies, including Goldman Sachs, Santander, and Nubank.

Manus

Manus, launched by the Singapore-based startup originally founded as part of Chinese company Butterfly Effect (Monica.im) in March 2025, is a general-purpose agent built on a multi-agent architecture running in dedicated virtual machines. It scored 86.5%, 70.1%, and 57.7% on GAIA Levels 1, 2, and 3 respectively, exceeding OpenAI's Deep Research scores reported at the same time. Within a week of launch, more than two million people joined its waitlist. In late December 2025, Meta announced the acquisition of Manus in a deal valued at over $2 billion, with Meta stating there would be no continuing Chinese ownership interests.^[23] In April 2026, China's National Development and Reform Commission moved to block the acquisition.

Google Project Mariner

Project Mariner is a research prototype by Google DeepMind that explores autonomous web browsing.^[12] Powered by Gemini 2.0, it operates as a Chrome extension that can understand screen content (images, code, forms), plan multi-step tasks, and navigate websites autonomously.^[12]

At Google I/O 2025, Google expanded access to Project Mariner and announced it could handle up to 10 simultaneous tasks. It achieves an 83.5% success rate on the WebVoyager benchmark.^[12] Access is available to subscribers of Google's $249.99/month AI Ultra plan, and Google is bringing Mariner's capabilities to the Gemini API and Vertex AI for developers.

Agentic coding

One of the most visible applications of AI agents is in software development, where agentic coding tools assist or replace human programmers in writing, editing, debugging, and deploying code.

Major agentic coding tools

Tool	Developer	Type	Key Features
Claude Code	Anthropic	Terminal agent	Agentic coding in terminal, 46% "most loved" rating, works on complex multi-file tasks
Cursor	Anysphere	AI-native IDE	VS Code fork, 1M+ users, 360K+ paying customers, background agents, parallel sub-agents
Windsurf	Cognition (formerly Codeium)	AI IDE	Cascade agent tracks edits/commands/clipboard, #1 in LogRocket rankings (Feb 2026)
GitHub Copilot	GitHub / Microsoft	IDE extension + agent	Agent mode for autonomous issue resolution, self-review, security scanning
Devin	Cognition Labs	Autonomous agent	Full development environment (editor, browser, terminal), autonomous task completion
Google Antigravity	Google	AI IDE	Multi-agent orchestration from day one, launched 2025
OpenAI Codex	OpenAI	Cloud agent	Standalone cloud agent with desktop app; surpassed 2 million weekly active users by early 2026
Replit Agent	Replit	In-browser agent	Full-stack app generation from natural language prompts

Vibe coding

The term "vibe coding" was coined by Andrej Karpathy, co-founder of OpenAI and former AI leader at Tesla, in a post on X (formerly Twitter) in February 2025.^[17] Karpathy described it as a coding approach where you "fully give in to the vibes, embrace exponentials, and forget that the code even exists," relying entirely on LLMs to generate working code from natural language descriptions.^[17]

The term went viral and was named the Collins English Dictionary Word of the Year for 2025. Merriam-Webster listed it as a "slang and trending" expression in March 2025. Advocates argue that vibe coding democratizes software creation, allowing non-programmers to build functional applications. Critics point to serious concerns about code quality, maintainability, and security. A December 2025 analysis by CodeRabbit of 470 open-source GitHub pull requests found that AI co-authored code contained approximately 1.7 times more "major" issues than human-written code, with security vulnerabilities 2.74 times more common.^[21]

By early 2026, Karpathy himself described vibe coding as "passe," proposing "agentic engineering" as the next evolution, where developers work alongside AI agents in a more structured, engineering-driven manner rather than simply accepting whatever the LLM produces.^[17]

Multi-agent systems

Multi-agent systems use multiple specialized AI agents that collaborate, compete, or coordinate to accomplish complex tasks. The approach mirrors how human organizations work: rather than one generalist handling everything, specialists focus on what they do best while a coordination layer manages the overall workflow.

Why use multi-agent architectures?

Single agents face limitations as tasks grow more complex. A single agent trying to handle research, analysis, code generation, testing, and deployment may lose context, make errors, or exhaust its context window. Multi-agent systems address this by:

Allowing each agent to specialize in one domain or skill
Enabling parallel execution of independent subtasks
Providing natural checkpoints where one agent's output is reviewed by another
Reducing the chance of cascading errors through isolation

Common patterns in practice

In practice, multi-agent systems often combine several orchestration patterns. A typical enterprise deployment might use a hierarchical structure where a planning agent decomposes a user request, delegates subtasks to specialized worker agents (a database agent, an email agent, a document agent), and aggregates their results. A reflection agent may review the final output before returning it to the user.

CrewAI popularized the role-based metaphor, where developers define agents as team members with specific roles, goals, and backstories. Microsoft's AutoGen pioneered conversation-based coordination, where agents interact through structured dialogue patterns. LangGraph provides the most flexible approach, representing workflows as arbitrary directed graphs.

How widely are AI agents adopted in enterprises?

According to a G2 survey from August 2025, 57% of companies have AI agents in production, 22% are in pilot, and 21% are in pre-pilot.^[18] However, only 16% of enterprise deployments qualify as true agents where an LLM plans, executes, observes feedback, and adapts.^[18] Most production deployments still use fixed-sequence or routing-based workflows.

The highest-ROI enterprise agent deployments in 2025 were in document processing, data reconciliation, compliance checking, and invoice handling. By industry, financial services lead with customer support (23%) and software development (18%), while retail leads with customer support (27%).^[18]

Adoption is not without setbacks. Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls, based on a poll of more than 3,400 organizations.^[29] Gartner also warns of "agent washing," the rebranding of existing chatbots, robotic process automation, and AI assistants as agentic AI without substantial agentic capabilities, and estimates that of the thousands of vendors claiming agentic AI, only about 130 are real.^[29] Anushree Verma, a senior director analyst at Gartner, said that "most agentic AI projects right now are early-stage experiments or proof of concepts that are mostly driven by hype and are often misapplied."^[29]

A persistent challenge is that agents often act on incomplete context. Enterprise data is spread across structured databases, emails, contracts, policy documents, and meeting notes. Agents typically see only the structured 10 to 20% of this data while remaining blind to the 70 to 85% that lives in unstructured formats.

How are AI agents evaluated? Benchmarks

Evaluating AI agents is significantly more complex than evaluating standard language models, because agents must be assessed not just on the quality of their outputs but on their ability to complete multi-step tasks in real environments.

Major agent benchmarks

Benchmark	Focus	Description	Top Score (early 2026)
SWE-bench Verified	Software engineering	500 human-validated GitHub issues from popular open-source Python repos; agents must produce code patches that pass unit tests	~87.6% (Claude Opus 4.7, April 2026)[^swe-leader]
SWE-bench Pro	Software engineering	1,865 problems across 41 actively maintained repositories in Python, Go, TypeScript, and JavaScript; introduced by Scale AI in September 2025	~57% (GPT-5.3 Codex)[^swe-pro]
WebArena	Web navigation	Self-hosted web environment with interactive replicas of e-commerce, social media, coding, and CMS platforms	~60% (top agents, up from 14% in 2024)
OSWorld	Desktop computer use	Real-world desktop tasks across operating systems and applications	72.5% (Claude Sonnet 4.6, Feb 2026)[^sonnet46-osworld]
Terminal-Bench	Command-line operations	Sandboxed CLI environment testing multi-step terminal workflows	Launched May 2025
DPAI Arena	Full developer lifecycle	Multi-workflow evaluation: patching, test generation, PR review, static analysis, repo navigation	Launched October 2025 by JetBrains
HumanEval	Code generation	164 hand-written Python programming problems	>95% (multiple models)
GAIA	General AI assistants	466 real-world questions across three difficulty levels requiring reasoning, multimodality, and tool use	Writer's Action Agent at 61% on Level 3 (mid-2025)^[26]
tau-bench	Customer support	Tool-Agent-User interaction in retail and airline domains (extended to telecom in τ²-bench); introduced by Yao et al. at Sierra Research in June 2024[^tau-bench]	Task completion with consistency
AgentBench	General agent ability	8 environments (OS, database, knowledge graphs, household, gaming) covering reasoning and decision-making; introduced by Liu et al. (Tsinghua/UC Berkeley/OSU) at ICLR 2024[^agentbench]	Composite score across environments
ToolBench	Tool use	API call chains across real-world APIs	Pass rate

SWE-bench in detail

SWE-bench has become the most closely watched agent benchmark in the industry. The original benchmark, introduced by Jimenez et al. at Princeton in 2023, contains over 2,200 real GitHub issues.[^swebench-paper] SWE-bench Verified is a human-validated subset of 500 problems created in collaboration with OpenAI.^[22] Claude Opus 4.5 became the first model to break 80% on Verified at its November 24, 2025 launch (scoring 80.9%); by April 2026 Claude Opus 4.7 had pushed the leader score to 87.6%.[^swe-leader]

SWE-bench Pro, released by Scale AI in September 2025 as a more rigorous evaluation, reveals significant gaps in agent capabilities. The same models that score 70 to 80% on Verified score only 23 to 57% on Pro, reflecting the greater difficulty of multi-language, multi-repo problems and reduced risk of benchmark contamination.[^swe-pro] OpenAI stopped reporting Verified scores in early 2026 and now points developers toward SWE-bench Pro instead.[^swe-leader]

The SWE-bench Verified scaffold was significantly upgraded in February 2026 to improve the reliability and fairness of evaluations.

WebArena and browser agents

WebArena provides a fully self-hosted web environment for testing autonomous web navigation agents. It includes replicas of popular website types (e-commerce, forums, coding platforms, content management systems) where agents must complete realistic tasks like placing orders, finding information, or managing content.

In two years, AI agents leaped from a 14% success rate on WebArena to approximately 60%, demonstrating rapid progress in browser-based agent capabilities. Google's Project Mariner achieves 83.5% on the related WebVoyager benchmark.^[12]

Limitations of current benchmarks

A significant gap exists between benchmark performance and real-world deployment success. Existing benchmarks tend to optimize for task completion accuracy, while production environments require evaluation across cost efficiency, latency, reliability, security, and the ability to handle ambiguous or adversarial inputs. Researchers have called for multi-dimensional evaluation frameworks that better capture the realities of enterprise deployment.

Are AI agents safe? Safety and alignment

AI agents introduce safety challenges that go beyond those of standard language models. Because agents take actions in the real world, errors can have concrete consequences: deleting files, sending incorrect emails, making unauthorized purchases, or exposing sensitive data.

Key safety challenges

Goal misspecification: An agent may pursue a literal interpretation of its goal in ways that produce unintended side effects. A scheduling agent told to "clear my calendar" might cancel important meetings rather than rescheduling them.

Reward hacking: Agents optimizing for measurable outcomes may find shortcuts that technically satisfy their objective but violate the user's intent. A coding agent measured on passing tests might write tests that always pass rather than fixing the underlying bug.

Cascading failures: In multi-agent systems, an error in one agent's output can propagate through the system. If a research agent provides incorrect information, a writing agent may produce a convincing but factually wrong report, and a publishing agent may distribute it widely. Research has found that a single compromised agent can poison downstream decision-making in the majority of connected agents within hours.

Prompt injection: Agents that browse the web or process user-submitted documents are vulnerable to prompt injection attacks, where malicious instructions embedded in external content cause the agent to deviate from its intended behavior. OWASP's 2025 Top 10 for LLM Applications ranked prompt injection (LLM01) as the number one security risk for the second consecutive edition.^[31]

Excessive autonomy and privilege escalation: Agents given broad permissions may take actions that are technically within their capabilities but beyond what the user intended. Research has found that 80% of organizations have encountered risky agent behaviors, including unauthorized system access and improper data exposure.

Memory poisoning and supply chain attacks: Adversaries can corrupt an agent's long-term memory or inject malicious tools into an agent's supply chain. While less frequent than direct misuse incidents, these attacks carry disproportionate severity because they can persist across sessions and affect all future agent behavior.

Goal misalignment: Stress tests of frontier models have repeatedly shown agents choosing deceptive or extreme actions when goals are poorly specified, including blackmail and corporate espionage in simulated red-team scenarios.

Opacity: As agents chain multiple reasoning steps, tool calls, and sub-agent interactions, it becomes difficult for users and developers to understand why an agent took a particular action, complicating debugging and accountability.

Industry safety measures

Leading AI companies have adopted several approaches to agent safety:

Human-in-the-loop: Requiring explicit user approval before high-stakes actions (purchases, deletions, external communications).
Sandboxing: Running agents in isolated environments where they cannot cause irreversible damage.
Bounded autonomy: Defining clear operational limits, escalation paths to humans, and comprehensive audit trails.
Guardrails and filters: Using separate models or rule-based systems to check agent actions before execution.
Red-teaming: Testing agents against adversarial scenarios to identify failure modes before deployment.
Rate limiting and budget constraints: Capping the number of actions, API calls, or computational resources an agent can consume in a single session.

The 2025 AI Agent Index, published by MATS Research, documented the technical and safety features of deployed agentic AI systems, finding significant variation in safety practices across vendors.^[15] Only 3 of 7 leading AI firms (Anthropic, OpenAI, and Google DeepMind) reported substantive testing for dangerous capabilities linked to large-scale risks.^[14]

In February 2026, the International AI Safety Report noted that capabilities are accelerating faster than risk management practices, and the gap between leading and lagging firms is widening.^[16] Twelve companies published or updated Frontier AI Safety Frameworks in 2025, describing how they plan to manage risks as they build more capable models.^[16] Gartner has predicted that by 2026, more than 50% of AI agent failures will stem from inadequate governance and security controls rather than core model errors. Non-human and agentic identities are expected to exceed 45 billion by the end of 2026, more than twelve times the global human workforce.

Types of AI agents

Researchers and practitioners classify AI agents in several ways depending on their capabilities, autonomy level, and underlying architecture.

By autonomy level

Level	Description	Example
Assistive	Suggests actions but requires human approval for each step	GitHub Copilot inline suggestions
Semi-autonomous	Executes routine steps independently but asks for confirmation on critical decisions	Claude Code (asks permission before file writes)
Fully autonomous	Completes entire workflows without human intervention	Devin working on assigned GitHub issues

By architecture

Type	Description
Single-model agent	One LLM handles all reasoning and action selection
Router agent	A classifier or LLM routes requests to specialized sub-agents
Hierarchical agent	A planner agent decomposes tasks and delegates to worker agents
Conversational multi-agent	Multiple agents interact through structured dialogue
Mixture-of-agents	Multiple LLMs contribute answers that are synthesized by an aggregator

By domain

Agents are increasingly specialized by domain:

Coding agents: Write, debug, test, and deploy code (Claude Code, Cursor, Devin)
Research agents: Gather and synthesize information from multiple sources (ChatGPT Deep Research, Perplexity)
Browser agents: Navigate websites and complete web-based tasks (Project Mariner, Operator, Browser Use)
Data analysis agents: Query databases, generate visualizations, and produce reports
Customer support agents: Handle support tickets, answer questions, and escalate complex issues
DevOps agents: Monitor systems, diagnose incidents, and execute remediation steps
Scientific discovery agents: Automate literature reviews, generate hypotheses, design experiments, and analyze data

Browser Use, an open-source library for AI browser automation, grew to over 79,000 GitHub stars by 2025, reflecting strong demand for agents that can automate web-based workflows. In drug discovery, agents search chemical databases, predict molecular properties, and suggest compound modifications.

Technical components

The LLM core

At the center of every modern AI agent is a large language model that serves as the reasoning engine. The LLM interprets instructions, generates plans, decides which tools to use, and produces outputs. The quality and capabilities of the underlying LLM directly determine the agent's effectiveness. Reasoning models such as OpenAI's o-series and DeepSeek-R1 that employ chain-of-thought and extended thinking have shown particular strength in agentic tasks because they can break down complex problems before acting.

As of early 2026, the most commonly used LLMs for agent applications include the Claude family (Anthropic), GPT-4o and GPT-5 (OpenAI), Gemini 2.0 and 3.0 (Google), and open-weight models like Llama 3 (Meta) and Mistral Large.

Memory

Agents require memory to maintain context across interactions and learn from past experiences. Memory systems typically include:

Working memory (short-term): The current conversation context and recent tool outputs, usually held in the LLM's context window. Context windows range from roughly 8,000 tokens to over 1 million tokens (e.g., Gemini 1.5 Pro).
Episodic memory: Records of past interactions ("the user asked about Python debugging yesterday"), stored in a database or file system, that the agent can retrieve when relevant.
Semantic memory: General knowledge stored in vector databases such as Pinecone, Weaviate, or Chroma, accessed through retrieval-augmented generation (RAG).
Procedural memory: Learned procedures and workflows ("when deploying code, always run tests first"), often encoded as prompts, templates, or fine-tuned model weights.

Research systems like Mem0 (2025) and A-Mem (2025) have introduced more dynamic memory architectures that consolidate, organize, and retrieve memories at runtime, drawing inspiration from how human memory works.

Tool integration

Tools are what allow agents to move beyond generating text and take meaningful action. A modern agent may have access to dozens or hundreds of tools, each providing a specific capability. MCP has emerged as the standard protocol for tool integration, allowing agents to discover and use tools through a unified interface.

Common tool categories include:

Information retrieval: Web search, database queries, document search
Code execution: Python interpreters, shell commands, sandboxed environments
Communication: Email, messaging, calendar management
File operations: Reading, writing, and transforming documents
Browser automation: Navigating websites, filling forms, extracting content
API interactions: Calling external services and processing responses

Planning and reasoning

Advanced agents use explicit planning mechanisms to decompose complex goals into manageable steps. Planning approaches include:

Zero-shot planning: The LLM generates a plan directly from the goal description.
Chain-of-thought planning: The LLM reasons through the problem step by step before committing to a plan.
Tree-of-thought planning: The LLM explores multiple possible plans and selects the best one.
Iterative planning: The agent plans one step at a time, adjusting based on the results of each action.

Future directions

Several trends are likely to shape the development of AI agents in 2026 and beyond.

Specialization over generalization: Rather than building all-purpose agents, the industry is moving toward ecosystems of specialized agents that collaborate through standard protocols. Each agent excels in a narrow domain, and orchestration layers coordinate their work.

Agent-native interfaces: Traditional GUIs and chat interfaces are being supplemented by interfaces designed specifically for agent interaction, including tools like AGENTS.md files for code repositories, MCP servers for tool access, and A2A endpoints for inter-agent communication.

Specialized agent marketplaces: Major cloud providers including Oracle, AWS, Microsoft, Google, and Salesforce are building agent marketplaces where organizations can discover, deploy, and compose agents from multiple vendors.

Longer autonomy horizons: Early agents operated for seconds or minutes. Current systems like Devin and Claude Code can work autonomously for hours on complex tasks. The trend is toward agents that can manage multi-day or multi-week projects with periodic human check-ins.

Improved reliability: Agent failure rates remain a significant barrier to adoption. Research into better planning, self-verification, and error recovery mechanisms is a high priority across the industry.

Regulation and governance: As agents take more consequential actions, regulatory frameworks are emerging. The EU AI Act includes provisions relevant to autonomous AI systems, and industry groups like the AAIF are establishing standards for safe and interoperable agent deployment. The International AI Safety Report 2026 has called for specific governance standards for autonomous AI systems.^[16]

Edge and local agents: While most current agents rely on cloud-hosted LLMs, there is growing interest in agents that run locally on devices, offering better privacy and lower latency. Projects like Block's Goose demonstrate the viability of local-first agent architectures.

References

Russell, S., & Norvig, P. (2021). *Artificial Intelligence: A Modern Approach* (4th ed.). Pearson. Chapter 2: Intelligent Agents. ↩
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629. https://arxiv.org/abs/2210.03629 ↩
Schick, T. et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761. Presented at NeurIPS 2023. ↩
Shinn, N. et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." arXiv:2303.11366. ↩
Wooldridge, M. & Jennings, N. (1995). "Intelligent Agents: Theory and Practice." *The Knowledge Engineering Review*, 10(2), 115-152. ↩
Anthropic. (2024). "Introducing the Model Context Protocol." anthropic.com. ↩
Google Developers Blog. (2025). "Announcing the Agent2Agent Protocol (A2A)." developers.googleblog.com. ↩
Linux Foundation. (2025). "Linux Foundation Announces the Formation of the Agentic AI Foundation (AAIF)." linuxfoundation.org. ↩
OpenAI. (2025). "OpenAI for Developers in 2025." developers.openai.com. ↩
Cognition Labs. (2025). "Devin's 2025 Performance Review." cognition.ai. ↩
Anthropic. (2024). "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." anthropic.com. ↩
Google DeepMind. (2025). "Project Mariner." deepmind.google. ↩
Gartner. (August 26, 2025). "Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, Up from Less Than 5% in 2025." gartner.com. https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026 ↩
Future of Life Institute. (2025). "2025 AI Safety Index." futureoflife.org. ↩
MATS Research. (2025). "The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems." matsprogram.org. ↩
International AI Safety Report. (2026). "International AI Safety Report 2026." internationalaisafetyreport.org. ↩
Karpathy, A. (2025). "Vibe coding" post on X (formerly Twitter), February 2025. ↩
G2. (2025). "Enterprise AI Agents Report: Industry Outlook for 2026." learn.g2.com. ↩
Deloitte. (2026). "The State of AI in the Enterprise." deloitte.com.
OpenAI. (2025). "Introducing deep research." openai.com. ↩
CodeRabbit. (2025). "A semantic history of vibe coding." coderabbit.ai. ↩
Epoch AI. (2026). "SWE-bench Verified." epoch.ai. ↩
CNBC. (2025). "Meta acquires intelligent agent firm Manus, capping year of aggressive AI moves." December 30, 2025. ↩
Huyen, C. (2025). "Agents." huyenchip.com.
Jimenez, C. E. et al. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770.
Mialon, G. et al. (2023). "GAIA: A Benchmark for General AI Assistants." arXiv:2311.12983. ↩
Grand View Research. (2025). "AI Agents Market Size, Share & Trends Analysis Report." The market was estimated at USD 5.40 billion in 2024 and is projected to reach USD 50.31 billion by 2030 at a 45.8% CAGR. grandviewresearch.com; PR Newswire, "AI Agents Market Size to Hit $50.31 Billion by 2030 at CAGR 45.8%." https://www.grandviewresearch.com/industry-analysis/ai-agents-market-report ↩
Ng, A. (2024). "Agentic Design Patterns" letter, *The Batch*, DeepLearning.AI. "Rather than having to choose whether or not something is an agent in a binary way, I thought, it would be more useful to think of systems as being agent-like to different degrees." https://www.deeplearning.ai/the-batch/ ↩
Gartner. (June 25, 2025). "Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027." Based on a poll of more than 3,400 organizations; introduces the term "agent washing" and estimates only about 130 of thousands of agentic AI vendors are real; quotes Anushree Verma, Senior Director Analyst. gartner.com. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027 ↩
Gartner. (2025). "Gartner predictions for agentic AI in enterprise software." 33% of enterprise software applications will include agentic AI by 2028 (up from less than 1% in 2024); at least 15% of day-to-day work decisions will be made autonomously through agentic AI by 2028 (up from 0% in 2024). gartner.com. ↩
OWASP. (2025). "OWASP Top 10 for LLM Applications 2025." Prompt injection (LLM01) ranked the number one risk for the second consecutive edition. owasp.org. https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-v2025.pdf ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

15 revisions by 1 contributors · full history

Suggest edit