See also: ai_agent, Machine learning terms
In artificial intelligence (AI), an agent is an entity that perceives its environment through sensors and acts upon that environment through actuators in pursuit of objectives. The concept of an agent is one of the most foundational ideas in AI, spanning classical AI planning, reinforcement learning, robotics, and the modern wave of large language model powered autonomous systems. Stuart Russell and Peter Norvig define an agent simply as "anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators," making it the central abstraction around which AI research is organized.[1]
Since 2024, the term "agent" has taken on renewed significance in the AI industry. While the classical definition remains intact, a new generation of LLM-based AI agents has emerged that can use software tools, browse the web, write and execute code, and carry out multi-step tasks with minimal human supervision. This development has made "agentic AI" one of the defining trends of 2025 and 2026.[2] The companion article ai_agent covers the modern LLM-driven incarnation in greater depth; this article focuses on the broader concept that unifies classical AI, reinforcement learning, and modern systems under one definition.
Imagine you have a robot friend who can look around a room, think about what to do, and then do it. If the robot sees that the floor is dirty, it decides to vacuum. If it bumps into a chair, it turns and goes another way. That robot is an "agent" because it can sense things (see the dirty floor), think about them (decide to vacuum), and take action (start cleaning). AI agents work the same way, but they live inside computers. Some play video games, some answer questions, and some help drive cars. The smartest ones can even learn from their mistakes and get better over time, just like you get better at riding a bike the more you practice.
Formally, an agent is defined by an agent function that maps every possible percept sequence to an action:
f : P -> A*
where P* is the set of all possible percept sequences and A is the set of actions available to the agent. The agent function is an abstract mathematical description; the agent program is the concrete implementation that runs on a physical or virtual system (the agent architecture). A rational agent is one that selects actions expected to maximize its performance measure, given what it has perceived so far and any built-in knowledge it possesses.[1]
Four ingredients fully specify an agent in this view: the percepts it can receive, the actions it can take, the goals or performance measure it tries to optimize, and the environment in which it operates. Russell and Norvig package these into the PEAS framework (Performance measure, Environment, Actuators, Sensors), which they use to specify the task environment of any agent before discussing its design.[1]
| PEAS element | Question it answers | Self-driving taxi example |
|---|---|---|
| Performance measure | What counts as success? | Safe, fast, legal, comfortable trips; profits |
| Environment | What does the agent live in? | Roads, traffic, pedestrians, customers, weather |
| Actuators | What can the agent change? | Steering, accelerator, brake, signal, horn, display |
| Sensors | What can the agent perceive? | Cameras, lidar, GPS, speedometer, accelerometer, microphone |
A second taxonomy classifies the task environment itself along several axes that influence how hard the design problem is.[1] An environment is fully observable if sensors give access to the complete state, partially observable otherwise; deterministic if the next state is fully determined by the current state and action, stochastic otherwise; episodic if each action stands alone, sequential if actions have lasting consequences; static or dynamic; discrete or continuous; and single-agent or multi-agent. A chess agent operates in a fully observable, deterministic, sequential, static, discrete, multi-agent environment. A self-driving car lives in the hardest possible setting along almost every axis.
The interaction between an agent and its environment follows a cyclical pattern that is especially well formalized in reinforcement learning. At each discrete time step t, the agent:
This cycle repeats until a terminal condition is met or the process continues indefinitely. The agent's objective is to learn a policy that maximizes the expected cumulative reward over time. When the environment satisfies the Markov property (future states depend only on the current state and action, not on history), this framework is called a Markov Decision Process (MDP).[3]
| Component | Symbol | Description |
|---|---|---|
| State | s_t | Representation of the environment at time t |
| Action | a_t | Choice made by the agent |
| Reward | r_{t+1} | Scalar feedback signal from the environment |
| Policy | pi(s) | Mapping from states to actions |
| Value function | V(s) | Expected cumulative reward from state s |
| Action-value function | Q(s, a) | Expected cumulative reward from taking action a in state s |
| Transition function | T(s, a, s') | Probability of moving to state s' after taking action a in state s |
| Discount factor | gamma | Weight on future rewards, typically 0.9 to 0.99 |
The expected return from a state under policy pi is V_pi(s) = E[ sum_t gamma^t r_t | s_0 = s, pi ]. Optimal control reduces to finding the policy that maximizes this quantity. The same loop describes a thermostat regulating a room, a chess engine choosing moves, a robot arm placing a chip on a circuit board, and an LLM agent calling tools in a browser. The differences are in the action space, the observation space, and how the policy is computed, not in the abstraction.
Russell and Norvig's textbook Artificial Intelligence: A Modern Approach, now in its fourth edition (Pearson, 2021), classifies agents into five types based on their internal structure and level of sophistication. Each successive type builds on the capabilities of the previous one. The framing has structured AI courses for nearly thirty years and is still the canonical introduction to the agent concept.[1]
Simple reflex agents select actions based solely on the current percept, ignoring the entire percept history. They operate using condition-action rules ("if the car ahead is braking, then apply brakes"). These agents work well in fully observable environments but fail when the environment is partially observable because they have no memory of past events. A household thermostat is a classic example: it turns on heating when the temperature drops below a threshold and turns it off when the threshold is exceeded. A spam filter that only inspects the current email and applies fixed rules is another. Simple reflex agents fail catastrophically in any setting where the right action depends on context the current percept does not reveal.
Model-based reflex agents maintain an internal model of the world that tracks aspects of the environment that are not directly visible. This internal state is updated after each action and percept using two kinds of knowledge: how the world evolves independently of the agent, and how the agent's own actions affect the world. By maintaining this model, the agent can handle partially observable environments far more effectively than a simple reflex agent. A self-driving car that remembers a pedestrian who briefly stepped behind a parked truck is a model-based reflex agent. The internal model can be as simple as a flag ("the lights are on") or as elaborate as a 3D occupancy grid of the surrounding street.
Goal-based agents extend model-based agents by incorporating explicit goal information that describes desirable states. Rather than just reacting, these agents use search and planning algorithms to identify sequences of actions that will achieve their goals. This makes them more flexible: when the environment or goals change, the agent can recompute its plan rather than requiring a complete rewrite of its condition-action rules. A robot vacuum that plans an efficient path through a room is a goal-based agent. So is a route planner that searches a graph of intersections to compute the shortest path to a destination, and a STRIPS planner that orders preconditions and effects to assemble a plan.
Utility-based agents go further by employing a utility function that maps each state (or sequence of states) to a real number representing how desirable that state is. While goal-based agents have a binary notion of success and failure, utility-based agents can compare multiple outcomes on a continuous scale. This is especially important when there are conflicting goals ("arrive on time" vs. "avoid bumpy roads"), when goals can be achieved to different degrees, or when there is uncertainty about outcomes. A rational utility-based agent selects the action that maximizes expected utility, weighing probabilities and desirability of potential outcomes. A financial trading agent that balances expected return against variance is a utility-based agent.
Learning agents can improve their performance over time through experience. They consist of four conceptual components: a learning element that makes improvements based on feedback, a performance element that selects actions, a critic that evaluates how well the agent is doing relative to a fixed performance standard, and a problem generator that suggests exploratory actions to discover new experiences. Nearly all sophisticated AI systems today are learning agents in some form. AlphaGo learned its policy from millions of self-play games; an LLM-based coding agent learns implicitly from the gradient updates that produced its base model and then explicitly from examples in its prompt.
| Agent type | Internal state | Planning | Learning | Example |
|---|---|---|---|---|
| Simple reflex | None | No | No | Thermostat |
| Model-based reflex | World model | No | No | Spam filter with context state |
| Goal-based | World model + goals | Yes | No | Route planner |
| Utility-based | World model + utility function | Yes | No | Financial trading agent |
| Learning | All of the above + learning element | Yes | Yes | Self-driving car, AlphaGo |
This taxonomy is conceptual rather than architectural. A modern coding agent like Devin blends elements of all five: it reacts to immediate test failures, maintains a model of the codebase, decomposes goals into subtasks, weighs alternative implementations against a quality utility, and updates its plan as new information arrives.
In reinforcement learning (RL), the agent is the central learning entity. Unlike supervised learning where correct answers are provided, an RL agent must discover which actions yield the highest reward through trial and error. The agent interacts with its environment over many episodes, gradually improving its policy.
RL agents can be broadly categorized as model-based or model-free. Model-based agents build an internal model of the environment's transition dynamics and use it for planning. Model-free agents, such as those using Q-learning or policy gradient methods, learn directly from experience without constructing an explicit environment model. Model-free approaches are often simpler to implement but may require more training data, while model-based methods can be more sample-efficient but rely on the accuracy of their learned model.[3]
Key RL algorithms for training agents include:
DQN's success in 2013 marked the start of the deep RL era and made the agent abstraction concrete in a way classical control theory had not. The same conceptual loop, now powered by neural function approximators, would later produce AlphaGo, AlphaStar, OpenAI Five, and many of the simulation results that influenced modern LLM training.
Game playing has been one of the most visible domains for AI agents, producing landmark achievements that demonstrated the power of different agent architectures.
Deep Blue (IBM, 1997) defeated chess world champion Garry Kasparov using brute-force search with hand-crafted evaluation functions. While not a learning agent, Deep Blue showcased the power of combining search algorithms with domain expertise and dedicated hardware.[5]
AlphaGo (DeepMind, 2015 to 2017) became the first computer program to defeat a professional human Go player without handicap. AlphaGo combined deep neural networks with Monte Carlo tree search and was trained through a combination of supervised learning on human games and reinforcement learning through self-play. Its successor, AlphaZero, learned to play Go, chess, and shogi entirely through self-play with no human game data, achieving superhuman performance in all three games within twenty four hours of training.[6]
OpenAI Five (2019) defeated the world champions of Dota 2, a complex five-on-five multiplayer video game. The system used a team of five neural network agents trained with PPO over the equivalent of 45,000 years of gameplay. OpenAI Five won 99.4% of its public games, demonstrating that RL agents could master highly complex, partially observable, multi-agent environments.[7]
AlphaStar (DeepMind, 2019) reached Grandmaster level in StarCraft II, a real-time strategy game requiring long-horizon planning, imperfect information handling, and real-time decision making.
Voyager (Wang et al., 2023) was the first LLM-powered embodied lifelong learning agent. Built on top of GPT-4, Voyager played Minecraft autonomously and combined three components: an automatic curriculum that proposed exploration tasks, an ever-growing skill library of executable code, and an iterative prompting loop that incorporated environment feedback and self-verification. It obtained 3.3 times more unique items, traveled 2.3 times longer distances, and unlocked tech tree milestones up to 15.3 times faster than prior state of the art, all without any model fine-tuning.[8] Voyager helped popularize the idea that an LLM could serve as the policy of an open-ended agent in a simulated world.
| Agent | Game | Year | Key technique | Achievement |
|---|---|---|---|---|
| Deep Blue | Chess | 1997 | Search + evaluation | Beat world champion Kasparov |
| AlphaGo | Go | 2016 | Neural nets + MCTS + RL | Beat 9 dan professional Lee Sedol |
| AlphaZero | Go, chess, shogi | 2017 | Pure self-play RL | Superhuman in all three games |
| OpenAI Five | Dota 2 | 2019 | Multi-agent PPO | Beat world champion team OG |
| AlphaStar | StarCraft II | 2019 | Multi-agent RL + imitation | Grandmaster level |
| MuZero | Atari, Go, chess, shogi | 2020 | Learned model + MCTS | Matched AlphaZero without rules |
| Voyager | Minecraft | 2023 | GPT-4 + skill library | Lifelong learning embodied agent |
The most significant recent development in agent research is the emergence of agents built on top of large language models. These systems use an LLM as the core reasoning engine (the "brain") and augment it with the ability to use external tools, access memory, and take actions in digital environments. The companion article ai_agent covers this incarnation in detail; the focus here is on the underlying patterns and the most influential systems.
An LLM-based agent typically operates in a loop:
This loop is closely related to the classical agent-environment cycle, but the "environment" is now the digital world of APIs, websites, and software tools, and the "policy" is the LLM's reasoning ability shaped by its training and prompt.
In an LLM agent, the language model serves as the cognitive engine that interprets state, decides what to do next, and produces the next action token by token. The capability ceiling of the agent is set by the model's reasoning ability. Reasoning-tuned models such as OpenAI's o-series, DeepSeek-R1, and grok_4_1_fast consistently outperform their non-reasoning counterparts on agentic benchmarks because they can plan, deliberate, and self-correct before committing to an action.
Tool use is what separates an agent from a chatbot. A chatbot answers using only its training data and the user's prompt; an agent calls external tools to fetch fresh information, perform reliable computation, or change the state of the world. Modern LLMs implement tool use through function calling, where the developer provides a list of named functions with JSON schemas and the model emits a structured call object that the runtime executes. The result is fed back into the context for the next turn.
Meta's Toolformer paper (Schick et al., February 2023) demonstrated that an LLM could teach itself to use external tools by training on data the model itself augmented with API calls inserted at useful positions. This established that tool use could be learned rather than only hand engineered.[9]
Chip Huyen identifies three categories of tools available to LLM agents:[10]
| Category | Purpose | Examples |
|---|---|---|
| Knowledge augmentation | Retrieve external information | Web search, document retrieval, API queries |
| Capability extension | Perform computations the LLM cannot | Calculators, code interpreters, translators |
| Write actions | Modify external state | Database writes, sending emails, making purchases |
ReAct (Reasoning + Acting): Introduced by Shunyu Yao and colleagues at Princeton and Google in October 2022, ReAct interleaves reasoning traces with actions in a Thought-Action-Observation loop. The agent first generates a verbal reasoning trace ("I need to find the population of France"), then formulates a tool call ("search: population of France 2025"), and finally incorporates the result into its context for the next reasoning step. The original paper showed ReAct outperforming imitation and reinforcement learning baselines on the ALFWorld text adventure and WebShop web-shopping benchmark by 34 and 10 absolute percentage points respectively, while also reducing hallucination on HotpotQA and Fever question answering. ReAct has become the most widely adopted pattern for building LLM agents.[11]
Chain-of-Thought (CoT) prompting: Encourages the agent to "think step by step" before acting, reducing errors and hallucinations. CoT is often combined with ReAct in practice, with the thought portion of each ReAct step itself being a short chain of thought.
Tree of Thoughts (ToT): Yao et al. (May 2023) generalized chain-of-thought into a tree where the agent generates multiple alternative reasoning paths, evaluates them, and searches with backtracking and lookahead. On the Game of 24 puzzle, GPT-4 with chain-of-thought prompting solved only 4% of problems; with Tree of Thoughts the same model solved 74%.[12]
Reflexion: Noah Shinn and colleagues (NeurIPS 2023) added a verbal self-critique step where the agent reviews its own outputs, writes a reflection to memory, and uses that reflection in subsequent attempts. Reflexion reached 91% pass@1 on the HumanEval coding benchmark, surpassing the previously reported GPT-4 baseline of 80%.[13]
Plan-and-Execute: Separates planning from execution. The agent first generates a complete plan, then executes each step, and can revise the plan if intermediate results are unexpected. Frameworks like LangGraph implement this as a graph of nodes representing planning, execution, and revision.
Self-consistency: Generates many independent samples of a reasoning trace and selects the answer that appears most often across samples, reducing variance.
LLM agents typically combine a short-term working memory (the model's context window, holding the active conversation, recent tool outputs, and current plan) with a long-term memory implemented in external storage. Long-term memory itself splits into:
Research systems like Mem0 (2025) and A-Mem (2025) have introduced more dynamic memory architectures that consolidate, organize, and retrieve memories at runtime, drawing inspiration from how human memory works.
A critical challenge for LLM agents is the accumulation of errors across steps. If an agent has 95% accuracy at each individual step, its reliability drops to approximately 60% over 10 steps and below 1% over 100 steps. This compound error problem means that even small improvements in per-step reliability can have outsized effects on end-to-end agent performance.[10] It explains why long-horizon agents often look much weaker in practice than their per-step benchmark scores would suggest, and why retry, verification, and self-reflection mechanisms have become essential rather than optional.
A handful of papers and open source projects shaped the modern agent landscape between 2022 and 2024.
| Project | Authors / org | Date | Contribution |
|---|---|---|---|
| ReAct | Yao et al. (Princeton, Google) | Oct 2022 | Interleaved reasoning and acting; the canonical agent loop[11] |
| Toolformer | Schick et al. (Meta AI) | Feb 2023 | Self-supervised tool use during model training[9] |
| HuggingGPT (Jarvis) | Shen et al. (Microsoft) | Mar 2023 | LLM as a controller orchestrating specialist models from Hugging Face |
| AutoGPT | Toran Bruce Richards | Mar 2023 | First viral autonomous LLM agent; reached 100k+ GitHub stars in months |
| BabyAGI | Yohei Nakajima | Apr 2023 | Minimal Python script demonstrating task creation, prioritization, execution |
| Reflexion | Shinn et al. | Mar 2023 (NeurIPS 2023) | Verbal self-reflection; 91% on HumanEval[13] |
| Tree of Thoughts | Yao et al. (Princeton) | May 2023 | Search over reasoning paths; 74% on Game of 24[12] |
| Voyager | Wang et al. (NVIDIA, Caltech) | May 2023 | Lifelong learning Minecraft agent with skill library[8] |
| Generative Agents | Park et al. (Stanford) | Apr 2023 | 25-character "Smallville" simulation with memory, reflection, planning[14] |
| SWE-agent | Yang et al. (Princeton) | Apr 2024 | Agent-computer interface for the SWE-bench coding benchmark |
The Generative Agents paper from Joon Sung Park and colleagues at Stanford (UIST 2023) was particularly influential in popular imagination. The team built a simulated town called Smallville where 25 LLM-driven characters lived for two days, woke up, made breakfast, went to work, and chatted. Starting with the seed instruction that one agent wanted to throw a Valentine's Day party, the agents autonomously spread invitations, asked each other on dates, and coordinated to show up at the party at the right time, demonstrating that combining a language model with a memory stream, periodic reflection, and a planner could produce surprisingly believable social behavior.[14]
A practical agent is an assembly of capabilities. The classical Russell and Norvig categorization above describes them at a high level; the modern engineering view is more granular.
| Capability | What it does | Typical implementation in 2026 |
|---|---|---|
| Perception | Convert raw inputs into a usable representation | Text tokenization, vision encoders, ASR, screenshot parsing |
| Reasoning | Decide what to do given current state | LLM forward pass, possibly with extended thinking |
| Planning | Decompose a goal into ordered subgoals | Chain of thought, ReAct, tree of thoughts, explicit planner |
| Action | Change the environment | Tool calls, code execution, browser actions, mouse and keyboard |
| Memory | Retain useful information across time | Context window, vector DB, key-value store, structured DB |
| Reflection | Critique own behavior | Reflexion, verifier model, judge LLM, test runs |
| Communication | Talk to humans or other agents | Chat UI, MCP, A2A protocol, structured outputs |
A single-LLM agent uses one model in the central role, with all other capabilities exposed as tools. A multi-LLM agent assigns specialist roles to different models (a small model for routing, a reasoning model for planning, a vision model for screen reading). A multi-agent system goes further by letting several full agents collaborate or compete.
Four broad architectural patterns dominate practical agent deployments.
A dense ecosystem of open source frameworks supports agent development.
| Framework | Developer | Strength | Style |
|---|---|---|---|
| LangChain | LangChain Inc. | Largest ecosystem, modular components | Chains and tools |
| LangGraph | LangChain Inc. | Stateful, cyclical graphs for control flow | Finite state machine |
| LlamaIndex | LlamaIndex Inc. | Data-centric agents, strong RAG integration | Index + query engine |
| AutoGen | Microsoft Research | Multi-agent conversations, async events | Conversable agents |
| Semantic Kernel | Microsoft | Enterprise C# and Python with Azure tie-ins | Plugin system |
| CrewAI | CrewAI Inc. | Role-based collaboration in fewer than 50 lines of code | Role delegation |
| Haystack | deepset | NLP-first pipelines for search and QA | Pipeline graph |
| OpenAI Agents SDK | OpenAI | Handoffs, guardrails, tracing | OpenAI-native |
| Claude Agent SDK | Anthropic | Tool use with Claude models | Anthropic-native |
| AutoGPT | Significant Gravitas | Fully autonomous loop, large community | Autonomous goal pursuit |
| Smol Agents | Hugging Face | Minimal code-execution agent | Code agents |
In October 2025 Microsoft folded AutoGen and Semantic Kernel into a unified Microsoft Agent Framework with general availability targeted for early 2026. The ecosystem has begun to consolidate around a smaller number of mature frameworks rather than the explosion of options seen in 2023.
By 2026, every major frontier lab and several startups ship general-purpose agentic products. The browser and the IDE have emerged as the two most lucrative environments for agent deployment.
| Product | Vendor | Released | Domain | Notes |
|---|---|---|---|---|
| Anthropic computer use | Anthropic | Oct 2024 | Desktop GUI | First frontier model to control mouse and keyboard via screenshots; public beta with Claude 3.5 Sonnet[15] |
| Project Mariner | Google DeepMind | Dec 2024 | Browser | Gemini 2.0 powered; 83.5% on WebVoyager benchmark[16] |
| Operator | OpenAI | Jan 2025 | Browser | Powered by Computer-Using Agent (CUA); 38.1% OSWorld, 87% WebVoyager[17] |
| ChatGPT Agent | OpenAI | 2025 | Browser + desktop | Successor to Operator integrated into ChatGPT |
| Manus | Monica.im | Mar 2025 | General | Multi-agent VM-based system; 86.5%, 70.1%, 57.7% on GAIA Levels 1-3, beating OpenAI Deep Research at the time[18] |
| Deep Research | OpenAI | 2025 | Research | Long-horizon web research with citation-rich reports |
| Gemini Deep Research | 2024-2025 | Research | Iterative search-and-read inside Gemini | |
| Claude Code | Anthropic | 2025 | Coding | Terminal-native coding agent; reads code, edits files, runs commands |
| Devin | Cognition AI | Mar 2024 | Coding | First public autonomous SWE agent; 13.86% on full SWE-bench at launch[19] |
| Cursor | Anysphere | 2023 | Coding (IDE) | AI-first VS Code fork with agent mode |
| Windsurf | Codeium | 2024 | Coding (IDE) | Cascade flows combining suggestions with agentic actions |
| GitHub Copilot Agent | GitHub / Microsoft | 2025 | Coding | Agent mode added to existing Copilot product |
| Cline | Cline | 2024-2025 | Coding | Open-source autonomous coding agent VS Code extension |
| Bolt.new | StackBlitz | 2024 | Web app generation | Browser-based full-stack agent that runs and previews code |
| v0 | Vercel | 2023-2025 | Web app generation | UI-first generation tied to the Next.js ecosystem |
| Lovable | Lovable | 2024-2025 | Web app generation | Conversational app builder targeting non-engineers |
OpenAI's Operator launched on January 23, 2025 and is powered by a Computer-Using Agent (CUA) model that combines GPT-4o vision with reinforcement learning on GUI control. CUA achieved 38.1% on OSWorld for full computer use, 58.1% on WebArena, and 87% on WebVoyager at launch.[17]
Anthropic's Computer Use, released October 22, 2024 alongside the upgraded Claude 3.5 Sonnet, was the first frontier model offering to take screenshots, move a cursor, click buttons, and type on a real desktop through an API. Anthropic explicitly framed the launch as experimental, noting it was "at times cumbersome and error prone." Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company were the first listed adopters.[15]
Devin, introduced by Cognition Labs on March 12, 2024, was the first product to package a software engineering agent as a stand-alone product. At launch, Devin resolved 13.86% of issues on the full SWE-bench benchmark, far above the previous best unassisted result of 1.96%, and 18.0% on a 25% random subset.[19] Subsequent specialized coding agents from OpenAI, Anthropic, and others have pushed SWE-bench Verified scores past 70%.
Manus, launched by the Chinese startup Monica.im in March 2025, is a general-purpose agent built on a multi-agent architecture running in dedicated virtual machines. It scored 86.5%, 70.1%, and 57.7% on GAIA Levels 1, 2, and 3 respectively, exceeding OpenAI's Deep Research scores reported at the same time. Within a week of launch, more than two million people joined its waitlist.[18]
Evaluating agents is harder than evaluating models. Benchmarks have to score not just final answers but the agent's ability to interact with environments, recover from errors, use tools, and complete multi step tasks within budget. The major agent benchmarks of 2025 and 2026 are summarized below.
| Benchmark | Focus | Tasks | Key metric | Authors |
|---|---|---|---|---|
| SWE-bench | Real GitHub issues in Python repos | 2,294 issues; ~500 in Verified subset | % issues resolved | Jimenez et al. (Princeton) |
| WebArena | Realistic web navigation | 812 tasks across 5 sites | Task success rate | Zhou et al. (CMU) |
| WebVoyager | End to end web tasks | 643 tasks across 15 popular sites | Success rate | Tencent / He et al. |
| GAIA | General AI assistant | 466 questions across 3 difficulty levels | Pass rate at each level | Mialon et al. (Meta, HF) |
| tau-bench | Customer support | Retail and airline domains with simulated DB | pass^k reliability | Yao et al. (Sierra) |
| tau2-bench | Tool-agent-user interaction | Telecom and others, shared world state | pass^k reliability | Sierra Research |
| OSWorld | Desktop OS tasks (Ubuntu, macOS, Windows) | 369 real computer tasks | Success rate | Xie et al. (XLang Lab) |
| AgentBench | General agent ability | 8 environments (OS, DB, KG, gaming, web) | Composite score | Liu et al. (Tsinghua) |
| BFCL | Function calling | Single, parallel, multiple, multi-turn tool calls | Accuracy by category | Berkeley Function Calling Leaderboard |
| WebShop | Online shopping | 12k products, 12k natural language goals | Reward, success | Princeton |
| ALFWorld | Embodied text adventure | Multi-step household tasks | Success rate | UToronto / Microsoft |
SWE-bench Verified, the human-validated subset, has become the de facto benchmark for coding agents; leading systems resolve more than 70% of issues by early 2026. OSWorld posed an unusually large gap between human performance (above 72%) and the best machine score (12.24%) when introduced in 2024, and that gap has narrowed but not closed in subsequent updates such as OSWorld-Verified.[20] GAIA Level 3, despite being just 99 questions, remains a strong differentiator between agentic systems because it requires reasoning, browsing, and tool use composed over many steps.
Four shifts define the agent landscape in this period.
Agent capability is improving rapidly. Per-step accuracy of frontier models on agent tasks has roughly doubled across two model generations, and the resulting end-to-end success on multi-step benchmarks has improved disproportionately because of the compound effect.
Every frontier lab ships agentic products. OpenAI (Operator, ChatGPT Agent, Codex, Deep Research), Anthropic (Computer Use, Claude Code, Claude Agent SDK), Google (Project Mariner, Gemini Deep Research, Vertex AI Agent Builder), Microsoft (Copilot, Copilot Studio, AutoGen, Microsoft Agent Framework), and xAI (Grok agentic features) all sell agents directly to end users or developers, in addition to providing the underlying APIs.
Browser and computer use are central. The browser has emerged as the universal interface for agents to act on the web because most enterprise systems lack stable APIs. Computer-use agents extend that to desktop applications. Both are bottlenecked by GUI grounding and slow, expensive vision models.
Code agents are the first commercial success story. Cursor, Devin, Claude Code, Codex, GitHub Copilot Agent, Windsurf, and Cline have collectively turned agentic coding into a multi-billion dollar segment of the AI industry. Cursor's annualized revenue crossed nine figures in 2024; Codex reportedly surpassed two million weekly active users in early 2026.
Interoperability protocols have started to standardize how agents talk to tools and to each other. The Model Context Protocol (model_context_protocol) was announced by Anthropic in November 2024 and adopted within a year by OpenAI, Google DeepMind, and Microsoft. Google introduced the Agent-to-Agent Protocol (A2A) in April 2025 to standardize agent-agent communication. In December 2025 the Agentic AI Foundation, a directed fund under the Linux Foundation, was launched to govern these emerging standards.[2]
Agents remain unreliable in ways that constrain where they can be deployed.
The practical response in 2025 and 2026 has been to keep humans in the loop for high-stakes actions, sandbox agents in isolated environments, scope permissions narrowly, log everything, and cap budget per session.
The distinction between an agent and a chatbot turns on autonomous action. A chatbot answers questions and produces text in response to prompts. An agent decides on its own to call tools, take actions in external systems, and pursue multi-step goals over time. The same underlying LLM can serve both roles depending on the scaffolding around it. ChatGPT in plain conversation is a chatbot; ChatGPT with browsing, code interpreter, and the ability to navigate websites becomes an agent.
Many products blur the line. A coding assistant that suggests a single completion is a chatbot; the same product in agent mode that reads files, runs tests, and applies a multi-file refactor is an agent. The relevant question is whether the system maintains state, takes consequential actions, and makes its own decisions about what to do next.
The concept of an agent in AI has evolved across more than seven decades.
| Year | Milestone |
|---|---|
| 1950 | Alan Turing's "Computing Machinery and Intelligence" proposes the imitation game; agent-style framing implicit |
| 1956 | Dartmouth conference establishes AI as a field; early programs like Logic Theorist embody agent-like behavior |
| 1966 | ELIZA at MIT becomes the first chatbot |
| 1971 | STRIPS introduces formal planning with preconditions and effects |
| 1976 | MYCIN expert system at Stanford diagnoses bacterial infections from rules |
| 1986 | Term "software agent" gains traction in distributed AI research |
| 1995 | Russell and Norvig's Artificial Intelligence: A Modern Approach (1st ed.) makes the agent the central organizing concept of AI |
| 1997 | IBM Deep Blue defeats Garry Kasparov at chess |
| 2013 | DeepMind DQN learns to play Atari games from raw pixels |
| 2016 | AlphaGo defeats Lee Sedol at Go |
| 2017 | AlphaZero masters Go, chess, and shogi via self-play |
| 2019 | OpenAI Five defeats Dota 2 world champions; AlphaStar reaches StarCraft II Grandmaster |
| Oct 2022 | ReAct paper formalizes the reasoning + acting loop for LLM agents |
| Feb 2023 | Toolformer demonstrates self-supervised tool use |
| Mar 2023 | AutoGPT and BabyAGI bring autonomous LLM agents to mainstream attention |
| Apr 2023 | Generative Agents "Smallville" simulation; Stanford UIST 2023 |
| May 2023 | Tree of Thoughts, Voyager Minecraft agent |
| Mar 2024 | Devin debuts as the first commercial autonomous software engineer |
| Apr 2024 | OSWorld benchmark released for real desktop tasks |
| Jun 2024 | tau-bench introduced for customer support agents |
| Oct 2024 | Anthropic releases Computer Use with Claude 3.5 Sonnet |
| Nov 2024 | Anthropic introduces Model Context Protocol (MCP) |
| Dec 2024 | Google unveils Project Mariner browser agent on Gemini 2.0 |
| Jan 2025 | OpenAI launches Operator (Computer-Using Agent / CUA) |
| Mar 2025 | Manus launches in China; tops GAIA leaderboard |
| Apr 2025 | Google introduces Agent-to-Agent (A2A) protocol |
| Dec 2025 | Agentic AI Foundation launched under Linux Foundation; AI agent market estimated at $7.63 billion[2] |
| 2026 | SWE-bench Verified scores from leading agents exceed 70%; non-human and agentic identities projected to surpass 45 billion |