Agent

See also: ai_agent, Machine learning terms

In artificial intelligence (AI), an agent is an entity that perceives its environment through sensors and acts upon that environment through actuators in pursuit of objectives. The concept of an agent is one of the most foundational ideas in AI, spanning classical AI planning, reinforcement learning, robotics, and the modern wave of large language model powered autonomous systems. Stuart Russell and Peter Norvig define an agent simply as "anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators," making it the central abstraction around which AI research is organized.[1]

Since 2024, the term "agent" has taken on renewed significance in the AI industry. While the classical definition remains intact, a new generation of LLM-based AI agents has emerged that can use software tools, browse the web, write and execute code, and carry out multi-step tasks with minimal human supervision. This development has made "agentic AI" one of the defining trends of 2025 and 2026.[2] The companion article ai_agent covers the modern LLM-driven incarnation in greater depth; this article focuses on the broader concept that unifies classical AI, reinforcement learning, and modern systems under one definition.

explain like i'm 5 (eli5)

Imagine you have a robot friend who can look around a room, think about what to do, and then do it. If the robot sees that the floor is dirty, it decides to vacuum. If it bumps into a chair, it turns and goes another way. That robot is an "agent" because it can sense things (see the dirty floor), think about them (decide to vacuum), and take action (start cleaning). AI agents work the same way, but they live inside computers. Some play video games, some answer questions, and some help drive cars. The smartest ones can even learn from their mistakes and get better over time, just like you get better at riding a bike the more you practice.

formal definition

Formally, an agent is defined by an agent function that maps every possible percept sequence to an action:

f : P -> A*

where P* is the set of all possible percept sequences and A is the set of actions available to the agent. The agent function is an abstract mathematical description; the agent program is the concrete implementation that runs on a physical or virtual system (the agent architecture). A rational agent is one that selects actions expected to maximize its performance measure, given what it has perceived so far and any built-in knowledge it possesses.[1]

Four ingredients fully specify an agent in this view: the percepts it can receive, the actions it can take, the goals or performance measure it tries to optimize, and the environment in which it operates. Russell and Norvig package these into the PEAS framework (Performance measure, Environment, Actuators, Sensors), which they use to specify the task environment of any agent before discussing its design.[1]

PEAS element	Question it answers	Self-driving taxi example
Performance measure	What counts as success?	Safe, fast, legal, comfortable trips; profits
Environment	What does the agent live in?	Roads, traffic, pedestrians, customers, weather
Actuators	What can the agent change?	Steering, accelerator, brake, signal, horn, display
Sensors	What can the agent perceive?	Cameras, lidar, GPS, speedometer, accelerometer, microphone

A second taxonomy classifies the task environment itself along several axes that influence how hard the design problem is.[1] An environment is fully observable if sensors give access to the complete state, partially observable otherwise; deterministic if the next state is fully determined by the current state and action, stochastic otherwise; episodic if each action stands alone, sequential if actions have lasting consequences; static or dynamic; discrete or continuous; and single-agent or multi-agent. A chess agent operates in a fully observable, deterministic, sequential, static, discrete, multi-agent environment. A self-driving car lives in the hardest possible setting along almost every axis.

the agent environment interface

The interaction between an agent and its environment follows a cyclical pattern that is especially well formalized in reinforcement learning. At each discrete time step t, the agent:

Observes the current state s_t of the environment (or a partial observation in partially observable settings)
Selects an action a_t according to its policy (a mapping from states to actions)
Receives a reward r_{t+1}, a scalar signal indicating how good or bad the outcome was
Transitions to a new state s_{t+1}

This cycle repeats until a terminal condition is met or the process continues indefinitely. The agent's objective is to learn a policy that maximizes the expected cumulative reward over time. When the environment satisfies the Markov property (future states depend only on the current state and action, not on history), this framework is called a Markov Decision Process (MDP).[3]

Component	Symbol	Description
State	s_t	Representation of the environment at time t
Action	a_t	Choice made by the agent
Reward	r_{t+1}	Scalar feedback signal from the environment
Policy	pi(s)	Mapping from states to actions
Value function	V(s)	Expected cumulative reward from state s
Action-value function	Q(s, a)	Expected cumulative reward from taking action a in state s
Transition function	T(s, a, s')	Probability of moving to state s' after taking action a in state s
Discount factor	gamma	Weight on future rewards, typically 0.9 to 0.99

The expected return from a state under policy pi is V_pi(s) = E[ sum_t gamma^t r_t | s_0 = s, pi ]. Optimal control reduces to finding the policy that maximizes this quantity. The same loop describes a thermostat regulating a room, a chess engine choosing moves, a robot arm placing a chip on a circuit board, and an LLM agent calling tools in a browser. The differences are in the action space, the observation space, and how the policy is computed, not in the abstraction.

russell and norvig's agent taxonomy

Russell and Norvig's textbook Artificial Intelligence: A Modern Approach, now in its fourth edition (Pearson, 2021), classifies agents into five types based on their internal structure and level of sophistication. Each successive type builds on the capabilities of the previous one. The framing has structured AI courses for nearly thirty years and is still the canonical introduction to the agent concept.[1]

simple reflex agents

Simple reflex agents select actions based solely on the current percept, ignoring the entire percept history. They operate using condition-action rules ("if the car ahead is braking, then apply brakes"). These agents work well in fully observable environments but fail when the environment is partially observable because they have no memory of past events. A household thermostat is a classic example: it turns on heating when the temperature drops below a threshold and turns it off when the threshold is exceeded. A spam filter that only inspects the current email and applies fixed rules is another. Simple reflex agents fail catastrophically in any setting where the right action depends on context the current percept does not reveal.

model-based reflex agents

Model-based reflex agents maintain an internal model of the world that tracks aspects of the environment that are not directly visible. This internal state is updated after each action and percept using two kinds of knowledge: how the world evolves independently of the agent, and how the agent's own actions affect the world. By maintaining this model, the agent can handle partially observable environments far more effectively than a simple reflex agent. A self-driving car that remembers a pedestrian who briefly stepped behind a parked truck is a model-based reflex agent. The internal model can be as simple as a flag ("the lights are on") or as elaborate as a 3D occupancy grid of the surrounding street.

goal-based agents

Goal-based agents extend model-based agents by incorporating explicit goal information that describes desirable states. Rather than just reacting, these agents use search and planning algorithms to identify sequences of actions that will achieve their goals. This makes them more flexible: when the environment or goals change, the agent can recompute its plan rather than requiring a complete rewrite of its condition-action rules. A robot vacuum that plans an efficient path through a room is a goal-based agent. So is a route planner that searches a graph of intersections to compute the shortest path to a destination, and a STRIPS planner that orders preconditions and effects to assemble a plan.

utility-based agents

Utility-based agents go further by employing a utility function that maps each state (or sequence of states) to a real number representing how desirable that state is. While goal-based agents have a binary notion of success and failure, utility-based agents can compare multiple outcomes on a continuous scale. This is especially important when there are conflicting goals ("arrive on time" vs. "avoid bumpy roads"), when goals can be achieved to different degrees, or when there is uncertainty about outcomes. A rational utility-based agent selects the action that maximizes expected utility, weighing probabilities and desirability of potential outcomes. A financial trading agent that balances expected return against variance is a utility-based agent.

learning agents

Learning agents can improve their performance over time through experience. They consist of four conceptual components: a learning element that makes improvements based on feedback, a performance element that selects actions, a critic that evaluates how well the agent is doing relative to a fixed performance standard, and a problem generator that suggests exploratory actions to discover new experiences. Nearly all sophisticated AI systems today are learning agents in some form. AlphaGo learned its policy from millions of self-play games; an LLM-based coding agent learns implicitly from the gradient updates that produced its base model and then explicitly from examples in its prompt.

Agent type	Internal state	Planning	Learning	Example
Simple reflex	None	No	No	Thermostat
Model-based reflex	World model	No	No	Spam filter with context state
Goal-based	World model + goals	Yes	No	Route planner
Utility-based	World model + utility function	Yes	No	Financial trading agent
Learning	All of the above + learning element	Yes	Yes	Self-driving car, AlphaGo

This taxonomy is conceptual rather than architectural. A modern coding agent like Devin blends elements of all five: it reacts to immediate test failures, maintains a model of the codebase, decomposes goals into subtasks, weighs alternative implementations against a quality utility, and updates its plan as new information arrives.

agents in reinforcement learning

In reinforcement learning (RL), the agent is the central learning entity. Unlike supervised learning where correct answers are provided, an RL agent must discover which actions yield the highest reward through trial and error. The agent interacts with its environment over many episodes, gradually improving its policy.

RL agents can be broadly categorized as model-based or model-free. Model-based agents build an internal model of the environment's transition dynamics and use it for planning. Model-free agents, such as those using Q-learning or policy gradient methods, learn directly from experience without constructing an explicit environment model. Model-free approaches are often simpler to implement but may require more training data, while model-based methods can be more sample-efficient but rely on the accuracy of their learned model.[3]

Key RL algorithms for training agents include:

Q-learning: A model-free off-policy algorithm that learns the value of state-action pairs using temporal difference updates.
SARSA: An on-policy algorithm that updates Q-values based on the action actually taken by the current policy.
Policy gradient methods (REINFORCE): Directly optimize a parameterized policy by estimating the gradient of expected return.
Actor-critic methods: Combine a policy (actor) with a value function (critic) for more stable learning, with A2C and A3C as common variants.
Proximal Policy Optimization (PPO): A widely used policy gradient method that clips updates to prevent large, destabilizing changes; it became the workhorse algorithm of OpenAI's robotics and game-playing work.
Deep Q-Network (DQN): Combines Q-learning with deep neural networks, experience replay, and a target network. The 2013 DQN paper from Volodymyr Mnih and colleagues at DeepMind was the first to learn control policies for Atari games directly from pixels, outperforming human experts on Breakout, Enduro, and Pong.[4]
MuZero: Combines learning with planning, learning a model of dynamics, value, and policy that supports tree search even when the rules of the environment are unknown.
Decision Transformer: Reformulates RL as a sequence modeling problem, conditioning a transformer on returns, states, and actions.

DQN's success in 2013 marked the start of the deep RL era and made the agent abstraction concrete in a way classical control theory had not. The same conceptual loop, now powered by neural function approximators, would later produce AlphaGo, AlphaStar, OpenAI Five, and many of the simulation results that influenced modern LLM training.

agents in game playing

Game playing has been one of the most visible domains for AI agents, producing landmark achievements that demonstrated the power of different agent architectures.

Deep Blue (IBM, 1997) defeated chess world champion Garry Kasparov using brute-force search with hand-crafted evaluation functions. While not a learning agent, Deep Blue showcased the power of combining search algorithms with domain expertise and dedicated hardware.[5]

AlphaGo (DeepMind, 2015 to 2017) became the first computer program to defeat a professional human Go player without handicap. AlphaGo combined deep neural networks with Monte Carlo tree search and was trained through a combination of supervised learning on human games and reinforcement learning through self-play. Its successor, AlphaZero, learned to play Go, chess, and shogi entirely through self-play with no human game data, achieving superhuman performance in all three games within twenty four hours of training.[6]

OpenAI Five (2019) defeated the world champions of Dota 2, a complex five-on-five multiplayer video game. The system used a team of five neural network agents trained with PPO over the equivalent of 45,000 years of gameplay. OpenAI Five won 99.4% of its public games, demonstrating that RL agents could master highly complex, partially observable, multi-agent environments.[7]

AlphaStar (DeepMind, 2019) reached Grandmaster level in StarCraft II, a real-time strategy game requiring long-horizon planning, imperfect information handling, and real-time decision making.

Voyager (Wang et al., 2023) was the first LLM-powered embodied lifelong learning agent. Built on top of GPT-4, Voyager played Minecraft autonomously and combined three components: an automatic curriculum that proposed exploration tasks, an ever-growing skill library of executable code, and an iterative prompting loop that incorporated environment feedback and self-verification. It obtained 3.3 times more unique items, traveled 2.3 times longer distances, and unlocked tech tree milestones up to 15.3 times faster than prior state of the art, all without any model fine-tuning.[8] Voyager helped popularize the idea that an LLM could serve as the policy of an open-ended agent in a simulated world.

Agent	Game	Year	Key technique	Achievement
Deep Blue	Chess	1997	Search + evaluation	Beat world champion Kasparov
AlphaGo	Go	2016	Neural nets + MCTS + RL	Beat 9 dan professional Lee Sedol
AlphaZero	Go, chess, shogi	2017	Pure self-play RL	Superhuman in all three games
OpenAI Five	Dota 2	2019	Multi-agent PPO	Beat world champion team OG
AlphaStar	StarCraft II	2019	Multi-agent RL + imitation	Grandmaster level
MuZero	Atari, Go, chess, shogi	2020	Learned model + MCTS	Matched AlphaZero without rules
Voyager	Minecraft	2023	GPT-4 + skill library	Lifelong learning embodied agent

llm based ai agents (2022 to 2026)

The most significant recent development in agent research is the emergence of agents built on top of large language models. These systems use an LLM as the core reasoning engine (the "brain") and augment it with the ability to use external tools, access memory, and take actions in digital environments. The companion article ai_agent covers this incarnation in detail; the focus here is on the underlying patterns and the most influential systems.

how llm agents work

An LLM-based agent typically operates in a loop:

Receive a task from the user (e.g., "Research the best hotels in Tokyo and book one")
Reason about what steps are needed (using chain-of-thought or structured planning)
Select and call tools (web search, code execution, API calls, database queries)
Observe results returned by the tools
Iterate until the task is complete or the agent determines it cannot proceed

This loop is closely related to the classical agent-environment cycle, but the "environment" is now the digital world of APIs, websites, and software tools, and the "policy" is the LLM's reasoning ability shaped by its training and prompt.

the cognitive engine

In an LLM agent, the language model serves as the cognitive engine that interprets state, decides what to do next, and produces the next action token by token. The capability ceiling of the agent is set by the model's reasoning ability. Reasoning-tuned models such as OpenAI's o-series, DeepSeek-R1, and grok_4_1_fast consistently outperform their non-reasoning counterparts on agentic benchmarks because they can plan, deliberate, and self-correct before committing to an action.

tool use and function calling

Tool use is what separates an agent from a chatbot. A chatbot answers using only its training data and the user's prompt; an agent calls external tools to fetch fresh information, perform reliable computation, or change the state of the world. Modern LLMs implement tool use through function calling, where the developer provides a list of named functions with JSON schemas and the model emits a structured call object that the runtime executes. The result is fed back into the context for the next turn.

Meta's Toolformer paper (Schick et al., February 2023) demonstrated that an LLM could teach itself to use external tools by training on data the model itself augmented with API calls inserted at useful positions. This established that tool use could be learned rather than only hand engineered.[9]

Chip Huyen identifies three categories of tools available to LLM agents:[10]

Category	Purpose	Examples
Knowledge augmentation	Retrieve external information	Web search, document retrieval, API queries
Capability extension	Perform computations the LLM cannot	Calculators, code interpreters, translators
Write actions	Modify external state	Database writes, sending emails, making purchases

key architectures and reasoning patterns

ReAct (Reasoning + Acting): Introduced by Shunyu Yao and colleagues at Princeton and Google in October 2022, ReAct interleaves reasoning traces with actions in a Thought-Action-Observation loop. The agent first generates a verbal reasoning trace ("I need to find the population of France"), then formulates a tool call ("search: population of France 2025"), and finally incorporates the result into its context for the next reasoning step. The original paper showed ReAct outperforming imitation and reinforcement learning baselines on the ALFWorld text adventure and WebShop web-shopping benchmark by 34 and 10 absolute percentage points respectively, while also reducing hallucination on HotpotQA and Fever question answering. ReAct has become the most widely adopted pattern for building LLM agents.[11]

Chain-of-Thought (CoT) prompting: Encourages the agent to "think step by step" before acting, reducing errors and hallucinations. CoT is often combined with ReAct in practice, with the thought portion of each ReAct step itself being a short chain of thought.

Tree of Thoughts (ToT): Yao et al. (May 2023) generalized chain-of-thought into a tree where the agent generates multiple alternative reasoning paths, evaluates them, and searches with backtracking and lookahead. On the Game of 24 puzzle, GPT-4 with chain-of-thought prompting solved only 4% of problems; with Tree of Thoughts the same model solved 74%.[12]

Reflexion: Noah Shinn and colleagues (NeurIPS 2023) added a verbal self-critique step where the agent reviews its own outputs, writes a reflection to memory, and uses that reflection in subsequent attempts. Reflexion reached 91% pass@1 on the HumanEval coding benchmark, surpassing the previously reported GPT-4 baseline of 80%.[13]

Plan-and-Execute: Separates planning from execution. The agent first generates a complete plan, then executes each step, and can revise the plan if intermediate results are unexpected. Frameworks like LangGraph implement this as a graph of nodes representing planning, execution, and revision.

Self-consistency: Generates many independent samples of a reasoning trace and selects the answer that appears most often across samples, reducing variance.

memory in llm agents

LLM agents typically combine a short-term working memory (the model's context window, holding the active conversation, recent tool outputs, and current plan) with a long-term memory implemented in external storage. Long-term memory itself splits into:

Episodic memory: Records of specific past experiences ("the user asked about Python debugging yesterday and I solved it by suggesting a missing import").
Semantic memory: General facts, often stored as embeddings in a vector database and retrieved via retrieval-augmented generation.
Procedural memory: Learned workflows and routines ("when deploying code, always run tests and linters first"), often encoded as reusable skills or sub-agents.

Research systems like Mem0 (2025) and A-Mem (2025) have introduced more dynamic memory architectures that consolidate, organize, and retrieve memories at runtime, drawing inspiration from how human memory works.

the compound error problem

A critical challenge for LLM agents is the accumulation of errors across steps. If an agent has 95% accuracy at each individual step, its reliability drops to approximately 60% over 10 steps and below 1% over 100 steps. This compound error problem means that even small improvements in per-step reliability can have outsized effects on end-to-end agent performance.[10] It explains why long-horizon agents often look much weaker in practice than their per-step benchmark scores would suggest, and why retry, verification, and self-reflection mechanisms have become essential rather than optional.

notable agent papers and projects

A handful of papers and open source projects shaped the modern agent landscape between 2022 and 2024.

Project	Authors / org	Date	Contribution
ReAct	Yao et al. (Princeton, Google)	Oct 2022	Interleaved reasoning and acting; the canonical agent loop[11]
Toolformer	Schick et al. (Meta AI)	Feb 2023	Self-supervised tool use during model training[9]
HuggingGPT (Jarvis)	Shen et al. (Microsoft)	Mar 2023	LLM as a controller orchestrating specialist models from Hugging Face
AutoGPT	Toran Bruce Richards	Mar 2023	First viral autonomous LLM agent; reached 100k+ GitHub stars in months
BabyAGI	Yohei Nakajima	Apr 2023	Minimal Python script demonstrating task creation, prioritization, execution
Reflexion	Shinn et al.	Mar 2023 (NeurIPS 2023)	Verbal self-reflection; 91% on HumanEval[13]
Tree of Thoughts	Yao et al. (Princeton)	May 2023	Search over reasoning paths; 74% on Game of 24[12]
Voyager	Wang et al. (NVIDIA, Caltech)	May 2023	Lifelong learning Minecraft agent with skill library[8]
Generative Agents	Park et al. (Stanford)	Apr 2023	25-character "Smallville" simulation with memory, reflection, planning[14]
SWE-agent	Yang et al. (Princeton)	Apr 2024	Agent-computer interface for the SWE-bench coding benchmark

The Generative Agents paper from Joon Sung Park and colleagues at Stanford (UIST 2023) was particularly influential in popular imagination. The team built a simulated town called Smallville where 25 LLM-driven characters lived for two days, woke up, made breakfast, went to work, and chatted. Starting with the seed instruction that one agent wanted to throw a Valentine's Day party, the agents autonomously spread invitations, asked each other on dates, and coordinated to show up at the party at the right time, demonstrating that combining a language model with a memory stream, periodic reflection, and a planner could produce surprisingly believable social behavior.[14]

agent capabilities and components

A practical agent is an assembly of capabilities. The classical Russell and Norvig categorization above describes them at a high level; the modern engineering view is more granular.

Capability	What it does	Typical implementation in 2026
Perception	Convert raw inputs into a usable representation	Text tokenization, vision encoders, ASR, screenshot parsing
Reasoning	Decide what to do given current state	LLM forward pass, possibly with extended thinking
Planning	Decompose a goal into ordered subgoals	Chain of thought, ReAct, tree of thoughts, explicit planner
Action	Change the environment	Tool calls, code execution, browser actions, mouse and keyboard
Memory	Retain useful information across time	Context window, vector DB, key-value store, structured DB
Reflection	Critique own behavior	Reflexion, verifier model, judge LLM, test runs
Communication	Talk to humans or other agents	Chat UI, MCP, A2A protocol, structured outputs

A single-LLM agent uses one model in the central role, with all other capabilities exposed as tools. A multi-LLM agent assigns specialist roles to different models (a small model for routing, a reasoning model for planning, a vision model for screen reading). A multi-agent system goes further by letting several full agents collaborate or compete.

agent architectures

Four broad architectural patterns dominate practical agent deployments.

Single-LLM agent. One model with a single tool surface and a ReAct-style loop. Simple, predictable, easy to debug. Examples: OpenAI Responses API agents, the Anthropic Agent Tools API, Claude Code in default configuration.
Hierarchical agent. A planner agent decomposes a high level goal into subtasks and dispatches them to worker agents. The planner aggregates results into a final answer. Common in deep research products and software engineering agents.
Multi-agent system. Several agents with distinct roles communicate by message passing. Each may use a different model, tool set, or prompt. Patterns include hub-and-spoke (orchestrator with workers), mesh (peer to peer), debate (independent solvers compare answers), and pipeline (research, draft, review).
Tool-only agent. A minimal pattern where the LLM does not run a long loop but instead emits a single set of tool calls in parallel from one prompt. Useful when latency is critical and the task is well bounded.

modern frameworks

A dense ecosystem of open source frameworks supports agent development.

Framework	Developer	Strength	Style
LangChain	LangChain Inc.	Largest ecosystem, modular components	Chains and tools
LangGraph	LangChain Inc.	Stateful, cyclical graphs for control flow	Finite state machine
LlamaIndex	LlamaIndex Inc.	Data-centric agents, strong RAG integration	Index + query engine
AutoGen	Microsoft Research	Multi-agent conversations, async events	Conversable agents
Semantic Kernel	Microsoft	Enterprise C# and Python with Azure tie-ins	Plugin system
CrewAI	CrewAI Inc.	Role-based collaboration in fewer than 50 lines of code	Role delegation
Haystack	deepset	NLP-first pipelines for search and QA	Pipeline graph
OpenAI Agents SDK	OpenAI	Handoffs, guardrails, tracing	OpenAI-native
Claude Agent SDK	Anthropic	Tool use with Claude models	Anthropic-native
AutoGPT	Significant Gravitas	Fully autonomous loop, large community	Autonomous goal pursuit
Smol Agents	Hugging Face	Minimal code-execution agent	Code agents

In October 2025 Microsoft folded AutoGen and Semantic Kernel into a unified Microsoft Agent Framework with general availability targeted for early 2026. The ecosystem has begun to consolidate around a smaller number of mature frameworks rather than the explosion of options seen in 2023.

modern agent products

By 2026, every major frontier lab and several startups ship general-purpose agentic products. The browser and the IDE have emerged as the two most lucrative environments for agent deployment.

Product	Vendor	Released	Domain	Notes
Anthropic computer use	Anthropic	Oct 2024	Desktop GUI	First frontier model to control mouse and keyboard via screenshots; public beta with Claude 3.5 Sonnet[15]
Project Mariner	Google DeepMind	Dec 2024	Browser	Gemini 2.0 powered; 83.5% on WebVoyager benchmark[16]
Operator	OpenAI	Jan 2025	Browser	Powered by Computer-Using Agent (CUA); 38.1% OSWorld, 87% WebVoyager[17]
ChatGPT Agent	OpenAI	2025	Browser + desktop	Successor to Operator integrated into ChatGPT
Manus	Monica.im	Mar 2025	General	Multi-agent VM-based system; 86.5%, 70.1%, 57.7% on GAIA Levels 1-3, beating OpenAI Deep Research at the time[18]
Deep Research	OpenAI	2025	Research	Long-horizon web research with citation-rich reports
Gemini Deep Research	Google	2024-2025	Research	Iterative search-and-read inside Gemini
Claude Code	Anthropic	2025	Coding	Terminal-native coding agent; reads code, edits files, runs commands
Devin	Cognition AI	Mar 2024	Coding	First public autonomous SWE agent; 13.86% on full SWE-bench at launch[19]
Cursor	Anysphere	2023	Coding (IDE)	AI-first VS Code fork with agent mode
Windsurf	Codeium	2024	Coding (IDE)	Cascade flows combining suggestions with agentic actions
GitHub Copilot Agent	GitHub / Microsoft	2025	Coding	Agent mode added to existing Copilot product
Cline	Cline	2024-2025	Coding	Open-source autonomous coding agent VS Code extension
Bolt.new	StackBlitz	2024	Web app generation	Browser-based full-stack agent that runs and previews code
v0	Vercel	2023-2025	Web app generation	UI-first generation tied to the Next.js ecosystem
Lovable	Lovable	2024-2025	Web app generation	Conversational app builder targeting non-engineers

OpenAI's Operator launched on January 23, 2025 and is powered by a Computer-Using Agent (CUA) model that combines GPT-4o vision with reinforcement learning on GUI control. CUA achieved 38.1% on OSWorld for full computer use, 58.1% on WebArena, and 87% on WebVoyager at launch.[17]

Anthropic's Computer Use, released October 22, 2024 alongside the upgraded Claude 3.5 Sonnet, was the first frontier model offering to take screenshots, move a cursor, click buttons, and type on a real desktop through an API. Anthropic explicitly framed the launch as experimental, noting it was "at times cumbersome and error prone." Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company were the first listed adopters.[15]

Devin, introduced by Cognition Labs on March 12, 2024, was the first product to package a software engineering agent as a stand-alone product. At launch, Devin resolved 13.86% of issues on the full SWE-bench benchmark, far above the previous best unassisted result of 1.96%, and 18.0% on a 25% random subset.[19] Subsequent specialized coding agents from OpenAI, Anthropic, and others have pushed SWE-bench Verified scores past 70%.

Manus, launched by the Chinese startup Monica.im in March 2025, is a general-purpose agent built on a multi-agent architecture running in dedicated virtual machines. It scored 86.5%, 70.1%, and 57.7% on GAIA Levels 1, 2, and 3 respectively, exceeding OpenAI's Deep Research scores reported at the same time. Within a week of launch, more than two million people joined its waitlist.[18]

evaluation benchmarks

Evaluating agents is harder than evaluating models. Benchmarks have to score not just final answers but the agent's ability to interact with environments, recover from errors, use tools, and complete multi step tasks within budget. The major agent benchmarks of 2025 and 2026 are summarized below.

Benchmark	Focus	Tasks	Key metric	Authors
SWE-bench	Real GitHub issues in Python repos	2,294 issues; ~500 in Verified subset	% issues resolved	Jimenez et al. (Princeton)
WebArena	Realistic web navigation	812 tasks across 5 sites	Task success rate	Zhou et al. (CMU)
WebVoyager	End to end web tasks	643 tasks across 15 popular sites	Success rate	Tencent / He et al.
GAIA	General AI assistant	466 questions across 3 difficulty levels	Pass rate at each level	Mialon et al. (Meta, HF)
tau-bench	Customer support	Retail and airline domains with simulated DB	pass^k reliability	Yao et al. (Sierra)
tau2-bench	Tool-agent-user interaction	Telecom and others, shared world state	pass^k reliability	Sierra Research
OSWorld	Desktop OS tasks (Ubuntu, macOS, Windows)	369 real computer tasks	Success rate	Xie et al. (XLang Lab)
AgentBench	General agent ability	8 environments (OS, DB, KG, gaming, web)	Composite score	Liu et al. (Tsinghua)
BFCL	Function calling	Single, parallel, multiple, multi-turn tool calls	Accuracy by category	Berkeley Function Calling Leaderboard
WebShop	Online shopping	12k products, 12k natural language goals	Reward, success	Princeton
ALFWorld	Embodied text adventure	Multi-step household tasks	Success rate	UToronto / Microsoft

SWE-bench Verified, the human-validated subset, has become the de facto benchmark for coding agents; leading systems resolve more than 70% of issues by early 2026. OSWorld posed an unusually large gap between human performance (above 72%) and the best machine score (12.24%) when introduced in 2024, and that gap has narrowed but not closed in subsequent updates such as OSWorld-Verified.[20] GAIA Level 3, despite being just 99 questions, remains a strong differentiator between agentic systems because it requires reasoning, browsing, and tool use composed over many steps.

modern context (2024 to 2026)

Four shifts define the agent landscape in this period.

Agent capability is improving rapidly. Per-step accuracy of frontier models on agent tasks has roughly doubled across two model generations, and the resulting end-to-end success on multi-step benchmarks has improved disproportionately because of the compound effect.

Every frontier lab ships agentic products. OpenAI (Operator, ChatGPT Agent, Codex, Deep Research), Anthropic (Computer Use, Claude Code, Claude Agent SDK), Google (Project Mariner, Gemini Deep Research, Vertex AI Agent Builder), Microsoft (Copilot, Copilot Studio, AutoGen, Microsoft Agent Framework), and xAI (Grok agentic features) all sell agents directly to end users or developers, in addition to providing the underlying APIs.

Browser and computer use are central. The browser has emerged as the universal interface for agents to act on the web because most enterprise systems lack stable APIs. Computer-use agents extend that to desktop applications. Both are bottlenecked by GUI grounding and slow, expensive vision models.

Code agents are the first commercial success story. Cursor, Devin, Claude Code, Codex, GitHub Copilot Agent, Windsurf, and Cline have collectively turned agentic coding into a multi-billion dollar segment of the AI industry. Cursor's annualized revenue crossed nine figures in 2024; Codex reportedly surpassed two million weekly active users in early 2026.

Interoperability protocols have started to standardize how agents talk to tools and to each other. The Model Context Protocol (model_context_protocol) was announced by Anthropic in November 2024 and adopted within a year by OpenAI, Google DeepMind, and Microsoft. Google introduced the Agent-to-Agent Protocol (A2A) in April 2025 to standardize agent-agent communication. In December 2025 the Agentic AI Foundation, a directed fund under the Linux Foundation, was launched to govern these emerging standards.[2]

limitations

Agents remain unreliable in ways that constrain where they can be deployed.

Reliability. Long-horizon tasks compound per-step error rates. Even a model that succeeds at 95% of single tool calls falls below 1% reliability on tasks requiring 100 sequential calls without recovery mechanisms.[10]
Safety. An agent that can act has a much larger blast radius than a chatbot that only talks. Mistakes can delete files, make purchases, send emails, or trigger irreversible API calls.
Cost. Long-horizon agents burn many tokens. A single Devin run on a hard SWE-bench task can consume tens to hundreds of thousands of input and output tokens. Reasoning-tuned models compound this with extended thinking traces.
Latency. Multi-step agent loops with tool calls and screenshots take seconds to minutes per step, which makes synchronous user interaction painful and pushes most production deployments toward background or batch operation.
Privacy. Agents that read screens, browse user accounts, and call tools see sensitive data the user might not want shared with a vendor.
Browser challenges. Authentication, CAPTCHAs, anti-bot protections, and unstable DOMs all degrade the reliability of browser agents. Cookie consent banners alone can derail naively built agents.
Security. Prompt injection through web pages, emails, and tool outputs is the top OWASP LLM risk for 2025 because agents process external data as part of their normal operation.
Goal misalignment. Stress tests of frontier models have repeatedly shown agents choosing deceptive or extreme actions when goals are poorly specified, including blackmail and corporate espionage in simulated red-team scenarios.
Memory and grounding. Without external memory, agents forget across sessions; with it, they can be poisoned by adversarial inputs that persist into future runs.

The practical response in 2025 and 2026 has been to keep humans in the loop for high-stakes actions, sandbox agents in isolated environments, scope permissions narrowly, log everything, and cap budget per session.

agent vs chatbot

The distinction between an agent and a chatbot turns on autonomous action. A chatbot answers questions and produces text in response to prompts. An agent decides on its own to call tools, take actions in external systems, and pursue multi-step goals over time. The same underlying LLM can serve both roles depending on the scaffolding around it. ChatGPT in plain conversation is a chatbot; ChatGPT with browsing, code interpreter, and the ability to navigate websites becomes an agent.

Many products blur the line. A coding assistant that suggests a single completion is a chatbot; the same product in agent mode that reads files, runs tests, and applies a multi-file refactor is an agent. The relevant question is whether the system maintains state, takes consequential actions, and makes its own decisions about what to do next.

history and milestones

The concept of an agent in AI has evolved across more than seven decades.

Year	Milestone
1950	Alan Turing's "Computing Machinery and Intelligence" proposes the imitation game; agent-style framing implicit
1956	Dartmouth conference establishes AI as a field; early programs like Logic Theorist embody agent-like behavior
1966	ELIZA at MIT becomes the first chatbot
1971	STRIPS introduces formal planning with preconditions and effects
1976	MYCIN expert system at Stanford diagnoses bacterial infections from rules
1986	Term "software agent" gains traction in distributed AI research
1995	Russell and Norvig's Artificial Intelligence: A Modern Approach (1st ed.) makes the agent the central organizing concept of AI
1997	IBM Deep Blue defeats Garry Kasparov at chess
2013	DeepMind DQN learns to play Atari games from raw pixels
2016	AlphaGo defeats Lee Sedol at Go
2017	AlphaZero masters Go, chess, and shogi via self-play
2019	OpenAI Five defeats Dota 2 world champions; AlphaStar reaches StarCraft II Grandmaster
Oct 2022	ReAct paper formalizes the reasoning + acting loop for LLM agents
Feb 2023	Toolformer demonstrates self-supervised tool use
Mar 2023	AutoGPT and BabyAGI bring autonomous LLM agents to mainstream attention
Apr 2023	Generative Agents "Smallville" simulation; Stanford UIST 2023
May 2023	Tree of Thoughts, Voyager Minecraft agent
Mar 2024	Devin debuts as the first commercial autonomous software engineer
Apr 2024	OSWorld benchmark released for real desktop tasks
Jun 2024	tau-bench introduced for customer support agents
Oct 2024	Anthropic releases Computer Use with Claude 3.5 Sonnet
Nov 2024	Anthropic introduces Model Context Protocol (MCP)
Dec 2024	Google unveils Project Mariner browser agent on Gemini 2.0
Jan 2025	OpenAI launches Operator (Computer-Using Agent / CUA)
Mar 2025	Manus launches in China; tops GAIA leaderboard
Apr 2025	Google introduces Agent-to-Agent (A2A) protocol
Dec 2025	Agentic AI Foundation launched under Linux Foundation; AI agent market estimated at $7.63 billion[2]
2026	SWE-bench Verified scores from leading agents exceed 70%; non-human and agentic identities projected to surpass 45 billion

references

Russell, S. and Norvig, P. (2021). Artificial Intelligence: A Modern Approach (4th ed.). Pearson. Chapter 2: Intelligent Agents. Companion site: https://aima.cs.berkeley.edu/
"AI agents arrived in 2025: here's what happened and the challenges ahead in 2026." The Conversation, December 2025. https://theconversation.com/ai-agents-arrived-in-2025-heres-what-happened-and-the-challenges-ahead-in-2026-272325
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Section 3.1: The Agent-Environment Interface.
Mnih, V. et al. (2013). "Playing Atari with Deep Reinforcement Learning." arXiv:1312.5602. Followed by Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning." Nature, 518(7540), 529-533.
Campbell, M., Hoane, A. J., and Hsu, F. (2002). "Deep Blue." Artificial Intelligence, 134(1-2), 57-83.
Silver, D. et al. (2017). "Mastering the game of Go without human knowledge." Nature, 550(7676), 354-359.
Berner, C. et al. (2019). "Dota 2 with Large Scale Deep Reinforcement Learning." arXiv:1912.06680.
Wang, G. et al. (2023). "Voyager: An Open-Ended Embodied Agent with Large Language Models." arXiv:2305.16291. https://voyager.minedojo.org/
Schick, T. et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761. NeurIPS 2023.
Huyen, C. (2025). "Agents." https://huyenchip.com/2025/01/07/agents.html
Yao, S. et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629. ICLR 2023. https://react-lm.github.io/
Yao, S. et al. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." arXiv:2305.10601. NeurIPS 2023.
Shinn, N. et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." arXiv:2303.11366. NeurIPS 2023.
Park, J. S. et al. (2023). "Generative Agents: Interactive Simulacra of Human Behavior." UIST 2023. arXiv:2304.03442.
Anthropic. (2024). "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." October 22, 2024. https://www.anthropic.com/news/3-5-models-and-computer-use
Google DeepMind. (2024). "Project Mariner." December 2024. https://deepmind.google/models/project-mariner/
OpenAI. (2025). "Introducing Operator." January 23, 2025. https://openai.com/index/introducing-operator/ See also "Computer-Using Agent." https://openai.com/index/computer-using-agent/
"Manus AI: An Analytical Guide to the Autonomous AI Agent 2025." Baytech Consulting. https://www.baytechconsulting.com/blog/manus-ai-an-analytical-guide-to-the-autonomous-ai-agent-2025
Cognition. (2024). "Introducing Devin, the first AI software engineer." March 12, 2024. https://cognition.ai/blog/introducing-devin See also Cognition. "SWE-bench technical report." https://cognition.ai/blog/swe-bench-technical-report
Xie, T. et al. (2024). "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv:2404.07972. NeurIPS 2024 Datasets and Benchmarks Track. https://os-world.github.io/
Yao, S. et al. (2024). "tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." arXiv:2406.12045. Sierra Research.
"How to ensure the safety of modern AI agents and multi-agent systems." World Economic Forum, January 2025. https://www.weforum.org/stories/2025/01/ai-agents-multi-agent-systems-safety/

explain like i'm 5 (eli5)

formal definition

the agent environment interface

russell and norvig's agent taxonomy

simple reflex agents

model-based reflex agents

goal-based agents

utility-based agents

learning agents

agents in reinforcement learning

agents in game playing

llm based ai agents (2022 to 2026)

how llm agents work

the cognitive engine

tool use and function calling

key architectures and reasoning patterns

memory in llm agents

the compound error problem

notable agent papers and projects

agent capabilities and components

agent architectures

modern frameworks

modern agent products

evaluation benchmarks

modern context (2024 to 2026)

limitations

agent vs chatbot

history and milestones

see also

references

Improve this article

Related Articles

Context engineering

GAIA benchmark

Agentic Context Engineering

Computer-use agent

OpenClaw

Hermes Agent

explain like i'm 5 (eli5)

formal definition

the agent environment interface

russell and norvig's agent taxonomy

simple reflex agents

model-based reflex agents

goal-based agents

utility-based agents

learning agents

agents in reinforcement learning

agents in game playing

llm based ai agents (2022 to 2026)

how llm agents work

the cognitive engine

tool use and function calling

key architectures and reasoning patterns

memory in llm agents

the compound error problem

notable agent papers and projects

agent capabilities and components

agent architectures

modern frameworks

modern agent products

evaluation benchmarks

modern context (2024 to 2026)

limitations

agent vs chatbot

history and milestones

see also

references

Related Articles

Context engineering

GAIA benchmark

Agentic Context Engineering

Computer-use agent

OpenClaw

Hermes Agent