AI browser agent

AI Agents Artificial Intelligence Large Language Models Robotics

41 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

50 citations

Revision

v6 · 8,295 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

An AI browser agent is a class of AI agent that uses artificial intelligence to autonomously navigate, interpret, and interact with web browsers to complete tasks on behalf of a user. These agents can click buttons, fill out forms, scroll through pages, follow links, and perform multi-step workflows across websites, replicating the actions a human would take when using a web browser. Unlike traditional browser automation tools that rely on hard-coded scripts and CSS selectors, AI browser agents use large language models (LLMs) and computer vision to understand web pages dynamically, making them capable of handling unfamiliar websites without site-specific programming.^[1]^[2] As OpenAI put it when launching Operator in January 2025, such an agent uses "its own browser" so that "it can look at a webpage and interact with it by typing, clicking, and scrolling," just as a person would.^[4]

The leading AI browser agents and their headline benchmark scores are summarized below. On the WebArena benchmark, where human performance is 78.24%, OpenAI's Computer-Using Agent (CUA) model reached 58.1% and H Company's Surfer 2 reached 69.6%; on the WebVoyager benchmark, scores have climbed past 97% (Surfer 2, October 2025), effectively saturating the test.^[9]^[28]^[32] The central unsolved problem across every product is prompt injection: OpenAI has stated the risk "is unlikely to ever be fully solved" for agents that browse the open web.^[42]

Agent	Developer	Released	WebArena	WebVoyager
Operator / ChatGPT Agent / Atlas	OpenAI	Jan 2025	58.1%	87.0%
Computer use / Claude for Chrome	Anthropic	Oct 2024	N/A	N/A
Project Mariner	Google DeepMind	Dec 2024	N/A	83.5%
Surfer 2	H Company	Oct 2025	69.6%	97.1%
Browser Use	Open-source	2024	N/A	89.1%

The development of AI browser agents accelerated rapidly between 2024 and 2025, with major AI companies releasing competing products. Anthropic launched computer use for Claude in October 2024.^[1] Google previewed Project Mariner, built on Gemini 2.0, in December 2024.^[3] OpenAI released Operator, powered by its Computer-Using Agent (CUA) model, in January 2025.^[4] Amazon introduced Nova Act in March 2025.^[5] Microsoft integrated computer-using agent capabilities into Copilot Studio in 2025.^[6] By 2026, the category had matured into a multibillion-dollar product space spanning consumer chatbots, enterprise automation suites, and open-source developer libraries.

How AI browser agents work

AI browser agents operate through a continuous loop of perception, reasoning, and action. At each step, the agent observes the current state of the browser, decides what to do next, and then executes an action such as clicking, typing, or scrolling. This loop repeats until the agent determines the task is complete or requires human intervention.^[1]

The perception step is where the major architectural differences between browser agents emerge. There are two primary approaches: screenshot-based (also called vision-based or pixel-based) and DOM-based (also called structure-based). Some systems use a hybrid of both.

Screenshot-based approach

In the screenshot-based approach, the agent captures a screenshot of the browser window and feeds it to a multimodal AI model that can interpret images. The model analyzes the visual layout of the page, identifies interactive elements like buttons and text fields, and returns coordinates for where to click or text to type. This process closely mirrors how a human perceives and interacts with a screen.^[1]

OpenAI's CUA model and Anthropic's Claude computer use both employ this approach. The CUA model combines GPT-4o's vision capabilities with advanced reasoning trained through reinforcement learning. It examines a screenshot, reasons about the next steps using chain-of-thought, and issues commands such as click(x, y) or type(text). The environment executes these commands and returns a new screenshot, and the cycle repeats.^[4]

Anthropic's computer use capability, introduced with Claude 3.5 Sonnet in October 2024, works similarly. The model receives a screenshot of the current screen state and returns specific pixel coordinates for mouse actions. Before this update, frontier AI models could not reliably provide coordinates on a screenshot; the updated Claude 3.5 Sonnet was the first major model to demonstrate this capability in public beta.^[1]

The primary advantage of the screenshot approach is universality. Because it works at the pixel level, it can handle any graphical interface, including web applications, desktop software, canvas-based applications, and graphical elements that have no representation in the DOM. The disadvantages are speed and cost. Screenshot-based agents typically take 2 to 7 seconds per action because each step requires image capture, upload, model inference, and response parsing. The token cost is also higher, since image tokens in models like GPT-4V are 10 to 20 times more expensive than processing equivalent HTML text.

DOM-based approach

The DOM-based approach reads the Document Object Model, the structured tree representation of a web page's HTML elements. Instead of looking at pixels, the agent parses the underlying code to identify buttons, links, input fields, and other interactive elements. It then reasons about which element to interact with and issues commands directly against the DOM.

Google's Project Mariner uses a hybrid approach that incorporates DOM understanding. The agent takes a screenshot of the active Chrome tab and simultaneously analyzes both the visible layout and the underlying HTML structure. This dual understanding of visual elements and code enables more reliable browser automation, particularly on text-heavy pages with forms and structured data.^[3]

DOM-based agents are significantly faster, executing actions in 20 to 100 milliseconds compared to several seconds for screenshot-based agents. They also consume fewer tokens, typically 1,000 to 3,000 tokens per page compared to 10,000 or more for vision-based processing. This makes them cheaper to operate at scale. However, DOM-based agents struggle with canvas-based applications, dynamically rendered content that does not appear in the HTML source, and visual elements like images or charts that carry meaning not captured in the DOM.

Hybrid approach

Many modern browser agents combine both approaches, using DOM parsing for most steps and falling back to screenshots when the page structure is ambiguous or when visual understanding is required. The open-source Browser Use framework, which gained over 78,000 GitHub stars by early 2026, exemplifies this hybrid strategy.^[7] It uses Playwright for browser control, parses the DOM for element identification, and takes screenshots for visual analysis when needed.

The observe, plan, act loop

A common control structure across browser agents is an Observe-Plan-Act loop in which the agent first captures the current browser state, then reasons about a sequence of next actions, and finally executes one action before re-observing.^[3] Some agents extend this with a Reflect or Critic phase, where a separate model checks whether the previous action achieved its goal. Browserbase's Stagehand framework formalizes this loop through four primitive operations called act, extract, observe, and agent, allowing developers to mix deterministic Playwright code with AI-resolved natural-language instructions.^[8]

What are the major AI browser agents?

Several major technology companies and startups have released AI browser agents, each with different technical approaches and target audiences.

OpenAI Operator, ChatGPT Agent, and Atlas

OpenAI launched Operator on January 23, 2025, as a research preview available to ChatGPT Pro subscribers in the United States.^[4] Operator was powered by the Computer-Using Agent (CUA) model, which combines GPT-4o's vision capabilities with reasoning trained through reinforcement learning to interact with graphical user interfaces. The CUA model can identify page elements, break tasks into multi-step plans, and self-correct when challenges arise.^[9] On the WebArena benchmark, the CUA model scored 58.1%, and on OSWorld it scored 38.1%, then a notable step toward general computer use although still well short of human performance.^[9]

Operator ran in a dedicated browser environment at operator.chatgpt.com. Users provided a natural language instruction, and the agent navigated websites to complete the task. Safety measures included asking for user confirmation before finalizing actions with external side effects, such as submitting an order or sending an email.^[4]

On July 17, 2025, OpenAI launched ChatGPT Agent, a unified agentic system that combined Operator's action-taking remote browser, deep research's web synthesis, and ChatGPT's conversational interface.^[10] ChatGPT Agent uses its own virtual computer to handle requests such as briefing a user on upcoming meetings based on calendar and recent news, planning and buying ingredients for a meal, or analyzing competitors and creating a slide deck. ChatGPT Agent was made available to Pro users at launch, with Plus, Team, Enterprise, and Education tiers following over subsequent weeks. Operator was deprecated after the release of ChatGPT Agent and the operator.chatgpt.com site was shut down on August 31, 2025.^[11]

In October 2025, OpenAI launched ChatGPT Atlas, a Chromium-based browser product that consolidated agent capabilities into a dedicated web browser for macOS.^[12] Atlas introduced Agent Mode as a core feature, allowing the browser to execute multi-step tasks autonomously, available in preview to Plus, Pro, and Business users. Agent Mode in Atlas is restricted from running code in the browser, downloading files, installing extensions, or accessing other applications on the user's computer, and pauses on sensitive sites such as financial institutions.^[12]

Anthropic computer use and Claude for Chrome

Anthropic introduced computer use as a public beta on October 22, 2024, making Claude 3.5 Sonnet "the first frontier AI model to offer computer use in public beta," according to the company's announcement.^[1] The feature, in Anthropic's words, lets developers direct Claude to use computers "the way people do, by looking at a screen, moving a cursor, clicking buttons, and typing text."^[1] Unlike OpenAI's hosted approach, Anthropic does not provide hosted virtual machines. Instead, developers call the Claude API, sending both text instructions and screenshots of the computer being controlled. Claude responds with commands describing what actions to perform next. The model can use three tools: a Computer tool (providing access to screen, keyboard, and mouse), a Text Editor, and a Bash terminal.^[1]

At launch, Claude 3.5 Sonnet scored 14.9% on OSWorld's screenshot-only category, which was notably better than the next-best system's score of 7.8% at that time, and 22.0% when afforded more steps.^[1] Subsequent Claude versions pushed performance much higher: by September 2025 Claude Sonnet 4.5 scored 61.4% on OSWorld, up from 42.2% for Claude Sonnet 4 four months earlier,^[13] and Claude Opus 4.5 reached 66.26% (P@1; avg@5) on the same benchmark in November 2025.^[14]

On August 26, 2025, Anthropic launched a research preview of Claude for Chrome, a browser-based AI agent delivered as a Chrome extension and initially limited to 1,000 subscribers on the Claude Max plan.^[15] The extension expanded to all Max subscribers in November 2025 and to all Pro, Team, and Enterprise subscribers on December 18, 2025.^[16] Claude for Chrome lets users issue natural-language requests in a sidebar; with permission, the agent then operates inside the active Chrome tab, navigating sites, filling forms, managing calendars, and drafting emails. Subsequent updates added scheduled tasks, multi-tab workflows, and a record-a-workflow feature that lets users demonstrate a sequence for Claude to learn and replay.^[16]

Google Project Mariner and Gemini 2.5 Computer Use

Google DeepMind unveiled Project Mariner on December 11, 2024, alongside the launch of Gemini 2.0, describing it as "an early research prototype built with Gemini 2.0 that explores the future of human-agent interaction, starting with your browser."^[3] Project Mariner was a research prototype implemented as a Chrome extension that uses the browser's side panel interface. It can understand and reason across information on the screen, including pixels, web elements, text, code, images, and forms.

Project Mariner operates using an Observe-Plan-Act loop. The Observe step captures the current browser state, including visual elements and underlying code structures. The Plan step analyzes the captured data and formulates a sequence of actions. The Act step executes the planned actions by simulating user interactions. This combination of visual and structural understanding distinguishes it from purely screenshot-based agents.^[3]

On the WebVoyager benchmark, Project Mariner achieved a state-of-the-art score of 83.5% as a single-agent setup at the time of its announcement.^[3] The agent can simultaneously handle up to 10 different tasks and includes a Teach and Repeat feature that allows users to demonstrate workflows for the agent to learn and replicate. At Google I/O 2025, Google made an updated Mariner available to Google AI Ultra subscribers in the United States at the plan's $249.99 per month price point, while integrating its computer-use capabilities into the Gemini API and Vertex AI.^[17]

On October 7, 2025, Google released the Gemini 2.5 Computer Use model in public preview through the Gemini API on AI Studio and Vertex AI.^[18] Built on Gemini 2.5 Pro's visual understanding and reasoning, the model is optimized for browser automation but also demonstrates capability for mobile UI control. Google reported that the model outperformed competing systems on multiple web and mobile control benchmarks with lower latency.^[18] Browserbase, an infrastructure partner, collaborated with Google DeepMind on the evaluation harness used during development.^[8]

Amazon Nova Act

Amazon introduced Nova Act on March 31, 2025, as the first public product from Amazon's AGI lab.^[5] This lab is co-led by former Adept co-founder David Luan and former Covariant co-founder Pieter Abbeel, both of whom Amazon hired away from their respective startups in 2024.

Nova Act is powered by a custom Nova 2 Lite model, a small and fast multimodal model in the Amazon Nova family, trained specifically for browser task completion.^[5] The Nova Act SDK allows developers to build agents that complete step-by-step browser tasks such as submitting time-off requests or placing online orders. The system is designed to handle interface elements that often trip up other agents, including drop-down menus, date pickers, and pop-up dialogs. Amazon also launched Bedrock AgentCore Browser, a managed browser environment that adds Web Bot Auth signing to AgentCore-hosted sessions, reducing CAPTCHA friction by cryptographically identifying agent traffic to bot control vendors such as Cloudflare, Akamai, and HUMAN Security.^[19]

Microsoft Copilot Studio

Microsoft introduced computer-using agent capabilities within Copilot Studio, first available in public preview for United States-based environments.^[6] These agents can operate applications and websites directly using a virtual mouse and keyboard, clicking, typing, and navigating user interfaces. A hosted browser powered by Windows 365 provides the automation environment without requiring users to configure their own machines.

As of late 2025, computer-using agents in Copilot Studio support multiple foundation models, including Anthropic's Claude Sonnet 4.5 alongside OpenAI's CUA, giving developers flexibility to choose the best model for each task and interface type.^[20] Claude Opus 4.5 was added as an additional choice in December 2025.^[21]

Manus AI

The startup Butterfly Effect launched Manus AI in invitation-only beta on March 6, 2025, as a general-purpose autonomous agent designed to operate a cloud-based virtual browser.^[22] Influential figures including Twitter cofounder Jack Dorsey and Hugging Face product lead Victor Mustar publicly praised the demos, and the launch video drew more than one million views within the first 20 hours. MIT Technology Review's hands-on review described the system as ambitious but inconsistent, with frequent timeouts and difficulty on novel tasks.^[23] In April 2025, Butterfly Effect raised approximately $75 million in a Series B round led by Benchmark, valuing the company at roughly $500 million; the team subsequently relocated headquarters from Wuhan and Beijing to Singapore.^[24] In November 2025, Manus shifted from a cloud-only setup to a local browser extension called Manus Browser Operator.

Perplexity Comet

Perplexity AI launched the Comet browser on July 9, 2025, initially limited to subscribers of its Max tier. The browser is built on Chromium and ships with a sidecar AI assistant that can summarize pages, answer questions about the current tab, and complete tasks like sending emails or buying products. Perplexity made Comet free to download worldwide on October 2, 2025, and added a Max-tier "background assistant" that runs multiple tasks asynchronously.^[25] Comet later expanded to Android in November 2025 and to iOS in March 2026.^[25]

Open-source and startup agents

Several open-source projects and startups have built browser agent platforms.

Browser Use is an open-source Python library that became the most popular browser automation framework for AI agents, reaching approximately 79,000 GitHub stars by early 2026.^[7] It was founded in 2024 by Magnus Müller and Gregor Zunic and joined Y Combinator's Winter 2025 batch. Browser Use uses Playwright for browser control and a hybrid DOM-plus-vision approach for page understanding, and scored 89.1% on the WebVoyager benchmark.^[7] The team released a custom in-house model branded BU 2.0 on January 27, 2026, reporting a 12 percentage-point accuracy improvement at the same speed.^[7] Browser Use also powers several higher-level products and was widely cited as the underlying engine for Manus AI's web browsing capability.^[26]

Stagehand is an open-source browser automation SDK from Browserbase, available in TypeScript, Python, and C#. Stagehand exposes four primitives, act, extract, observe, and agent, that let developers write automations using natural language instead of brittle CSS selectors, with the framework resolving instructions like "click the submit button" at runtime so scripts survive site redesigns.^[8] Stagehand v3 (October 2025) extended the framework across browser drivers, languages, and ecosystems.^[8]

Skyvern is a Y Combinator (S23) startup that automates browser-based workflows using LLMs and computer vision. It has helped customers automate tasks such as job applications, invoice retrieval, government form completion, and product purchases across hundreds of websites. Skyvern scored 85.8% on WebVoyager, and raised $2.7 million in seed funding in 2024 to expand its open-source platform.^[27]

BrowserGym is an open-source framework from ServiceNow that provides a unified Gym environment for web agent research. It integrates multiple benchmarks including WebArena, VisualWebArena, and WorkArena, and includes AgentLab, a companion framework for implementing, testing, and evaluating web agents.

MultiOn is a startup that provides an API for autonomous web agents capable of executing tasks such as event planning, service booking, and workflow automation. The platform includes built-in handling for authentication, CAPTCHAs, and dynamic content.

Surfer 2, released in October 2025 by H Company, is a cross-platform computer use agent that achieved 97.1% on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, and a perfect 100% pass@10 on WebVoyager through test-time scaling, effectively saturating the benchmark.^[28]

Comparison of major AI browser agents

Agent	Developer	Release date	Perception method	Foundation model	WebVoyager score	WebArena score	Availability
Operator / ChatGPT Agent / Atlas	OpenAI	January 2025 / July 2025 / October 2025	Screenshot-based	CUA (GPT-4o + RL)	87.0%	58.1%	ChatGPT Plus, Pro, Business
Computer use / Claude for Chrome	Anthropic	October 2024 / August 2025	Screenshot-based	Claude Sonnet 4.5 / Opus 4.5	N/A	N/A	API + Chrome extension
Project Mariner	Google DeepMind	December 2024	Hybrid (pixels + DOM)	Gemini 2.0	83.5%	N/A	Chrome extension (limited preview)
Gemini 2.5 Computer Use	Google DeepMind	October 2025	Screenshot-based	Gemini 2.5 Pro	N/A	N/A	API preview (AI Studio, Vertex AI)
Nova Act	Amazon	March 2025	Screenshot-based	Nova 2 Lite	N/A	N/A	SDK (research preview)
Copilot Studio CUA	Microsoft	2025	Screenshot-based	CUA / Claude Sonnet 4.5 / Claude Opus 4.5	N/A	N/A	Public preview (US)
Manus	Butterfly Effect	March 2025	Hybrid (cloud VM)	Various	N/A	N/A	Invitation-only beta
Comet	Perplexity AI	July 2025	Hybrid	Various	N/A	N/A	Chromium browser (free)
Browser Use	Open-source	2024	Hybrid (DOM + vision)	Various (pluggable)	89.1%	N/A	Open-source (MIT)
Skyvern	Skyvern (YC S23)	June 2024	Hybrid (LLM + CV)	Various	85.8%	N/A	SaaS + open-source
Surfer 2	H Company	October 2025	Screenshot-based	Custom	97.1%	69.6%	Research

How are AI browser agents benchmarked?

Evaluating AI browser agents requires specialized benchmarks that test the ability to complete realistic, multi-step web tasks. Several benchmarks have become standard in the field.

WebShop

WebShop, introduced at NeurIPS 2022 by Princeton researchers led by Shunyu Yao, was one of the first large-scale benchmarks for grounded web interaction.^[29] The environment is a simulated e-commerce website populated with 1.18 million real product offers scraped from Amazon and 12,087 crowd-sourced natural-language instructions. The agent must navigate multiple page types, issue queries, customize options, and complete a purchase that satisfies the instruction. WebShop has remained an influential evaluation harness for shopping-style agents and inspired subsequent benchmarks in the e-commerce domain.^[29]

Mind2Web and Online-Mind2Web

Mind2Web, presented as a Spotlight paper at NeurIPS 2023, introduced the first large-scale dataset and benchmark for building generalist web agents.^[30] The dataset contains 2,350 tasks spanning 137 websites across 31 domains. The accompanying MindAct model uses a two-stage approach: a fine-tuned small language model first filters relevant web elements from the page, and then a larger LLM selects from the filtered elements and predicts the corresponding action. Mind2Web was a key contribution from the OSU NLP Group and demonstrated that LLMs could be trained for generalized web interaction rather than site-specific automation.

In April 2025, the OSU NLP Group released Online-Mind2Web, a live, online version of Mind2Web with 300 multi-step tasks across 136 high-traffic websites and a paper titled "An Illusion of Progress? Assessing the Current State of Web Agents."^[31] On Online-Mind2Web, OpenAI's Operator achieved a 61.3% human-evaluated success rate while most other agents clustered at 28 to 30%, a sharp drop from the >90% scores those same systems reported on cached benchmarks. The OSU team also introduced a WebJudge LLM-as-a-Judge evaluation method that achieves roughly 85% agreement with human judgment.^[31]

WebArena

WebArena was published at ICLR 2024 and is widely considered one of the most rigorous benchmarks for web agents.^[32] It provides a self-hosted environment with fully functional copies of websites across four domains: e-commerce (modeled on shopping sites), social forums (modeled on Reddit), collaborative software development (modeled on GitLab), and content management (modeled on a CMS). The environment also includes tools like maps and user manuals to encourage realistic task-solving.

WebArena contains 812 long-horizon tasks described in natural language. Evaluation focuses on functional correctness: programmatic validators check whether the result of execution actually achieves the desired goal, rather than simply matching text output. Human performance on WebArena is 78.24%. When the benchmark was first published, the best GPT-4-based agent achieved only 14.41%. By early 2025, OpenAI's CUA model had pushed this to 58.1%, and Surfer 2 reached 69.6% in October 2025.^[28]

VisualWebArena

VisualWebArena extends WebArena by introducing 910 tasks across three web applications (a classifieds site, a shopping site, and a Reddit-like forum) that explicitly require visual understanding.^[33] Tasks involve reasoning about images, spatial layouts, and visual elements that cannot be understood from HTML alone. This benchmark is particularly important for evaluating multimodal agents that must process both visual and textual information.

OSWorld

OSWorld, introduced at NeurIPS 2024, goes beyond web-only tasks to evaluate agents in a full desktop environment.^[34] It provides 369 diverse tasks on Ubuntu Linux and Windows operating systems, requiring agents to interact with multiple desktop applications, file systems, and system settings in addition to web browsers. OpenAI's CUA model achieved 38.1% on OSWorld at the start of 2025;^[9] Claude Sonnet 4.5 reached 61.4% in September 2025,^[13] and Claude Opus 4.5 hit 66.26% in November 2025, still well below human performance of 72.4%.^[14]

WebVoyager

WebVoyager, published at ACL 2024, comprises 643 manually validated tasks distributed across 15 high-traffic websites.^[35] Tasks span diverse web functionalities including search engines, e-commerce platforms, and social media sites. The benchmark uses an automatic evaluation protocol leveraging GPT-4V's multimodal understanding to assess task completion, achieving 85.3% agreement with human judgment. Scores on WebVoyager have climbed rapidly, with the original paper reporting 59.1%, later agents like Browser Use reaching 89.1%, Magnitude reaching 93.9%, and Surfer 2 reaching 97.1% (and a perfect 100% pass@10) in October 2025.^[28]

BrowseComp

OpenAI released BrowseComp on April 10, 2025, an open-source benchmark of 1,266 challenging fact-finding problems that test an agent's ability to locate hard-to-find, entangled information on the open web.^[36] The benchmark was constructed using an "inverted question" methodology in which human trainers started with verifiable facts and then composed questions that made those facts hard to find but easy to verify, with each question accompanied by a canary GUID to detect training-set leakage. On BrowseComp, OpenAI's o1 scored 9.9% while OpenAI's Deep Research model scored 51.5%, the highest at launch, illustrating that even strong reasoners with browsing tools solve only about half the problems.^[36]

AssistantBench

AssistantBench, presented in 2024, evaluates web agents on 214 realistic, time-consuming tasks across 258 different websites in 525 pages, covering domains from information lookup to math problem-solving.^[37] The benchmark was designed to expose the limitations of language-model-only and retrieval-augmented baselines, none of which exceeded 26 accuracy points at the time of release. State-of-the-art web agents at that time hovered near zero. The proposed SEE-PLAN-ACT (SPA) baseline together with an ensembled closed-book model achieved the highest scores in the original paper.^[37]

ScreenSpot-Pro

ScreenSpot-Pro, introduced in April 2025, tests GUI grounding in professional, high-resolution desktop environments rather than web pages.^[38] It contains 1,581 instructions over screenshots drawn from 23 applications (development tools and creative software such as VS Code, Photoshop, Blender, and DaVinci Resolve) on three operating systems. The best existing GUI grounding model in the original paper reached only 18.9% accuracy; the authors' ScreenSeekeR visual search method improved that to 48.1% without additional training.^[38] ScreenSpot-Pro has become an important component of evaluation suites that test agents on small, dense GUI elements at high resolution.

Comparison of benchmarks

Benchmark	Venue	Year	Number of tasks	Environment	Human performance	Best agent (as of early 2026)
WebShop	NeurIPS	2022	12,087 instructions	Simulated e-commerce	N/A	Various
Mind2Web	NeurIPS	2023	2,350	Cached pages, 137 sites	N/A	Various
WebArena	ICLR	2024	812	Self-hosted websites (4 domains)	78.24%	69.6% (Surfer 2)
VisualWebArena	ICLR	2024	910	Self-hosted websites (3 apps)	N/A	N/A
OSWorld	NeurIPS	2024	369	Full desktop (Ubuntu/Windows)	72.4%	66.26% (Opus 4.5)
WebVoyager	ACL	2024	643	Live websites (15 sites)	N/A	97.1% (Surfer 2)
AssistantBench	EMNLP	2024	214	Live web, 258 sites	N/A	~26% (paper baselines)
BrowseComp	OpenAI	2025	1,266	Open web fact-finding	N/A	51.5% (Deep Research)
Online-Mind2Web	OSU NLP	2025	300	Live websites (136 sites)	N/A	61.3% (Operator)
ScreenSpot-Pro	ACM MM	2025	1,581	Professional desktop apps	N/A	48.1% (ScreenSeekeR)

Research history

The development of AI browser agents has roots in research dating back to the early 2020s.

Adept ACT-1 (2022)

Adept, a startup founded by former OpenAI and Google researchers, demonstrated ACT-1 (Action Transformer) in September 2022, one of the earliest models trained specifically to operate computer interfaces.^[39] ACT-1 was hooked into a Chrome extension that gave it access to a custom rendering of the browser viewport and an action space of available UI elements; in one widely circulated demo it navigated a real estate site from a text prompt to find a home for a family of four in Houston under $600,000.^[39] The ACT-1 demonstration helped Adept secure $350 million in funding in 2023. However, Adept never released a public product. In June 2024, Amazon hired Adept's co-founders and key employees and licensed the company's technology, models, and datasets. This talent and technology contributed directly to the development of Amazon Nova Act.

WebGPT (2021)

OpenAI's WebGPT project, published in December 2021, was one of the earliest demonstrations of an LLM interacting with web content to answer questions.^[40] The researchers fine-tuned GPT-3 to use a text-based web browser, where the model could submit search queries, follow links, and scroll through pages. Unlike modern browser agents that operate on visual interfaces, WebGPT interacted with a simplified text representation of web pages.

The model was trained using behavior cloning from human demonstrations, then refined using rejection sampling against a reward model trained to predict human preferences. In evaluations, WebGPT's answers were preferred over human-written answers 56% of the time and were preferred over the highest-voted Reddit answers 69% of the time. On TruthfulQA, the model produced true answers 75% of the time.^[40]

WebShop and grounded language agents (2022)

The WebShop benchmark introduced by Yao et al. at NeurIPS 2022 was a turning point for grounded web agents, providing the first large-scale environment where natural-language instructions had to be translated into multi-step browser actions over real product data.^[29] The WebShop paper trained baselines using imitation learning, reinforcement learning, and pre-trained vision-language models on more than 1,600 human demonstrations, establishing many of the patterns that later agent training pipelines would refine.

Mind2Web (2023)

Mind2Web demonstrated that LLMs could be trained for generalized web interaction rather than site-specific automation, providing a benchmark and dataset of 2,350 tasks across 137 websites that spurred subsequent generalist agent research.^[30]

SeeAct (2024)

SeeAct, also from the OSU NLP Group and published at ICML 2024, advanced the field by demonstrating that multimodal models could serve as generalist web agents. The system uses GPT-4V to visually perceive websites and generate action plans in textual form. SeeAct instructs the model to imitate human browsing behavior, analyzing the task, current webpage, and previous actions to determine the next step. In experiments, SeeAct successfully completed 51.1% of tasks on live websites when provided with manual grounding for action execution, demonstrating the potential and current limitations of vision-based web agents.^[41]

2025-2026 developments

The 12 months from mid-2025 through early 2026 saw the AI browser agent category transition from research previews to commercial products, with several inflection points that reshaped the landscape.

Operator's deprecation and the rise of ChatGPT Agent

OpenAI consolidated three previously separate products, Operator's action-taking remote browser, Deep Research's web synthesis, and ChatGPT's conversational interface, into a single offering named ChatGPT Agent on July 17, 2025.^[10] The standalone operator.chatgpt.com site was sunset on August 31, 2025.^[11] OpenAI's blog explained that the merger was a response to the limitation that Operator could not perform deep analysis and Deep Research could not interact with authenticated sites; ChatGPT Agent's virtual computer can do both inside a single trajectory.^[10]

ChatGPT Atlas as a dedicated agentic browser

On October 21, 2025, OpenAI shipped ChatGPT Atlas, a Chromium-based browser for macOS that puts ChatGPT and its agent capability at the center of the browsing experience.^[12] Atlas added a browser memory feature that uses recent browsing history to personalize responses, and an Agent Mode that can complete end-to-end tasks like compiling a meal plan and ordering ingredients. OpenAI also published a follow-up post in December 2025 describing how it had continuously hardened Atlas against prompt injection through reinforcement learning against an adversarial LLM-based attacker.^[42]

Claude for Chrome and Anthropic's prompt-injection research

Anthropic shipped Claude for Chrome as a research preview on August 26, 2025, starting with 1,000 Max subscribers, and expanded the audience throughout the fall, with general availability for paid Pro, Team, and Enterprise users by December 18, 2025.^[15]^[16] Alongside the release, Anthropic published systematic measurements of prompt-injection success rates on a 123-test-case, 29-scenario benchmark: without safety mitigations, attacks succeeded 23.6% of the time; with mitigations enabled, the rate dropped to 11.2%, and browser-specific attacks dropped from 35.7% to 0%.^[15] After Anthropic released Claude Opus 4.5 on November 24, 2025, follow-up testing reported a 1.4% prompt-injection success rate on the new model against the same benchmark, compared to 10.8% for Claude Sonnet 4.5 under the older safeguards.^[43]

Gemini 2.5 Computer Use and Project Mariner's reorganization

Google released the Gemini 2.5 Computer Use model in preview on October 7, 2025, exposing browser control through the Gemini API on AI Studio and Vertex AI.^[18] Project Mariner remained accessible through Google AI Ultra throughout 2025, and Google subsequently absorbed its functionality into the broader Gemini Agent and Chrome's Auto Browse features. Browserbase released companion infrastructure for running Gemini Computer Use sessions, including the gemini-browser repository for evaluation work.^[8]

Manus AI and the agent hype cycle

Manus AI's invitation-only beta on March 6, 2025, became one of the most-discussed product launches of the year. Industry figures including Jack Dorsey and Hugging Face's Victor Mustar publicly praised the demo videos within hours, and MIT Technology Review's hands-on review (March 11, 2025) found the system both impressive and inconsistent, with frequent failures on novel tasks and long execution times.^[23] Browser Use, an open-source library, was identified as a core component of Manus's web-browsing capability, which contributed to its own surge in popularity.^[26] In April 2025, Butterfly Effect raised $75 million from Benchmark at a $500 million valuation;^[24] the team relocated to Singapore later in 2025. In November 2025, Manus released a local Chromium-based Manus Browser Operator extension.

Perplexity Comet and the AI browser product wave

Perplexity AI released the Comet browser on July 9, 2025, made it free on October 2, 2025, and continued to roll out an asynchronous "background assistant" for Max-tier subscribers that runs multiple agentic tasks in parallel.^[25] Security researchers at Brave published two reports during 2025 documenting systemic indirect prompt-injection vulnerabilities in Comet, including a screenshot-based attack in which faint-text injections become OCR-extracted commands when Comet processes a screenshot.^[44]^[45] These reports framed indirect prompt injection as a category-wide systemic problem rather than a single product bug.

Agent identity, Web Bot Auth, and the CAPTCHA problem

Browser agents at scale encountered an authentication and identity gap: anti-bot systems blocked legitimate agent traffic because there was no cryptographic way to prove that an automated request came from an accountable, authorized agent. The IETF draft Web Bot Auth standard, championed by Cloudflare and built on the HTTP Message Signatures specification, addressed this by letting bot operators publish a public key and sign each outbound request, so receiving sites can verify identity without IP allow lists.^[46] In October 2025, Amazon Bedrock AgentCore Browser added Web Bot Auth signing in preview across nine AWS regions, allowing AgentCore-hosted sessions to be verified by Cloudflare, Akamai, and HUMAN Security.^[19] Adoption across Visa TAP, Mastercard Agent Pay, AWS WAF, Shopify, and Vercel through 2025 and 2026 turned Web Bot Auth into a de facto identity layer for agent traffic.

Microsoft, Salesforce, and enterprise integration

Microsoft expanded Copilot Studio's computer-using-agent functionality through 2025 to support not just OpenAI's CUA but also Anthropic's Claude Sonnet 4.5 (October 2025) and Claude Opus 4.5 (December 2025), letting builders pick the underlying model per agent.^[20]^[21] Salesforce launched Agentforce 360 on October 13, 2025, generalizing its existing Agentforce 2dx (March 2025) and Agentforce 3 (June 2025) platforms into a coordinated set of action-taking agents capable of working across email, ERP, and other enterprise systems, with adopters reporting cycle-time reductions of up to 70% on processes such as auditing and onboarding.^[47] These announcements signaled that browser agents were becoming a feature embedded in incumbent enterprise platforms rather than a separate product category.

Surfer 2 and benchmark saturation

The H Company released Surfer 2 in October 2025 with results that effectively saturated WebVoyager (97.1% pass@1; 100% pass@10) and pushed other benchmarks to new highs.^[28] The paper notes that the remaining WebVoyager failures cluster around anti-bot measures such as CAPTCHAs rather than agent reasoning, reinforcing that the bottleneck for browser agents has shifted away from base-model capability toward authentication, identity, and environment robustness.

What are the main technical challenges?

AI browser agents face several significant technical challenges that limit their reliability and applicability.

Dynamic web pages

Modern websites frequently use JavaScript frameworks to render content dynamically, load data asynchronously, and update the page without full reloads. Elements may appear, disappear, or change position based on user interactions, network conditions, or server responses. Browser agents must handle loading states, wait for elements to become interactive, and adapt to pages that look different on each visit due to personalized content, A/B testing, or time-sensitive information.

CAPTCHAs and anti-bot measures

Websites deploy CAPTCHAs, rate limiting, browser fingerprinting, and behavioral analysis to prevent automated access. These mechanisms are specifically designed to block non-human interactions, creating a fundamental tension with browser agents that automate exactly that. Modern CAPTCHA systems have moved beyond simple image recognition to analyze behavioral signals such as mouse movement patterns, typing cadence, and navigation timing, making them difficult for agents to bypass.

Some emerging solutions address this challenge. Amazon's Bedrock AgentCore Browser uses Web Bot Auth, a draft IETF protocol that cryptographically identifies AI agents to websites and bot control vendors, reducing CAPTCHA friction for verified agents.^[19]^[46] This approach establishes a legitimate identity for the agent rather than attempting to disguise automated behavior.

Authentication and authorization

Many useful web tasks require logging into accounts, which raises both technical and security challenges. Agents need access to credentials, and providing those credentials to an AI system introduces risks around data exposure and unintended actions. Most browser agent platforms handle this by asking users to log in manually before handing control to the agent, or by requesting explicit confirmation before any authentication step.

Error recovery

Web interactions can fail in unpredictable ways: pages may not load, buttons may be obscured by pop-ups or cookie consent banners, dropdown menus may behave differently across browsers, and network errors can interrupt multi-step workflows. Reliable browser agents need robust error detection and recovery strategies, including the ability to recognize when they are stuck and try alternative approaches.

Speed and cost

Screenshot-based agents that call large multimodal models for every action are inherently slow and expensive. A task that takes a human 30 seconds might take a browser agent several minutes due to the latency of model inference at each step. For enterprise applications requiring high throughput, this overhead can be prohibitive. Reducing cost and latency while maintaining accuracy remains an active area of engineering.

Real-world reliability gap

Online benchmarks against live websites continue to show that headline numbers from cached or self-hosted benchmarks substantially overstate real-world reliability. On Online-Mind2Web, Operator's human-evaluated success rate was 61.3% and most other agents clustered at 28 to 30%, well below their reported WebVoyager scores.^[31] The paper's authors attributed most failures to misuse of filter and sort controls (57.7% of major failures) and to difficulty interpreting implicit numerical, temporal, and compositional constraints, suggesting that benchmark inflation rather than fundamental task solution is the more accurate explanation for many headline scores.^[31]

Are AI browser agents safe?

AI browser agents introduce novel safety and security concerns that go beyond those of traditional AI systems. The dominant concern is prompt injection, discussed below; agents that can take real actions on logged-in accounts also raise risks of unintended actions, data exposure, and misuse.

Prompt injection

Prompt injection is the most widely discussed security risk for browser agents.^[48] In this attack, malicious instructions are hidden within web page content, emails, or documents that the agent processes. Because the agent interprets both user instructions and web content, an attacker can embed text on a web page that instructs the agent to perform unintended actions, such as extracting personal data, making purchases, or clicking malicious links.

OpenAI has acknowledged that prompt injection "is unlikely to ever be fully solved" for browser agents, comparing it to scams and social engineering on the web. Brave's security research has confirmed that indirect prompt injection is a systemic challenge affecting all AI-powered browsers, not an isolated issue.^[44] Researchers have demonstrated that invisible prompt injections can be embedded in screenshots, with hidden text on yellow backgrounds extracted via OCR and treated as agent commands.^[45]

Defense strategies include training agents to distinguish between user instructions and third-party content, requiring user confirmation before high-stakes actions, and using adversarial testing with LLM-based automated attackers. OpenAI uses a reinforcement-learning-trained attacker to probe for vulnerabilities in its Atlas browser.^[42] Anthropic has published research on layered defenses against prompt injection for browser agents, reporting that mitigations reduce attack success on a 123-case browser test set from 23.6% to 11.2% on Claude Sonnet 4.5 and to 1.4% on Claude Opus 4.5.^[15]^[43] Independent commentary by Simon Willison has framed the problem with the phrase "the lethal trifecta": when an agent simultaneously has access to private data, exposure to untrusted content, and the ability to communicate externally, an attacker can chain these capabilities to exfiltrate data.^[49]

Unintended actions

Because browser agents can take real-world actions such as sending messages, making purchases, and modifying account settings, errors carry tangible consequences. An agent that misinterprets a task or encounters an unexpected page state could place an incorrect order, send a message to the wrong person, or change critical settings. Most commercial browser agents mitigate this by requiring human confirmation before irreversible actions, but this limits the degree of autonomy and negates some of the efficiency gains. Atlas's Agent Mode pauses on financial-institution sites; Claude for Chrome blocks financial-services, adult, and pirated-content categories outright; ChatGPT Agent prompts for confirmation before purchases.^[12]^[15]

Data privacy

Browser agents necessarily process sensitive information visible on the user's screen, including emails, financial data, personal messages, and account credentials. This data passes through the AI model's inference pipeline, raising questions about data retention, model training on user data, and compliance with privacy regulations. Users must trust that the agent provider handles this data appropriately, and enterprises must evaluate browser agents against their data governance policies.

Misuse potential

Browser agents could be misused for spam, fraud, credential stuffing, or other malicious activities. The ability to automate browser interactions at scale lowers the barrier for attacks that previously required human effort. Agent providers implement usage policies and monitoring to detect and prevent abuse, but the dual-use nature of the technology means that safeguards must continually evolve.

What are AI browser agents used for?

AI browser agents are being applied across a wide range of use cases.

Enterprise workflow automation: Agents automate repetitive browser-based tasks such as data entry across multiple systems, invoice processing, compliance form completion, and report generation. This is particularly valuable when the target applications lack APIs and can only be accessed through a web interface. Salesforce Agentforce, Microsoft Copilot Studio, and Amazon Bedrock AgentCore Browser are examples of commercial platforms that target this segment.^[6]^[19]^[47]

Personal task automation: Consumer-facing agents handle tasks like booking travel, ordering groceries, comparing prices across shopping sites, scheduling appointments, and filling out forms. OpenAI's ChatGPT Agent and Atlas, Anthropic's Claude for Chrome, and Perplexity Comet target this category.

Software testing: Browser agents can be used to test web applications by simulating user interactions, identifying broken workflows, and verifying that pages render correctly. Their ability to navigate unfamiliar interfaces makes them useful for exploratory testing. The Browserbase Stagehand SDK and Browser Use library are both positioned as testing and automation backends.^[7]^[8]

Research and data collection: Agents can gather information from multiple websites, compile results, and synthesize findings, going beyond simple web scraping to tasks that require navigating complex multi-page workflows. OpenAI's Deep Research, integrated into ChatGPT Agent, is one widely cited example.^[10]^[36]

Accessibility: Browser agents have the potential to make the web more accessible to people with disabilities by performing complex navigation tasks on their behalf, effectively translating high-level intents into detailed browser interactions.

Robotic process automation alternative: Compared to traditional Robotic Process Automation (RPA) tooling, which relies on rule-based scripts tied to specific UI elements, AI browser agents can adapt to UI changes without manual reconfiguration. They are better suited to dynamic, exception-heavy processes, while RPA remains preferable for stable, structured, high-throughput workflows.^[50]

Future directions

Several trends are shaping the future of AI browser agents. Agent reliability continues to improve as models become better at understanding web interfaces, recovering from errors, and handling edge cases. The gap between human and agent performance on benchmarks like WebArena is closing, though it remains significant.

Standardization efforts such as the Web Bot Auth IETF protocol aim to create legitimate channels for AI agents to identify themselves to websites, potentially reducing friction from anti-bot measures while maintaining security.^[46] Multi-agent architectures, where multiple specialized agents collaborate on complex tasks, are being explored to improve reliability and handle tasks that span many websites.

The integration of browser agents into existing products, as demonstrated by Microsoft's Copilot Studio and OpenAI's Atlas browser, suggests that browser agent capabilities will become a standard feature of productivity software rather than standalone tools. As these systems mature, the balance between autonomy and human oversight remains a central design challenge.

References

Anthropic, "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku", anthropic.com, 2024-10-22. https://www.anthropic.com/news/3-5-models-and-computer-use. Accessed 2026-05-24. ↩
Yao, S., et al., "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents", NeurIPS, 2022. https://arxiv.org/abs/2207.01206. Accessed 2026-05-24. ↩
Google DeepMind, "Project Mariner", deepmind.google, 2024-12-11. https://deepmind.google/technologies/project-mariner/. Accessed 2026-05-24. ↩
OpenAI, "Introducing Operator", openai.com, 2025-01-23. https://openai.com/index/introducing-operator/. Accessed 2026-05-24. ↩
Amazon, "Introducing Amazon Nova Act", labs.amazon.science, 2025-03-31. https://labs.amazon.science/blog/nova-act. Accessed 2026-05-24. ↩
Microsoft, "Improve complex UI automation with computer-using agents", microsoft.com, 2025. https://www.microsoft.com/en-us/microsoft-copilot/blog/copilot-studio/. Accessed 2026-05-24. ↩
Browser Use, "browser-use/browser-use", GitHub, 2024. https://github.com/browser-use/browser-use. Accessed 2026-05-24. ↩
Browserbase, "Stagehand: The SDK for Browser Agents", browserbase.com, 2025. https://github.com/browserbase/stagehand. Accessed 2026-05-24. ↩
OpenAI, "Computer-Using Agent", openai.com, 2025-01-23. https://openai.com/index/computer-using-agent/. Accessed 2026-05-24. ↩
OpenAI, "Introducing ChatGPT agent: bridging research and action", openai.com, 2025-07-17. https://openai.com/index/introducing-chatgpt-agent/. Accessed 2026-05-24. ↩
OpenAI, "Operator Release Notes", help.openai.com, 2025-08. https://help.openai.com/en/articles/10561834. Accessed 2026-05-24. ↩
OpenAI, "Introducing ChatGPT Atlas", openai.com, 2025-10-21. https://openai.com/index/introducing-chatgpt-atlas/. Accessed 2026-05-24. ↩
Anthropic, "Introducing Claude Sonnet 4.5", anthropic.com, 2025-09-29. https://www.anthropic.com/news/claude-sonnet-4-5. Accessed 2026-05-24. ↩
Anthropic, "Introducing Claude Opus 4.5", anthropic.com, 2025-11-24. https://www.anthropic.com/news/claude-opus-4-5. Accessed 2026-05-24. ↩
Anthropic, "Piloting Claude in Chrome", claude.com, 2025-08-26. https://claude.com/blog/claude-for-chrome. Accessed 2026-05-24. ↩
Anthropic, "Claude in Chrome: Release notes", support.claude.com, 2025-12-18. https://support.claude.com/en/articles/12138966-release-notes. Accessed 2026-05-24. ↩
Google, "Google I/O 2025: Gemini as a universal AI assistant", blog.google, 2025-05-20. https://blog.google/technology/google-deepmind/gemini-universal-ai-assistant/. Accessed 2026-05-24. ↩
Google, "Introducing the Gemini 2.5 Computer Use model", blog.google, 2025-10-07. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-computer-use-model/. Accessed 2026-05-24. ↩
AWS, "Amazon Bedrock AgentCore Browser now reduces CAPTCHAs with Web Bot Auth (Preview)", aws.amazon.com, 2025-10-30. https://aws.amazon.com/about-aws/whats-new/2025/10/amazon-bedrock-agentcore-browser-web-bot-auth-preview/. Accessed 2026-05-24. ↩
Microsoft, "Available today: Claude Sonnet 4.5 in Microsoft Copilot Studio", microsoft.com, 2025-10. https://www.microsoft.com/en-us/microsoft-copilot/blog/copilot-studio/available-today-claude-sonnet-4-5-in-microsoft-copilot-studio/. Accessed 2026-05-24. ↩
Microsoft, "Announcing Claude Opus 4.5 in Microsoft Copilot Studio", microsoft.com, 2025-12. https://www.microsoft.com/en-us/microsoft-copilot/blog/copilot-studio/announcing-claude-opus-4-5-in-copilot-studio/. Accessed 2026-05-24. ↩
Butterfly Effect, "Introducing Manus: The general AI agent", workos.com, 2025-03-06. https://workos.com/blog/introducing-manus-the-general-ai-agent. Accessed 2026-05-24. ↩
MIT Technology Review, "Everyone in AI is talking about Manus. We put it to the test.", technologyreview.com, 2025-03-11. https://www.technologyreview.com/2025/03/11/1113133/manus-ai-review/. Accessed 2026-05-24. ↩
SiliconAngle, "Chinese startup behind Manus reportedly raises $75M in funding", siliconangle.com, 2025-04-25. https://siliconangle.com/2025/04/25/chinese-startup-behind-manus-reportedly-raises-75m-funding/. Accessed 2026-05-24. ↩
TechCrunch, "Perplexity's Comet AI browser now free; Max users get new 'background assistant'", techcrunch.com, 2025-10-02. https://techcrunch.com/2025/10/02/perplexitys-comet-ai-browser-now-free-max-users-get-new-background-assistant/. Accessed 2026-05-24. ↩
TechCrunch, "Browser Use, one of the tools powering Manus, is also going viral", techcrunch.com, 2025-03-12. https://techcrunch.com/2025/03/12/browser-use-one-of-the-tools-powering-manus-is-also-going-viral/. Accessed 2026-05-24. ↩
Skyvern, "We raised $2.7M to fix browser automation (open source)", skyvern.com, 2024. https://www.skyvern.com/blog/skyvern-we-raised-2-7m-to-fix-browser-automation-open-source/. Accessed 2026-05-24. ↩
H Company, "Surfer 2: The Next Generation of Cross-Platform Computer Use Agents", arxiv.org, 2025-10. https://arxiv.org/abs/2510.19949. Accessed 2026-05-24. ↩
Yao, S., et al., "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents", arxiv.org, 2022. https://arxiv.org/abs/2207.01206. Accessed 2026-05-24. ↩
Deng, X., et al., "Mind2Web: Towards a Generalist Agent for the Web", NeurIPS, 2023. https://arxiv.org/abs/2306.06070. Accessed 2026-05-24. ↩
Liu, T., et al., "An Illusion of Progress? Assessing the Current State of Web Agents (Online-Mind2Web)", arxiv.org, 2025-04. https://arxiv.org/abs/2504.01382. Accessed 2026-05-24. ↩
Zhou, S., et al., "WebArena: A Realistic Web Environment for Building Autonomous Agents", ICLR, 2024. https://arxiv.org/abs/2307.13854. Accessed 2026-05-24. ↩
Koh, J.Y., et al., "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks", ICLR, 2024. https://arxiv.org/abs/2401.13649. Accessed 2026-05-24. ↩
Xie, T., et al., "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments", NeurIPS, 2024. https://arxiv.org/abs/2404.07972. Accessed 2026-05-24. ↩
He, H., et al., "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models", ACL, 2024. https://arxiv.org/abs/2401.13919. Accessed 2026-05-24. ↩
Wei, J., et al., "BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents", OpenAI, 2025-04-10. https://openai.com/index/browsecomp/. Accessed 2026-05-24. ↩
Yoran, O., et al., "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?", arxiv.org, 2024. https://arxiv.org/abs/2407.15711. Accessed 2026-05-24. ↩
Li, K., et al., "ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use", arxiv.org, 2025-04. https://arxiv.org/abs/2504.07981. Accessed 2026-05-24. ↩
Adept, "ACT-1: Transformer for Actions", adept.ai, 2022-09. https://www.adept.ai/blog/act-1/. Accessed 2026-05-24. ↩
Nakano, R., et al., "WebGPT: Browser-assisted question-answering with human feedback", arxiv.org, 2021-12. https://arxiv.org/abs/2112.09332. Accessed 2026-05-24. ↩
Zheng, B., et al., "GPT-4V(ision) is a Generalist Web Agent, if Grounded (SeeAct)", ICML, 2024. https://arxiv.org/abs/2401.01614. Accessed 2026-05-24. ↩
OpenAI, "Continuously hardening ChatGPT Atlas against prompt injection attacks", openai.com, 2025-12. https://openai.com/index/atlas-prompt-injection-hardening/. Accessed 2026-05-24. ↩
VentureBeat, "Anthropic published the prompt injection failure rates that enterprise security teams have been asking every vendor for", venturebeat.com, 2025-11. https://venturebeat.com/security/prompt-injection-measurable-security-metric-one-ai-developer-publishes-numbers. Accessed 2026-05-24. ↩
Brave, "Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet", brave.com, 2025-08. https://brave.com/blog/comet-prompt-injection/. Accessed 2026-05-24. ↩
Brave, "Unseeable prompt injections in screenshots: more vulnerabilities in Comet and other AI browsers", brave.com, 2025-10. https://brave.com/blog/unseeable-prompt-injections/. Accessed 2026-05-24. ↩
Cloudflare, "Web Bot Auth: HTTP Message Signatures for AI agents", developers.cloudflare.com, 2025. https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/. Accessed 2026-05-24. ↩
Salesforce, "Welcome to the Agentic Enterprise: With Agentforce 360, Salesforce Elevates Human Potential in the Age of AI", salesforce.com, 2025-10-13. https://www.salesforce.com/news/press-releases/2025/10/13/agentic-enterprise-announcement/. Accessed 2026-05-24. ↩
Anthropic, "Mitigating the risk of prompt injections in browser use", anthropic.com, 2025. https://www.anthropic.com/research/prompt-injection-defenses. Accessed 2026-05-24. ↩
Willison, S., "The lethal trifecta for AI agents", simonw.substack.com, 2025. https://simonw.substack.com/p/the-lethal-trifecta-for-ai-agents. Accessed 2026-05-24. ↩
TechTarget, "Compare AI agents vs. RPA: Key differences and overlap", techtarget.com, 2025. https://www.techtarget.com/searchenterpriseai/tip/Compare-AI-agents-vs-RPA-Key-differences-and-overlap. Accessed 2026-05-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

Agent evaluation Anthropic Computer Use Antigravity (Google)Browser-use agent Claude Code Playwright Indirect prompt injection Machine learning terms/Natural Language Processing Mind2Web OpenAI Operator Playwright MCP Task budgets The Browser Company of New York WebArena WebGPT🤖

How AI browser agents work

Screenshot-based approach

DOM-based approach

Hybrid approach

The observe, plan, act loop

What are the major AI browser agents?

OpenAI Operator, ChatGPT Agent, and Atlas

Anthropic computer use and Claude for Chrome

Google Project Mariner and Gemini 2.5 Computer Use

Amazon Nova Act

Microsoft Copilot Studio

Manus AI

Perplexity Comet

Open-source and startup agents

Comparison of major AI browser agents

How are AI browser agents benchmarked?

WebShop

Mind2Web and Online-Mind2Web

WebArena

VisualWebArena

OSWorld

WebVoyager

BrowseComp

AssistantBench

ScreenSpot-Pro

Comparison of benchmarks

Research history

Adept ACT-1 (2022)

WebGPT (2021)

WebShop and grounded language agents (2022)

Mind2Web (2023)

SeeAct (2024)

2025-2026 developments

Operator's deprecation and the rise of ChatGPT Agent

ChatGPT Atlas as a dedicated agentic browser

Claude for Chrome and Anthropic's prompt-injection research

Gemini 2.5 Computer Use and Project Mariner's reorganization

Manus AI and the agent hype cycle

Perplexity Comet and the AI browser product wave

Agent identity, Web Bot Auth, and the CAPTCHA problem

Microsoft, Salesforce, and enterprise integration

Surfer 2 and benchmark saturation

What are the main technical challenges?

Dynamic web pages

CAPTCHAs and anti-bot measures

Authentication and authorization

Error recovery

Speed and cost

Real-world reliability gap

Are AI browser agents safe?

Prompt injection

Unintended actions

Data privacy

Misuse potential

What are AI browser agents used for?

Future directions

See also

References

Improve this article

Related Articles

Agentic Context Engineering

Context engineering

AI agents

Mixture of Agents

Tool use

Agentic workflow

What links here

Related Articles

Agentic Context Engineering

Context engineering

AI agents

Mixture of Agents

Tool use

Agentic workflow

What links here