AI browser agent
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 8,010 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 8,010 words
Add missing citations, update stale details, or suggest a clearer explanation.
An AI browser agent is a software system that uses artificial intelligence to autonomously navigate, interpret, and interact with web browsers to complete tasks on behalf of a user. These agents can click buttons, fill out forms, scroll through pages, follow links, and perform multi-step workflows across websites, replicating the actions a human would take when using a web browser. Unlike traditional browser automation tools that rely on hard-coded scripts and CSS selectors, AI browser agents use large language models (LLMs) and computer vision to understand web pages dynamically, making them capable of handling unfamiliar websites without site-specific programming.[1][2]
The development of AI browser agents accelerated rapidly between 2024 and 2025, with major AI companies releasing competing products. Anthropic launched computer use for Claude in October 2024.[1] Google previewed Project Mariner, built on Gemini 2.0, in December 2024.[3] OpenAI released Operator, powered by its Computer-Using Agent (CUA) model, in January 2025.[4] Amazon introduced Nova Act in March 2025.[5] Microsoft integrated computer-using agent capabilities into Copilot Studio in 2025.[6] By 2026, the category had matured into a multibillion-dollar product space spanning consumer chatbots, enterprise automation suites, and open-source developer libraries.
AI browser agents operate through a continuous loop of perception, reasoning, and action. At each step, the agent observes the current state of the browser, decides what to do next, and then executes an action such as clicking, typing, or scrolling. This loop repeats until the agent determines the task is complete or requires human intervention.[1]
The perception step is where the major architectural differences between browser agents emerge. There are two primary approaches: screenshot-based (also called vision-based or pixel-based) and DOM-based (also called structure-based). Some systems use a hybrid of both.
In the screenshot-based approach, the agent captures a screenshot of the browser window and feeds it to a multimodal AI model that can interpret images. The model analyzes the visual layout of the page, identifies interactive elements like buttons and text fields, and returns coordinates for where to click or text to type. This process closely mirrors how a human perceives and interacts with a screen.[1]
OpenAI's CUA model and Anthropic's Claude computer use both employ this approach. The CUA model combines GPT-4o's vision capabilities with advanced reasoning trained through reinforcement learning. It examines a screenshot, reasons about the next steps using chain-of-thought, and issues commands such as click(x, y) or type(text). The environment executes these commands and returns a new screenshot, and the cycle repeats.[4]
Anthropic's computer use capability, introduced with Claude 3.5 Sonnet in October 2024, works similarly. The model receives a screenshot of the current screen state and returns specific pixel coordinates for mouse actions. Before this update, frontier AI models could not reliably provide coordinates on a screenshot; the updated Claude 3.5 Sonnet was the first major model to demonstrate this capability in public beta.[1]
The primary advantage of the screenshot approach is universality. Because it works at the pixel level, it can handle any graphical interface, including web applications, desktop software, canvas-based applications, and graphical elements that have no representation in the DOM. The disadvantages are speed and cost. Screenshot-based agents typically take 2 to 7 seconds per action because each step requires image capture, upload, model inference, and response parsing. The token cost is also higher, since image tokens in models like GPT-4V are 10 to 20 times more expensive than processing equivalent HTML text.
The DOM-based approach reads the Document Object Model, the structured tree representation of a web page's HTML elements. Instead of looking at pixels, the agent parses the underlying code to identify buttons, links, input fields, and other interactive elements. It then reasons about which element to interact with and issues commands directly against the DOM.
Google's Project Mariner uses a hybrid approach that incorporates DOM understanding. The agent takes a screenshot of the active Chrome tab and simultaneously analyzes both the visible layout and the underlying HTML structure. This dual understanding of visual elements and code enables more reliable browser automation, particularly on text-heavy pages with forms and structured data.[3]
DOM-based agents are significantly faster, executing actions in 20 to 100 milliseconds compared to several seconds for screenshot-based agents. They also consume fewer tokens, typically 1,000 to 3,000 tokens per page compared to 10,000 or more for vision-based processing. This makes them cheaper to operate at scale. However, DOM-based agents struggle with canvas-based applications, dynamically rendered content that does not appear in the HTML source, and visual elements like images or charts that carry meaning not captured in the DOM.
Many modern browser agents combine both approaches, using DOM parsing for most steps and falling back to screenshots when the page structure is ambiguous or when visual understanding is required. The open-source Browser Use framework, which gained over 78,000 GitHub stars by early 2026, exemplifies this hybrid strategy.[7] It uses Playwright for browser control, parses the DOM for element identification, and takes screenshots for visual analysis when needed.
A common control structure across browser agents is an Observe-Plan-Act loop in which the agent first captures the current browser state, then reasons about a sequence of next actions, and finally executes one action before re-observing.[3] Some agents extend this with a Reflect or Critic phase, where a separate model checks whether the previous action achieved its goal. Browserbase's Stagehand framework formalizes this loop through four primitive operations called act, extract, observe, and agent, allowing developers to mix deterministic Playwright code with AI-resolved natural-language instructions.[8]
Several major technology companies and startups have released AI browser agents, each with different technical approaches and target audiences.
OpenAI launched Operator on January 23, 2025, as a research preview available to ChatGPT Pro subscribers in the United States.[4] Operator was powered by the Computer-Using Agent (CUA) model, which combines GPT-4o's vision capabilities with reasoning trained through reinforcement learning to interact with graphical user interfaces. The CUA model can identify page elements, break tasks into multi-step plans, and self-correct when challenges arise.[9] On the WebArena benchmark, the CUA model scored 58.1%, and on OSWorld it scored 38.1%, then a notable step toward general computer use although still well short of human performance.[9]
Operator ran in a dedicated browser environment at operator.chatgpt.com. Users provided a natural language instruction, and the agent navigated websites to complete the task. Safety measures included asking for user confirmation before finalizing actions with external side effects, such as submitting an order or sending an email.[4]
On July 17, 2025, OpenAI launched ChatGPT Agent, a unified agentic system that combined Operator's action-taking remote browser, deep research's web synthesis, and ChatGPT's conversational interface.[10] ChatGPT Agent uses its own virtual computer to handle requests such as briefing a user on upcoming meetings based on calendar and recent news, planning and buying ingredients for a meal, or analyzing competitors and creating a slide deck. ChatGPT Agent was made available to Pro users at launch, with Plus, Team, Enterprise, and Education tiers following over subsequent weeks. Operator was deprecated after the release of ChatGPT Agent and the operator.chatgpt.com site was shut down on August 31, 2025.[11]
In October 2025, OpenAI launched ChatGPT Atlas, a Chromium-based browser product that consolidated agent capabilities into a dedicated web browser for macOS.[12] Atlas introduced Agent Mode as a core feature, allowing the browser to execute multi-step tasks autonomously, available in preview to Plus, Pro, and Business users. Agent Mode in Atlas is restricted from running code in the browser, downloading files, installing extensions, or accessing other applications on the user's computer, and pauses on sensitive sites such as financial institutions.[12]
Anthropic introduced computer use as a public beta on October 22, 2024, making Claude 3.5 Sonnet the first frontier AI model to offer autonomous computer control capabilities at this level.[1] The feature allows developers to direct Claude to interact with computers by looking at screenshots, moving a cursor, clicking buttons, and typing text. Unlike OpenAI's hosted approach, Anthropic does not provide hosted virtual machines. Instead, developers call the Claude API, sending both text instructions and screenshots of the computer being controlled. Claude responds with commands describing what actions to perform next. The model can use three tools: a Computer tool (providing access to screen, keyboard, and mouse), a Text Editor, and a Bash terminal.[1]
At launch, Claude 3.5 Sonnet scored 14.9% on OSWorld's screenshot-only category, which was notably better than the next-best system's score of 7.8% at that time, and 22.0% when afforded more steps.[1] Subsequent Claude versions pushed performance much higher: by September 2025 Claude Sonnet 4.5 scored 61.4% on OSWorld, up from 42.2% for Claude Sonnet 4 four months earlier,[13] and Claude Opus 4.5 reached 66.26% (P@1; avg@5) on the same benchmark in November 2025.[14]
On August 26, 2025, Anthropic launched a research preview of Claude for Chrome, a browser-based AI agent delivered as a Chrome extension and initially limited to 1,000 subscribers on the Claude Max plan.[15] The extension expanded to all Max subscribers in November 2025 and to all Pro, Team, and Enterprise subscribers on December 18, 2025.[16] Claude for Chrome lets users issue natural-language requests in a sidebar; with permission, the agent then operates inside the active Chrome tab, navigating sites, filling forms, managing calendars, and drafting emails. Subsequent updates added scheduled tasks, multi-tab workflows, and a record-a-workflow feature that lets users demonstrate a sequence for Claude to learn and replay.[16]
Google DeepMind unveiled Project Mariner on December 11, 2024, alongside the launch of Gemini 2.0.[3] Project Mariner was a research prototype implemented as a Chrome extension that uses the browser's side panel interface. It can understand and reason across information on the screen, including pixels, web elements, text, code, images, and forms.
Project Mariner operates using an Observe-Plan-Act loop. The Observe step captures the current browser state, including visual elements and underlying code structures. The Plan step analyzes the captured data and formulates a sequence of actions. The Act step executes the planned actions by simulating user interactions. This combination of visual and structural understanding distinguishes it from purely screenshot-based agents.[3]
On the WebVoyager benchmark, Project Mariner achieved a state-of-the-art score of 83.5% as a single-agent setup at the time of its announcement.[3] The agent can simultaneously handle up to 10 different tasks and includes a Teach and Repeat feature that allows users to demonstrate workflows for the agent to learn and replicate. At Google I/O 2025, Google made an updated Mariner available to Google AI Ultra subscribers in the United States at the plan's $249.99 per month price point, while integrating its computer-use capabilities into the Gemini API and Vertex AI.[17]
On October 7, 2025, Google released the Gemini 2.5 Computer Use model in public preview through the Gemini API on AI Studio and Vertex AI.[18] Built on Gemini 2.5 Pro's visual understanding and reasoning, the model is optimized for browser automation but also demonstrates capability for mobile UI control. Google reported that the model outperformed competing systems on multiple web and mobile control benchmarks with lower latency.[18] Browserbase, an infrastructure partner, collaborated with Google DeepMind on the evaluation harness used during development.[8]
Amazon introduced Nova Act on March 31, 2025, as the first public product from Amazon's AGI lab.[5] This lab is co-led by former Adept co-founder David Luan and former Covariant co-founder Pieter Abbeel, both of whom Amazon hired away from their respective startups in 2024.
Nova Act is powered by a custom Nova 2 Lite model, a small and fast multimodal model in the Amazon Nova family, trained specifically for browser task completion.[5] The Nova Act SDK allows developers to build agents that complete step-by-step browser tasks such as submitting time-off requests or placing online orders. The system is designed to handle interface elements that often trip up other agents, including drop-down menus, date pickers, and pop-up dialogs. Amazon also launched Bedrock AgentCore Browser, a managed browser environment that adds Web Bot Auth signing to AgentCore-hosted sessions, reducing CAPTCHA friction by cryptographically identifying agent traffic to bot control vendors such as Cloudflare, Akamai, and HUMAN Security.[19]
Microsoft introduced computer-using agent capabilities within Copilot Studio, first available in public preview for United States-based environments.[6] These agents can operate applications and websites directly using a virtual mouse and keyboard, clicking, typing, and navigating user interfaces. A hosted browser powered by Windows 365 provides the automation environment without requiring users to configure their own machines.
As of late 2025, computer-using agents in Copilot Studio support multiple foundation models, including Anthropic's Claude Sonnet 4.5 alongside OpenAI's CUA, giving developers flexibility to choose the best model for each task and interface type.[20] Claude Opus 4.5 was added as an additional choice in December 2025.[21]
The startup Butterfly Effect launched Manus AI in invitation-only beta on March 6, 2025, as a general-purpose autonomous agent designed to operate a cloud-based virtual browser.[22] Influential figures including Twitter cofounder Jack Dorsey and Hugging Face product lead Victor Mustar publicly praised the demos, and the launch video drew more than one million views within the first 20 hours. MIT Technology Review's hands-on review described the system as ambitious but inconsistent, with frequent timeouts and difficulty on novel tasks.[23] In April 2025, Butterfly Effect raised approximately $75 million in a Series B round led by Benchmark, valuing the company at roughly $500 million; the team subsequently relocated headquarters from Wuhan and Beijing to Singapore.[24] In November 2025, Manus shifted from a cloud-only setup to a local browser extension called Manus Browser Operator.
Perplexity AI launched the Comet browser on July 9, 2025, initially limited to subscribers of its Max tier. The browser is built on Chromium and ships with a sidecar AI assistant that can summarize pages, answer questions about the current tab, and complete tasks like sending emails or buying products. Perplexity made Comet free to download worldwide on October 2, 2025, and added a Max-tier "background assistant" that runs multiple tasks asynchronously.[25] Comet later expanded to Android in November 2025 and to iOS in March 2026.[25]
Several open-source projects and startups have built browser agent platforms.
Browser Use is an open-source Python library that became the most popular browser automation framework for AI agents, reaching approximately 79,000 GitHub stars by early 2026.[7] It was founded in 2024 by Magnus Müller and Gregor Zunic and joined Y Combinator's Winter 2025 batch. Browser Use uses Playwright for browser control and a hybrid DOM-plus-vision approach for page understanding, and scored 89.1% on the WebVoyager benchmark.[7] The team released a custom in-house model branded BU 2.0 on January 27, 2026, reporting a 12 percentage-point accuracy improvement at the same speed.[7] Browser Use also powers several higher-level products and was widely cited as the underlying engine for Manus AI's web browsing capability.[26]
Stagehand is an open-source browser automation SDK from Browserbase, available in TypeScript, Python, and C#. Stagehand exposes four primitives, act, extract, observe, and agent, that let developers write automations using natural language instead of brittle CSS selectors, with the framework resolving instructions like "click the submit button" at runtime so scripts survive site redesigns.[8] Stagehand v3 (October 2025) extended the framework across browser drivers, languages, and ecosystems.[8]
Skyvern is a Y Combinator (S23) startup that automates browser-based workflows using LLMs and computer vision. It has helped customers automate tasks such as job applications, invoice retrieval, government form completion, and product purchases across hundreds of websites. Skyvern scored 85.8% on WebVoyager, and raised $2.7 million in seed funding in 2024 to expand its open-source platform.[27]
BrowserGym is an open-source framework from ServiceNow that provides a unified Gym environment for web agent research. It integrates multiple benchmarks including WebArena, VisualWebArena, and WorkArena, and includes AgentLab, a companion framework for implementing, testing, and evaluating web agents.
MultiOn is a startup that provides an API for autonomous web agents capable of executing tasks such as event planning, service booking, and workflow automation. The platform includes built-in handling for authentication, CAPTCHAs, and dynamic content.
Surfer 2, released in October 2025 by H Company, is a cross-platform computer use agent that achieved 97.1% on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, and a perfect 100% pass@10 on WebVoyager through test-time scaling, effectively saturating the benchmark.[28]
| Agent | Developer | Release date | Perception method | Foundation model | WebVoyager score | WebArena score | Availability |
|---|---|---|---|---|---|---|---|
| Operator / ChatGPT Agent / Atlas | OpenAI | January 2025 / July 2025 / October 2025 | Screenshot-based | CUA (GPT-4o + RL) | 87.0% | 58.1% | ChatGPT Plus, Pro, Business |
| Computer use / Claude for Chrome | Anthropic | October 2024 / August 2025 | Screenshot-based | Claude Sonnet 4.5 / Opus 4.5 | N/A | N/A | API + Chrome extension |
| Project Mariner | Google DeepMind | December 2024 | Hybrid (pixels + DOM) | Gemini 2.0 | 83.5% | N/A | Chrome extension (limited preview) |
| Gemini 2.5 Computer Use | Google DeepMind | October 2025 | Screenshot-based | Gemini 2.5 Pro | N/A | N/A | API preview (AI Studio, Vertex AI) |
| Nova Act | Amazon | March 2025 | Screenshot-based | Nova 2 Lite | N/A | N/A | SDK (research preview) |
| Copilot Studio CUA | Microsoft | 2025 | Screenshot-based | CUA / Claude Sonnet 4.5 / Claude Opus 4.5 | N/A | N/A | Public preview (US) |
| Manus | Butterfly Effect | March 2025 | Hybrid (cloud VM) | Various | N/A | N/A | Invitation-only beta |
| Comet | Perplexity AI | July 2025 | Hybrid | Various | N/A | N/A | Chromium browser (free) |
| Browser Use | Open-source | 2024 | Hybrid (DOM + vision) | Various (pluggable) | 89.1% | N/A | Open-source (MIT) |
| Skyvern | Skyvern (YC S23) | June 2024 | Hybrid (LLM + CV) | Various | 85.8% | N/A | SaaS + open-source |
| Surfer 2 | H Company | October 2025 | Screenshot-based | Custom | 97.1% | 69.6% | Research |
Evaluating AI browser agents requires specialized benchmarks that test the ability to complete realistic, multi-step web tasks. Several benchmarks have become standard in the field.
WebShop, introduced at NeurIPS 2022 by Princeton researchers led by Shunyu Yao, was one of the first large-scale benchmarks for grounded web interaction.[29] The environment is a simulated e-commerce website populated with 1.18 million real product offers scraped from Amazon and 12,087 crowd-sourced natural-language instructions. The agent must navigate multiple page types, issue queries, customize options, and complete a purchase that satisfies the instruction. WebShop has remained an influential evaluation harness for shopping-style agents and inspired subsequent benchmarks in the e-commerce domain.[29]
Mind2Web, presented as a Spotlight paper at NeurIPS 2023, introduced the first large-scale dataset and benchmark for building generalist web agents.[30] The dataset contains 2,350 tasks spanning 137 websites across 31 domains. The accompanying MindAct model uses a two-stage approach: a fine-tuned small language model first filters relevant web elements from the page, and then a larger LLM selects from the filtered elements and predicts the corresponding action. Mind2Web was a key contribution from the OSU NLP Group and demonstrated that LLMs could be trained for generalized web interaction rather than site-specific automation.
In April 2025, the OSU NLP Group released Online-Mind2Web, a live, online version of Mind2Web with 300 multi-step tasks across 136 high-traffic websites and a paper titled "An Illusion of Progress? Assessing the Current State of Web Agents."[31] On Online-Mind2Web, OpenAI's Operator achieved a 61.3% human-evaluated success rate while most other agents clustered at 28 to 30%, a sharp drop from the >90% scores those same systems reported on cached benchmarks. The OSU team also introduced a WebJudge LLM-as-a-Judge evaluation method that achieves roughly 85% agreement with human judgment.[31]
WebArena was published at ICLR 2024 and is widely considered one of the most rigorous benchmarks for web agents.[32] It provides a self-hosted environment with fully functional copies of websites across four domains: e-commerce (modeled on shopping sites), social forums (modeled on Reddit), collaborative software development (modeled on GitLab), and content management (modeled on a CMS). The environment also includes tools like maps and user manuals to encourage realistic task-solving.
WebArena contains 812 long-horizon tasks described in natural language. Evaluation focuses on functional correctness: programmatic validators check whether the result of execution actually achieves the desired goal, rather than simply matching text output. Human performance on WebArena is 78.24%. When the benchmark was first published, the best GPT-4-based agent achieved only 14.41%. By early 2025, OpenAI's CUA model had pushed this to 58.1%, and Surfer 2 reached 69.6% in October 2025.[28]
VisualWebArena extends WebArena by introducing 910 tasks across three web applications (a classifieds site, a shopping site, and a Reddit-like forum) that explicitly require visual understanding.[33] Tasks involve reasoning about images, spatial layouts, and visual elements that cannot be understood from HTML alone. This benchmark is particularly important for evaluating multimodal agents that must process both visual and textual information.
OSWorld, introduced at NeurIPS 2024, goes beyond web-only tasks to evaluate agents in a full desktop environment.[34] It provides 369 diverse tasks on Ubuntu Linux and Windows operating systems, requiring agents to interact with multiple desktop applications, file systems, and system settings in addition to web browsers. OpenAI's CUA model achieved 38.1% on OSWorld at the start of 2025;[9] Claude Sonnet 4.5 reached 61.4% in September 2025,[13] and Claude Opus 4.5 hit 66.26% in November 2025, still well below human performance of 72.4%.[14]
WebVoyager, published at ACL 2024, comprises 643 manually validated tasks distributed across 15 high-traffic websites.[35] Tasks span diverse web functionalities including search engines, e-commerce platforms, and social media sites. The benchmark uses an automatic evaluation protocol leveraging GPT-4V's multimodal understanding to assess task completion, achieving 85.3% agreement with human judgment. Scores on WebVoyager have climbed rapidly, with the original paper reporting 59.1%, later agents like Browser Use reaching 89.1%, Magnitude reaching 93.9%, and Surfer 2 reaching 97.1% (and a perfect 100% pass@10) in October 2025.[28]
OpenAI released BrowseComp on April 10, 2025, an open-source benchmark of 1,266 challenging fact-finding problems that test an agent's ability to locate hard-to-find, entangled information on the open web.[36] The benchmark was constructed using an "inverted question" methodology in which human trainers started with verifiable facts and then composed questions that made those facts hard to find but easy to verify, with each question accompanied by a canary GUID to detect training-set leakage. On BrowseComp, OpenAI's o1 scored 9.9% while OpenAI's Deep Research model scored 51.5%, the highest at launch, illustrating that even strong reasoners with browsing tools solve only about half the problems.[36]
AssistantBench, presented in 2024, evaluates web agents on 214 realistic, time-consuming tasks across 258 different websites in 525 pages, covering domains from information lookup to math problem-solving.[37] The benchmark was designed to expose the limitations of language-model-only and retrieval-augmented baselines, none of which exceeded 26 accuracy points at the time of release. State-of-the-art web agents at that time hovered near zero. The proposed SEE-PLAN-ACT (SPA) baseline together with an ensembled closed-book model achieved the highest scores in the original paper.[37]
ScreenSpot-Pro, introduced in April 2025, tests GUI grounding in professional, high-resolution desktop environments rather than web pages.[38] It contains 1,581 instructions over screenshots drawn from 23 applications (development tools and creative software such as VS Code, Photoshop, Blender, and DaVinci Resolve) on three operating systems. The best existing GUI grounding model in the original paper reached only 18.9% accuracy; the authors' ScreenSeekeR visual search method improved that to 48.1% without additional training.[38] ScreenSpot-Pro has become an important component of evaluation suites that test agents on small, dense GUI elements at high resolution.
| Benchmark | Venue | Year | Number of tasks | Environment | Human performance | Best agent (as of early 2026) |
|---|---|---|---|---|---|---|
| WebShop | NeurIPS | 2022 | 12,087 instructions | Simulated e-commerce | N/A | Various |
| Mind2Web | NeurIPS | 2023 | 2,350 | Cached pages, 137 sites | N/A | Various |
| WebArena | ICLR | 2024 | 812 | Self-hosted websites (4 domains) | 78.24% | 69.6% (Surfer 2) |
| VisualWebArena | ICLR | 2024 | 910 | Self-hosted websites (3 apps) | N/A | N/A |
| OSWorld | NeurIPS | 2024 | 369 | Full desktop (Ubuntu/Windows) | 72.4% | 66.26% (Opus 4.5) |
| WebVoyager | ACL | 2024 | 643 | Live websites (15 sites) | N/A | 97.1% (Surfer 2) |
| AssistantBench | EMNLP | 2024 | 214 | Live web, 258 sites | N/A | ~26% (paper baselines) |
| BrowseComp | OpenAI | 2025 | 1,266 | Open web fact-finding | N/A | 51.5% (Deep Research) |
| Online-Mind2Web | OSU NLP | 2025 | 300 | Live websites (136 sites) | N/A | 61.3% (Operator) |
| ScreenSpot-Pro | ACM MM | 2025 | 1,581 | Professional desktop apps | N/A | 48.1% (ScreenSeekeR) |
The development of AI browser agents has roots in research dating back to the early 2020s.
Adept, a startup founded by former OpenAI and Google researchers, demonstrated ACT-1 (Action Transformer) in September 2022, one of the earliest models trained specifically to operate computer interfaces.[39] ACT-1 was hooked into a Chrome extension that gave it access to a custom rendering of the browser viewport and an action space of available UI elements; in one widely circulated demo it navigated a real estate site from a text prompt to find a home for a family of four in Houston under $600,000.[39] The ACT-1 demonstration helped Adept secure $350 million in funding in 2023. However, Adept never released a public product. In June 2024, Amazon hired Adept's co-founders and key employees and licensed the company's technology, models, and datasets. This talent and technology contributed directly to the development of Amazon Nova Act.
OpenAI's WebGPT project, published in December 2021, was one of the earliest demonstrations of an LLM interacting with web content to answer questions.[40] The researchers fine-tuned GPT-3 to use a text-based web browser, where the model could submit search queries, follow links, and scroll through pages. Unlike modern browser agents that operate on visual interfaces, WebGPT interacted with a simplified text representation of web pages.
The model was trained using behavior cloning from human demonstrations, then refined using rejection sampling against a reward model trained to predict human preferences. In evaluations, WebGPT's answers were preferred over human-written answers 56% of the time and were preferred over the highest-voted Reddit answers 69% of the time. On TruthfulQA, the model produced true answers 75% of the time.[40]
The WebShop benchmark introduced by Yao et al. at NeurIPS 2022 was a turning point for grounded web agents, providing the first large-scale environment where natural-language instructions had to be translated into multi-step browser actions over real product data.[29] The WebShop paper trained baselines using imitation learning, reinforcement learning, and pre-trained vision-language models on more than 1,600 human demonstrations, establishing many of the patterns that later agent training pipelines would refine.
Mind2Web demonstrated that LLMs could be trained for generalized web interaction rather than site-specific automation, providing a benchmark and dataset of 2,350 tasks across 137 websites that spurred subsequent generalist agent research.[30]
SeeAct, also from the OSU NLP Group and published at ICML 2024, advanced the field by demonstrating that multimodal models could serve as generalist web agents. The system uses GPT-4V to visually perceive websites and generate action plans in textual form. SeeAct instructs the model to imitate human browsing behavior, analyzing the task, current webpage, and previous actions to determine the next step. In experiments, SeeAct successfully completed 51.1% of tasks on live websites when provided with manual grounding for action execution, demonstrating the potential and current limitations of vision-based web agents.[41]
The 12 months from mid-2025 through early 2026 saw the AI browser agent category transition from research previews to commercial products, with several inflection points that reshaped the landscape.
OpenAI consolidated three previously separate products, Operator's action-taking remote browser, Deep Research's web synthesis, and ChatGPT's conversational interface, into a single offering named ChatGPT Agent on July 17, 2025.[10] The standalone operator.chatgpt.com site was sunset on August 31, 2025.[11] OpenAI's blog explained that the merger was a response to the limitation that Operator could not perform deep analysis and Deep Research could not interact with authenticated sites; ChatGPT Agent's virtual computer can do both inside a single trajectory.[10]
On October 21, 2025, OpenAI shipped ChatGPT Atlas, a Chromium-based browser for macOS that puts ChatGPT and its agent capability at the center of the browsing experience.[12] Atlas added a browser memory feature that uses recent browsing history to personalize responses, and an Agent Mode that can complete end-to-end tasks like compiling a meal plan and ordering ingredients. OpenAI also published a follow-up post in December 2025 describing how it had continuously hardened Atlas against prompt injection through reinforcement learning against an adversarial LLM-based attacker.[42]
Anthropic shipped Claude for Chrome as a research preview on August 26, 2025, starting with 1,000 Max subscribers, and expanded the audience throughout the fall, with general availability for paid Pro, Team, and Enterprise users by December 18, 2025.[15][16] Alongside the release, Anthropic published systematic measurements of prompt-injection success rates on a 123-test-case, 29-scenario benchmark: without safety mitigations, attacks succeeded 23.6% of the time; with mitigations enabled, the rate dropped to 11.2%, and browser-specific attacks dropped from 35.7% to 0%.[15] After Anthropic released Claude Opus 4.5 on November 24, 2025, follow-up testing reported a 1.4% prompt-injection success rate on the new model against the same benchmark, compared to 10.8% for Claude Sonnet 4.5 under the older safeguards.[43]
Google released the Gemini 2.5 Computer Use model in preview on October 7, 2025, exposing browser control through the Gemini API on AI Studio and Vertex AI.[18] Project Mariner remained accessible through Google AI Ultra throughout 2025, and Google subsequently absorbed its functionality into the broader Gemini Agent and Chrome's Auto Browse features. Browserbase released companion infrastructure for running Gemini Computer Use sessions, including the gemini-browser repository for evaluation work.[8]
Manus AI's invitation-only beta on March 6, 2025, became one of the most-discussed product launches of the year. Industry figures including Jack Dorsey and Hugging Face's Victor Mustar publicly praised the demo videos within hours, and MIT Technology Review's hands-on review (March 11, 2025) found the system both impressive and inconsistent, with frequent failures on novel tasks and long execution times.[23] Browser Use, an open-source library, was identified as a core component of Manus's web-browsing capability, which contributed to its own surge in popularity.[26] In April 2025, Butterfly Effect raised $75 million from Benchmark at a $500 million valuation;[24] the team relocated to Singapore later in 2025. In November 2025, Manus released a local Chromium-based Manus Browser Operator extension.
Perplexity AI released the Comet browser on July 9, 2025, made it free on October 2, 2025, and continued to roll out an asynchronous "background assistant" for Max-tier subscribers that runs multiple agentic tasks in parallel.[25] Security researchers at Brave published two reports during 2025 documenting systemic indirect prompt-injection vulnerabilities in Comet, including a screenshot-based attack in which faint-text injections become OCR-extracted commands when Comet processes a screenshot.[44][45] These reports framed indirect prompt injection as a category-wide systemic problem rather than a single product bug.
Browser agents at scale encountered an authentication and identity gap: anti-bot systems blocked legitimate agent traffic because there was no cryptographic way to prove that an automated request came from an accountable, authorized agent. The IETF draft Web Bot Auth standard, championed by Cloudflare and built on the HTTP Message Signatures specification, addressed this by letting bot operators publish a public key and sign each outbound request, so receiving sites can verify identity without IP allow lists.[46] In October 2025, Amazon Bedrock AgentCore Browser added Web Bot Auth signing in preview across nine AWS regions, allowing AgentCore-hosted sessions to be verified by Cloudflare, Akamai, and HUMAN Security.[19] Adoption across Visa TAP, Mastercard Agent Pay, AWS WAF, Shopify, and Vercel through 2025 and 2026 turned Web Bot Auth into a de facto identity layer for agent traffic.
Microsoft expanded Copilot Studio's computer-using-agent functionality through 2025 to support not just OpenAI's CUA but also Anthropic's Claude Sonnet 4.5 (October 2025) and Claude Opus 4.5 (December 2025), letting builders pick the underlying model per agent.[20][21] Salesforce launched Agentforce 360 on October 13, 2025, generalizing its existing Agentforce 2dx (March 2025) and Agentforce 3 (June 2025) platforms into a coordinated set of action-taking agents capable of working across email, ERP, and other enterprise systems, with adopters reporting cycle-time reductions of up to 70% on processes such as auditing and onboarding.[47] These announcements signaled that browser agents were becoming a feature embedded in incumbent enterprise platforms rather than a separate product category.
The H Company released Surfer 2 in October 2025 with results that effectively saturated WebVoyager (97.1% pass@1; 100% pass@10) and pushed other benchmarks to new highs.[28] The paper notes that the remaining WebVoyager failures cluster around anti-bot measures such as CAPTCHAs rather than agent reasoning, reinforcing that the bottleneck for browser agents has shifted away from base-model capability toward authentication, identity, and environment robustness.
AI browser agents face several significant technical challenges that limit their reliability and applicability.
Modern websites frequently use JavaScript frameworks to render content dynamically, load data asynchronously, and update the page without full reloads. Elements may appear, disappear, or change position based on user interactions, network conditions, or server responses. Browser agents must handle loading states, wait for elements to become interactive, and adapt to pages that look different on each visit due to personalized content, A/B testing, or time-sensitive information.
Websites deploy CAPTCHAs, rate limiting, browser fingerprinting, and behavioral analysis to prevent automated access. These mechanisms are specifically designed to block non-human interactions, creating a fundamental tension with browser agents that automate exactly that. Modern CAPTCHA systems have moved beyond simple image recognition to analyze behavioral signals such as mouse movement patterns, typing cadence, and navigation timing, making them difficult for agents to bypass.
Some emerging solutions address this challenge. Amazon's Bedrock AgentCore Browser uses Web Bot Auth, a draft IETF protocol that cryptographically identifies AI agents to websites and bot control vendors, reducing CAPTCHA friction for verified agents.[19][46] This approach establishes a legitimate identity for the agent rather than attempting to disguise automated behavior.
Many useful web tasks require logging into accounts, which raises both technical and security challenges. Agents need access to credentials, and providing those credentials to an AI system introduces risks around data exposure and unintended actions. Most browser agent platforms handle this by asking users to log in manually before handing control to the agent, or by requesting explicit confirmation before any authentication step.
Web interactions can fail in unpredictable ways: pages may not load, buttons may be obscured by pop-ups or cookie consent banners, dropdown menus may behave differently across browsers, and network errors can interrupt multi-step workflows. Reliable browser agents need robust error detection and recovery strategies, including the ability to recognize when they are stuck and try alternative approaches.
Screenshot-based agents that call large multimodal models for every action are inherently slow and expensive. A task that takes a human 30 seconds might take a browser agent several minutes due to the latency of model inference at each step. For enterprise applications requiring high throughput, this overhead can be prohibitive. Reducing cost and latency while maintaining accuracy remains an active area of engineering.
Online benchmarks against live websites continue to show that headline numbers from cached or self-hosted benchmarks substantially overstate real-world reliability. On Online-Mind2Web, Operator's human-evaluated success rate was 61.3% and most other agents clustered at 28 to 30%, well below their reported WebVoyager scores.[31] The paper's authors attributed most failures to misuse of filter and sort controls (57.7% of major failures) and to difficulty interpreting implicit numerical, temporal, and compositional constraints, suggesting that benchmark inflation rather than fundamental task solution is the more accurate explanation for many headline scores.[31]
AI browser agents introduce novel safety and security concerns that go beyond those of traditional AI systems.
Prompt injection is the most widely discussed security risk for browser agents.[48] In this attack, malicious instructions are hidden within web page content, emails, or documents that the agent processes. Because the agent interprets both user instructions and web content, an attacker can embed text on a web page that instructs the agent to perform unintended actions, such as extracting personal data, making purchases, or clicking malicious links.
OpenAI has acknowledged that prompt injection "is unlikely to ever be fully solved" for browser agents, comparing it to scams and social engineering on the web. Brave's security research has confirmed that indirect prompt injection is a systemic challenge affecting all AI-powered browsers, not an isolated issue.[44] Researchers have demonstrated that invisible prompt injections can be embedded in screenshots, with hidden text on yellow backgrounds extracted via OCR and treated as agent commands.[45]
Defense strategies include training agents to distinguish between user instructions and third-party content, requiring user confirmation before high-stakes actions, and using adversarial testing with LLM-based automated attackers. OpenAI uses a reinforcement-learning-trained attacker to probe for vulnerabilities in its Atlas browser.[42] Anthropic has published research on layered defenses against prompt injection for browser agents, reporting that mitigations reduce attack success on a 123-case browser test set from 23.6% to 11.2% on Claude Sonnet 4.5 and to 1.4% on Claude Opus 4.5.[15][43] Independent commentary by Simon Willison has framed the problem with the phrase "the lethal trifecta": when an agent simultaneously has access to private data, exposure to untrusted content, and the ability to communicate externally, an attacker can chain these capabilities to exfiltrate data.[49]
Because browser agents can take real-world actions such as sending messages, making purchases, and modifying account settings, errors carry tangible consequences. An agent that misinterprets a task or encounters an unexpected page state could place an incorrect order, send a message to the wrong person, or change critical settings. Most commercial browser agents mitigate this by requiring human confirmation before irreversible actions, but this limits the degree of autonomy and negates some of the efficiency gains. Atlas's Agent Mode pauses on financial-institution sites; Claude for Chrome blocks financial-services, adult, and pirated-content categories outright; ChatGPT Agent prompts for confirmation before purchases.[12][15]
Browser agents necessarily process sensitive information visible on the user's screen, including emails, financial data, personal messages, and account credentials. This data passes through the AI model's inference pipeline, raising questions about data retention, model training on user data, and compliance with privacy regulations. Users must trust that the agent provider handles this data appropriately, and enterprises must evaluate browser agents against their data governance policies.
Browser agents could be misused for spam, fraud, credential stuffing, or other malicious activities. The ability to automate browser interactions at scale lowers the barrier for attacks that previously required human effort. Agent providers implement usage policies and monitoring to detect and prevent abuse, but the dual-use nature of the technology means that safeguards must continually evolve.
AI browser agents are being applied across a wide range of use cases.
Enterprise workflow automation: Agents automate repetitive browser-based tasks such as data entry across multiple systems, invoice processing, compliance form completion, and report generation. This is particularly valuable when the target applications lack APIs and can only be accessed through a web interface. Salesforce Agentforce, Microsoft Copilot Studio, and Amazon Bedrock AgentCore Browser are examples of commercial platforms that target this segment.[6][19][47]
Personal task automation: Consumer-facing agents handle tasks like booking travel, ordering groceries, comparing prices across shopping sites, scheduling appointments, and filling out forms. OpenAI's ChatGPT Agent and Atlas, Anthropic's Claude for Chrome, and Perplexity Comet target this category.
Software testing: Browser agents can be used to test web applications by simulating user interactions, identifying broken workflows, and verifying that pages render correctly. Their ability to navigate unfamiliar interfaces makes them useful for exploratory testing. The Browserbase Stagehand SDK and Browser Use library are both positioned as testing and automation backends.[7][8]
Research and data collection: Agents can gather information from multiple websites, compile results, and synthesize findings, going beyond simple web scraping to tasks that require navigating complex multi-page workflows. OpenAI's Deep Research, integrated into ChatGPT Agent, is one widely cited example.[10][36]
Accessibility: Browser agents have the potential to make the web more accessible to people with disabilities by performing complex navigation tasks on their behalf, effectively translating high-level intents into detailed browser interactions.
Robotic process automation alternative: Compared to traditional Robotic Process Automation (RPA) tooling, which relies on rule-based scripts tied to specific UI elements, AI browser agents can adapt to UI changes without manual reconfiguration. They are better suited to dynamic, exception-heavy processes, while RPA remains preferable for stable, structured, high-throughput workflows.[50]
Several trends are shaping the future of AI browser agents. Agent reliability continues to improve as models become better at understanding web interfaces, recovering from errors, and handling edge cases. The gap between human and agent performance on benchmarks like WebArena is closing, though it remains significant.
Standardization efforts such as the Web Bot Auth IETF protocol aim to create legitimate channels for AI agents to identify themselves to websites, potentially reducing friction from anti-bot measures while maintaining security.[46] Multi-agent architectures, where multiple specialized agents collaborate on complex tasks, are being explored to improve reliability and handle tasks that span many websites.
The integration of browser agents into existing products, as demonstrated by Microsoft's Copilot Studio and OpenAI's Atlas browser, suggests that browser agent capabilities will become a standard feature of productivity software rather than standalone tools. As these systems mature, the balance between autonomy and human oversight remains a central design challenge.