An AI browser agent is a software system that uses artificial intelligence to autonomously navigate, interpret, and interact with web browsers to complete tasks on behalf of a user. These agents can click buttons, fill out forms, scroll through pages, follow links, and perform multi-step workflows across websites, replicating the actions a human would take when using a web browser. Unlike traditional browser automation tools that rely on hard-coded scripts and CSS selectors, AI browser agents use large language models (LLMs) and computer vision to understand web pages dynamically, making them capable of handling unfamiliar websites without site-specific programming.
The development of AI browser agents accelerated rapidly between 2024 and 2025, with major AI companies releasing competing products. Anthropic launched computer use for Claude in October 2024. Google previewed Project Mariner, built on Gemini 2.0, in December 2024. OpenAI released Operator, powered by its Computer-Using Agent (CUA) model, in January 2025. Amazon introduced Nova Act in March 2025. Microsoft integrated computer-using agent capabilities into Copilot Studio in 2025. These releases signaled a broad industry consensus that autonomous browser control represents a natural next step for AI systems.
AI browser agents operate through a continuous loop of perception, reasoning, and action. At each step, the agent observes the current state of the browser, decides what to do next, and then executes an action such as clicking, typing, or scrolling. This loop repeats until the agent determines the task is complete or requires human intervention.
The perception step is where the major architectural differences between browser agents emerge. There are two primary approaches: screenshot-based (also called vision-based or pixel-based) and DOM-based (also called structure-based). Some systems use a hybrid of both.
In the screenshot-based approach, the agent captures a screenshot of the browser window and feeds it to a multimodal AI model that can interpret images. The model analyzes the visual layout of the page, identifies interactive elements like buttons and text fields, and returns coordinates for where to click or text to type. This process closely mirrors how a human perceives and interacts with a screen.
OpenAI's CUA model and Anthropic's Claude computer use both employ this approach. The CUA model combines GPT-4o's vision capabilities with advanced reasoning trained through reinforcement learning. It examines a screenshot, reasons about the next steps using chain-of-thought, and issues commands such as click(x, y) or type(text). The environment executes these commands and returns a new screenshot, and the cycle repeats.
Anthropic's computer use capability, introduced with Claude 3.5 Sonnet in October 2024, works similarly. The model receives a screenshot of the current screen state and returns specific pixel coordinates for mouse actions. Before this update, frontier AI models could not reliably provide coordinates on a screenshot; the updated Claude 3.5 Sonnet was the first major model to demonstrate this capability in public beta.
The primary advantage of the screenshot approach is universality. Because it works at the pixel level, it can handle any graphical interface, including web applications, desktop software, canvas-based applications, and graphical elements that have no representation in the DOM. The disadvantages are speed and cost. Screenshot-based agents typically take 2 to 7 seconds per action because each step requires image capture, upload, model inference, and response parsing. The token cost is also higher, since image tokens in models like GPT-4V are 10 to 20 times more expensive than processing equivalent HTML text.
The DOM-based approach reads the Document Object Model, the structured tree representation of a web page's HTML elements. Instead of looking at pixels, the agent parses the underlying code to identify buttons, links, input fields, and other interactive elements. It then reasons about which element to interact with and issues commands directly against the DOM.
Google's Project Mariner uses a hybrid approach that incorporates DOM understanding. The agent takes a screenshot of the active Chrome tab and simultaneously analyzes both the visible layout and the underlying HTML structure. This dual understanding of visual elements and code enables more reliable browser automation, particularly on text-heavy pages with forms and structured data.
DOM-based agents are significantly faster, executing actions in 20 to 100 milliseconds compared to several seconds for screenshot-based agents. They also consume fewer tokens, typically 1,000 to 3,000 tokens per page compared to 10,000 or more for vision-based processing. This makes them cheaper to operate at scale. However, DOM-based agents struggle with canvas-based applications, dynamically rendered content that does not appear in the HTML source, and visual elements like images or charts that carry meaning not captured in the DOM.
Many modern browser agents combine both approaches, using DOM parsing for most steps and falling back to screenshots when the page structure is ambiguous or when visual understanding is required. The open-source Browser Use framework, which gained over 78,000 GitHub stars by early 2026, exemplifies this hybrid strategy. It uses Playwright for browser control, parses the DOM for element identification, and takes screenshots for visual analysis when needed.
Several major technology companies and startups have released AI browser agents, each with different technical approaches and target audiences.
OpenAI launched Operator on January 23, 2025, as a research preview available to ChatGPT Pro subscribers in the United States. Operator is powered by the Computer-Using Agent (CUA) model, which combines GPT-4o's vision capabilities with reasoning trained through reinforcement learning to interact with graphical user interfaces. The CUA model can identify page elements, break tasks into multi-step plans, and self-correct when challenges arise.
Operator runs in a dedicated browser environment at operator.chatgpt.com. Users provide a natural language instruction, and the agent navigates websites to complete the task. Safety measures include asking for user confirmation before finalizing actions with external side effects, such as submitting an order or sending an email.
In October 2025, OpenAI launched ChatGPT Atlas, a dedicated browser product that consolidated the capabilities previously available through Operator. Atlas introduced Agent Mode as a core feature, allowing the browser to execute multi-step tasks autonomously. The Operator site was deprecated, and its agentic capabilities were integrated into both the main ChatGPT interface and the Atlas browser. Agent Mode in Atlas is available to Plus, Pro, and Business users.
Anthropic introduced computer use as a public beta on October 22, 2024, making Claude 3.5 Sonnet the first frontier AI model to offer autonomous computer control capabilities at this level. The feature allows developers to direct Claude to interact with computers by looking at screenshots, moving a cursor, clicking buttons, and typing text.
Unlike OpenAI's approach, Anthropic does not provide hosted virtual machines. Instead, developers call the Claude API, sending both text instructions and screenshots of the computer being controlled. Claude responds with commands describing what actions to perform next. The model can use three tools: a Computer tool (providing access to screen, keyboard, and mouse), a Text Editor, and a Bash terminal.
At launch, Claude 3.5 Sonnet scored 14.9% on OSWorld's screenshot-only category, which was notably better than the next-best system's score of 7.8% at that time. Anthropic has continued to improve the feature, and in August 2025, launched a research preview of Claude for Chrome, a browser-based AI agent available as a Chrome extension for a limited group of subscribers.
Google DeepMind unveiled Project Mariner on December 11, 2024, alongside the launch of Gemini 2.0. Project Mariner is a research prototype implemented as a Chrome extension that uses the browser's side panel interface. It can understand and reason across information on the screen, including pixels, web elements, text, code, images, and forms.
Project Mariner operates using an Observe-Plan-Act loop. The Observe step captures the current browser state, including visual elements and underlying code structures. The Plan step analyzes the captured data and formulates a sequence of actions. The Act step executes the planned actions by simulating user interactions. This combination of visual and structural understanding distinguishes it from purely screenshot-based agents.
On the WebVoyager benchmark, Project Mariner achieved a state-of-the-art score of 83.5% as a single-agent setup at the time of its announcement. The agent can simultaneously handle up to 10 different tasks and includes a Teach and Repeat feature that allows users to demonstrate workflows for the agent to learn and replicate.
Amazon introduced Nova Act on March 31, 2025, as the first public product from Amazon's AGI lab. This lab is co-led by former Adept co-founder David Luan and former Covariant co-founder Pieter Abbeel, both of whom Amazon hired away from their respective startups in 2024.
Nova Act is powered by a custom Nova 2 Lite model, a small and fast multimodal model in the Amazon Nova family, trained specifically for browser task completion. The Nova Act SDK allows developers to build agents that complete step-by-step browser tasks such as submitting time-off requests or placing online orders. The system is designed to handle interface elements that often trip up other agents, including drop-down menus, date pickers, and pop-up dialogs.
Microsoft introduced computer-using agent capabilities within Copilot Studio, first available in public preview for United States-based environments. These agents can operate applications and websites directly using a virtual mouse and keyboard, clicking, typing, and navigating user interfaces. A hosted browser powered by Windows 365 provides the automation environment without requiring users to configure their own machines.
As of late 2025, computer-using agents in Copilot Studio support multiple foundation models, including Anthropic's Claude Sonnet 4.5 alongside OpenAI's CUA, giving developers flexibility to choose the best model for each task and interface type.
Several open-source projects and startups have built browser agent platforms.
Browser Use is an open-source Python library that became the most popular browser automation framework for AI agents, reaching over 78,000 GitHub stars. It uses Playwright for browser control and a hybrid DOM-plus-vision approach for page understanding. Browser Use scored 89.1% on the WebVoyager benchmark.
Skyvern is a Y Combinator (S23) startup that automates browser-based workflows using LLMs and computer vision. It has helped customers automate tasks such as job applications, invoice retrieval, government form completion, and product purchases across hundreds of websites. Skyvern scored 85.8% on WebVoyager.
BrowserGym is an open-source framework from ServiceNow that provides a unified Gym environment for web agent research. It integrates multiple benchmarks including WebArena, VisualWebArena, and WorkArena, and includes AgentLab, a companion framework for implementing, testing, and evaluating web agents.
MultiOn is a startup that provides an API for autonomous web agents capable of executing tasks such as event planning, service booking, and workflow automation. The platform includes built-in handling for authentication, CAPTCHAs, and dynamic content.
| Agent | Developer | Release Date | Perception Method | Foundation Model | WebVoyager Score | WebArena Score | Availability |
|---|---|---|---|---|---|---|---|
| Operator / Atlas | OpenAI | January 2025 / October 2025 | Screenshot-based | CUA (GPT-4o + RL) | 87.0% | 58.1% | ChatGPT Plus, Pro, Business |
| Computer Use | Anthropic | October 2024 | Screenshot-based | Claude 3.5 Sonnet | N/A | N/A | API (public beta) |
| Project Mariner | Google DeepMind | December 2024 | Hybrid (pixels + DOM) | Gemini 2.0 | 83.5% | N/A | Chrome extension (limited preview) |
| Nova Act | Amazon | March 2025 | Screenshot-based | Nova 2 Lite | N/A | N/A | SDK (research preview) |
| Copilot Studio CUA | Microsoft | 2025 | Screenshot-based | CUA / Claude Sonnet 4.5 | N/A | N/A | Public preview (US) |
| Browser Use | Open-source | 2024 | Hybrid (DOM + vision) | Various (pluggable) | 89.1% | N/A | Open-source (MIT) |
| Skyvern | Skyvern (YC S23) | June 2024 | Hybrid (LLM + CV) | Various | 85.8% | N/A | SaaS + open-source |
Evaluating AI browser agents requires specialized benchmarks that test the ability to complete realistic, multi-step web tasks. Several benchmarks have become standard in the field.
WebArena was published at ICLR 2024 and is widely considered the most rigorous benchmark for web agents. It provides a self-hosted environment with fully functional copies of websites across four domains: e-commerce (modeled on shopping sites), social forums (modeled on Reddit), collaborative software development (modeled on GitLab), and content management (modeled on a CMS). The environment also includes tools like maps and user manuals to encourage realistic task-solving.
WebArena contains 812 long-horizon tasks described in natural language. Evaluation focuses on functional correctness: programmatic validators check whether the result of execution actually achieves the desired goal, rather than simply matching text output. Human performance on WebArena is 78.24%. When the benchmark was first published, the best GPT-4-based agent achieved only 14.41%. By early 2025, OpenAI's CUA model had pushed this to 58.1%, and by mid-2025, the best agents reached approximately 61.7%. The gap between human and AI performance remains significant, highlighting that general web task automation is still an unsolved problem.
VisualWebArena extends WebArena by introducing 910 tasks across three web applications (a classifieds site, a shopping site, and a Reddit-like forum) that explicitly require visual understanding. Tasks involve reasoning about images, spatial layouts, and visual elements that cannot be understood from HTML alone. This benchmark is particularly important for evaluating multimodal agents that must process both visual and textual information.
OSWorld, introduced at NeurIPS 2024, goes beyond web-only tasks to evaluate agents in a full desktop environment. It provides 369 diverse tasks on Ubuntu Linux and Windows operating systems, requiring agents to interact with multiple desktop applications, file systems, and system settings in addition to web browsers. OpenAI's CUA model achieved 38.1% on OSWorld, representing a notable step toward general computer use, though still far from human-level performance.
WebVoyager, published at ACL 2024, comprises 643 manually validated tasks distributed across 15 high-traffic websites. Tasks span diverse web functionalities including search engines, e-commerce platforms, and social media sites. The benchmark uses an automatic evaluation protocol leveraging GPT-4V's multimodal understanding to assess task completion, achieving 85.3% agreement with human judgment. Scores on WebVoyager have climbed rapidly, with the original paper reporting 59.1%, and later agents like Browser Use reaching 89.1% and Magnitude reaching 93.9%.
| Benchmark | Venue | Year | Number of Tasks | Environment | Human Performance | Best Agent (as of early 2026) |
|---|---|---|---|---|---|---|
| WebArena | ICLR | 2024 | 812 | Self-hosted websites (4 domains) | 78.24% | ~61.7% |
| VisualWebArena | ICLR | 2024 | 910 | Self-hosted websites (3 apps) | N/A | N/A |
| OSWorld | NeurIPS | 2024 | 369 | Full desktop (Ubuntu/Windows) | 72.4% | 38.1% (CUA) |
| WebVoyager | ACL | 2024 | 643 | Live websites (15 sites) | N/A | 97.1% (Surfer 2) |
The development of AI browser agents has roots in research dating back to the early 2020s.
OpenAI's WebGPT project, published in December 2021, was one of the earliest demonstrations of an LLM interacting with web content to answer questions. The researchers fine-tuned GPT-3 to use a text-based web browser, where the model could submit search queries, follow links, and scroll through pages. Unlike modern browser agents that operate on visual interfaces, WebGPT interacted with a simplified text representation of web pages.
The model was trained using behavior cloning from human demonstrations, then refined using rejection sampling against a reward model trained to predict human preferences. In evaluations, WebGPT's answers were preferred over human-written answers 56% of the time and were preferred over the highest-voted Reddit answers 69% of the time. On TruthfulQA, the model produced true answers 75% of the time.
Mind2Web, presented as a Spotlight paper at NeurIPS 2023, introduced the first large-scale dataset and benchmark for building generalist web agents. The dataset contains 2,350 tasks spanning 137 websites across 31 domains. The accompanying MindAct model uses a two-stage approach: a fine-tuned small language model first filters relevant web elements from the page, and then a larger LLM selects from the filtered elements and predicts the corresponding action. Mind2Web was a key contribution from the OSU NLP Group and demonstrated that LLMs could be trained for generalized web interaction rather than site-specific automation.
SeeAct, also from the OSU NLP Group and published at ICML 2024, advanced the field by demonstrating that multimodal models could serve as generalist web agents. The system uses GPT-4V to visually perceive websites and generate action plans in textual form. SeeAct instructs the model to imitate human browsing behavior, analyzing the task, current webpage, and previous actions to determine the next step. In experiments, SeeAct successfully completed 51.1% of tasks on live websites when provided with manual grounding for action execution, demonstrating the potential and current limitations of vision-based web agents.
Adept, a startup founded by former OpenAI and Google researchers, developed ACT-1 (Action Transformer), one of the earliest models trained specifically to operate computer interfaces. In a notable early demonstration, ACT-1 navigated complex enterprise procurement software entirely from a natural language instruction, which helped Adept secure $350 million in funding in 2023. However, Adept never released a public product. In June 2024, Amazon hired Adept's co-founders and key employees, and licensed the company's technology, models, and datasets. This talent and technology contributed directly to the development of Amazon Nova Act.
AI browser agents face several significant technical challenges that limit their reliability and applicability.
Modern websites frequently use JavaScript frameworks to render content dynamically, load data asynchronously, and update the page without full reloads. Elements may appear, disappear, or change position based on user interactions, network conditions, or server responses. Browser agents must handle loading states, wait for elements to become interactive, and adapt to pages that look different on each visit due to personalized content, A/B testing, or time-sensitive information.
Websites deploy CAPTCHAs, rate limiting, browser fingerprinting, and behavioral analysis to prevent automated access. These mechanisms are specifically designed to block non-human interactions, creating a fundamental tension with browser agents that automate exactly that. Modern CAPTCHA systems have moved beyond simple image recognition to analyze behavioral signals such as mouse movement patterns, typing cadence, and navigation timing, making them difficult for agents to bypass.
Some emerging solutions address this challenge. Amazon's Bedrock AgentCore Browser uses Web Bot Auth, a draft IETF protocol that cryptographically identifies AI agents to websites and bot control vendors, reducing CAPTCHA friction for verified agents. This approach establishes a legitimate identity for the agent rather than attempting to disguise automated behavior.
Many useful web tasks require logging into accounts, which raises both technical and security challenges. Agents need access to credentials, and providing those credentials to an AI system introduces risks around data exposure and unintended actions. Most browser agent platforms handle this by asking users to log in manually before handing control to the agent, or by requesting explicit confirmation before any authentication step.
Web interactions can fail in unpredictable ways: pages may not load, buttons may be obscured by pop-ups or cookie consent banners, dropdown menus may behave differently across browsers, and network errors can interrupt multi-step workflows. Reliable browser agents need robust error detection and recovery strategies, including the ability to recognize when they are stuck and try alternative approaches.
Screenshot-based agents that call large multimodal models for every action are inherently slow and expensive. A task that takes a human 30 seconds might take a browser agent several minutes due to the latency of model inference at each step. For enterprise applications requiring high throughput, this overhead can be prohibitive. Reducing cost and latency while maintaining accuracy remains an active area of engineering.
AI browser agents introduce novel safety and security concerns that go beyond those of traditional AI systems.
Prompt injection is the most widely discussed security risk for browser agents. In this attack, malicious instructions are hidden within web page content, emails, or documents that the agent processes. Because the agent interprets both user instructions and web content, an attacker can embed text on a web page that instructs the agent to perform unintended actions, such as extracting personal data, making purchases, or clicking malicious links.
OpenAI has acknowledged that prompt injection "is unlikely to ever be fully solved" for browser agents, comparing it to scams and social engineering on the web. Brave's security research has confirmed that indirect prompt injection is a systemic challenge affecting all AI-powered browsers, not an isolated issue. Researchers have demonstrated that invisible prompt injections can be embedded in screenshots, making them undetectable to human oversight.
Defense strategies include training agents to distinguish between user instructions and third-party content, requiring user confirmation before high-stakes actions, and using adversarial testing with LLM-based automated attackers. OpenAI uses a reinforcement-learning-trained attacker to probe for vulnerabilities in its Atlas browser. Anthropic has published research on layered defenses against prompt injection for browser agents.
Because browser agents can take real-world actions such as sending messages, making purchases, and modifying account settings, errors carry tangible consequences. An agent that misinterprets a task or encounters an unexpected page state could place an incorrect order, send a message to the wrong person, or change critical settings. Most commercial browser agents mitigate this by requiring human confirmation before irreversible actions, but this limits the degree of autonomy and negates some of the efficiency gains.
Browser agents necessarily process sensitive information visible on the user's screen, including emails, financial data, personal messages, and account credentials. This data passes through the AI model's inference pipeline, raising questions about data retention, model training on user data, and compliance with privacy regulations. Users must trust that the agent provider handles this data appropriately, and enterprises must evaluate browser agents against their data governance policies.
Browser agents could be misused for spam, fraud, credential stuffing, or other malicious activities. The ability to automate browser interactions at scale lowers the barrier for attacks that previously required human effort. Agent providers implement usage policies and monitoring to detect and prevent abuse, but the dual-use nature of the technology means that safeguards must continually evolve.
AI browser agents are being applied across a wide range of use cases.
Enterprise workflow automation: Agents automate repetitive browser-based tasks such as data entry across multiple systems, invoice processing, compliance form completion, and report generation. This is particularly valuable when the target applications lack APIs and can only be accessed through a web interface.
Personal task automation: Consumer-facing agents handle tasks like booking travel, ordering groceries, comparing prices across shopping sites, scheduling appointments, and filling out forms. OpenAI's Operator and Atlas target this category.
Software testing: Browser agents can be used to test web applications by simulating user interactions, identifying broken workflows, and verifying that pages render correctly. Their ability to navigate unfamiliar interfaces makes them useful for exploratory testing.
Research and data collection: Agents can gather information from multiple websites, compile results, and synthesize findings, going beyond simple web scraping to tasks that require navigating complex multi-page workflows.
Accessibility: Browser agents have the potential to make the web more accessible to people with disabilities by performing complex navigation tasks on their behalf, effectively translating high-level intents into detailed browser interactions.
Several trends are shaping the future of AI browser agents. Agent reliability continues to improve as models become better at understanding web interfaces, recovering from errors, and handling edge cases. The gap between human and agent performance on benchmarks like WebArena is closing, though it remains significant.
Standardization efforts such as the Web Bot Auth IETF protocol aim to create legitimate channels for AI agents to identify themselves to websites, potentially reducing friction from anti-bot measures while maintaining security. Multi-agent architectures, where multiple specialized agents collaborate on complex tasks, are being explored to improve reliability and handle tasks that span many websites.
The integration of browser agents into existing products, as demonstrated by Microsoft's Copilot Studio and OpenAI's Atlas browser, suggests that browser agent capabilities will become a standard feature of productivity software rather than standalone tools. As these systems mature, the balance between autonomy and human oversight will remain a central design challenge.