Browser-use agent
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 5,950 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 5,950 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Browser-use agent (also called an autonomous web agent or LLM-based browser agent) is a type of artificial intelligence software agent that operates a standard web browser through its graphical user interface to accomplish goals specified by users in natural language.[1][2] The term is also strongly associated with Browser Use, an open-source Python library and cloud platform of the same name founded in late 2024 by Magnus Müller and Gregor Žunič that became one of the fastest-growing developer projects of 2025, reaching more than 95,000 GitHub stars by mid-2026.[3][4]
Unlike traditional web scraping, API-based approaches, or simple automation scripts that follow predefined rules, browser-use agents use the reasoning of large language models (LLMs) combined with browser automation infrastructure to perceive web page content, plan sequences of actions, and execute them across diverse websites without bespoke integration.[5][6] They represent a step toward general-purpose digital assistants that can handle real-world web-based tasks such as booking travel, completing forms, researching information, or running enterprise workflows.[7]
The term "browser-use agent" describes agents that complete tasks by controlling a web browser rather than calling site-specific APIs.[8] These systems share three traits:
Browser-use agents are distinguished from computer-use agents, which operate in broader desktop environments, by their focus on web interactions inside a browser instance.[9] In contemporary usage the phrase has also become a brand name: the open-source library browser-use published on PyPI and developed by Browser Use Inc. is sometimes referred to simply as "the browser-use agent."[4]
The concept of browser-use agents emerged from the convergence of advances in large language models and web automation technologies.
OpenAI's WebGPT, introduced in December 2021, demonstrated early browser-assisted question-answering with human feedback, fine-tuning GPT-3 to issue search and click commands in a text-based browser environment.[10] In September 2022, Adept AI introduced ACT-1, a transformer trained to use common software tools including web browsers, framed as an "action transformer" mapping natural language to UI operations.[11] During 2023, academic benchmarks WebArena from Carnegie Mellon and Mind2Web from Ohio State University established standardized evaluation frameworks that would shape later progress.[1][12] OpenAI's GPT-4V vision API, released in late 2023, made it feasible for general-purpose LLMs to interpret screenshots, opening the door to vision-based web agents.[13]
Browser Use was created by Magnus Müller and Gregor Žunič, two ETH Zurich graduate students.[3][14] According to Müller, the project began as "a weekend experiment to see if LLMs could navigate the web like humans," with an initial prototype built in four days and launched on Hacker News in October 2024.[3] The pair worked out of ETH Zurich's Student Project House (SPH), a campus incubator. The project went viral after gaining traction on Hacker News and X (formerly Twitter) in late November 2024 and acquired tens of thousands of GitHub stars within weeks, becoming one of the fastest-growing open-source repositories in the agentic-AI ecosystem.[4][15] The team was admitted to the Y Combinator Winter 2025 batch.[15]
In October 2024, Anthropic released the Computer Use feature for Claude 3.5 Sonnet, allowing the model to read pixels of a screen and emit keystrokes and mouse coordinates.[16] In December 2024, Google DeepMind unveiled Project Mariner, an experimental browser agent built on Gemini 2.0.[17] On 23 January 2025, OpenAI launched Operator, powered by the Computer-Using Agent (CUA) model, initially as a research preview to ChatGPT Pro subscribers.[2][18] OpenAI later integrated Operator capabilities into a unified ChatGPT Agent mode and announced deprecation of the standalone operator.chatgpt.com site.[19]
On 22 March 2025, Browser Use announced a $17 million seed round led by Felicis Ventures, with participation from A Capital, Nexus Venture Partners, SV Angel, Liquid2, Pioneer Fund, and angel investors including Paul Graham.[20][21] The round was led at Felicis by Senior Venture Partner Astasia Myers.[21] At the time of the announcement, the open-source repository had passed 50,000 GitHub stars and 15,000 active developers.[22][21]
In May 2025, Salesforce acquired Convergence AI, a UK-based competitor whose Proxy product launched browser-based agents; the acquisition was framed as accelerating Salesforce's Agentforce roadmap.[23] In June 2025, Browserbase, the cloud infrastructure provider behind the Stagehand agent framework, announced a $40 million Series B led by Notable Capital and launched a no-code product called Director.[24] Later in 2025 Browser Use shipped a managed cloud platform (browser-use.com) with Pay-As-You-Go and Enterprise tiers offering SLAs, on-premise deployment, HIPAA compliance, and zero-retention contracts.[25]
On 27 January 2026, Browser Use released BU 2.0, a proprietary in-house model tuned for web automation, claiming a 12% accuracy improvement over the prior default while preserving speed.[26] By May 2026 the open-source browser-use package had passed 95,000 GitHub stars and 10,000 forks.[4]
Many real-world workflows remain locked behind human-oriented web interfaces that lack public APIs. Browser-use agents aim to generalize across diverse sites without bespoke integration by:[27]
Browser Use co-founder Müller has framed the core technical wager as treating the browser "not just as an interface for humans, but as the execution environment for intelligent agents."[21]
A browser-use agent follows a perception, reasoning, action loop where it perceives the state of a web page, reasons about the next best action toward its goal, and executes that action; the cycle repeats until task completion, failure, or a step limit is reached.[28]
| Component | Description | Technologies | Implementation details |
|---|---|---|---|
| Perception layer | Understands content and layout of current web page | DOM parsing, CSS selectors, XPath, Accessibility Tree APIs, vision models | DOM extraction for interactive elements; screenshot processing (base64); visual analysis for layout; text extraction and semantic parsing |
| Reasoning and planning layer | Core decision-making powered by LLMs | GPT-4, Claude, Gemini, Llama; chain-of-thought; ReAct framework | Task decomposition into sub-goals; multi-step action planning; context management across pages; error detection and recovery |
| Action execution layer | Translates abstract actions into browser commands | Selenium, Playwright, Puppeteer, browser extensions, Chrome DevTools Protocol | Low-level control (click, type, scroll); multi-browser support; headless and visible modes; session management |
| Memory management | Maintains state and context | Vector databases, session storage, reinforcement learning memories | Working memory for active tasks; persistent memory across sessions; semantic memory; episodic action history |
| Safety and monitoring | Ensures safe operation and compliance | Refusal mechanisms, audit logging, permission systems | Prompt injection prevention; sensitive action gates; user approval workflows; activity logging and rollback |
In the open-source browser-use library, the agent loop is exposed through a high-level Agent class instantiated with a natural-language task and an LLM (the ChatBrowserUse, ChatOpenAI, ChatAnthropic, ChatGoogle, or ChatOllama wrappers).[29] When agent.run() is called, the library executes the following cycle for each step until completion or max_steps is reached:[30]
<sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup><button>Submit</button>)The library's design philosophy departs from screenshot-only agents (such as OpenAI's Operator) by relying primarily on the structured DOM with vision as an optional channel, which the team argues is more deterministic, cheaper in tokens, and easier to debug.[21][31] Configuration parameters include max_actions_per_step (default 4), max_failures (default 3), use_vision ("auto", True, or False), vision_detail_level, fallback_llm, flash_mode (skips evaluation and next-goal generation for speed), and a separate page_extraction_llm for text extraction.[29]
The original 2024 implementation built actions on top of Microsoft's Playwright cross-browser framework.[32] In August 2025 the team announced a rewrite that drops Playwright and speaks Chrome DevTools Protocol (CDP) directly.[33] In a public blog post the team described the move as "leaving the curse of abstraction," citing several reasons: the extra hop through Playwright's Node.js websocket server added latency; Playwright's synchronous "update view between actions" model fought against CDP's event-driven nature; and cross-origin iframe support and async reactions were difficult to retrofit.[33]
The CDP rewrite introduced an event-driven architecture in which "watchdog" services subscribe to CDP events. Examples documented by the project include a downloads_watchdog that monitors spontaneous file downloads, a crash_watchdog that detects renderer crashes via a single CDP event subscription, and dialog watchdogs that auto-handle beforeunload and JavaScript modals.[33] The DOM extraction pipeline produces "super-selectors" that combine target ID, frame ID, backend node ID, position data, and fallback CSS selectors, enabling reliable element tracking across cross-origin iframes.[33] The CLI 2.0 release (March 2025) reported approximately 50 ms command latency via a persistent background daemon and a roughly 50% reduction in token use compared to prior versions.[34]
| Framework | Primary use case | Advantages | Limitations | Browser-use agent adoption |
|---|---|---|---|---|
| Playwright | Cross-browser automation | Fast, reliable, modern API, built-in waiting | Newer ecosystem; extra abstraction layer | Default for many agents; was used by browser-use until mid-2025[32] |
| Selenium | Traditional web testing | Mature, wide language support | Slower, more complex setup | Legacy support |
| Puppeteer | Chrome/Chromium control | Direct CDP access, lightweight | Chrome-only | Specialized use cases |
| Chrome DevTools Protocol (CDP) | Low-level browser control | Maximum control, event-driven | Complex, browser-specific | Used directly by browser-use from August 2025[33] |
| Mode | Description | Token usage | Speed | Accuracy | Best for |
|---|---|---|---|---|---|
| Snapshot mode | Uses accessibility tree or indexed DOM for element identification | Low (500 to 2K) | Fast (less than 1s) | High for simple pages | Form filling, standard layouts |
| Vision mode | Processes screenshots for visual understanding | High (5K to 15K) | Slow (2 to 5s) | High for complex layouts | Dynamic content, visual elements |
| Hybrid mode | Combines DOM parsing with visual processing | Medium (2K to 8K) | Medium (1 to 3s) | Highest overall | General-purpose automation |
| Streaming mode | Continuous observation and action | Very high | Real-time | Variable | Interactive applications |
Browser-use agents support various LLM providers with different capabilities:[17]
| Provider | Models | Vision support | Cost (per 1M tokens) | Latency | Best use case |
|---|---|---|---|---|---|
| OpenAI | GPT-4o, GPT-4-turbo, o-series | Yes | $5 to 15 | Low | Production systems |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus, Claude Sonnet 4 | Yes | $3 to 15 | Low | Complex reasoning |
| Gemini 1.5 Pro, Gemini 2.0, Gemini 2.5 | Yes | $3.5 to 7 | Low | Multimodal tasks | |
| Open source | Llama 3, Mistral, Qwen | Limited | $0.5 to 2 | Variable | Cost-sensitive applications |
The Browser Use library exposes wrapper classes (ChatOpenAI, ChatAnthropic, ChatGoogle, ChatGroq, ChatOllama) plus its own ChatBrowserUse client that defaults to the in-house BU 2.0 model.[29][26] In April and May 2026 the project removed litellm from core dependencies in response to a supply chain incident, while keeping a ChatLiteLLM wrapper for users who install it separately.[35]
| Benchmark | Focus area | Task count | Characteristics | Key metrics |
|---|---|---|---|---|
| WebArena | Realistic multi-site environment | 812 tasks | Self-hostable sites across e-commerce, CMS, social platforms; execution-based evaluation | Task success rate, efficiency score[1] |
| Mind2Web | Cross-website generalization | 2,350 tasks | 137 websites, real-world task diversity, action sequence annotation | Element accuracy, action F1 score[12] |
| WebVoyager | Live website interaction | 643 tasks | Amazon, GitHub, Google Maps, real-time execution | End-to-end success rate[18] |
| VisualWebArena | Multimodal and visual tasks | 910 tasks | Image-heavy tasks, visual grounding | Visual element accuracy[36] |
| BrowserGym | Unified ecosystem | 5,000+ tasks | Standardized obs/action spaces, cross-benchmark evaluation | Aggregate performance score[27] |
| WebShop | E-commerce navigation | 12,087 products | Product search and selection, attribute matching | Purchase success rate[37] |
| OSWorld | Full OS control | 369 tasks | Ubuntu, Windows, macOS environments | Cross-platform success rate[18] |
| Online-Mind2Web | Real websites, live tasks | 300 tasks | 136 real websites; binary scoring; replaces self-hosted Mind2Web | Human-evaluated success rate[38] |
| Agent / model | WebArena | WebVoyager | Mind2Web | OSWorld | Online-Mind2Web |
|---|---|---|---|---|---|
| Human baseline | 78.2% | 90.0% | 85.3% | 72.4% | n/a |
| Browser Use (open source, GPT-4o) | 51.2% | 89.1%[31] | 73.4% | n/a | n/a |
| Browser Use Auto-Research (2025) | n/a | n/a | n/a | n/a | 97.7%[38] |
| OpenAI CUA (Operator) | 58.1% | 87.0% | 76.2% | 38.1% | 61.3%[38] |
| Anthropic Computer Use | 45.3% | 56.0% | 62.1% | 22.0% | n/a |
| Project Mariner (Google) | 52.4% | 83.5% | 71.3% | n/a | n/a |
| Agent S2 (Simular, OSWorld 50-step) | n/a | n/a | n/a | 34.5%[39] | n/a |
The 89.1% WebVoyager number Browser Use reported in early 2025 has been disputed: independent evaluations by Browserable and others produced lower numbers (77.3% self-reported, 60.2% LLM-verified) on the same tasks, and competing projects such as Magnitude have reported 93.9% on WebVoyager.[31][40] On Online-Mind2Web (a successor benchmark with 300 tasks on 136 live websites), Browser Use reported 97.7% success rate using its Auto-Research approach built on the Claude Agent SDK, compared with 61.3% for OpenAI's Operator and 28 to 40% for most other agents.[38]
The Online-Mind2Web paper, titled "An Illusion of Progress? Assessing the Current State of Web Agents" (Hou et al., 2025), argued that earlier benchmarks systematically overestimated real-world agent ability because of impossible tasks, drifted website content, and inadequate evaluators.[41]
browser-use is an open-source MIT-licensed Python library that enables LLM-powered browser interaction via natural language.[4] Notable characteristics include:
pip install browser-use or via the uv package manager.[4]cloud.browser-use.com with typed Python and TypeScript SDKs, pay-as-you-go pricing (~$0.01 per task initialization plus per-step LLM cost), live session preview, and Enterprise plans with on-premise deployment and HIPAA support.[25][42]Released 23 January 2025, OpenAI Operator is powered by the Computer-Using Agent (CUA) model and combines GPT-4o's vision with reinforcement learning:[2][18]
operator.chatgpt.com site[19]Released October 2024, Anthropic Computer Use lets Claude models interact with computer interfaces through visual perception and simulated input.[16] On 26 August 2025, Anthropic launched a research preview of Claude for Chrome, a browser extension that lets Claude operate the user's actual Chrome session.[43] The extension entered general availability for Claude Pro, Team, Enterprise, and Max subscribers in late December 2025 with site-level permissions, action confirmations for high-risk operations, and built-in prompt injection defenses.[44][45]
Project Mariner is an experimental browser agent from Google DeepMind built on Gemini 2.0 and later Gemini 2.5, focused on multimodal understanding and announced in December 2024.[17] Google brought aspects of Mariner into its Gemini app and Project Astra during 2025 and 2026, and added an Agent Mode to Chrome in 2026 alongside agentic web protocols (the Agent Payments Protocol, Agent2Agent, and Trust Tokens for Agents).[46]
Stagehand is a TypeScript framework from Browserbase that adds AI methods (act, extract, observe) to existing Playwright code, with a hybrid model where deterministic Playwright handles predictable flows and AI commands handle ambiguous steps.[47] Browserbase, the cloud infrastructure platform behind Stagehand, raised a $40 million Series B in June 2025 and launched a no-code product called Director that emits Stagehand scripts from natural-language goals.[24]
Skyvern is an open-source agent that treats DOM as unreliable and instead feeds screenshots to vision models, with native two-factor authentication, CAPTCHA support, structured schema-based extraction, and a no-code workflow builder.[48] Skyvern is often cited as a fork-friendly alternative to Browser Use for form-heavy automation.[49]
Browser-use agents are increasingly positioned as a successor to traditional Robotic Process Automation (RPA), with applications including:
Modern single-page applications with asynchronous loading, virtual scrolling, and lazy loading complicate element discovery; AJAX-heavy interfaces require sophisticated waiting strategies.[53] Browser Use's CDP migration was motivated in part by the inadequacy of synchronous Playwright waits for the agent's event-driven needs.[33]
Shadow DOM and iframes create isolation barriers; dynamically generated IDs defeat naive selectors; and visually similar elements require disambiguation. Browser Use addresses these through cross-origin iframe handling and "super-selector" tuples (target ID, frame ID, backend node ID, position, fallback selector).[33][54]
Persistent sessions across page transitions, authentication and two-factor flows, and unexpected logouts or timeouts present recurring difficulties.[55] Browser Use exposes real browser profiles for authentication, and Skyvern provides native 2FA and TOTP integration.[48]
Documented success rates on real-world tasks ranged from 60 to 90% in 2025; reliability gaps remain the dominant blocker for production deployment.[56] Specific failure modes include CAPTCHA handling, modal and popup dialogs, network failures, and rate limiting.[57]
| Issue | Impact | Current solutions | Future approaches |
|---|---|---|---|
| LLM inference latency | 2 to 5 second delays per action | Caching, batching, persistent daemons[34] | Edge deployment, model distillation |
| Token consumption | $0.10 to $1.00 per complex task | Efficient prompting, DOM-only mode | Specialized models (BU 2.0), compression[26] |
| Memory limits | Context window constraints | Summarization, pruning | Extended context, hierarchical memory |
| Reliability | 60 to 90% success rates | Retry logic, fallback LLMs | Reinforcement learning, self-improvement |
Prompt injection attacks from malicious web content are widely considered the dominant security risk for browser-use agents.[58][59] Notable 2025 vulnerabilities include EchoLeak (CVE-2025-32711), a zero-click vulnerability in Microsoft 365 Copilot that allowed remote attackers to exfiltrate data via email content;[59] CurXecute (CVE-2025-54135), a remote code execution flaw in Cursor IDE triggered by malicious README content (CVSS 9.8);[60] and a prompt-injection-via-navigation flaw in Perplexity Comet disclosed in October 2025.[61] OpenAI stated in December 2025 that prompt injection in AI browsers is "unlikely to ever be fully solved," and the UK National Cyber Security Centre echoed that this class of attack "may be a problem that is never fully fixed."[62][63] Google's January 2026 report noted a 32% relative increase in malicious prompt-injection activity between November 2025 and February 2026.[64] Indirect prompt injection, which delivers attacker payloads through third-party content (web pages, documents, emails) rather than direct user prompts, was identified by OWASP as the #1 AI security threat for 2026.[65]
Other concerns include credential theft, cross-site scripting via injected payloads, data exfiltration, processing of sensitive personal information, screenshot capture of private data, audit-trail retention, automated spam, large-scale unauthorized scraping, and terms-of-service violations.[66] Browser Use's documentation includes a sensitive_data parameter that masks specified strings in LLM prompts and screenshots.[29]
The 2025 to 2026 period transformed Browser Use from a viral library into a venture-backed company with a commercial cloud and an in-house model:
litellm from core dependencies after a supply-chain incident; release of install-lite.sh; security hardening for daemon socket access and file handling.[35]The browser-use agent market consolidated rapidly during this period:
Two related agent web protocols emerged in this period: the Model Context Protocol (MCP), introduced by Anthropic in November 2024 and adopted broadly by 2025, gave LLM hosts a standard interface for tools, including browser tools.[68] The Agent2Agent (A2A) protocol from Google in 2025 specified interoperability between agents.[69] The Agent Payments Protocol (AP2), also from Google, defined trusted-credentials handshakes for agent-initiated payments.[46] Browser Use has integrated MCP via a server that exposes its actions as MCP tools.[29]
By mid-2026 the state of the art on Online-Mind2Web was held by Browser Use's Auto-Research configuration at 97.7%, but with the caveat that two tasks were judged "impossible" and that the agentic judge was a Claude-based system aligned with human reviewers.[38] On OSWorld the open Agent S2 from Simular led at 34.5% (50 steps).[39] On WebVoyager, project-internal numbers from Magnitude (93.9%), Browserable (90.4%), and Browser Use (89.1%) all exceeded human-claimed baselines but were difficult to compare across independent harnesses.[40][50][31]
By May 2026, the open-source browser-use repository had:
| Aspect | Browser-use agent | Computer-use agent | Traditional RPA | Web scraping |
|---|---|---|---|---|
| Scope | Web browsers | Full desktop OS | Predefined workflows | Data extraction only |
| Adaptability | High (LLM-based) | High (LLM-based) | Low (scripted) | Low (rule-based) |
| Setup complexity | Medium | High | High | Low |
| Maintenance | Self-adapting | Self-adapting | Frequent updates | Regular updates |
| Cost | $0.10 to $1.00 per task | $0.50 to $2.00 per task | High initial, low per-task | Low |
| Use cases | General web automation | Any desktop application | Repetitive business processes | Data collection |
| Error handling | Intelligent recovery | Intelligent recovery | Basic retry logic | Minimal |