Browser-use agent
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 6,537 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 6,537 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Browser-use agent is an artificial intelligence software agent that operates a standard web browser through its normal user interface, clicking, typing, scrolling, and navigating, to complete tasks described in plain natural language.[1][2] The term is most strongly associated with Browser Use, an open-source Python library and cloud platform of the same name created in late 2024 by Magnus Müller and Gregor Žunič, two ETH Zurich data-science master's students, whose stated mission is to "make websites accessible for AI agents" and which became one of the fastest-growing developer projects of 2025, surpassing 100,000 GitHub stars by mid-2026.[3][4] Browser Use works by converting a web page's interactive elements into a structured, indexed text representation that a large language model can read and act on, letting the model decide which button to click or field to fill rather than relying on hand-written, site-specific scripts.[21][29]
More broadly, a browser-use agent (also called an autonomous web agent or LLM-based browser agent) differs from traditional web scraping, API-based integrations, or rule-following automation scripts: it uses the reasoning of large language models (LLMs) combined with browser automation infrastructure to perceive web page content, plan sequences of actions, and execute them across diverse websites without bespoke integration.[5][6] Such agents represent a step toward general-purpose AI agents that can handle real-world web tasks such as booking travel, completing forms, researching information, or running enterprise workflows.[7]
The term "browser-use agent" describes agents that complete tasks by controlling a web browser rather than calling site-specific APIs.[8] These systems share three traits:
Browser-use agents are distinguished from computer-use agents, which operate in broader desktop environments, by their focus on web interactions inside a browser instance.[9] In contemporary usage the phrase has also become a brand name: the open-source library browser-use published on PyPI and developed by Browser Use Inc. is sometimes referred to simply as "the browser-use agent."[4] The repository's one-line description states the project's goal directly: "Make websites accessible for AI agents. Automate tasks online with ease."[4]
Yes. The core browser-use library is open source under the permissive MIT license and is freely installable from PyPI.[4] Its co-founders have framed open access as central to the project: in the seed-round announcement they wrote that "we are building the infrastructure that enables AI to interact with the web as seamlessly as humans do."[3] Alongside the open-source library, Browser Use Inc. offers a separate commercial cloud platform (browser-use.com) with managed infrastructure, Pay-As-You-Go and Enterprise tiers, and features such as SLAs, on-premise deployment, HIPAA compliance, and zero-retention contracts.[25] The library itself is approximately 98% Python, requires Python 3.11 or newer, and can be installed via pip install browser-use or the uv package manager.[4]
A browser-use agent follows a perception, reasoning, action loop where it perceives the state of a web page, reasons about the next best action toward its goal, and executes that action; the cycle repeats until task completion, failure, or a step limit is reached.[28] Rather than feeding raw pixels to the model, Browser Use "convert[s] website interfaces into structured text that LLMs can process deterministically," according to its founders.[21] This structured-text approach is what makes the page "accessible" to the agent: buttons, links, and inputs are extracted, numbered, and described so the model can reference them by index.
| Component | Description | Technologies | Implementation details |
|---|---|---|---|
| Perception layer | Understands content and layout of current web page | DOM parsing, CSS selectors, XPath, Accessibility Tree APIs, vision models | DOM extraction for interactive elements; screenshot processing (base64); visual analysis for layout; text extraction and semantic parsing |
| Reasoning and planning layer | Core decision-making powered by LLMs | GPT-4, Claude, Gemini, Llama; chain-of-thought; ReAct framework | Task decomposition into sub-goals; multi-step action planning; context management across pages; error detection and recovery |
| Action execution layer | Translates abstract actions into browser commands | Selenium, Playwright, Puppeteer, browser extensions, Chrome DevTools Protocol | Low-level control (click, type, scroll); multi-browser support; headless and visible modes; session management |
| Memory management | Maintains state and context | Vector databases, session storage, reinforcement learning memories | Working memory for active tasks; persistent memory across sessions; semantic memory; episodic action history |
| Safety and monitoring | Ensures safe operation and compliance | Refusal mechanisms, audit logging, permission systems | Prompt injection prevention; sensitive action gates; user approval workflows; activity logging and rollback |
In the open-source browser-use library, the agent loop is exposed through a high-level Agent class instantiated with a natural-language task and an LLM (the ChatBrowserUse, ChatOpenAI, ChatAnthropic, ChatGoogle, or ChatOllama wrappers).[29] When agent.run() is called, the library executes the following cycle for each step until completion or max_steps is reached:[30]
<sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup><button>Submit</button>)The library's design philosophy departs from screenshot-only agents (such as OpenAI's Operator) by relying primarily on the structured DOM with vision as an optional channel, which the team argues is more deterministic, cheaper in tokens, and easier to debug.[21][31] Configuration parameters include max_actions_per_step (default 4), max_failures (default 3), use_vision ("auto", True, or False), vision_detail_level, fallback_llm, flash_mode (skips evaluation and next-goal generation for speed), and a separate page_extraction_llm for text extraction.[29]
The original 2024 implementation built actions on top of Microsoft's Playwright cross-browser framework.[32] In August 2025 the team announced a rewrite that drops Playwright and speaks Chrome DevTools Protocol (CDP) directly.[33] In a public blog post the team described the move as "leaving the curse of abstraction," citing several reasons: the extra hop through Playwright's Node.js websocket server added latency; Playwright's synchronous "update view between actions" model fought against CDP's event-driven nature; and cross-origin iframe support and async reactions were difficult to retrofit.[33]
The CDP rewrite introduced an event-driven architecture in which "watchdog" services subscribe to CDP events. Examples documented by the project include a downloads_watchdog that monitors spontaneous file downloads, a crash_watchdog that detects renderer crashes via a single CDP event subscription, and dialog watchdogs that auto-handle beforeunload and JavaScript modals.[33] The DOM extraction pipeline produces "super-selectors" that combine target ID, frame ID, backend node ID, position data, and fallback CSS selectors, enabling reliable element tracking across cross-origin iframes.[33] The CLI 2.0 release (March 2025) reported approximately 50 ms command latency via a persistent background daemon and a roughly 50% reduction in token use compared to prior versions.[34]
The concept of browser-use agents emerged from the convergence of advances in large language models and web automation technologies, but the Browser Use project specifically was created by Magnus Müller and Gregor Žunič, who met in 2024 while completing master's degrees in data science at ETH Zurich and worked out of the university's Student Project House (SPH) incubator.[3][14][20] Müller had spent years building web-scraping tools before the pair teamed up.[20] The open-source project was launched in November 2024 and "quickly became one of the fastest-growing projects in the developer ecosystem," reaching roughly 46,000 GitHub stars within months of release.[20] The team was subsequently admitted to Y Combinator's Winter 2025 batch.[15]
OpenAI's WebGPT, introduced in December 2021, demonstrated early browser-assisted question-answering with human feedback, fine-tuning GPT-3 to issue search and click commands in a text-based browser environment.[10] In September 2022, Adept AI introduced ACT-1, a transformer trained to use common software tools including web browsers, framed as an "action transformer" mapping natural language to UI operations.[11] During 2023, academic benchmarks WebArena from Carnegie Mellon and Mind2Web from Ohio State University established standardized evaluation frameworks that would shape later progress.[1][12] OpenAI's GPT-4V vision API, released in late 2023, made it feasible for general-purpose LLMs to interpret screenshots, opening the door to vision-based web agents.[13]
According to Müller, the project began as "a weekend experiment to see if LLMs could navigate the web like humans," with an initial prototype built in four days and launched on Hacker News in October 2024.[3] The project went viral after gaining traction on Hacker News and X (formerly Twitter) in late November 2024 and acquired tens of thousands of GitHub stars within weeks, becoming one of the fastest-growing open-source repositories in the agentic AI ecosystem.[4][15] The team was admitted to the Y Combinator Winter 2025 batch.[15]
In October 2024, Anthropic released the Computer Use feature for Claude 3.5 Sonnet, allowing the model to read pixels of a screen and emit keystrokes and mouse coordinates.[16] In December 2024, Google DeepMind unveiled Project Mariner, an experimental browser agent built on Gemini 2.0.[17] On 23 January 2025, OpenAI launched Operator, powered by the Computer-Using Agent (CUA) model, initially as a research preview to ChatGPT Pro subscribers.[2][18] OpenAI later integrated Operator capabilities into a unified ChatGPT Agent mode and announced deprecation of the standalone operator.chatgpt.com site.[19]
On 22 March 2025, Browser Use announced a $17 million seed round led by Felicis Ventures, with participation from A Capital, Nexus Venture Partners, SV Angel, Liquid2, Pioneer Fund, and angel investors including Y Combinator co-founder Paul Graham.[20][21] The round was led at Felicis by Senior Venture Partner Astasia Myers.[21] At the time of the announcement, the open-source repository had passed 50,000 GitHub stars and 15,000 active developers.[3][21] In the announcement the company described its plan to "build the infrastructure that makes this transition possible," predicting that "within a few years, automated workflows will outnumber human interactions on the web."[3]
In May 2025, Salesforce acquired Convergence AI, a UK-based competitor whose Proxy product launched browser-based agents; the acquisition was framed as accelerating Salesforce's Agentforce roadmap.[23] In June 2025, Browserbase, the cloud infrastructure provider behind the Stagehand agent framework, announced a $40 million Series B led by Notable Capital and launched a no-code product called Director.[24] Later in 2025 Browser Use shipped a managed cloud platform (browser-use.com) with Pay-As-You-Go and Enterprise tiers offering SLAs, on-premise deployment, HIPAA compliance, and zero-retention contracts.[25]
On 27 January 2026, Browser Use released BU 2.0, a proprietary in-house model tuned for web automation, claiming a 12% accuracy improvement over the prior default while preserving speed.[26] By mid-2026 the open-source browser-use package had passed 100,000 GitHub stars and 11,000 forks.[4]
Many real-world workflows remain locked behind human-oriented web interfaces that lack public APIs. Browser-use agents aim to generalize across diverse sites without bespoke integration by:[27]
Browser Use co-founder Müller has framed the core technical wager as treating the browser "not just as an interface for humans, but as the execution environment for intelligent agents."[21]
| Framework | Primary use case | Advantages | Limitations | Browser-use agent adoption |
|---|---|---|---|---|
| Playwright | Cross-browser automation | Fast, reliable, modern API, built-in waiting | Newer ecosystem; extra abstraction layer | Default for many agents; was used by browser-use until mid-2025[32] |
| Selenium | Traditional web testing | Mature, wide language support | Slower, more complex setup | Legacy support |
| Puppeteer | Chrome/Chromium control | Direct CDP access, lightweight | Chrome-only | Specialized use cases |
| Chrome DevTools Protocol (CDP) | Low-level browser control | Maximum control, event-driven | Complex, browser-specific | Used directly by browser-use from August 2025[33] |
| Mode | Description | Token usage | Speed | Accuracy | Best for |
|---|---|---|---|---|---|
| Snapshot mode | Uses accessibility tree or indexed DOM for element identification | Low (500 to 2K) | Fast (less than 1s) | High for simple pages | Form filling, standard layouts |
| Vision mode | Processes screenshots for visual understanding | High (5K to 15K) | Slow (2 to 5s) | High for complex layouts | Dynamic content, visual elements |
| Hybrid mode | Combines DOM parsing with visual processing | Medium (2K to 8K) | Medium (1 to 3s) | Highest overall | General-purpose automation |
| Streaming mode | Continuous observation and action | Very high | Real-time | Variable | Interactive applications |
Browser-use agents support various LLM providers with different capabilities:[17]
| Provider | Models | Vision support | Cost (per 1M tokens) | Latency | Best use case |
|---|---|---|---|---|---|
| OpenAI | GPT-4o, GPT-4-turbo, o-series | Yes | $5 to 15 | Low | Production systems |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus, Claude Sonnet 4 | Yes | $3 to 15 | Low | Complex reasoning |
| Gemini 1.5 Pro, Gemini 2.0, Gemini 2.5 | Yes | $3.5 to 7 | Low | Multimodal tasks | |
| Open source | Llama 3, Mistral, Qwen | Limited | $0.5 to 2 | Variable | Cost-sensitive applications |
The Browser Use library exposes wrapper classes (ChatOpenAI, ChatAnthropic, ChatGoogle, ChatGroq, ChatOllama) plus its own ChatBrowserUse client that defaults to the in-house BU 2.0 model.[29][26] In April and May 2026 the project removed litellm from core dependencies in response to a supply chain incident, while keeping a ChatLiteLLM wrapper for users who install it separately.[35]
| Benchmark | Focus area | Task count | Characteristics | Key metrics |
|---|---|---|---|---|
| WebArena | Realistic multi-site environment | 812 tasks | Self-hostable sites across e-commerce, CMS, social platforms; execution-based evaluation | Task success rate, efficiency score[1] |
| Mind2Web | Cross-website generalization | 2,350 tasks | 137 websites, real-world task diversity, action sequence annotation | Element accuracy, action F1 score[12] |
| WebVoyager | Live website interaction | 643 tasks | Amazon, GitHub, Google Maps, real-time execution | End-to-end success rate[18] |
| VisualWebArena | Multimodal and visual tasks | 910 tasks | Image-heavy tasks, visual grounding | Visual element accuracy[36] |
| BrowserGym | Unified ecosystem | 5,000+ tasks | Standardized obs/action spaces, cross-benchmark evaluation | Aggregate performance score[27] |
| WebShop | E-commerce navigation | 12,087 products | Product search and selection, attribute matching | Purchase success rate[37] |
| OSWorld | Full OS control | 369 tasks | Ubuntu, Windows, macOS environments | Cross-platform success rate[18] |
| Online-Mind2Web | Real websites, live tasks | 300 tasks | 136 real websites; binary scoring; replaces self-hosted Mind2Web | Human-evaluated success rate[38] |
| Agent / model | WebArena | WebVoyager | Mind2Web | OSWorld | Online-Mind2Web |
|---|---|---|---|---|---|
| Human baseline | 78.2% | 90.0% | 85.3% | 72.4% | n/a |
| Browser Use (open source, GPT-4o) | 51.2% | 89.1%[31] | 73.4% | n/a | n/a |
| Browser Use Auto-Research (2025) | n/a | n/a | n/a | n/a | 97.7%[38] |
| OpenAI CUA (Operator) | 58.1% | 87.0% | 76.2% | 38.1% | 61.3%[38] |
| Anthropic Computer Use | 45.3% | 56.0% | 62.1% | 22.0% | n/a |
| Project Mariner (Google) | 52.4% | 83.5% | 71.3% | n/a | n/a |
| Agent S2 (Simular, OSWorld 50-step) | n/a | n/a | n/a | 34.5%[39] | n/a |
Browser Use reported an 89.1% success rate on the WebVoyager benchmark (run with GPT-4o across 586 tasks), which the project described as "state of the art" for an open-source web agent at the time, compared with 87% for OpenAI's Operator.[31] That number has since been disputed: independent evaluations by Browserable and others produced lower figures (77.3% self-reported, 60.2% LLM-verified) on the same tasks, and competing projects such as Magnitude have reported 93.9% on WebVoyager.[31][40] The Browser Use team has itself noted that it modified the original WebVoyager harness (different prompts, a Langchain migration, and manual review of "unknown" or "failed" tasks) because of issues with the default evaluator.[31] On Online-Mind2Web (a successor benchmark with 300 tasks on 136 live websites), Browser Use reported a 97.7% success rate using its Auto-Research approach built on the Claude Agent SDK, compared with 61.3% for OpenAI's Operator and 28 to 40% for most other agents.[38]
The Online-Mind2Web paper, titled "An Illusion of Progress? Assessing the Current State of Web Agents" (Hou et al., 2025), argued that earlier benchmarks systematically overestimated real-world agent ability because of impossible tasks, drifted website content, and inadequate evaluators.[41]
browser-use is an open-source MIT-licensed Python library that enables LLM-powered browser interaction via natural language.[4] Notable characteristics include:
pip install browser-use or via the uv package manager.[4]cloud.browser-use.com with typed Python and TypeScript SDKs, pay-as-you-go pricing (~$0.01 per task initialization plus per-step LLM cost), live session preview, and Enterprise plans with on-premise deployment and HIPAA support.[25][42]Released 23 January 2025, OpenAI Operator is powered by the Computer-Using Agent (CUA) model and combines GPT-4o's vision with reinforcement learning:[2][18]
operator.chatgpt.com site[19]Released October 2024, Anthropic Computer Use lets Claude models interact with computer interfaces through visual perception and simulated input.[16] On 26 August 2025, Anthropic launched a research preview of Claude for Chrome, a browser extension that lets Claude operate the user's actual Chrome session.[43] The extension entered general availability for Claude Pro, Team, Enterprise, and Max subscribers in late December 2025 with site-level permissions, action confirmations for high-risk operations, and built-in prompt injection defenses.[44][45]
Project Mariner is an experimental browser agent from Google DeepMind built on Gemini 2.0 and later Gemini 2.5, focused on multimodal understanding and announced in December 2024.[17] Google brought aspects of Mariner into its Gemini app and Project Astra during 2025 and 2026, and added an Agent Mode to Chrome in 2026 alongside agentic web protocols (the Agent Payments Protocol, Agent2Agent, and Trust Tokens for Agents).[46]
Stagehand is a TypeScript framework from Browserbase that adds AI methods (act, extract, observe) to existing Playwright code, with a hybrid model where deterministic Playwright handles predictable flows and AI commands handle ambiguous steps.[47] Browserbase, the cloud infrastructure platform behind Stagehand, raised a $40 million Series B in June 2025 and launched a no-code product called Director that emits Stagehand scripts from natural-language goals.[24]
Skyvern is an open-source agent that treats DOM as unreliable and instead feeds screenshots to vision models, with native two-factor authentication, CAPTCHA support, structured schema-based extraction, and a no-code workflow builder.[48] Skyvern is often cited as a fork-friendly alternative to Browser Use for form-heavy automation.[49]
Browser-use agents are increasingly positioned as a successor to traditional Robotic Process Automation (RPA), with applications including:
Modern single-page applications with asynchronous loading, virtual scrolling, and lazy loading complicate element discovery; AJAX-heavy interfaces require sophisticated waiting strategies.[53] Browser Use's CDP migration was motivated in part by the inadequacy of synchronous Playwright waits for the agent's event-driven needs.[33]
Shadow DOM and iframes create isolation barriers; dynamically generated IDs defeat naive selectors; and visually similar elements require disambiguation. Browser Use addresses these through cross-origin iframe handling and "super-selector" tuples (target ID, frame ID, backend node ID, position, fallback selector).[33][54]
Persistent sessions across page transitions, authentication and two-factor flows, and unexpected logouts or timeouts present recurring difficulties.[55] Browser Use exposes real browser profiles for authentication, and Skyvern provides native 2FA and TOTP integration.[48]
Documented success rates on real-world tasks ranged from 60 to 90% in 2025; reliability gaps remain the dominant blocker for production deployment.[56] Specific failure modes include CAPTCHA handling, modal and popup dialogs, network failures, and rate limiting.[57]
| Issue | Impact | Current solutions | Future approaches |
|---|---|---|---|
| LLM inference latency | 2 to 5 second delays per action | Caching, batching, persistent daemons[34] | Edge deployment, model distillation |
| Token consumption | $0.10 to $1.00 per complex task | Efficient prompting, DOM-only mode | Specialized models (BU 2.0), compression[26] |
| Memory limits | Context window constraints | Summarization, pruning | Extended context, hierarchical memory |
| Reliability | 60 to 90% success rates | Retry logic, fallback LLMs | Reinforcement learning, self-improvement |
Prompt injection attacks from malicious web content are widely considered the dominant security risk for browser-use agents.[58][59] Notable 2025 vulnerabilities include EchoLeak (CVE-2025-32711), a zero-click vulnerability in Microsoft 365 Copilot that allowed remote attackers to exfiltrate data via email content;[59] CurXecute (CVE-2025-54135), a remote code execution flaw in Cursor IDE triggered by malicious README content (CVSS 9.8);[60] and a prompt-injection-via-navigation flaw in Perplexity Comet disclosed in October 2025.[61] OpenAI stated in December 2025 that prompt injection in AI browsers is "unlikely to ever be fully solved," and the UK National Cyber Security Centre echoed that this class of attack "may be a problem that is never fully fixed."[62][63] Google's January 2026 report noted a 32% relative increase in malicious prompt-injection activity between November 2025 and February 2026.[64] Indirect prompt injection, which delivers attacker payloads through third-party content (web pages, documents, emails) rather than direct user prompts, was identified by OWASP as the #1 AI security threat for 2026.[65]
Other concerns include credential theft, cross-site scripting via injected payloads, data exfiltration, processing of sensitive personal information, screenshot capture of private data, audit-trail retention, automated spam, large-scale unauthorized scraping, and terms-of-service violations.[66] Browser Use's documentation includes a sensitive_data parameter that masks specified strings in LLM prompts and screenshots.[29]
The 2025 to 2026 period transformed Browser Use from a viral library into a venture-backed company with a commercial cloud and an in-house model:
litellm from core dependencies after a supply-chain incident; release of install-lite.sh; security hardening for daemon socket access and file handling.[35]The browser-use agent market consolidated rapidly during this period:
Two related agent web protocols emerged in this period: the Model Context Protocol (MCP), introduced by Anthropic in November 2024 and adopted broadly by 2025, gave LLM hosts a standard interface for tools, including browser tools.[68] The Agent2Agent (A2A) protocol from Google in 2025 specified interoperability between agents.[69] The Agent Payments Protocol (AP2), also from Google, defined trusted-credentials handshakes for agent-initiated payments.[46] Browser Use has integrated MCP via a server that exposes its actions as MCP tools.[29]
By mid-2026 the state of the art on Online-Mind2Web was held by Browser Use's Auto-Research configuration at 97.7%, but with the caveat that two tasks were judged "impossible" and that the agentic judge was a Claude-based system aligned with human reviewers.[38] On OSWorld the open Agent S2 from Simular led at 34.5% (50 steps).[39] On WebVoyager, project-internal numbers from Magnitude (93.9%), Browserable (90.4%), and Browser Use (89.1%) all exceeded human-claimed baselines but were difficult to compare across independent harnesses.[40][50][31]
By mid-2026, the open-source browser-use repository had:
| Aspect | Browser-use agent | Computer-use agent | Traditional RPA | Web scraping |
|---|---|---|---|---|
| Scope | Web browsers | Full desktop OS | Predefined workflows | Data extraction only |
| Adaptability | High (LLM-based) | High (LLM-based) | Low (scripted) | Low (rule-based) |
| Setup complexity | Medium | High | High | Low |
| Maintenance | Self-adapting | Self-adapting | Frequent updates | Regular updates |
| Cost | $0.10 to $1.00 per task | $0.50 to $2.00 per task | High initial, low per-task | Low |
| Use cases | General web automation | Any desktop application | Repetitive business processes | Data collection |
| Error handling | Intelligent recovery | Intelligent recovery | Basic retry logic | Minimal |
Imagine you have a very smart assistant who can use a computer the same way you do. Instead of needing a special back-door connection to every website, the assistant looks at the page, reads the buttons and boxes as plain text, decides what to click or type, and does it, all because you told it a goal in everyday words like "book me the cheapest flight to Tokyo." Browser Use is a free, open tool that gives AI assistants this ability by turning a messy webpage into a tidy numbered list the AI can understand. Two graduate students built it in 2024, it became wildly popular on GitHub, and investors gave the project $17 million to keep improving it.