BrowserGym is an open-source Gymnasium-style environment and unified benchmark ecosystem for web agent research, developed by ServiceNow Research. It provides a single Python interface that wraps multiple existing web-agent benchmarks (including WebArena, VisualWebArena, WorkArena, WorkArena++, MiniWoB++, AssistantBench, and WebLINX) under a common observation space, action space, and reset/step API, so that the same agent code can be evaluated across very different web tasks without rewriting per-benchmark glue.
The project was first released alongside the WorkArena paper at the International Conference on Machine Learning (ICML) 2024, where it served as the evaluation environment for that benchmark. It was then formalized as a research artifact in its own right in the December 2024 paper The BrowserGym Ecosystem for Web Agent Research (Le Sellier De Chezelles et al., arXiv:2412.05467), which describes the unified action space, observation modalities, and the companion AgentLab framework. BrowserGym is implemented as a thin layer over Playwright-driven Chromium, so agents interact with real browsers rather than HTML traces or screenshots alone.[1][2][3]
By 2025, BrowserGym had become one of the most widely used research environments for browser-based LLM agents, in part because the same harness could be used to compare frontier models like GPT-4o, Claude 3.5 Sonnet, Llama 3, and others on a wide variety of tasks under identical conditions.[1]
Research on browser-based AI agents grew rapidly between 2022 and 2024, but each new benchmark tended to ship with its own scaffolding. WebArena used a Docker-hosted set of self-contained sites with a custom Python harness, MiniWoB++ relied on a Selenium-based runner originally written for reinforcement learning research, Mind2Web used cached HTML snapshots replayed offline, and AssistantBench targeted the open web. The result was that a single research group wanting to test their agent across all of these environments had to maintain several different evaluation pipelines, each with its own action vocabulary, observation format, and execution model.
This fragmentation made head-to-head comparisons difficult. A reported success rate on WebArena could not be directly compared to a number on MiniWoB++ because the two benchmarks defined "clicking" differently, exposed different page representations to the agent, and recorded different metrics. The BrowserGym authors argued that the bottleneck for the field was not raw model capability but rather the lack of a shared interface, similar to how Gym (and later Gymnasium) standardized reinforcement learning environments.[1]
BrowserGym was designed to fill that gap. Instead of inventing a new benchmark, it imports the existing benchmarks as plug-ins and exposes them all through a single BrowserEnv class with a uniform observation tuple and action space. Researchers can then write one agent and run it across MiniWoB++, WebArena, WorkArena, and the rest with only a configuration change.[1][2]
BrowserGym originated at ServiceNow Research, the AI research arm of the enterprise software company ServiceNow. Its first public appearance was in the WorkArena paper (Drouin et al., "WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?", ICML 2024), which introduced both the WorkArena benchmark and the BrowserGym environment that hosted it. WorkArena tested whether LLM-based agents could complete realistic enterprise tasks on a live ServiceNow instance, such as filling forms, ordering items from a service catalog, filtering lists, and reading dashboards. BrowserGym was the underlying harness that translated agent actions into real Playwright calls against the ServiceNow web UI.[4]
After the WorkArena release, the same group at ServiceNow extended the harness to support additional benchmarks and published it as a stand-alone ecosystem in the December 2024 paper. The 2024 ecosystem paper has 19 co-authors across ServiceNow Research, Mila, Polytechnique Montreal, McGill University, Tel Aviv University, and Carnegie Mellon University, reflecting the multi-institutional nature of the project. Lead authors include Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, and Léo Boisvert, with senior co-authors including Graham Neubig, Ruslan Salakhutdinov, and Siva Reddy.[1]
The code is hosted on GitHub at github.com/ServiceNow/BrowserGym under an open-source license. As of January 2026 the project had reached version 0.14.3, with active maintenance and frequent point releases.[2]
BrowserGym is organized as a small Python package, browsergym-core, plus a set of optional sub-packages that each contribute one task suite. This split keeps the core dependency footprint small and lets users install only the benchmarks they need.
At the bottom of the stack is a real Chromium browser controlled by Playwright. When an environment is reset, BrowserGym launches a fresh Chromium context, navigates to the task's starting URL, and exposes a Page object that is shared between the task setup code and the agent. Because the underlying browser is a real one, agents see real CSS, real iframes, real JavaScript-driven UI components, and real DOM mutations. This is in contrast to benchmarks like the original Mind2Web, which replays cached HTML snapshots, or text-only sandboxes that only return rendered page text.[2][3]
On top of Playwright, the BrowserEnv class wraps the browser in a Gymnasium v26-style API. The standard methods are:
reset(seed=None): clears the page, sets up the task, and returns an initial (observation, info) tuple.step(action): parses the agent's action string, executes it against the live page, and returns (observation, reward, terminated, truncated, info).close(): shuts down the Chromium context.This matches the Gymnasium contract, so researchers familiar with classic reinforcement learning loops can use the environment directly.[3]
Each benchmark is registered as a subclass of AbstractBrowserTask. A task implements four methods: setup(page) to initialize state and return the natural-language goal; validate(page, chat_messages) to score the current state and decide whether the episode is done; teardown() to release resources; and an optional cheat(page, chat_messages) to provide a reference solution for debugging. Because all benchmarks share this interface, the same training, evaluation, and analysis code paths can target any of them.[3]
BrowserGym ships several optional packages corresponding to its supported benchmarks. Researchers install only what they need.[2]
| Package | Adds |
|---|---|
browsergym-core | The base environment, action space, and POMDP wrapper |
browsergym-miniwob | MiniWoB++ synthetic web tasks |
browsergym-webarena | WebArena self-hosted realistic websites |
browsergym-webarena-verified | A cleaned-up, human-verified WebArena variant |
browsergym-visualwebarena | VisualWebArena multimodal extension |
browsergym-workarena | WorkArena and WorkArena++ on a live ServiceNow instance |
browsergym-assistantbench | AssistantBench open-web tasks |
weblinx-browsergym | WebLINX conversational web traces |
browsergym-experiments | Meta-package that pulls in all of the above |
The BrowserGym action space is designed to be expressive enough for capable agents while still being structured enough for evaluation.
Each action is a Python expression that calls one of a fixed set of primitive functions. The high-level primitives include click(bid), dblclick(bid), hover(bid), fill(bid, value), press(bid, key), select_option(bid, value), focus(bid), clear(bid), scroll(x, y), mouse_click(x, y), keyboard_press(key), goto(url), go_back(), go_forward(), new_tab(), tab_focus(index), tab_close(), upload_file(bid, path), and noop(). There are also send_msg_to_user(text) and report_infeasible(reason) actions for agents to communicate with the human or signal that a task cannot be completed.[1][3]
In each call, bid refers to a unique browser-element identifier (BID) that BrowserGym injects into the page during observation extraction. BIDs let agents reference DOM nodes symbolically without having to manage CSS selectors or XPaths, which often break when the page re-renders. Agents that prefer pixel coordinates can use the mouse_click(x, y) family instead.[1][3]
For more advanced research, BrowserGym also offers a Python action mode in which the agent emits arbitrary Playwright code. This makes the action space Turing-complete, at the cost of being harder to constrain or analyze. Most published agents use the high-level mode.[1]
The action space is configurable per benchmark and per agent. A researcher can disable certain primitives (for example, removing mouse_click to force the agent to use BIDs), or extend it with new actions for benchmark-specific behavior. WorkArena++ adds chat actions so the agent can ask clarifying questions, and AssistantBench adds an answer(text) action so the agent can return a final response to an open-web question. The harness validates each action string against the configured allowed list before executing it, which both guards against malformed output and gives precise error feedback to the agent on the next observation.[1][3]
Observations are returned as a structured dictionary so that agents can choose which modalities to consume. Recorded across the literature, the main observation fields are:
| Field | Type | Description |
|---|---|---|
goal or chat_messages | string or list | The natural-language task instruction or running chat history |
dom_object | structured DOM | A snapshot of the live HTML, augmented with BIDs, x/y coordinates, bounding boxes, and visibility flags |
axtree_object | accessibility tree | The Chromium accessibility tree, also annotated with BIDs |
screenshot | image | A PNG of the current viewport, suitable for vision-language models |
open_pages_urls | list | URLs of all open tabs |
active_page_index | int | Index of the currently focused tab |
last_action | string | The action emitted by the agent on the previous step |
last_action_error | string | Any error message from the last step (e.g., "element not found") |
elapsed_time | float | Wall-clock time since reset |
Because the agent only sees this dictionary and not the full browser state, BrowserGym treats each task as a Partially Observable Markov Decision Process (POMDP). Different agents can choose different observation slices: a text-only agent might consume only the accessibility tree, while a multimodal agent might pair the screenshot with the DOM.[1][3]
The headline contribution of the BrowserGym ecosystem is that it makes seven major web-agent benchmarks available behind one API. Each benchmark contributes its own tasks and evaluators, but agent code is unchanged.[1][2]
| Benchmark | Source paper | Domain | Approximate size | Style |
|---|---|---|---|---|
| MiniWoB++ | Shi et al., 2017; Liu et al., 2018 | Synthetic micro-tasks | 125 task types | Short tasks (clicks, drags, form fills) on small synthetic pages |
| WebArena | Zhou et al., 2024 | Self-hosted realistic websites | 812 tasks | Long-horizon tasks on Docker copies of GitLab, Reddit, Shopping, Map, etc. |
| VisualWebArena | Koh et al., 2024 | Visual web tasks | Hundreds of tasks | Multimodal extension of WebArena requiring image understanding |
| WorkArena (L1) | Drouin et al., 2024 | Enterprise SaaS | 33 atomic task templates, ~19,912 instances | Common knowledge-worker actions on a live ServiceNow instance |
| WorkArena++ (L2/L3) | Boisvert et al., 2024 | Enterprise SaaS | 682 compositional tasks | Multi-step workflows requiring planning and reasoning |
| AssistantBench | Yoran et al., 2024 | Live open web | 214 tasks | Realistic information-finding tasks on the public internet |
| WebLINX | Lù et al., 2024 | Conversational web navigation | 100k+ interaction turns across 155 sites | Multi-turn dialogue plus web actions |
Most of these benchmarks were originally published with their own custom code. The BrowserGym packages re-implement their evaluators on top of the unified Playwright interface so that the same agent harness can be used end-to-end.[1]
AgentLab is the companion framework to BrowserGym, also developed by ServiceNow Research. While BrowserGym is the environment, AgentLab is the experiment runner: it provides higher-level building blocks for designing agents, launching large benchmark sweeps, and analyzing results.[1][5]
AgentLab adds:
GenericAgent, with a configurable prompt, observation flags, and chain-of-thought options that researchers can use as a baseline or as a starting point for new agents.A significant fraction of published BrowserGym numbers come from AgentLab runs because the framework handles the long tail of distributed evaluation, retries, and accounting that would otherwise need to be implemented for every paper.[5]
A typical AgentLab run looks like this. The researcher defines a Study object that pairs an agent configuration (model, prompt, observation flags) with a benchmark (WorkArena L1, for example). They call study.run(), which fans out tasks across worker processes, captures every prompt-response pair plus a recorded video of the browser, scores each episode against the task's validate() method, and writes a row to a results database. The researcher can then load the results in Jupyter, slice them by task category, and inspect individual trajectories in AgentXRay. This workflow has become a de facto standard for reporting BrowserGym numbers because it ensures that every reported success rate is backed by a recoverable, reproducible trace.[5]
Because BrowserGym standardizes the harness, several research groups have adopted it for evaluating frontier models against web tasks. The ecosystem paper itself reports a head-to-head comparison of six state-of-the-art LLMs across six benchmarks, all run through the same GenericAgent with the same prompt template. The results show that no single model dominated every benchmark, and that the performance ordering can flip depending on whether the task is text-heavy or visual.[1]
The table below summarizes representative success rates from the December 2024 BrowserGym ecosystem paper and follow-up reports. All numbers are best-effort reproductions of figures published in the paper or its companion materials and should be cross-checked against the latest version of the paper before citing in academic work.[1][6]
| Benchmark | Claude 3.5 Sonnet | GPT-4o | Notes |
|---|---|---|---|
| MiniWoB++ | ~69.8% | comparable | Both models near saturation on the synthetic suite |
| WebArena | ~36.0% | ~31.5% | Long-horizon multi-site tasks |
| WorkArena L1 | ~56.4% | ~45.5% | Atomic ServiceNow knowledge-work tasks |
| WorkArena L2 | ~39.1% | ~8.5% | Compositional ServiceNow workflows |
| VisualWebArena | competitive | leads | GPT-4o stronger on visual tasks |
| AssistantBench | low (single digits to low teens) | similar | Both models struggle with open-web information seeking |
The headline finding from the paper is that Claude 3.5 Sonnet led on most benchmarks while GPT-4o pulled ahead on vision-heavy tasks, and that there is a striking gap between models on the harder compositional benchmarks (notably WorkArena L2). This pattern, where open-web and compositional tasks remain unsolved while synthetic micro-tasks approach saturation, has shaped subsequent web-agent research.[1][6]
BrowserGym is widely used in the academic web-agent literature. Research groups at ServiceNow Research, Mila, Carnegie Mellon University, McGill University, and Tel Aviv University collaborate on the ecosystem itself. Beyond the original authors, BrowserGym appears as the evaluation harness in numerous follow-up papers on web agents, distillation, retrieval-augmented browsing, and agent reward modeling, including AgentRewardBench, BrowserArena, and structured distillation studies. The ecosystem also underpins community leaderboards such as the Steel.dev browser-agent leaderboard, which uses BrowserGym tasks to rank publicly available agents.[6][7]
While ServiceNow maintains the project, the benchmarks themselves are model-agnostic: agents based on GPT-4, Claude, Gemini, Llama, and open-source variants have all been evaluated under the harness in published work.[1]
BrowserGym occupies a particular niche in the web-agent landscape. It is not itself a benchmark; it is the connective tissue that lets multiple benchmarks share an agent. The table below compares it to several adjacent projects.
| System | Type | Live browser? | Tasks | Primary contribution |
|---|---|---|---|---|
| BrowserGym | Environment + ecosystem | Yes (Playwright on real Chromium) | Aggregates 7+ benchmarks | Unified action/observation API across many benchmarks |
| Mind2Web | Benchmark | Originally no (cached HTML); later live variants | 2,000+ tasks across 137 sites | Static evaluation of generalist web agents |
| WebArena | Benchmark | Yes (self-hosted Docker) | 812 tasks | Realistic, reproducible multi-site environment |
| WorkArena | Benchmark | Yes (live ServiceNow) | 33 atomic templates + 682 compositional | Enterprise knowledge-work tasks |
| MiniWoB++ | Benchmark | Yes (Selenium originally) | 125 task types | Short, synthetic micro-tasks for RL |
| AssistantBench | Benchmark | Yes (live open web) | 214 tasks | Realistic time-consuming research questions |
| AgentBench | Multi-domain agent benchmark | Some web environments only | 8 environments | General LLM-as-agent evaluation, not web-specific |
| OSWorld | Benchmark | Full desktop OS | 369 tasks | Whole-OS GUI agents, not browser-only |
The sharpest contrast is with Mind2Web. Both projects evaluate web agents, but Mind2Web's original release was offline: it replayed cached HTML snapshots of websites, which made evaluation cheap and reproducible at the cost of not catching agents that drift from the recorded trajectory. BrowserGym is online: every action runs on a real Chromium tab, so the agent can encounter dynamic content, popups, or JavaScript state changes that a static replay cannot capture. This makes BrowserGym slower and more expensive but closer to real deployment.[1][7]
Compared to whole-OS environments like OSWorld, BrowserGym is narrower (browser-only) but considerably easier to run, since it does not require a virtual machine per task.
The authors and subsequent users have flagged several limitations of the BrowserGym ecosystem:
Despite these caveats, BrowserGym is widely viewed as the most consolidated environment for browser-based agent research as of 2025-2026, and its design has influenced subsequent harnesses for general computer-use agents.[1][6]
BrowserGym is maintained on GitHub with frequent releases. Selected recent versions:
| Version | Date | Notes |
|---|---|---|
| 0.1.x | Mid 2024 | Initial release alongside the WorkArena ICML paper |
| 0.10.x | Late 2024 | Released alongside the BrowserGym ecosystem paper (arXiv:2412.05467) |
| 0.14.0 / 0.14.1 | June 16, 2025 | Action-space and observation refactors |
| 0.14.2 | August 5, 2025 | MiniWoB zoom support, Playwright fixes, automatic MiniWoB install |
| 0.14.3 | January 20, 2026 | Latest stable release as of early 2026 |
The companion AgentLab framework follows a separate release cadence; its 0.4.x line was current in January 2026.[2][5]