BrowserGym

BrowserGym is an open-source Gymnasium-style environment and unified benchmark ecosystem for web agent research, developed by ServiceNow Research. It provides a single Python interface that wraps multiple existing web-agent benchmarks (including WebArena, VisualWebArena, WorkArena, WorkArena++, MiniWoB++, AssistantBench, and WebLINX) under a common observation space, action space, and reset/step API, so that the same agent code can be evaluated across very different web tasks without rewriting per-benchmark glue.

The project was first released alongside the WorkArena paper at the International Conference on Machine Learning (ICML) 2024, where it served as the evaluation environment for that benchmark. It was then formalized as a research artifact in its own right in the December 2024 paper The BrowserGym Ecosystem for Web Agent Research (Le Sellier De Chezelles et al., arXiv:2412.05467), which describes the unified action space, observation modalities, and the companion AgentLab framework. BrowserGym is implemented as a thin layer over Playwright-driven Chromium, so agents interact with real browsers rather than HTML traces or screenshots alone.^[1]^[2]^[3]

By 2025, BrowserGym had become one of the most widely used research environments for browser-based LLM agents, in part because the same harness could be used to compare frontier models like GPT-4o, Claude 3.5 Sonnet, Llama 3, and others on a wide variety of tasks under identical conditions.^[1]

Background

Research on browser-based AI agents grew rapidly between 2022 and 2024, but each new benchmark tended to ship with its own scaffolding. WebArena used a Docker-hosted set of self-contained sites with a custom Python harness, MiniWoB++ relied on a Selenium-based runner originally written for reinforcement learning research, Mind2Web used cached HTML snapshots replayed offline, and AssistantBench targeted the open web. The result was that a single research group wanting to test their agent across all of these environments had to maintain several different evaluation pipelines, each with its own action vocabulary, observation format, and execution model.

This fragmentation made head-to-head comparisons difficult. A reported success rate on WebArena could not be directly compared to a number on MiniWoB++ because the two benchmarks defined "clicking" differently, exposed different page representations to the agent, and recorded different metrics. The BrowserGym authors argued that the bottleneck for the field was not raw model capability but rather the lack of a shared interface, similar to how Gym (and later Gymnasium) standardized reinforcement learning environments.^[1]

BrowserGym was designed to fill that gap. Instead of inventing a new benchmark, it imports the existing benchmarks as plug-ins and exposes them all through a single BrowserEnv class with a uniform observation tuple and action space. Researchers can then write one agent and run it across MiniWoB++, WebArena, WorkArena, and the rest with only a configuration change.^[1]^[2]

Origin at ServiceNow Research

BrowserGym originated at ServiceNow Research, the AI research arm of the enterprise software company ServiceNow. Its first public appearance was in the WorkArena paper (Drouin et al., "WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?", ICML 2024), which introduced both the WorkArena benchmark and the BrowserGym environment that hosted it. WorkArena tested whether LLM-based agents could complete realistic enterprise tasks on a live ServiceNow instance, such as filling forms, ordering items from a service catalog, filtering lists, and reading dashboards. BrowserGym was the underlying harness that translated agent actions into real Playwright calls against the ServiceNow web UI.^[4]

After the WorkArena release, the same group at ServiceNow extended the harness to support additional benchmarks and published it as a stand-alone ecosystem in the December 2024 paper. The 2024 ecosystem paper has 19 co-authors across ServiceNow Research, Mila, Polytechnique Montreal, McGill University, Tel Aviv University, and Carnegie Mellon University, reflecting the multi-institutional nature of the project. Lead authors include Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, and Léo Boisvert, with senior co-authors including Graham Neubig, Ruslan Salakhutdinov, and Siva Reddy.^[1]

The code is hosted on GitHub at github.com/ServiceNow/BrowserGym under an open-source license. As of January 2026 the project had reached version 0.14.3, with active maintenance and frequent point releases.^[2]

Architecture

BrowserGym is organized as a small Python package, browsergym-core, plus a set of optional sub-packages that each contribute one task suite. This split keeps the core dependency footprint small and lets users install only the benchmarks they need.

Core stack

At the bottom of the stack is a real Chromium browser controlled by Playwright. When an environment is reset, BrowserGym launches a fresh Chromium context, navigates to the task's starting URL, and exposes a Page object that is shared between the task setup code and the agent. Because the underlying browser is a real one, agents see real CSS, real iframes, real JavaScript-driven UI components, and real DOM mutations. This is in contrast to benchmarks like the original Mind2Web, which replays cached HTML snapshots, or text-only sandboxes that only return rendered page text.^[2]^[3]

On top of Playwright, the BrowserEnv class wraps the browser in a Gymnasium v26-style API. The standard methods are:

reset(seed=None): clears the page, sets up the task, and returns an initial (observation, info) tuple.
step(action): parses the agent's action string, executes it against the live page, and returns (observation, reward, terminated, truncated, info).
close(): shuts down the Chromium context.

This matches the Gymnasium contract, so researchers familiar with classic reinforcement learning loops can use the environment directly.^[3]

Tasks as plug-ins

Each benchmark is registered as a subclass of AbstractBrowserTask. A task implements four methods: setup(page) to initialize state and return the natural-language goal; validate(page, chat_messages) to score the current state and decide whether the episode is done; teardown() to release resources; and an optional cheat(page, chat_messages) to provide a reference solution for debugging. Because all benchmarks share this interface, the same training, evaluation, and analysis code paths can target any of them.^[3]

Sub-packages

BrowserGym ships several optional packages corresponding to its supported benchmarks. Researchers install only what they need.^[2]

Package	Adds
`browsergym-core`	The base environment, action space, and POMDP wrapper
`browsergym-miniwob`	MiniWoB++ synthetic web tasks
`browsergym-webarena`	WebArena self-hosted realistic websites
`browsergym-webarena-verified`	A cleaned-up, human-verified WebArena variant
`browsergym-visualwebarena`	VisualWebArena multimodal extension
`browsergym-workarena`	WorkArena and WorkArena++ on a live ServiceNow instance
`browsergym-assistantbench`	AssistantBench open-web tasks
`weblinx-browsergym`	WebLINX conversational web traces
`browsergym-experiments`	Meta-package that pulls in all of the above

Action space

The BrowserGym action space is designed to be expressive enough for capable agents while still being structured enough for evaluation.

Each action is a Python expression that calls one of a fixed set of primitive functions. The high-level primitives include click(bid), dblclick(bid), hover(bid), fill(bid, value), press(bid, key), select_option(bid, value), focus(bid), clear(bid), scroll(x, y), mouse_click(x, y), keyboard_press(key), goto(url), go_back(), go_forward(), new_tab(), tab_focus(index), tab_close(), upload_file(bid, path), and noop(). There are also send_msg_to_user(text) and report_infeasible(reason) actions for agents to communicate with the human or signal that a task cannot be completed.^[1]^[3]

In each call, bid refers to a unique browser-element identifier (BID) that BrowserGym injects into the page during observation extraction. BIDs let agents reference DOM nodes symbolically without having to manage CSS selectors or XPaths, which often break when the page re-renders. Agents that prefer pixel coordinates can use the mouse_click(x, y) family instead.^[1]^[3]

For more advanced research, BrowserGym also offers a Python action mode in which the agent emits arbitrary Playwright code. This makes the action space Turing-complete, at the cost of being harder to constrain or analyze. Most published agents use the high-level mode.^[1]

The action space is configurable per benchmark and per agent. A researcher can disable certain primitives (for example, removing mouse_click to force the agent to use BIDs), or extend it with new actions for benchmark-specific behavior. WorkArena++ adds chat actions so the agent can ask clarifying questions, and AssistantBench adds an answer(text) action so the agent can return a final response to an open-web question. The harness validates each action string against the configured allowed list before executing it, which both guards against malformed output and gives precise error feedback to the agent on the next observation.^[1]^[3]

Observation space

Observations are returned as a structured dictionary so that agents can choose which modalities to consume. Recorded across the literature, the main observation fields are:

Field	Type	Description
`goal` or `chat_messages`	string or list	The natural-language task instruction or running chat history
`dom_object`	structured DOM	A snapshot of the live HTML, augmented with BIDs, x/y coordinates, bounding boxes, and visibility flags
`axtree_object`	accessibility tree	The Chromium accessibility tree, also annotated with BIDs
`screenshot`	image	A PNG of the current viewport, suitable for vision-language models
`open_pages_urls`	list	URLs of all open tabs
`active_page_index`	int	Index of the currently focused tab
`last_action`	string	The action emitted by the agent on the previous step
`last_action_error`	string	Any error message from the last step (e.g., "element not found")
`elapsed_time`	float	Wall-clock time since reset

Because the agent only sees this dictionary and not the full browser state, BrowserGym treats each task as a Partially Observable Markov Decision Process (POMDP). Different agents can choose different observation slices: a text-only agent might consume only the accessibility tree, while a multimodal agent might pair the screenshot with the DOM.^[1]^[3]

Supported task suites

The headline contribution of the BrowserGym ecosystem is that it makes seven major web-agent benchmarks available behind one API. Each benchmark contributes its own tasks and evaluators, but agent code is unchanged.^[1]^[2]

Benchmark	Source paper	Domain	Approximate size	Style
MiniWoB++	Shi et al., 2017; Liu et al., 2018	Synthetic micro-tasks	125 task types	Short tasks (clicks, drags, form fills) on small synthetic pages
WebArena	Zhou et al., 2024	Self-hosted realistic websites	812 tasks	Long-horizon tasks on Docker copies of GitLab, Reddit, Shopping, Map, etc.
VisualWebArena	Koh et al., 2024	Visual web tasks	Hundreds of tasks	Multimodal extension of WebArena requiring image understanding
WorkArena (L1)	Drouin et al., 2024	Enterprise SaaS	33 atomic task templates, ~19,912 instances	Common knowledge-worker actions on a live ServiceNow instance
WorkArena++ (L2/L3)	Boisvert et al., 2024	Enterprise SaaS	682 compositional tasks	Multi-step workflows requiring planning and reasoning
AssistantBench	Yoran et al., 2024	Live open web	214 tasks	Realistic information-finding tasks on the public internet
WebLINX	Lù et al., 2024	Conversational web navigation	100k+ interaction turns across 155 sites	Multi-turn dialogue plus web actions

Most of these benchmarks were originally published with their own custom code. The BrowserGym packages re-implement their evaluators on top of the unified Playwright interface so that the same agent harness can be used end-to-end.^[1]

AgentLab

AgentLab is the companion framework to BrowserGym, also developed by ServiceNow Research. While BrowserGym is the environment, AgentLab is the experiment runner: it provides higher-level building blocks for designing agents, launching large benchmark sweeps, and analyzing results.^[1]^[5]

AgentLab adds:

A unified LLM API that can route requests to OpenAI, Anthropic, Azure OpenAI, Google, OpenRouter, vLLM, and other backends so the same agent code can swap models without changes.
Parallel execution via Ray or joblib backends, typically running 10 to 100 tasks at once depending on hardware.
A reference agent, GenericAgent, with a configurable prompt, observation flags, and chain-of-thought options that researchers can use as a baseline or as a starting point for new agents.
AgentXRay, a visualization tool that renders an agent's full trajectory step by step (DOM, screenshot, prompt, model output, action) for debugging and qualitative analysis.
Reproducibility metadata, including benchmark and package versions plus Git commit hashes, automatically logged with every experiment.

A significant fraction of published BrowserGym numbers come from AgentLab runs because the framework handles the long tail of distributed evaluation, retries, and accounting that would otherwise need to be implemented for every paper.^[5]

A typical AgentLab run looks like this. The researcher defines a Study object that pairs an agent configuration (model, prompt, observation flags) with a benchmark (WorkArena L1, for example). They call study.run(), which fans out tasks across worker processes, captures every prompt-response pair plus a recorded video of the browser, scores each episode against the task's validate() method, and writes a row to a results database. The researcher can then load the results in Jupyter, slice them by task category, and inspect individual trajectories in AgentXRay. This workflow has become a de facto standard for reporting BrowserGym numbers because it ensures that every reported success rate is backed by a recoverable, reproducible trace.^[5]

Use in research and industry

Because BrowserGym standardizes the harness, several research groups have adopted it for evaluating frontier models against web tasks. The ecosystem paper itself reports a head-to-head comparison of six state-of-the-art LLMs across six benchmarks, all run through the same GenericAgent with the same prompt template. The results show that no single model dominated every benchmark, and that the performance ordering can flip depending on whether the task is text-heavy or visual.^[1]

Reported results from the ecosystem paper

The table below summarizes representative success rates from the December 2024 BrowserGym ecosystem paper and follow-up reports. All numbers are best-effort reproductions of figures published in the paper or its companion materials and should be cross-checked against the latest version of the paper before citing in academic work.^[1]^[6]

Benchmark	Claude 3.5 Sonnet	GPT-4o	Notes
MiniWoB++	~69.8%	comparable	Both models near saturation on the synthetic suite
WebArena	~36.0%	~31.5%	Long-horizon multi-site tasks
WorkArena L1	~56.4%	~45.5%	Atomic ServiceNow knowledge-work tasks
WorkArena L2	~39.1%	~8.5%	Compositional ServiceNow workflows
VisualWebArena	competitive	leads	GPT-4o stronger on visual tasks
AssistantBench	low (single digits to low teens)	similar	Both models struggle with open-web information seeking

The headline finding from the paper is that Claude 3.5 Sonnet led on most benchmarks while GPT-4o pulled ahead on vision-heavy tasks, and that there is a striking gap between models on the harder compositional benchmarks (notably WorkArena L2). This pattern, where open-web and compositional tasks remain unsolved while synthetic micro-tasks approach saturation, has shaped subsequent web-agent research.^[1]^[6]

Adoption

BrowserGym is widely used in the academic web-agent literature. Research groups at ServiceNow Research, Mila, Carnegie Mellon University, McGill University, and Tel Aviv University collaborate on the ecosystem itself. Beyond the original authors, BrowserGym appears as the evaluation harness in numerous follow-up papers on web agents, distillation, retrieval-augmented browsing, and agent reward modeling, including AgentRewardBench, BrowserArena, and structured distillation studies. The ecosystem also underpins community leaderboards such as the Steel.dev browser-agent leaderboard, which uses BrowserGym tasks to rank publicly available agents.^[6]^[7]

While ServiceNow maintains the project, the benchmarks themselves are model-agnostic: agents based on GPT-4, Claude, Gemini, Llama, and open-source variants have all been evaluated under the harness in published work.^[1]

Comparison to other web-agent benchmarks

BrowserGym occupies a particular niche in the web-agent landscape. It is not itself a benchmark; it is the connective tissue that lets multiple benchmarks share an agent. The table below compares it to several adjacent projects.

System	Type	Live browser?	Tasks	Primary contribution
BrowserGym	Environment + ecosystem	Yes (Playwright on real Chromium)	Aggregates 7+ benchmarks	Unified action/observation API across many benchmarks
Mind2Web	Benchmark	Originally no (cached HTML); later live variants	2,000+ tasks across 137 sites	Static evaluation of generalist web agents
WebArena	Benchmark	Yes (self-hosted Docker)	812 tasks	Realistic, reproducible multi-site environment
WorkArena	Benchmark	Yes (live ServiceNow)	33 atomic templates + 682 compositional	Enterprise knowledge-work tasks
MiniWoB++	Benchmark	Yes (Selenium originally)	125 task types	Short, synthetic micro-tasks for RL
AssistantBench	Benchmark	Yes (live open web)	214 tasks	Realistic time-consuming research questions
AgentBench	Multi-domain agent benchmark	Some web environments only	8 environments	General LLM-as-agent evaluation, not web-specific
OSWorld	Benchmark	Full desktop OS	369 tasks	Whole-OS GUI agents, not browser-only

The sharpest contrast is with Mind2Web. Both projects evaluate web agents, but Mind2Web's original release was offline: it replayed cached HTML snapshots of websites, which made evaluation cheap and reproducible at the cost of not catching agents that drift from the recorded trajectory. BrowserGym is online: every action runs on a real Chromium tab, so the agent can encounter dynamic content, popups, or JavaScript state changes that a static replay cannot capture. This makes BrowserGym slower and more expensive but closer to real deployment.^[1]^[7]

Compared to whole-OS environments like OSWorld, BrowserGym is narrower (browser-only) but considerably easier to run, since it does not require a virtual machine per task.

Limitations

The authors and subsequent users have flagged several limitations of the BrowserGym ecosystem:

Engineering surface area. Because BrowserGym wraps real browsers, it inherits the flakiness of browser automation. Long-running benchmark sweeps must contend with stale sessions, network errors, occasional Playwright timeouts, and site-specific quirks (such as ServiceNow's iframes). AgentLab adds retry logic, but flaky failures still show up at scale.
Task-suite drift. WorkArena runs against a live ServiceNow instance, so changes to the underlying SaaS product can shift task results between releases. The project addresses this by pinning ServiceNow versions in its tutorials, but reproducibility across many months is not perfect.
Cost. Running the full ecosystem against a frontier model is expensive. WebArena alone has 812 tasks, each potentially requiring tens of LLM calls; a full sweep across all benchmarks can easily exceed millions of tokens.
Coverage gaps. BrowserGym standardizes seven benchmarks but does not yet cover every web-agent task in the literature; for example, GAIA-style multi-tool agents and full computer-use scenarios fall outside its scope.
Action-space mismatch with computer-use models. Some recent agent products (e.g., commercial computer-use systems) are trained natively on screen-coordinate actions and screenshots, which is supported by BrowserGym but not its primary mode. Agents tuned for BID-based DOM actions and agents tuned for raw pixels can produce different rankings on the same tasks.
Benchmark saturation on the easy end. MiniWoB++ is approaching saturation for frontier models, so its discriminative value for new research is limited. The harder benchmarks (WorkArena L2/L3, AssistantBench) remain far from solved.

Despite these caveats, BrowserGym is widely viewed as the most consolidated environment for browser-based agent research as of 2025-2026, and its design has influenced subsequent harnesses for general computer-use agents.^[1]^[6]

Versions and releases

BrowserGym is maintained on GitHub with frequent releases. Selected recent versions:

Version	Date	Notes
0.1.x	Mid 2024	Initial release alongside the WorkArena ICML paper
0.10.x	Late 2024	Released alongside the BrowserGym ecosystem paper (arXiv:2412.05467)
0.14.0 / 0.14.1	June 16, 2025	Action-space and observation refactors
0.14.2	August 5, 2025	MiniWoB zoom support, Playwright fixes, automatic MiniWoB install
0.14.3	January 20, 2026	Latest stable release as of early 2026

The companion AgentLab framework follows a separate release cadence; its 0.4.x line was current in January 2026.^[2]^[5]

References

Le Sellier De Chezelles, T., Gasse, M., Drouin, A., Caccia, M., Boisvert, L., Thakkar, M., Marty, T., Assouel, R., Omidi Shayegan, S., Jang, L. K., Lù, X. H., Yoran, O., Kong, D., Xu, F. F., Reddy, S., Cappart, Q., Neubig, G., Salakhutdinov, R., Chapados, N., & Lacoste, A. (2024). "The BrowserGym Ecosystem for Web Agent Research." arXiv:2412.05467. Submitted December 6, 2024 (v1); latest version v4, February 28, 2025. https://arxiv.org/abs/2412.05467
ServiceNow. "BrowserGym: a Gym environment for web task automation." GitHub repository. Latest release v0.14.3, January 20, 2026. https://github.com/ServiceNow/BrowserGym
"BrowserGym API documentation." Read the Docs. https://browsergym.readthedocs.io/latest/core/core.html
Drouin, A., Gasse, M., Caccia, M., Laradji, I. H., Del Verme, M., Marty, T., Boisvert, L., Thakkar, M., Cappart, Q., Vazquez, D., Chapados, N., & Lacoste, A. (2024). "WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?" Proceedings of the 41st International Conference on Machine Learning (ICML 2024). arXiv:2403.07718. https://arxiv.org/abs/2403.07718
ServiceNow. "AgentLab: An open-source framework for developing, testing, and benchmarking web agents." GitHub repository. https://github.com/ServiceNow/AgentLab
Boisvert, L., Thakkar, M., Gasse, M., Caccia, M., Le Sellier De Chezelles, T., Cappart, Q., Chapados, N., Lacoste, A., & Drouin, A. (2024). "WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks." NeurIPS 2024 Datasets and Benchmarks Track. arXiv:2407.05291. https://arxiv.org/abs/2407.05291
Yoran, O., Amouyal, S. J., Malaviya, C., Bogin, B., Press, O., & Berant, J. (2024). "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?" EMNLP 2024. arXiv:2407.15711. https://arxiv.org/abs/2407.15711
"BrowserGym: Web Agent Automation Framework." Maxim AI Blog. https://www.getmaxim.ai/blog/browsergym-web-agent-automation/
"BrowserGym Interface for Web Agent Research." Emergent Mind topic page. https://www.emergentmind.com/topics/browsergym-interface
"BrowserGym benchmark." Steel.dev AI Browser Agent Leaderboard. https://leaderboard.steel.dev/registry/benchmarks/browsergym/

BrowserGym

Background

Origin at ServiceNow Research

Architecture

Core stack

Tasks as plug-ins

Sub-packages

Action space

Observation space

Supported task suites

AgentLab

Use in research and industry

Reported results from the ecosystem paper

Adoption

Comparison to other web-agent benchmarks

Limitations

Versions and releases

See also

References

Improve this article

Background

Origin at ServiceNow Research

Architecture

Core stack

Tasks as plug-ins

Sub-packages

Action space

Observation space

Supported task suites

AgentLab

Use in research and industry

Reported results from the ecosystem paper

Adoption

Comparison to other web-agent benchmarks

Limitations

Versions and releases

See also

References

Background

Origin at ServiceNow Research

Architecture

Core stack

Tasks as plug-ins

Sub-packages

Action space

Observation space

Supported task suites

AgentLab

Use in research and industry

Reported results from the ecosystem paper

Adoption

Comparison to other web-agent benchmarks

Limitations

Versions and releases

See also

References

Improve this article

Related Articles

Mind2Web

Context engineering

Project Mariner

Agentic Context Engineering

Computer-use agent

OpenClaw

Background

Origin at ServiceNow Research

Architecture

Core stack

Tasks as plug-ins

Sub-packages

Action space

Observation space

Supported task suites

AgentLab

Use in research and industry

Reported results from the ecosystem paper

Adoption

Comparison to other web-agent benchmarks

Limitations

Versions and releases

See also

References

Related Articles

Mind2Web

Context engineering

Project Mariner

Agentic Context Engineering

Computer-use agent

OpenClaw