BrowserGym is an open-source web-agent research environment and benchmark ecosystem described in the 2024 paper "The BrowserGym Ecosystem for Web Agent Research." The project aims to standardize how researchers evaluate browser agents by providing a gym-like interface, unified action and observation spaces, and a common way to run many web benchmarks.[1]
In practice, BrowserGym is both infrastructure and evaluation packaging. It is not a single benchmark task set in the way Mind2Web is. Instead, it is an ecosystem that brings multiple web benchmarks under one interface and pairs them with tooling such as AgentLab.[1][2]
The paper highlights two main pieces:[1]
| Component | Role |
|---|---|
| BrowserGym | Unified browser environment for agent evaluation |
| AgentLab | Framework for building, testing, and analyzing agents |
The authors say the ecosystem is meant to reduce fragmentation and make benchmark comparisons more reproducible across the growing web-agent literature.[1]
The BrowserGym paper reports a large-scale experiment with six state-of-the-art language models across six popular web-agent benchmarks available in the ecosystem. The authors highlight that Claude-3.5-Sonnet led most of those benchmarks, while GPT-4o was stronger on vision-related tasks.[1]
BrowserGym is maintained as an open-source repository by ServiceNow. The GitHub project describes it succinctly as a gym environment for web task automation and links supporting traces and benchmark resources used in the paper.[2]
BrowserGym matters because web-agent evaluation is hard to compare across papers when every group uses a different browser wrapper, observation format, or task API. BrowserGym's contribution is to make that layer more consistent so that model comparisons and agent ablations are easier to reproduce.[1][2]