OSWorld
Last reviewed
May 17, 2026
Sources
18 citations
Review status
Source-backed
Revision
v3 ยท 6,425 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
18 citations
Review status
Source-backed
Revision
v3 ยท 6,425 words
Add missing citations, update stale details, or suggest a clearer explanation.
OSWorld is a benchmark for evaluating multimodal AI agents on open-ended tasks performed inside real computer environments. Published at NeurIPS 2024 by Tianbao Xie, Tao Yu, and colleagues from the University of Hong Kong, Salesforce Research, Carnegie Mellon University, and the University of Waterloo, OSWorld was the first benchmark to evaluate agents systematically across full desktop operating systems rather than sandboxed web browsers or narrow application domains. The benchmark consists of 369 tasks spanning Ubuntu, Windows, and macOS, each grounded in authentic computer workflows and assessed through execution-based evaluation scripts rather than heuristic matching or human judges. Within roughly two years of release, OSWorld became the standard yardstick that nearly every major lab cites in product announcements covering computer-use AI agents.[1][2]
At the time of publication, the gap between human and machine performance on OSWorld was striking: human evaluators unfamiliar with the specific software completed more than 72% of tasks, while the best-performing model at the time reached only 12.24%. That gap narrowed sharply as computer-use agents improved, and the OSWorld leaderboard has become one of the primary public scoreboards for tracking progress in GUI-driven AI automation. By April 2026, Claude Opus 4.7 reached 78.0% on the verified leaderboard, exceeding the human baseline that had stood untouched for nineteen months. The trajectory from 12.24% in April 2024 to scores near 80% in May 2026 represents one of the fastest documented capability improvements in agent evaluation, and OSWorld now sits alongside SWE-bench as one of the two most cited benchmarks in frontier model release notes.[1][3][4]
The ability of AI systems to operate computers was long confined to narrow demonstrations. Early work in GUI automation focused on scripted macros and rigid rule-based systems. When large language models began to show reasoning and planning capabilities, researchers attempted to adapt them to browser-based tasks, producing benchmarks like WebArena (2023) that evaluated agents navigating simulated websites. These benchmarks were useful but limited: they covered only web interfaces, excluded desktop applications, and could not test workflows that crossed application boundaries.
The practical challenge of deploying AI as a general-purpose computer assistant is far broader. Real computer work involves spreadsheets, image editors, code editors, email clients, media players, terminal commands, and multi-step workflows that often move data from one application to another. No benchmark available before 2024 evaluated agents across this full range. WebVoyager, released in early 2024, brought live website evaluation but remained browser-only. GAIA, also released in 2023 by researchers at Meta and Hugging Face, evaluated general assistant capability with a focus on tool use and multi-step reasoning, but did not test direct GUI manipulation.
Four structural gaps motivated the creation of OSWorld. First, existing benchmarks lacked controllable, executable environments that faithfully represented real desktop sessions. Second, the task sets were narrow, typically covering one application domain or one operating system. Third, evaluation functions were often shallow, checking only surface-level outputs rather than verifying that the agent had actually completed the underlying computational task. Fourth, no benchmark tested cross-application workflows in which an agent must orchestrate multiple programs in sequence.
OSWorld addressed all four gaps by building on virtual machine infrastructure, constructing tasks from real user workflows across nine application categories, writing 134 custom evaluation functions to verify task completion, and including a dedicated multi-application workflow category. The benchmark was conceived inside the XLANG NLP Lab at HKU, which had already produced research on text-to-SQL grounding and tool use, and the methodological choice to evaluate agents through real virtual machines rather than HTML scrapes traces back to that lab's emphasis on grounded execution rather than surface evaluation.[1][2]
The paper introducing OSWorld, titled "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments," was submitted to arXiv in April 2024 (arXiv:2404.07972) and accepted to the Datasets and Benchmarks Track at NeurIPS 2024. It appeared in the NeurIPS 2024 proceedings and is accessible through the conference proceedings and OpenReview.[1][12]
The lead author, Tianbao Xie, is a PhD student at the University of Hong Kong supervised by Tao Yu, a faculty member who also leads the XLANG NLP Lab. Additional authors include Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, and Shuyan Zhou. Senior authors include Silvio Savarese and Caiming Xiong from Salesforce Research, Victor Zhong from the University of Waterloo, and Tao Yu from HKU. Shuyan Zhou, also one of the co-creators of WebArena, connected the OSWorld project to earlier work on web-only agent evaluation.
The paper's main contributions were:
The paper was widely cited within months of release. By the time Anthropic launched Anthropic Computer Use in October 2024, OSWorld had already become the standard reference for computer-use evaluation, and Anthropic's announcement reported a 14.9% headline score on the screenshot-only configuration alongside a comparison to the next-best published system at 7.8%.[10]
OSWorld contains 369 tasks, with an additional 43 tasks constructed specifically for Windows analysis, for a total evaluation corpus of 412 tasks across all platforms. The core set of 369 tasks was developed for Ubuntu, with 72.6% (268 tasks) targeting single-application scenarios and 27.4% (101 tasks) targeting multi-application workflows.
The dataset also includes 30 infeasible tasks, representing approximately 8.1% of the total. Infeasible tasks are designed to test whether agents can correctly recognize when a requested operation cannot be completed given the current system state, rather than attempting to hallucinate a path to completion. This is a small but important design choice: many real computer workflows include requests that turn out to be impossible given missing files, network outages, or permission constraints, and a useful agent needs to be able to say so rather than fabricating a result.
Each task is accompanied by a natural language instruction, a detailed initial state specification that defines the exact starting configuration of the virtual machine, and an execution-based evaluation script. Task instructions were written to reflect how a real user would phrase a request, including occasional ambiguity that requires the agent to infer intent. A typical instruction might be "set this image to 200 KB and save it as JPEG," leaving the agent to decide which compression settings to use rather than spelling out every parameter.
A defining feature of OSWorld is its use of real operating system environments rather than emulated or sandboxed interfaces. Each task runs inside a virtual machine with a fully functional desktop session. The infrastructure supports VMware and VirtualBox for local evaluation on laptops and workstations, Docker with KVM support for server environments, and AWS for cloud-based parallel evaluation.
AWS support, added in the OSWorld-Verified update in July 2025, enables up to 50 simultaneous environments and compresses total evaluation time from more than ten hours to roughly twenty minutes for a full benchmark run. This is a practical requirement for organizations evaluating many agent runs and for leaderboard submissions that must run agents on the official infrastructure.[5][6]
The benchmark captures four types of agent observation: screenshots of the current desktop state, accessibility trees (structured representations of GUI elements), combined screenshot plus accessibility tree, and Set-of-Marks, an augmented screenshot technique that labels interactive elements with numeric identifiers to improve spatial grounding. Agents interact with the environment through PyAutoGUI-compatible actions including mouse movement, clicking, keyboard input, and scrolling. Most production deployments of computer-use models, including Anthropic Computer Use and OpenAI Operator, have settled on a screenshot-only or screenshot-plus-accessibility configuration, since accessibility trees are not always reliably populated outside controlled benchmark settings.
OSWorld uses 134 custom execution-based evaluation functions to assess task completion. These functions go beyond surface checks: rather than asking whether the agent claimed to have completed a task or whether a particular button was clicked, they verify the actual computational state of the system after the agent finishes. A task that asks the agent to set a file to 200 KB is scored by reading the file size from the filesystem, not by inspecting the model's textual output.
Evaluation functions are built on getter utilities that retrieve ground-truth information from the environment: reading file contents, extracting accessibility tree states, querying browser cookies, parsing spreadsheet values, checking system configuration registers, and inspecting application state through APIs where available. Logical evaluators then combine these getters with support for alternative correct solutions, so that tasks with multiple valid completion paths are not incorrectly penalized. A task that asks the agent to extract a value from a spreadsheet, for example, can be solved either through the spreadsheet GUI or by parsing the file directly, and both routes are accepted if the resulting state is correct.
The 302 distinct initial states ensure that task setups are reproducible: every evaluation run for a given task starts the virtual machine in an identical initial condition, eliminating variability from prior agent actions.
The paper evaluated agents under four input configurations:
Results varied substantially across modalities and models. Accessibility tree input generally helped text-focused models, while screenshot input was necessary for tasks requiring visual recognition of content. Set-of-Marks improved performance on some models but not consistently across all.[1]
The 369-task core set is organized into nine application categories plus a multi-application workflow category. The table below summarizes each category, the type of work it tests, and the share of the benchmark it accounts for.
| Category | Application | What it tests | Share of benchmark |
|---|---|---|---|
| OS | Terminal, file manager, image and PDF viewers | File system operations, installation, configuration, cleanup | ~13% |
| LibreOffice Writer | Word processor | Document creation, formatting, mail merge, table editing | ~8% |
| LibreOffice Calc | Spreadsheet | Formulas, charts, conditional formatting, pivot tables | ~13% |
| LibreOffice Impress | Presentation | Slide layout, animation, formatting | ~6% |
| Chrome | Web browser | Form filling, bookmarks, cookies, extension config, multi-page flows | ~13% |
| VLC | Media player | Playlist management, subtitles, audio and video settings | ~5% |
| Thunderbird | Email client | Composition, folder management, filters, calendar | ~6% |
| VS Code | Code editor | Settings, keyboard shortcuts, extensions, terminal integration | ~6% |
| GIMP | Image editor | Cropping, color adjustment, layers, file export | ~6% |
| Multi-app workflow | Multiple | Cross-application data transfer and orchestration | ~27% |
Single-application tasks represent 72.6% of the benchmark and cover the full breadth of what a typical office or professional user does in a given program. In LibreOffice Calc, tasks range from entering formulas to building pivot tables and applying conditional formatting. Calc and multi-application tasks consistently required the most steps per task, with some tasks involving dozens of sequential operations.
In Chrome, tasks include managing cookies, configuring extensions, interacting with web forms, and completing tasks that require navigating multi-page web flows. Chrome tasks overlap with the domain covered by web-only benchmarks like WebArena, but within the OSWorld framing these exist alongside desktop tasks rather than in isolation.
GIMP tasks test spatial precision, requiring the agent to select regions, apply effects, adjust color histograms, and export files in specific formats. These tasks expose the limitations of screenshot-only agents that lack the pixel-level precision available to humans using a physical mouse. The 1:1 vision pixel mapping introduced in Claude Opus 4.7 was an explicit response to this category of failure.
VS Code tasks test both text editing accuracy and knowledge of IDE-specific operations: configuring settings JSON files, managing keyboard shortcuts, installing extensions, and using integrated terminal commands. The VS Code category sits at an interesting boundary because effective agents can sometimes drop down into the integrated terminal and complete the task with shell commands rather than GUI clicks, a substitution that the benchmark generally accepts.
Thunderbird tasks reveal a particular weakness in screenshot-only agents: managing nested folder hierarchies and configuring per-folder filters requires the agent to keep track of context across collapse and expand events that change which UI elements are visible. Many early agents lost state between screenshots and had to redo work as a result.
The multi-application category (27.4% of tasks) is the most demanding and most representative of real computer work. Example workflows include:
These tasks require the agent to maintain context across application boundaries, handle intermediate file formats, and sequence actions correctly across multiple programs. The drop in performance from single-application to multi-application tasks in the original paper was severe: the best model in the original paper achieved 6.57% on workflow tasks even as it reached 12.24% on the overall benchmark, reflecting the compounding difficulty of cross-application coordination.
At the time of the NeurIPS 2024 paper, the best model achieved only 12.24% of tasks successfully, compared to the human baseline of 72.36%. The human baseline used evaluators who were not expert users of the specific software applications, making it a realistic lower bound on human capability rather than an expert ceiling.
The paper evaluated both open-source and closed-source models. GPT-4o and GPT-4V with screenshot input reached approximately 5.26% to 5.80% success depending on configuration. Gemini Pro Vision performed similarly at screenshot-based tasks. Claude 3 Opus showed varied performance across input configurations. Open-source models including CogAgent and Mixtral variants performed below 5%. No model exceeded 12.24% under any configuration in the original evaluation.
Workflow tasks (multi-application) were the hardest category, with the best model achieving only 6.57%. Single-application tasks were easier but still far below human performance.
Two primary failure modes dominated: GUI grounding failures, where the agent identified the correct action but clicked in the wrong location or on the wrong element, and operational knowledge failures, where the agent did not know the correct sequence of steps to accomplish the task in a specific application. More than 75% of failures involved mouse click inaccuracies despite otherwise correct reasoning. This finding shaped the next generation of computer-use models. Anthropic's October 2024 launch of Anthropic Computer Use and OpenAI's January 2025 launch of OpenAI Operator both invested heavily in fine-tuning and reinforcement learning specifically for spatial click accuracy, citing OSWorld failure analysis as motivation.[10][11]
OSWorld-Verified is an enhanced version of the benchmark developed by the XLANG Lab and released on 28 July 2025. The original benchmark, while groundbreaking, accumulated more than 300 pieces of community feedback over fifteen months of use. These fell into several categories: HTML parsing functions that broke when websites updated their structure, anti-crawling mechanisms such as CAPTCHA challenges and IP blocking that made web-related tasks intermittently impossible, timing issues where asynchronous operations caused evaluation flakiness, instruction ambiguities that allowed agents to take paths the evaluation function did not recognize as correct, and file hosting problems from Google Drive throttling.[5][6]
The OSWorld-Verified initiative involved approximately two months of work from a ten-person team. A design principle of the effort was to minimize invasive changes: the team primarily modified evaluator scripts rather than rewriting task instructions, the goal being to preserve score continuity with the original benchmark wherever possible. The key changes were:
Infrastructure migration. The benchmark moved from VMware and Docker to AWS cloud infrastructure, supporting up to 50 simultaneous environments. A full benchmark run that had previously required more than ten hours on a single machine now completes in roughly twenty minutes when fully parallelized.
Task quality fixes. The team addressed the 300+ feedback items. Web structure changes were handled by updating HTML parsing, URL validation, and selector logic. Anti-crawling failures were mitigated by adjusting task setups so they no longer depended on bot-detection-prone access patterns. Timing issues were resolved by converting asynchronous operations to blocking calls. Instruction ambiguities were resolved by expanding the set of acceptable solution paths in evaluators rather than narrowing the instructions themselves.
Evaluation robustness. Scoring functions were enhanced with fuzzy matching for documents, improved image similarity algorithms that handle minor rendering differences, and tolerance for valid formatting variations. System stability fixes addressed transient errors in the VM lifecycle that had caused legitimate agent runs to be scored as failures.
File hosting migration. Task assets were moved from Google Drive to Hugging Face to eliminate throttling and download failures that had affected reproducibility.
At the time of the OSWorld-Verified release, the strongest reported system was CoACT-1, an academic multi-agent framework, which scored 60.76% on the verified set. Within five months that number had been overtaken by frontier production systems, with Claude Opus 4.5 reaching 66.3% and Claude Opus 4.6 crossing the 72% human baseline in February 2026.
The OSWorld-Verified leaderboard operates under stricter submission rules than the main benchmark: models are evaluated either by running agents directly on the AWS infrastructure or by allowing maintainers to execute submitted agent code, ensuring that all entries reflect genuine agent performance rather than cherry-picked runs. Anthropic, OpenAI, and Alibaba have all submitted models to the verified track since its release. The XLANG team has continued to refine the verified set since July 2025, with an additional roughly 10% of task instructions modified in subsequent point releases as new edge cases surfaced.
From roughly mid-2025 onward, model release announcements have switched from citing the original OSWorld score to citing OSWorld-Verified. Claude Sonnet 4.6 and Claude Opus 4.7 both report verified figures only, GPT-5.4 reports OSWorld-Verified rather than the original benchmark, and GPT-5.5 followed the same convention at its April 2026 launch.[7][8][14][15][16]
The headline number that nearly every computer-use product cites is OSWorld success rate. The trajectory across major model launches is striking when laid out chronologically. Where Anthropic, OpenAI, or Alibaba published distinct figures for the original OSWorld benchmark and for the curated OSWorld-Verified subset, both are listed.
| Date | Model | Lab | OSWorld | OSWorld-Verified | Notes |
|---|---|---|---|---|---|
| April 2024 | GPT-4V baseline | OpenAI | 5.26 to 5.80% | n/a | Reported in original OSWorld paper |
| April 2024 | Best paper model | Multiple | 12.24% | n/a | Highest score in original paper |
| October 2024 | Claude 3.5 Sonnet (new) | Anthropic | 14.9% (screenshot only); 22.0% (extra steps) | n/a | Anthropic Computer Use launch |
| January 2025 | OpenAI CUA / Operator | OpenAI | 38.1% (50-step config) | n/a | OpenAI Operator launch |
| May 2025 | Claude Sonnet 4 | Anthropic | 42.2% | n/a | Claude 4 family launch |
| July 2025 | CoACT-1 multi-agent framework | Academic | n/a | 60.76% | Top entry at OSWorld-Verified release |
| August 2025 | Claude Opus 4.1 | Anthropic | 42.2% | n/a | Coding upgrade; OSWorld unchanged |
| September 2025 | Claude Sonnet 4.5 | Anthropic | 61.4% | n/a | First Sonnet over 60% on computer use |
| October 2025 | Claude Haiku 4.5 | Anthropic | 50.7% | n/a | Smallest Claude with native computer use |
| November 2025 | Claude Opus 4.5 | Anthropic | 66.3% | n/a | First model above 60% on main leaderboard |
| February 2026 | Claude Opus 4.6 | Anthropic | 72.7% | 72.7% | First Claude to match 72.36% human baseline |
| February 2026 | Claude Sonnet 4.6 | Anthropic | 72.5% | 72.5% | Effectively tied with Opus 4.6 at lower cost |
| March 2026 | GPT-5.4 Thinking | OpenAI | n/a | 75.0% | Above human baseline; OpenAI's first frontier OSWorld lead |
| April 2026 | Claude Opus 4.7 | Anthropic | n/a | 78.0% | Highest production Claude at time of launch |
| April 2026 | Claude Mythos Preview | Anthropic | n/a | 79.6% | Invitation-only research model |
| April 2026 | GPT-5.5 | OpenAI | n/a | 78.7% | Retook second place behind Mythos a week after launch |
The progression from the original 12.24% ceiling in April 2024 to scores near 80% by May 2026 represents roughly a sixfold improvement in two years. Three observations help interpret the curve.[17][18]
First, the steepest jumps coincide with launches that explicitly trained models on computer-use trajectories rather than relying on general multimodal capability. The October 2024 jump from 7.8% to 14.9% came from Anthropic's first dedicated computer-use training run. The January 2025 jump to 38.1% came from OpenAI's CUA, which used reinforcement learning specifically for GUI tasks. The September 2025 jump to 61.4% in Claude Sonnet 4.5 was widely characterized in the developer press as the first "production-grade" computer-use Claude. The April 2026 cluster of scores near 80% from Mythos, GPT-5.5, and Opus 4.7 reflects a different dynamic: with the human baseline already crossed, labs have stopped advertising OSWorld as a difficulty measure and started using it as a saturation marker for retiring the benchmark.
Second, the dominance of Anthropic models in the upper rankings reflects the company's sustained investment in computer use as a product capability. Claude's computer-use feature, introduced in October 2024 with Claude 3.5 Sonnet, trained the model directly on GUI interaction tasks, and subsequent Claude generations continued that specialization. Claude Opus 4.5 was the first model to exceed 66% on the main OSWorld benchmark.
Third, the trajectory crossed the 72.36% human baseline in February 2026 when Claude Opus 4.6 reached 72.7%. By April 2026, both Anthropic's Opus 4.7 and OpenAI's GPT-5.4 Thinking sat above the human baseline on OSWorld-Verified, and Anthropic's Claude Mythos Preview reached 79.6%. The benchmark is unusual among AI evaluations in that the human baseline was a relatively early target rather than a long-term aspiration, and the rapid saturation has prompted discussion within the XLANG Lab about a successor evaluation with longer-horizon tasks.
The table below summarizes the top entries on the OSWorld-Verified leaderboard as of mid-May 2026.[7][8]
| Rank | Model | Organization | Score |
|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | 79.6% |
| 2 | GPT-5.5 | OpenAI | 78.7% |
| 3 | Claude Opus 4.7 | Anthropic | 78.0% |
| 4 | GPT-5.4 Thinking | OpenAI | 75.0% |
| 5 | Kimi K2.6 | Moonshot AI | 73.1% |
| 6 | Claude Opus 4.6 | Anthropic | 72.7% |
| 7 | Claude Sonnet 4.6 | Anthropic | 72.5% |
| 8 | GPT-5.4 mini | OpenAI | 72.1% |
| 9 | GPT-5.3 Codex | OpenAI | 64.7% |
| 10 | Qwen3.6 Plus | Alibaba Cloud | 62.5% |
The release of GPT-5.5 in late April 2026 broke a brief window in which Anthropic had held all three of the top non-research positions on the verified leaderboard. As of the 13 May 2026 leaderboard refresh, Mythos, GPT-5.5, and Opus 4.7 sat within 1.6 percentage points of each other, a spread that observers have described as effectively a three-way tie within run-to-run variance on the benchmark. The CoACT-1 multi-agent framework that had topped the leaderboard at OSWorld-Verified's July 2025 release fell out of the top ten as frontier single-model systems caught up.
OSWorld arrived at a moment when major AI laboratories were beginning to invest seriously in computer-use capabilities. The benchmark provided a standardized evaluation surface that allowed direct comparison of different approaches.
Anthropic cited OSWorld in its October 2024 announcement of computer use for Claude 3.5 Sonnet. The model achieved 14.9% on the screenshot-only category at launch, which was notably better than the next-best published system at 7.8%. Anthropic highlighted this result as evidence of Claude's relative strength in GUI-based computer operation. Subsequent Claude generations continued to improve, and by 2025 Anthropic's models consistently held the top positions on the main leaderboard. By February 2026, OSWorld scores were the second metric Anthropic mentioned in launch posts, after SWE-bench Verified.[10]
OpenAI's Computer-Using Agent (CUA), which powered OpenAI Operator, was evaluated on OSWorld at its January 2025 launch and reached 38.1% success on the benchmark's 50-step task configuration. OpenAI described this as a significant advance over the prior state-of-the-art and cited it as evidence that CUA represented a new generation of computer-use capability. CUA also scored 58.1% on WebArena and 87.0% on WebVoyager, framing OSWorld as the desktop-equivalent benchmark in the same announcement.[11] OpenAI's later models continued to use OSWorld as the headline computer-use metric: GPT-5.2 Thinking reported 47.3%, GPT-5.3-Codex reported 64.7%, GPT-5.4 Thinking reported 75.0% on OSWorld-Verified, and GPT-5.5 reported 78.7% at its April 2026 launch.
Google's Project Mariner, launched in December 2024, did not consistently report OSWorld figures because the product is browser-only rather than full-desktop, and the team focused on WebVoyager as its primary metric. This underscored a structural fact about OSWorld: it tests capabilities that are visible only when an agent is allowed to operate desktop applications, not just web pages, which explained why Anthropic's Computer Use product (which can operate terminals, IDEs, and native software) outperformed browser-only competitors on the benchmark even when those competitors had higher WebVoyager scores.
Alibaba Cloud's Qwen series and Zhipu AI's GLM series both began submitting to OSWorld in 2025 and have held mid-leaderboard positions. The Qwen3 VL line has been the strongest open-weights performer on the main OSWorld benchmark, peaking around 66.7% with Qwen3 VL 235B A22B Instruct. Moonshot AI's Kimi K2.6 reached 73.1% on OSWorld-Verified, putting it in the top five at the time of release.
The XLANG Lab at HKU formalized community engagement through the OSWorld-Verified leaderboard, which required submissions to be evaluated on the official AWS infrastructure with access by benchmark maintainers. This introduced a level of verification absent from the main benchmark, where organizations could submit self-reported scores.
OSWorld has also generated a line of derivative work. The OSWorld-Human paper (2025) extended the evaluation to measure task completion efficiency rather than just binary success, asking how many actions agents took relative to expert humans. The OSWorld-MCP variant explored whether agents equipped with Model Context Protocol servers could use structured tool access to improve performance. The OS-HARM project built a safety evaluation layer on top of OSWorld infrastructure to measure whether computer-use agents could be induced to take harmful actions.
Third-party frameworks for running and evaluating computer-use agents, including the UK AI Safety Institute's inspect_evals library and the open-source CUA framework, have incorporated OSWorld as a standard evaluation target, further cementing its role as the primary public benchmark for desktop AI agent capability.[13] The benchmark's influence also extended to training pipelines: several computer-use models have explicitly cited OSWorld task distributions when describing their data collection and reinforcement learning fine-tuning strategies, making the benchmark both an evaluation target and a specification of what computer-use competence means in practice.
One significant limitation of OSWorld is that task instructions and evaluation functions have been updated over time in response to community feedback. The July 2025 OSWorld-Verified update modified more than 10% of tasks, an additional roughly 10% of task instructions were adjusted in subsequent point releases, and approximately 10% of tasks rely on live internet data, which means task difficulty can shift as websites change. This makes strict longitudinal comparisons of model scores across time unreliable without knowing exactly which version of the benchmark was used.
The Epoch AI analysis published in 2025 framed this concern bluntly: the practice of correcting errors in a static benchmark over time is highly atypical for non-live evaluations, and to the extent that the corrected errors had been rendering tasks unnecessarily difficult or impossible, the corrections give a spurious appearance of improved model capability when reported figures rise without controlling for the benchmark version. The XLANG team has been transparent about this and explicitly designed OSWorld-Verified to favor evaluator-side fixes over instruction rewrites in order to preserve score continuity with the original benchmark, but the fundamental tension between maintenance and stability is unavoidable.
The OSWorld-Verified initiative partially addresses this problem by versioning the task set and providing a stable snapshot for leaderboard comparisons, but the main benchmark continues to be updated. Anthropic's October 2024 launch number for Claude 3.5 Sonnet (14.9%) and the May 2025 launch number for Claude Sonnet 4 (42.2%) were measured on different versions of the benchmark than the February 2026 launch number for Claude Sonnet 4.6 (72.5%), and reasonable people disagree about how much of the apparent improvement is real progress versus benchmark drift.
A notable characteristic of OSWorld, analyzed by Epoch AI in 2025, is that not all tasks genuinely require GUI interaction. Approximately 15% of tasks can be completed using only a terminal, and roughly 30% allow models to substitute terminal commands or Python scripting for the intended GUI operations. This means that a model scoring well on OSWorld may be demonstrating command-line proficiency rather than genuine GUI manipulation skill.[9]
The benchmark developers acknowledge this property. OSWorld is designed to test agents that operate like computer users, which includes terminal use as a legitimate modality. However, it means that headline OSWorld scores do not cleanly decompose into "GUI skill" versus "shell scripting skill," and researchers interpreting results should be aware of this. The Epoch analysis estimated that adjusting for the substitution effect would lower headline scores by roughly five to ten percentage points, which is a meaningful correction at the current frontier.
Some task instructions were written with intentional ambiguity reflecting how real users phrase requests. While this is realistic, it introduces evaluation variance because different valid interpretations of an instruction may lead to different outcomes, some of which the evaluation script may not recognize as correct. The OSWorld-Verified fixes expanded the set of accepted solution paths for many tasks, but residual ambiguity remains.
The core benchmark is primarily an Ubuntu Linux evaluation. While tasks are adapted for Windows analysis, and macOS is listed as a supported platform, the majority of tasks and the primary leaderboard reflect Linux workflows. Many enterprise environments run Windows, and the Ubuntu-centric composition means that Linux-specific CLI idioms appear more frequently than they would in a representative sample of global computer use. The Windows subset of 43 tasks is considered too small for meaningful platform comparisons by most third-party evaluators.
Most OSWorld tasks require fewer than ten atomic actions. The median task requires approximately six steps. Only 12% of tasks demand more than 20 steps. This means the benchmark primarily measures performance on short to medium-length computer tasks and does not fully characterize agent performance on extended autonomous workflows that might require hundreds of sequential decisions. OSWorld-Human, a subsequent benchmark from 2025, addresses task complexity from an efficiency perspective by measuring how many actions agents take relative to humans. As frontier models approach and clear the 72% human baseline, attention has shifted to longer-horizon evaluations that test sustained focus over hours rather than minutes, and the XLANG Lab has signalled that a successor benchmark with extended-horizon tasks is in development.
By May 2026, three frontier systems (Claude Mythos Preview, GPT-5.5, and Claude Opus 4.7) sat within two percentage points of each other near 80% on OSWorld-Verified, more than seven points above the human baseline and within striking distance of a plausible 85% to 90% upper bound that residual task ambiguity and noise allow on the current set. Saturation discussions, similar to those that accompanied the late stages of GLUE and SuperGLUE in natural language understanding, have grown more frequent in the computer-use community. Several major labs now report that they treat OSWorld-Verified primarily as a regression test rather than a frontier metric, with newer evaluations such as long-horizon agent benchmarks and live-task suites supplanting it as the headline measure for new releases.
For much of 2024 and early 2025, OSWorld's reliance on locally hosted virtual machines made independent reproduction difficult. Different evaluators using slightly different VM configurations could produce different scores for the same agent on the same task. The introduction of AWS-backed parallel evaluation in OSWorld-Verified largely solved this for verified-track submissions, but private evaluations on the original infrastructure still produce noticeable run-to-run variance. Anthropic's Sonnet 4.6 launch post acknowledged this by reporting harmonized public OSWorld-Verified numbers rather than the spread of internal runs.
The main OSWorld leaderboard accepts self-reported submissions. As of May 2026, only the OSWorld-Verified leaderboard requires evaluation on the official AWS infrastructure or on agent code submitted to the maintainers. This split has produced occasional confusion when laboratories report slightly different scores on the two leaderboards for the same model, and the practical advice from XLANG is to cite OSWorld-Verified figures whenever they are available.
OSWorld sits within a growing ecosystem of agent evaluation benchmarks. The table below compares its key properties against WebArena, WebVoyager, and GAIA.
| Dimension | OSWorld | WebArena | WebVoyager | GAIA |
|---|---|---|---|---|
| Year | 2024 (NeurIPS) | 2023 (ICLR 2024) | 2024 | 2023 |
| Environment | Full desktop OS (Ubuntu, Windows, macOS) | Sandboxed websites (4 domains) | Live websites (15 popular sites) | Mixed (web search, file handling, tool use) |
| Task count | 369 (+ 43 Windows) | 812 templated tasks from 241 templates | 643 tasks | 466 tasks |
| Interaction modality | GUI (mouse, keyboard) plus optional terminal | Web browser actions | Real browser actions | Mixed (browsing, file I/O, calculator) |
| Cross-app workflows | Yes (101 tasks) | No | No | Implicit (multi-step tool use) |
| Evaluation method | Execution-based (134 custom scripts) | Execution-based plus web state | Auto-evaluation with GPT-4V judge | Automated answer matching |
| Live data dependency | Partial (~10% tasks) | Sandboxed (no live internet) | Yes (live websites) | Partial |
| Human baseline | 72.36% | 78.24% | n/a | 92% |
| State of the art (May 2026) | 79.6% (Mythos Preview) | ~67% (GPT-5.4) | mid-80s on top systems | high-80s on top systems |
| Original best model | 12.24% (April 2024) | ~14.4% (GPT-4 era) | 65 to 87% (early 2025) | 14% (GPT-4 era) |
WebArena and OSWorld share the execution-based evaluation philosophy but target different scopes. WebArena's 812 tasks are generated from templates, which enables larger scale but also means many tasks are structural variations on the same underlying operations. OSWorld's 369 tasks are individually crafted from real workflows, which reduces scale but increases diversity. WebArena tests only browser-based tasks; OSWorld tests full desktop operation including local applications with no web component.
WebVoyager occupies a related but distinct niche. It evaluates agents on live websites rather than sandboxed copies, which captures the real volatility and variety of the modern web but also introduces reproducibility challenges, since a website that worked at the time of evaluation may have changed by the time someone tries to reproduce a score. WebVoyager is the metric where browser-only systems like Operator and Project Mariner shine, scoring in the 80s, while OSWorld is where full-desktop systems like Anthropic Computer Use lead.
GAIA, released in late 2023 by researchers at Meta and Hugging Face, evaluates general assistant capability with an emphasis on reasoning across mixed tools rather than direct GUI manipulation. GAIA tasks often test whether an agent can correctly use a calculator, follow a citation chain, or extract a specific value from a long document. It overlaps with OSWorld at the level of task complexity but differs in modality: OSWorld asks whether the agent can drive a desktop, while GAIA asks whether the agent can reason its way through a multi-step information problem. AgentBench, AssistantBench, and Online-Mind2Web round out the broader landscape of agent evaluations cited alongside OSWorld in 2025 and 2026 model release notes.
A further distinction is that OSWorld was the first benchmark to include cross-application workflow tasks as a dedicated category, testing a capability that is absent from WebArena and irrelevant to most other agent evaluations. This focus on multi-application orchestration is part of why OSWorld scores improved more slowly than browser-only benchmarks: each application boundary an agent crosses is another opportunity to lose state.