Windows Agent Arena
Last reviewed
Jun 2, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,499 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,499 words
Add missing citations, update stale details, or suggest a clearer explanation.
Windows Agent Arena (WAA) is a reproducible benchmark and evaluation environment created by Microsoft for testing multimodal computer-use agents on real Windows tasks at scale. It places an AI agent inside an actual Windows 11 operating system, where it can open the same applications, browsers, and tools a person would use, and then scores the agent on a suite of 154 hand-authored tasks spanning everyday workloads such as document editing, web browsing, coding, media playback, and system configuration. A central design goal is speed: by packaging the operating system in a container and distributing the work across Microsoft Azure, a full run of the benchmark can finish in roughly 20 minutes rather than the hours or days that sequential, multi-step desktop evaluation usually demands. The project was introduced in the paper "Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale," posted to arXiv in September 2024.[1][2][3]
Windows Agent Arena targets the problem of measuring how well a software agent can drive a real desktop computer toward a goal. Earlier agent benchmarks tended to constrain the problem to a single modality or a narrow domain, for example text-only question answering, web-page navigation, or code generation, which made it hard to judge an agent that has to perceive a screen, plan over many steps, and operate arbitrary applications the way a human assistant would. WAA instead runs agents against a live Windows installation so that the action space, the rendering, and the application behavior match what a real user encounters.[1][4]
The benchmark ships 154 tasks distributed across representative Windows domains, each paired with a programmatic checker that inspects the final system state to decide whether the task succeeded. To demonstrate the platform, the authors built a multimodal agent called Navi, which reached a 19.5% task success rate, well below the 74.5% an unassisted human achieved on the same tasks. That gap is the headline finding: capable general-purpose vision-language models are still far from human-level competence at operating a desktop, and WAA was built to make progress on that gap measurable and cheap to iterate on. The code is released by Microsoft under the MIT license and the original benchmark task framework is adapted from OSWorld.[1][2][5]
A computer-use agent is a system that perceives a graphical interface, usually from screenshots, and acts on it through synthesized mouse and keyboard input in order to complete a user's request. Interest in such agents grew sharply in 2024 as multimodal foundation models became able to read screens and reason about interface elements, and as vendors began shipping products in the same space. Evaluating these agents is difficult for reasons WAA explicitly set out to address.[1][4]
First, desktop tasks are long-horizon and sequential: a single task can require many dependent actions, and an early mistake can derail everything that follows, so each task takes real wall-clock time to run. Second, faithful evaluation requires a real environment, because agents that look strong in a simplified or mocked interface may fail against the quirks of production software. Third, reproducibility is hard when an agent touches live web services or a mutable machine. WAA's answer is a deterministic, containerized Windows environment with state-based task verification, combined with cloud parallelization so that the cost of running a realistic benchmark does not scale with the length of its tasks.[1][2][4]
WAA runs a full Windows 11 image inside a virtual machine that is itself wrapped in a Docker container, which makes the whole environment portable and reproducible. The agent observes the screen through screenshots and can additionally consume the Windows UI Automation (UIA) accessibility tree, a structured description of on-screen elements exposed by the operating system. It acts by issuing programmatic mouse and keyboard commands, the same channels a human uses, so no application needs a special API to be exercised.[1][2][5]
The 154 tasks are spread across application domains chosen to reflect common Windows usage. Reported coverage spans office productivity in LibreOffice Calc and Writer, web browsing in Microsoft Edge and Google Chrome, Windows system operations in File Explorer and Settings, software development in Visual Studio Code, media playback in VLC, and small utilities such as Notepad, Clock, and Paint. Each task pairs a natural-language instruction with an automatic checker that examines the resulting files, application state, or system configuration, so success is judged by outcome rather than by matching a fixed click sequence.[2][3][6]
| Task domain | Example applications | Representative work |
|---|---|---|
| Office productivity | LibreOffice Calc, LibreOffice Writer | Editing spreadsheets and documents [2][6] |
| Web browsing | Microsoft Edge, Google Chrome | Navigating sites and gathering information [2][6] |
| Windows system | File Explorer, Settings | Managing files and changing settings [2][6] |
| Coding | Visual Studio Code | Editing and running code [2][6] |
| Media and video | VLC Player | Playing and controlling video [2][6] |
| Utilities | Notepad, Clock, Paint | Small everyday tasks [2][6] |
The most distinctive engineering feature of WAA is how quickly it runs. Because every task executes in its own self-contained Windows container, tasks are embarrassingly parallel, and the benchmark can be fanned out across many machines at once using Azure Machine Learning. The authors report that the infrastructure can be scaled out to as many workers as there are tasks, so that a full evaluation of all 154 tasks finishes in as little as 20 minutes instead of the multiple hours to over a day it would take on a single machine.[1][2][7]
In practice the deployment uploads a prepared Windows 11 "golden image" and the Docker images to Azure Blob storage, then launches parallel jobs on Azure ML compute (the repository references Standard_D8_v3 virtual machines), with each job running one or more tasks and reporting results back. Microsoft frames this as the ability to "deploy hundreds of agents in parallel, accelerating results to a matter of minutes, not days," which lowers the cost of a development loop enough that researchers can re-run the full benchmark frequently while iterating on an agent.[5][7]
To show that the platform is usable end to end, the WAA team introduced Navi, a multimodal agent that serves as the reference baseline. Navi takes a screenshot, builds a structured representation of the interactive elements on screen, and feeds that to a vision-language model that decides the next action. The element representation uses a Set-of-Marks approach, in which candidate UI elements are detected and overlaid with numbered marks so the model can refer to a target by its index rather than by raw pixel coordinates.[1][2][5]
Navi can construct those marks from several signals, and the quality of the marks turned out to matter a great deal. The strongest configuration combines pixel-based element detection from OmniParser, a screen-parsing model from Microsoft, with information from the Windows UIA accessibility tree. Weaker configurations rely only on optical character recognition and icon or object detection. Navi is model-agnostic and was evaluated with several backends, including GPT-4V, GPT-4o, GPT-4o-mini, and the smaller Phi-3-V; an open-source-only variant swaps OmniParser for Pytesseract plus Grounding DINO. Beyond the desktop benchmark, Navi was also tested on the web-navigation benchmark Mind2Web, where it performed competitively against prior methods, evidence that the same agent design generalizes beyond Windows.[1][2][8]
WAA reports a single primary metric, the success rate, defined as the fraction of tasks for which the post-execution checker confirms the goal was achieved, without partial credit. Each task has a deterministic verifier so that scoring does not depend on human judgment, and a step or time budget bounds how long an agent may work on a task before it is marked failed. To anchor the difficulty scale, the authors collected a human baseline by having people attempt the same tasks without external assistance, yielding the 74.5% reference figure.[1][2][4]
The methodology emphasizes apples-to-apples comparison of agent designs. Because the environment, the task set, and the verifier are fixed and reproducible, the main variables a researcher changes are the underlying model and the agent's perception pipeline, which lets WAA isolate effects such as how much the accessibility tree contributes on top of vision alone. The benchmark's speed reinforces this: a cheap, fast full run encourages reporting results over the entire 154-task suite rather than a hand-picked subset.[1][2][7]
Across configurations, the best Navi setup on WAA reached 19.5% overall, using GPT-4V (the 1106 vision-preview model) with OmniParser plus UIA. Performance was strongly dependent on both the model and the perception signals: adding the UIA accessibility tree on top of OmniParser raised the OmniParser configuration from 12.5% to 19.5%, an improvement the paper describes as about 57%, and similar gains appeared for the open-source perception stack. Success also varied sharply by domain, with browser, system, coding, and media tasks all completing roughly a third of the time, while Office tasks, which lean heavily on keyboard shortcuts and dense toolbars, failed almost entirely.[1][6]
| Result | Value | Configuration / notes | Source |
|---|---|---|---|
| Human baseline (no assistance) | 74.5% | People attempting the 154 tasks | [1][2] |
| Navi, best overall | 19.5% | GPT-4V-1106 + OmniParser + UIA | [1][6] |
| Navi with GPT-4o | 8.6% | OmniParser + UIA | [6] |
| Navi with GPT-4o-mini | 4.2% | OmniParser + UIA | [6] |
| Navi with Phi-3-V | 3.5% | OmniParser + UIA | [6] |
| OmniParser, no UIA tree | 12.5% | GPT-4V-1106; UIA adds ~57% relative | [1][6] |
| Open-source perception, no UIA | 8.6% | Pytesseract + Grounding DINO + GPT-4V-1106 | [6] |
| Open-source perception, with UIA | 13.1% | Same stack plus UIA tree | [6] |
| WAA domain (best config) | Navi success rate | Source |
|---|---|---|
| Office | 0.0% | [6] |
| Web browser | 27.3% | [6] |
| Windows system | 33.3% | [6] |
| Coding | 27.3% | [6] |
| Media and video | 30.3% | [6] |
| Windows utilities | 8.3% | [6] |
On the web-navigation side, Navi using GPT-4o reported 47.3% element accuracy, an 85.8% operation F1, and a 45.2% step success rate on Mind2Web, ahead of the SeeAct (GPT-4V) baseline at 44.3% element accuracy, 71.8% operation F1, and 38.3% step success. These figures supported the claim that Navi is a competitive general agent and not merely tuned to Windows.[1][8]
WAA is built directly on the task framework of OSWorld, an earlier scalable environment for benchmarking multimodal computer-use agents that is primarily oriented toward Ubuntu Linux (and also covers some Windows scenarios). Microsoft describes Windows Agent Arena as extending the OSWorld platform toward a broad range of tasks on the Windows operating system, and the WAA repository credits OSWorld for the original benchmark task framework. In effect, WAA carries over OSWorld's idea of state-checked tasks in a real OS while specializing the applications, perception signals, and infrastructure for Windows.[1][2][5]
The clearest point of difference is infrastructure and turnaround time. OSWorld popularized realistic OS-level evaluation, but a full run remains slow because tasks are long and sequential; WAA's contribution is the Azure-backed parallelization that compresses a complete 154-task run into roughly 20 minutes, which changes the economics of iterating on an agent. WAA also leans on Windows-specific perception, in particular the UIA accessibility tree and Microsoft's OmniParser, as first-class inputs to the agent.[1][7]
| Aspect | OSWorld | Windows Agent Arena |
|---|---|---|
| Primary operating system | Ubuntu Linux (plus some Windows) | Windows 11 |
| Task framework | Original state-checked task design | Adapted from OSWorld |
| Cloud parallelization | Not the central focus | Azure ML, full run in ~20 min |
| Reference agent | Various baselines | Navi (Set-of-Marks + VLM) |
| Released by | Academic group (Xie et al.) | Microsoft and collaborators |
Windows Agent Arena gave the field a realistic, reproducible, and notably fast way to measure desktop agents on the world's most widely used PC operating system, on which most office and consumer software actually runs. By exposing how far a strong baseline like Navi (19.5%) sits below human performance (74.5%), it quantified the headroom in computer-use agents and provided a shared yardstick for tracking progress. The work was produced by researchers at Microsoft together with collaborators at Carnegie Mellon University and Columbia University, and it attracted broad coverage as an indicator of where Microsoft sees Windows-native AI assistants heading.[1][3][9]
Its parallelization model is part of why it mattered: turning a multi-day evaluation into a roughly 20-minute job makes it practical to treat agent development like ordinary software development, with a full benchmark run inside a normal iteration loop. The release of the environment, the Navi agent, and the task suite under a permissive license lowered the barrier for other groups to evaluate and extend computer-use agents on Windows.[1][5][7]
The authors are explicit that WAA measures a hard, unsolved capability and that current agents are weak at it. They highlight failure modes such as misalignment between an agent's textual reasoning and its visual understanding, and imprecise Set-of-Marks bounding boxes that cause the agent to click the wrong element. Performance also skews toward text-dominant interfaces and away from tasks that rely on keyboard shortcuts and small icons, which is why Office tasks performed so poorly relative to browser or system tasks.[1][6]
Safety is a recurring concern. An agent that can freely operate a real machine can also take harmful or irreversible actions, so the authors argue users should be able to understand, direct, diagnose, and override what an agent does, and they position human-in-the-loop oversight and more specialized agent architectures as directions for improvement. Outside commentary echoed these security and reliability worries about granting agents keyboard-and-mouse access to a live PC. Finally, the benchmark is bounded by its 154 curated tasks and its specific application set, so a high score on WAA does not by itself certify general competence across the full breadth of real-world desktop work.[1][9][10]