WindowsWorld
Last reviewed
Jun 3, 2026
Sources
2 citations
Review status
Source-backed
Revision
v1 · 1,361 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
2 citations
Review status
Source-backed
Revision
v1 · 1,361 words
Add missing citations, update stale details, or suggest a clearer explanation.
WindowsWorld is a process-centric benchmark for evaluating autonomous GUI agents on professional, cross-application desktop workflows. Introduced in an April 2026 paper from the Harbin Institute of Technology, Shenzhen, it contains 181 tasks that span 17 common Windows applications, with an average of 5.0 sub-goals per task and 78% of tasks requiring coordination across more than one application. The headline result is sobering: every frontier computer use agent the authors tested scored under 21% on the benchmark, far below their performance on the simpler single-application tasks that earlier benchmarks emphasize. WindowsWorld was accepted to ACL 2026, and its code and data are released on GitHub.[1][2]
The benchmark targets a gap that its authors argue prior work mostly ignored. Most desktop agent benchmarks measure whether a model can finish one self-contained chore inside a single program. Real office work rarely looks like that. A person pulling numbers out of an email, dropping them into a spreadsheet, generating a chart, and pasting it into a slide deck is moving across four applications and carrying state between them the whole way. WindowsWorld is built around that kind of stitched-together workflow, and it scores agents on the intermediate steps rather than only on the final outcome.
The central idea is process-centric evaluation. Instead of grading a task as a binary pass or fail based on the end state, WindowsWorld decomposes each task into explicit intermediate checkpoints, averaging 4.97 per task. An agent gets partial credit for reaching milestones along the way, which lets the benchmark distinguish an agent that stalled on the first sub-goal from one that completed most of a workflow but tripped on the final step. The authors argue this granularity matters because end-state-only scoring throws away almost all the signal on long tasks, where nearly everything fails outright.[1]
Tasks run on real Windows virtual machines (Windows 10 and 11, plus Windows Server 2022 and 2025) rather than in a simplified or mocked interface, so agents interact with genuine application UIs through screenshots and actions. The 17 applications cover the categories an office worker actually touches: productivity software such as Word, Excel, PowerPoint, and a PDF reader; communication and browsing tools including Chrome and the Thunderbird email client; system utilities like File Explorer and Calculator; media tools such as GIMP and Paint; and developer tools including VS Code and PowerShell.[1][2]
WindowsWorld sorts its tasks into four difficulty levels, each with its own step budget:
| Level | Description | Share of tasks | Step budget |
|---|---|---|---|
| L1 | Single-application atomic actions | 21.5% | 15 |
| L2 | Multi-application linear workflows (2 to 3 apps) | 44.2% | 25 |
| L3 | Dynamic reasoning across 3 or more apps with conditional logic | 27.6% | 40 |
| L4 | Infeasible tasks the agent should refuse | 6.6% | 20 |
The L4 tier is a notable design choice. These tasks cannot be completed because of something like an invalid URL, a missing file, or a required login the agent does not have. A capable agent is supposed to recognize the obstacle and report that the task is infeasible rather than hallucinate a success. This directly probes a common failure mode where agents claim to have finished work they never did.[1]
Authoring realistic multi-application tasks by hand is slow, so the team used a human-in-the-loop multi-agent pipeline. The generation process is steered by 16 occupations drawn from five professional domains, which anchors tasks in plausible job activities rather than arbitrary UI exercises. The pipeline drafts candidate tasks at each of the four difficulty levels, inserts intermediate inspection points, and then hands the output to human reviewers who refine and validate each task before it is executed in the simulated environment. The result is meant to combine the breadth of automated generation with the reliability of human curation.[1]
To establish a sense of how much work each task genuinely requires, the authors recorded the minimum expert action counts: roughly 9.67 actions for L1 tasks, 18.13 for L2, and 27.81 for L3. These figures double as an efficiency yardstick, since the paper found agents routinely blow past human step counts without ever reaching the goal.[1]
The benchmark is hard. Across leading models and agent scaffolds, no system cleared the 21% mark on overall task completion, and the gap between intermediate progress and final success is striking. The strongest performer was an agent built on Gemini 3 Flash, which reached about 50.32% on the intermediate checkpoint score but only 20.44% on final task completion. Other frontier models fared considerably worse on this benchmark: a GPT-5.2 based agent scored around 6.62% on intermediate checkpoints and 1.12% on final completion, while a Claude Sonnet 4.5 based agent landed near 5.84% intermediate and 0.55% final.[1]
A few patterns stand out across the leaderboard. Agents do reasonably well on the simple single-application tasks but collapse as soon as a workflow spans multiple programs, which is exactly where most of the benchmark lives. They are especially weak on tasks that demand conditional judgment and reasoning across three or more applications, often stalling at an early sub-goal and never recovering. And efficiency is poor: tasks frequently fail even after the agent has taken far more steps than a human expert would need, which suggests the models drift, repeat themselves, or lose track of state during long sequences.[1]
The large spread between checkpoint scores and completion scores is itself a finding. It shows that frontier agents can often start a complex workflow and make early progress, but they rarely carry it all the way through. Under a conventional pass-or-fail benchmark, almost all of this activity would register as a flat zero, which is part of why the authors favor milestone scoring.[1]
WindowsWorld positions itself against two well-known predecessors. OSWorld is a broad computer-use benchmark covering real desktop tasks across operating systems, and Windows Agent Arena focuses specifically on agents operating a Windows environment. The authors' argument is not that these benchmarks are wrong, but that they lean heavily on isolated, single-application tasks and grade mostly on end states.
Two differences define WindowsWorld. The first is the multi-application emphasis: about 78% of its tasks inherently require more than one program, compared with a much smaller multi-app share in OSWorld's Windows tasks (the paper cites roughly 27.4%). The second is process-aware evaluation, the intermediate checkpoint scoring that Windows Agent Arena and similar benchmarks generally lack. Together these make WindowsWorld a test of sustained, stateful coordination rather than one-shot competence, which is why scores on it are so much lower than on the tasks earlier benchmarks foreground.[1][2]
The value of WindowsWorld is that it relocates the difficulty. Headline numbers on single-app benchmarks had started to suggest desktop agents were approaching practical usefulness, but those numbers said little about the multi-step, multi-program work that fills an actual workday. By concentrating on cross-application workflows and scoring partial progress, WindowsWorld exposes how far current agents are from reliably completing real office tasks, and it gives researchers a more diagnostic signal than a binary success rate. The L4 refusal tasks add a useful honesty check, since an agent that fabricates success is arguably worse than one that admits it is stuck.
There are limits worth keeping in mind. With 181 tasks the benchmark is modest in size, the reported results come from the paper's own evaluation harness, and the absence of a full human success-rate baseline (beyond minimum expert action counts) makes the human ceiling harder to pin down. Even so, the under-21% result is a clear marker. It suggests that the hard part of desktop automation is not operating any single application but holding a goal together across many of them, and that closing this gap is one of the open problems for the next generation of computer-use agents.[1][2]