# Computer-use agent

> Source: https://aiwiki.ai/wiki/computer-use_agent
> Updated: 2026-06-23
> Categories: AI Agents, Artificial Intelligence, Computer Science, Computer Vision, Machine Learning, Multimodal AI, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **computer-use agent** (**CUA**) is a category of [AI agent](/wiki/ai_agent) in [artificial intelligence](/wiki/artificial_intelligence) that performs tasks by directly operating a general-purpose computer's graphical user interface (GUI) the way a human does, by "seeing" the screen, moving a cursor, clicking, typing, and interacting with windows and applications.[1] Unlike tool-calling approaches that rely on predefined APIs, computer-use agents aim to generalize across arbitrary software by treating the computer itself as the universal interface.[2] When Anthropic introduced the first major commercial implementation in October 2024, it framed the capability as letting "developers can direct Claude to use computers the way people do, by looking at a screen, moving a cursor, clicking buttons, and typing text."[1] CUAs combine a [large language model](/wiki/large_language_model) (typically multimodal) with [computer vision](/wiki/computer_vision) and an action executor (a virtual machine, container, or browser sandbox), enabling end-to-end perception, reasoning, and control loops.[3][4]

The best computer-use agents reached the OSWorld human baseline of 72.36% by early 2026, climbing from 14.9% at the October 2024 launch to 72.7% for Claude Opus 4.6 in February 2026, a roughly fivefold gain in 16 months.[1][5][6] This article describes the broader category across all vendors and architectures, including [Anthropic](/wiki/anthropic) Claude, [OpenAI](/wiki/openai) [Operator](/wiki/openai_operator) and ChatGPT agent, Google Project Mariner and Gemini computer use, [Microsoft](/wiki/microsoft) Copilot Vision, and a growing open-source ecosystem. The narrower [computer use](/wiki/computer_use) article focuses on Anthropic's Claude Computer Use feature specifically.

## What are the different names for a computer-use agent?

The field has accumulated a small thicket of overlapping names. They mostly mean the same thing, with vendor-specific shading.

| Term | Origin | Notes |
| --- | --- | --- |
| Computer-use agent (CUA) | Generic | The umbrella term used in this article and most academic work |
| GUI agent | Academic | Emphasizes graphical interface interaction, used in papers like SeeAct[7] |
| Desktop agent | Industry | Focuses on full desktop environments versus browser-only systems |
| Self-operating computer (SOC) | OthersideAI | Open-source framework with pixel observation and mouse/keyboard actions[8] |
| Computer-Using [Agent](/wiki/agent) (CUA) | OpenAI | The model behind [Operator](/wiki/openai_operator), now in the Responses API[2] |
| [Browser-use agent](/wiki/browser-use_agent) (BUA) | Industry | A CUA restricted to a web browser, sometimes using DOM alongside vision[9] |
| Web agent | Mixed | Interchangeable with browser-use agent, especially in benchmarks like [WebArena](/wiki/webarena)[10] |

[Robotic process automation](/wiki/robotic_process_automation) (RPA) tools are sometimes lumped in, but classic RPA is rule-based and brittle. The point of CUAs is that the model can read an unfamiliar interface without hand-coded selectors.

## How does a computer-use agent differ from an API or browser-automation agent?

Vendors blur these lines on purpose, so being precise matters. Computer-use agents see screenshots (or a video stream of the screen) and emit mouse and keyboard actions, requiring no cooperation from the target application. API-driven agents call HTTP endpoints, gRPC services, or [function calling](/wiki/function_calling) tools; faster and more reliable, but only where the developer has wired up an integration. Browser automation tools like Playwright and Selenium drive a browser through DOM and accessibility trees, not pixels; reliable on standard sites, fragile on heavy JavaScript apps. [Tool-use](/wiki/tool_use) agents in the broad sense include CUAs as one specialization, alongside agents that call calculators, search APIs, or [Model Context Protocol](/wiki/model_context_protocol) servers.

In practice, modern systems mix these. The [ChatGPT agent](/wiki/openai_operator) released in July 2025 combines a visual browser, a text browser, a terminal, and direct connectors to services like Gmail, all coordinated by a single model. Anthropic's March 2026 Mac computer use feature prefers connectors over browser navigation over screen interaction, with screen control as the last resort. The hybrid pattern is now dominant.

## Core characteristics

A computer-use agent operates any on-screen software within OS and permission limits, rather than only API-integrated tools.[2] It runs a perception-action loop that reads screen state (image, sometimes plus accessibility data) and emits actions like `click(x, y)` and `type("text")` until the goal is reached or a stop condition fires.[3] Text-based planning is fused with visual grounding so the agent can locate UI elements that have no semantic anchor in code.[11] Sandboxes, virtual machines, and containerized desktops improve safety and reproducibility, especially when the agent is touching the real internet.[4] A modern CUA can chain steps across applications, but most production deployments insert confirmation gates for irreversible actions.[1][2]

## How does a computer-use agent work?

A typical computer-use agent has five conceptual layers, even when they collapse into a single neural network. **Perception** captures screen frames plus optional auxiliary signals (window hierarchy, accessibility tree, OCR output); most production CUAs work primarily from screenshots, with the visual model interpreting buttons, text fields, and menus.[3] **Reasoning and planning** uses a multimodal [LLM](/wiki/llm), often with [chain-of-thought](/wiki/chain_of_thought) or extended thinking; some implementations apply [reinforcement learning](/wiki/reinforcement_learning) fine-tuning on top of a base [foundation model](/wiki/foundation_model) to teach action sequences.[11] **Grounding** maps abstract plan tokens like "click the search button" to pixel coordinates or element IDs; this is where many failures originate.[12] **Action execution** sends clicks, keystrokes, scrolls, and window commands.[3] **Monitoring and recovery** detects pop-ups, navigation drift, or repeated clicks and triggers retries, backtracking, or human handoff.[4]

### Implementation approaches

| Approach | Description | Advantages | Limitations |
| --- | --- | --- | --- |
| Pure vision | Relies solely on visual interpretation of screen pixels | Platform-agnostic, works with any GUI | Struggles with dense, custom, or rapidly changing layouts |
| DOM-enhanced | Combines vision with web page structure analysis | Higher accuracy and lower token cost for web tasks | Limited to browser environments |
| Hybrid systems | Integrates multiple signals including OS APIs and accessibility trees | More accurate and reliable | Platform-specific, more engineering work |
| Container-based | Runs in isolated VMs or Docker containers | Stronger security and easier scaling | Adds infrastructure overhead and latency |
| Connector-first | Tries direct integrations before falling back to screen control | Fastest and most reliable when integrations exist | Coverage limited by available connectors |

The connector-first pattern, exemplified by Anthropic's [Claude](/wiki/claude) Cowork on macOS, is increasingly seen as the production-ready architecture, with pure pixel control as a fallback rather than a default.[13]

## History

### Research roots (2017 to 2023)

Researchers had been chasing GUI-controlling agents for years before the modern wave. World of Bits (2017) had agents play through web micro-tasks; MiniWoB and MiniWoB++ provided controlled environments for reinforcement learning on synthetic web pages. The 2023 work that mattered most was SeeAct from Ohio State, which formalized the perception-grounding-action triple for [GPT-4](/wiki/gpt-4)V on real web pages, alongside Mind2Web, the same group's dataset of 2,350 cross-website tasks.[7][14]

[Adept](/wiki/adept) opened the commercial story with its **[ACT-1](/wiki/act_1)** demo on September 14, 2022, showing a transformer that drove a Chrome extension to fill forms and navigate Salesforce.[15] ACT-1 never shipped as a product, and Adept eventually pivoted, but the demo established the visual outline of a GUI agent.

### Modern era (2024 onward)

| Date | Milestone |
| --- | --- |
| November 2023 | OthersideAI's open-source Self-Operating Computer framework demonstrates a multimodal model controlling a desktop using only pixels and synthetic input.[8] |
| April 2024 | OSWorld is published on arXiv (Xie et al., NeurIPS 2024), establishing the standard 369-task benchmark across Ubuntu, Windows, and macOS.[12] |
| October 22, 2024 | [Anthropic](/wiki/anthropic) releases [computer use](/wiki/computer_use) in public beta with [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet), the first major commercial implementation. Initial OSWorld: 14.9% (22.0% with extra steps).[1] |
| December 11, 2024 | [Google DeepMind](/wiki/google_deepmind) unveils Project Mariner, a Chrome-extension prototype on [Gemini](/wiki/gemini) 2.0 Flash, for trusted testers.[16] |
| January 23, 2025 | [OpenAI](/wiki/openai) launches [Operator](/wiki/openai_operator) for ChatGPT Pro subscribers, powered by the [Computer-Using Agent (CUA)](/wiki/openai_cua) model on [GPT-4o](/wiki/gpt-4o) with reinforcement learning. OSWorld 38.1%, WebVoyager 87%.[2] |
| February 24, 2025 | Anthropic releases [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) with extended thinking, lifting OSWorld scores into the 30s.[17] |
| March 11, 2025 | OpenAI exposes CUA in the Responses API at $3/$12 per million input/output tokens.[18] |
| March 2025 | Microsoft documents "Computer Use (preview)" in Azure AI Foundry.[19] Simular AI ships Agent S2 at 34.5% OSWorld.[20] |
| May 22, 2025 | Anthropic launches the [Claude 4](/wiki/claude_opus_4) family with interleaved extended thinking and tool use.[21] |
| May 23, 2025 | OpenAI moves CUA to an [o3](/wiki/openai_o-series)-based variant ("Operator o3"), with large GAIA gains.[22] |
| August 26, 2025 | Anthropic ships Claude for Chrome for Max subscribers.[23] |
| September 2025 | Anthropic releases [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5), reaching 61.4% on OSWorld and 77.2% on SWE-bench Verified.[24] |
| October 7, 2025 | Google launches the Gemini 2.5 Computer Use model via Gemini API and Vertex AI.[25] |
| October 21, 2025 | OpenAI announces ChatGPT Atlas, a Chromium browser with built-in agent mode and "browser memories."[26] |
| December 2025 | Simular's Agent S becomes the first published system to exceed the OSWorld human baseline at 72.6%.[27] |
| February 5, 2026 | Anthropic releases [Claude Opus 4.6](/wiki/claude_opus_4_7) and [Claude Sonnet 4.6](/wiki/claude_sonnet_4_6); Opus 4.6 reaches 72.7% on OSWorld, effectively at the human baseline. [GPT-5](/wiki/gpt-5).2 is reported at 38.2% on the same harness.[28] |
| March 17, 2026 | Anthropic launches Dispatch for phone-to-Mac task assignment.[13] |
| March 23, 2026 | Anthropic introduces native Mac computer use in Claude Cowork and Claude Code, with a connector-first tooling hierarchy.[13] |

## How does the perception-action loop function?

A computer-use agent runs an iterative loop: observe (a screenshot, sometimes plus an accessibility snapshot or DOM excerpt), plan, act (`click(x, y)`, `type("text")`, `key("enter")`, `scroll(direction, amount)`, drag, or higher-level macros), observe again, and repeat until the task is complete, the agent gives up, or a confirmation is required.[3] The shape of the loop is similar across vendors; the differences live in how the perception is constructed and how the model is allowed to think between actions.

Most public SDKs document the primitive set explicitly. Anthropic's tool versions started with `screenshot`, `left_click`, `type`, `key`, and `mouse_move`, then added `right_click`, `double_click`, `triple_click`, `left_click_drag`, `scroll`, `hold_key`, and `wait`.[29] OpenAI's Responses API tool exposes a similar action set in a different schema.

The "flipbook" character of this loop matters. The agent does not see between screenshots, so transient elements like loading spinners, brief notifications, drag-and-drop ghosts, and focus rings can be invisible to it. Vendors are slowly moving toward video-stream perception, but discrete screenshots remain the norm for production systems in early 2026.

## Which companies make computer-use agents?

The major commercial computer-use agents in 2026 differ in scope (browser-only or full desktop), execution model (cloud sandbox or user's machine), and target audience (developer API or consumer subscription).

| Vendor | Product | First release | Scope | Execution model | Headline benchmark |
| --- | --- | --- | --- | --- | --- |
| [Anthropic](/wiki/anthropic) | [Claude](/wiki/claude) Computer Use API | October 22, 2024 | Full desktop (developer VM) | Customer-hosted Docker or VM | OSWorld 72.7% (Opus 4.6)[28] |
| Anthropic | Claude Cowork (Mac) | March 23, 2026 | Native macOS, connector-first | Runs on user's Mac | Same as API[13] |
| Anthropic | Claude for Chrome | August 26, 2025 | Browser only | Chrome extension, user's machine | Not published[23] |
| [OpenAI](/wiki/openai) | [Operator](/wiki/openai_operator) / ChatGPT agent | Jan 23, 2025; merged Jul 17, 2025; standalone end Aug 31, 2025 | Browser; agent mode adds terminal, files, connectors | Cloud-hosted virtual computer | OSWorld 38.1%; WebVoyager 87%[2] |
| OpenAI | Computer use (Responses API) | March 11, 2025 | Developer-defined | Developer-hosted | $3/$12 per million tokens[18] |
| OpenAI | ChatGPT Atlas | October 21, 2025 | Browser, sidebar plus agent mode | Native Chromium browser | Inherits CUA[26] |
| [Google](/wiki/google_deepmind) | Project Mariner | December 11, 2024 | Browser only | Cloud VMs, parallel tasks | WebVoyager ~83.5%[16] |
| Google | Gemini 2.5 Computer Use | October 7, 2025 | Browser only via API | Developer-hosted | Positioned as fastest in class[25] |
| [Microsoft](/wiki/microsoft) | Copilot Vision | 2025 | Sees user's screen, scoped automations | Local view, cloud reasoning | Not a full agent |
| Microsoft | Computer Use (Azure AI Foundry) | March 2025 | Windows 365, Azure Virtual Desktop | Cloud VMs | Inherits CUA[19] |

The coarse split: Anthropic gives developers a low-level desktop-control primitive plus a consumer experience through Claude Cowork on Mac. OpenAI ships polished consumer products (Operator, ChatGPT agent, Atlas) and exposes the same model through an API afterward. Google has bifurcated into the consumer-oriented Project Mariner inside Google AI Ultra and the API-oriented Gemini 2.5 Computer Use for developers. Microsoft piggybacks on OpenAI's CUA inside Azure and keeps Copilot Vision narrower, focused on "see what I see" rather than "do this for me."

### Anthropic Claude Computer Use

Anthropic released computer use as a public beta on October 22, 2024, alongside [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet).[1] The implementation exposes three tools: `computer` for screenshots and mouse/keyboard actions, `bash` for shell commands, and `text_editor` for file edits, with a Docker-based Streamlit reference UI. Training was deliberately narrow: a few simple programs (calculator, text editor) with no internet, generalizing broadly. Teaching the model to count pixels accurately was a key technical hurdle for coordinate-level clicking.[1] At launch Claude 3.5 Sonnet scored 14.9% on OSWorld in the screenshot-only setting, well ahead of the next-best system at 7.8% but far below the 72.36% human baseline.[1]

The OSWorld progression over 16 months: 14.9% (Claude 3.5 Sonnet at launch), 22.0% (with extra steps), into the 30s ([Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) with extended thinking), 61.4% (Claude Sonnet 4.5, September 2025), and 72.7% (Claude Opus 4.6, February 2026), essentially matching the 72.36% human baseline.[28] The Mac computer use feature in March 2026 runs Claude on the user's actual Mac (not a VM) with per-application permissions, default-blocked sensitive apps (investment platforms, crypto tools), approval prompts for destructive actions, and prompt-injection scanning on screen contents. Pricing is bundled with Claude Pro and Max subscriptions.[13]

### OpenAI Operator and CUA

OpenAI's [Operator](/wiki/openai_operator) launched on January 23, 2025, as a research preview for ChatGPT Pro subscribers, with a ChatGPT-style chat window paired to an embedded browser. Safety levels included confirmation mode for irreversible actions, take-over mode for credentials and CAPTCHAs, watch mode for sensitive sites, and a hard-restricted task list (stock trades, banking transfers).[30] The underlying [Computer-Using Agent (CUA)](/wiki/openai_cua) model combined GPT-4o vision with reinforcement learning, scoring 38.1% on OSWorld, 58.1% on WebArena, and 87% on WebVoyager at launch.[2] In May 2025 OpenAI moved CUA onto the [o3](/wiki/openai_o-series) reasoning model, with big GAIA gains (62.2 versus 12.3).[22]

Operator was deprecated on August 31, 2025; its capabilities now live in three places. Inside ChatGPT, agent mode adds a terminal, a text browser, file outputs, and connectors to Gmail, Drive, GitHub, Slack, and others. Inside the OpenAI Responses API, the computer use tool lets developers pipe screenshots from any environment. Inside ChatGPT Atlas, the loop is integrated into a Chromium browser with sidebar and agent mode plus browser memories.[26] Pricing: standalone Operator was $200/month (Pro); ChatGPT agent gives Plus 40 messages/month at $20, Team 40 at $25/user, Pro 400 at $200; the API meter is $3/$12 per million input/output tokens.[18]

### Google Project Mariner and Gemini computer use

**Project Mariner**, unveiled on December 11, 2024, started as a Chrome extension built on [Gemini](/wiki/gemini) 2.0 Flash, accessible only to trusted testers, and posted a state-of-the-art single-agent WebVoyager score of 83.5% at announcement.[16] By Google I/O 2025 it moved to cloud VMs, gained a "Teach and Repeat" feature for demonstrating workflows, and supported up to 10 parallel tasks. Access was tied to the Google AI Ultra plan at $249.99 per month.[16] The **Gemini 2.5 Computer Use** model launched on October 7, 2025, as a developer-facing capability through the Gemini API and Vertex AI, positioned as the fastest browser-control model in its class. The model is browser-focused; full desktop control was on the roadmap but not initially shipped.[25]

### Microsoft Copilot Vision and Recall

Microsoft's posture is more cautious. **Copilot Vision** lets the assistant see the user's current screen and answer questions about it, with action-taking handled through narrower scoped automations rather than a generalized browser agent. Inside Azure AI Foundry, Microsoft documented "Computer Use (preview)" in March 2025, mostly as a host for OpenAI's CUA model running against Windows 365 and Azure Virtual Desktop sessions for enterprise customers.[19][31] **Recall** is an adjacent product: a continuous screenshot indexer for Windows that lets the user search what they have done on their machine. Recall is not a CUA, but privacy researchers' concerns at the original 2024 announcement shaped Microsoft's broader caution. The **Mu** model, introduced in 2025, is a small on-device model embedded in Windows for tasks like answering questions about the Settings app, closer to a constrained on-device CUA than to a chatbot.

## Open-source frameworks

The open-source ecosystem has been lively, partly because researchers can iterate on the same OSWorld and WebArena suites the vendors publish against.

| Framework | First release | Notable features |
| --- | --- | --- |
| Self-Operating Computer | November 2023 | OthersideAI's pixel-only desktop controller, multi-model backend[8] |
| OpenInterpreter | 2024 | Killian Lucas's local code-running assistant with computer use mode[32] |
| LaVague | 2024 | Web agent framework with modular driver/world-model split[33] |
| SkyVern | 2024 | Browser workflow automation, HTML extraction, GPT-4 Vision[34] |
| Browser-Use | 2024 | Python library, 21,000+ GitHub stars, 89.1% on WebVoyager[35] |
| Multi-On | 2024 | Personal browser agent, Chrome extension |
| Agent S2 | March 2025 | Simular AI, hierarchical planning, 34.5% on OSWorld[20] |
| Agent S | December 2025 | Simular AI; first to exceed OSWorld human baseline at 72.6%[27] |
| Anthropic computer-use-demo | October 2024 | Reference Docker image with Streamlit UI[1] |
| openai/openai-cua-sample-app | March 2025 | OpenAI's reference app for the Responses API[18] |
| Cua Framework | 2025 | Docker-style packaging of OS images[36] |
| UFO Agents | 2025 | Microsoft's Windows-specific framework |
| BrowserGym | 2024 | Unified academic ecosystem, 5,000+ tasks[37] |

The gravitational center has shifted. In 2024 most projects tried to replicate vendor demos; by late 2025 and early 2026, the strongest open-source systems (Agent S, Browser-Use) were competitive with frontier proprietary models on specific benchmarks by combining stronger planners with better grounding modules and clever multi-call orchestration. Smaller commercial efforts like [Manus](/wiki/manus) layer scaffolding on top of frontier APIs. [Devin](/wiki/devin) from Cognition is the best-known startup focused on software engineering rather than general computer use. The [Rabbit R1](/wiki/rabbit_r1) launched its "Large Action Model" pitch in early 2024; the product was more brittle than the marketing implied, and the company shifted toward a conventional agent stack during 2025.

## How well do computer-use agents perform on benchmarks?

The interactive benchmarks for computer-use agents are not interchangeable: a model can score well on browser tasks and poorly on full desktop tasks, and vice versa.

### OSWorld

[OSWorld](/wiki/osworld) is the dominant benchmark for full computer-use evaluation. Created by Tianbao Xie and collaborators and accepted at NeurIPS 2024, it provides 369 real desktop tasks across Ubuntu, Windows, and macOS with execution-based evaluation; at publication the best model reached only 12.24% against a 72.36% human baseline.[12] OSWorld-Verified (July 2025) fixed 300+ issues in the original suite and added AWS-based parallelization for 50x faster evaluation.[38] OSWorld-Human adds annotated optimal trajectories, exposing how many extra steps agents take compared with humans.[39]

| Date | Model / Agent | OSWorld score | Notes |
| --- | --- | --- | --- |
| October 2024 | Claude 3.5 Sonnet (Anthropic) | 14.9% | Initial release; 22.0% with extra steps[1] |
| October 2024 | Previous best | ~7.8% | Pre-CUA frontier[1] |
| January 2025 | OpenAI CUA | 38.1% | Operator launch[2] |
| March 2025 | Agent S2 (Simular) | 34.5% | Open-source; 50-step config[20] |
| September 2025 | Claude Sonnet 4.5 | 61.4% | State-of-the-art at release[24] |
| December 2025 | Agent S (Simular) | 72.6% | First system to exceed human baseline[27] |
| February 2026 | Claude Sonnet 4.6 | 72.5% | Effectively at human baseline[28] |
| February 2026 | Claude Opus 4.6 | 72.7% | Highest published score[28] |
| February 2026 | GPT-5.2 | 38.2% | Reported on the same evaluation harness[28] |
| Reference | Human baseline | 72.36% | Established by OSWorld authors[12] |

### WebArena and VisualWebArena

[WebArena](/wiki/webarena), from Carnegie Mellon, evaluates web agents on 812 tasks across self-hosted replicas of e-commerce, content management, social, and project tracking sites.[10] **VisualWebArena** extends the suite with 910 image-heavy tasks where visual reasoning is required to succeed.[40]

| Agent | WebArena | Notes |
| --- | --- | --- |
| OpenAI CUA | 58.1% | January 2025[2] |
| Google Project Mariner | ~52.4% | Reported 2025[9] |
| Browser-Use (open source) | ~51.2% | 2025[9] |
| Anthropic Claude (Sonnet 3.5 era) | ~45% | October 2024[9] |
| Human reference | ~78% | Author estimate[10] |

### WebVoyager

WebVoyager tests live websites including Amazon, GitHub, and Google Maps. The benchmark contains 643 tasks spanning 15 popular sites, with end-to-end success rates as the headline metric.[41]

| Agent | WebVoyager | Notes |
| --- | --- | --- |
| Browser-Use (open source) | 89.1% | Late 2024[35] |
| OpenAI CUA / ChatGPT agent | 87.0% | January 2025[2] |
| Google Project Mariner | 83.5% | 2025[16] |
| Claude Sonnet 4.5 | ~56% | Sonnet 4.5 era[6] |
| Human reference | ~90% | Approximate[41] |

### Mind2Web and other suites

[Mind2Web](/wiki/mind2web) from Ohio State and the Allen Institute provides 2,350 tasks across 137 real websites, with action-sequence labels for offline evaluation.[7] WindowsAgentArena is Microsoft's benchmark for Windows-specific tasks. AndroidWorld evaluates mobile GUI agents (Agent S2 hits 50%). GAIA measures general assistant tasks; OpenAI's o3-based Operator scored 62.2 on GAIA in May 2025. macOSWorld introduced the first macOS benchmark with 202+ multilingual interactive tasks.[42] [AgentBench](/wiki/agentbench) rolls these into a unified evaluation framework.

The practical takeaway: web tasks are now in the high 80s and 90s for the best systems, while full desktop tasks have only just reached human parity at the top of the model lineup. Smaller and older models are 30+ percentage points behind.

## What can computer-use agents do, and what are their limits?

As of early 2026, computer-use agents can navigate websites and complete multi-step checkouts when not blocked by bot detection, operate productivity software in unfamiliar configurations, adapt to layout changes where rule-based RPA would break, coordinate workflows across web tools and spreadsheets and mailers, compile research reports by visiting many sites, and serve as automated testers.

Persistent failure modes:

- **Latency.** A single screenshot-action-observation cycle takes a few seconds; tasks a human finishes in 30 seconds can take a CUA several minutes, and supervision often costs more time than doing the task manually.
- **Precision and grounding.** Small interface elements, drag-and-drop, custom calendar widgets, sliders, and rich-text editors remain failure-prone. OSWorld-Human showed even the best agents take 1.4x to 2.7x the human step count.[39]
- **OCR and rendering.** Text in images, low-contrast UI, and dense data tables get misread.
- **Transient elements.** Tooltips, popovers, and short-lived notifications fall between screenshots.
- **CAPTCHAs and anti-bot measures** on booking and ticketing platforms can block agents entirely.
- **Long-horizon coherence.** Cascading errors compound over hundreds of steps; self-correction is inconsistent.
- **Cost.** A complex 100+ step task with high-resolution screenshots can cost several dollars on token-metered APIs.
- **Hallucinated UI elements** sometimes lead the model to click where it expects a button to be.
- **Reliability gaps.** Anthropic, OpenAI, and Google all describe their consumer products as research previews. The technology suits supervised automation of routine tasks but not high-stakes unsupervised work.

## Are computer-use agents safe to use?

Computer-use agents introduce a new class of risk that traditional chatbots do not have: the model takes real actions on real systems, often with access to authenticated sessions.

### Prompt injection

[Prompt injection](/wiki/prompt_injection) through on-screen content is the most discussed and least solved attack. Because the model reads everything visible on screen, malicious text on a webpage, in an email, or in a document can hijack the agent. In a widely cited February 2025 demonstration, security researcher Johann Rehberger hid instructions in a GitHub issue. When [Operator](/wiki/openai_operator) visited the issue as part of an unrelated task, the embedded instructions caused the agent to navigate to an authenticated session, scrape personal data, and exfiltrate it through a URL parameter.[43] Similar demonstrations followed for Anthropic's Claude for Chrome shortly after its August 2025 launch and for OpenAI's Atlas browser within days of its October 2025 launch. The attack class is unsolved at the model layer. Defenses work at the wrapper layer: classifiers that scan screenshots for injection patterns, confirmation prompts for off-goal navigation, and watch modes that pause on sensitive sites. None is a complete fix.

### Other failure modes

A broader version of prompt injection is that any content the agent reads can change its behavior, including ads, A/B test variants, hidden DOM nodes, fake error messages, and screen overlays. An agent driving a logged-in browser holds the keys to whatever sites the user has authenticated; successful injection can lead to data exfiltration, unauthorized purchases, or social-engineering attacks. Agents running on the user's own machine (Anthropic's Mac computer use, Claude for Chrome, OpenAI Atlas) raise the stakes further by exposing the local file system.

Researchers have shown that even hard-restricted task lists (no stock trades, no banking transfers) can be softened through role-play prompts or chained instructions similar to chat-based [jailbreaks](/wiki/jailbreak). Wrapper-layer enforcement catches most attempts in production, but the underlying model's willingness to take an action can be raised by context manipulation. Even without an attacker, an agent that misinterprets a task can submit the wrong form, send the wrong email, delete the wrong file, or modify settings in ways that are difficult to undo. Cascading errors over long horizons make this worse: a small grounding mistake at step 12 can produce a wildly off-course agent at step 40.

### Mitigation strategies

| Strategy | Description | Effectiveness |
| --- | --- | --- |
| Containerization | Run the agent in an isolated VM, container, or cloud sandbox | High; standard in OpenAI Operator and Project Mariner |
| Connector-first hierarchy | Prefer API connectors, then browser, then screen as fallback | High; Anthropic's March 2026 model[13] |
| Per-application permissions | Require user approval before each application | High; Anthropic Mac computer use[13] |
| Default app restrictions | Block sensitive apps (banking, brokerage, crypto) by default | High for catastrophic-loss prevention[13] |
| Confirmation gates | Pause for user approval before irreversible actions | High when user is attentive; OpenAI[30] |
| Take-over mode | Hand control to the user for credentials and CAPTCHAs | High; OpenAI Operator and ChatGPT agent[30] |
| Watch mode | Auto-pause on sensitive sites when user becomes inactive | High for high-risk sites; OpenAI[30] |
| Hard-restricted tasks | Refuse certain task classes regardless of user request | High; both OpenAI and Anthropic publish lists[30] |
| Prompt-injection classifiers | Flag screenshots that look like injection attempts | Medium; useful but bypassable[1] |
| Allowlist navigation | Limit the agent to approved domains | High for known workflows |
| Action telemetry and audit logs | Log every action with screenshots for review | High for incident response |
| Credential isolation | Never let the model see credentials | High when implemented; OpenAI take-over mode[30] |
| Filesystem isolation | Recommend a dedicated working directory | Medium; user discipline required[13] |

No single defense is sufficient. Production deployments use defense in depth, and even then prompt injection remains an open problem. The International AI Safety Report 2026 specifically called out computer-use agents as a category requiring careful governance.[44]

## What are computer-use agents used for?

The genuinely useful applications in production are narrower than the marketing suggests but real.

**Web automation and form filling** is the bread-and-butter case: forms, multi-step processes on government and enterprise portals, and pulling routine information from sites that lack APIs. Works well on tested sites, poorly on the long tail of obscure custom JavaScript. **Research and competitive analysis** plays to the strengths of agents that combine browsing with file output, like ChatGPT agent's visual browser, text browser, and terminal mix, or Claude Cowork's local files plus web sources. **Booking and shopping** drove the launch partners for [Operator](/wiki/openai_operator) (OpenTable, Instacart, Uber, Booking.com); works when the site is mainstream and the user is willing to confirm before checkout.

**Software QA and testing** uses the agent as an automated tester catching visual regressions and end-to-end flow breakage; [Replit](/wiki/replit) is the canonical example.[1] **Data entry and migration** between systems that lack integration is the natural successor to traditional [robotic process automation](/wiki/robotic_process_automation) and probably the largest enterprise use case by dollar value. **Software development and code review** is handled by engineering-focused agents like [Devin](/wiki/devin), ChatGPT agent's terminal, and Anthropic's [Claude Code](/wiki/claude_code). **Accessibility** has obvious value for users with motor disabilities but no flagship product yet because reliability has not been good enough. **Personal productivity** is the framing behind Anthropic's Dispatch (March 2026): assign tasks from a phone, run them on your Mac, retrieve the output later. The economic argument holds for batchable, recurring tasks where supervision can be amortized.

## How much do computer-use agents cost?

| Vendor | Product | Pricing model | Cost |
| --- | --- | --- | --- |
| Anthropic | Claude API computer use | Per-token | Sonnet 4.6: $3/$15 per million input/output tokens. Opus 4.6: $5/$25. Tool overhead ~466-499 tokens per step plus screenshot tokens[29] |
| Anthropic | Claude Cowork (Mac) | Subscription | Bundled with Claude Pro ($20/month) and Claude Max[13] |
| Anthropic | Claude for Chrome | Subscription | Max plan only at launch; expanded to Pro, Team, Enterprise by December 2025[23] |
| OpenAI | ChatGPT agent | Subscription with quota | Plus: 40/month at $20. Team: 40 at $25/user. Pro: 400 at $200[26] |
| OpenAI | Computer use tool (Responses API) | Per-token | $3/$12 per million input/output tokens[18] |
| OpenAI | Atlas browser | Freemium | Free sidebar; agent mode requires Plus or Pro[26] |
| Google | Project Mariner | Subscription | Bundled with Google AI Ultra at $249.99/month |
| Google | Gemini 2.5 Computer Use | Per-token | Gemini API and Vertex AI metered pricing[25] |
| Microsoft | Computer Use in Azure AI Foundry | Per-token | Inherits OpenAI CUA pricing through Azure[19] |

Vendors with consumer products charge subscriptions; vendors with API-first products charge tokens; everyone with both does both.

## Future direction

Native API integrations will be preferred over screen scraping wherever they exist. Anthropic's connector-first hierarchy is the explicit version; ChatGPT agent's connector tab is the implicit one. Pure pixel control is becoming the fallback, not the default. Hybrid agents combining HTTP APIs, [Model Context Protocol](/wiki/mcp) servers, browser automation, and screen control will be the norm; single-modality agents are already losing on cost and latency. Video-stream perception will gradually replace static screenshots, addressing the flipbook problem.

Enterprise deployment will accelerate as the OSWorld ceiling is crossed, but adoption will lag the benchmark curve because of compliance, audit, and integration concerns. Sectors with heavy legacy GUI software (healthcare, insurance, utilities) are the obvious early targets. The prompt-injection problem will get worse before it gets better, because agents will gain more autonomy and connect to more sensitive systems. Mitigations will remain wrapper-layer fixes until model-level solutions emerge. Mobile-to-desktop delegation, exemplified by Dispatch, is likely to expand: "assign a task on the phone, run it on the desktop, retrieve the output" matches how people actually work. The "agentic coworker" framing is the dominant marketing pitch, with mixed empirical support so far; the genuine productivity wins are in narrow, batchable tasks, while broad personal-assistant scenarios remain aspirational.

## Impact and implications

Organizations that have deployed CUAs in narrow, well-supervised contexts report 30 to 50% reductions in manual workload on the targeted tasks, faster data processing, and lower error rates compared with manual entry.[45] Whether these wins generalize beyond pilot programs is the open question; most published case studies come from vendor partnerships, which selects for the cases that worked.

CUAs are reshaping workflows by automating routine digital tasks. Displacement risk is concentrated in high-volume, repetitive GUI work like data entry, basic customer service, and simple QA testing. New roles in agent supervision, prompt engineering, and audit trail review have emerged. As with most AI deployments, the question is how quickly capability translates into adoption.

Deployment raises privacy questions when agents read screen contents that include other people's information, accountability questions when an agent takes an unintended action, and consent questions about the boundary between agent autonomy and user oversight. The same arguments are being relitigated case by case: Recall in 2024 raised continuous-screenshot privacy concerns, Operator in 2025 raised credential exposure concerns, and Mac computer use in 2026 raised local file access concerns. There is also a [cybersecurity](/wiki/cybersecurity) concern about CUAs as attack tools, where adversaries use the same primitives to automate fraud or intrusion at scale.

## Relationship to related articles

The [computer use](/wiki/computer_use) article is specifically about Anthropic's Claude Computer Use feature, including the API tool, the reference implementation, the Mac integration, and the OSWorld progression of Claude models. This article treats Anthropic's product as one implementation in a broader category that also includes [OpenAI Operator](/wiki/openai_operator) and [CUA](/wiki/openai_cua), Project Mariner, Gemini computer use, Microsoft Copilot Vision, and the open-source ecosystem. The [browser-use agent](/wiki/browser-use_agent) article covers CUAs restricted to a web browser, including the open-source Browser-Use library. The [openai_operator](/wiki/openai_operator) article covers the standalone Operator product (now deprecated) and the ChatGPT agent and Atlas products that succeeded it.

## See also

- [Hermes Agent](/wiki/hermes_agent)
- [Computer use](/wiki/computer_use)
- [Browser-use agent](/wiki/browser-use_agent)
- [OpenAI Operator](/wiki/openai_operator)
- [Computer-Using Agent](/wiki/openai_cua)
- [OSWorld](/wiki/osworld)
- [WebArena](/wiki/webarena)
- [Mind2Web](/wiki/mind2web)
- [Model Context Protocol](/wiki/model_context_protocol)
- [Prompt injection](/wiki/prompt_injection)
- [Devin](/wiki/devin)
- [Manus](/wiki/manus)
- [Robotic process automation](/wiki/robotic_process_automation)
- [AI agent](/wiki/ai_agent)
- [ACT-1](/wiki/act_1)

## References

1. Anthropic. "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." October 22, 2024. https://www.anthropic.com/news/3-5-models-and-computer-use
2. OpenAI. "Computer-Using Agent." January 23, 2025. https://openai.com/index/computer-using-agent/
3. OpenAI. "Introducing Operator." January 23, 2025. https://openai.com/index/introducing-operator/
4. Microsoft. "Computer Use (preview) in Azure AI Foundry." March 2025. https://learn.microsoft.com/azure/ai-foundry/
5. Xie, T. et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv:2404.07972, NeurIPS 2024. https://arxiv.org/abs/2404.07972
6. OSWorld Leaderboard. https://os-world.github.io/
7. Deng, X. et al. "Mind2Web: Towards a Generalist Agent for the Web." NeurIPS 2023. https://github.com/OSU-NLP-Group/Mind2Web
8. OthersideAI. "Self-Operating Computer Framework." GitHub, November 2023. https://github.com/OthersideAI/self-operating-computer
9. WorkOS. "Anthropic's Computer Use versus OpenAI's Computer Using Agent (CUA)." 2025. https://workos.com/blog/anthropics-computer-use-versus-openais-computer-using-agent-cua
10. Zhou, S. et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024. https://webarena.dev/
11. Anthropic. "Developing a computer use model." October 22, 2024. https://www.anthropic.com/news/developing-computer-use
12. OSWorld project page. https://os-world.github.io/
13. Anthropic. "Put Claude to work on your computer." Claude Blog, March 2026. https://claude.com/blog/dispatch-and-computer-use
14. Zheng, B. et al. "GPT-4V(ision) is a Generalist Web Agent, if Grounded." SeeAct, 2023. https://osu-nlp-group.github.io/SeeAct/
15. Adept AI. "ACT-1: Transformer for Actions." September 14, 2022. https://www.adept.ai/blog/act-1
16. Google DeepMind. "Project Mariner." December 11, 2024. https://deepmind.google/models/project-mariner/
17. Anthropic. "Claude 3.7 Sonnet and Claude Code." February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet
18. OpenAI. "New tools for building agents." March 11, 2025. https://openai.com/index/new-tools-for-building-agents/
19. Microsoft Learn. "Computer Use in Azure AI Foundry." March 2025. https://learn.microsoft.com/azure/ai-foundry/
20. Simular AI. "Agent S2." March 2025. https://www.simular.ai/agent-s2
21. Anthropic. "Introducing Claude 4." May 22, 2025. https://www.anthropic.com/news/claude-4
22. OpenAI. "Addendum to OpenAI o3 and o4-mini system card: OpenAI o3 Operator." May 23, 2025. https://openai.com/index/o3-o4-mini-system-card-addendum-operator-o3/
23. TechCrunch. "Anthropic launches a Claude AI agent that lives in Chrome." August 26, 2025. https://techcrunch.com/2025/08/26/anthropic-launches-a-claude-ai-agent-that-lives-in-chrome/
24. Anthropic. "Introducing Claude Sonnet 4.5." September 2025. https://www.anthropic.com/news/claude-sonnet-4-5
25. Google. "Gemini 2.5 Computer Use." October 7, 2025. https://blog.google/technology/google-deepmind/gemini-computer-use/
26. OpenAI. "Introducing ChatGPT Atlas." October 21, 2025. https://openai.com/index/introducing-chatgpt-atlas/
27. Simular. "Simular's computer use agent outperforms humans." December 2025. https://www.simular.ai/articles/simulars-computer-use-agent-outperforms-humans
28. Anthropic. "Claude Opus 4.6." February 5, 2026. https://www.anthropic.com/claude/opus
29. Anthropic. "Computer use tool documentation." Claude API. https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool
30. OpenAI. "Operator System Card." January 23, 2025. https://cdn.openai.com/operator_system_card.pdf
31. Microsoft. "Copilot Vision overview." 2025. https://copilot.microsoft.com/
32. Open Interpreter. GitHub repository. https://github.com/OpenInterpreter/open-interpreter
33. LaVague. GitHub repository. https://github.com/lavague-ai/LaVague
34. SkyVern. GitHub repository. https://github.com/Skyvern-AI/skyvern
35. Browser-Use. GitHub repository. https://github.com/browser-use/browser-use
36. Cua Framework. GitHub repository. https://github.com/trycua/cua
37. ServiceNow Research. "BrowserGym." 2024. https://github.com/ServiceNow/BrowserGym
38. XLang AI. "Introducing OSWorld-Verified." July 2025. https://xlang.ai/blog/osworld-verified
39. OSWorld-Human, GitHub repository. https://github.com/xlang-ai/OSWorld
40. Koh, J. et al. "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks." 2024. https://jykoh.com/vwa
41. He, H. et al. "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models." 2024. https://github.com/MinorJerry/WebVoyager
42. macOSWorld benchmark, 2025.
43. Johann Rehberger / Embrace The Red. "Prompt injection exploits in ChatGPT Operator." February 2025. https://embracethered.com/blog/posts/2025/chatgpt-operator-prompt-injection-exploits/
44. International AI Safety Report 2026. https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026
45. Anthropic computer use launch partner case studies. https://www.anthropic.com/news/3-5-models-and-computer-use