Computer-use agent
Last reviewed
May 8, 2026
Sources
45 citations
Review status
Source-backed
Revision
v4 · 5,986 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
45 citations
Review status
Source-backed
Revision
v4 · 5,986 words
Add missing citations, update stale details, or suggest a clearer explanation.
A computer-use agent (CUA) is a category of AI agent in artificial intelligence that performs tasks by directly operating a general-purpose computer's graphical user interface (GUI) the way a human does, by "seeing" the screen, moving a cursor, clicking, typing, and interacting with windows and applications.[1] Unlike tool-calling approaches that rely on predefined APIs, computer-use agents aim to generalize across arbitrary software by treating the computer itself as the universal interface.[2] They combine a large language model (typically multimodal) with computer vision and an action executor (a virtual machine, container, or browser sandbox), enabling end-to-end perception, reasoning, and control loops.[3][4]
This article describes the broader category across all vendors and architectures, including Anthropic Claude, OpenAI Operator and ChatGPT agent, Google Project Mariner and Gemini computer use, Microsoft Copilot Vision, and a growing open-source ecosystem. The narrower computer use article focuses on Anthropic's Claude Computer Use feature specifically. The best models reached the OSWorld human baseline of 72.36% by early 2026.[5][6]
The field has accumulated a small thicket of overlapping names. They mostly mean the same thing, with vendor-specific shading.
| Term | Origin | Notes |
|---|---|---|
| Computer-use agent (CUA) | Generic | The umbrella term used in this article and most academic work |
| GUI agent | Academic | Emphasizes graphical interface interaction, used in papers like SeeAct[7] |
| Desktop agent | Industry | Focuses on full desktop environments versus browser-only systems |
| Self-operating computer (SOC) | OthersideAI | Open-source framework with pixel observation and mouse/keyboard actions[8] |
| Computer-Using Agent (CUA) | OpenAI | The model behind Operator, now in the Responses API[2] |
| Browser-use agent (BUA) | Industry | A CUA restricted to a web browser, sometimes using DOM alongside vision[9] |
| Web agent | Mixed | Interchangeable with browser-use agent, especially in benchmarks like WebArena[10] |
Robotic process automation (RPA) tools are sometimes lumped in, but classic RPA is rule-based and brittle. The point of CUAs is that the model can read an unfamiliar interface without hand-coded selectors.
Vendors blur these lines on purpose, so being precise matters. Computer-use agents see screenshots (or a video stream of the screen) and emit mouse and keyboard actions, requiring no cooperation from the target application. API-driven agents call HTTP endpoints, gRPC services, or function calling tools; faster and more reliable, but only where the developer has wired up an integration. Browser automation tools like Playwright and Selenium drive a browser through DOM and accessibility trees, not pixels; reliable on standard sites, fragile on heavy JavaScript apps. Tool-use agents in the broad sense include CUAs as one specialization, alongside agents that call calculators, search APIs, or Model Context Protocol servers.
In practice, modern systems mix these. The ChatGPT agent released in July 2025 combines a visual browser, a text browser, a terminal, and direct connectors to services like Gmail, all coordinated by a single model. Anthropic's March 2026 Mac computer use feature prefers connectors over browser navigation over screen interaction, with screen control as the last resort. The hybrid pattern is now dominant.
A computer-use agent operates any on-screen software within OS and permission limits, rather than only API-integrated tools.[2] It runs a perception-action loop that reads screen state (image, sometimes plus accessibility data) and emits actions like click(x, y) and type("text") until the goal is reached or a stop condition fires.[3] Text-based planning is fused with visual grounding so the agent can locate UI elements that have no semantic anchor in code.[11] Sandboxes, virtual machines, and containerized desktops improve safety and reproducibility, especially when the agent is touching the real internet.[4] A modern CUA can chain steps across applications, but most production deployments insert confirmation gates for irreversible actions.[1][2]
A typical computer-use agent has five conceptual layers, even when they collapse into a single neural network. Perception captures screen frames plus optional auxiliary signals (window hierarchy, accessibility tree, OCR output); most production CUAs work primarily from screenshots, with the visual model interpreting buttons, text fields, and menus.[3] Reasoning and planning uses a multimodal LLM, often with chain-of-thought or extended thinking; some implementations apply reinforcement learning fine-tuning on top of a base foundation model to teach action sequences.[11] Grounding maps abstract plan tokens like "click the search button" to pixel coordinates or element IDs; this is where many failures originate.[12] Action execution sends clicks, keystrokes, scrolls, and window commands.[3] Monitoring and recovery detects pop-ups, navigation drift, or repeated clicks and triggers retries, backtracking, or human handoff.[4]
| Approach | Description | Advantages | Limitations |
|---|---|---|---|
| Pure vision | Relies solely on visual interpretation of screen pixels | Platform-agnostic, works with any GUI | Struggles with dense, custom, or rapidly changing layouts |
| DOM-enhanced | Combines vision with web page structure analysis | Higher accuracy and lower token cost for web tasks | Limited to browser environments |
| Hybrid systems | Integrates multiple signals including OS APIs and accessibility trees | More accurate and reliable | Platform-specific, more engineering work |
| Container-based | Runs in isolated VMs or Docker containers | Stronger security and easier scaling | Adds infrastructure overhead and latency |
| Connector-first | Tries direct integrations before falling back to screen control | Fastest and most reliable when integrations exist | Coverage limited by available connectors |
The connector-first pattern, exemplified by Anthropic's Claude Cowork on macOS, is increasingly seen as the production-ready architecture, with pure pixel control as a fallback rather than a default.[13]
Researchers had been chasing GUI-controlling agents for years before the modern wave. World of Bits (2017) had agents play through web micro-tasks; MiniWoB and MiniWoB++ provided controlled environments for reinforcement learning on synthetic web pages. The 2023 work that mattered most was SeeAct from Ohio State, which formalized the perception-grounding-action triple for GPT-4V on real web pages, alongside Mind2Web, the same group's dataset of 2,350 cross-website tasks.[7][14]
Adept opened the commercial story with its ACT-1 demo on September 14, 2022, showing a transformer that drove a Chrome extension to fill forms and navigate Salesforce.[15] ACT-1 never shipped as a product, and Adept eventually pivoted, but the demo established the visual outline of a GUI agent.
| Date | Milestone |
|---|---|
| November 2023 | OthersideAI's open-source Self-Operating Computer framework demonstrates a multimodal model controlling a desktop using only pixels and synthetic input.[8] |
| April 2024 | OSWorld is published on arXiv (Xie et al., NeurIPS 2024), establishing the standard 369-task benchmark across Ubuntu, Windows, and macOS.[12] |
| October 22, 2024 | Anthropic releases computer use in public beta with Claude 3.5 Sonnet, the first major commercial implementation. Initial OSWorld: 14.9% (22.0% with extra steps).[1] |
| December 11, 2024 | Google DeepMind unveils Project Mariner, a Chrome-extension prototype on Gemini 2.0 Flash, for trusted testers.[16] |
| January 23, 2025 | OpenAI launches Operator for ChatGPT Pro subscribers, powered by the Computer-Using Agent (CUA) model on GPT-4o with reinforcement learning. OSWorld 38.1%, WebVoyager 87%.[2] |
| February 24, 2025 | Anthropic releases Claude 3.7 Sonnet with extended thinking, lifting OSWorld scores into the 30s.[17] |
| March 11, 2025 | OpenAI exposes CUA in the Responses API at $3/$12 per million input/output tokens.[18] |
| March 2025 | Microsoft documents "Computer Use (preview)" in Azure AI Foundry.[19] Simular AI ships Agent S2 at 34.5% OSWorld.[20] |
| May 22, 2025 | Anthropic launches the Claude 4 family with interleaved extended thinking and tool use.[21] |
| May 23, 2025 | OpenAI moves CUA to an o3-based variant ("Operator o3"), with large GAIA gains.[22] |
| August 26, 2025 | Anthropic ships Claude for Chrome for Max subscribers.[23] |
| September 2025 | Anthropic releases Claude Sonnet 4.5, reaching 61.4% on OSWorld and 77.2% on SWE-bench Verified.[24] |
| October 7, 2025 | Google launches the Gemini 2.5 Computer Use model via Gemini API and Vertex AI.[25] |
| October 21, 2025 | OpenAI announces ChatGPT Atlas, a Chromium browser with built-in agent mode and "browser memories."[26] |
| December 2025 | Simular's Agent S becomes the first published system to exceed the OSWorld human baseline at 72.6%.[27] |
| February 5, 2026 | Anthropic releases Claude Opus 4.6 and Claude Sonnet 4.6; Opus 4.6 reaches 72.7% on OSWorld, effectively at the human baseline. GPT-5.2 is reported at 38.2% on the same harness.[28] |
| March 17, 2026 | Anthropic launches Dispatch for phone-to-Mac task assignment.[13] |
| March 23, 2026 | Anthropic introduces native Mac computer use in Claude Cowork and Claude Code, with a connector-first tooling hierarchy.[13] |
A computer-use agent runs an iterative loop: observe (a screenshot, sometimes plus an accessibility snapshot or DOM excerpt), plan, act (click(x, y), type("text"), key("enter"), scroll(direction, amount), drag, or higher-level macros), observe again, and repeat until the task is complete, the agent gives up, or a confirmation is required.[3] The shape of the loop is similar across vendors; the differences live in how the perception is constructed and how the model is allowed to think between actions.
Most public SDKs document the primitive set explicitly. Anthropic's tool versions started with screenshot, left_click, type, key, and mouse_move, then added right_click, double_click, triple_click, left_click_drag, scroll, hold_key, and wait.[29] OpenAI's Responses API tool exposes a similar action set in a different schema.
The "flipbook" character of this loop matters. The agent does not see between screenshots, so transient elements like loading spinners, brief notifications, drag-and-drop ghosts, and focus rings can be invisible to it. Vendors are slowly moving toward video-stream perception, but discrete screenshots remain the norm for production systems in early 2026.
The major commercial computer-use agents in 2026 differ in scope (browser-only or full desktop), execution model (cloud sandbox or user's machine), and target audience (developer API or consumer subscription).
| Vendor | Product | First release | Scope | Execution model | Headline benchmark |
|---|---|---|---|---|---|
| Anthropic | Claude Computer Use API | October 22, 2024 | Full desktop (developer VM) | Customer-hosted Docker or VM | OSWorld 72.7% (Opus 4.6)[28] |
| Anthropic | Claude Cowork (Mac) | March 23, 2026 | Native macOS, connector-first | Runs on user's Mac | Same as API[13] |
| Anthropic | Claude for Chrome | August 26, 2025 | Browser only | Chrome extension, user's machine | Not published[23] |
| OpenAI | Operator / ChatGPT agent | Jan 23, 2025; merged Jul 17, 2025; standalone end Aug 31, 2025 | Browser; agent mode adds terminal, files, connectors | Cloud-hosted virtual computer | OSWorld 38.1%; WebVoyager 87%[2] |
| OpenAI | Computer use (Responses API) | March 11, 2025 | Developer-defined | Developer-hosted | $3/$12 per million tokens[18] |
| OpenAI | ChatGPT Atlas | October 21, 2025 | Browser, sidebar plus agent mode | Native Chromium browser | Inherits CUA[26] |
| Project Mariner | December 11, 2024 | Browser only | Cloud VMs, parallel tasks | WebVoyager ~83.5%[16] | |
| Gemini 2.5 Computer Use | October 7, 2025 | Browser only via API | Developer-hosted | Positioned as fastest in class[25] | |
| Microsoft | Copilot Vision | 2025 | Sees user's screen, scoped automations | Local view, cloud reasoning | Not a full agent |
| Microsoft | Computer Use (Azure AI Foundry) | March 2025 | Windows 365, Azure Virtual Desktop | Cloud VMs | Inherits CUA[19] |
The coarse split: Anthropic gives developers a low-level desktop-control primitive plus a consumer experience through Claude Cowork on Mac. OpenAI ships polished consumer products (Operator, ChatGPT agent, Atlas) and exposes the same model through an API afterward. Google has bifurcated into the consumer-oriented Project Mariner inside Google AI Ultra and the API-oriented Gemini 2.5 Computer Use for developers. Microsoft piggybacks on OpenAI's CUA inside Azure and keeps Copilot Vision narrower, focused on "see what I see" rather than "do this for me."
Anthropic released computer use as a public beta on October 22, 2024, alongside Claude 3.5 Sonnet.[1] The implementation exposes three tools: computer for screenshots and mouse/keyboard actions, bash for shell commands, and text_editor for file edits, with a Docker-based Streamlit reference UI. Training was deliberately narrow: a few simple programs (calculator, text editor) with no internet, generalizing broadly. Teaching the model to count pixels accurately was a key technical hurdle for coordinate-level clicking.[1]
The OSWorld progression over 16 months: 14.9% (Claude 3.5 Sonnet at launch), 22.0% (with extra steps), into the 30s (Claude 3.7 Sonnet with extended thinking), 61.4% (Claude Sonnet 4.5, September 2025), and 72.7% (Claude Opus 4.6, February 2026), essentially matching the 72.36% human baseline.[28] The Mac computer use feature in March 2026 runs Claude on the user's actual Mac (not a VM) with per-application permissions, default-blocked sensitive apps (investment platforms, crypto tools), approval prompts for destructive actions, and prompt-injection scanning on screen contents. Pricing is bundled with Claude Pro and Max subscriptions.[13]
OpenAI's Operator launched on January 23, 2025, as a research preview for ChatGPT Pro subscribers, with a ChatGPT-style chat window paired to an embedded browser. Safety levels included confirmation mode for irreversible actions, take-over mode for credentials and CAPTCHAs, watch mode for sensitive sites, and a hard-restricted task list (stock trades, banking transfers).[30] The underlying Computer-Using Agent (CUA) model combined GPT-4o vision with reinforcement learning, scoring 38.1% on OSWorld, 58.1% on WebArena, and 87% on WebVoyager at launch.[2] In May 2025 OpenAI moved CUA onto the o3 reasoning model, with big GAIA gains (62.2 versus 12.3).[22]
Operator was deprecated on August 31, 2025; its capabilities now live in three places. Inside ChatGPT, agent mode adds a terminal, a text browser, file outputs, and connectors to Gmail, Drive, GitHub, Slack, and others. Inside the OpenAI Responses API, the computer use tool lets developers pipe screenshots from any environment. Inside ChatGPT Atlas, the loop is integrated into a Chromium browser with sidebar and agent mode plus browser memories.[26] Pricing: standalone Operator was $200/month (Pro); ChatGPT agent gives Plus 40 messages/month at $20, Team 40 at $25/user, Pro 400 at $200; the API meter is $3/$12 per million input/output tokens.[18]
Project Mariner, unveiled on December 11, 2024, started as a Chrome extension built on Gemini 2.0 Flash, accessible only to trusted testers. By Google I/O 2025 it moved to cloud VMs, gained a "Teach and Repeat" feature for demonstrating workflows, and supported up to 10 parallel tasks. Access was tied to the Google AI Ultra plan at $249.99 per month.[16] The Gemini 2.5 Computer Use model launched on October 7, 2025, as a developer-facing capability through the Gemini API and Vertex AI, positioned as the fastest browser-control model in its class. The model is browser-focused; full desktop control was on the roadmap but not initially shipped.[25]
Microsoft's posture is more cautious. Copilot Vision lets the assistant see the user's current screen and answer questions about it, with action-taking handled through narrower scoped automations rather than a generalized browser agent. Inside Azure AI Foundry, Microsoft documented "Computer Use (preview)" in March 2025, mostly as a host for OpenAI's CUA model running against Windows 365 and Azure Virtual Desktop sessions for enterprise customers.[19][31] Recall is an adjacent product: a continuous screenshot indexer for Windows that lets the user search what they have done on their machine. Recall is not a CUA, but privacy researchers' concerns at the original 2024 announcement shaped Microsoft's broader caution. The Mu model, introduced in 2025, is a small on-device model embedded in Windows for tasks like answering questions about the Settings app, closer to a constrained on-device CUA than to a chatbot.
The open-source ecosystem has been lively, partly because researchers can iterate on the same OSWorld and WebArena suites the vendors publish against.
| Framework | First release | Notable features |
|---|---|---|
| Self-Operating Computer | November 2023 | OthersideAI's pixel-only desktop controller, multi-model backend[8] |
| OpenInterpreter | 2024 | Killian Lucas's local code-running assistant with computer use mode[32] |
| LaVague | 2024 | Web agent framework with modular driver/world-model split[33] |
| SkyVern | 2024 | Browser workflow automation, HTML extraction, GPT-4 Vision[34] |
| Browser-Use | 2024 | Python library, 21,000+ GitHub stars, 89.1% on WebVoyager[35] |
| Multi-On | 2024 | Personal browser agent, Chrome extension |
| Agent S2 | March 2025 | Simular AI, hierarchical planning, 34.5% on OSWorld[20] |
| Agent S | December 2025 | Simular AI; first to exceed OSWorld human baseline at 72.6%[27] |
| Anthropic computer-use-demo | October 2024 | Reference Docker image with Streamlit UI[1] |
| openai/openai-cua-sample-app | March 2025 | OpenAI's reference app for the Responses API[18] |
| Cua Framework | 2025 | Docker-style packaging of OS images[36] |
| UFO Agents | 2025 | Microsoft's Windows-specific framework |
| BrowserGym | 2024 | Unified academic ecosystem, 5,000+ tasks[37] |
The gravitational center has shifted. In 2024 most projects tried to replicate vendor demos; by late 2025 and early 2026, the strongest open-source systems (Agent S, Browser-Use) were competitive with frontier proprietary models on specific benchmarks by combining stronger planners with better grounding modules and clever multi-call orchestration. Smaller commercial efforts like Manus layer scaffolding on top of frontier APIs. Devin from Cognition is the best-known startup focused on software engineering rather than general computer use. The Rabbit R1 launched its "Large Action Model" pitch in early 2024; the product was more brittle than the marketing implied, and the company shifted toward a conventional agent stack during 2025.
The interactive benchmarks for computer-use agents are not interchangeable: a model can score well on browser tasks and poorly on full desktop tasks, and vice versa.
OSWorld is the dominant benchmark for full computer-use evaluation. Created by researchers at the University of Hong Kong, Carnegie Mellon, and other institutions and accepted at NeurIPS 2024, it provides 369 real desktop tasks across Ubuntu, Windows, and macOS with execution-based evaluation.[12] OSWorld-Verified (July 2025) fixed 300+ issues in the original suite and added AWS-based parallelization for 50x faster evaluation.[38] OSWorld-Human adds annotated optimal trajectories, exposing how many extra steps agents take compared with humans.[39]
| Date | Model / Agent | OSWorld score | Notes |
|---|---|---|---|
| October 2024 | Claude 3.5 Sonnet (Anthropic) | 14.9% | Initial release; 22.0% with extra steps[1] |
| October 2024 | Previous best | ~7.8% | Pre-CUA frontier[1] |
| January 2025 | OpenAI CUA | 38.1% | Operator launch[2] |
| March 2025 | Agent S2 (Simular) | 34.5% | Open-source; 50-step config[20] |
| September 2025 | Claude Sonnet 4.5 | 61.4% | State-of-the-art at release[24] |
| December 2025 | Agent S (Simular) | 72.6% | First system to exceed human baseline[27] |
| February 2026 | Claude Sonnet 4.6 | 72.5% | Effectively at human baseline[28] |
| February 2026 | Claude Opus 4.6 | 72.7% | Highest published score[28] |
| February 2026 | GPT-5.2 | 38.2% | Reported on the same evaluation harness[28] |
| Reference | Human baseline | 72.36% | Established by OSWorld authors[12] |
WebArena, from Carnegie Mellon, evaluates web agents on 812 tasks across self-hosted replicas of e-commerce, content management, social, and project tracking sites.[10] VisualWebArena extends the suite with 910 image-heavy tasks where visual reasoning is required to succeed.[40]
| Agent | WebArena | Notes |
|---|---|---|
| OpenAI CUA | 58.1% | January 2025[2] |
| Google Project Mariner | ~52.4% | Reported 2025[9] |
| Browser-Use (open source) | ~51.2% | 2025[9] |
| Anthropic Claude (Sonnet 3.5 era) | ~45% | October 2024[9] |
| Human reference | ~78% | Author estimate[10] |
WebVoyager tests live websites including Amazon, GitHub, and Google Maps. The benchmark contains 643 tasks spanning 15 popular sites, with end-to-end success rates as the headline metric.[41]
| Agent | WebVoyager | Notes |
|---|---|---|
| Browser-Use (open source) | 89.1% | Late 2024[35] |
| OpenAI CUA / ChatGPT agent | 87.0% | January 2025[2] |
| Google Project Mariner | 83.5% | 2025[16] |
| Claude Sonnet 4.5 | ~56% | Sonnet 4.5 era[6] |
| Human reference | ~90% | Approximate[41] |
Mind2Web from Ohio State and the Allen Institute provides 2,350 tasks across 137 real websites, with action-sequence labels for offline evaluation.[7] WindowsAgentArena is Microsoft's benchmark for Windows-specific tasks. AndroidWorld evaluates mobile GUI agents (Agent S2 hits 50%). GAIA measures general assistant tasks; OpenAI's o3-based Operator scored 62.2 on GAIA in May 2025. macOSWorld introduced the first macOS benchmark with 202+ multilingual interactive tasks.[42] AgentBench rolls these into a unified evaluation framework.
The practical takeaway: web tasks are now in the high 80s and 90s for the best systems, while full desktop tasks have only just reached human parity at the top of the model lineup. Smaller and older models are 30+ percentage points behind.
As of early 2026, computer-use agents can navigate websites and complete multi-step checkouts when not blocked by bot detection, operate productivity software in unfamiliar configurations, adapt to layout changes where rule-based RPA would break, coordinate workflows across web tools and spreadsheets and mailers, compile research reports by visiting many sites, and serve as automated testers.
Persistent failure modes:
Computer-use agents introduce a new class of risk that traditional chatbots do not have: the model takes real actions on real systems, often with access to authenticated sessions.
Prompt injection through on-screen content is the most discussed and least solved attack. Because the model reads everything visible on screen, malicious text on a webpage, in an email, or in a document can hijack the agent. In a widely cited February 2025 demonstration, security researcher Johann Rehberger hid instructions in a GitHub issue. When Operator visited the issue as part of an unrelated task, the embedded instructions caused the agent to navigate to an authenticated session, scrape personal data, and exfiltrate it through a URL parameter.[43] Similar demonstrations followed for Anthropic's Claude for Chrome shortly after its August 2025 launch and for OpenAI's Atlas browser within days of its October 2025 launch. The attack class is unsolved at the model layer. Defenses work at the wrapper layer: classifiers that scan screenshots for injection patterns, confirmation prompts for off-goal navigation, and watch modes that pause on sensitive sites. None is a complete fix.
A broader version of prompt injection is that any content the agent reads can change its behavior, including ads, A/B test variants, hidden DOM nodes, fake error messages, and screen overlays. An agent driving a logged-in browser holds the keys to whatever sites the user has authenticated; successful injection can lead to data exfiltration, unauthorized purchases, or social-engineering attacks. Agents running on the user's own machine (Anthropic's Mac computer use, Claude for Chrome, OpenAI Atlas) raise the stakes further by exposing the local file system.
Researchers have shown that even hard-restricted task lists (no stock trades, no banking transfers) can be softened through role-play prompts or chained instructions similar to chat-based jailbreaks. Wrapper-layer enforcement catches most attempts in production, but the underlying model's willingness to take an action can be raised by context manipulation. Even without an attacker, an agent that misinterprets a task can submit the wrong form, send the wrong email, delete the wrong file, or modify settings in ways that are difficult to undo. Cascading errors over long horizons make this worse: a small grounding mistake at step 12 can produce a wildly off-course agent at step 40.
| Strategy | Description | Effectiveness |
|---|---|---|
| Containerization | Run the agent in an isolated VM, container, or cloud sandbox | High; standard in OpenAI Operator and Project Mariner |
| Connector-first hierarchy | Prefer API connectors, then browser, then screen as fallback | High; Anthropic's March 2026 model[13] |
| Per-application permissions | Require user approval before each application | High; Anthropic Mac computer use[13] |
| Default app restrictions | Block sensitive apps (banking, brokerage, crypto) by default | High for catastrophic-loss prevention[13] |
| Confirmation gates | Pause for user approval before irreversible actions | High when user is attentive; OpenAI[30] |
| Take-over mode | Hand control to the user for credentials and CAPTCHAs | High; OpenAI Operator and ChatGPT agent[30] |
| Watch mode | Auto-pause on sensitive sites when user becomes inactive | High for high-risk sites; OpenAI[30] |
| Hard-restricted tasks | Refuse certain task classes regardless of user request | High; both OpenAI and Anthropic publish lists[30] |
| Prompt-injection classifiers | Flag screenshots that look like injection attempts | Medium; useful but bypassable[1] |
| Allowlist navigation | Limit the agent to approved domains | High for known workflows |
| Action telemetry and audit logs | Log every action with screenshots for review | High for incident response |
| Credential isolation | Never let the model see credentials | High when implemented; OpenAI take-over mode[30] |
| Filesystem isolation | Recommend a dedicated working directory | Medium; user discipline required[13] |
No single defense is sufficient. Production deployments use defense in depth, and even then prompt injection remains an open problem. The International AI Safety Report 2026 specifically called out computer-use agents as a category requiring careful governance.[44]
The genuinely useful applications in production are narrower than the marketing suggests but real.
Web automation and form filling is the bread-and-butter case: forms, multi-step processes on government and enterprise portals, and pulling routine information from sites that lack APIs. Works well on tested sites, poorly on the long tail of obscure custom JavaScript. Research and competitive analysis plays to the strengths of agents that combine browsing with file output, like ChatGPT agent's visual browser, text browser, and terminal mix, or Claude Cowork's local files plus web sources. Booking and shopping drove the launch partners for Operator (OpenTable, Instacart, Uber, Booking.com); works when the site is mainstream and the user is willing to confirm before checkout.
Software QA and testing uses the agent as an automated tester catching visual regressions and end-to-end flow breakage; Replit is the canonical example.[1] Data entry and migration between systems that lack integration is the natural successor to traditional robotic process automation and probably the largest enterprise use case by dollar value. Software development and code review is handled by engineering-focused agents like Devin, ChatGPT agent's terminal, and Anthropic's Claude Code. Accessibility has obvious value for users with motor disabilities but no flagship product yet because reliability has not been good enough. Personal productivity is the framing behind Anthropic's Dispatch (March 2026): assign tasks from a phone, run them on your Mac, retrieve the output later. The economic argument holds for batchable, recurring tasks where supervision can be amortized.
| Vendor | Product | Pricing model | Cost |
|---|---|---|---|
| Anthropic | Claude API computer use | Per-token | Sonnet 4.6: $3/$15 per million input/output tokens. Opus 4.6: $5/$25. Tool overhead ~466-499 tokens per step plus screenshot tokens[29] |
| Anthropic | Claude Cowork (Mac) | Subscription | Bundled with Claude Pro ($20/month) and Claude Max[13] |
| Anthropic | Claude for Chrome | Subscription | Max plan only at launch; expanded to Pro, Team, Enterprise by December 2025[23] |
| OpenAI | ChatGPT agent | Subscription with quota | Plus: 40/month at $20. Team: 40 at $25/user. Pro: 400 at $200[26] |
| OpenAI | Computer use tool (Responses API) | Per-token | $3/$12 per million input/output tokens[18] |
| OpenAI | Atlas browser | Freemium | Free sidebar; agent mode requires Plus or Pro[26] |
| Project Mariner | Subscription | Bundled with Google AI Ultra at $249.99/month | |
| Gemini 2.5 Computer Use | Per-token | Gemini API and Vertex AI metered pricing[25] | |
| Microsoft | Computer Use in Azure AI Foundry | Per-token | Inherits OpenAI CUA pricing through Azure[19] |
Vendors with consumer products charge subscriptions; vendors with API-first products charge tokens; everyone with both does both.
Native API integrations will be preferred over screen scraping wherever they exist. Anthropic's connector-first hierarchy is the explicit version; ChatGPT agent's connector tab is the implicit one. Pure pixel control is becoming the fallback, not the default. Hybrid agents combining HTTP APIs, Model Context Protocol servers, browser automation, and screen control will be the norm; single-modality agents are already losing on cost and latency. Video-stream perception will gradually replace static screenshots, addressing the flipbook problem.
Enterprise deployment will accelerate as the OSWorld ceiling is crossed, but adoption will lag the benchmark curve because of compliance, audit, and integration concerns. Sectors with heavy legacy GUI software (healthcare, insurance, utilities) are the obvious early targets. The prompt-injection problem will get worse before it gets better, because agents will gain more autonomy and connect to more sensitive systems. Mitigations will remain wrapper-layer fixes until model-level solutions emerge. Mobile-to-desktop delegation, exemplified by Dispatch, is likely to expand: "assign a task on the phone, run it on the desktop, retrieve the output" matches how people actually work. The "agentic coworker" framing is the dominant marketing pitch, with mixed empirical support so far; the genuine productivity wins are in narrow, batchable tasks, while broad personal-assistant scenarios remain aspirational.
Organizations that have deployed CUAs in narrow, well-supervised contexts report 30 to 50% reductions in manual workload on the targeted tasks, faster data processing, and lower error rates compared with manual entry.[45] Whether these wins generalize beyond pilot programs is the open question; most published case studies come from vendor partnerships, which selects for the cases that worked.
CUAs are reshaping workflows by automating routine digital tasks. Displacement risk is concentrated in high-volume, repetitive GUI work like data entry, basic customer service, and simple QA testing. New roles in agent supervision, prompt engineering, and audit trail review have emerged. As with most AI deployments, the question is how quickly capability translates into adoption.
Deployment raises privacy questions when agents read screen contents that include other people's information, accountability questions when an agent takes an unintended action, and consent questions about the boundary between agent autonomy and user oversight. The same arguments are being relitigated case by case: Recall in 2024 raised continuous-screenshot privacy concerns, Operator in 2025 raised credential exposure concerns, and Mac computer use in 2026 raised local file access concerns. There is also a cybersecurity concern about CUAs as attack tools, where adversaries use the same primitives to automate fraud or intrusion at scale.
The computer use article is specifically about Anthropic's Claude Computer Use feature, including the API tool, the reference implementation, the Mac integration, and the OSWorld progression of Claude models. This article treats Anthropic's product as one implementation in a broader category that also includes OpenAI Operator and CUA, Project Mariner, Gemini computer use, Microsoft Copilot Vision, and the open-source ecosystem. The browser-use agent article covers CUAs restricted to a web browser, including the open-source Browser-Use library. The openai_operator article covers the standalone Operator product (now deprecated) and the ChatGPT agent and Atlas products that succeeded it.