Computer-use agent

AI Agents Artificial Intelligence Computer Science Computer Vision Machine Learning Multimodal AI Natural Language Processing

31 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

45 citations

Revision

v6 · 6,151 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A computer-use agent (CUA) is a category of AI agent in artificial intelligence that performs tasks by directly operating a general-purpose computer's graphical user interface (GUI) the way a human does, by "seeing" the screen, moving a cursor, clicking, typing, and interacting with windows and applications.^[1] Unlike tool-calling approaches that rely on predefined APIs, computer-use agents aim to generalize across arbitrary software by treating the computer itself as the universal interface.^[2] When Anthropic introduced the first major commercial implementation in October 2024, it framed the capability as letting "developers can direct Claude to use computers the way people do, by looking at a screen, moving a cursor, clicking buttons, and typing text."^[1] CUAs combine a large language model (typically multimodal) with computer vision and an action executor (a virtual machine, container, or browser sandbox), enabling end-to-end perception, reasoning, and control loops.^[3]^[4]

The best computer-use agents reached the OSWorld human baseline of 72.36% by early 2026, climbing from 14.9% at the October 2024 launch to 72.7% for Claude Opus 4.6 in February 2026, a roughly fivefold gain in 16 months.^[1]^[5]^[6] This article describes the broader category across all vendors and architectures, including Anthropic Claude, OpenAI Operator and ChatGPT agent, Google Project Mariner and Gemini computer use, Microsoft Copilot Vision, and a growing open-source ecosystem. The narrower computer use article focuses on Anthropic's Claude Computer Use feature specifically.

What are the different names for a computer-use agent?

The field has accumulated a small thicket of overlapping names. They mostly mean the same thing, with vendor-specific shading.

Term	Origin	Notes
Computer-use agent (CUA)	Generic	The umbrella term used in this article and most academic work
GUI agent	Academic	Emphasizes graphical interface interaction, used in papers like SeeAct^[7]
Desktop agent	Industry	Focuses on full desktop environments versus browser-only systems
Self-operating computer (SOC)	OthersideAI	Open-source framework with pixel observation and mouse/keyboard actions^[8]
Computer-Using Agent (CUA)	OpenAI	The model behind Operator, now in the Responses API^[2]
Browser-use agent (BUA)	Industry	A CUA restricted to a web browser, sometimes using DOM alongside vision^[9]
Web agent	Mixed	Interchangeable with browser-use agent, especially in benchmarks like WebArena^[10]

Robotic process automation (RPA) tools are sometimes lumped in, but classic RPA is rule-based and brittle. The point of CUAs is that the model can read an unfamiliar interface without hand-coded selectors.

How does a computer-use agent differ from an API or browser-automation agent?

Vendors blur these lines on purpose, so being precise matters. Computer-use agents see screenshots (or a video stream of the screen) and emit mouse and keyboard actions, requiring no cooperation from the target application. API-driven agents call HTTP endpoints, gRPC services, or function calling tools; faster and more reliable, but only where the developer has wired up an integration. Browser automation tools like Playwright and Selenium drive a browser through DOM and accessibility trees, not pixels; reliable on standard sites, fragile on heavy JavaScript apps. Tool-use agents in the broad sense include CUAs as one specialization, alongside agents that call calculators, search APIs, or Model Context Protocol servers.

In practice, modern systems mix these. The ChatGPT agent released in July 2025 combines a visual browser, a text browser, a terminal, and direct connectors to services like Gmail, all coordinated by a single model. Anthropic's March 2026 Mac computer use feature prefers connectors over browser navigation over screen interaction, with screen control as the last resort. The hybrid pattern is now dominant.

Core characteristics

A computer-use agent operates any on-screen software within OS and permission limits, rather than only API-integrated tools.^[2] It runs a perception-action loop that reads screen state (image, sometimes plus accessibility data) and emits actions like click(x, y) and type("text") until the goal is reached or a stop condition fires.^[3] Text-based planning is fused with visual grounding so the agent can locate UI elements that have no semantic anchor in code.^[11] Sandboxes, virtual machines, and containerized desktops improve safety and reproducibility, especially when the agent is touching the real internet.^[4] A modern CUA can chain steps across applications, but most production deployments insert confirmation gates for irreversible actions.^[1]^[2]

How does a computer-use agent work?

A typical computer-use agent has five conceptual layers, even when they collapse into a single neural network. Perception captures screen frames plus optional auxiliary signals (window hierarchy, accessibility tree, OCR output); most production CUAs work primarily from screenshots, with the visual model interpreting buttons, text fields, and menus.^[3] Reasoning and planning uses a multimodal LLM, often with chain-of-thought or extended thinking; some implementations apply reinforcement learning fine-tuning on top of a base foundation model to teach action sequences.^[11] Grounding maps abstract plan tokens like "click the search button" to pixel coordinates or element IDs; this is where many failures originate.^[12] Action execution sends clicks, keystrokes, scrolls, and window commands.^[3] Monitoring and recovery detects pop-ups, navigation drift, or repeated clicks and triggers retries, backtracking, or human handoff.^[4]

Implementation approaches

Approach	Description	Advantages	Limitations
Pure vision	Relies solely on visual interpretation of screen pixels	Platform-agnostic, works with any GUI	Struggles with dense, custom, or rapidly changing layouts
DOM-enhanced	Combines vision with web page structure analysis	Higher accuracy and lower token cost for web tasks	Limited to browser environments
Hybrid systems	Integrates multiple signals including OS APIs and accessibility trees	More accurate and reliable	Platform-specific, more engineering work
Container-based	Runs in isolated VMs or Docker containers	Stronger security and easier scaling	Adds infrastructure overhead and latency
Connector-first	Tries direct integrations before falling back to screen control	Fastest and most reliable when integrations exist	Coverage limited by available connectors

The connector-first pattern, exemplified by Anthropic's Claude Cowork on macOS, is increasingly seen as the production-ready architecture, with pure pixel control as a fallback rather than a default.^[13]

History

Research roots (2017 to 2023)

Researchers had been chasing GUI-controlling agents for years before the modern wave. World of Bits (2017) had agents play through web micro-tasks; MiniWoB and MiniWoB++ provided controlled environments for reinforcement learning on synthetic web pages. The 2023 work that mattered most was SeeAct from Ohio State, which formalized the perception-grounding-action triple for GPT-4V on real web pages, alongside Mind2Web, the same group's dataset of 2,350 cross-website tasks.^[7]^[14]

Adept opened the commercial story with its ACT-1 demo on September 14, 2022, showing a transformer that drove a Chrome extension to fill forms and navigate Salesforce.^[15] ACT-1 never shipped as a product, and Adept eventually pivoted, but the demo established the visual outline of a GUI agent.

Modern era (2024 onward)

Date	Milestone
November 2023	OthersideAI's open-source Self-Operating Computer framework demonstrates a multimodal model controlling a desktop using only pixels and synthetic input.^[8]
April 2024	OSWorld is published on arXiv (Xie et al., NeurIPS 2024), establishing the standard 369-task benchmark across Ubuntu, Windows, and macOS.^[12]
October 22, 2024	Anthropic releases computer use in public beta with Claude 3.5 Sonnet, the first major commercial implementation. Initial OSWorld: 14.9% (22.0% with extra steps).^[1]
December 11, 2024	Google DeepMind unveils Project Mariner, a Chrome-extension prototype on Gemini 2.0 Flash, for trusted testers.^[16]
January 23, 2025	OpenAI launches Operator for ChatGPT Pro subscribers, powered by the Computer-Using Agent (CUA) model on GPT-4o with reinforcement learning. OSWorld 38.1%, WebVoyager 87%.^[2]
February 24, 2025	Anthropic releases Claude 3.7 Sonnet with extended thinking, lifting OSWorld scores into the 30s.^[17]
March 11, 2025	OpenAI exposes CUA in the Responses API at $3/$12 per million input/output tokens.^[18]
March 2025	Microsoft documents "Computer Use (preview)" in Azure AI Foundry.^[19] Simular AI ships Agent S2 at 34.5% OSWorld.^[20]
May 22, 2025	Anthropic launches the Claude 4 family with interleaved extended thinking and tool use.^[21]
May 23, 2025	OpenAI moves CUA to an o3-based variant ("Operator o3"), with large GAIA gains.^[22]
August 26, 2025	Anthropic ships Claude for Chrome for Max subscribers.^[23]
September 2025	Anthropic releases Claude Sonnet 4.5, reaching 61.4% on OSWorld and 77.2% on SWE-bench Verified.^[24]
October 7, 2025	Google launches the Gemini 2.5 Computer Use model via Gemini API and Vertex AI.^[25]
October 21, 2025	OpenAI announces ChatGPT Atlas, a Chromium browser with built-in agent mode and "browser memories."^[26]
December 2025	Simular's Agent S becomes the first published system to exceed the OSWorld human baseline at 72.6%.^[27]
February 5, 2026	Anthropic releases Claude Opus 4.6 and Claude Sonnet 4.6; Opus 4.6 reaches 72.7% on OSWorld, effectively at the human baseline. GPT-5.2 is reported at 38.2% on the same harness.^[28]
March 17, 2026	Anthropic launches Dispatch for phone-to-Mac task assignment.^[13]
March 23, 2026	Anthropic introduces native Mac computer use in Claude Cowork and Claude Code, with a connector-first tooling hierarchy.^[13]

How does the perception-action loop function?

A computer-use agent runs an iterative loop: observe (a screenshot, sometimes plus an accessibility snapshot or DOM excerpt), plan, act (click(x, y), type("text"), key("enter"), scroll(direction, amount), drag, or higher-level macros), observe again, and repeat until the task is complete, the agent gives up, or a confirmation is required.^[3] The shape of the loop is similar across vendors; the differences live in how the perception is constructed and how the model is allowed to think between actions.

Most public SDKs document the primitive set explicitly. Anthropic's tool versions started with screenshot, left_click, type, key, and mouse_move, then added right_click, double_click, triple_click, left_click_drag, scroll, hold_key, and wait.^[29] OpenAI's Responses API tool exposes a similar action set in a different schema.

The "flipbook" character of this loop matters. The agent does not see between screenshots, so transient elements like loading spinners, brief notifications, drag-and-drop ghosts, and focus rings can be invisible to it. Vendors are slowly moving toward video-stream perception, but discrete screenshots remain the norm for production systems in early 2026.

Which companies make computer-use agents?

The major commercial computer-use agents in 2026 differ in scope (browser-only or full desktop), execution model (cloud sandbox or user's machine), and target audience (developer API or consumer subscription).

Vendor	Product	First release	Scope	Execution model	Headline benchmark
Anthropic	Claude Computer Use API	October 22, 2024	Full desktop (developer VM)	Customer-hosted Docker or VM	OSWorld 72.7% (Opus 4.6)^[28]
Anthropic	Claude Cowork (Mac)	March 23, 2026	Native macOS, connector-first	Runs on user's Mac	Same as API^[13]
Anthropic	Claude for Chrome	August 26, 2025	Browser only	Chrome extension, user's machine	Not published^[23]
OpenAI	Operator / ChatGPT agent	Jan 23, 2025; merged Jul 17, 2025; standalone end Aug 31, 2025	Browser; agent mode adds terminal, files, connectors	Cloud-hosted virtual computer	OSWorld 38.1%; WebVoyager 87%^[2]
OpenAI	Computer use (Responses API)	March 11, 2025	Developer-defined	Developer-hosted	$3/$12 per million tokens^[18]
OpenAI	ChatGPT Atlas	October 21, 2025	Browser, sidebar plus agent mode	Native Chromium browser	Inherits CUA^[26]
Google	Project Mariner	December 11, 2024	Browser only	Cloud VMs, parallel tasks	WebVoyager ~83.5%^[16]
Google	Gemini 2.5 Computer Use	October 7, 2025	Browser only via API	Developer-hosted	Positioned as fastest in class^[25]
Microsoft	Copilot Vision	2025	Sees user's screen, scoped automations	Local view, cloud reasoning	Not a full agent
Microsoft	Computer Use (Azure AI Foundry)	March 2025	Windows 365, Azure Virtual Desktop	Cloud VMs	Inherits CUA^[19]

The coarse split: Anthropic gives developers a low-level desktop-control primitive plus a consumer experience through Claude Cowork on Mac. OpenAI ships polished consumer products (Operator, ChatGPT agent, Atlas) and exposes the same model through an API afterward. Google has bifurcated into the consumer-oriented Project Mariner inside Google AI Ultra and the API-oriented Gemini 2.5 Computer Use for developers. Microsoft piggybacks on OpenAI's CUA inside Azure and keeps Copilot Vision narrower, focused on "see what I see" rather than "do this for me."

Anthropic Claude Computer Use

Anthropic released computer use as a public beta on October 22, 2024, alongside Claude 3.5 Sonnet.^[1] The implementation exposes three tools: computer for screenshots and mouse/keyboard actions, bash for shell commands, and text_editor for file edits, with a Docker-based Streamlit reference UI. Training was deliberately narrow: a few simple programs (calculator, text editor) with no internet, generalizing broadly. Teaching the model to count pixels accurately was a key technical hurdle for coordinate-level clicking.^[1] At launch Claude 3.5 Sonnet scored 14.9% on OSWorld in the screenshot-only setting, well ahead of the next-best system at 7.8% but far below the 72.36% human baseline.^[1]

The OSWorld progression over 16 months: 14.9% (Claude 3.5 Sonnet at launch), 22.0% (with extra steps), into the 30s (Claude 3.7 Sonnet with extended thinking), 61.4% (Claude Sonnet 4.5, September 2025), and 72.7% (Claude Opus 4.6, February 2026), essentially matching the 72.36% human baseline.^[28] The Mac computer use feature in March 2026 runs Claude on the user's actual Mac (not a VM) with per-application permissions, default-blocked sensitive apps (investment platforms, crypto tools), approval prompts for destructive actions, and prompt-injection scanning on screen contents. Pricing is bundled with Claude Pro and Max subscriptions.^[13]

OpenAI Operator and CUA

OpenAI's Operator launched on January 23, 2025, as a research preview for ChatGPT Pro subscribers, with a ChatGPT-style chat window paired to an embedded browser. Safety levels included confirmation mode for irreversible actions, take-over mode for credentials and CAPTCHAs, watch mode for sensitive sites, and a hard-restricted task list (stock trades, banking transfers).^[30] The underlying Computer-Using Agent (CUA) model combined GPT-4o vision with reinforcement learning, scoring 38.1% on OSWorld, 58.1% on WebArena, and 87% on WebVoyager at launch.^[2] In May 2025 OpenAI moved CUA onto the o3 reasoning model, with big GAIA gains (62.2 versus 12.3).^[22]

Operator was deprecated on August 31, 2025; its capabilities now live in three places. Inside ChatGPT, agent mode adds a terminal, a text browser, file outputs, and connectors to Gmail, Drive, GitHub, Slack, and others. Inside the OpenAI Responses API, the computer use tool lets developers pipe screenshots from any environment. Inside ChatGPT Atlas, the loop is integrated into a Chromium browser with sidebar and agent mode plus browser memories.^[26] Pricing: standalone Operator was $200/month (Pro); ChatGPT agent gives Plus 40 messages/month at $20, Team 40 at $25/user, Pro 400 at $200; the API meter is $3/$12 per million input/output tokens.^[18]

Google Project Mariner and Gemini computer use

Project Mariner, unveiled on December 11, 2024, started as a Chrome extension built on Gemini 2.0 Flash, accessible only to trusted testers, and posted a state-of-the-art single-agent WebVoyager score of 83.5% at announcement.^[16] By Google I/O 2025 it moved to cloud VMs, gained a "Teach and Repeat" feature for demonstrating workflows, and supported up to 10 parallel tasks. Access was tied to the Google AI Ultra plan at $249.99 per month.^[16] The Gemini 2.5 Computer Use model launched on October 7, 2025, as a developer-facing capability through the Gemini API and Vertex AI, positioned as the fastest browser-control model in its class. The model is browser-focused; full desktop control was on the roadmap but not initially shipped.^[25]

Microsoft Copilot Vision and Recall

Microsoft's posture is more cautious. Copilot Vision lets the assistant see the user's current screen and answer questions about it, with action-taking handled through narrower scoped automations rather than a generalized browser agent. Inside Azure AI Foundry, Microsoft documented "Computer Use (preview)" in March 2025, mostly as a host for OpenAI's CUA model running against Windows 365 and Azure Virtual Desktop sessions for enterprise customers.^[19]^[31] Recall is an adjacent product: a continuous screenshot indexer for Windows that lets the user search what they have done on their machine. Recall is not a CUA, but privacy researchers' concerns at the original 2024 announcement shaped Microsoft's broader caution. The Mu model, introduced in 2025, is a small on-device model embedded in Windows for tasks like answering questions about the Settings app, closer to a constrained on-device CUA than to a chatbot.

Open-source frameworks

The open-source ecosystem has been lively, partly because researchers can iterate on the same OSWorld and WebArena suites the vendors publish against.

Framework	First release	Notable features
Self-Operating Computer	November 2023	OthersideAI's pixel-only desktop controller, multi-model backend^[8]
OpenInterpreter	2024	Killian Lucas's local code-running assistant with computer use mode^[32]
LaVague	2024	Web agent framework with modular driver/world-model split^[33]
SkyVern	2024	Browser workflow automation, HTML extraction, GPT-4 Vision^[34]
Browser-Use	2024	Python library, 21,000+ GitHub stars, 89.1% on WebVoyager^[35]
Multi-On	2024	Personal browser agent, Chrome extension
Agent S2	March 2025	Simular AI, hierarchical planning, 34.5% on OSWorld^[20]
Agent S	December 2025	Simular AI; first to exceed OSWorld human baseline at 72.6%^[27]
Anthropic computer-use-demo	October 2024	Reference Docker image with Streamlit UI^[1]
openai/openai-cua-sample-app	March 2025	OpenAI's reference app for the Responses API^[18]
Cua Framework	2025	Docker-style packaging of OS images^[36]
UFO Agents	2025	Microsoft's Windows-specific framework
BrowserGym	2024	Unified academic ecosystem, 5,000+ tasks^[37]

The gravitational center has shifted. In 2024 most projects tried to replicate vendor demos; by late 2025 and early 2026, the strongest open-source systems (Agent S, Browser-Use) were competitive with frontier proprietary models on specific benchmarks by combining stronger planners with better grounding modules and clever multi-call orchestration. Smaller commercial efforts like Manus layer scaffolding on top of frontier APIs. Devin from Cognition is the best-known startup focused on software engineering rather than general computer use. The Rabbit R1 launched its "Large Action Model" pitch in early 2024; the product was more brittle than the marketing implied, and the company shifted toward a conventional agent stack during 2025.

How well do computer-use agents perform on benchmarks?

The interactive benchmarks for computer-use agents are not interchangeable: a model can score well on browser tasks and poorly on full desktop tasks, and vice versa.

OSWorld

OSWorld is the dominant benchmark for full computer-use evaluation. Created by Tianbao Xie and collaborators and accepted at NeurIPS 2024, it provides 369 real desktop tasks across Ubuntu, Windows, and macOS with execution-based evaluation; at publication the best model reached only 12.24% against a 72.36% human baseline.^[12] OSWorld-Verified (July 2025) fixed 300+ issues in the original suite and added AWS-based parallelization for 50x faster evaluation.^[38] OSWorld-Human adds annotated optimal trajectories, exposing how many extra steps agents take compared with humans.^[39]

Date	Model / Agent	OSWorld score	Notes
October 2024	Claude 3.5 Sonnet (Anthropic)	14.9%	Initial release; 22.0% with extra steps^[1]
October 2024	Previous best	~7.8%	Pre-CUA frontier^[1]
January 2025	OpenAI CUA	38.1%	Operator launch^[2]
March 2025	Agent S2 (Simular)	34.5%	Open-source; 50-step config^[20]
September 2025	Claude Sonnet 4.5	61.4%	State-of-the-art at release^[24]
December 2025	Agent S (Simular)	72.6%	First system to exceed human baseline^[27]
February 2026	Claude Sonnet 4.6	72.5%	Effectively at human baseline^[28]
February 2026	Claude Opus 4.6	72.7%	Highest published score^[28]
February 2026	GPT-5.2	38.2%	Reported on the same evaluation harness^[28]
Reference	Human baseline	72.36%	Established by OSWorld authors^[12]

WebArena and VisualWebArena

WebArena, from Carnegie Mellon, evaluates web agents on 812 tasks across self-hosted replicas of e-commerce, content management, social, and project tracking sites.^[10] VisualWebArena extends the suite with 910 image-heavy tasks where visual reasoning is required to succeed.^[40]

Agent	WebArena	Notes
OpenAI CUA	58.1%	January 2025^[2]
Google Project Mariner	~52.4%	Reported 2025^[9]
Browser-Use (open source)	~51.2%	2025^[9]
Anthropic Claude (Sonnet 3.5 era)	~45%	October 2024^[9]
Human reference	~78%	Author estimate^[10]

WebVoyager

WebVoyager tests live websites including Amazon, GitHub, and Google Maps. The benchmark contains 643 tasks spanning 15 popular sites, with end-to-end success rates as the headline metric.^[41]

Agent	WebVoyager	Notes
Browser-Use (open source)	89.1%	Late 2024^[35]
OpenAI CUA / ChatGPT agent	87.0%	January 2025^[2]
Google Project Mariner	83.5%	2025^[16]
Claude Sonnet 4.5	~56%	Sonnet 4.5 era^[6]
Human reference	~90%	Approximate^[41]

Mind2Web and other suites

Mind2Web from Ohio State and the Allen Institute provides 2,350 tasks across 137 real websites, with action-sequence labels for offline evaluation.^[7] WindowsAgentArena is Microsoft's benchmark for Windows-specific tasks. AndroidWorld evaluates mobile GUI agents (Agent S2 hits 50%). GAIA measures general assistant tasks; OpenAI's o3-based Operator scored 62.2 on GAIA in May 2025. macOSWorld introduced the first macOS benchmark with 202+ multilingual interactive tasks.^[42] AgentBench rolls these into a unified evaluation framework.

The practical takeaway: web tasks are now in the high 80s and 90s for the best systems, while full desktop tasks have only just reached human parity at the top of the model lineup. Smaller and older models are 30+ percentage points behind.

What can computer-use agents do, and what are their limits?

As of early 2026, computer-use agents can navigate websites and complete multi-step checkouts when not blocked by bot detection, operate productivity software in unfamiliar configurations, adapt to layout changes where rule-based RPA would break, coordinate workflows across web tools and spreadsheets and mailers, compile research reports by visiting many sites, and serve as automated testers.

Persistent failure modes:

Latency. A single screenshot-action-observation cycle takes a few seconds; tasks a human finishes in 30 seconds can take a CUA several minutes, and supervision often costs more time than doing the task manually.
Precision and grounding. Small interface elements, drag-and-drop, custom calendar widgets, sliders, and rich-text editors remain failure-prone. OSWorld-Human showed even the best agents take 1.4x to 2.7x the human step count.^[39]
OCR and rendering. Text in images, low-contrast UI, and dense data tables get misread.
Transient elements. Tooltips, popovers, and short-lived notifications fall between screenshots.
CAPTCHAs and anti-bot measures on booking and ticketing platforms can block agents entirely.
Long-horizon coherence. Cascading errors compound over hundreds of steps; self-correction is inconsistent.
Cost. A complex 100+ step task with high-resolution screenshots can cost several dollars on token-metered APIs.
Hallucinated UI elements sometimes lead the model to click where it expects a button to be.
Reliability gaps. Anthropic, OpenAI, and Google all describe their consumer products as research previews. The technology suits supervised automation of routine tasks but not high-stakes unsupervised work.

Are computer-use agents safe to use?

Computer-use agents introduce a new class of risk that traditional chatbots do not have: the model takes real actions on real systems, often with access to authenticated sessions.

Prompt injection

Prompt injection through on-screen content is the most discussed and least solved attack. Because the model reads everything visible on screen, malicious text on a webpage, in an email, or in a document can hijack the agent. In a widely cited February 2025 demonstration, security researcher Johann Rehberger hid instructions in a GitHub issue. When Operator visited the issue as part of an unrelated task, the embedded instructions caused the agent to navigate to an authenticated session, scrape personal data, and exfiltrate it through a URL parameter.^[43] Similar demonstrations followed for Anthropic's Claude for Chrome shortly after its August 2025 launch and for OpenAI's Atlas browser within days of its October 2025 launch. The attack class is unsolved at the model layer. Defenses work at the wrapper layer: classifiers that scan screenshots for injection patterns, confirmation prompts for off-goal navigation, and watch modes that pause on sensitive sites. None is a complete fix.

Other failure modes

A broader version of prompt injection is that any content the agent reads can change its behavior, including ads, A/B test variants, hidden DOM nodes, fake error messages, and screen overlays. An agent driving a logged-in browser holds the keys to whatever sites the user has authenticated; successful injection can lead to data exfiltration, unauthorized purchases, or social-engineering attacks. Agents running on the user's own machine (Anthropic's Mac computer use, Claude for Chrome, OpenAI Atlas) raise the stakes further by exposing the local file system.

Researchers have shown that even hard-restricted task lists (no stock trades, no banking transfers) can be softened through role-play prompts or chained instructions similar to chat-based jailbreaks. Wrapper-layer enforcement catches most attempts in production, but the underlying model's willingness to take an action can be raised by context manipulation. Even without an attacker, an agent that misinterprets a task can submit the wrong form, send the wrong email, delete the wrong file, or modify settings in ways that are difficult to undo. Cascading errors over long horizons make this worse: a small grounding mistake at step 12 can produce a wildly off-course agent at step 40.

Mitigation strategies

Strategy	Description	Effectiveness
Containerization	Run the agent in an isolated VM, container, or cloud sandbox	High; standard in OpenAI Operator and Project Mariner
Connector-first hierarchy	Prefer API connectors, then browser, then screen as fallback	High; Anthropic's March 2026 model^[13]
Per-application permissions	Require user approval before each application	High; Anthropic Mac computer use^[13]
Default app restrictions	Block sensitive apps (banking, brokerage, crypto) by default	High for catastrophic-loss prevention^[13]
Confirmation gates	Pause for user approval before irreversible actions	High when user is attentive; OpenAI^[30]
Take-over mode	Hand control to the user for credentials and CAPTCHAs	High; OpenAI Operator and ChatGPT agent^[30]
Watch mode	Auto-pause on sensitive sites when user becomes inactive	High for high-risk sites; OpenAI^[30]
Hard-restricted tasks	Refuse certain task classes regardless of user request	High; both OpenAI and Anthropic publish lists^[30]
Prompt-injection classifiers	Flag screenshots that look like injection attempts	Medium; useful but bypassable^[1]
Allowlist navigation	Limit the agent to approved domains	High for known workflows
Action telemetry and audit logs	Log every action with screenshots for review	High for incident response
Credential isolation	Never let the model see credentials	High when implemented; OpenAI take-over mode^[30]
Filesystem isolation	Recommend a dedicated working directory	Medium; user discipline required^[13]

No single defense is sufficient. Production deployments use defense in depth, and even then prompt injection remains an open problem. The International AI Safety Report 2026 specifically called out computer-use agents as a category requiring careful governance.^[44]

What are computer-use agents used for?

The genuinely useful applications in production are narrower than the marketing suggests but real.

Web automation and form filling is the bread-and-butter case: forms, multi-step processes on government and enterprise portals, and pulling routine information from sites that lack APIs. Works well on tested sites, poorly on the long tail of obscure custom JavaScript. Research and competitive analysis plays to the strengths of agents that combine browsing with file output, like ChatGPT agent's visual browser, text browser, and terminal mix, or Claude Cowork's local files plus web sources. Booking and shopping drove the launch partners for Operator (OpenTable, Instacart, Uber, Booking.com); works when the site is mainstream and the user is willing to confirm before checkout.

Software QA and testing uses the agent as an automated tester catching visual regressions and end-to-end flow breakage; Replit is the canonical example.^[1] Data entry and migration between systems that lack integration is the natural successor to traditional robotic process automation and probably the largest enterprise use case by dollar value. Software development and code review is handled by engineering-focused agents like Devin, ChatGPT agent's terminal, and Anthropic's Claude Code. Accessibility has obvious value for users with motor disabilities but no flagship product yet because reliability has not been good enough. Personal productivity is the framing behind Anthropic's Dispatch (March 2026): assign tasks from a phone, run them on your Mac, retrieve the output later. The economic argument holds for batchable, recurring tasks where supervision can be amortized.

How much do computer-use agents cost?

Vendor	Product	Pricing model	Cost
Anthropic	Claude API computer use	Per-token	Sonnet 4.6: $3/$15 per million input/output tokens. Opus 4.6: $5/$25. Tool overhead ~466-499 tokens per step plus screenshot tokens^[29]
Anthropic	Claude Cowork (Mac)	Subscription	Bundled with Claude Pro ($20/month) and Claude Max^[13]
Anthropic	Claude for Chrome	Subscription	Max plan only at launch; expanded to Pro, Team, Enterprise by December 2025^[23]
OpenAI	ChatGPT agent	Subscription with quota	Plus: 40/month at $20. Team: 40 at $25/user. Pro: 400 at $200^[26]
OpenAI	Computer use tool (Responses API)	Per-token	$3/$12 per million input/output tokens^[18]
OpenAI	Atlas browser	Freemium	Free sidebar; agent mode requires Plus or Pro^[26]
Google	Project Mariner	Subscription	Bundled with Google AI Ultra at $249.99/month
Google	Gemini 2.5 Computer Use	Per-token	Gemini API and Vertex AI metered pricing^[25]
Microsoft	Computer Use in Azure AI Foundry	Per-token	Inherits OpenAI CUA pricing through Azure^[19]

Vendors with consumer products charge subscriptions; vendors with API-first products charge tokens; everyone with both does both.

Future direction

Native API integrations will be preferred over screen scraping wherever they exist. Anthropic's connector-first hierarchy is the explicit version; ChatGPT agent's connector tab is the implicit one. Pure pixel control is becoming the fallback, not the default. Hybrid agents combining HTTP APIs, Model Context Protocol servers, browser automation, and screen control will be the norm; single-modality agents are already losing on cost and latency. Video-stream perception will gradually replace static screenshots, addressing the flipbook problem.

Enterprise deployment will accelerate as the OSWorld ceiling is crossed, but adoption will lag the benchmark curve because of compliance, audit, and integration concerns. Sectors with heavy legacy GUI software (healthcare, insurance, utilities) are the obvious early targets. The prompt-injection problem will get worse before it gets better, because agents will gain more autonomy and connect to more sensitive systems. Mitigations will remain wrapper-layer fixes until model-level solutions emerge. Mobile-to-desktop delegation, exemplified by Dispatch, is likely to expand: "assign a task on the phone, run it on the desktop, retrieve the output" matches how people actually work. The "agentic coworker" framing is the dominant marketing pitch, with mixed empirical support so far; the genuine productivity wins are in narrow, batchable tasks, while broad personal-assistant scenarios remain aspirational.

Impact and implications

Organizations that have deployed CUAs in narrow, well-supervised contexts report 30 to 50% reductions in manual workload on the targeted tasks, faster data processing, and lower error rates compared with manual entry.^[45] Whether these wins generalize beyond pilot programs is the open question; most published case studies come from vendor partnerships, which selects for the cases that worked.

CUAs are reshaping workflows by automating routine digital tasks. Displacement risk is concentrated in high-volume, repetitive GUI work like data entry, basic customer service, and simple QA testing. New roles in agent supervision, prompt engineering, and audit trail review have emerged. As with most AI deployments, the question is how quickly capability translates into adoption.

Deployment raises privacy questions when agents read screen contents that include other people's information, accountability questions when an agent takes an unintended action, and consent questions about the boundary between agent autonomy and user oversight. The same arguments are being relitigated case by case: Recall in 2024 raised continuous-screenshot privacy concerns, Operator in 2025 raised credential exposure concerns, and Mac computer use in 2026 raised local file access concerns. There is also a cybersecurity concern about CUAs as attack tools, where adversaries use the same primitives to automate fraud or intrusion at scale.

The computer use article is specifically about Anthropic's Claude Computer Use feature, including the API tool, the reference implementation, the Mac integration, and the OSWorld progression of Claude models. This article treats Anthropic's product as one implementation in a broader category that also includes OpenAI Operator and CUA, Project Mariner, Gemini computer use, Microsoft Copilot Vision, and the open-source ecosystem. The browser-use agent article covers CUAs restricted to a web browser, including the open-source Browser-Use library. The openai_operator article covers the standalone Operator product (now deprecated) and the ChatGPT agent and Atlas products that succeeded it.

References

Anthropic. "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." October 22, 2024. https://www.anthropic.com/news/3-5-models-and-computer-use ↩
OpenAI. "Computer-Using Agent." January 23, 2025. https://openai.com/index/computer-using-agent/ ↩
OpenAI. "Introducing Operator." January 23, 2025. https://openai.com/index/introducing-operator/ ↩
Microsoft. "Computer Use (preview) in Azure AI Foundry." March 2025. https://learn.microsoft.com/azure/ai-foundry/ ↩
Xie, T. et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv:2404.07972, NeurIPS 2024. https://arxiv.org/abs/2404.07972 ↩
OSWorld Leaderboard. https://os-world.github.io/ ↩
Deng, X. et al. "Mind2Web: Towards a Generalist Agent for the Web." NeurIPS 2023. https://github.com/OSU-NLP-Group/Mind2Web ↩
OthersideAI. "Self-Operating Computer Framework." GitHub, November 2023. https://github.com/OthersideAI/self-operating-computer ↩
WorkOS. "Anthropic's Computer Use versus OpenAI's Computer Using Agent (CUA)." 2025. https://workos.com/blog/anthropics-computer-use-versus-openais-computer-using-agent-cua ↩
Zhou, S. et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024. https://webarena.dev/ ↩
Anthropic. "Developing a computer use model." October 22, 2024. https://www.anthropic.com/news/developing-computer-use ↩
OSWorld project page. https://os-world.github.io/ ↩
Anthropic. "Put Claude to work on your computer." Claude Blog, March 2026. https://claude.com/blog/dispatch-and-computer-use ↩
Zheng, B. et al. "GPT-4V(ision) is a Generalist Web Agent, if Grounded." SeeAct, 2023. https://osu-nlp-group.github.io/SeeAct/ ↩
Adept AI. "ACT-1: Transformer for Actions." September 14, 2022. https://www.adept.ai/blog/act-1 ↩
Google DeepMind. "Project Mariner." December 11, 2024. https://deepmind.google/models/project-mariner/ ↩
Anthropic. "Claude 3.7 Sonnet and Claude Code." February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet ↩
OpenAI. "New tools for building agents." March 11, 2025. https://openai.com/index/new-tools-for-building-agents/ ↩
Microsoft Learn. "Computer Use in Azure AI Foundry." March 2025. https://learn.microsoft.com/azure/ai-foundry/ ↩
Simular AI. "Agent S2." March 2025. https://www.simular.ai/agent-s2 ↩
Anthropic. "Introducing Claude 4." May 22, 2025. https://www.anthropic.com/news/claude-4 ↩
OpenAI. "Addendum to OpenAI o3 and o4-mini system card: OpenAI o3 Operator." May 23, 2025. https://openai.com/index/o3-o4-mini-system-card-addendum-operator-o3/ ↩
TechCrunch. "Anthropic launches a Claude AI agent that lives in Chrome." August 26, 2025. https://techcrunch.com/2025/08/26/anthropic-launches-a-claude-ai-agent-that-lives-in-chrome/ ↩
Anthropic. "Introducing Claude Sonnet 4.5." September 2025. https://www.anthropic.com/news/claude-sonnet-4-5 ↩
Google. "Gemini 2.5 Computer Use." October 7, 2025. https://blog.google/technology/google-deepmind/gemini-computer-use/ ↩
OpenAI. "Introducing ChatGPT Atlas." October 21, 2025. https://openai.com/index/introducing-chatgpt-atlas/ ↩
Simular. "Simular's computer use agent outperforms humans." December 2025. https://www.simular.ai/articles/simulars-computer-use-agent-outperforms-humans ↩
Anthropic. "Claude Opus 4.6." February 5, 2026. https://www.anthropic.com/claude/opus ↩
Anthropic. "Computer use tool documentation." Claude API. https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool ↩
OpenAI. "Operator System Card." January 23, 2025. https://cdn.openai.com/operator_system_card.pdf ↩
Microsoft. "Copilot Vision overview." 2025. https://copilot.microsoft.com/ ↩
Open Interpreter. GitHub repository. https://github.com/OpenInterpreter/open-interpreter ↩
LaVague. GitHub repository. https://github.com/lavague-ai/LaVague ↩
SkyVern. GitHub repository. https://github.com/Skyvern-AI/skyvern ↩
Browser-Use. GitHub repository. https://github.com/browser-use/browser-use ↩
Cua Framework. GitHub repository. https://github.com/trycua/cua ↩
ServiceNow Research. "BrowserGym." 2024. https://github.com/ServiceNow/BrowserGym ↩
XLang AI. "Introducing OSWorld-Verified." July 2025. https://xlang.ai/blog/osworld-verified ↩
OSWorld-Human, GitHub repository. https://github.com/xlang-ai/OSWorld ↩
Koh, J. et al. "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks." 2024. https://jykoh.com/vwa ↩
He, H. et al. "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models." 2024. https://github.com/MinorJerry/WebVoyager ↩
macOSWorld benchmark, 2025. ↩
Johann Rehberger / Embrace The Red. "Prompt injection exploits in ChatGPT Operator." February 2025. https://embracethered.com/blog/posts/2025/chatgpt-operator-prompt-injection-exploits/ ↩
International AI Safety Report 2026. https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026 ↩
Anthropic computer use launch partner case studies. https://www.anthropic.com/news/3-5-models-and-computer-use ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

AI browser agent Agent evaluation Amazon Nova Act Artificial intelligence terms Browser-use agent Claude Code Playwright Computer use Indirect prompt injection Machine learning terms/Natural Language Processing Mind2Web Online-Mind2Web Open Interpreter Qwen2.5-VL Terms Windows Agent Arena

What are the different names for a computer-use agent?

How does a computer-use agent differ from an API or browser-automation agent?

Core characteristics

How does a computer-use agent work?

Implementation approaches

History

Research roots (2017 to 2023)

Modern era (2024 onward)

How does the perception-action loop function?

Which companies make computer-use agents?

Anthropic Claude Computer Use

OpenAI Operator and CUA

Google Project Mariner and Gemini computer use

Microsoft Copilot Vision and Recall

Open-source frameworks

How well do computer-use agents perform on benchmarks?

OSWorld

WebArena and VisualWebArena

WebVoyager

Mind2Web and other suites

What can computer-use agents do, and what are their limits?

Are computer-use agents safe to use?

Prompt injection

Other failure modes

Mitigation strategies

What are computer-use agents used for?

How much do computer-use agents cost?

Future direction

Impact and implications

Relationship to related articles

See also

References

Improve this article

Related Articles

Claude Sonnet 4.5

OCR Models

Pre-training

Agentic Context Engineering

Artificial Intelligence

Machine Learning

What links here

Related Articles

Claude Sonnet 4.5

OCR Models

Pre-training

Agentic Context Engineering

Artificial Intelligence

Machine Learning

What links here