Computer use

AI Agents Anthropic Artificial Intelligence

37 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

30 citations

Revision

v8 · 7,321 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Computer use is the capability of artificial intelligence models to control computers by viewing screens and performing mouse and keyboard actions, much like a human user would. Rather than interacting through APIs or command-line interfaces, computer use agents perceive the graphical user interface (GUI) through screenshots, reason about what they see, and execute actions such as clicking buttons, typing text, scrolling, and navigating between applications. When Anthropic introduced the capability, it described directing Claude "to use computers the way people do, by looking at a screen, moving a cursor, clicking buttons, and typing text."^[1]

The concept emerged as a practical product in late 2024, when Anthropic released computer use capabilities for Claude 3.5 Sonnet on October 22, 2024, making it, in Anthropic's words, "the first frontier AI model to offer computer use in public beta."^[1] OpenAI followed with Operator on January 23, 2025, powered by a Computer-Using Agent (CUA) model that scored 38.1% on the OSWorld benchmark at launch, and Google introduced Project Mariner as a browser-based agent in December 2024.^[9]^[37]^[10] By mid-2026, computer use has become a competitive frontier in AI development, with multiple agents exceeding the human baseline on standardized benchmarks. Open-weight specialist models such as H Company's Holo3 now hold the top spot on the OSWorld-Verified leaderboard at 82.6%, while general-purpose frontier models like Claude Opus 4.7 (78.0%) and GPT-5.5 (78.7%) sit within a few points of the open-weight leader.^[23]^[24]

History

Early approaches

Before the current generation of computer-use agents, researchers explored various ways to have AI interact with graphical interfaces. Early approaches included screen scraping (extracting text from rendered interfaces), accessibility API integration (using OS-level accessibility trees to understand interface structure), and Selenium-style browser automation (programmatically controlling web browsers through DOM manipulation). These methods were brittle, requiring specific knowledge of each application's internal structure, and broke whenever interfaces changed.

The vision capabilities of modern multimodal language models made a fundamentally different approach possible. Instead of relying on structured access to interface elements, a vision-capable model could simply look at a screenshot and understand what was on screen, the same way a person does.

Anthropic's Claude computer use (October 2024)

Anthropic launched computer use as a public beta on October 22, 2024, alongside the updated Claude 3.5 Sonnet model.^[1] This was the first time a major AI company offered a production-grade API for autonomous desktop control. The announcement described the capability as allowing developers to "direct Claude to use computers the way people do, by looking at a screen, moving a cursor, clicking buttons, and typing text."^[1] The feature was available through the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI from day one.

At launch, Anthropic was candid about the limitations, calling the capability "still experimental, at times cumbersome and error-prone," and noting that Claude could miss short-lived notifications, struggle with scrolling, and have difficulty with precise cursor placement.^[1] Despite these limitations, the release generated significant interest from developers building automation workflows.

Early adopters

Six organizations were highlighted as early adopters of Claude's computer use at launch:

Company	Use Case
Asana	Automating project management workflows
Canva	Testing and automating design tool interactions
Cognition	Integrating desktop control into AI development agents
DoorDash	Automating internal operational processes
Replit	Evaluating apps as they are built within Replit Agent, using Claude's UI navigation capabilities to test applications during the development process
The Browser Company	Automating web-based workflows; the company noted that Claude 3.5 Sonnet outperformed every model they had previously tested for this purpose

These companies were executing tasks requiring "dozens, and sometimes even hundreds, of steps," demonstrating the potential for complex multi-step automation even in the beta's early state.^[1]

How Anthropic developed the model

Anthropic published a companion blog post titled "Developing a computer use model" alongside the October 2024 launch, detailing the training approach.^[2] The company's previous work on tool use and multimodality provided the foundation. Combining these abilities, Anthropic trained Claude to interpret what was happening on a screen and then use available software tools to carry out tasks.

The training itself was deliberately constrained for safety reasons. Claude was trained on only a few pieces of simple software, such as a calculator and a text editor. Internet access was intentionally prohibited during training. Despite this narrow training scope, the team was surprised by how rapidly Claude generalized to handle diverse applications it had never seen during training. This suggested that the model developed generalizable spatial reasoning rather than memorizing specific interface layouts.

A key technical challenge was teaching Claude to count pixels accurately. Without this ability to calculate cursor movement distances, the model struggled with mouse commands. This parallels how large language models often have difficulty with seemingly simple tasks like counting letter occurrences in words.

The model also demonstrated self-correction behavior, retrying tasks when it encountered obstacles rather than simply failing. Anthropic described Claude's perception mechanism as a "flipbook" approach: taking screenshots, analyzing what is visible, then issuing sequential commands. This differs from continuous video observation, which creates limitations in detecting brief notifications or rapidly changing UI elements.

OpenAI Operator (January 2025)

OpenAI launched Operator on January 23, 2025, powered by the Computer-Using Agent (CUA) model.^[9] CUA combined GPT-4o's vision capabilities with advanced reasoning trained through reinforcement learning. OpenAI described CUA as operating through "an iterative loop that integrates perception, reasoning, and action," processing raw pixel data to understand the screen and using a virtual mouse and keyboard to act.^[37]

Unlike Anthropic's approach, which provided a low-level API for developers to build their own computer use applications, Operator was a consumer-facing product. It ran in a secure virtual browser environment hosted by OpenAI, meaning users did not need to set up their own sandboxed environments. Operator was initially available to ChatGPT Pro subscribers (the $200-per-month tier) in the United States.

CUA uses a three-phase cycle: perception (screenshots are added to the model's context), reasoning (the model uses chain-of-thought to evaluate observations, track intermediate steps, and adapt dynamically), and action (the model performs clicks, scrolling, and typing until the task is completed or user input is needed). CUA is trained to interact with GUIs directly, without relying on OS-specific or web-specific APIs. At launch OpenAI reported that CUA scored 38.1% on OSWorld, 58.1% on WebArena, and 87% on WebVoyager.^[37] However, because OpenAI hosts the execution environment in a sandboxed virtual browser, CUA is limited to browser-based tasks, unlike Anthropic's approach which supports full desktop control.

ChatGPT agent and Operator deprecation (July 2025)

On July 17, 2025, OpenAI folded Operator into a unified product called ChatGPT agent, available directly from the ChatGPT composer's tools dropdown as "agent mode."^[25] The new product combined three previously separate capabilities: Operator's action-taking remote browser, deep research's web synthesis, and ChatGPT's conversational interface. OpenAI's stated rationale was that "many queries users attempted with Operator were actually better suited for deep research," so the strengths were brought together. ChatGPT agent's built-in virtual browser carried over Operator's browser automation and added a code interpreter, terminal access, and the ability to deliver editable slideshows and spreadsheets summarizing its findings.

The standalone Operator product at operator.chatgpt.com was deprecated within weeks of the announcement and the experience shut down on August 31, 2025.^[26] ChatGPT agent went live for ChatGPT Pro, Team, and Plus users, with Pro receiving 400 queries per month and Team and Plus receiving 40 per month. At launch, ChatGPT agent posted state-of-the-art browser-task results on WebArena and a score around 38.1% on OSWorld, inheriting the CUA backbone.

Google Project Mariner (December 2024)

Google DeepMind unveiled Project Mariner on December 11, 2024, as a research prototype for AI-powered web browsing.^[10] Powered by Gemini 2.0, Project Mariner could navigate websites, click buttons, fill forms, conduct searches, and complete online tasks autonomously.

The initial release was limited to a select group of testers. At Google I/O 2025, Google expanded access and announced several upgrades. Project Mariner was updated to run on virtual machines in the cloud (similar to OpenAI's approach), allowing users to continue their own work while the agent handled up to 10 tasks simultaneously in the background. A "Teach and Repeat" feature let users demonstrate workflows that the agent could then replicate.

Access to Project Mariner was tied to Google's AI Ultra plan, priced at $249.99 per month. Google also made Mariner's capabilities available through the Gemini API and Vertex AI for developers building their own applications. Like CUA, Project Mariner was limited to browser-based interactions and did not provide full desktop control.

Project Mariner shutdown and Gemini Agent (May 2026)

Google shut down Project Mariner as a standalone product on May 4, 2026, ending an experiment that ran for roughly 17 months.^[27] Rather than discarding the work, Google folded Mariner's browser-agent technology into Gemini Agent, the task-automation layer of the Gemini app, and into Google AI Mode in Search. Gemini Agent runs on Gemini 3 and inherits Mariner's multi-step planning, persistent cloud sessions, and the ability to coordinate up to 10 parallel tasks.^[28] It also gains tighter integration with Deep Research, Canvas, and Workspace apps such as Gmail and Calendar.

Gemini Agent is available on web to Google AI Ultra subscribers in the United States with their language set to English. Google plans to extend the same agent capabilities to Android phones, watches, cars, and glasses starting summer 2026, beginning with select Samsung Galaxy and Pixel devices. Separately, on January 29, 2026, Google launched Chrome Auto Browse, a Gemini 3-powered feature inside the Chrome browser that handles shopping, form-filling, and other multi-step tasks autonomously.^[29] Together, Gemini Agent and Chrome Auto Browse replace the standalone Mariner offering with embedded agent capabilities in products that already have hundreds of millions of users.

ChatGPT Atlas (October 2025)

On October 21, 2025, OpenAI launched ChatGPT Atlas, a Chromium-based web browser with ChatGPT integrated as the primary interface.^[30] Atlas includes a sidebar assistant that can answer questions about the current page, summarize content, and rewrite selected text. Its centerpiece for computer use is Agent Mode, which lets ChatGPT browse and act on the open browser window on the user's behalf. Agent Mode was offered as a preview to ChatGPT Plus, Pro, and Business subscribers at launch and is available initially on macOS, with Windows, iOS, and Android versions promised. Atlas brought OpenAI's computer use into the browser layer itself rather than running an agent in a separate remote sandbox.

Mac computer use and Dispatch (March 2026)

On March 23, 2026, Anthropic introduced native computer use for macOS, available in both Claude Cowork and Claude Code.^[4] This release marked a shift from the API-only approach of the 2024 beta to an integrated consumer product where Claude can directly control a user's Mac desktop.

The Mac computer use feature follows a prioritized tooling hierarchy:

Connectors first: If a direct integration is available for the target service (such as Gmail, Google Drive, or Slack), Claude uses it. This is the fastest and most reliable path.
Browser navigation: When no connector exists, Claude can open and navigate Chrome to access web-based tools.
Screen interaction: As a last resort, Claude falls back to direct desktop control, moving the mouse, clicking, typing, and scrolling through the graphical interface the same way a person would.

Anthropic acknowledged that screen interaction is slower than using connectors. When Claude works through the screen instead of a direct integration, tasks take longer. Complex tasks sometimes need a second attempt, and the company described the feature as a research preview rather than a production-ready tool.

The Mac computer use feature launched alongside Dispatch, a companion capability that lets users assign tasks to Claude from a mobile device. Dispatch was first released on March 17, 2026, with Claude Max subscribers receiving access first, followed by Pro subscribers within days.^[22] Users can scan a QR code from the Claude mobile app to link their phone to Claude Desktop, then assign tasks while away from the computer. Claude executes the work on the desktop and delivers the finished output when the user returns. The computer must remain powered on, as Dispatch functions as a remote control rather than a cloud computing service.

Mac computer use is initially available to Claude Pro and Claude Max subscribers on macOS.

Claude for Chrome (August 2025)

Anthropic launched Claude for Chrome (also referred to as "Claude in Chrome") in August 2025 as a research preview limited to 1,000 trusted users on the Claude Max plan.^[31] The browser extension lets users converse with Claude in a sidebar while the model navigates websites, fills forms, extracts data, and runs multi-step workflows on the user's behalf. Access expanded to all Max subscribers by November 2025 and to every paid plan, including Claude Pro, by December 2025. Claude for Chrome integrates with Claude Code, Cowork, and Claude Desktop, and it added a workflow-recording feature so users can teach the agent a sequence of steps to replay later. Anthropic reported that adding autonomy-mode safety mitigations reduced prompt injection attack success rates from 23.6% to 11.2%, while emphasizing that the risk has not been eliminated.

Claude Opus 4.7 vision upgrade (April 2026)

On April 16, 2026, Anthropic released Claude Opus 4.7, the first Claude model with native high-resolution image support and a substantial computer use uplift over the previous Opus 4.6 generation.^[7] Maximum image resolution rose from 1,568 pixels per long edge (1.15 megapixels) to 2,576 pixels (3.75 megapixels), roughly 3.3 times the previous pixel density.^[11] Crucially, the model's coordinate space now maps 1:1 with actual pixel coordinates, removing the scale-factor math that previously complicated screenshot and computer use workflows. Anthropic specifically called out that the resolution change "is particularly important for computer use and screenshot/artifact/document understanding workflows."^[11]

Opus 4.7 also added a new xhigh effort level for agentic workloads, task budgets that give the model an advisory token allowance across a full agentic loop, and a new tokenizer. The model uses adaptive thinking as the only supported thinking-on mode (extended thinking budgets are no longer accepted) and runs at $5 per million input tokens and $25 per million output tokens at standard API pricing, with no long-context premium for the 1M token context window.

On the OSWorld-Verified benchmark, Opus 4.7 scored 78.0%, up from 72.7% for Opus 4.6, placing it third on the public leaderboard behind H Company's Holo3-35B-A3B (82.6%) and Claude Mythos Preview (79.6%).^[23]^[24]

Holo3 and the open-weight surge (April 2026)

On April 1, 2026, the Paris-based startup H Company released Holo3, a family of vision-language models trained specifically for computer use.^[32] Holo3 is composed of two variants: Holo3-35B-A3B, a 35-billion parameter mixture-of-experts model with 3B active parameters released under the Apache 2.0 license and available on Hugging Face, and Holo3-122B-A10B, a larger API-only variant. The smaller model can be run locally on a single GPU with roughly 24GB of VRAM; the larger model is offered at $0.40 per million input tokens and $3.00 per million output tokens, an order of magnitude cheaper than Opus 4.7.

Holo3-35B-A3B topped the OSWorld-Verified leaderboard at 82.6% as of April 29, 2026, the first open-weight model to do so and well above general-purpose frontier models. Holo3-122B-A10B followed at 78.85%.^[32]^[24] H Company described Holo3 as a "computer-use specialist" rather than a general assistant, and the release demonstrated that vertical specialization with dedicated training data can outperform much larger general-purpose models on GUI tasks.

Claude Mythos Preview (April 2026)

Anthropic previewed Claude Mythos Preview on April 7, 2026, a model that sits one capability tier above Opus 4.7.^[33] Mythos is offered only through Project Glasswing, an invitation-only program for 12 founding organizations and roughly 40 vetted critical-infrastructure operators. Partners pay $25 per million input tokens and $125 per million output tokens, five times Opus 4.7 pricing, and access is gated for cybersecurity reasons. Mythos posted 79.6% on OSWorld-Verified, the highest score for a general-purpose model and second only to Holo3-35B-A3B.^[24] Anthropic has not announced a general availability date.

Mistral Le Chat Work mode (April 2026)

Mistral AI added a Work mode preview to its Le Chat assistant in April 2026, powered by Mistral Medium 3.5.^[34] Work mode runs multi-step agentic workflows across connected tools such as email, calendar, messaging, GitHub, Jira, and Slack. The agent can draft messages, create issues, and generate reports while exposing every tool call and intermediate step to the user; sensitive operations require explicit approval. Sessions persist across multiple steps so the agent can iterate until a task completes. Mistral also introduced remote coding agents in its Vibe product line that can be spawned from the CLI or Le Chat and run asynchronously in the cloud; a local CLI session can be "teleported" up to the cloud to continue in the background.

How computer use works

How does computer use work? All major computer use implementations follow a similar perception-reasoning-action loop, though the specific implementations differ.

The agent loop

The core mechanism is a cycle that repeats until the task is complete:

Screenshot capture: The system takes a screenshot of the current screen state and sends it to the AI model.
Visual analysis: The model examines the screenshot to understand the current state of the interface, identifying elements like buttons, text fields, menus, and content.
Reasoning: The model determines what action to take next based on the current screen state, the original task instructions, and its memory of previous actions.
Action execution: The model outputs a specific action (click at coordinates X,Y; type "hello"; press Enter; scroll down), which the host system executes.
Verification: A new screenshot is captured, and the cycle repeats. The model checks whether its action had the intended effect and adjusts its plan accordingly.

This loop continues until the model determines that the task is complete or that it needs human input to proceed.

Anthropic's API implementation

Anthropic's computer use tool is exposed through the Claude API as a special tool type.^[3] Developers define a computer use tool with a specified screen resolution and pass it to the model alongside their prompt. Claude then requests tool calls that the developer's application executes in the host environment. The tool is schema-less, meaning the input schema is built into Claude's model and cannot be modified by the developer.

The computer use tool remains in beta and requires a specific beta header in API requests. Different Claude model generations use different tool versions:

Model	Tool Version	Beta Header
Claude Opus 4.7	`computer_20251124` (with 1:1 coordinate mapping)	`computer-use-2025-11-24`
Claude Opus 4.6, Claude Sonnet 4.6, Claude Opus 4.5	`computer_20251124`	`computer-use-2025-11-24`
Sonnet 4.5, Haiku 4.5, Opus 4.1, Sonnet 4, Opus 4, Sonnet 3.7	`computer_20250124`	`computer-use-2025-01-24`

The available actions have expanded across tool versions:

Action Category	Actions	Availability
Basic actions	`screenshot`, `left_click`, `type`, `key`, `mouse_move`	All versions
Enhanced mouse actions	`right_click`, `middle_click`, `double_click`, `triple_click`, `left_click_drag`	`computer_20250124` and later
Fine-grained control	`left_mouse_down`, `left_mouse_up`, `hold_key`, `wait`	`computer_20250124` and later
Scrolling	`scroll` (with direction and amount control)	`computer_20250124` and later
Detailed inspection	`zoom` (view a specific screen region at full resolution)	`computer_20251124` only

Anthropic provides a reference implementation with a Docker container, web interface, and example tool implementations hosted on GitHub. The developer is responsible for setting up the execution environment, which typically involves a virtual machine or container with a virtual X11 display server (Xvfb) that renders the desktop interface Claude sees through screenshots. The computing environment includes a lightweight Linux desktop with a window manager (Mutter), a panel (Tint2), and pre-installed applications like Firefox, LibreOffice, and text editors.

When the developer provides the computer use tool, Anthropic's API generates a specialized system prompt that tells Claude it has access to a sandboxed computing environment. The developer's own system prompt is incorporated alongside this generated prompt.

OpenAI's CUA and Atlas implementations

OpenAI's Computer-Using Agent follows the same general perception-reasoning-action pattern but is architecturally different. The original Operator product ran entirely in a secure virtual browser hosted by OpenAI, so users did not need to manage their own sandboxing infrastructure. This made setup easier, particularly for non-technical users, but restricted CUA to browser-based tasks. Anthropic's approach supports full desktop control, including terminal commands, native applications, and file system operations.

With ChatGPT agent in July 2025, OpenAI expanded the environment to include a code interpreter and a terminal alongside the virtual browser, broadening the range of tasks the agent could complete. ChatGPT Atlas, launched in October 2025, took a third path: rather than running in a remote sandbox, Agent Mode in Atlas acts directly on the user's local browser session, so the agent sees the same logged-in pages the user does.

Google's approach

Project Mariner ran in cloud-hosted virtual machines that the user provisioned through Google's AI Ultra plan, with the agent capable of handling up to 10 parallel tasks. After Mariner's May 2026 shutdown, the same underlying capabilities became available through Gemini Agent in the Gemini app and through Chrome Auto Browse inside the Chrome browser. Gemini Agent runs on Gemini 3 and adds tighter integration with Workspace apps and Deep Research; Chrome Auto Browse operates inside the user's browser session rather than in a remote sandbox.

How do computer use platforms compare?

The major computer use platforms differ in scope, architecture, and target audience.

Feature	Anthropic Claude	OpenAI ChatGPT agent / Atlas	Google Gemini Agent	H Company Holo3
Launch date	October 22, 2024	January 23, 2025 (Operator); July 17, 2025 (ChatGPT agent); October 21, 2025 (Atlas)	December 11, 2024 (Mariner); May 4, 2026 (Gemini Agent)	April 1, 2026
Environment	Full desktop (user-managed VM/container), native Mac via Cowork, browser via Claude for Chrome	Cloud-hosted virtual browser (ChatGPT agent), local browser (Atlas Agent Mode)	Cloud-hosted VM (Gemini Agent), local browser (Chrome Auto Browse)	User-managed VM or local desktop
Scope	Desktop apps, terminal, browser, file system	Browser, code interpreter, terminal	Browser, Workspace apps	Desktop apps, browser
Access model	API; Cowork/Code for consumers; Claude for Chrome extension	ChatGPT subscription (Plus, Pro, Team, Business); Atlas client	AI Ultra plan ($249.99/month); also Gemini API	API for 122B variant; open-weight 35B model on Hugging Face
Multi-task	Single task per session (Cowork); multiple via API	Single task (agent mode)	Up to 10 parallel tasks (Gemini Agent)	Depends on harness
Setup complexity	Developer-oriented (API); simple for Cowork/Code/Chrome	Minimal (cloud-hosted, browser extension)	Minimal (cloud-hosted)	Developer-oriented (open weights) or API
OSWorld-Verified score (Apr 29, 2026)	78.0% (Opus 4.7), 79.6% (Mythos Preview, restricted)	78.7% (GPT-5.5); 75.0% (GPT-5.4); 38.1% (CUA, Jan 2025)	Not published for Gemini Agent on OSWorld-Verified	82.6% (35B-A3B); 78.85% (122B-A10B)

OSWorld benchmark

OSWorld is the primary benchmark for evaluating computer use agents. Created by researchers at Carnegie Mellon University and other institutions, it was first published in April 2024 and accepted as a paper at NeurIPS 2024.^[12]

Design

OSWorld provides a real computer environment (not a simulation) for testing multimodal agents. It supports task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS. The benchmark includes 369 computer tasks involving real web and desktop applications, OS file operations, and workflows that span multiple applications.^[12]

Unlike earlier benchmarks that tested agents on simplified or simulated interfaces, OSWorld requires agents to interact with actual operating systems and real software. Tasks range from simple file management to complex multi-application workflows. When the paper was published in April 2024, humans completed 72.36% of the tasks while the best model managed only 12.24%, with the authors noting that systems were "primarily struggling with GUI grounding and operational knowledge."^[12] The roughly 70-point gap closed by mid-2026 illustrates how quickly the field advanced.

Performance progression

OSWorld has become the standard yardstick for measuring progress in computer use. The improvement in scores over a short period illustrates how rapidly the field is advancing.

Date	Score	Agent/Model	Notes
Mid-2024	~12%	GPT-4V-based agents	Early attempts with vision models
October 2024	14.9%	Claude 3.5 Sonnet	Anthropic's initial computer use release; next-best system scored 7.8%
October 2024	22.0%	Claude 3.5 Sonnet (with extra steps)	Improved score when given additional steps
January 2025	38.1%	OpenAI CUA / Operator	Major jump with Operator launch
Mid-2025	~42.2%	Claude Sonnet 3.6	Continued Anthropic improvements
Late 2025	61.4%	Claude Sonnet 4.5	Significant generational leap
December 2025	72.6%	Agent S (Simular)	First system to exceed human baseline of 72.36%
February 2026	72.5%	Claude Sonnet 4.6	Effectively tied with flagship model
February 2026	72.7%	Claude Opus 4.6	Highest published Anthropic score at that time
April 1, 2026	82.6%	Holo3-35B-A3B (H Company)	First open-weight model to lead the OSWorld-Verified leaderboard
April 7, 2026	79.6%	Claude Mythos Preview	Restricted-access Anthropic model above Opus tier
April 16, 2026	78.0%	Claude Opus 4.7	Highest score for a general-availability Anthropic model
April 2026	78.7%	GPT-5.5	Top OpenAI general-purpose model
April 2026	75.0%	GPT-5.4	Above human baseline
April 2026	73.1%	Kimi K2.6 (Moonshot AI)	Leading open-weight general model

The human baseline on OSWorld is 72.36%, established through testing with human participants completing the same tasks.^[12] As of April 2026, multiple models clear the human baseline, with the OSWorld-Verified leaderboard led by Holo3-35B-A3B at 82.6%, followed by Claude Mythos Preview (79.6%), GPT-5.5 (78.7%), and Claude Opus 4.7 (78.0%).^[24]

OSWorld variants

The community has developed several variants of the benchmark:

OSWorld-Verified (July 2025): A major update fixing 300+ issues including web structure changes, instruction ambiguity, and evaluation robustness, with AWS support for 50x parallelization.^[16] This is the version most current leaderboards report against.
OSWorld-Human: Adds annotations of how humans solve the tasks, enabling comparison of human and agent strategies.
OSWorld-Grounding: Uses OSWorld tasks and agent trajectories to develop more granular, single-step evaluation tasks.

Other benchmarks

Beyond OSWorld, several other benchmarks evaluate computer use capabilities:

Benchmark	Focus	Notable Scores
WebArena	Autonomous web navigation on real websites	CUA: 58.1%; Claude achieves state-of-the-art among single-agent systems
WebVoyager	Real-world web task completion	CUA: 87%; Project Mariner: 83.5%
Online-Mind2Web	300 tasks across 136 live high-traffic websites	Operator: 61.3% human-evaluated success (71.8% WebJudge); most other agents 28-30%
ScreenSpot Pro	High-resolution GUI element grounding	Opus 4.7 gained +4.4 points over Opus 4.6 from resolution upgrade
Mind2Web	Web task generalization across sites	Used for cross-site transfer evaluation
InfographicQA	Dense-document and chart understanding	Opus 4.7 gained +2.4 points from high-resolution image support

The CUA scores of 58.1% on WebArena and 87% on WebVoyager were reported by OpenAI at the Operator launch in January 2025.^[37]

Safety and security

Is computer use safe? Computer use introduces safety risks that go beyond those of traditional chatbot interactions, because the model is taking real actions in a real environment.

Prompt injection

One of the most serious risks is prompt injection through on-screen content. Because the model reads and interprets everything visible on screen, malicious content on a website or in a document could instruct the agent to take unintended actions. For example, hidden text on a web page could instruct the agent to navigate to a different site and enter sensitive information.

Anthropic has addressed this with automatic classifiers that run on prompts to flag potential prompt injection in screenshots. When the classifiers detect a potential injection, they steer the model to ask for user confirmation before proceeding. However, this defense is not perfect, and Anthropic recommends additional precautions. For the Claude for Chrome browser extension, Anthropic reported that adding autonomy-mode safety mitigations reduced prompt injection attack success rates from 23.6% to 11.2%, demonstrating both the scale of the threat and the limits of current defenses.

Autonomous action risks

Computer use agents can take actions with real consequences: making purchases, sending emails, deleting files, or modifying settings. If an agent misinterprets a task or encounters an error, the consequences can be difficult to reverse. This is compounded by the fact that agents act autonomously, making it harder for humans to intervene before failures cause harm.

The International AI Safety Report (2026) specifically highlighted computer use agents as a category requiring careful governance, noting that "advances in how developers combine AI models with tools have enabled the development of increasingly powerful AI agents given access to tools such as memory, a computer interface, and web browsers, helping them autonomously interact with the world."^[20]

API-level precautions

Anthropic's documentation recommends several safety measures for computer use deployments through the API:^[3]

Use a dedicated virtual machine or container with minimal privileges to prevent direct system attacks or accidents.
Avoid giving the model access to sensitive data such as account login information to prevent information theft.
Limit internet access to an allowlist of approved domains to reduce exposure to malicious content.
Require human confirmation for actions with real-world consequences (financial transactions, agreeing to terms of service, sending communications).

OpenAI's Operator addressed some of these concerns architecturally by running in a sandboxed virtual browser rather than on the user's actual computer, limiting the potential damage from errors or prompt injection. ChatGPT Atlas, which runs Agent Mode against the user's local browser session, partially walks this back: the agent has access to logged-in sessions but does not have file-system or terminal access.

Cowork and Mac computer use safety

The March 2026 Mac computer use release in Claude Cowork introduced a layered safety model specific to consumer desktop use:^[6]

Per-application permissions: Claude asks for explicit permission before accessing each application on the Mac. Users must approve access via an on-screen prompt.
Default app restrictions: Certain sensitive applications are off-limits by default, including investment platforms and cryptocurrency tools. Users can also configure a custom blocklist to prevent Claude from accessing specific applications.
Destructive action confirmations: Cowork requires explicit user permission before permanently deleting any files. A dedicated permission prompt appears, and the user must select "Allow" before Claude can carry out the deletion.
Trained behavioral boundaries: Claude is trained to avoid transferring funds, modifying or deleting files without permission, and handling sensitive data. Anthropic acknowledges that these safeguards are not perfect and that Claude may occasionally act outside these boundaries.
Prompt injection scanning: Automatic classifiers scan for prompt injection attempts within model activations during screen interaction.
Filesystem isolation recommendations: Anthropic recommends creating a dedicated working folder for Claude rather than granting broad file system access, to protect sensitive documents like financial records and credentials.

Cross-application effects remain a challenge. If Claude clicks a link in one application, that link will open in the default browser even if the user has not explicitly granted Claude permission to use that browser.

Capabilities and limitations

What can computer use agents do?

Computer use agents in their current state (mid-2026) are capable of:

Navigating websites and web applications (searching, filling forms, clicking through multi-step processes)
Managing files and folders on desktop operating systems
Using productivity software (spreadsheets, word processors, email clients)
Performing multi-step workflows that span multiple applications
Adapting to unfamiliar interfaces by reasoning from visual cues rather than relying on hardcoded selectors
Handling tasks that require reading and understanding on-screen content, including dense documents and high-resolution screenshots (Opus 4.7 supports 3.75 MP images natively)
Compiling information from multiple sources (local files, web pages, different applications) into reports or summaries
Interacting with development environments, running dev tools, and testing applications
Running up to 10 parallel tasks in the background (Gemini Agent)
Self-improving across sessions through filesystem-based memory and "dreaming" patterns (Claude Managed Agents, May 2026)^[35]

What are the limitations of computer use?

Despite rapid progress, computer use agents face several persistent limitations:

Speed: Computer use agents are slow compared to API-based automation. Each step requires a screenshot capture, model inference, and action execution, with each cycle taking several seconds. Tasks that a human completes in minutes can take substantially longer for an agent. Anthropic has acknowledged that working through the screen is slower than using a direct integration.
Precision: Agents sometimes struggle with precise cursor placement, especially for small interface elements. Drag-and-drop operations and interactions with complex UI components (sliders, canvas elements, custom widgets) remain challenging. The Opus 4.7 1:1 coordinate mapping and 3.75 MP image support narrow this gap but do not eliminate it.
Transient elements: Notifications, tooltips, dropdown menus, and other elements that appear briefly or require precise timing can be missed between screenshot captures. The "flipbook" approach to perception means that anything changing between screenshots is invisible to the model.
Scrolling and navigation: Agents can lose track of their position in long documents or web pages, sometimes scrolling past their target or failing to find elements that are off-screen.
Error recovery: When an action fails or produces an unexpected result, agents do not always recover gracefully. They may repeat the same failed action or take an increasingly divergent path. Though self-correction behavior was observed during training, it is not consistent in production.
Cost: Each step in the agent loop requires an LLM API call with a screenshot (which is token-intensive). Complex tasks involving dozens or hundreds of steps can become expensive. High-resolution images send to Opus 4.7 use proportionally more tokens than the lower-resolution Opus 4.6 default, so Anthropic recommends downsampling when the additional fidelity is unnecessary.
Reliability for complex tasks: Anthropic notes that complex tasks sometimes need a second attempt. The technology is still described as a research preview rather than a production-ready system for high-stakes use.

Pricing

How much does computer use cost? Costs vary depending on the provider and the complexity of the task.

Provider	Model/Product	Pricing Model	Approximate Cost
Anthropic	Claude API (computer use)	Per-token API pricing	Claude Sonnet 4.6: $3 input / $15 output per million tokens; Claude Opus 4.7: $5 input / $25 output per million tokens; high-resolution screenshots increase token usage
Anthropic	Claude Cowork (Mac computer use)	Subscription	Included with Claude Pro ($20/month) and Claude Max plans
Anthropic	Agent SDK (programmatic)	Separate credit pool (effective June 15, 2026)	Pro $20 credit, Max 5x $100, Max 20x $200, Team $100/seat, Enterprise $200/seat per month at API list rates^[36]
OpenAI	ChatGPT agent / Atlas Agent Mode	Subscription	ChatGPT Plus ($20/month) 40 queries; Team ($25-30/seat) 40; Pro ($200/month) 400
Google	Gemini Agent	Subscription	Included with Google AI Ultra plan ($249.99/month); Chrome Auto Browse with AI Pro and AI Ultra
H Company	Holo3-122B-A10B	Per-token API	$0.40 input / $3.00 output per million tokens
H Company	Holo3-35B-A3B	Open weights	Apache 2.0; free to self-host on a 24GB GPU
Mistral	Le Chat Work mode	Subscription	Le Chat Pro/Team tier; Mistral Medium 3.5 backbone

For Anthropic's API-based approach, the cost per task depends on the number of steps required and the resolution of screenshots sent to the model. A simple task requiring 10 steps might cost a few cents, while a complex multi-application workflow with 100+ steps could cost several dollars. The Batch API offers a 50% discount on both input and output tokens for asynchronous processing, and prompt caching reduces the cost of repeated context by 90%. Anthropic announced on May 14, 2026, that programmatic computer use (Agent SDK, claude -p, GitHub Actions, third-party harnesses such as OpenClaw) will move off subscription usage pools and onto a separate $200-equivalent monthly credit pool effective June 15, 2026, while interactive Claude Code, Cowork, and chat usage remain on existing subscription limits.^[36]

Applications

What is computer use used for?

Computer use spans web automation, software testing, data migration, accessibility, knowledge work, IT administration, and security research. The most common applications are described below.

Web automation

The most common application of computer use is automating repetitive web tasks: filling forms, navigating multi-step processes, extracting information from websites, and performing routine online transactions. This is particularly valuable when websites do not offer APIs or when the task requires interacting with multiple sites. Browser-native agents like Claude for Chrome, ChatGPT Atlas, and Chrome Auto Browse target this category directly by acting on the user's logged-in browser session.

Software testing

Computer use agents can serve as automated testers, navigating through application interfaces to verify that features work correctly. Because they interact with the GUI the same way users do, they can catch visual bugs and usability issues that unit tests and API tests miss. Replit's early adoption of Claude computer use for evaluating apps during the build process is one example of this approach.

Data entry and migration

Organizations use computer use agents to transfer data between systems that lack integration, especially legacy systems that only offer GUI access. An agent can read data from one application, navigate to another, and enter the data, handling the tedious work that would otherwise require manual effort.

Accessibility

Computer use technology has potential applications in accessibility, helping users with motor disabilities interact with computer interfaces through natural language commands rather than precise mouse and keyboard actions.

Knowledge work and research

With Mac computer use in Cowork, Claude can compile competitive analyses, gather data from multiple local files and web sources, populate spreadsheets, and produce reports while the user focuses on other work. The Dispatch feature extends this by allowing users to assign such tasks from a phone and retrieve finished work later. Mistral Le Chat Work mode, ChatGPT agent, and Gemini Agent target the same knowledge-work segment from different angles: connector-first integrations, cloud-hosted virtual environments, and Workspace-integrated agents respectively.

IT support and administration

IT teams are exploring computer use agents for routine system administration tasks: configuring software, running diagnostics, and following standard operating procedures.

Cybersecurity research

Anthropic's restricted Claude Mythos Preview is being used by critical-infrastructure operators in Project Glasswing to autonomously find zero-day vulnerabilities, including 27-year-old and 17-year-old remote-code-execution bugs in OpenBSD and FreeBSD discovered during pre-launch testing.^[33] These workloads combine deep code reasoning with computer use to drive penetration-testing tools through their GUIs.

Current state (mid-2026)

As of May 2026, computer use is an active and rapidly advancing area of AI development. The field has moved from research prototypes to consumer products in roughly 19 months.

On the OSWorld-Verified benchmark, the open-weight Holo3-35B-A3B from H Company leads at 82.6%, the first open-weight model to top the leaderboard. Claude Mythos Preview (79.6%) is the highest-scoring general-purpose model but is restricted to invitation-only access. Among generally available frontier models, GPT-5.5 (78.7%) and Claude Opus 4.7 (78.0%) sit within a point of each other. Both are above the human baseline of 72.36%. Browser-specific benchmarks like WebVoyager still favor OpenAI's agent mode with approximately 87% success rates. The harder Online-Mind2Web benchmark, which tests 300 tasks on 136 live high-traffic websites, shows much lower success rates (Operator at 61.3% human-evaluated; most others 28-30%), highlighting that older benchmarks may overstate real-world reliability.

The competitive landscape is intensifying. Anthropic, OpenAI, Google, Mistral, and specialized startups like H Company and Simular are all investing heavily in computer use capabilities. Open-weight models are now competitive with closed frontier models on GUI tasks, a development that mirrors the earlier rise of open-weight coding and reasoning models. The release of new model generations continues to improve computer use performance significantly with each iteration.

Key trends for 2026 include:

Vision uplift: 3.75 MP image support and 1:1 coordinate mapping (Opus 4.7) make screenshot-based control cleaner and more accurate.^[11]
Open-weight specialists: H Company's Holo3 demonstrates that vertical specialization can outperform much larger general-purpose models on GUI tasks.^[32]
Browser-native agents: ChatGPT Atlas, Claude for Chrome, and Chrome Auto Browse move the agent into the user's own browser session rather than a remote sandbox.^[30]^[31]^[29]
Standalone agent products fold into integrated assistants: OpenAI deprecated Operator in August 2025; Google shut down Project Mariner in May 2026; both now ship as agent modes inside ChatGPT and Gemini respectively.^[26]^[27]
Multi-task parallelism: Gemini Agent handles up to 10 simultaneous tasks; Claude Code's Agent View (May 11, 2026) provides a CLI interface for managing parallel sessions.^[35]
Mobile-to-desktop task delegation: Claude Dispatch lets users assign tasks from a phone to a desktop session.
Filesystem memory and self-improvement: Claude Managed Agents added filesystem-based memory in April 2026 and a "dreaming" feature in May 2026 that reviews past sessions to find patterns and help agents self-improve.^[35]
Programmatic billing separation: Anthropic split Agent SDK usage from interactive subscription limits onto a separate credit pool effective June 15, 2026, reflecting how distinct programmatic agent workloads have become from interactive chat.^[36]

The technology is not yet reliable enough for fully unsupervised use in high-stakes scenarios, but it is already practical for supervised automation of routine tasks. The gap between current capabilities and full reliability is expected to continue narrowing as models improve and safety tooling matures.

References

Anthropic. "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." October 22, 2024. https://www.anthropic.com/news/3-5-models-and-computer-use ↩
Anthropic. "Developing a computer use model." October 2024. https://www.anthropic.com/news/developing-computer-use ↩
Anthropic. "Computer use tool." Claude API Documentation. https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool ↩
Anthropic. "Put Claude to work on your computer." Claude Blog, March 2026. https://claude.com/blog/dispatch-and-computer-use ↩
Anthropic. "Let Claude use your computer in Cowork." Claude Help Center. https://support.claude.com/en/articles/14128542-let-claude-use-your-computer-in-cowork
Anthropic. "Use Cowork safely." Claude Help Center. https://support.claude.com/en/articles/13364135-use-cowork-safely ↩
Anthropic. "Introducing Claude Opus 4.7." April 16, 2026. https://www.anthropic.com/news/claude-opus-4-7 ↩
OpenAI. "Introducing Operator." January 23, 2025. https://openai.com/index/introducing-operator/ ↩
Google DeepMind. "Project Mariner." 2024. https://deepmind.google/models/project-mariner/ ↩
Anthropic. "What's new in Claude Opus 4.7." Claude API Docs. https://platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-7 ↩
Xie, T. et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv:2404.07972, NeurIPS 2024. https://arxiv.org/abs/2404.07972 ↩
OSWorld project page. https://os-world.github.io/
XLang AI. "Introducing OSWorld-Verified." July 2025. https://xlang.ai/blog/osworld-verified ↩
International AI Safety Report 2026. https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026 ↩
Anthropic. "Assign tasks to Claude from anywhere in Cowork." Claude Help Center. https://support.claude.com/en/articles/13947068-assign-tasks-to-claude-from-anywhere-in-cowork ↩
LLM Stats. "OSWorld-Verified Benchmark Leaderboard." https://llm-stats.com/benchmarks/osworld-verified ↩
BenchLM.ai. "OSWorld-Verified Benchmark 2026." https://benchlm.ai/benchmarks/osWorldVerified ↩
OpenAI. "Introducing ChatGPT agent: bridging research and action." July 17, 2025. https://openai.com/index/introducing-chatgpt-agent/ ↩
OpenAI Help Center. "Operator Release Notes." https://help.openai.com/en/articles/10561834 ↩
Nerova. "Google Shuts Down Project Mariner." May 6, 2026. https://nerova.ai/news/google-shuts-down-project-mariner-gemini-agent-browser-2026 ↩
Google. "Gemini Agent." https://gemini.google/overview/agent/ ↩
Google Blog. "The new era of browsing: Putting Gemini to work in Chrome." January 29, 2026. https://blog.google/products-and-platforms/products/chrome/gemini-3-auto-browse/ ↩
OpenAI. "Introducing ChatGPT Atlas." October 21, 2025. https://openai.com/index/introducing-chatgpt-atlas/ ↩
Anthropic. "Piloting Claude in Chrome." August 2025. https://www.anthropic.com/news/claude-for-chrome ↩
H Company. "Holo3." 2026. https://hcompany.ai/holo3 ↩
Anthropic. "Claude Mythos Preview." April 7, 2026. https://red.anthropic.com/2026/mythos-preview/ ↩
InfoQ. "Mistral Adds Remote Agents and Work Mode to Le Chat." May 2026. https://www.infoq.com/news/2026/05/mistral-agents-lechat/ ↩
Releasebot. "Claude Updates by Anthropic - May 2026." https://releasebot.io/updates/anthropic/claude ↩
VentureBeat. "Anthropic reinstates OpenClaw and third-party agent usage on Claude subscriptions, with a catch." May 14, 2026. https://venturebeat.com/technology/anthropic-reinstates-openclaw-and-third-party-agent-usage-on-claude-subscriptions-with-a-catch ↩
OpenAI. "Computer-Using Agent." January 23, 2025. https://openai.com/index/computer-using-agent/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit