Computer use
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 7,099 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 7,099 words
Add missing citations, update stale details, or suggest a clearer explanation.
Computer use refers to the capability of artificial intelligence models to control computers by viewing screens and performing mouse and keyboard actions, much like a human user would. Rather than interacting through APIs or command-line interfaces, computer use agents perceive the graphical user interface (GUI) through screenshots, reason about what they see, and execute actions such as clicking buttons, typing text, scrolling, and navigating between applications.
The concept emerged as a practical product in late 2024, when Anthropic released computer use capabilities for Claude 3.5 Sonnet on October 22, 2024, making it the first major AI model to offer autonomous desktop control through a public API.[1] OpenAI followed with Operator on January 23, 2025, and Google introduced Project Mariner as a browser-based agent in December 2024.[9][10] By mid-2026, computer use has become a competitive frontier in AI development, with multiple agents exceeding the human baseline on standardized benchmarks. Open-weight specialist models such as H Company's Holo3 now hold the top spot on the OSWorld-Verified leaderboard at 82.6%, while general-purpose frontier models like Claude Opus 4.7 (78.0%) and GPT-5.5 (78.7%) sit within a few points of the open-weight leader.[23][24]
Before the current generation of computer-use agents, researchers explored various ways to have AI interact with graphical interfaces. Early approaches included screen scraping (extracting text from rendered interfaces), accessibility API integration (using OS-level accessibility trees to understand interface structure), and Selenium-style browser automation (programmatically controlling web browsers through DOM manipulation). These methods were brittle, requiring specific knowledge of each application's internal structure, and broke whenever interfaces changed.
The vision capabilities of modern multimodal language models made a fundamentally different approach possible. Instead of relying on structured access to interface elements, a vision-capable model could simply look at a screenshot and understand what was on screen, the same way a person does.
Anthropic launched computer use as a public beta on October 22, 2024, alongside the updated Claude 3.5 Sonnet model.[1] This was the first time a major AI company offered a production-grade API for autonomous desktop control. The announcement described the capability as allowing developers to "direct Claude to use computers the way people do, by looking at a screen, moving a cursor, clicking buttons, and typing text."[1] The feature was available through the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI from day one.
At launch, Anthropic was candid about the limitations. The company noted that computer use was "at times cumbersome and error-prone" and that Claude could miss short-lived notifications, struggle with scrolling, and have difficulty with precise cursor placement.[1] Despite these limitations, the release generated significant interest from developers building automation workflows.
Six organizations were highlighted as early adopters of Claude's computer use at launch:
| Company | Use Case |
|---|---|
| Asana | Automating project management workflows |
| Canva | Testing and automating design tool interactions |
| Cognition | Integrating desktop control into AI development agents |
| DoorDash | Automating internal operational processes |
| Replit | Evaluating apps as they are built within Replit Agent, using Claude's UI navigation capabilities to test applications during the development process |
| The Browser Company | Automating web-based workflows; the company noted that Claude 3.5 Sonnet outperformed every model they had previously tested for this purpose |
These companies were executing tasks requiring "dozens, and sometimes even hundreds, of steps," demonstrating the potential for complex multi-step automation even in the beta's early state.[1]
Anthropic published a companion blog post titled "Developing a computer use model" alongside the October 2024 launch, detailing the training approach.[2] The company's previous work on tool use and multimodality provided the foundation. Combining these abilities, Anthropic trained Claude to interpret what was happening on a screen and then use available software tools to carry out tasks.
The training itself was deliberately constrained for safety reasons. Claude was trained on only a few pieces of simple software, such as a calculator and a text editor. Internet access was intentionally prohibited during training. Despite this narrow training scope, the team was surprised by how rapidly Claude generalized to handle diverse applications it had never seen during training. This suggested that the model developed generalizable spatial reasoning rather than memorizing specific interface layouts.
A key technical challenge was teaching Claude to count pixels accurately. Without this ability to calculate cursor movement distances, the model struggled with mouse commands. This parallels how large language models often have difficulty with seemingly simple tasks like counting letter occurrences in words.
The model also demonstrated self-correction behavior, retrying tasks when it encountered obstacles rather than simply failing. Anthropic described Claude's perception mechanism as a "flipbook" approach: taking screenshots, analyzing what is visible, then issuing sequential commands. This differs from continuous video observation, which creates limitations in detecting brief notifications or rapidly changing UI elements.
OpenAI launched Operator on January 23, 2025, powered by the Computer-Using Agent (CUA) model.[9] CUA combined GPT-4o's vision capabilities with advanced reasoning trained through reinforcement learning.
Unlike Anthropic's approach, which provided a low-level API for developers to build their own computer use applications, Operator was a consumer-facing product. It ran in a secure virtual browser environment hosted by OpenAI, meaning users did not need to set up their own sandboxed environments. Operator was initially available to ChatGPT Pro subscribers (the $200-per-month tier) in the United States.
CUA uses a three-phase cycle: perception (screenshots are added to the model's context), reasoning (the model uses chain-of-thought to evaluate observations, track intermediate steps, and adapt dynamically), and action (the model performs clicks, scrolling, and typing until the task is completed or user input is needed). CUA is trained to interact with GUIs directly, without relying on OS-specific or web-specific APIs. However, because OpenAI hosts the execution environment in a sandboxed virtual browser, CUA is limited to browser-based tasks, unlike Anthropic's approach which supports full desktop control.
On July 17, 2025, OpenAI folded Operator into a unified product called ChatGPT agent, available directly from the ChatGPT composer's tools dropdown as "agent mode."[25] The new product combined three previously separate capabilities: Operator's action-taking remote browser, deep research's web synthesis, and ChatGPT's conversational interface. OpenAI's stated rationale was that "many queries users attempted with Operator were actually better suited for deep research," so the strengths were brought together. ChatGPT agent's built-in virtual browser carried over Operator's browser automation and added a code interpreter, terminal access, and the ability to deliver editable slideshows and spreadsheets summarizing its findings.
The standalone Operator product at operator.chatgpt.com was deprecated within weeks of the announcement and the experience shut down on August 31, 2025.[26] ChatGPT agent went live for ChatGPT Pro, Team, and Plus users, with Pro receiving 400 queries per month and Team and Plus receiving 40 per month. At launch, ChatGPT agent posted state-of-the-art browser-task results on WebArena and a score around 38.1% on OSWorld, inheriting the CUA backbone.
Google DeepMind unveiled Project Mariner on December 11, 2024, as a research prototype for AI-powered web browsing.[10] Powered by Gemini 2.0, Project Mariner could navigate websites, click buttons, fill forms, conduct searches, and complete online tasks autonomously.
The initial release was limited to a select group of testers. At Google I/O 2025, Google expanded access and announced several upgrades. Project Mariner was updated to run on virtual machines in the cloud (similar to OpenAI's approach), allowing users to continue their own work while the agent handled up to 10 tasks simultaneously in the background. A "Teach and Repeat" feature let users demonstrate workflows that the agent could then replicate.
Access to Project Mariner was tied to Google's AI Ultra plan, priced at $249.99 per month. Google also made Mariner's capabilities available through the Gemini API and Vertex AI for developers building their own applications. Like CUA, Project Mariner was limited to browser-based interactions and did not provide full desktop control.
Google shut down Project Mariner as a standalone product on May 4, 2026, ending an experiment that ran for roughly 17 months.[27] Rather than discarding the work, Google folded Mariner's browser-agent technology into Gemini Agent, the task-automation layer of the Gemini app, and into Google AI Mode in Search. Gemini Agent runs on Gemini 3 and inherits Mariner's multi-step planning, persistent cloud sessions, and the ability to coordinate up to 10 parallel tasks.[28] It also gains tighter integration with Deep Research, Canvas, and Workspace apps such as Gmail and Calendar.
Gemini Agent is available on web to Google AI Ultra subscribers in the United States with their language set to English. Google plans to extend the same agent capabilities to Android phones, watches, cars, and glasses starting summer 2026, beginning with select Samsung Galaxy and Pixel devices. Separately, on January 29, 2026, Google launched Chrome Auto Browse, a Gemini 3-powered feature inside the Chrome browser that handles shopping, form-filling, and other multi-step tasks autonomously.[29] Together, Gemini Agent and Chrome Auto Browse replace the standalone Mariner offering with embedded agent capabilities in products that already have hundreds of millions of users.
On October 21, 2025, OpenAI launched ChatGPT Atlas, a Chromium-based web browser with ChatGPT integrated as the primary interface.[30] Atlas includes a sidebar assistant that can answer questions about the current page, summarize content, and rewrite selected text. Its centerpiece for computer use is Agent Mode, which lets ChatGPT browse and act on the open browser window on the user's behalf. Agent Mode was offered as a preview to ChatGPT Plus, Pro, and Business subscribers at launch and is available initially on macOS, with Windows, iOS, and Android versions promised. Atlas brought OpenAI's computer use into the browser layer itself rather than running an agent in a separate remote sandbox.
On March 23, 2026, Anthropic introduced native computer use for macOS, available in both Claude Cowork and Claude Code.[4] This release marked a shift from the API-only approach of the 2024 beta to an integrated consumer product where Claude can directly control a user's Mac desktop.
The Mac computer use feature follows a prioritized tooling hierarchy:
Anthropic acknowledged that screen interaction is slower than using connectors. When Claude works through the screen instead of a direct integration, tasks take longer. Complex tasks sometimes need a second attempt, and the company described the feature as a research preview rather than a production-ready tool.
The Mac computer use feature launched alongside Dispatch, a companion capability that lets users assign tasks to Claude from a mobile device. Dispatch was first released on March 17, 2026, with Claude Max subscribers receiving access first, followed by Pro subscribers within days.[22] Users can scan a QR code from the Claude mobile app to link their phone to Claude Desktop, then assign tasks while away from the computer. Claude executes the work on the desktop and delivers the finished output when the user returns. The computer must remain powered on, as Dispatch functions as a remote control rather than a cloud computing service.
Mac computer use is initially available to Claude Pro and Claude Max subscribers on macOS.
Anthropic launched Claude for Chrome (also referred to as "Claude in Chrome") in August 2025 as a research preview limited to 1,000 trusted users on the Claude Max plan.[31] The browser extension lets users converse with Claude in a sidebar while the model navigates websites, fills forms, extracts data, and runs multi-step workflows on the user's behalf. Access expanded to all Max subscribers by November 2025 and to every paid plan, including Claude Pro, by December 2025. Claude for Chrome integrates with Claude Code, Cowork, and Claude Desktop, and it added a workflow-recording feature so users can teach the agent a sequence of steps to replay later. Anthropic reported that adding autonomy-mode safety mitigations reduced prompt injection attack success rates from 23.6% to 11.2%, while emphasizing that the risk has not been eliminated.
On April 16, 2026, Anthropic released Claude Opus 4.7, the first Claude model with native high-resolution image support and a substantial computer use uplift over the previous Opus 4.6 generation.[7] Maximum image resolution rose from 1,568 pixels per long edge (1.15 megapixels) to 2,576 pixels (3.75 megapixels), roughly 3.3 times the previous pixel density.[11] Crucially, the model's coordinate space now maps 1:1 with actual pixel coordinates, removing the scale-factor math that previously complicated screenshot and computer use workflows. Anthropic specifically called out that the resolution change "is particularly important for computer use and screenshot/artifact/document understanding workflows."[11]
Opus 4.7 also added a new xhigh effort level for agentic workloads, task budgets that give the model an advisory token allowance across a full agentic loop, and a new tokenizer. The model uses adaptive thinking as the only supported thinking-on mode (extended thinking budgets are no longer accepted) and runs at $5 per million input tokens and $25 per million output tokens at standard API pricing, with no long-context premium for the 1M token context window.
On the OSWorld-Verified benchmark, Opus 4.7 scored 78.0%, up from 72.7% for Opus 4.6, placing it third on the public leaderboard behind H Company's Holo3-35B-A3B (82.6%) and Claude Mythos Preview (79.6%).[23][24]
On April 1, 2026, the Paris-based startup H Company released Holo3, a family of vision-language models trained specifically for computer use.[32] Holo3 is composed of two variants: Holo3-35B-A3B, a 35-billion parameter mixture-of-experts model with 3B active parameters released under the Apache 2.0 license and available on Hugging Face, and Holo3-122B-A10B, a larger API-only variant. The smaller model can be run locally on a single GPU with roughly 24GB of VRAM; the larger model is offered at $0.40 per million input tokens and $3.00 per million output tokens, an order of magnitude cheaper than Opus 4.7.
Holo3-35B-A3B topped the OSWorld-Verified leaderboard at 82.6% as of April 29, 2026, the first open-weight model to do so and well above general-purpose frontier models. Holo3-122B-A10B followed at 78.85%.[32][24] H Company described Holo3 as a "computer-use specialist" rather than a general assistant, and the release demonstrated that vertical specialization with dedicated training data can outperform much larger general-purpose models on GUI tasks.
Anthropic previewed Claude Mythos Preview on April 7, 2026, a model that sits one capability tier above Opus 4.7.[33] Mythos is offered only through Project Glasswing, an invitation-only program for 12 founding organizations and roughly 40 vetted critical-infrastructure operators. Partners pay $25 per million input tokens and $125 per million output tokens, five times Opus 4.7 pricing, and access is gated for cybersecurity reasons. Mythos posted 79.6% on OSWorld-Verified, the highest score for a general-purpose model and second only to Holo3-35B-A3B.[24] Anthropic has not announced a general availability date.
Mistral AI added a Work mode preview to its Le Chat assistant in April 2026, powered by Mistral Medium 3.5.[34] Work mode runs multi-step agentic workflows across connected tools such as email, calendar, messaging, GitHub, Jira, and Slack. The agent can draft messages, create issues, and generate reports while exposing every tool call and intermediate step to the user; sensitive operations require explicit approval. Sessions persist across multiple steps so the agent can iterate until a task completes. Mistral also introduced remote coding agents in its Vibe product line that can be spawned from the CLI or Le Chat and run asynchronously in the cloud; a local CLI session can be "teleported" up to the cloud to continue in the background.
All major computer use implementations follow a similar perception-reasoning-action loop, though the specific implementations differ.
The core mechanism is a cycle that repeats until the task is complete:
This loop continues until the model determines that the task is complete or that it needs human input to proceed.
Anthropic's computer use tool is exposed through the Claude API as a special tool type.[3] Developers define a computer use tool with a specified screen resolution and pass it to the model alongside their prompt. Claude then requests tool calls that the developer's application executes in the host environment. The tool is schema-less, meaning the input schema is built into Claude's model and cannot be modified by the developer.
The computer use tool remains in beta and requires a specific beta header in API requests. Different Claude model generations use different tool versions:
| Model | Tool Version | Beta Header |
|---|---|---|
| Claude Opus 4.7 | computer_20251124 (with 1:1 coordinate mapping) | computer-use-2025-11-24 |
| Claude Opus 4.6, Claude Sonnet 4.6, Claude Opus 4.5 | computer_20251124 | computer-use-2025-11-24 |
| Sonnet 4.5, Haiku 4.5, Opus 4.1, Sonnet 4, Opus 4, Sonnet 3.7 | computer_20250124 | computer-use-2025-01-24 |
The available actions have expanded across tool versions:
| Action Category | Actions | Availability |
|---|---|---|
| Basic actions | screenshot, left_click, type, key, mouse_move | All versions |
| Enhanced mouse actions | right_click, middle_click, double_click, triple_click, left_click_drag | computer_20250124 and later |
| Fine-grained control | left_mouse_down, left_mouse_up, hold_key, wait | computer_20250124 and later |
| Scrolling | scroll (with direction and amount control) | computer_20250124 and later |
| Detailed inspection | zoom (view a specific screen region at full resolution) | computer_20251124 only |
Anthropic provides a reference implementation with a Docker container, web interface, and example tool implementations hosted on GitHub. The developer is responsible for setting up the execution environment, which typically involves a virtual machine or container with a virtual X11 display server (Xvfb) that renders the desktop interface Claude sees through screenshots. The computing environment includes a lightweight Linux desktop with a window manager (Mutter), a panel (Tint2), and pre-installed applications like Firefox, LibreOffice, and text editors.
When the developer provides the computer use tool, Anthropic's API generates a specialized system prompt that tells Claude it has access to a sandboxed computing environment. The developer's own system prompt is incorporated alongside this generated prompt.
OpenAI's Computer-Using Agent follows the same general perception-reasoning-action pattern but is architecturally different. The original Operator product ran entirely in a secure virtual browser hosted by OpenAI, so users did not need to manage their own sandboxing infrastructure. This made setup easier, particularly for non-technical users, but restricted CUA to browser-based tasks. Anthropic's approach supports full desktop control, including terminal commands, native applications, and file system operations.
With ChatGPT agent in July 2025, OpenAI expanded the environment to include a code interpreter and a terminal alongside the virtual browser, broadening the range of tasks the agent could complete. ChatGPT Atlas, launched in October 2025, took a third path: rather than running in a remote sandbox, Agent Mode in Atlas acts directly on the user's local browser session, so the agent sees the same logged-in pages the user does.
Project Mariner ran in cloud-hosted virtual machines that the user provisioned through Google's AI Ultra plan, with the agent capable of handling up to 10 parallel tasks. After Mariner's May 2026 shutdown, the same underlying capabilities became available through Gemini Agent in the Gemini app and through Chrome Auto Browse inside the Chrome browser. Gemini Agent runs on Gemini 3 and adds tighter integration with Workspace apps and Deep Research; Chrome Auto Browse operates inside the user's browser session rather than in a remote sandbox.
The major computer use platforms differ in scope, architecture, and target audience.
| Feature | Anthropic Claude | OpenAI ChatGPT agent / Atlas | Google Gemini Agent | H Company Holo3 |
|---|---|---|---|---|
| Launch date | October 22, 2024 | January 23, 2025 (Operator); July 17, 2025 (ChatGPT agent); October 21, 2025 (Atlas) | December 11, 2024 (Mariner); May 4, 2026 (Gemini Agent) | April 1, 2026 |
| Environment | Full desktop (user-managed VM/container), native Mac via Cowork, browser via Claude for Chrome | Cloud-hosted virtual browser (ChatGPT agent), local browser (Atlas Agent Mode) | Cloud-hosted VM (Gemini Agent), local browser (Chrome Auto Browse) | User-managed VM or local desktop |
| Scope | Desktop apps, terminal, browser, file system | Browser, code interpreter, terminal | Browser, Workspace apps | Desktop apps, browser |
| Access model | API; Cowork/Code for consumers; Claude for Chrome extension | ChatGPT subscription (Plus, Pro, Team, Business); Atlas client | AI Ultra plan ($249.99/month); also Gemini API | API for 122B variant; open-weight 35B model on Hugging Face |
| Multi-task | Single task per session (Cowork); multiple via API | Single task (agent mode) | Up to 10 parallel tasks (Gemini Agent) | Depends on harness |
| Setup complexity | Developer-oriented (API); simple for Cowork/Code/Chrome | Minimal (cloud-hosted, browser extension) | Minimal (cloud-hosted) | Developer-oriented (open weights) or API |
| OSWorld-Verified score (Apr 29, 2026) | 78.0% (Opus 4.7), 79.6% (Mythos Preview, restricted) | 78.7% (GPT-5.5); 75.0% (GPT-5.4); 38.1% (CUA, Jan 2025) | Not published for Gemini Agent on OSWorld-Verified | 82.6% (35B-A3B); 78.85% (122B-A10B) |
OSWorld is the primary benchmark for evaluating computer use agents. Created by researchers at Carnegie Mellon University and other institutions, it was first published in April 2024 and accepted as a paper at NeurIPS 2024.[12]
OSWorld provides a real computer environment (not a simulation) for testing multimodal agents. It supports task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS. The benchmark includes 369 computer tasks involving real web and desktop applications, OS file operations, and workflows that span multiple applications.
Unlike earlier benchmarks that tested agents on simplified or simulated interfaces, OSWorld requires agents to interact with actual operating systems and real software. Tasks range from simple file management to complex multi-application workflows.
OSWorld has become the standard yardstick for measuring progress in computer use. The improvement in scores over a short period illustrates how rapidly the field is advancing.
| Date | Score | Agent/Model | Notes |
|---|---|---|---|
| Mid-2024 | ~12% | GPT-4V-based agents | Early attempts with vision models |
| October 2024 | 14.9% | Claude 3.5 Sonnet | Anthropic's initial computer use release; next-best system scored 7.8% |
| October 2024 | 22.0% | Claude 3.5 Sonnet (with extra steps) | Improved score when given additional steps |
| January 2025 | 38.1% | OpenAI CUA / Operator | Major jump with Operator launch |
| Mid-2025 | ~42.2% | Claude Sonnet 3.6 | Continued Anthropic improvements |
| Late 2025 | 61.4% | Claude Sonnet 4.5 | Significant generational leap |
| December 2025 | 72.6% | Agent S (Simular) | First system to exceed human baseline of 72.36% |
| February 2026 | 72.5% | Claude Sonnet 4.6 | Effectively tied with flagship model |
| February 2026 | 72.7% | Claude Opus 4.6 | Highest published Anthropic score at that time |
| April 1, 2026 | 82.6% | Holo3-35B-A3B (H Company) | First open-weight model to lead the OSWorld-Verified leaderboard |
| April 7, 2026 | 79.6% | Claude Mythos Preview | Restricted-access Anthropic model above Opus tier |
| April 16, 2026 | 78.0% | Claude Opus 4.7 | Highest score for a general-availability Anthropic model |
| April 2026 | 78.7% | GPT-5.5 | Top OpenAI general-purpose model |
| April 2026 | 75.0% | GPT-5.4 | Above human baseline |
| April 2026 | 73.1% | Kimi K2.6 (Moonshot AI) | Leading open-weight general model |
The human baseline on OSWorld is 72.36%, established through testing with human participants completing the same tasks.[12] As of April 2026, multiple models clear the human baseline, with the OSWorld-Verified leaderboard led by Holo3-35B-A3B at 82.6%, followed by Claude Mythos Preview (79.6%), GPT-5.5 (78.7%), and Claude Opus 4.7 (78.0%).[24]
The community has developed several variants of the benchmark:
Beyond OSWorld, several other benchmarks evaluate computer use capabilities:
| Benchmark | Focus | Notable Scores |
|---|---|---|
| WebArena | Autonomous web navigation on real websites | CUA: 58.1%; Claude achieves state-of-the-art among single-agent systems |
| WebVoyager | Real-world web task completion | CUA: 87%; Project Mariner: 83.5% |
| Online-Mind2Web | 300 tasks across 136 live high-traffic websites | Operator: 61.3% human-evaluated success (71.8% WebJudge); most other agents 28-30% |
| ScreenSpot Pro | High-resolution GUI element grounding | Opus 4.7 gained +4.4 points over Opus 4.6 from resolution upgrade |
| Mind2Web | Web task generalization across sites | Used for cross-site transfer evaluation |
| InfographicQA | Dense-document and chart understanding | Opus 4.7 gained +2.4 points from high-resolution image support |
Computer use introduces safety risks that go beyond those of traditional chatbot interactions, because the model is taking real actions in a real environment.
One of the most serious risks is prompt injection through on-screen content. Because the model reads and interprets everything visible on screen, malicious content on a website or in a document could instruct the agent to take unintended actions. For example, hidden text on a web page could instruct the agent to navigate to a different site and enter sensitive information.
Anthropic has addressed this with automatic classifiers that run on prompts to flag potential prompt injection in screenshots. When the classifiers detect a potential injection, they steer the model to ask for user confirmation before proceeding. However, this defense is not perfect, and Anthropic recommends additional precautions. For the Claude for Chrome browser extension, Anthropic reported that adding autonomy-mode safety mitigations reduced prompt injection attack success rates from 23.6% to 11.2%, demonstrating both the scale of the threat and the limits of current defenses.
Computer use agents can take actions with real consequences: making purchases, sending emails, deleting files, or modifying settings. If an agent misinterprets a task or encounters an error, the consequences can be difficult to reverse. This is compounded by the fact that agents act autonomously, making it harder for humans to intervene before failures cause harm.
The International AI Safety Report (2026) specifically highlighted computer use agents as a category requiring careful governance, noting that "advances in how developers combine AI models with tools have enabled the development of increasingly powerful AI agents given access to tools such as memory, a computer interface, and web browsers, helping them autonomously interact with the world."[20]
Anthropic's documentation recommends several safety measures for computer use deployments through the API:[3]
OpenAI's Operator addressed some of these concerns architecturally by running in a sandboxed virtual browser rather than on the user's actual computer, limiting the potential damage from errors or prompt injection. ChatGPT Atlas, which runs Agent Mode against the user's local browser session, partially walks this back: the agent has access to logged-in sessions but does not have file-system or terminal access.
The March 2026 Mac computer use release in Claude Cowork introduced a layered safety model specific to consumer desktop use:[6]
Cross-application effects remain a challenge. If Claude clicks a link in one application, that link will open in the default browser even if the user has not explicitly granted Claude permission to use that browser.
Computer use agents in their current state (mid-2026) are capable of:
Despite rapid progress, computer use agents face several persistent limitations:
Computer use costs vary depending on the provider and the complexity of the task.
| Provider | Model/Product | Pricing Model | Approximate Cost |
|---|---|---|---|
| Anthropic | Claude API (computer use) | Per-token API pricing | Claude Sonnet 4.6: $3 input / $15 output per million tokens; Claude Opus 4.7: $5 input / $25 output per million tokens; high-resolution screenshots increase token usage |
| Anthropic | Claude Cowork (Mac computer use) | Subscription | Included with Claude Pro ($20/month) and Claude Max plans |
| Anthropic | Agent SDK (programmatic) | Separate credit pool (effective June 15, 2026) | Pro $20 credit, Max 5x $100, Max 20x $200, Team $100/seat, Enterprise $200/seat per month at API list rates[36] |
| OpenAI | ChatGPT agent / Atlas Agent Mode | Subscription | ChatGPT Plus ($20/month) 40 queries; Team ($25-30/seat) 40; Pro ($200/month) 400 |
| Gemini Agent | Subscription | Included with Google AI Ultra plan ($249.99/month); Chrome Auto Browse with AI Pro and AI Ultra | |
| H Company | Holo3-122B-A10B | Per-token API | $0.40 input / $3.00 output per million tokens |
| H Company | Holo3-35B-A3B | Open weights | Apache 2.0; free to self-host on a 24GB GPU |
| Mistral | Le Chat Work mode | Subscription | Le Chat Pro/Team tier; Mistral Medium 3.5 backbone |
For Anthropic's API-based approach, the cost per task depends on the number of steps required and the resolution of screenshots sent to the model. A simple task requiring 10 steps might cost a few cents, while a complex multi-application workflow with 100+ steps could cost several dollars. The Batch API offers a 50% discount on both input and output tokens for asynchronous processing, and prompt caching reduces the cost of repeated context by 90%. Anthropic announced on May 14, 2026, that programmatic computer use (Agent SDK, claude -p, GitHub Actions, third-party harnesses such as OpenClaw) will move off subscription usage pools and onto a separate $200-equivalent monthly credit pool effective June 15, 2026, while interactive Claude Code, Cowork, and chat usage remain on existing subscription limits.[36]
The most common application of computer use is automating repetitive web tasks: filling forms, navigating multi-step processes, extracting information from websites, and performing routine online transactions. This is particularly valuable when websites do not offer APIs or when the task requires interacting with multiple sites. Browser-native agents like Claude for Chrome, ChatGPT Atlas, and Chrome Auto Browse target this category directly by acting on the user's logged-in browser session.
Computer use agents can serve as automated testers, navigating through application interfaces to verify that features work correctly. Because they interact with the GUI the same way users do, they can catch visual bugs and usability issues that unit tests and API tests miss. Replit's early adoption of Claude computer use for evaluating apps during the build process is one example of this approach.
Organizations use computer use agents to transfer data between systems that lack integration, especially legacy systems that only offer GUI access. An agent can read data from one application, navigate to another, and enter the data, handling the tedious work that would otherwise require manual effort.
Computer use technology has potential applications in accessibility, helping users with motor disabilities interact with computer interfaces through natural language commands rather than precise mouse and keyboard actions.
With Mac computer use in Cowork, Claude can compile competitive analyses, gather data from multiple local files and web sources, populate spreadsheets, and produce reports while the user focuses on other work. The Dispatch feature extends this by allowing users to assign such tasks from a phone and retrieve finished work later. Mistral Le Chat Work mode, ChatGPT agent, and Gemini Agent target the same knowledge-work segment from different angles: connector-first integrations, cloud-hosted virtual environments, and Workspace-integrated agents respectively.
IT teams are exploring computer use agents for routine system administration tasks: configuring software, running diagnostics, and following standard operating procedures.
Anthropic's restricted Claude Mythos Preview is being used by critical-infrastructure operators in Project Glasswing to autonomously find zero-day vulnerabilities, including 27-year-old and 17-year-old remote-code-execution bugs in OpenBSD and FreeBSD discovered during pre-launch testing.[33] These workloads combine deep code reasoning with computer use to drive penetration-testing tools through their GUIs.
As of May 2026, computer use is an active and rapidly advancing area of AI development. The field has moved from research prototypes to consumer products in roughly 19 months.
On the OSWorld-Verified benchmark, the open-weight Holo3-35B-A3B from H Company leads at 82.6%, the first open-weight model to top the leaderboard. Claude Mythos Preview (79.6%) is the highest-scoring general-purpose model but is restricted to invitation-only access. Among generally available frontier models, GPT-5.5 (78.7%) and Claude Opus 4.7 (78.0%) sit within a point of each other. Both are above the human baseline of 72.36%. Browser-specific benchmarks like WebVoyager still favor OpenAI's agent mode with approximately 87% success rates. The harder Online-Mind2Web benchmark, which tests 300 tasks on 136 live high-traffic websites, shows much lower success rates (Operator at 61.3% human-evaluated; most others 28-30%), highlighting that older benchmarks may overstate real-world reliability.
The competitive landscape is intensifying. Anthropic, OpenAI, Google, Mistral, and specialized startups like H Company and Simular are all investing heavily in computer use capabilities. Open-weight models are now competitive with closed frontier models on GUI tasks, a development that mirrors the earlier rise of open-weight coding and reasoning models. The release of new model generations continues to improve computer use performance significantly with each iteration.
Key trends for 2026 include:
The technology is not yet reliable enough for fully unsupervised use in high-stakes scenarios, but it is already practical for supervised automation of routine tasks. The gap between current capabilities and full reliability is expected to continue narrowing as models improve and safety tooling matures.