Computer use refers to the capability of artificial intelligence models to control computers by viewing screens and performing mouse and keyboard actions, much like a human user would. Rather than interacting through APIs or command-line interfaces, computer use agents perceive the graphical user interface (GUI) through screenshots, reason about what they see, and execute actions such as clicking buttons, typing text, scrolling, and navigating between applications.
The concept emerged as a practical product in late 2024, when Anthropic released computer use capabilities for Claude 3.5 Sonnet on October 22, 2024, making it the first major AI model to offer autonomous desktop control through a public API. OpenAI followed with Operator in January 2025, and Google introduced Project Mariner as a browser-based agent in December 2024. By early 2026, computer use has become a competitive frontier in AI development, with agents reaching and occasionally exceeding human performance on standardized benchmarks.
Before the current generation of computer-use agents, researchers explored various ways to have AI interact with graphical interfaces. Early approaches included screen scraping (extracting text from rendered interfaces), accessibility API integration (using OS-level accessibility trees to understand interface structure), and Selenium-style browser automation (programmatically controlling web browsers through DOM manipulation). These methods were brittle, requiring specific knowledge of each application's internal structure, and broke whenever interfaces changed.
The vision capabilities of modern multimodal language models made a fundamentally different approach possible. Instead of relying on structured access to interface elements, a vision-capable model could simply look at a screenshot and understand what was on screen, the same way a person does.
Anthropic launched computer use as a public beta on October 22, 2024, alongside the updated Claude 3.5 Sonnet model. This was the first time a major AI company offered a production-grade API for autonomous desktop control. The announcement described the capability as allowing developers to "direct Claude to use computers the way people do, by looking at a screen, moving a cursor, clicking buttons, and typing text." The feature was available through the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI from day one.
At launch, Anthropic was candid about the limitations. The company noted that computer use was "at times cumbersome and error-prone" and that Claude could miss short-lived notifications, struggle with scrolling, and have difficulty with precise cursor placement. Despite these limitations, the release generated significant interest from developers building automation workflows.
Six organizations were highlighted as early adopters of Claude's computer use at launch:
| Company | Use Case |
|---|---|
| Asana | Automating project management workflows |
| Canva | Testing and automating design tool interactions |
| Cognition | Integrating desktop control into AI development agents |
| DoorDash | Automating internal operational processes |
| Replit | Evaluating apps as they are built within Replit Agent, using Claude's UI navigation capabilities to test applications during the development process |
| The Browser Company | Automating web-based workflows; the company noted that Claude 3.5 Sonnet outperformed every model they had previously tested for this purpose |
These companies were executing tasks requiring "dozens, and sometimes even hundreds, of steps," demonstrating the potential for complex multi-step automation even in the beta's early state.
Anthropic published a companion blog post titled "Developing a computer use model" alongside the October 2024 launch, detailing the training approach. The company's previous work on tool use and multimodality provided the foundation. Combining these abilities, Anthropic trained Claude to interpret what was happening on a screen and then use available software tools to carry out tasks.
The training itself was deliberately constrained for safety reasons. Claude was trained on only a few pieces of simple software, such as a calculator and a text editor. Internet access was intentionally prohibited during training. Despite this narrow training scope, the team was surprised by how rapidly Claude generalized to handle diverse applications it had never seen during training. This suggested that the model developed generalizable spatial reasoning rather than memorizing specific interface layouts.
A key technical challenge was teaching Claude to count pixels accurately. Without this ability to calculate cursor movement distances, the model struggled with mouse commands. This parallels how large language models often have difficulty with seemingly simple tasks like counting letter occurrences in words.
The model also demonstrated self-correction behavior, retrying tasks when it encountered obstacles rather than simply failing. Anthropic described Claude's perception mechanism as a "flipbook" approach: taking screenshots, analyzing what is visible, then issuing sequential commands. This differs from continuous video observation, which creates limitations in detecting brief notifications or rapidly changing UI elements.
OpenAI launched Operator on January 23, 2025, powered by the Computer-Using Agent (CUA) model. CUA combined GPT-4o's vision capabilities with advanced reasoning trained through reinforcement learning.
Unlike Anthropic's approach, which provided a low-level API for developers to build their own computer use applications, Operator was a consumer-facing product. It ran in a secure virtual browser environment hosted by OpenAI, meaning users did not need to set up their own sandboxed environments. Operator was initially available to ChatGPT Pro subscribers (the $200-per-month tier) in the United States.
CUA uses a three-phase cycle: perception (screenshots are added to the model's context), reasoning (the model uses chain-of-thought to evaluate observations, track intermediate steps, and adapt dynamically), and action (the model performs clicks, scrolling, and typing until the task is completed or user input is needed). CUA is trained to interact with GUIs directly, without relying on OS-specific or web-specific APIs. However, because OpenAI hosts the execution environment in a sandboxed virtual browser, CUA is limited to browser-based tasks, unlike Anthropic's approach which supports full desktop control.
By July 2025, OpenAI integrated Operator's capabilities directly into ChatGPT as "agent mode," and the standalone Operator product was deprecated in August 2025. This move made computer use accessible to a broader audience through the familiar ChatGPT interface.
Google DeepMind unveiled Project Mariner on December 11, 2024, as a research prototype for AI-powered web browsing. Powered by Gemini 2.0, Project Mariner could navigate websites, click buttons, fill forms, conduct searches, and complete online tasks autonomously.
The initial release was limited to a select group of testers. At Google I/O 2025, Google expanded access and announced several upgrades. Project Mariner was updated to run on virtual machines in the cloud (similar to OpenAI's approach), allowing users to continue their own work while the agent handled up to 10 tasks simultaneously in the background. A "Teach and Repeat" feature let users demonstrate workflows that the agent could then replicate.
Access to Project Mariner was tied to Google's AI Ultra plan, priced at $249.99 per month. Google also made Mariner's capabilities available through the Gemini API and Vertex AI for developers building their own applications. Like CUA, Project Mariner is limited to browser-based interactions and does not provide full desktop control.
On March 23, 2026, Anthropic introduced native computer use for macOS, available in both Claude Cowork and Claude Code. This release marked a shift from the API-only approach of the 2024 beta to an integrated consumer product where Claude can directly control a user's Mac desktop.
The Mac computer use feature follows a prioritized tooling hierarchy:
Anthropic acknowledged that screen interaction is slower than using connectors. When Claude works through the screen instead of a direct integration, tasks take longer. Complex tasks sometimes need a second attempt, and the company described the feature as a research preview rather than a production-ready tool.
The Mac computer use feature launched alongside Dispatch, a companion capability that lets users assign tasks to Claude from a mobile device. Dispatch was first released on March 17, 2026, with Claude Max subscribers receiving access first, followed by Pro subscribers within days. Users can scan a QR code from the Claude mobile app to link their phone to Claude Desktop, then assign tasks while away from the computer. Claude executes the work on the desktop and delivers the finished output when the user returns. The computer must remain powered on, as Dispatch functions as a remote control rather than a cloud computing service.
Mac computer use is initially available to Claude Pro and Claude Max subscribers on macOS.
All major computer use implementations follow a similar perception-reasoning-action loop, though the specific implementations differ.
The core mechanism is a cycle that repeats until the task is complete:
This loop continues until the model determines that the task is complete or that it needs human input to proceed.
Anthropic's computer use tool is exposed through the Claude API as a special tool type. Developers define a computer use tool with a specified screen resolution and pass it to the model alongside their prompt. Claude then requests tool calls that the developer's application executes in the host environment. The tool is schema-less, meaning the input schema is built into Claude's model and cannot be modified by the developer.
The computer use tool remains in beta and requires a specific beta header in API requests. Different Claude model generations use different tool versions:
| Model | Tool Version | Beta Header |
|---|---|---|
| Claude Opus 4.6, Claude Sonnet 4.6, Claude Opus 4.5 | computer_20251124 | computer-use-2025-11-24 |
| Sonnet 4.5, Haiku 4.5, Opus 4.1, Sonnet 4, Opus 4, Sonnet 3.7 | computer_20250124 | computer-use-2025-01-24 |
The available actions have expanded across tool versions:
| Action Category | Actions | Availability |
|---|---|---|
| Basic actions | screenshot, left_click, type, key, mouse_move | All versions |
| Enhanced mouse actions | right_click, middle_click, double_click, triple_click, left_click_drag | computer_20250124 and later |
| Fine-grained control | left_mouse_down, left_mouse_up, hold_key, wait | computer_20250124 and later |
| Scrolling | scroll (with direction and amount control) | computer_20250124 and later |
| Detailed inspection | zoom (view a specific screen region at full resolution) | computer_20251124 only |
Anthropic provides a reference implementation with a Docker container, web interface, and example tool implementations hosted on GitHub. The developer is responsible for setting up the execution environment, which typically involves a virtual machine or container with a virtual X11 display server (Xvfb) that renders the desktop interface Claude sees through screenshots. The computing environment includes a lightweight Linux desktop with a window manager (Mutter), a panel (Tint2), and pre-installed applications like Firefox, LibreOffice, and text editors.
When the developer provides the computer use tool, Anthropic's API generates a specialized system prompt that tells Claude it has access to a sandboxed computing environment. The developer's own system prompt is incorporated alongside this generated prompt.
OpenAI's Computer-Using Agent follows the same general perception-reasoning-action pattern but is architecturally different. CUA runs entirely in a secure virtual browser hosted by OpenAI, so users do not need to manage their own sandboxing infrastructure. This makes setup easier, particularly for non-technical users, but restricts CUA to browser-based tasks. Anthropic's approach supports full desktop control, including terminal commands, native applications, and file system operations.
The three major computer use platforms differ in scope, architecture, and target audience.
| Feature | Anthropic Claude | OpenAI Operator/CUA | Google Project Mariner |
|---|---|---|---|
| Launch date | October 22, 2024 | January 23, 2025 | December 11, 2024 |
| Environment | Full desktop (user-managed VM/container), or native Mac via Cowork | Cloud-hosted virtual browser | Cloud-hosted virtual browser |
| Scope | Desktop apps, terminal, browser, file system | Browser-only | Browser-only |
| Access model | API for developers; Cowork/Code for consumers | ChatGPT subscription (formerly Pro, now Plus and above) | AI Ultra subscription ($249.99/month); also Gemini API |
| Multi-task | Single task per session (Cowork), multiple via API | Single task (agent mode) | Up to 10 tasks in parallel |
| Setup complexity | Developer-oriented (API); simple for Cowork/Code | Minimal (cloud-hosted) | Minimal (cloud-hosted) |
| WebVoyager score | ~56% (Sonnet 4.5 era); improving with newer models | 87% | 83.5% |
| OSWorld score | 72.7% (Opus 4.6, early 2026) | 38.1% (CUA, January 2025) | Not published |
OSWorld is the primary benchmark for evaluating computer use agents. Created by researchers at Carnegie Mellon University and other institutions, it was first published in April 2024 and accepted as a paper at NeurIPS 2024.
OSWorld provides a real computer environment (not a simulation) for testing multimodal agents. It supports task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS. The benchmark includes 369 computer tasks involving real web and desktop applications, OS file operations, and workflows that span multiple applications.
Unlike earlier benchmarks that tested agents on simplified or simulated interfaces, OSWorld requires agents to interact with actual operating systems and real software. Tasks range from simple file management to complex multi-application workflows.
OSWorld has become the standard yardstick for measuring progress in computer use. The improvement in scores over a short period illustrates how rapidly the field is advancing.
| Date | Score | Agent/Model | Notes |
|---|---|---|---|
| Mid-2024 | ~12% | GPT-4V-based agents | Early attempts with vision models |
| October 2024 | 14.9% | Claude 3.5 Sonnet | Anthropic's initial computer use release; next-best system scored 7.8% |
| October 2024 | 22.0% | Claude 3.5 Sonnet (with extra steps) | Improved score when given additional steps to complete tasks |
| January 2025 | 38.1% | OpenAI CUA | Major jump with Operator launch |
| Mid-2025 | ~42.2% | Claude Sonnet 3.6 | Continued Anthropic improvements |
| Late 2025 | 61.4% | Claude Sonnet 4.5 | Significant generational leap |
| December 2025 | 72.6% | Agent S (Simular) | First system to exceed human baseline of 72.36% |
| February 2026 | 72.5% | Claude Sonnet 4.6 | Effectively tied with flagship model |
| February 2026 | 72.7% | Claude Opus 4.6 | Highest published score; GPT-5.2 scored 38.2% on the same benchmark |
The human baseline on OSWorld is 72.36%, established through testing with human participants completing the same tasks. Claude Opus 4.6 (released February 5, 2026) achieved 72.7%, making it the top-performing model on the benchmark and one of the first to match human-level performance.
The community has developed several variants of the benchmark:
Beyond OSWorld, several other benchmarks evaluate computer use capabilities:
| Benchmark | Focus | Notable Scores |
|---|---|---|
| WebArena | Autonomous web navigation on real websites | CUA: 58.1%; Claude achieves state-of-the-art among single-agent systems |
| WebVoyager | Real-world web task completion | CUA: 87%; Project Mariner: 83.5% |
| ScreenSpot | GUI element identification and grounding | Various models tested |
| Mind2Web | Web task generalization across sites | Used for cross-site transfer evaluation |
Computer use introduces safety risks that go beyond those of traditional chatbot interactions, because the model is taking real actions in a real environment.
One of the most serious risks is prompt injection through on-screen content. Because the model reads and interprets everything visible on screen, malicious content on a website or in a document could instruct the agent to take unintended actions. For example, hidden text on a web page could instruct the agent to navigate to a different site and enter sensitive information.
Anthropic has addressed this with automatic classifiers that run on prompts to flag potential prompt injection in screenshots. When the classifiers detect a potential injection, they steer the model to ask for user confirmation before proceeding. However, this defense is not perfect, and Anthropic recommends additional precautions.
Computer use agents can take actions with real consequences: making purchases, sending emails, deleting files, or modifying settings. If an agent misinterprets a task or encounters an error, the consequences can be difficult to reverse. This is compounded by the fact that agents act autonomously, making it harder for humans to intervene before failures cause harm.
The International AI Safety Report (2026) specifically highlighted computer use agents as a category requiring careful governance, noting that "advances in how developers combine AI models with tools have enabled the development of increasingly powerful AI agents given access to tools such as memory, a computer interface, and web browsers, helping them autonomously interact with the world."
Anthropic's documentation recommends several safety measures for computer use deployments through the API:
OpenAI's Operator addressed some of these concerns architecturally by running in a sandboxed virtual browser rather than on the user's actual computer, limiting the potential damage from errors or prompt injection.
The March 2026 Mac computer use release in Claude Cowork introduced a layered safety model specific to consumer desktop use:
Cross-application effects remain a challenge. If Claude clicks a link in one application, that link will open in the default browser even if the user has not explicitly granted Claude permission to use that browser.
Computer use agents in their current state (early 2026) are capable of:
Despite rapid progress, computer use agents face several persistent limitations:
Computer use costs vary depending on the provider and the complexity of the task.
| Provider | Model/Product | Pricing Model | Approximate Cost |
|---|---|---|---|
| Anthropic | Claude API (computer use) | Per-token API pricing | Claude Sonnet 4.6: $3 input / $15 output per million tokens; Claude Opus 4.6: $5 input / $25 output per million tokens; each step adds tool overhead plus screenshot tokens |
| Anthropic | Claude Cowork (Mac computer use) | Subscription | Included with Claude Pro ($20/month) and Claude Max plans |
| OpenAI | ChatGPT agent mode | Subscription | Included with ChatGPT Plus ($20/month) and higher tiers |
| Project Mariner | Subscription | Included with AI Ultra plan ($249.99/month); also available through Gemini API |
For Anthropic's API-based approach, the cost per task depends on the number of steps required and the resolution of screenshots sent to the model. A simple task requiring 10 steps might cost a few cents, while a complex multi-application workflow with 100+ steps could cost several dollars. The Batch API offers a 50% discount on both input and output tokens for asynchronous processing, and prompt caching reduces the cost of repeated context by 90%.
The most common application of computer use is automating repetitive web tasks: filling forms, navigating multi-step processes, extracting information from websites, and performing routine online transactions. This is particularly valuable when websites do not offer APIs or when the task requires interacting with multiple sites.
Computer use agents can serve as automated testers, navigating through application interfaces to verify that features work correctly. Because they interact with the GUI the same way users do, they can catch visual bugs and usability issues that unit tests and API tests miss. Replit's early adoption of Claude computer use for evaluating apps during the build process is one example of this approach.
Organizations use computer use agents to transfer data between systems that lack integration, especially legacy systems that only offer GUI access. An agent can read data from one application, navigate to another, and enter the data, handling the tedious work that would otherwise require manual effort.
Computer use technology has potential applications in accessibility, helping users with motor disabilities interact with computer interfaces through natural language commands rather than precise mouse and keyboard actions.
With Mac computer use in Cowork, Claude can compile competitive analyses, gather data from multiple local files and web sources, populate spreadsheets, and produce reports while the user focuses on other work. The Dispatch feature extends this by allowing users to assign such tasks from a phone and retrieve finished work later.
IT teams are exploring computer use agents for routine system administration tasks: configuring software, running diagnostics, and following standard operating procedures.
As of March 2026, computer use is an active and rapidly advancing area of AI development. The field has moved from research prototypes to consumer products in roughly 18 months.
On the OSWorld benchmark, Claude Opus 4.6 leads with a score of 72.7%, effectively matching the human baseline of 72.36%. Claude Sonnet 4.6 achieves 72.5%, putting both models at roughly human-level performance on desktop automation tasks. By comparison, GPT-5.2 scored 38.2% on the same benchmark. On browser-specific benchmarks like WebVoyager, OpenAI's agent mode leads with approximately 87% success rates.
The competitive landscape is intensifying. Anthropic, OpenAI, and Google are all investing heavily in computer use capabilities, and smaller companies like Simular are pushing state-of-the-art performance on benchmarks (Agent S reached 72.6% on OSWorld in December 2025). The release of new model generations continues to improve computer use performance significantly with each iteration.
Key trends for 2026 include the shift toward native desktop integration (Anthropic's Mac computer use in Cowork), cloud-hosted execution environments becoming the norm for browser-based agents (OpenAI, Google), integration of computer use into mainstream AI products (ChatGPT agent mode, Claude Cowork), multi-task parallelism (Project Mariner handling 10 simultaneous tasks), mobile-to-desktop task delegation (Claude Dispatch), and the development of standardized benchmarks and safety frameworks.
The technology is not yet reliable enough for fully unsupervised use in high-stakes scenarios, but it is already practical for supervised automation of routine tasks. The gap between current capabilities and full reliability is expected to continue narrowing as models improve and safety tooling matures.