Template:Infobox AI technology
A computer-use agent (CUA) is a type of software agent in artificial intelligence that performs tasks by directly operating a general-purpose computer's graphical user interface (GUI) the way a human does by "seeing" the screen, moving a cursor, clicking, typing, and interacting with windows and applications.[1] Unlike tool-calling approaches that rely on predefined APIs, CUAs aim to generalize across arbitrary software by treating the computer itself as the universal interface.[2] They represent an advancement in AI agent technology, combining computer vision, natural language processing, and reinforcement learning to handle open-ended tasks.[3]
CUAs typically combine a large language model (LLM) with computer vision and an action executor (for example, a virtual machine or a remote desktop session), enabling end-to-end perception, reasoning, and control loops.[4][5] Early deployments are experimental and can be error-prone, but rapid progress since 2022 has made "computer use" a central paradigm in building autonomous agents for software workflows.[1][6]
click(x,y) and type("text") until the goal is reached or a stop condition fires[4]Typical CUAs include the following components:
| Approach | Description | Advantages | Limitations |
|---|---|---|---|
| Pure Vision | Relies solely on visual interpretation of screen pixels | Platform-agnostic, works with any GUI | May struggle with complex layouts |
| DOM-Enhanced | Combines vision with web page structure analysis | Higher accuracy for web tasks | Limited to browser environments |
| Hybrid Systems | Integrates multiple data sources including OS APIs | Most accurate and reliable | Platform-specific implementations required |
| Container-Based | Runs in isolated virtual environments | Enhanced security and scalability | Additional infrastructure overhead |
| Year | Milestone |
|---|---|
| 2022 | Adept introduces ACT-1, a transformer trained to use digital tools via a Chrome extension, an early demonstration of end-to-end GUI action from model outputs[10] |
| November 2023 | The open-source Self-Operating Computer framework by OthersideAI shows a multimodal model operating a desktop using the same inputs/outputs as a human (pixels and mouse/keyboard)[7] |
| 2024 | Frameworks like LaVague and Skyvern emerged, combining LLMs with vision for web agent automation[11] |
| October 22, 2024 | Anthropic publicly announces "computer use" in beta for Claude 3.5 Sonnet, enabling on-screen control (look, move cursor, click, type) via API, marking the first major commercial implementation[1] |
| January 23, 2025 | OpenAI publishes a formal description of a Computer-Using Agent and provides a documented Computer Use tool that runs a continuous observe-plan-act loop, introduced as part of "Operator" research preview[2][4] |
| February 24, 2025 | Anthropic releases Claude 3.7 Sonnet with improved computer use capabilities and extended thinking mode[12] |
| March 2025 | Azure OpenAI documents "Computer Use (preview)" for building agents that interact with computer UIs; major cloud providers publish prescriptive guidance patterns[6][5] |
| March 2025 | Simular AI releases Agent S2, an open-source modular framework outperforming proprietary CUAs on benchmarks like OSWorld[13] |
| September 2025 | Anthropic releases Claude Sonnet 4.5, achieving state-of-the-art 61.4% success rate on OSWorld benchmark and 77.2% on SWE-bench Verified[14] |
CUAs operate through an iterative loop of perception, reasoning, and action:
Many implementations expose a loop in which the agent:
Public SDKs document low-level actions such as click(x,y), type(text), and clipboard/file operations, executed by a host process controlling a VM or remote session. This loop allows CUAs to handle tasks requiring dozens of steps, such as form filling or software testing.[16] Limitations include challenges with scrolling, zooming, and short-lived UI elements due to screenshot-based (non-video) perception.[15]
Anthropic released computer use capabilities in beta with Claude 3.5 Sonnet in October 2024, allowing developers to direct Claude to use computers through the Anthropic API.[1] The implementation includes specialized tools:
Training focused on simple software like calculators and text editors, with restricted internet access for safety. Anthropic's research emphasized pixel counting for accuracy, with generalization from limited examples.[15] Early adopters included companies like Asana, Canva, and DoorDash, using it for multi-step automation.[1]
Claude Sonnet 4.5, released in September 2025, represents the current state-of-the-art with a 61.4% success rate on OSWorld benchmark, a significant improvement from the 14.9% achieved by the October 2024 version.[14]
OpenAI introduced the Computer-Using Agent (CUA) in January 2025 as part of its "Operator" research preview, built on GPT-4o's vision capabilities with advanced reasoning.[2][18] The CUA model achieves:
The implementation uses reinforcement learning for reasoning and handles GUI interactions via screenshots. It's integrated into Operator and requires user confirmations for sensitive actions.[18]
Microsoft announced the Computer-Using Agent capabilities in Azure AI Foundry in March 2025, featuring integration with the Responses API. The implementation focuses on enterprise integration with Windows 365 and Azure Virtual Desktop.[6]
| Framework | Description | Release Date | Key Features |
|---|---|---|---|
| Self-Operating Computer | Vision-based computer control | November 2023 | Screenshot analysis, basic automation, multimodal control[7] |
| OpenInterpreter | General-purpose control with Python | 2024 | Extensible, LLM integration[11] |
| Agent S2 | Modular framework for GUIs | March 2025 | Hierarchical planning, 34.5% OSWorld score[13] |
| LaVague | Web agent framework | 2024 | Modular architecture, vision + LLMs[11] |
| Skyvern | Browser workflow automation | 2024 | HTML extraction, task automation[11] |
| Cua Framework | Containerized environments for CUAs | 2025 | Docker-like deployment, OS virtualization[19] |
| Browser-Use | Web-specific agent | 2025 | 89.1% WebVoyager success rate, DOM + vision[20] |
| UFO Agents | Windows-specific control | 2025 | Windows API integration, enhanced accuracy[21] |
| AutoGen | Distributed agent framework | 2024 | Multi-agent coordination[11] |
| NatBot | Browser-specific automation | 2024 | GPT-4 Vision integration[11] |
Researchers have proposed interactive benchmarks to evaluate CUAs in realistic settings.
OSWorld is a comprehensive benchmark for evaluating multimodal agents in real computer environments across Ubuntu, Windows, and macOS. It includes 369 tasks involving real web and desktop applications, file I/O operations, and cross-application workflows.[9]
| Model | Success Rate | Multi-step Score | Notes |
|---|---|---|---|
| Human Performance | 72.4% | N/A | Baseline human capability |
| Claude Sonnet 4.5 | 61.4% | N/A | Current state-of-the-art (September 2025)[14] |
| OpenAI CUA | 38.1% | N/A | January 2025 release[2] |
| Agent S2 | 34.5% | N/A | 50-step configuration[13] |
| Claude 3.5 Sonnet | 14.9% | 22.0% | October 2024 version[1] |
| Previous Best (2024) | 12.0% | N/A | Prior to CUA models |
OSWorld-Human provides annotated trajectories with human optimal steps. Across 16 agents tested, even the best took 1.4–2.7× the human step count on average, indicating significant efficiency gaps.[22]
WebArena evaluates web browsing agents using self-hosted open-source websites that simulate real-world scenarios in e-commerce, content management systems, and social platforms. It tests complex, multi-step web interactions offline.[23]
WebVoyager tests agent performance on live websites including Amazon, GitHub, and Google Maps, evaluating real-world web navigation and task completion capabilities. The benchmark includes 586 diverse web tasks.[24]
macOSWorld introduces the first comprehensive macOS benchmark with 202+ multilingual interactive tasks. It reports distinct performance tiers with >30% success for some proprietary CUAs in its evaluations.[25]
AndroidWorld evaluates mobile GUI tasks:
CUAs automate repetitive tasks in various domains:
Companies like DoorDash use CUAs for internal processes requiring hundreds of steps, while Replit uses Anthropic's tool for code evaluation.[1]
CUAs are susceptible to prompt injection attacks where malicious instructions embedded in content can override intended behavior. This vulnerability is particularly concerning as CUAs can execute actions on behalf of users.[17]
| Strategy | Description | Effectiveness |
|---|---|---|
| Containerization | Run CUAs in isolated virtual machines or Docker containers | High for system isolation |
| Least Privilege | Restrict CUA access to minimum necessary resources | Medium-High for damage limitation |
| Human Oversight | Require approval for sensitive operations | High for critical actions |
| Input Validation | Filter and sanitize user inputs and external content | Medium, not foolproof |
| Monitoring | Track CUA actions and detect anomalous behavior | High for incident response |
| Classifiers | Detect harmful content and restrict actions | Medium-High for known threats |
| Blocklists | Prevent access to sensitive domains/applications | High for defined restrictions |
Organizations deploying CUAs should implement:[29][15]
Anthropic implements classifiers to detect harm, restrictions on election-related tasks, and ASL-2 compliance.[15] OpenAI includes refusals for harmful tasks, blocklists, user confirmations, and evaluations against frontier risks like autonomous replication.[2]
Independent evaluations and benchmark studies report that state-of-the-art CUAs still struggle with robust GUI grounding, long-horizon plans, and operational knowledge of unfamiliar applications.[9][25][22]
Users may find interfaces confusing without clear benefits over traditional tools.[31]
Industry leaders have outlined several advancement areas:
The llms.txt proposal suggests a standardized format for websites to provide AI-readable information, potentially improving CUA reliability while maintaining human usability.[21] This would allow websites to expose structured data specifically for AI consumption.
Future developments include:
Open-source efforts like Agent S2 emphasize modularity for scalability.[13] By mid-2025, CUAs are seen as foundational for "agentic coworkers."[3]
Organizations implementing CUAs report significant operational improvements:
CUAs are reshaping workplace dynamics by:
The deployment of CUAs raises important ethical questions:
Cite error: <ref> tag with name "deepmind-webagent-iclr-2024" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "reddit-anthropic" defined in <references> is not used in prior text.