A computer-use model is a specialized type of artificial intelligence model that enables autonomous agents to interact with graphical user interfaces (GUIs) by perceiving screen content and executing actions like clicking, typing, and scrolling, similar to how humans use computers.[1][2] These models represent a significant advancement in AI agents, allowing them to control computers through visual understanding rather than programmatic APIs, making them capable of automating complex digital tasks across various applications and operating systems.[3]
Overview
Computer-use models combine vision-language models (VLMs) with reinforcement learning capabilities to understand and interact with computer screens through pixel-level visual processing.[2] Unlike traditional automation approaches that require specific APIs or scripting for each application, computer-use models can control any software that has a graphical interface, using the same visual cues and input methods that humans use.[3] This universal approach makes them particularly valuable for tasks that span multiple applications or require interaction with legacy systems that lack modern APIs.[4]
While many agent systems integrate through structured APIs, a large portion of digital work still happens in GUIs including form filling, dashboards, and behind-login workflows. Computer-use models address this gap by powering agents that can operate like human users, navigating web pages and applications by clicking, typing and scrolling.[3]
The fundamental innovation of computer-use models is their ability to translate high-level instructions into low-level computer actions by:
Perceiving screen content through screenshot analysis
Understanding the spatial layout and purpose of UI elements
Generating appropriate mouse and keyboard actions
Adapting to dynamic changes in the interface
Learning from feedback to improve performance over time[5]
History
The concept of computer-use models emerged as part of the broader development of multimodal large language models (LLMs) capable of processing visual inputs. Early research focused on visual question answering and image captioning, but by 2024, advancements allowed models to actively control UIs.
The first public beta of a computer-use model was introduced by Anthropic on October 22, 2024, with an upgraded Claude 3.5 Sonnet model featuring "computer use" capabilities. This allowed Claude to perceive screens and perform actions like cursor movement and typing.[1][6]
In July 2025, OpenAI released a preview of its Computer Use tool via Azure OpenAI, enabling models to interact with browsers, desktops, and applications across operating systems like Windows, macOS, and Ubuntu.[7]
On October 7, 2025, Google DeepMind announced the Gemini 2.5 Computer Use model, built on Gemini 2.5 Pro, optimized primarily for web browsers and mobile UIs. The model became available through the Gemini API via Google AI Studio and Vertex AI.[3][8]
Technical Architecture
Core Components
Computer-use models typically consist of several integrated components working in an iterative loop:[3][9]
Visual Perception Module: Processes screenshots using convolutional neural networks or vision transformers to understand screen content
Language Understanding Module: Interprets user instructions and maintains context using large language models
Action Planning Module: Uses chain-of-thought reasoning to decompose tasks into executable steps[2]
Action Execution Module: Translates high-level decisions into specific UI actions (clicks, keystrokes, scrolls)
Feedback Processing Module: Evaluates action results and adjusts strategy based on observed changes
Agent Loop
At a high level, agents using computer-use models follow a repeated loop:[9]
Send Request: The application invokes the Computer Use tool with the user's goal, a screenshot of the current GUI, the current URL, and optionally recent action history and constraints (for example excluded actions)
Receive Model Response: The model analyzes these inputs and generates a response, typically containing one or more function calls representing UI actions (for example open browser, click, type) and may include a safety decision (for example "requires confirmation")
Execute Actions: Client-side code executes allowed actions (prompting the end user for confirmation when required), then captures a fresh screenshot/URL
Capture New State: If the action has been executed, the client captures a new screenshot of the GUI and the current URL
Send Function Response: The new state is returned to the model as function responses, and the loop repeats from step 2
This process continues until the task is complete, an error occurs, or termination due to safety response or user decision. The loop is conceptually similar to function calling with tools, but specialized for GUI manipulation.[9]
Coordinate System
Most computer-use models employ a normalized coordinate system where screen positions are represented on a 1000x1000 grid regardless of actual screen resolution.[9] This approach ensures consistency across different display configurations. The model outputs normalized coordinates that are then converted to actual pixel values by the client implementation:
X coordinates: 0-999 (left to right)
Y coordinates: 0-999 (top to bottom)
Actual pixel position = (normalized_coordinate / 1000) × screen_dimension
The recommended screen size for use with computer-use models is 1440×900 pixels, though models work with any resolution.[9]
Training Methodology
Computer-use models are typically trained using a combination of:[10]
Supervised Fine-tuning (SFT): Initial training on human demonstrations of UI interactions
Reinforcement Learning (RL): Optimization through trial-and-error with reward signals
Reinforcement Learning from Human Feedback (RLHF): Refinement based on human preferences and corrections[11]
Imitation Learning: Learning from recorded sequences of expert human interactions
Major Implementations
Google Gemini Computer Use
Released on October 7, 2025, Google DeepMind's Gemini 2.5 Computer Use model is a specialized variant of Gemini 2.5 Pro optimized for browser control.[3] Key features include:
Model code: `gemini-2.5-computer-use-preview-10-2025`
Specialized for web browser automation with promise for mobile UI control
Built-in safety monitoring with per-step safety service
Performance: 70.3% on Online-Mind2Web benchmark, 34.7% on WebVoyager, 70.9% on AndroidWorld[12]
Powers Project Mariner, Firebase Testing Agent, and some agentic capabilities in AI Mode in Search[3]
Available via Google AI Studio and Vertex AI
Early testers report significant results:
Poke.com (AI assistant): "50% faster and better than the next best solutions"[3]
Autotab (AI agent): "18% performance increase on hardest evals"[3]
Google payments platform: "Successfully rehabilitates over 60% of executions" for failed UI tests[3]
OpenAI Computer-Using Agent (CUA)
OpenAI's Computer-Using Agent (CUA) powers the Operator product and combines GPT-4o's vision capabilities with reinforcement learning.[2] Released in July 2025 via Azure OpenAI, it achieves:
Anthropic's Claude 3.5 Sonnet was the first frontier AI model to offer computer use capabilities in public beta (October 22, 2024).[1] Features include:
Pixel counting navigation method
14.9% score on OSWorld (screenshot-only)
22.0% score on OSWorld (with additional steps)
Available through API for developer integration
Early adopters include Asana, DoorDash, and Replit for multi-step automation[1]
Benchmarks and Evaluation
Performance Comparison on Major Benchmarks
Model
OSWorld
WebArena
WebVoyager
Online-Mind2Web
AndroidWorld
OpenAI CUA
38.1%
58.1%
87%
-
-
Gemini 2.5 Computer Use
-
-
34.7%
70.3%
70.9%
Claude 3.5 Sonnet
14.9-22%
-
-
-
-
Human Performance
72.36%
-
-
-
-
OSWorld
OSWorld is a comprehensive benchmark for evaluating multimodal agents on open-ended computer tasks across Ubuntu, Windows, and macOS.[14] The benchmark consists of 369 tasks involving:
WebArena evaluates web browsing agents using self-hosted open-source websites that simulate real-world scenarios in e-commerce, content management systems, and social platforms.[2] It tests abilities including form filling, multi-step navigation, information extraction, and transaction completion.
WebVoyager
WebVoyager tests model performance on live websites including Amazon, GitHub, and Google Maps, evaluating real-world web interaction capabilities.[2]
Browserbase collaboration with Google DeepMind reported Gemini 2.5 Computer Use leading in accuracy, speed, and cost under matched constraints, with public evaluation traces across thousands of human-verified runs.[16]
Supported Actions
Computer-use models typically support a standardized set of UI actions. Developers must implement the execution logic for these actions on their client-side application:[9]
Common UI Actions Supported by Computer-Use Models
Action
Description
Parameters
open_web_browser
Opens the web browser
None
click_at
Click at specific coordinates
x, y coordinates
type_text_at
Type text at location
x, y, text, clear_before_typing, press_enter
scroll_document
Scroll entire page
direction (up/down/left/right)
scroll_at
Scroll specific element/region
x, y, direction, magnitude
drag_and_drop
Drag element to new location
start x,y, destination x,y
key_combination
Press keyboard shortcuts
keys (for example "Control+C")
hover_at
Hover mouse at location
x, y coordinates
navigate
Go to URL
url
wait_5_seconds
Pause execution
None
go_back/go_forward
Navigate browser history
None
search
Go to default search engine
None
Developers can also add custom user-defined functions (for example `open_app`, `long_press_at` for mobile) and exclude specific predefined functions to constrain behavior.[9]
Applications
Computer-use models have numerous practical applications across industries:[3][5]
Business Automation
Data entry and form processing across multiple websites
Cross-application workflow automation
Report generation from multiple sources
Customer service automation
Invoice and document processing
Software Development
UI testing and quality assurance (Google's payments team recovered 60% of failed tests)[3]
Automated debugging and cross-browser compatibility testing
Web scraping and data collection (for example gathering product information, prices, and reviews)
Competitive intelligence gathering
Market research automation
Academic research assistance
Content aggregation
Personal Productivity
Email management and calendar scheduling
File organization
Online shopping assistance (for example finding "highly rated smart fridges with touchscreen")
Social media management
Personal assistant applications (Poke.com reports 50% speed improvement)[3]
Safety and Security
Computer-use models introduce unique risks including intentional misuse, unexpected model behavior, and vulnerability to prompt injections and scams. To address these, implementations use layered safety approaches:[3][9]
Built-in Safety
Per-step safety service: An out-of-model, inference-time safety service assesses each action before execution
Safety decisions: Actions classified as regular/allowed, requires_confirmation, or blocked
Training-level safety: Features trained directly into models to avoid harmful actions
Prompt Injection Attacks
Prompt injection represents one of the most significant security risks for computer-use models.[17] These attacks can occur through:
Direct injection: Malicious instructions embedded in user input[18]
Indirect injection: Hidden commands in external content (web pages, documents)[19]
Stored injection: Persistent malicious prompts in training data or memory[20]
Mitigation Strategies
Organizations implementing computer-use models should employ multiple layers of security:[21]
Sandboxed Execution: Run agents in isolated virtual machines or containers[1]
Human-in-the-Loop (HITL): Require human confirmation for sensitive actions (for example purchases, CAPTCHA interactions)[9]
System Instructions: Custom safety policies to block or require confirmation for high-stakes actions
Access Control: Implement strict permission boundaries and authentication
Content Filtering: Use guardrails to detect and block malicious inputs[7]
Monitoring and Logging: Track all agent actions for audit and forensics
Rate Limiting: Prevent abuse through action frequency restrictions
Allowlists/Blocklists: Control which websites agents can access
Ethical Considerations
The deployment of computer-use models raises several ethical concerns:
Privacy implications of screen content analysis
Potential for unauthorized data access or exfiltration
Risk of perpetuating biases in automated decisions
Impact on employment in data entry and similar fields