UI-TARS
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,390 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,390 words
Add missing citations, update stale details, or suggest a clearer explanation.
UI-TARS is a native graphical user interface (GUI) agent model developed by ByteDance through its Seed research team. It was introduced in the paper "UI-TARS: Pioneering Automated GUI Interaction with Native Agents" (arXiv:2501.12326), submitted on 21 January 2025 [1]. The model is a single vision-language model trained end to end to read screenshots and directly emit human-like actions, such as mouse moves, clicks, scrolls, and keyboard input, in order to operate computers, browsers, and phones. The project also ships a desktop application, UI-TARS Desktop, and has been followed by an updated release, UI-TARS-1.5.
Most early approaches to computer-use automation are agentic frameworks: they wrap a closed, general-purpose model such as GPT-4o behind expert-written prompts, scripted workflows, and external modules for screen parsing, planning, and grounding. UI-TARS takes the opposite approach. It is a native GUI agent, meaning perception, grounding, reasoning, action, and short-term memory are all folded into the weights of one model rather than being split across hand-built components [1]. The model receives only screenshots (it does not depend on accessibility trees or HTML), reasons about what to do, and outputs the next action in a unified action format. Because the policy is learned rather than scripted, the authors argue it generalizes better to unfamiliar interfaces and can be improved continually from its own interaction data instead of through manual prompt tuning. The central empirical claim of the paper is that this end-to-end model outperforms framework-style systems that rely on heavily wrapped commercial models [1].
UI-TARS is built on the Qwen2-VL vision-language backbone. The team continually trained the 7B and 72B Qwen2-VL checkpoints on a large GUI-focused corpus to produce UI-TARS-7B and UI-TARS-72B, and also released a 2B model [1][2]. The paper describes roughly 50 billion tokens of continual pre-training drawn from a curated mix of GUI screenshots, element descriptions, dense captions, state-transition captions, question answering, and Set-of-Mark style annotations, alongside grounding data covering tens of millions of UI elements and around 6 million GUI tutorials [1].
The design rests on a few pillars. Enhanced perception teaches the model to describe screen elements and understand layout precisely. Unified action modeling defines a single cross-platform action space (for example click, type, scroll, drag, hotkey, and wait) with coordinate-level grounding, so the same vocabulary works across desktop, web, and mobile. System-2 reasoning lets the model produce explicit intermediate "thoughts" covering task decomposition, milestone tracking, and reflection before committing to an action, rather than reacting to each frame in isolation [1].
The authors frame training as moving from system-1 reactive behavior toward deliberate system-2 control, then refining the policy through iterative self-improvement. After the perception and grounding pre-training, the model is trained on multi-step trajectories assembled from annotated data and standardized public datasets including Multimodal Mind2Web, GUIAct, AITW, AITZ, AndroidControl, GUI-Odyssey, and AMEX [1].
The most distinctive stage is iterative training with reflective online traces. UI-TARS runs in real GUI environments to collect new trajectories at scale; human annotators and automated filters then mark errors and construct corrected, post-reflection pairs. These pairs are used for reflection tuning and for Direct Preference Optimization (DPO), which yields the DPO variants of the released models. This loop lets the agent learn from its own mistakes and adapt without expert-crafted prompts, and it is why the public checkpoints come in both supervised fine-tuned (SFT) and DPO forms [1][2].
UI-TARS was released openly on Hugging Face under the ByteDance-Seed organization in 2B, 7B, and 72B parameter scales, with SFT checkpoints for all three and DPO checkpoints for the 7B and 72B models [2][3]. All checkpoints use Qwen2-VL as the base and are distributed under the Apache 2.0 license [3].
| Checkpoint | Parameters | Variant | Base model |
|---|---|---|---|
| UI-TARS-2B-SFT | 2B | SFT | Qwen2-VL |
| UI-TARS-7B-SFT | 7B | SFT | Qwen2-VL |
| UI-TARS-7B-DPO | 7B | DPO (recommended) | Qwen2-VL |
| UI-TARS-72B-SFT | 72B | SFT | Qwen2-VL |
| UI-TARS-72B-DPO | 72B | DPO (recommended) | Qwen2-VL |
UI-TARS reported state-of-the-art results across more than ten GUI benchmarks spanning grounding (locating the right pixel to act on) and full interactive task execution [1]. On the OSWorld benchmark of real desktop tasks, UI-TARS-72B scored 24.6 with a 50-step budget and 22.7 with a 15-step budget, ahead of Claude's computer-use agent at 22.0 and 14.9. On AndroidWorld it reached 46.6, above GPT-4o at 34.5 [1].
| Interactive benchmark | UI-TARS-72B | Comparison |
|---|---|---|
| OSWorld (50 steps) | 24.6 | Claude 22.0 |
| OSWorld (15 steps) | 22.7 | Claude 14.9 |
| AndroidWorld | 46.6 | GPT-4o 34.5 |
The ScreenSpot family measures pure grounding accuracy. On the original ScreenSpot, the 7B model led at 89.5 average; on ScreenSpot-V2 it reached 91.6; and on the harder ScreenSpot-Pro (high-resolution professional software), the 72B model set the top score at 38.1, far above GPT-4o (0.8) and Claude's computer-use agent (17.1) [1][3].
| ScreenSpot grounding (average) | UI-TARS-2B | UI-TARS-7B | UI-TARS-72B |
|---|---|---|---|
| ScreenSpot | 82.3 | 89.5 | 88.4 |
| ScreenSpot-V2 | 84.7 | 91.6 | 90.3 |
| ScreenSpot-Pro | 27.7 | 35.7 | 38.1 |
For reference, on ScreenSpot-Pro GPT-4o scored 0.8 and Claude's computer-use agent scored 17.1, so the GUI-specialized UI-TARS models opened a wide margin on dense professional interfaces [3].
UI-TARS Desktop is an open-source application that turns the model into a usable agent for a local machine. It is described as a native GUI agent for your local computer driven by the UI-TARS and Seed vision-language models, and it supports Windows, macOS, and browser targets. It takes natural-language instructions, captures screenshots, and issues mouse and keyboard actions with real-time feedback during execution [4]. The same repository also hosts Agent TARS, a broader multimodal agent stack with a command-line interface and web UI that can connect to external tools through Model Context Protocol servers [4].
On 17 April 2025, ByteDance Seed released UI-TARS-1.5 and open-sourced a 7B checkpoint, UI-TARS-1.5-7B, under Apache 2.0 [5][6]. The 1.5 update adds reinforcement learning on top of the original recipe and emphasizes an explicit think-then-act strategy, strengthening long-horizon reasoning and generalization in unfamiliar environments. ByteDance also reported, for the first time in this line of work, results in game-playing scenarios (a set of Poki browser games and Minecraft tasks) in addition to the standard GUI suite [5][6]. The reported gains are substantial.
| UI-TARS-1.5 benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous best |
|---|---|---|---|---|
| OSWorld (100 steps) | 42.5 | 36.4 | 28 | 38.1 |
| Windows Agent Arena (50 steps) | 42.1 | N/A | N/A | 29.8 |
| WebVoyager | 84.8 | 87 | 84.1 | 87 |
| Online-Mind2Web | 75.8 | 71 | 62.9 | 71 |
| AndroidWorld | 64.2 | N/A | N/A | 59.5 |
| ScreenSpot-V2 | 94.2 | 87.9 | 87.6 | 91.6 |
| ScreenSpot-Pro | 61.6 | 23.4 | 27.7 | 43.6 |
The 1.5 release lifted the OSWorld score to 42.5 over a 100-step budget, ahead of OpenAI's computer-using agent (36.4) and Claude 3.7 (28), and pushed ScreenSpot-Pro grounding to 61.6 [5][6]. ByteDance subsequently announced a further iteration, UI-TARS-2, on 4 September 2025 [2]. The work fits into ByteDance's wider agent and model efforts, which also include the Doubao consumer assistant and the company's family of foundation models.
The UI-TARS model checkpoints, the original 2B/7B/72B releases and UI-TARS-1.5-7B alike, are published under the Apache 2.0 license, as are the UI-TARS and UI-TARS Desktop code repositories [3][4][6]. This permits commercial use, modification, and redistribution, and it places UI-TARS among the more permissively licensed open computer-use models, in contrast to the proprietary computer-use agents from OpenAI and Anthropic against which it is benchmarked.