UI-TARS

AI Agents Chinese AI Multimodal AI

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,388 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

UI-TARS is a native graphical user interface (GUI) agent model developed by ByteDance through its Seed research team. It was introduced in the paper "UI-TARS: Pioneering Automated GUI Interaction with Native Agents" (arXiv:2501.12326), submitted on 21 January 2025 ^[1]. The model is a single vision-language model trained end to end to read screenshots and directly emit human-like actions, such as mouse moves, clicks, scrolls, and keyboard input, in order to operate computers, browsers, and phones. The project also ships a desktop application, UI-TARS Desktop, and has been followed by an updated release, UI-TARS-1.5.

Native GUI agent paradigm

Most early approaches to computer-use automation are agentic frameworks: they wrap a closed, general-purpose model such as GPT-4o behind expert-written prompts, scripted workflows, and external modules for screen parsing, planning, and grounding. UI-TARS takes the opposite approach. It is a native GUI agent, meaning perception, grounding, reasoning, action, and short-term memory are all folded into the weights of one model rather than being split across hand-built components ^[1]. The model receives only screenshots (it does not depend on accessibility trees or HTML), reasons about what to do, and outputs the next action in a unified action format. Because the policy is learned rather than scripted, the authors argue it generalizes better to unfamiliar interfaces and can be improved continually from its own interaction data instead of through manual prompt tuning. The central empirical claim of the paper is that this end-to-end model outperforms framework-style systems that rely on heavily wrapped commercial models ^[1].

Architecture and perception

UI-TARS is built on the Qwen2-VL vision-language backbone. The team continually trained the 7B and 72B Qwen2-VL checkpoints on a large GUI-focused corpus to produce UI-TARS-7B and UI-TARS-72B, and also released a 2B model ^[1]^[2]. The paper describes roughly 50 billion tokens of continual pre-training drawn from a curated mix of GUI screenshots, element descriptions, dense captions, state-transition captions, question answering, and Set-of-Mark style annotations, alongside grounding data covering tens of millions of UI elements and around 6 million GUI tutorials ^[1].

The design rests on a few pillars. Enhanced perception teaches the model to describe screen elements and understand layout precisely. Unified action modeling defines a single cross-platform action space (for example click, type, scroll, drag, hotkey, and wait) with coordinate-level grounding, so the same vocabulary works across desktop, web, and mobile. System-2 reasoning lets the model produce explicit intermediate "thoughts" covering task decomposition, milestone tracking, and reflection before committing to an action, rather than reacting to each frame in isolation ^[1].

Training

The authors frame training as moving from system-1 reactive behavior toward deliberate system-2 control, then refining the policy through iterative self-improvement. After the perception and grounding pre-training, the model is trained on multi-step trajectories assembled from annotated data and standardized public datasets including Multimodal Mind2Web, GUIAct, AITW, AITZ, AndroidControl, GUI-Odyssey, and AMEX ^[1].

The most distinctive stage is iterative training with reflective online traces. UI-TARS runs in real GUI environments to collect new trajectories at scale; human annotators and automated filters then mark errors and construct corrected, post-reflection pairs. These pairs are used for reflection tuning and for Direct Preference Optimization (DPO), which yields the DPO variants of the released models. This loop lets the agent learn from its own mistakes and adapt without expert-crafted prompts, and it is why the public checkpoints come in both supervised fine-tuned (SFT) and DPO forms ^[1]^[2].

Model sizes

UI-TARS was released openly on Hugging Face under the ByteDance-Seed organization in 2B, 7B, and 72B parameter scales, with SFT checkpoints for all three and DPO checkpoints for the 7B and 72B models ^[2]^[3]. All checkpoints use Qwen2-VL as the base and are distributed under the Apache 2.0 license ^[3].

Checkpoint	Parameters	Variant	Base model
UI-TARS-2B-SFT	2B	SFT	Qwen2-VL
UI-TARS-7B-SFT	7B	SFT	Qwen2-VL
UI-TARS-7B-DPO	7B	DPO (recommended)	Qwen2-VL
UI-TARS-72B-SFT	72B	SFT	Qwen2-VL
UI-TARS-72B-DPO	72B	DPO (recommended)	Qwen2-VL

Benchmarks

UI-TARS reported state-of-the-art results across more than ten GUI benchmarks spanning grounding (locating the right pixel to act on) and full interactive task execution ^[1]. On the OSWorld benchmark of real desktop tasks, UI-TARS-72B scored 24.6 with a 50-step budget and 22.7 with a 15-step budget, ahead of Claude's computer-use agent at 22.0 and 14.9. On AndroidWorld it reached 46.6, above GPT-4o at 34.5 ^[1].

Interactive benchmark	UI-TARS-72B	Comparison
OSWorld (50 steps)	24.6	Claude 22.0
OSWorld (15 steps)	22.7	Claude 14.9
AndroidWorld	46.6	GPT-4o 34.5

The ScreenSpot family measures pure grounding accuracy. On the original ScreenSpot, the 7B model led at 89.5 average; on ScreenSpot-V2 it reached 91.6; and on the harder ScreenSpot-Pro (high-resolution professional software), the 72B model set the top score at 38.1, far above GPT-4o (0.8) and Claude's computer-use agent (17.1) ^[1]^[3].

ScreenSpot grounding (average)	UI-TARS-2B	UI-TARS-7B	UI-TARS-72B
ScreenSpot	82.3	89.5	88.4
ScreenSpot-V2	84.7	91.6	90.3
ScreenSpot-Pro	27.7	35.7	38.1

For reference, on ScreenSpot-Pro GPT-4o scored 0.8 and Claude's computer-use agent scored 17.1, so the GUI-specialized UI-TARS models opened a wide margin on dense professional interfaces ^[3].

UI-TARS Desktop and later versions

UI-TARS Desktop is an open-source application that turns the model into a usable agent for a local machine. It is described as a native GUI agent for your local computer driven by the UI-TARS and Seed vision-language models, and it supports Windows, macOS, and browser targets. It takes natural-language instructions, captures screenshots, and issues mouse and keyboard actions with real-time feedback during execution ^[4]. The same repository also hosts Agent TARS, a broader multimodal agent stack with a command-line interface and web UI that can connect to external tools through Model Context Protocol servers ^[4].

On 17 April 2025, ByteDance Seed released UI-TARS-1.5 and open-sourced a 7B checkpoint, UI-TARS-1.5-7B, under Apache 2.0 ^[5]^[6]. The 1.5 update adds reinforcement learning on top of the original recipe and emphasizes an explicit think-then-act strategy, strengthening long-horizon reasoning and generalization in unfamiliar environments. ByteDance also reported, for the first time in this line of work, results in game-playing scenarios (a set of Poki browser games and Minecraft tasks) in addition to the standard GUI suite ^[5]^[6]. The reported gains are substantial.

UI-TARS-1.5 benchmark	UI-TARS-1.5	OpenAI CUA	Claude 3.7	Previous best
OSWorld (100 steps)	42.5	36.4	28	38.1
Windows Agent Arena (50 steps)	42.1	N/A	N/A	29.8
WebVoyager	84.8	87	84.1	87
Online-Mind2Web	75.8	71	62.9	71
AndroidWorld	64.2	N/A	N/A	59.5
ScreenSpot-V2	94.2	87.9	87.6	91.6
ScreenSpot-Pro	61.6	23.4	27.7	43.6

The 1.5 release lifted the OSWorld score to 42.5 over a 100-step budget, ahead of OpenAI's computer-using agent (36.4) and Claude 3.7 (28), and pushed ScreenSpot-Pro grounding to 61.6 ^[5]^[6]. ByteDance subsequently announced a further iteration, UI-TARS-2, on 4 September 2025 ^[2]. The work fits into ByteDance's wider agent and model efforts, which also include the Doubao consumer assistant and the company's family of foundation models.

Licensing

The UI-TARS model checkpoints, the original 2B/7B/72B releases and UI-TARS-1.5-7B alike, are published under the Apache 2.0 license, as are the UI-TARS and UI-TARS Desktop code repositories ^[3]^[4]^[6]. This permits commercial use, modification, and redistribution, and it places UI-TARS among the more permissively licensed open computer-use models, in contrast to the proprietary computer-use agents from OpenAI and Anthropic against which it is benchmarked.

References

UI-TARS: Pioneering Automated GUI Interaction with Native Agents (arXiv:2501.12326). https://arxiv.org/abs/2501.12326 ↩
bytedance/UI-TARS, GitHub repository. https://github.com/bytedance/UI-TARS ↩
ByteDance-Seed/UI-TARS-72B-DPO, Hugging Face model card. https://huggingface.co/ByteDance-Seed/UI-TARS-72B-DPO ↩
bytedance/UI-TARS-desktop, GitHub repository. https://github.com/bytedance/UI-TARS-desktop ↩
ByteDance Seed Agent Model UI-TARS-1.5 Open Source, ByteDance Seed blog. https://seed.bytedance.com/en/blog/bytedance-seed-agent-model-ui-tars-1-5-open-source-achieving-sota-performance-in-various-benchmarks ↩
ByteDance-Seed/UI-TARS-1.5-7B, Hugging Face model card. https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

ByteDance Paper2Video

Native GUI agent paradigm

Architecture and perception

Training

Model sizes

Benchmarks

UI-TARS Desktop and later versions

Licensing

References

Improve this article

Related Articles

CogAgent

DeepSeek-OCR

Doubao Seed 1.6

InternVL

Qwen2.5-VL

DeepSeek Janus

What links here

Related Articles

CogAgent

DeepSeek-OCR

Doubao Seed 1.6

InternVL

Qwen2.5-VL

DeepSeek Janus

What links here