CogAgent
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,278 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,278 words
Add missing citations, update stale details, or suggest a clearer explanation.
CogAgent is an open visual language model built to act as a graphical user interface (GUI) agent: given a screenshot and a natural-language goal, it predicts the next on-screen action, such as where to click or what text to type. It was developed by the Knowledge Engineering Group (KEG) at Tsinghua University together with Zhipu AI, and was introduced in the paper "CogAgent: A Visual Language Model for GUI Agents" (arXiv:2312.08914), accepted at CVPR 2024 as a Highlight. [1][2] The original model, CogAgent-18B, was released in December 2023 and built on the CogVLM visual language model. A later checkpoint, CogAgent-9B-20241220, was released on 20 December 2024 and is built on the GLM-4V-9B base. [3][4]
The distinguishing feature of CogAgent is a high-resolution cross-module that lets the model read very small interface text and icons from full screenshots while keeping the compute cost of high-resolution input manageable. The project positions GUI control as a vision problem solved directly from pixels, in contrast to agents that depend on parsed HTML, accessibility trees, or document object model text. [1]
Large language models such as those behind ChatGPT operate on text, so they interact poorly with software whose state is conveyed visually. Web pages, desktop applications, and mobile apps present icons, rendered widgets, charts, and laid-out text that have no clean textual representation, and HTML or accessibility metadata is often verbose, incomplete, or unavailable. CogAgent was designed to remove that dependency by taking the rendered screenshot as the primary input and producing a concrete action plus a target location. [1]
CogAgent extends a general-purpose vision language model rather than starting from scratch. The 18B version inherits the architecture of CogVLM, which couples a frozen language model with a trainable "visual expert" module inserted into each transformer layer so that image and text tokens are processed by separate weights within a shared attention computation. This lets the model gain strong visual grounding without degrading the language backbone. [1]
Reading a UI requires far more visual detail than typical image captioning. Small fonts, toolbar icons, and dense menus are illegible at the 224 by 224 pixel resolution common to CLIP-style encoders, yet feeding a large image directly into a transformer makes the sequence of visual tokens, and therefore the attention cost, grow quadratically. CogAgent addresses this with a two-branch design. [1]
A low-resolution branch encodes the image at 224 by 224 pixels using a large EVA2-CLIP-E encoder, providing global semantic context. A separate lightweight high-resolution cross-module encodes the same image at 1120 by 1120 pixels using a smaller EVA2-CLIP-L encoder (roughly 0.30B parameters). The high-resolution features are injected into the language decoder through cross-attention layers (hidden size 1024, with 32 attention heads) added at each decoder layer, so the model can attend to fine-grained detail on demand without expanding the main visual token sequence. Because the high-resolution encoder is small and its features are consumed through narrow cross-attention, supporting 1120 by 1120 input adds only a modest amount of computation relative to processing such a large image natively. [1]
In total CogAgent-18B contains about 18 billion parameters, of which roughly 11 billion are visual and 7 billion are language parameters; the language backbone is Vicuna-1.5-7B. [1][5]
CogAgent is a multimodal model that accepts a screenshot plus a textual instruction and returns a plan that includes the next action and, where relevant, a target region expressed as coordinates. It handles both PC and smartphone interfaces, covering tasks such as navigating web pages, operating desktop applications, and completing multi-step flows on Android. Beyond action prediction, the model can perform optical character recognition, answer questions about chart and document images, and carry out visual grounding (pointing to a described element). [1][5]
To train these abilities, the authors assembled GUI-focused pre-training data including rendered text, recognized screen elements, and grounding annotations, alongside conventional image-text data, so that the model learns both general visual understanding and the specific skill of locating and acting on interface widgets. [1]
The original 18B release came in two checkpoints with different strengths. cogagent-chat is oriented toward GUI agent use, visual grounding, and multi-turn visual dialogue, while cogagent-vqa is tuned for single-turn visual question answering and scores higher on standard VQA benchmarks. [5]
CogAgent reported state-of-the-art results on GUI navigation benchmarks using only screenshots as input, outperforming methods of the time that consumed extracted HTML text. On Mind2Web, a benchmark for generalist web agents, it improved step success rate over a LLaMA2-70B HTML-based baseline across all three generalization splits. [1]
| Mind2Web split | Step success rate |
|---|---|
| Cross-task | 62.3% |
| Cross-website | 54.0% |
| Cross-domain | 59.4% |
| Average | 58.2% |
On AITW (Android in the Wild), which evaluates smartphone GUI control, CogAgent reached an overall score of 76.88%, above the Auto-UI unified baseline. [1]
| AITW subset | Score |
|---|---|
| General | 65.38% |
| Install | 78.86% |
| GoogleApps | 74.95% |
| Single | 93.49% |
| WebShopping | 71.73% |
| Overall | 76.88% |
As a generalist model, CogAgent also reported strong results on text-rich and general visual question answering. Reported figures include VQAv2 83.7%, OK-VQA 61.2%, TextVQA 76.1%, ST-VQA 80.5%, ChartQA 68.4%, InfographicVQA 44.5%, DocVQA 81.6%, MM-Vet 52.8, and POPE (adversarial) 85.9%. [1]
CogAgent-9B-20241220 is a newer checkpoint released on 20 December 2024, built on the GLM-4V-9B bilingual vision language model (part of the GLM-4 family) rather than the older CogVLM base. According to its developers it advances on the original across GUI perception, prediction accuracy, the completeness of its action space, and task generalization, and it accepts both Chinese and English instructions. [3][4]
Unlike a chat model, CogAgent-9B is an agent-execution model: it does not hold open-ended conversations but instead consumes the current screenshot together with an execution history and emits the next action in a structured format. Its action vocabulary includes operations such as clicking, typing, and scrolling, each annotated with a bounding box and a short description of the target element, for example a CLICK action with coordinates and element information. It is documented as supporting Windows, macOS, and Android-style mobile interfaces. The 9B checkpoint underlies Zhipu AI's GLM-PC product, which uses the model to read a computer screen and operate applications on the user's behalf. The team reported evaluations on GUI grounding and agent benchmarks including ScreenSpot, OmniAct, OSWorld, and a Chinese step benchmark, comparing against systems such as GPT-4o, Claude 3.5 Sonnet, and Qwen2-VL. [3][4]
For the original CogAgent-18B, the source code is released under the Apache 2.0 license, while the model weights are governed by a separate model license; commercial use requires a free registration with Zhipu AI through its open platform. [5] The CogAgent-9B-20241220 repository likewise publishes its code under Apache 2.0, with the weights distributed under a custom model license. [3][4] The models, code, and documentation are hosted on GitHub (originally under THUDM, later the zai-org organization) and on Hugging Face. [3][5]
CogAgent is part of a broader line of open multimodal and language work from Tsinghua's KEG lab and Zhipu AI that also includes CogVLM, the ChatGLM and GLM-4 large language model families, and the GLM-4V vision models on which the 9B agent is based. [3]