CogVLM
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,953 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,953 words
Add missing citations, update stale details, or suggest a clearer explanation.
CogVLM is an open vision language model developed by Zhipu AI and the Knowledge Engineering Group (KEG) at Tsinghua University. It was introduced in the paper "CogVLM: Visual Expert for Pretrained Language Models" (arXiv:2311.03079), first posted in November 2023, with the original 17-billion-parameter checkpoint released on 5 October 2023. The paper was later accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). [1][2]
The central idea of CogVLM is a trainable "visual expert" module added to every layer of an otherwise frozen pretrained large language model. This gives a deep fusion of visual and language features rather than the shallow alignment used by earlier methods, so the model keeps its original language ability while gaining strong image understanding. CogVLM-17B combined an EVA2-CLIP vision encoder with a Vicuna language backbone, and reported state-of-the-art or near state-of-the-art results across image captioning, visual question answering, and visual grounding benchmarks. The same approach later carried over to CogAgent, a graphical user interface agent built on CogVLM, and to CogVLM2, a second generation built on Llama 3. [1]
Most early vision language models followed a "shallow alignment" recipe. A frozen vision encoder produces image features, a small trainable component (a linear layer, a multilayer perceptron, or a Q-Former as in BLIP-2) maps those features into the word-embedding space of a frozen language model, and the projected image tokens are simply prepended to the text tokens. The CogVLM authors argue this is structurally limited. The language model's weights were trained on text, so feeding image features through layers that expect text representations creates a mismatch, and the projected visual tokens cannot fully exploit the model's attention heads. The authors compare the situation to efficient fine-tuning methods such as LoRA or p-tuning: a small adapter on a frozen backbone can imitate behaviour but tends to underperform full fine-tuning, and shallow visual adapters behave the same way. [1]
CogVLM instead processes image and text tokens with separate weights inside a shared attention computation, so visual features are transformed at every layer to align with each attention head. The authors call this deep fusion and present it as a shift in the training paradigm for vision language models from shallow alignment to deep fusion. [1]
The visual expert is the defining mechanism of CogVLM. The model takes a pretrained language model and adds, to each transformer layer, a parallel set of weights that is applied only to the image tokens. Concretely, the visual expert in each layer consists of its own QKV (query, key, value) matrix and its own feedforward (FFN) block, both initialized from the corresponding weights of the pretrained language model. [1]
Within a layer, the hidden state sequence is split into image positions and text positions. In the attention step, the image tokens are projected by the visual expert's QKV matrix while the text tokens are projected by the original language model's QKV matrix; the two streams then attend to each other in a single attention operation, so the model still mixes visual and textual information rather than keeping them in separate towers. The FFN step works the same way: image tokens pass through the visual expert's FFN and text tokens through the original FFN. The original language model weights stay frozen during this stage, which is what preserves the model's text ability. Because the expert roughly doubles the parameter count but each token is still processed by only one branch, the floating point operations per token stay the same as the underlying language model. [1]
For position information, all visual tokens share a single position id in the language model's rotary position embedding (RoPE). The authors note that an image can occupy hundreds to thousands of tokens, and giving each visual token a distinct position id would let RoPE's remote attenuation suppress attention between distant parts of the same image, so a shared position id avoids that. [1]
CogVLM-17B has four components: a ViT vision encoder, an MLP adapter, the pretrained language model, and the visual expert module described above. [1]
The vision encoder is EVA2-CLIP-E, the 4.4-billion-parameter variant of the EVA-CLIP family. Its final layer is removed because that layer specializes in aggregating the [CLS] token for contrastive learning, which is not what is needed for per-patch features. The image features are then passed through an MLP adapter, a two-layer MLP with a SwiGLU activation, which maps them into the same space as the text features so the language model can consume them. The language backbone is Vicuna-1.5-7B, an instruction-tuned model derived from Llama 2. [1]
The "17B" name reflects the total: roughly 10 billion vision parameters (the EVA2-CLIP-E encoder plus the visual expert weights) and 7 billion language parameters (the Vicuna backbone). The number of trainable parameters is 6.5 billion, since the language backbone is frozen during pretraining. The released base checkpoints operate at 224 by 224 and 490 by 490 pixel resolution. [1]
| Property | Value |
|---|---|
| Total parameters | About 17 billion (about 10B vision, 7B language) |
| Trainable parameters | 6.5 billion |
| Vision encoder | EVA2-CLIP-E (4.4B), final layer removed |
| Adapter | Two-layer MLP with SwiGLU |
| Language backbone | Vicuna-1.5-7B |
| Visual expert | Per-layer QKV matrix plus FFN, applied to image tokens |
| Image resolution | 224 by 224, then 490 by 490 |
Pretraining used a corpus of about 1.5 billion English image-text pairs, filtered from larger public datasets (LAION-2B and COYO-700M) by removing broken links, images with a non-uniform aspect ratio (greater than 6 or less than 1/6), images smaller than a size threshold, and images flagged for political bias. The first stage trained on these 1.5 billion pairs with an image-captioning objective (next-token prediction over the caption) for 120,000 iterations at a batch size of 8,192. The second stage mixed image captioning with a referring expression comprehension task and ran for 60,000 iterations; partway through, after 30,000 iterations, the input resolution was raised from 224 by 224 to 490 by 490. [1]
On top of the pretrained base, the authors released two specialized chat models. CogVLM-Chat is an instruction-following model produced by supervised fine-tuning on a mix of visual question answering and dialogue data; this stage used 6,000 iterations at a learning rate of 1e-5 and batch size 1,024, and unfroze the visual encoder at one-tenth of the main learning rate to stabilize training. CogVLM-Grounding is tuned on grounding data covering grounded captioning, referring expression generation, referring expression comprehension, and grounded visual question answering, drawn from sources such as Flickr30K Entities, RefCOCO, Visual7W, Visual Genome, and Grounded CoT-VQA. [1]
On image captioning, the pretrained CogVLM base set state-of-the-art or competitive scores. The table below lists CIDEr scores from the paper. NoCaps and Flickr are zero-shot; COCO and TextCaps are after fine-tuning. On Flickr30K the model reached 94.9, ahead of the concurrently released Qwen-VL by 9.1 points, while using only about 10 percent of the pretraining data of GIT2 (1.5 billion versus 12.9 billion pairs). [1]
| Captioning benchmark | CogVLM score (CIDEr) |
|---|---|
| NoCaps val (out-of-domain) | 132.6 |
| NoCaps val (overall) | 128.3 |
| NoCaps test (overall) | 126.4 |
| Flickr30K (zero-shot) | 94.9 |
| COCO Karpathy | 148.7 |
| TextCaps test | 144.9 |
On visual question answering and broader large-vision-language-model (LVLM) benchmarks, the instruction-tuned CogVLM-Chat (Vicuna-7B) led models of comparable scale and reported state-of-the-art results across all seven LVLM benchmarks the authors tested. The table reports the figures for the Vicuna-7B chat model; an asterisk in the paper marks datasets seen during supervised fine-tuning. [1]
| VQA / LVLM benchmark | CogVLM-Chat (Vicuna-7B) |
|---|---|
| VQAv2 | 82.3 |
| OKVQA | 64.8 |
| ScienceQA | 91.2 |
| MM-Vet | 51.1 |
| SEED-Bench | 72.5 |
| MMBench | 77.6 |
| LLaVA-Bench | 77.8 |
| POPE | 87.9 |
| MMMU | 41.1 |
| MathVista | 34.5 |
On visual grounding, CogVLM-Grounding reported the strongest generalist numbers in the paper, matching or beating specialist detection models on several splits. Scores are accuracy on referring expression comprehension. [1]
| Grounding benchmark | val | test-A | test-B |
|---|---|---|---|
| RefCOCO | 92.76 | 94.75 | 88.99 |
| RefCOCO+ | 88.68 | 92.91 | 83.39 |
| RefCOCOg | 89.75 | 90.79 (test) | N/A |
| Visual7W | 91.05 (test) | N/A | N/A |
CogAgent is a vision language model for graphical user interface agents that is built on CogVLM. CogAgent-18B inherits the visual expert architecture and adds a high-resolution cross-module so it can read small interface text and icons from full screenshots at 1120 by 1120 pixels. It was introduced in a separate paper (arXiv:2312.08914) and released in December 2023. See the dedicated CogAgent article for details. [3]
CogVLM2 is the second generation, released on 20 May 2024 and described in the paper "CogVLM2: Visual Language Models for Image and Video Understanding" (arXiv:2408.16500, August 2024). It keeps the visual expert idea but swaps the language backbone to Meta-Llama-3-8B-Instruct. Compared with the first generation it supports image resolution up to 1344 by 1344, a context length of 8K tokens, and a bilingual Chinese and English variant. The image models are named with a "19B" suffix reflecting their total parameter count. [4][5]
| CogVLM2 model | Base model | Languages | Resolution | Context |
|---|---|---|---|---|
| cogvlm2-llama3-chat-19B | Meta-Llama-3-8B-Instruct | English | 1344 by 1344 | 8K |
| cogvlm2-llama3-chinese-chat-19B | Meta-Llama-3-8B-Instruct | Chinese, English | 1344 by 1344 | 8K |
| cogvlm2-video-llama3-chat | Meta-Llama-3-8B-Instruct | English | 224 by 224 (video) | 2K |
CogVLM2 improved markedly on text-heavy tasks relative to the first generation. The Llama-3 chat model reported 84.2 on TextVQA, 92.3 on DocVQA, and 756 on OCRBench, with the Chinese variant reaching 85.0 on TextVQA, all evaluated without external OCR tools. Zhipu reported the family as competitive with, or better than, GPT-4V on most of the benchmarks it tested. In July 2024 the project added CogVLM2-Video, which interprets short video (up to about one minute) by sampling keyframes. A related model, GLM-4V-9B, released in June 2024, used the same data and training recipe as CogVLM2 but with a GLM-9B backbone and removed the visual experts to shrink the model to 13B; that recipe later fed into the multimodal capabilities of the broader GLM line. CogVLM and the GLM family both come from the same KEG and Zhipu lineage that produced ChatGLM. [4][5]
The CogVLM source code is released under the Apache 2.0 license. The model weights are governed by a separate custom Model License: they are free for academic research, while commercial use requires completing a registration form. Because CogVLM-17B is built on a Vicuna (Llama 2) backbone and the EVA-CLIP encoder, the relevant Llama 2 and EVA (MIT) license terms also apply to the weights. CogVLM2 is released under its own CogVLM2 License, and because it is built on Meta-Llama-3, users must additionally comply with the Llama 3 license. [2][4][6]