Qwen-VL
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,474 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,474 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen-VL is the first series of vision-language (multimodal) models from the Qwen team at Alibaba Cloud. It extends the text-only Qwen-7B language model with a visual encoder and a lightweight adapter so the model can read images alongside text. Alibaba released the open-weight models on ModelScope and Hugging Face on 22 August 2023, and the accompanying paper, "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond" (arXiv:2308.12966), was posted to arXiv on 24 August 2023.[1][2] Two open variants shipped: Qwen-VL, the pretrained base model, and Qwen-VL-Chat, an instruction-tuned chat model. Beyond the captioning and visual-question-answering tasks common to vision-language model work at the time, Qwen-VL added fine-grained capabilities such as bounding-box grounding and bilingual (Chinese and English) text reading directly from images.[1]
The series is named after Alibaba's Tongyi Qianwen assistant (通义千问-VL), and it sits within the broader Qwen family of open models. Its direct successors, Qwen2-VL, Qwen2.5-VL, and Qwen3-VL, each have their own articles.
Qwen-VL connects three components: a large language model, a visual encoder, and a position-aware vision-language adapter that links them.[1]
The language model is initialized from Qwen-7B, the 7-billion-parameter text model from the same team, which contributes about 7.7B parameters. The visual encoder is a Vision Transformer (ViT) initialized from OpenCLIP's ViT-bigG, accounting for roughly 1.9B parameters. Images are split into patches with a stride of 14 before being processed by the encoder.[1][3]
Running a ViT over a high-resolution image produces a long sequence of patch features, which would be expensive to feed directly into the language model. To avoid this, Qwen-VL inserts a vision-language adapter, a single randomly initialized cross-attention layer that compresses the visual features to a fixed length of 256 tokens. The adapter uses a set of 256 trainable query vectors that attend (as queries) to the image features from the ViT (as keys and values). To keep spatial information that pure compression would otherwise discard, 2D absolute positional encodings are added into the cross-attention so the model retains a sense of where each feature sits in the image. The adapter itself is small, about 0.08B parameters, bringing the full Qwen-VL model to roughly 9.6B parameters.[1]
| Component | Initialization | Approx. parameters |
|---|---|---|
| Visual encoder (ViT) | OpenCLIP ViT-bigG | 1.9B |
| Vision-language adapter | Random (single-layer cross-attention) | 0.08B |
| Language model | Qwen-7B | 7.7B |
| Total | N/A | 9.6B |
To express where things are in an image, Qwen-VL uses special tokens in its text stream. Image content is wrapped between <img> and </img> markers, a region of interest is named with <ref>...</ref>, and a bounding box is written between <box> and </box> as a pair of corner coordinates, (x_topleft, y_topleft),(x_bottomright, y_bottomright), normalized to the range [0, 1000). This token scheme lets a single autoregressive model output detection boxes, grounded captions, and grounded answers in the same way it outputs ordinary text.[1]
Qwen-VL is trained in three stages, with different components frozen or trainable at each step.[1]
The first stage is low-resolution pretraining at 224x224. The language model is frozen, and only the visual encoder and the adapter are trained on a large, weakly supervised corpus of image-text pairs. After cleaning, the team used about 1.4 billion pairs, roughly 77.3% English and 22.7% Chinese, and trained for 50,000 steps. The goal of this stage is to align the visual features with the language model's representation space.[1]
The second stage is multi-task pretraining at the higher 448x448 resolution, with the whole model (encoder, adapter, and language model) unfrozen. Here the model is trained jointly on seven task families at once: image captioning, visual question answering, grounding, reference grounding, grounded captioning, OCR, and pure text generation. The higher resolution matters for reading small text and locating small objects, which is why Qwen-VL emphasized 448x448 inputs while several contemporaries still used 224x224.[1][3]
The third stage is supervised fine-tuning to produce Qwen-VL-Chat. The visual encoder is frozen again, and the language model and adapter are tuned on roughly 350,000 instruction-following examples, including multi-image and multi-turn dialogue data, for about 8,000 steps. This stage is what gives the chat model its interactive, instruction-following behavior.[1]
| Stage | Resolution | Trainable parts | Data | Steps |
|---|---|---|---|---|
| 1. Pretraining | 224x224 | Visual encoder + adapter | ~1.4B image-text pairs | 50,000 |
| 2. Multi-task pretraining | 448x448 | Full model | 7 task families (captioning, VQA, grounding, OCR, text, etc.) | 19,000 |
| 3. Supervised fine-tuning | 448x448 | Language model + adapter | ~350K instruction samples | 8,000 |
Qwen-VL handles the standard multimodal tasks of image captioning and visual question answering, and adds several abilities that were less common in open models at its release.[1]
It supports multi-image, interleaved dialogue, so a conversation can reference several pictures and the surrounding text together. It performs visual grounding: given a referring expression in Chinese or English, it can return a bounding box for the described object, and conversely it can describe what is inside a given box. Because OCR-style data and grounding boxes are mixed into pretraining, the model also does end-to-end recognition of bilingual text inside images, which underlies its document, chart, and scene-text question answering. The models are bilingual in English and Chinese throughout.[1][3]
On the paper's evaluation suite, the Qwen-VL base model was competitive with or ahead of other open vision-language models of similar size across captioning, general VQA, text-oriented VQA, and grounding. The grounding numbers come from the instruction-tuned model. All figures below are reported by the Qwen-VL paper.[1]
| Task | Benchmark | Qwen-VL score |
|---|---|---|
| Image captioning | Flickr30K (CIDEr, zero-shot) | 85.8 |
| Image captioning | NoCaps (CIDEr, zero-shot) | 121.4 |
| General VQA | VQAv2 | 79.5 |
| General VQA | OK-VQA | 58.6 |
| General VQA | GQA | 59.3 |
| General VQA | ScienceQA (image) | 67.1 |
| Text VQA | TextVQA | 63.8 |
| Text VQA | DocVQA | 65.1 |
| Text VQA | OCR-VQA | 75.7 |
| Grounding | RefCOCO (val) | 89.36 |
| Grounding | RefCOCO+ (val) | 83.12 |
| Grounding | RefCOCOg (val) | 85.58 |
For the chat model, the team reported instruction-following and perception scores on benchmarks designed for multimodal assistants, including TouchStone (645.2 in English, 401.2 in Chinese), SEED-Bench (58.2 on images), and MME (1487.58 perception, 360.71 cognition).[1]
The Qwen-VL and Qwen-VL-Chat weights, along with the inference, training, and fine-tuning code, are released under the Tongyi Qianwen License Agreement rather than a standard permissive license. Research and most commercial use are permitted free of charge, but the license requires any deployment whose product or service exceeds 100 million monthly active users to request a separate license from Alibaba.[1][4] An Int4-quantized Qwen-VL-Chat was released on 31 August 2023, and on 12 September 2023 the team added fine-tuning support covering full-parameter, LoRA, and Q-LoRA methods. (Later open Qwen models, including the Qwen2.5-VL line, moved to the Apache 2.0 license, but Qwen-VL itself uses the Tongyi Qianwen terms.)[2][5]
Alongside the open 9.6B models, Alibaba shipped two larger, API-only members of the series. Qwen-VL-Plus, announced on 28 November 2023, raised the supported input resolution to over one million pixels and posted strong document-understanding results. Qwen-VL-Max, introduced on 18 January 2024, was Alibaba's flagship vision model at the time; the company reported that it performed on par with Gemini Ultra and GPT-4V across several text-image tasks, led on Chinese question answering and Chinese text comprehension, and reached competitive scores on DocVQA, MMMU, and MathVista. Neither Plus nor Max was released as open weights; both are served through Alibaba Cloud's API and web interface.[2][6]
Qwen-VL was the foundation for a rapidly evolving line of Alibaba multimodal models. Qwen2-VL, released on 30 August 2024 in 2B, 7B, and 72B sizes, replaced the fixed-token adapter with naive dynamic resolution and added video understanding. Qwen2.5-VL, released in January 2025 in 3B, 7B, 32B, and 72B sizes, further improved document parsing, long-video understanding, and agentic use of computer interfaces. Qwen3-VL is the most recent generation in the family. These successors carried forward the core idea introduced by Qwen-VL: a strong Qwen language model coupled to a vision encoder, trained to read, locate, and reason over images and text together.[7][8]