Qwen-VL

Chinese AI Multimodal AI Open Source AI

10 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 1,919 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Qwen-VL is the first family of open vision-language (multimodal) models from the Qwen team at Alibaba Cloud, able to take images, text, and bounding boxes as input and produce text and bounding boxes as output. First released on 22 August 2023, Qwen-VL extends the text-only Qwen-7B language model with a visual encoder so a single model can caption images, answer questions about them, read bilingual (Chinese and English) text inside them, and localize objects with bounding-box coordinates.^[1]^[2] Its developers describe the line as "a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images."^[1]

The original release shipped two open-weight variants, the pretrained base model Qwen-VL and the instruction-tuned chat model Qwen-VL-Chat, on ModelScope and Hugging Face, with the accompanying paper, "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond" (arXiv:2308.12966), posted to arXiv on 24 August 2023.^[1]^[2] Beyond the captioning and visual-question-answering tasks common to vision-language model work at the time, Qwen-VL added fine-grained abilities such as bounding-box grounding and reading text directly from images.^[1] The series is named after Alibaba's Tongyi Qianwen assistant (通义千问-VL), and it sits within the broader Qwen family of open models. Its direct successors, Qwen2-VL, Qwen2.5-VL, and Qwen3-VL, each have their own articles.

What is Qwen-VL?

Qwen-VL is a multimodal model series that couples a strong Qwen language model to a vision encoder so it can reason jointly over pictures and text. The headline design goal, set out in the paper, was to build a generalist model that goes beyond description and question answering to also handle precise localization and text reading: the title itself promises "Understanding, Localization, Text Reading, and Beyond."^[1] At launch the open models had roughly 9.6 billion total parameters and were bilingual in English and Chinese throughout.^[1]^[3]

What is Qwen-VL's architecture?

Qwen-VL connects three components: a large language model, a visual encoder, and a position-aware vision-language adapter that links them.^[1]

The language model is initialized from Qwen-7B, the 7-billion-parameter text model from the same team, which contributes about 7.7B parameters. The visual encoder is a Vision Transformer (ViT) initialized from OpenCLIP's ViT-bigG, accounting for roughly 1.9B parameters. Images are split into patches with a stride of 14 before being processed by the encoder.^[1]^[3]

Running a ViT over a high-resolution image produces a long sequence of patch features, which would be expensive to feed directly into the language model. To avoid this, Qwen-VL inserts a vision-language adapter, a single randomly initialized cross-attention layer that compresses the visual features to a fixed length of 256 tokens. The adapter uses a set of 256 trainable query vectors that attend (as queries) to the image features from the ViT (as keys and values). To keep spatial information that pure compression would otherwise discard, 2D absolute positional encodings are added into the cross-attention so the model retains a sense of where each feature sits in the image. The adapter itself is small, about 0.08B parameters, bringing the full Qwen-VL model to roughly 9.6B parameters.^[1]

Component	Initialization	Approx. parameters
Visual encoder (ViT)	OpenCLIP ViT-bigG	1.9B
Vision-language adapter	Random (single-layer cross-attention)	0.08B
Language model	Qwen-7B	7.7B
Total	N/A	9.6B

To express where things are in an image, Qwen-VL uses special tokens in its text stream. Image content is wrapped between <img> and </img> markers, a region of interest is named with <ref>...</ref>, and a bounding box is written between <box> and </box> as a pair of corner coordinates, (x_topleft, y_topleft),(x_bottomright, y_bottomright), normalized to the range [0, 1000). This token scheme lets a single autoregressive model output detection boxes, grounded captions, and grounded answers in the same way it outputs ordinary text.^[1]

How was Qwen-VL trained?

Qwen-VL is trained in three stages, with different components frozen or trainable at each step.^[1]

The first stage is low-resolution pretraining at 224x224. The language model is frozen, and only the visual encoder and the adapter are trained on a large, weakly supervised corpus of image-text pairs. After cleaning, the team used about 1.4 billion pairs, roughly 77.3% English and 22.7% Chinese, and trained for 50,000 steps. The goal of this stage is to align the visual features with the language model's representation space.^[1]

The second stage is multi-task pretraining at the higher 448x448 resolution, with the whole model (encoder, adapter, and language model) unfrozen. Here the model is trained jointly on seven task families at once: image captioning, visual question answering, grounding, reference grounding, grounded captioning, OCR, and pure text generation. The higher resolution matters for reading small text and locating small objects, which is why Qwen-VL emphasized 448x448 inputs while several contemporaries still used 224x224.^[1]^[3]

The third stage is supervised fine-tuning to produce Qwen-VL-Chat. The visual encoder is frozen again, and the language model and adapter are tuned on roughly 350,000 instruction-following examples, including multi-image and multi-turn dialogue data, for about 8,000 steps. This stage is what gives the chat model its interactive, instruction-following behavior.^[1]

Stage	Resolution	Trainable parts	Data	Steps
1. Pretraining	224x224	Visual encoder + adapter	~1.4B image-text pairs	50,000
2. Multi-task pretraining	448x448	Full model	7 task families (captioning, VQA, grounding, OCR, text, etc.)	19,000
3. Supervised fine-tuning	448x448	Language model + adapter	~350K instruction samples	8,000

What can Qwen-VL do?

Qwen-VL handles the standard multimodal tasks of image captioning and visual question answering, and adds several abilities that were less common in open models at its release.^[1]

It supports multi-image, interleaved dialogue, so a conversation can reference several pictures and the surrounding text together. It performs visual grounding: given a referring expression in Chinese or English, it can return a bounding box for the described object, and conversely it can describe what is inside a given box. Because OCR-style data and grounding boxes are mixed into pretraining, the model also does end-to-end recognition of bilingual text inside images, which underlies its document, chart, and scene-text question answering. The models are bilingual in English and Chinese throughout.^[1]^[3]

How does Qwen-VL perform on benchmarks?

On the paper's evaluation suite, the Qwen-VL base model was competitive with or ahead of other open vision-language models of similar size across captioning, general VQA, text-oriented VQA, and grounding. The grounding numbers come from the instruction-tuned model. All figures below are reported by the Qwen-VL paper.^[1]

Task	Benchmark	Qwen-VL score
Image captioning	Flickr30K (CIDEr, zero-shot)	85.8
Image captioning	NoCaps (CIDEr, zero-shot)	121.4
General VQA	VQAv2	79.5
General VQA	OK-VQA	58.6
General VQA	GQA	59.3
General VQA	ScienceQA (image)	67.1
Text VQA	TextVQA	63.8
Text VQA	DocVQA	65.1
Text VQA	OCR-VQA	75.7
Grounding	RefCOCO (val)	89.36
Grounding	RefCOCO+ (val)	83.12
Grounding	RefCOCOg (val)	85.58

For the chat model, the team reported instruction-following and perception scores on benchmarks designed for multimodal assistants, including TouchStone (645.2 in English, 401.2 in Chinese), SEED-Bench (58.2 on images), and MME (1487.58 perception, 360.71 cognition).^[1]

Is Qwen-VL open source?

The Qwen-VL and Qwen-VL-Chat weights, along with the inference, training, and fine-tuning code, are released under the Tongyi Qianwen License Agreement rather than a standard permissive license. Research and most commercial use are permitted free of charge, but the license requires any deployment whose product or service exceeds 100 million monthly active users to request a separate license from Alibaba.^[1]^[4] An Int4-quantized Qwen-VL-Chat was released on 31 August 2023, and on 12 September 2023 the team added fine-tuning support covering full-parameter, LoRA, and Q-LoRA methods.^[2]^[5] Later open Qwen multimodal models moved to the more permissive Apache 2.0 license: the smaller Qwen2-VL (2B and 7B) and the Qwen2.5-VL line (3B, 7B, and the later 32B) are Apache 2.0, while Qwen-VL itself uses the Tongyi Qianwen terms.^[6]^[7]

What are Qwen-VL's closed API variants?

Alongside the open 9.6B models, Alibaba shipped two larger, API-only members of the series. Qwen-VL-Plus, announced on 28 November 2023, raised the supported input resolution to over one million pixels and posted strong document-understanding results. Qwen-VL-Max, introduced on 18 January 2024, was Alibaba's flagship vision model at the time; the company reported that it performed on par with Gemini Ultra and GPT-4V across several text-image tasks, led on Chinese question answering and Chinese text comprehension, and reached competitive scores on DocVQA, MMMU, and MathVista. Neither Plus nor Max was released as open weights; both are served through Alibaba Cloud's API and web interface.^[2]^[6]

How has Qwen-VL evolved?

Qwen-VL was the foundation for a rapidly evolving line of Alibaba multimodal models, each carrying forward the core idea of a strong Qwen language model coupled to a vision encoder, trained to read, locate, and reason over images and text together.^[7]^[8]

Qwen2-VL, released on 30 August 2024 in 2B, 7B, and 72B sizes, replaced the fixed 256-token adapter with "Naive Dynamic Resolution" so the model can "handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens," and added video understanding for clips "over 20 minutes" long via Multimodal Rotary Position Embedding (M-RoPE).^[7] Qwen2.5-VL, released in January 2025 (with a 32B variant added on 24 March 2025) in 3B, 7B, 32B, and 72B sizes, further improved document parsing and structured extraction from invoices, forms, and tables, comprehension of videos over one hour with event localization, and agentic use of computer and phone interfaces as a visual agent.^[8] Qwen3-VL is the most recent generation in the family.^[7]^[8]

Generation	Released	Open sizes	License	Key additions over predecessor
Qwen-VL / Qwen-VL-Chat	22 Aug 2023	9.6B	Tongyi Qianwen	Grounding, bilingual in-image text reading, 448x448 inputs
Qwen2-VL	30 Aug 2024	2B, 7B (72B API)	Apache 2.0 (2B/7B)	Dynamic resolution, video over 20 min, M-RoPE
Qwen2.5-VL	Jan 2025 (32B Mar 2025)	3B, 7B, 32B, 72B	Apache 2.0	Document parsing, hour-long video, agentic UI grounding

ELI5

Imagine a smart reading-and-looking helper. Older AI chatbots could only read and write words. Qwen-VL also has eyes: you can show it a photo, a screenshot, or a page of text, and it can tell you what is in the picture, read the words inside it (in both English and Chinese), and even point to exactly where something is by drawing a box around it. Alibaba gave away the smaller versions for free so anyone can use them, and each newer version (Qwen2-VL, then Qwen2.5-VL) got better at watching long videos and reading messy documents.

References

Bai, Jinze; Bai, Shuai; Yang, Shusheng; et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv:2308.12966. https://arxiv.org/abs/2308.12966 ↩
QwenLM/Qwen-VL GitHub repository (official, Alibaba Cloud). https://github.com/QwenLM/Qwen-VL ↩
"Large-scale Vision Language Models (LVLMs): Qwen-VL and Qwen-VL-Chat." Encord blog. https://encord.com/blog/qwen-vl-large-scale-vision-language-models/ ↩
"Qwen-VL/LICENSE (Tongyi Qianwen License Agreement)." QwenLM/Qwen-VL GitHub. https://github.com/QwenLM/Qwen-VL/blob/master/LICENSE ↩
"Qwen." Wikipedia. https://en.wikipedia.org/wiki/Qwen ↩
"Introducing Qwen-VL." Qwen team blog. https://qwenlm.github.io/blog/qwen-vl/ ↩
"Qwen2-VL: To See the World More Clearly." Qwen team blog. https://qwenlm.github.io/blog/qwen2-vl/ ↩
"Qwen2.5-VL Technical Report." arXiv:2502.13923. https://arxiv.org/abs/2502.13923 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

CharXiv Cross-attention DeepSeek-OCR MathVista

What is Qwen-VL?

What is Qwen-VL's architecture?

How was Qwen-VL trained?

What can Qwen-VL do?

How does Qwen-VL perform on benchmarks?

Is Qwen-VL open source?

What are Qwen-VL's closed API variants?

How has Qwen-VL evolved?

ELI5

See also

References

Improve this article

Related Articles

DeepSeek-OCR

InternVL

Qwen2.5-VL

Qwen2-VL

MiniCPM-V

Qwen3-Omni

What links here

Related Articles

DeepSeek-OCR

InternVL

Qwen2.5-VL

Qwen2-VL

MiniCPM-V

Qwen3-Omni

What links here