Qwen2-VL
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,573 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,573 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen2-VL is a family of open-weight vision-language models released by the Qwen team at Alibaba Cloud between August and September 2024. The series, comprising 2B, 7B, and 72B Instruct variants, introduced two architectural innovations: a Naive Dynamic Resolution mechanism that lets the vision encoder process images of arbitrary size into variable-length visual token sequences, and Multimodal Rotary Position Embedding (M-RoPE) that unifies positional information across text, images, and video.[1][2] The accompanying technical report, "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution" by Peng Wang, Shuai Bai and colleagues, claims the 72B flagship reaches parity with proprietary frontier models such as GPT-4o and Claude 3.5 Sonnet on several multimodal benchmarks.[1] Qwen2-VL succeeded the original Qwen-VL of August 2023 and was itself superseded in January 2025 by Qwen2.5-VL.[3][4]
| Attribute | Value |
|---|---|
| Developer | Qwen team, Alibaba Cloud |
| Series | Qwen2-VL-2B-Instruct, Qwen2-VL-7B-Instruct, Qwen2-VL-72B-Instruct |
| Initial release | 29 August 2024 (2B and 7B) |
| Flagship release | 18 September 2024 (72B Instruct) |
| Architecture | Vision Transformer encoder (about 675M parameters) plus Qwen2 LLM decoder |
| Training data | Approximately 1.4 trillion tokens across two pretraining stages |
| Knowledge cutoff | June 2023 (image data) |
| License (2B, 7B) | Apache 2.0 |
| License (72B) | Tongyi Qianwen License (custom) |
| Paper | arXiv:2409.12191, 18 September 2024 |
| Repository | github.com/QwenLM/Qwen2-VL |
The Qwen vision-language line began with Qwen-VL, posted to arXiv on 24 August 2023 by Jinze Bai, Shuai Bai, and collaborators. The first paper, titled "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond," paired the Qwen language backbone with a visual receptor and introduced a three-stage training pipeline together with image-caption-box alignment data to enable grounding and OCR-style text reading.[3] Qwen-VL and its Qwen-VL-Chat instruction-tuned variant established the design ethos that later iterations would inherit: dense visual feature extraction, autoregressive decoding, and emphasis on multilingual coverage. The first generation used a fixed input resolution and a learned re-sampler to compress visual tokens to a constant length, an approach Qwen2-VL would explicitly abandon.[3] The Qwen language stack itself moved to the Qwen2 generation in mid-2024, and a vision-language follow-up built on that backbone became the natural next step.[5]
By mid-2024, open-weight vision-language models were proliferating across labs in China, the United States, and Europe. The pre-Qwen2-VL landscape included LLaVA-NeXT, InternVL 1.5, MiniCPM-V 2.5, CogVLM2, and the Idefics line from Hugging Face, all of which competed on a shifting frontier of OCR fidelity, document understanding, and instruction following. Most of these systems still relied on tiled or windowed handling of high-resolution inputs, a workaround whose limitations the Qwen team explicitly used to motivate their dynamic-resolution design.[1][7]
Qwen2-VL was announced on 29 August 2024, with the 2B and 7B Instruct weights published immediately on Hugging Face and ModelScope. Alibaba's preview noted that the 72B model would follow as an open release after additional safety work, with API access available earlier through Alibaba Cloud's Model Studio under the model name qwen-vl-max.[2] The 72B-Instruct weights and quantized AWQ and GPTQ Int4 and Int8 variants were published on 18 September 2024, the same day the technical report appeared on arXiv as version 1.[1][6] A revised version 2 of the paper was posted on 3 October 2024.[1] The blog post accompanying the August release framed the launch as a step toward "seeing the world more clearly," positioning visual perception alongside the company's existing investments in voice and text models in the Tongyi family.[2]
The named lead authors on the technical report are Peng Wang and Shuai Bai. Co-authors include Sinan Tan, Shijie Wang, and roughly fifteen further collaborators from the Qwen team; the work credits collective contributions across the Alibaba group's multimodal and Tongyi Lab efforts.[1] Shuai Bai had also been a lead author on the original Qwen-VL paper, providing personnel continuity across the two generations.[3] Five months later, the team released Qwen2.5-VL (26 January 2025), which kept the dynamic-resolution and M-RoPE design but added window attention in the vision encoder and absolute time encoding for video, while extending the size lineup to 3B, 7B, and 72B variants.[4]
Qwen2-VL keeps the same broad shape as most modern vision-language models: a Vision Transformer encoder produces a sequence of visual tokens that is concatenated with the text token sequence and fed into an autoregressive language model. Two design choices distinguish it from contemporaries.[1]
Traditional vision encoders are trained at a fixed input size (commonly 224 or 336 pixels square) and either resize, pad, or tile larger inputs. Qwen2-VL instead trains its encoder to process inputs at the resolution they arrive in, mapping the image to a number of visual tokens proportional to its area. The mechanism is called Naive Dynamic Resolution because no learned re-sampler or Q-Former is involved; the ViT (Vision Transformer) operates on 14 by 14 patches and emits one token per patch, modulo a 2 by 2 token-merging step that reduces the sequence by a factor of four before it reaches the LLM.[1][7]
To make this work, the ViT was retrofitted with 2D rotary position embeddings rather than the fixed learned absolute position embeddings used in the original Vision Transformer (ViT). The 2D RoPE applies separate rotary frequencies to the height and width axes, so the same encoder can interpret patch positions at any resolution it encounters at inference. The Qwen team contrasts this with the alternative of tiling a large image into many fixed-size crops, an approach that fragments long-range visual context and inflates the visual-token budget by duplicating overlap regions.[1][7]
The result is that a small thumbnail might generate only a handful of visual tokens while a high-resolution document image may produce thousands. This makes the model behave more like human perception (devoting representation budget to the available detail) and lets it handle extreme aspect ratios and very dense documents without the lossy fixed-window tiling used by some earlier systems.[1][2] The shared encoder has roughly 675 million parameters and is reused unchanged across the 2B, 7B, and 72B configurations so the vision compute does not scale with the language model.[7] This shared-encoder design also keeps the cost of running the small variant low: a 2B language decoder is paired with the same 675M ViT used by the 72B, which is significant since the encoder dominates inference cost for small models on high-resolution inputs.
The second key contribution is M-RoPE, a generalization of Rotary position embedding (RoPE) that decomposes positional information into three components: a temporal index, a height index, and a width index. Text tokens use identical values across the three components, behaving like the standard 1D RoPE. Image tokens carry a fixed temporal index but distinct spatial coordinates. Video tokens increment the temporal index with each frame while preserving spatial layout. The decomposition lets a single embedding scheme address 1D text, 2D images, and 3D video sequences inside the same context window without ad hoc switching.[1][7]
Concretely, when an image is inserted into a chat context, every patch token shares the same temporal index as the preceding text token (so the image does not advance time), while its spatial indices vary across the height and width of the image. The next text token after the image picks up an incremented temporal index but resets the spatial indices to a shared value. For video, the temporal index increments with each sampled frame, providing a natural ordering signal that the model uses to answer questions about event order in long clips. The Qwen team argues that this scheme generalizes to longer videos than methods which encode spatial and temporal position in a single flattened axis, because each axis can be scaled independently.[1][7]
The report describes a three-stage training procedure. In the first stage, only the ViT is trained on image-text pairs to align visual features with the LLM's token space; this stage focuses on coarse image-text relationships, OCR pretraining, and visual grounding with bounding-box supervision.[7] The second stage unfreezes all parameters and trains on a large mixed-modality corpus that includes image-text pairs, OCR data, interleaved documents, and visual question answering tasks. The final stage freezes the ViT and instruction-tunes the language model on multimodal chat data using the ChatML format. Pretraining used approximately 600 billion tokens in the first phase and around 800 billion in the second, for roughly 1.4 trillion tokens of multimodal data overall.[7]
The post-training pipeline for the Instruct variants combined supervised fine-tuning with a preference-based stage to align responses on multimodal tasks; the report notes use of Direct Preference Optimization (DPO)-style optimization on a curated set of multimodal preference pairs covering chart reading, math reasoning, document QA, and refusal behaviors.[7] Instruction Tuning data spanned multiple languages and included long-context examples to support the 32K-token context window claimed for the released models.[9]
Each variant pairs the 675M ViT with a Qwen2 language model: the 2B and 7B use the respective dense Qwen2 backbones, while the 72B variant pairs with Qwen2-72B.[1] The 72B-Instruct release lists 73 billion total parameters and BF16 precision on its Hugging Face model card.[6] Quantization variants released alongside the 72B used 4-bit and 8-bit AWQ and GPTQ to reduce VRAM requirements from roughly 145 GB at BF16 to less than 50 GB at Int4, making the flagship runnable on a single 80 GB GPU once quantized.[6]
| Variant | LLM backbone | Vision encoder | Release | License |
|---|---|---|---|---|
| Qwen2-VL-2B-Instruct | Qwen2-1.5B class | 675M ViT | 29 Aug 2024 | Apache 2.0[8] |
| Qwen2-VL-7B-Instruct | Qwen2-7B | 675M ViT | 29 Aug 2024 | Apache 2.0[9] |
| Qwen2-VL-72B-Instruct | Qwen2-72B | 675M ViT | 18 Sep 2024 | Tongyi Qianwen License[6] |
| Qwen2-VL-72B-Instruct-AWQ / GPTQ-Int4 / GPTQ-Int8 | Quantized 72B | 675M ViT | 18 Sep 2024 | Tongyi Qianwen License[6] |
All weights are mirrored on Hugging Face under the Qwen/Qwen2-VL-* namespace and on ModelScope, the Alibaba Cloud model hub. The 2B-Instruct card alone reported more than 3.7 million monthly downloads on Hugging Face, indicating the small variant became a popular building block for on-device and research use.[8] The license terms differ between the small and flagship models: the 2B and 7B weights are distributed under the permissive Apache 2.0 license, while the 72B uses Alibaba's custom Tongyi Qianwen License that imposes commercial-use thresholds and other conditions.[2][6][8][9]
The technical report and the official Qwen2-VL blog publish per-variant scores across image, video, and agent benchmarks. The headline figures for the Instruct variants are reproduced below.[2][6][7]
The reported evaluation followed the standard practice of reporting on widely used multimodal suites. The team selected six image benchmarks (MathVista, RealWorldQA, MMBench-EN, MTVQA, OCRBench, DocVQA) and four video benchmarks (Video-MME, MVBench, EgoSchema, PerceptionTest) as headline results, alongside the agent suites discussed below. Comparisons in the technical report span GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, InternVL 2, and LLaVA-OneVision.[7]
| Benchmark | Qwen2-VL-2B | Qwen2-VL-7B | Qwen2-VL-72B |
|---|---|---|---|
| MathVista (testmini) | 43.0 | 58.2 | 70.5 |
| RealWorldQA | 62.9 | 70.1 | 77.8 |
| MMBench-EN (test) | 74.9 | 83.0 | 86.5 |
| MTVQA | 18.1 | 25.6 | 30.9 |
| OCRBench | 794 | 845 | 877 |
| DocVQA (test) | 90.1 | 94.5 | 96.5 |
| Benchmark | Qwen2-VL-2B | Qwen2-VL-7B | Qwen2-VL-72B |
|---|---|---|---|
| Video-MME (with subtitles) | 60.4 | 69.0 | 77.8 |
| MVBench | 63.2 | 67.0 | 73.6 |
| EgoSchema | 54.9 | 66.7 | 77.9 |
| PerceptionTest | 53.9 | 62.3 | 68.0 |
The team reports that Qwen2-VL-72B set a state-of-the-art DocVQA score of 96.5 and an OCRBench score of 877 at release, ahead of contemporaneous proprietary systems, while reaching results within a few points of GPT-4o on most general benchmarks and surpassing it on multilingual OCR for most languages except Arabic.[2][7] On MathVista the 72B scored 70.5 against GPT-4o's reported 63.8 in the same evaluation table, and on RealWorldQA the 77.8 matched the proprietary baseline within noise.[7] The 2B and 7B Instruct variants also became reference points for the open-weights community: at release, Qwen2-VL-2B's 90.1 DocVQA score was higher than several 7B-class models from earlier in 2024, illustrating how the Naive Dynamic Resolution design lifted small-model performance on text-heavy inputs.[8]
Qwen2-VL was one of the first openly released vision-language models to publish strong results on visual-agent tasks. The 72B model recorded a 93.1 type-match and 53.2 exact-match score on the team's internal function calling benchmark (FnCall), 89.6 task-match and 72.1 exact-match on an Android UI control benchmark, and reported additional results on AITZ, ALFRED, card games, and vision-language navigation in the technical report.[6][7] The Qwen2-VL blog frames the model as a "visual agent" capable of operating mobile phones, robots, and other devices through screen understanding plus tool calls.[2]
The agent evaluation in the paper is notable because it spans both UI control (which requires reading screenshots and emitting structured action tokens such as click coordinates and keystrokes) and embodied tasks like ALFRED. The Android control numbers, in particular, demonstrated that an open-weight model could approach the kind of GUI automation that frontier closed models such as Claude with computer use were beginning to advertise around the same time.[7] Independent third-party agent benchmarks would later use Qwen2-VL as a baseline, including the AndroidControl-Curated suite that examined whether reported agent scores transferred to harder real-world tasks.[7]
Beyond raw benchmark numbers, the released variants support several concrete capabilities highlighted by Alibaba and by independent secondary coverage:
The combination of these capabilities lets a single model serve as the perception layer for a broad range of pipelines: an enterprise might use the 7B variant to parse a stream of invoices, summarize a security-camera feed, and answer end-user chat questions on documentation screenshots, all without swapping in specialized models. Several of these use cases appeared in the Qwen Cookbook and in third-party showcase notebooks shortly after release.[11]
The official QwenLM/Qwen2-VL repository on GitHub provides example inference code, evaluation scripts, and a qwen-vl-utils helper package used by the Hugging Face Transformers integration. The model class Qwen2VLForConditionalGeneration and matching AutoProcessor were merged into the Hugging Face Transformers library shortly after release, enabling a one-line .from_pretrained() workflow that closely mirrors text-only Qwen2 usage.[9] Within weeks of launch, Qwen2-VL was supported by vLLM for high-throughput serving, by AutoGPTQ and AutoAWQ for low-bit quantization, by Llama-Factory for fine-tuning, and by ONNX-based inference stacks for edge deployment.[2][11]
The 2B variant became especially popular in edge and research settings. Its small footprint (around 4 GB at BF16) and Apache 2.0 license made it a common starting point for downstream fine-tuning on niche document-understanding and OCR tasks, including chart parsing, receipt recognition, and form extraction. Walkthroughs and recipes published by third-party tutorial sites such as DebuggerCafe and the LearnOpenCV blog demonstrated low-rank adaptation (LoRA) fine-tuning on consumer GPUs.[11] The 7B Instruct variant served as the default mid-size baseline for downstream agentic and document tasks during late 2024.[9]
Independent secondary coverage credited the release with raising the bar for open multimodal models. SiliconANGLE described the launch as bringing video reasoning of "advanced" caliber into open weights, while emphasizing Alibaba's claim that the 72B variant could match or exceed GPT-4o and Claude 3.5 Sonnet on selected benchmarks.[12] The hub statistics for the 2B and 7B variants (millions of downloads per month) made Qwen2-VL one of the most-used open-weight multimodal models in the second half of 2024.[8] Within Alibaba's own ecosystem, the model was offered through Model Studio as a managed API and integrated into the Tongyi mobile and enterprise products.[2]
The model cards for all three variants enumerate the same set of caveats, which match the discussion in the technical report:[6][7][9]
Reviewers also noted that the Apache 2.0 grant for the small variants is more permissive than the bespoke license attached to the 72B model, which complicates downstream redistribution and commercial deployment when the flagship is required.[2][6] The split license stance was unusual in the open-weights space at the time: most contemporaries either released everything under Apache 2.0 (as DeepSeek did) or applied a single bespoke license across the entire family (as Meta did with its Llama license). The Qwen team partially closed this gap with the next release, distributing all Qwen2.5-VL sizes uniformly under Apache 2.0.[4]
Qwen2-VL sits within a wave of late-2024 open-weight vision-language releases. Direct contemporaries include InternVL from Shanghai AI Lab, LLaVA (Large Language and Vision Assistant) derivatives such as LLaVA-OneVision, DeepSeek Janus from DeepSeek, and Llama 3.2's 11B and 90B vision variants from Meta.[1] Among Chinese labs, DeepSeek-VL2 followed shortly after, and the broader Multimodal Models field consolidated around the dynamic-resolution plus M-RoPE recipe that Qwen2-VL helped popularize.[4]
In terms of design philosophy, Qwen2-VL contrasts with two adjacent approaches. Models like LLaVA-NeXT and InternVL 1.5 relied on tiling high-resolution inputs into multiple fixed-size crops, while models such as Idefics-2 used learned re-samplers (Perceiver-style attention) to compress visual tokens to a fixed budget. Qwen2-VL's argument is that doing neither, and instead training the encoder directly on variable-length inputs, simplifies the architecture, improves long-context fidelity on dense visuals, and avoids quality losses from resampling.[1] The recipe was widely adopted: InternVL 3, the Qwen3-VL series, and several other follow-on papers in 2025 explicitly cite Qwen2-VL's Naive Dynamic Resolution and M-RoPE as inspirations.[13]
The direct successor, Qwen2.5-VL, was released on 26 January 2025 with 3B, 7B, and 72B variants under Apache 2.0 across the board, an updated vision encoder using window attention, and absolute-time encoding for video to enable second-level temporal grounding.[4] Qwen2.5-VL extended the agent narrative further with explicit GUI-control prompting and structured "QwenVL HTML" document outputs, and Alibaba reported that the 3B successor exceeded Qwen2-VL-7B on most image benchmarks.[4] Subsequent releases in the family include Qwen3-VL (2B through 235B variants released across late 2025) which extended the architecture further with native long-context support and additional agentic capabilities, including a separate "Thinking" track that interleaves reasoning chains with visual observations.[13]