Qwen3-VL
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 2,460 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 2,460 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen3-VL is a family of open-weight vision-language models built by the Qwen team at Alibaba Cloud, first released in September 2025. It is the multimodal branch of the Qwen3 generation, pairing a vision encoder with a Qwen3 language backbone so that a single model can read text, look at images, and watch video. The series spans small dense models that run on a single GPU up to a 235-billion-parameter mixture-of-experts flagship, and every size ships in two flavors, an Instruct version for direct use and a Thinking version tuned for step-by-step reasoning. All released checkpoints carry an Apache 2.0 license. [1][2]
According to the Qwen team, Qwen3-VL is the most capable vision-language model the group has produced, and its top configuration trades blows with leading closed systems such as Gemini 2.5 Pro and GPT-5 on a wide set of multimodal benchmarks. [2][3] The headline additions over the previous generation are a stronger visual agent that can drive software interfaces, OCR that now covers many more languages, and long-video understanding measured in hours rather than minutes. [1][4]
Qwen3-VL sits at the end of a fairly long line of multimodal work from Alibaba. The first Qwen-VL models appeared in 2023, followed by Qwen2-VL and then Qwen2.5-VL, which became a widely used open VLM through 2025. Qwen3-VL is a separate, newer series and not a re-release of Qwen2.5-VL. It rebuilds the visual stack on top of the Qwen3 text models and inherits the split into Instruct and Thinking variants that the broader Qwen3 lineup introduced. The team frames it as one of the multimodal members of the Qwen3 family, alongside text-only models and code-focused efforts like Qwen3-Coder. [1][2]
The naming follows the rest of Qwen3. A plain size such as 8B refers to a dense model with that many parameters. A name with an "A" component, such as 235B-A22B, refers to a mixture-of-experts model where the first number is the total parameter count and the second is the number of parameters actually activated for each token. [2]
The family has six base configurations. Four are dense models at 2B, 4B, 8B, and 32B parameters. Two are mixture-of-experts models, a 30B-A3B model with about 30 billion total parameters and roughly 3 billion active per token, and the flagship 235B-A22B with about 235 billion total parameters and roughly 22 billion active. [1][2] The MoE design lets the large models keep inference cost closer to a much smaller dense model, since only a slice of the experts fires for any given token. That is what makes a 235B model practical to serve at speeds comparable to far smaller systems. [2]
Every configuration comes in an Instruct edition and a Thinking edition. The Instruct edition answers directly and suits perception, captioning, OCR, and everyday agent tasks. The Thinking edition is post-trained to produce explicit chains of reasoning before its final answer, which helps on STEM problems, visual math, and multi-step puzzles. The Qwen team trained both from the same pretraining base and then split post-training into the two modes. [2] Quantized FP8 builds of several sizes were also published to lower memory needs for deployment. [1]
The staged rollout ran across late 2025. The flagship 235B-A22B Instruct and Thinking weights opened first on 23 September 2025. The 30B-A3B variants and FP8 builds followed in early October, then the 4B and 8B dense models in mid-October, and the 2B and 32B dense models on 21 October. A technical report covering the full series was published in late November 2025. [1][3]
Qwen3-VL keeps the three-part shape common to recent VLMs, a vision encoder, a vision-language merger, and a language model decoder. The vision side uses the SigLIP-2 architecture as its encoder, with a SigLIP2-SO-400M variant for most sizes and a smaller SigLIP2-Large at roughly 300M parameters for the 2B and 4B models. An MLP-based merger compresses the visual features into tokens that line up with the language model's hidden dimension. [2][5] For background, SigLIP is a contrastively trained image-text encoder, and the language decoder is a Qwen3 large language model. [2]
The report calls out three changes that separate Qwen3-VL from its predecessor. The first is an enhanced interleaved multimodal rotary position embedding, written as interleaved-MRoPE, which spreads positional frequencies across time, height, and width so the model tracks layout and motion more reliably over long image sequences and long video. [2] The second is DeepStack, a cross-layer fusion idea where visual tokens are not taken only from the final encoder layer. Instead the model pulls features from three different levels of the vision transformer and routes them into corresponding layers of the language model, which sharpens fine detail and tightens the alignment between pixels and words. [2][6] The third change targets video timing. Where Qwen2.5-VL encoded absolute time through positional encoding, Qwen3-VL writes explicit timestamp tokens, prefixing each group of video frames with a formatted string such as a marker reading three seconds, which gives the model a direct and readable sense of when something happens. [2]
Context length is a defining feature. Every model natively supports interleaved sequences of up to 256K tokens covering mixed text, images, and video, and that window extends to roughly 1 million tokens using YaRN-based positional scaling. [1][2] Pretraining was carried out at progressively larger sequence lengths up to 256K tokens, and the team applied a square-root reweighting scheme to balance text-only and multimodal objectives so that adding vision did not erode pure-text quality. [2]
The most talked-about new skill is the visual agent. Qwen3-VL can look at a PC or mobile screen, recognize interface elements such as buttons and menus, reason about what each control does, and then call tools or take actions to finish a task. [1][2] The Qwen team positions this as a bridge between perception, reasoning, and action, and points to it as groundwork for embodied agents that operate digital interfaces and, eventually, robotic systems. [1][2]
Optical character recognition saw a large expansion. Qwen2.5-VL handled 10 languages beyond Chinese and English. Qwen3-VL adds 29 more for a total of 39 supported languages, and the report states it keeps over 70 percent accuracy on 32 of those 39 in its own evaluation. The team also reports better reading in low light, blur, and tilt, plus stronger handling of rare or archaic characters and the structure of long documents. [2][4]
Long-video understanding is a second focus. With the 1 million token window, the model can ingest video up to about two hours long and answer questions about specific moments. In a video needle-in-a-haystack test, where a single informative frame is hidden somewhere in a long clip, the 235B-A22B Instruct model scored a perfect 100 percent accuracy on videos up to 30 minutes, which corresponds to a 256K token context, and held 99.5 percent accuracy when extrapolated to roughly two hours of video at 1 million tokens. [2][4]
Spatial perception and grounding round out the visual skills. The model supports both box-based and point-based grounding, can judge object positions, viewpoints, and occlusion, and adds 3D and spatial reasoning aimed at embodied use, including estimating object affordances and supporting action planning. [1][2] The team also reports broad visual recognition across categories such as landmarks, products, plants, and animals, a visual coding mode that turns an image or sketch into Draw.io, HTML, CSS, or JavaScript, and text understanding meant to stay on par with text-only LLMs of similar size. [1][2]
The technical report compares Qwen3-VL against strong proprietary models. The table below lists scores reported by the Qwen team for the flagship Qwen3-VL-235B-A22B in both Thinking and Instruct modes, alongside the corresponding figures the report cites for Gemini 2.5 Pro and GPT-5 in their stronger settings. Higher is better on every row except OCRBench, which is scored out of 1000, and all values are read directly from the report's tables. [2]
| Benchmark | Qwen3-VL 235B Thinking | Qwen3-VL 235B Instruct | Gemini 2.5 Pro | GPT-5 (high) |
|---|---|---|---|---|
| MMMU | 80.6 | 78.7 | 81.7 | 84.2 |
| MMMU-Pro | 69.3 | 68.1 | 68.8 | 78.4 |
| MathVista (mini) | 85.8 | 84.9 | 82.7 | 81.3 |
| MathVision | 74.6 | 66.5 | 73.3 | 70.9 |
| MMStar | 78.7 | 78.4 | 77.5 | 76.4 |
| MMBench-EN | 88.8 | 89.3 | 90.1 | 83.8 |
| RealWorldQA | 81.3 | 79.2 | 78.0 | 82.8 |
| DocVQA (test) | 96.5 | 97.1 | 92.6 | 91.5 |
| InfoVQA (test) | 89.5 | 89.2 | 84.2 | 79.0 |
| ChartQA (test) | 90.3 | 90.3 | 83.3 | 59.7 |
| AI2D (with mask) | 89.2 | 89.7 | 90.9 | 89.7 |
| OCRBench | 875 | 920 | 866 | 810 |
The pattern in the report is that the flagship leads on document and OCR tasks like DocVQA, InfoVQA, ChartQA, and OCRBench, and on several visual-math measures such as MathVista and MathVision, while the closed models stay ahead on parts of the broad knowledge benchmark MMMU and its harder MMMU-Pro split. [2] The Qwen team summarizes the Instruct version as matching or beating Gemini 2.5 Pro on key visual perception benchmarks. [3] Independent coverage of the technical report repeated several of these figures, including 85.8 on MathVista, 96.5 on DocVQA, and 875 on OCRBench. [4]
The report also benchmarks the medium models. It places Qwen3-VL-32B and the 30B-A3B MoE against systems such as Gemini 2.5 Flash and GPT-5 mini, and notes that scores climb steadily with size, for example MMBench-EN in Thinking mode rising from 79.9 on the 2B model to 85.3 on the 8B model. [2] As with most vendor-reported numbers, the comparison columns are drawn from the Qwen team's own runs and cited technical reports rather than a neutral third party, so they are best read as the developer's framing.
| Property | Detail |
|---|---|
| Developer | Qwen team, Alibaba Cloud |
| First release | September 2025 (flagship), full series through late 2025 |
| Dense sizes | 2B, 4B, 8B, 32B |
| MoE sizes | 30B-A3B (about 3B active), 235B-A22B (about 22B active) |
| Variants per size | Instruct and Thinking |
| Vision encoder | SigLIP-2 (SO-400M default; Large 300M for 2B and 4B) |
| Vision-language merger | MLP-based, with DeepStack multi-level fusion |
| Position encoding | Interleaved-MRoPE; explicit video timestamp tokens |
| Native context | 256K tokens (text, image, and video interleaved) |
| Extended context | About 1M tokens via YaRN scaling |
| OCR languages | 39 |
| License | Apache 2.0 |
| Distribution | Hugging Face, ModelScope, GitHub |
All Qwen3-VL open weights are released under the Apache 2.0 license, which permits commercial use, modification, and redistribution with attribution. [1][2] The models are distributed through Hugging Face and ModelScope, with code, usage examples, and a cookbook in the QwenLM/Qwen3-VL GitHub repository. [1] Alibaba Cloud also offers hosted access to the series through its Model Studio API. [2]
Because Qwen3-VL uses standard transformer components, support arrived quickly across the open ecosystem. The Hugging Face Transformers library exposes the models through dedicated classes, and serving stacks and fine-tuning toolkits added Qwen3-VL recipes, including community guides for running and tuning the smaller sizes on modest hardware. [1][7][8] FP8 builds and the range of dense sizes from 2B upward make the family usable from edge-class devices to multi-GPU servers. [1]
Against its direct predecessor, Qwen3-VL changes several things at once. The language backbone moves to Qwen3, the vision encoder is SigLIP-2 with DeepStack fusion rather than the earlier single-layer feature path, and video timing switches from positional encoding to explicit timestamp tokens. [2] Practical reach grows as well. Context extends to a native 256K tokens with a 1 million token ceiling, OCR coverage rises from 10 to 39 languages beyond Chinese and English, and video understanding stretches to roughly two hours. [1][2][4] The Thinking and Instruct split is also new relative to the single-mode Qwen2.5-VL releases. [2]
Against closed competitors, the report positions the 235B-A22B flagship as roughly competitive with Gemini 2.5 Pro and GPT-5 across the multimodal suite, leading on document parsing, chart reading, and OCR while trailing on parts of the broad knowledge benchmark MMMU. [2] It also competes with Claude Opus 4.1, which appears as a comparison point in the report's tables. [2] The main draw over those systems is openness. Qwen3-VL ships full weights under Apache 2.0 across a wide size range, which the closed models do not, while those closed models may still hold an edge on certain reasoning and knowledge benchmarks. [2][3] Within Alibaba's own lineup, Qwen3-VL is the vision-capable counterpart to the text-only Qwen3 models, separate from products like Qwen3-Max. [1]
Several cautions apply. The benchmark comparisons come from the Qwen team's own measurements and cited reports, not an independent evaluator, so the head-to-head figures reflect the developer's testing choices. [2] The closed models still lead on parts of MMMU and MMMU-Pro, which suggests room to grow on broad multimodal knowledge and the hardest reasoning. [2] The 1 million token window depends on YaRN extrapolation beyond the 256K native length, and the near-perfect long-video result is shown mainly on a controlled needle-in-a-haystack setup rather than across every long-video task. [2] OCR breadth is real but uneven, with the over 70 percent accuracy figure holding on 32 of the 39 supported languages rather than all of them. [2] Running the largest models still demands serious hardware, with the 235B flagship weighing in around 471 GB at full precision even with FP8 builds available to ease deployment. [4] As with any system that can drive interfaces and act on screens, the visual agent features warrant careful sandboxing and human oversight in production. [2]