Qwen3-VL

Chinese AI Large Language Models Multimodal AI Open Source AI

14 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v2 · 2,807 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Qwen3-VL is a family of open-weight vision-language models built by the Qwen team at Alibaba Cloud, first released in September 2025. It is the multimodal branch of the Qwen3 generation, pairing a vision encoder with a Qwen3 language backbone so that a single model can read text, look at images, and watch video. The series spans small dense models that run on a single GPU up to a 235-billion-parameter mixture-of-experts flagship, and every size ships in two flavors, an Instruct version for direct use and a Thinking version tuned for step-by-step reasoning. All released checkpoints carry an Apache 2.0 license. ^[1]^[2]

The technical report calls Qwen3-VL "the most capable vision-language model in the Qwen series to date," and its top configuration trades blows with leading closed systems such as Gemini 2.5 Pro and GPT-5 on a wide set of multimodal benchmarks. ^[2]^[3] The headline additions over the previous generation are a stronger visual agent that can drive software interfaces, OCR that now covers many more languages, and long-video understanding measured in hours rather than minutes. ^[1]^[4] In February 2026 the Qwen team moved past the standalone VL branch with Qwen3.5, a natively multimodal model that folds vision into the main line and that Alibaba says outperforms Qwen3-VL at similar scales. ^[14]^[15]

Where does Qwen3-VL fit in the Qwen lineage?

Qwen3-VL sits at the end of a fairly long line of multimodal work from Alibaba. The first Qwen-VL models appeared in 2023, followed by Qwen2-VL and then Qwen2.5-VL, which became a widely used open VLM through 2025. Qwen3-VL is a separate, newer series and not a re-release of Qwen2.5-VL. It rebuilds the visual stack on top of the Qwen3 text models and inherits the split into Instruct and Thinking variants that the broader Qwen3 lineup introduced. The team frames it as one of the multimodal members of the Qwen3 family, alongside text-only models and code-focused efforts like Qwen3-Coder. ^[1]^[2]

The naming follows the rest of Qwen3. A plain size such as 8B refers to a dense model with that many parameters. A name with an "A" component, such as 235B-A22B, refers to a mixture-of-experts model where the first number is the total parameter count and the second is the number of parameters actually activated for each token. ^[2]

What sizes and variants does Qwen3-VL come in?

The family has six base configurations. Four are dense models at 2B, 4B, 8B, and 32B parameters. Two are mixture-of-experts models, a 30B-A3B model with about 30 billion total parameters and roughly 3 billion active per token, and the flagship 235B-A22B with about 235 billion total parameters and roughly 22 billion active. ^[1]^[2] The MoE design lets the large models keep inference cost closer to a much smaller dense model, since only a slice of the experts fires for any given token. That is what makes a 235B model practical to serve at speeds comparable to far smaller systems. ^[2]

Every configuration comes in an Instruct edition and a Thinking edition. The Instruct edition answers directly and suits perception, captioning, OCR, and everyday agent tasks. The Thinking edition is post-trained to produce explicit chains of reasoning before its final answer, which helps on STEM problems, visual math, and multi-step puzzles. The Qwen team trained both from the same pretraining base and then split post-training into the two modes. ^[2] Quantized FP8 builds of several sizes were also published to lower memory needs for deployment. ^[1]

When was Qwen3-VL released?

The staged rollout ran across late 2025. The flagship 235B-A22B Instruct and Thinking weights opened first on 23 September 2025, and the smaller dense and MoE sizes followed over the next month. ^[1]^[3]^[13] A technical report covering the full series was published on arXiv in late November 2025. ^[2] The table below lists the public release dates from the QwenLM/Qwen3-VL repository. ^[1]

Variant	Type	Release date
235B-A22B Instruct and Thinking	MoE	23 September 2025
30B-A3B Instruct and Thinking (and FP8 builds)	MoE	4 October 2025
4B and 8B Instruct and Thinking	Dense	15 October 2025
2B and 32B Instruct and Thinking	Dense	21 October 2025
Qwen3-VL Technical Report (arXiv:2511.21631)	Paper	26 November 2025

How is Qwen3-VL built?

Qwen3-VL keeps the three-part shape common to recent VLMs, a vision encoder, a vision-language merger, and a language model decoder. The vision side uses the SigLIP-2 architecture as its encoder, with a SigLIP2-SO-400M variant for most sizes and a smaller SigLIP2-Large at roughly 300M parameters for the 2B and 4B models. An MLP-based merger compresses the visual features into tokens that line up with the language model's hidden dimension. ^[2]^[5] For background, SigLIP is a contrastively trained image-text encoder, and the language decoder is a Qwen3 large language model. ^[2]

The report calls out three changes that separate Qwen3-VL from its predecessor. The first is an enhanced interleaved multimodal rotary position embedding, written as interleaved-MRoPE, which spreads positional frequencies across time, height, and width so the model tracks layout and motion more reliably over long image sequences and long video. ^[2] The second is DeepStack, a cross-layer fusion idea where visual tokens are not taken only from the final encoder layer. Instead the model pulls features from three different levels of the vision transformer and routes them into corresponding layers of the language model, which sharpens fine detail and tightens the alignment between pixels and words. ^[2]^[6] The third change targets video timing. Where Qwen2.5-VL encoded absolute time through positional encoding, Qwen3-VL writes explicit timestamp tokens, prefixing each group of video frames with a formatted string such as a marker reading three seconds, which gives the model a direct and readable sense of when something happens. ^[2]

Context length is a defining feature. Every model natively supports interleaved sequences of up to 256K tokens covering mixed text, images, and video, and that window extends to roughly 1 million tokens using YaRN-based positional scaling. ^[1]^[2] Pretraining was carried out at progressively larger sequence lengths up to 256K tokens, and the team applied a square-root reweighting scheme to balance text-only and multimodal objectives so that adding vision did not erode pure-text quality. ^[2]

What can Qwen3-VL do?

The most talked-about new skill is the visual agent. Qwen3-VL can look at a PC or mobile screen, recognize interface elements such as buttons and menus, reason about what each control does, and then call tools or take actions to finish a task. ^[1]^[2]^[13] The Qwen team positions this as a bridge between perception, reasoning, and action, and points to it as groundwork for embodied agents that operate digital interfaces and, eventually, robotic systems. ^[1]^[2]

Optical character recognition saw a large expansion. Qwen2.5-VL handled 10 languages beyond Chinese and English. Qwen3-VL adds 29 more for a total of 39 supported languages, and the report states it keeps over 70 percent accuracy on 32 of those 39 in its own evaluation. The team also reports better reading in low light, blur, and tilt, plus stronger handling of rare or archaic characters and the structure of long documents. ^[2]^[4]

Long-video understanding is a second focus. With the 1 million token window, the model can ingest video up to about two hours long and answer questions about specific moments. In a video needle-in-a-haystack test, where a single informative frame is hidden somewhere in a long clip, the 235B-A22B Instruct model scored a perfect 100 percent accuracy on videos up to 30 minutes, which corresponds to a 256K token context, and held 99.5 percent accuracy when extrapolated to roughly two hours of video at 1 million tokens. ^[2]^[4]

Spatial perception and grounding round out the visual skills. The model supports both box-based and point-based grounding, can judge object positions, viewpoints, and occlusion, and adds 3D and spatial reasoning aimed at embodied use, including estimating object affordances and supporting action planning. ^[1]^[2] The team also reports broad visual recognition across categories such as landmarks, products, plants, and animals, a visual coding mode that turns an image or sketch into Draw.io, HTML, CSS, or JavaScript, and text understanding meant to stay on par with text-only LLMs of similar size. ^[1]^[2]

How does Qwen3-VL perform on benchmarks?

The technical report compares Qwen3-VL against strong proprietary models. The table below lists scores reported by the Qwen team for the flagship Qwen3-VL-235B-A22B in both Thinking and Instruct modes, alongside the corresponding figures the report cites for Gemini 2.5 Pro and GPT-5 in their stronger settings. Higher is better on every row except OCRBench, which is scored out of 1000, and all values are read directly from the report's tables. ^[2]

Benchmark	Qwen3-VL 235B Thinking	Qwen3-VL 235B Instruct	Gemini 2.5 Pro	GPT-5 (high)
MMMU	80.6	78.7	81.7	84.2
MMMU-Pro	69.3	68.1	68.8	78.4
MathVista (mini)	85.8	84.9	82.7	81.3
MathVision	74.6	66.5	73.3	70.9
MMStar	78.7	78.4	77.5	76.4
MMBench-EN	88.8	89.3	90.1	83.8
RealWorldQA	81.3	79.2	78.0	82.8
DocVQA (test)	96.5	97.1	92.6	91.5
InfoVQA (test)	89.5	89.2	84.2	79.0
ChartQA (test)	90.3	90.3	83.3	59.7
AI2D (with mask)	89.2	89.7	90.9	89.7
OCRBench	875	920	866	810

The pattern in the report is that the flagship leads on document and OCR tasks like DocVQA, InfoVQA, ChartQA, and OCRBench, and on several visual-math measures such as MathVista and MathVision, while the closed models stay ahead on parts of the broad knowledge benchmark MMMU and its harder MMMU-Pro split. ^[2] The Qwen team summarizes the Instruct version as matching or beating Gemini 2.5 Pro on key visual perception benchmarks. ^[3] Independent coverage of the technical report repeated several of these figures, including 85.8 on MathVista, 96.5 on DocVQA, and 875 on OCRBench. ^[4]

The report also benchmarks the medium models. It places Qwen3-VL-32B and the 30B-A3B MoE against systems such as Gemini 2.5 Flash and GPT-5 mini, and notes that scores climb steadily with size, for example MMBench-EN in Thinking mode rising from 79.9 on the 2B model to 85.3 on the 8B model. ^[2] As with most vendor-reported numbers, the comparison columns are drawn from the Qwen team's own runs and cited technical reports rather than a neutral third party, so they are best read as the developer's framing.

What are Qwen3-VL's key specifications?

Property	Detail
Developer	Qwen team, Alibaba Cloud
First release	September 2025 (flagship), full series through late 2025
Dense sizes	2B, 4B, 8B, 32B
MoE sizes	30B-A3B (about 3B active), 235B-A22B (about 22B active)
Variants per size	Instruct and Thinking
Vision encoder	SigLIP-2 (SO-400M default; Large 300M for 2B and 4B)
Vision-language merger	MLP-based, with DeepStack multi-level fusion
Position encoding	Interleaved-MRoPE; explicit video timestamp tokens
Native context	256K tokens (text, image, and video interleaved)
Extended context	About 1M tokens via YaRN scaling
OCR languages	39
License	Apache 2.0
Distribution	Hugging Face, ModelScope, GitHub

Is Qwen3-VL open source?

Yes. All Qwen3-VL open weights are released under the Apache 2.0 license, which permits commercial use, modification, and redistribution with attribution. ^[1]^[2] The models are distributed through Hugging Face and ModelScope, with code, usage examples, and a cookbook in the QwenLM/Qwen3-VL GitHub repository. ^[1] Alibaba Cloud also offers hosted access to the series through its Model Studio API. ^[2]

Because Qwen3-VL uses standard transformer components, support arrived quickly across the open ecosystem. The Hugging Face Transformers library exposes the models through dedicated classes, and serving stacks and fine-tuning toolkits added Qwen3-VL recipes, including community guides for running and tuning the smaller sizes on modest hardware. ^[1]^[7]^[8] FP8 builds and the range of dense sizes from 2B upward make the family usable from edge-class devices to multi-GPU servers. ^[1]

How does Qwen3-VL compare with Qwen2.5-VL and closed models?

Against its direct predecessor, Qwen3-VL changes several things at once. The language backbone moves to Qwen3, the vision encoder is SigLIP-2 with DeepStack fusion rather than the earlier single-layer feature path, and video timing switches from positional encoding to explicit timestamp tokens. ^[2] Practical reach grows as well. Context extends to a native 256K tokens with a 1 million token ceiling, OCR coverage rises from 10 to 39 languages beyond Chinese and English, and video understanding stretches to roughly two hours. ^[1]^[2]^[4] The Thinking and Instruct split is also new relative to the single-mode Qwen2.5-VL releases. ^[2]

Against closed competitors, the report positions the 235B-A22B flagship as roughly competitive with Gemini 2.5 Pro and GPT-5 across the multimodal suite, leading on document parsing, chart reading, and OCR while trailing on parts of the broad knowledge benchmark MMMU. ^[2] It also competes with Claude Opus 4.1, which appears as a comparison point in the report's tables. ^[2] The main draw over those systems is openness. Qwen3-VL ships full weights under Apache 2.0 across a wide size range, which the closed models do not, while those closed models may still hold an edge on certain reasoning and knowledge benchmarks. ^[2]^[3] Within Alibaba's own lineup, Qwen3-VL is the vision-capable counterpart to the text-only Qwen3 models, separate from products like Qwen3-Max. ^[1]

What came after Qwen3-VL?

Qwen3-VL was the last standalone vision-language series in this lineage before Alibaba shifted to native multimodality. In mid-February 2026 the Qwen team released Qwen3.5, which the Alibaba Cloud team describes as "natively multimodal via early text-vision fusion" and trains on text, images, and video together rather than bolting a vision encoder onto a text model. ^[14]^[15] The flagship Qwen3.5-397B-A17B is a 397-billion-parameter mixture-of-experts model that activates about 17 billion parameters per token, and Alibaba reports it "outperforming Qwen3-VL at similar scales." ^[14] The generation rolled out from that 397B flagship down to compact sizes for edge and mobile use, with independent coverage dating the first release to 17 February 2026. ^[15] Qwen3-VL remains available as an Apache 2.0 download and through the Model Studio API, and its checkpoints stay in wide use, but for new multimodal work the Qwen team now points to the unified Qwen3.5 line rather than a separate VL model. ^[1]^[14]

Limitations

Several cautions apply. The benchmark comparisons come from the Qwen team's own measurements and cited reports, not an independent evaluator, so the head-to-head figures reflect the developer's testing choices. ^[2] The closed models still lead on parts of MMMU and MMMU-Pro, which suggests room to grow on broad multimodal knowledge and the hardest reasoning. ^[2] The 1 million token window depends on YaRN extrapolation beyond the 256K native length, and the near-perfect long-video result is shown mainly on a controlled needle-in-a-haystack setup rather than across every long-video task. ^[2] OCR breadth is real but uneven, with the over 70 percent accuracy figure holding on 32 of the 39 supported languages rather than all of them. ^[2] Running the largest models still demands serious hardware, with the 235B flagship weighing in around 471 GB at full precision even with FP8 builds available to ease deployment. ^[4] As with any system that can drive interfaces and act on screens, the visual agent features warrant careful sandboxing and human oversight in production. ^[2]

References

Qwen team. "Qwen3-VL." GitHub repository, QwenLM/Qwen3-VL. https://github.com/QwenLM/Qwen3-VL ↩
Qwen team. "Qwen3-VL Technical Report." arXiv:2511.21631, November 2025. https://arxiv.org/abs/2511.21631 ↩
Qwen (@Alibaba_Qwen). "We're thrilled to unveil Qwen3-VL." X post announcing the flagship release, 23 September 2025. https://x.com/Alibaba_Qwen/status/1970594923503391182 ↩
Unite.AI. "Alibaba Releases Qwen3-VL Technical Report Detailing Two-Hour Video Analysis." 2025. https://www.unite.ai/alibaba-releases-qwen3-vl-technical-report-detailing-two-hour-video-analysis/ ↩
Tschannen, M., et al. "SigLIP 2: Multilingual Vision-Language Encoders." 2025. (vision encoder used by Qwen3-VL) ↩
Meng, L., et al. "DeepStack: Deeply Stacking Visual Tokens." 2024. (cross-layer fusion adapted in Qwen3-VL) ↩
Hugging Face. "Qwen3-VL-235B-A22B-Instruct model card." https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct ↩
Hugging Face. "Qwen/Qwen3-VL-8B-Instruct model card." https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct ↩
WinBuzzer. "Alibaba Releases Qwen3-VL Open-Source Vision Language AI Model Series." 24 September 2025. https://winbuzzer.com/2025/09/24/alibaba-releases-qwen3-vl-open-source-vision-language-ai-model-series-xcxwbn/
Unsloth. "Qwen3-VL: How to Run and Fine-tune." Unsloth Documentation. https://unsloth.ai/docs/models/tutorials/qwen3-how-to-run-and-fine-tune/qwen3-vl-how-to-run-and-fine-tune
LM Studio. "qwen3-vl model listing." https://lmstudio.ai/models/qwen3-vl
Vercel AI Gateway. "Qwen3-VL 235B A22B Thinking." Model specifications. https://vercel.com/ai-gateway/models/qwen3-vl-thinking
Qwen team. "Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action." Official announcement blog, qwen.ai, September 2025. https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef ↩
Qwen team / Alibaba Cloud. "Qwen3.5: Towards Native Multimodal Agents." Alibaba Cloud blog, February 2026. https://www.alibabacloud.com/blog/qwen3-5-towards-native-multimodal-agents_602894 ↩
Willison, Simon. "Qwen3.5: Towards Native Multimodal Agents." simonwillison.net, 17 February 2026. https://simonwillison.net/2026/Feb/17/qwen35/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Qwen-VL Qwen3.5 Qwen3.6 ZebraLogic

Where does Qwen3-VL fit in the Qwen lineage?

What sizes and variants does Qwen3-VL come in?

When was Qwen3-VL released?

How is Qwen3-VL built?

What can Qwen3-VL do?

How does Qwen3-VL perform on benchmarks?

What are Qwen3-VL's key specifications?

Is Qwen3-VL open source?

How does Qwen3-VL compare with Qwen2.5-VL and closed models?

What came after Qwen3-VL?

Limitations

References

Improve this article

Related Articles

Qwen3-Omni

ERNIE 4.5

DeepSeek-OCR

InternVL

Qwen2.5-VL

Qwen2-VL

What links here

Related Articles

Qwen3-Omni

ERNIE 4.5

DeepSeek-OCR

InternVL

Qwen2.5-VL

Qwen2-VL

What links here