# Qwen2.5-VL

> Source: https://aiwiki.ai/wiki/qwen2_5_vl
> Updated: 2026-06-22
> Categories: Chinese AI, Multimodal AI, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Qwen2.5-VL is a series of open-weight [vision-language models](/wiki/vision_language_model) released on 26 January 2025 by the [Qwen](/wiki/qwen) team at [Alibaba](/wiki/alibaba) ([Alibaba Cloud](/wiki/alibaba_cloud)), succeeding the earlier Qwen2-VL family.[^1][^2] The series initially launched in 3B, 7B, and 72B parameter sizes, with a 32B variant added on 24 March 2025.[^2][^3] The Qwen team describes it as "the new flagship vision-language model of Qwen and also a significant leap from the previous Qwen2-VL."[^2] All models pair a redesigned Vision Transformer ([ViT](/wiki/vision_transformer_vit)) encoder with a [Qwen](/wiki/qwen)2.5 language backbone, introducing window attention in the visual encoder, dynamic frames-per-second (FPS) sampling, absolute-time-aligned multimodal Rotary Position Embedding (MRoPE), and native support for variable-resolution images, hour-long videos, and structured outputs such as bounding boxes and HTML document layouts.[^1][^2] The technical report (Bai et al., arXiv:2502.13923, 19 February 2025) reports that the flagship Qwen2.5-VL-72B matches or surpasses [GPT-4o](/wiki/gpt_4o) and [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) on document, chart, math, and long-video benchmarks while preserving the language capabilities of the underlying [Qwen](/wiki/qwen)2.5 LLM.[^1] Smaller variants (3B/7B/32B) ship under the [Apache 2.0](/wiki/apache_license) license; the 72B model is released under the Qwen license with commercial use restrictions only above a 100-million monthly-active-user threshold.[^4][^3]

## Infobox

| Field | Value |
|---|---|
| Developer | [Qwen](/wiki/qwen) team, [Alibaba Cloud](/wiki/alibaba_cloud) |
| Family | Qwen2.5-VL (successor to Qwen2-VL) |
| Initial release | 26 January 2025 (3B, 7B, 72B)[^2] |
| 32B release | 24 March 2025[^3] |
| Sizes | 3B, 7B, 32B, 72B (base and Instruct)[^2][^3] |
| Architecture | Qwen2.5 LLM backbone plus redesigned ViT with window attention |
| Vision encoder | ViT, hidden size 1280, 32 layers, 16 heads, patch size 14[^5] |
| Context length | 32,768 tokens, extendable with [YaRN](/wiki/yarn)[^4] |
| Pretraining data | ~4.1 trillion tokens (versus ~1.2T for Qwen2-VL)[^1] |
| Modalities | Image, video (up to multi-hour), interleaved text[^2] |
| Distribution | [Hugging Face](/wiki/hugging_face), [ModelScope](/wiki/modelscope), Qwen Chat[^6][^2] |
| License | Apache 2.0 (3B/7B/32B); Qwen License (72B)[^3][^4] |
| Technical report | Bai et al., arXiv:2502.13923 (Feb 2025)[^1] |

## Background

### Predecessors in the Qwen-VL line

The [Qwen](/wiki/qwen) vision-language line began with the original Qwen-VL in August 2023, an early open-weight multimodal model paired with a 7B language backbone. Qwen-VL Chat introduced grounded visual conversation and basic OCR. Qwen2-VL, released in September 2024, was the first major redesign: it replaced fixed-resolution patching with a native dynamic-resolution ViT, increased the language backbone to the Qwen2 family, and introduced 3D Multimodal Rotary Position Embedding (M-RoPE) to encode temporal, height, and width axes for video understanding.[^7] Qwen2-VL shipped at 2B, 7B, and 72B scales (arXiv:2409.12191) and was the first Qwen vision-language model to be benchmarked against frontier closed systems with credible parity on document and chart understanding.[^7] However, Qwen2-VL used full attention throughout its vision encoder. Because self-[attention](/wiki/attention) scales quadratically with sequence length, processing high-resolution images and long videos became expensive, with practical compute budgets bounding the resolution and frame count that could be served at production scale.[^7][^1] Qwen2-VL also encoded video time as a frame index rather than absolute seconds, which meant fine-grained timestamp queries depended on a fixed sampling rate and were not directly supported.[^2]

### Open-source vision-language landscape in late 2024

By late 2024 the open-source [vision-language model](/wiki/vision_language_model) ecosystem had matured into several distinct lineages. [LLaVA](/wiki/llava) (Liu et al., 2023) had popularised the visual-instruction-tuning recipe with a [CLIP](/wiki/clip) vision encoder bolted onto an instruction-tuned [LLaMA](/wiki/llama) backbone, and the LLaVA-1.5 and LLaVA-NeXT updates extended this to higher resolutions and stronger reasoning.[^16] [InternVL](/wiki/internvl) (OpenGVLab) pursued a different scaling axis with a 6B-parameter vision encoder and progressive multi-stage training, reaching the 78B parameter scale and rivalling closed models on many benchmarks.[^1] The Qwen-VL line, [DeepSeek](/wiki/deepseek)-VL, and the MiniCPM-V series rounded out the open offerings; on the closed side, [GPT-4o](/wiki/gpt_4o), [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet), and Gemini 2.0 Flash defined the frontier.[^11][^8] New benchmarks reshaped the evaluation landscape: [MMMU](/wiki/mmmu) and [MMMU-Pro](/wiki/mmmu-pro) for college-level multimodal reasoning, [MathVista](/wiki/mathvista) for visual math, OCRBench and DocVQA for document understanding, [Video-MME](/wiki/video-mme) and LVBench for long-form video, and ScreenSpot / Android Control for GUI agents.[^1][^11]

### When was Qwen2.5-VL released?

Qwen2.5-VL was announced on the official Qwen blog on 26 January 2025 with the slogan "Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL!" and a launch of three sizes (3B, 7B, 72B), each in both base and Instruct variants.[^2] The blog post highlighted four headline capabilities: visual recognition of objects and text, agentic computer and phone use, hour-long video understanding with second-level event localization, and structured outputs for invoices and forms.[^2] [Alibaba Cloud](/wiki/alibaba_cloud) followed with corporate coverage on 6 February 2025, framing the release alongside the company's broader Qwen2.5 ecosystem and emphasising vision-agent use cases for finance and commerce.[^6] The accompanying technical report (27 authors led by Shuai Bai) was uploaded to arXiv on 19 February 2025 as arXiv:2502.13923.[^1]

A 32B variant, Qwen2.5-VL-32B-Instruct, was released on 24 March 2025, billed as "smarter and lighter" and tuned with reinforcement learning for stronger mathematical and human-preference performance under the [Apache 2.0](/wiki/apache_license) license.[^3] The 32B model was widely covered in independent reporting as evidence that the Qwen team had closed the gap between the 32B-tier open-weight models and the 72B flagship through better post-training rather than scale alone.[^16][^3] Together with the [DeepSeek](/wiki/deepseek)-R1 release a month earlier, Qwen2.5-VL helped cement the perception that Chinese open-weight foundation models had reached the frontier in early 2025.[^11]

## How does Qwen2.5-VL work?

Qwen2.5-VL preserves the three-part structure pioneered in Qwen-VL: a Vision Transformer encoder, an MLP-based vision-language merger that projects visual tokens into the LLM's embedding space, and a [Qwen](/wiki/qwen)2.5 language backbone.[^5][^2] Several components were redesigned relative to Qwen2-VL.

### Redesigned Vision Transformer

The visual encoder is a ViT trained from scratch (not initialised from CLIP) with hidden size 1280, 32 transformer layers, 16 attention heads, and a patch size of 14 pixels.[^5] Image height and width are required to be multiples of 28 (two patches) to align with the merger.[^5] In contrast to Qwen2-VL, in which every ViT layer used full self-attention, Qwen2.5-VL applies window attention to 28 of the 32 layers, reserving full attention for only 4 strategically placed layers.[^5][^1] Windowed layers scale linearly with the number of patches rather than quadratically, enabling efficient processing of very high-resolution images and many-frame videos without prohibitive compute.[^9][^5]

To bring the visual encoder closer to the LLM in design conventions, Qwen2.5-VL replaces older normalization and activation choices with [RMSNorm](/wiki/rmsnorm) and the [SwiGLU](/wiki/swiglu) activation found in the [Qwen](/wiki/qwen)2.5 LLM.[^2][^5] This aligns vision and language stacks, simplifies the kernel set for inference, and is reported by the authors to improve throughput.[^2]

### Native dynamic resolution and 2D-RoPE

Like Qwen2-VL, Qwen2.5-VL accepts images at their native resolution. Patches are emitted at stride 14, and a token-merging step joins 2x2 spatial blocks before the projection layer, yielding a flexible token budget that scales with image area rather than a fixed grid.[^5][^2] Within the ViT, spatial positions are encoded using 2D [Rotary Position Embedding (RoPE)](/wiki/rope) over height and width axes.[^5] This removes any need for ad-hoc resizing or letter-boxing and preserves the geometric relationships needed for precise spatial reasoning, including [bounding box](/wiki/bounding_box) regression and point localization.[^1][^2]

### M-RoPE with absolute-time alignment

For video and interleaved input, Qwen2.5-VL extends [RoPE](/wiki/rope) into a 3D Multimodal Rotary Position Embedding (M-RoPE) over (temporal, height, width).[^1][^2] The key change over Qwen2-VL is that the temporal axis is aligned to absolute time rather than to frame index. Qwen2-VL's M-RoPE encoded frame order but did not carry information about real elapsed seconds; Qwen2.5-VL associates each temporal position id with absolute time and combines this with dynamic FPS training so that the model learns to relate timestamps to events directly.[^2][^1] This makes second-level event localization in long videos a native capability rather than an emergent one.[^1]

### Vision-language merger

A two-layer MLP merges spatial token groups (2x2) and projects the resulting vectors into the LLM hidden size, yielding roughly a 4x token reduction at the merge boundary.[^5][^2] This compresses the visual stream while keeping the underlying ViT operating on native-resolution patches. The merger is light enough that it adds negligible parameter count relative to the ViT and the language backbone, but it is critical for controlling the token cost of high-resolution and long-video inputs.[^5][^9]

### Language backbone

The language model is initialised from the corresponding [Qwen](/wiki/qwen)2.5 LLM (3B, 7B, 32B, or 72B), inheriting its tokenizer, [SwiGLU](/wiki/swiglu) feed-forward layers, [RMSNorm](/wiki/rmsnorm) normalization, [RoPE](/wiki/rope) positional encoding, and grouped-query [attention](/wiki/attention) structure.[^2][^5] Native [context window](/wiki/context_window) is 32,768 tokens, with [YaRN](/wiki/yarn) available to extend further at inference time for long-document and multi-hour-video workloads.[^4] By coupling vision and language stacks that share the same normalization, activation, and positional encoding conventions, the Qwen team simplifies the operator set required for high-throughput inference and aligns the model with the broader [Qwen](/wiki/qwen)2.5 family.[^2][^5]

### Window attention details

The window attention pattern within the ViT is reported as a primary lever of efficiency. Of the 32 transformer layers in the encoder, 28 use windowed attention while only 4 retain global attention.[^5] In windowed layers, each patch attends only to other patches within its local window, decoupling the cost from total patch count. The four global-attention layers are interleaved to allow long-range visual information flow without paying the quadratic cost everywhere. This pattern echoes design choices from the [Swin Transformer](/wiki/swin_transformer) family and from windowed [attention](/wiki/attention) variants in language models, adapted to dynamic-resolution multimodal use.[^5][^9] The Qwen team reports that this change, combined with the [RMSNorm](/wiki/rmsnorm) and [SwiGLU](/wiki/swiglu) swaps, substantially improves both training and inference throughput compared with Qwen2-VL.[^2]

## How was Qwen2.5-VL trained?

### Three-stage pretraining

The technical report describes a three-stage pretraining schedule followed by post-training, with a total budget of roughly 4.1 trillion tokens compared with the 1.2 trillion used for Qwen2-VL.[^10][^9][^1]

| Stage | Focus | Tokens | Modules trained |
|---|---|---|---|
| 1 | Visual pretraining (captioning, knowledge recognition, OCR) | 1.5T | ViT only |
| 2 | Multimodal pretraining (text, interleaved image-text, VQA, video grounding, agent data) | 2.0T | ViT + LLM |
| 3 | Long-context pretraining (long videos, transcripts, documents at up to 32,768 tokens) | 0.6T | ViT + LLM |

In stage 1, only the Vision Transformer is updated; the LLM is frozen at its [Qwen](/wiki/qwen)2.5 initialization. This warms up the visual encoder on captioning, world-knowledge image data such as celebrities, landmarks, animals and plants, and multilingual OCR.[^10] In stage 2, the full model is trained on interleaved image-text data, visual question answering, video grounding data, and agent demonstrations, expanding the vocabulary of tasks the model can perform.[^10][^9] In stage 3, training sequences are extended to the full 32,768 token native context with long videos, document chains, and transcripts, teaching the model to attend across long temporal and textual spans.[^10][^1] The Qwen team emphasises that the temporal axis of M-RoPE is aligned to absolute seconds in stages 2 and 3, supplying the model with explicit temporal cues during training rather than only at inference.[^1]

### Data composition

Pretraining data sources include image-caption pairs, visual knowledge data spanning over ten thousand object categories, OCR data in ten languages (English, French, German, Italian, Spanish, Portuguese, Arabic, Russian, Japanese, Korean and Vietnamese are documented in community reporting), document-parsing data with structured layout labels and HTML-format rendering, video data sampled at variable frame rates, grounding data with absolute bounding-box coordinates and point annotations, and GUI agent demonstrations from desktop and mobile screenshots.[^9][^10][^15] A large synthetic document corpus was constructed in which tables, charts, equations, music sheets, chemical formulas, and figures were uniformly rendered in QwenVL HTML format, enabling the model to learn a single unified parsing target.[^9][^15]

### Post-training

Post-training applies supervised fine-tuning ([SFT](/wiki/supervised_fine-tuning)) on the order of one million multimodal instruction-response examples to produce the Instruct checkpoints, using a standard next-token language-modeling loss with prompt masking.[^10] The 32B Instruct release, in addition, was tuned with reinforcement learning to better align with human preferences and to improve mathematical reasoning, producing answers reported as more detailed and better-formatted than its larger 72B sibling on subjective benchmarks such as MM-MT-Bench.[^3] The Qwen team has not publicly released the full SFT data mix or the alignment recipe, although community projects such as Open-Qwen2VL have attempted to reproduce comparable training regimes from public datasets.[^10]

## What can Qwen2.5-VL do?

### Document parsing and structured outputs

Qwen2.5-VL targets document parsing as a first-class capability rather than an emergent OCR side-effect.[^2][^1] The team introduced a "QwenVL HTML" format in which tables, charts, equations, music sheets, chemical formulas, and figures are uniformly represented as HTML-like markup with absolute layout coordinates.[^9][^2] A large synthetic and curated corpus of documents was rendered in this format during pretraining, allowing the model to emit structured layout reconstructions directly from a page image.[^9] Models are also trained to produce JSON outputs for invoices, forms, and receipts, supporting downstream business workflows without bespoke parsers.[^6][^2] The same machinery handles charts and diagrams: the model can extract underlying data from a plotted figure or summarise a flow chart into structured text.[^4][^9]

Compared with [DeepSeek-OCR](/wiki/deepseek-ocr) and dedicated OCR systems such as [Mistral OCR 3](/wiki/mistral_ocr_3), Qwen2.5-VL combines [OCR](/wiki/ocr_models) with general-purpose multimodal reasoning in a single model. This [structured output](/wiki/structured_output) capability is one of the headline differences relative to Qwen2-VL, which had basic OCR but no native HTML-formatted document parsing.[^2][^15] On OCRBench-V2, the 72B model achieves 61.5 (English) and 63.7 (Chinese), figures that the Qwen blog describes as state-of-the-art among open vision-language models at the time of release.[^4][^15]

### Object localization

The model can return [bounding boxes](/wiki/bounding_box) and point coordinates for queried objects in absolute image-pixel units.[^1][^2] Localization data uses absolute coordinates rather than normalized 0-1 ranges, which keeps the geometry consistent across the wide range of input resolutions allowed by the dynamic-resolution ViT.[^1][^9] Grounding outputs are emitted as structured tokens (for example, JSON or specially delimited strings) so they can be consumed programmatically.[^2] The model supports two output modes: bounding-box rectangles for general object detection and point coordinates for fine-grained references such as parts of an interface or pixels in a chart.[^2][^9] These outputs feed naturally into downstream tools, including the agent loop described below.[^11]

### Long-video understanding

Combining absolute-time M-RoPE with dynamic FPS sampling, Qwen2.5-VL is designed for videos that span minutes to hours. According to the Qwen team, the model "can comprehend videos of over 1 hour, and this time it has a new ability of capturing event by pinpointing the relevant video segments."[^2][^1] Frames are sampled at variable rates depending on duration and target token budget, and timestamps are encoded so the model can answer questions about events at a particular second.[^2][^1] The Qwen team reports examples in which the model can pinpoint "the exact second" of a specific event in hour-long footage.[^6][^11] In practice this enables use cases such as identifying when a specific topic begins in a lecture recording, locating a goal in a sports broadcast, or finding the moment a defect appears in a manufacturing video.[^2]

The dynamic FPS approach is significant: instead of resampling to a fixed rate, the system can sample more densely in regions of interest and sparsely elsewhere, then use the absolute-time-aligned M-RoPE so the model still knows where in time each frame sits.[^1][^2] This avoids both the information loss of aggressive downsampling and the compute blow-up of uniform high-rate sampling on long videos.

### Visual agent (computer and phone use)

Qwen2.5-VL is positioned as a vision agent that, in the Qwen team's words, "directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use."[^2] It can read GUI screenshots, plan actions, and emit grounded clicks or commands to operate desktop and mobile applications.[^2][^11] The Qwen blog demonstrates the model booking flights in an airline app, using a browser to find weather forecasts, editing an image to increase colour vibrancy, and installing a VS Code extension through a tool-use loop.[^2] [Anthropic Computer Use](/wiki/anthropic_computer_use) (Claude) and the [OpenAI Operator](/wiki/openai_operator) preview shipped concurrent capabilities, but Qwen2.5-VL was among the first open-weight models with comparable native grounding-plus-planning.[^11] The model is integrated with the Qwen-Agent framework for tool-call orchestration, and benchmarks such as ScreenSpot, ScreenSpot Pro, AITZ, and Android Control are used to evaluate it.[^4][^1] On Android Control, the 72B Instruct model reaches 93.7 Low-Exact-Match and 67.36 High-Exact-Match, indicating strong performance on simple per-screen actions and moderate performance on longer multi-step trajectories.[^4]

TechCrunch's coverage at launch positioned the model as "similar to the model powering OpenAI's recently launched Operator," while noting that the model still scored "poorly on OSWorld," a benchmark that simulates a real desktop environment with many windows and applications.[^11] This is consistent with the broader pattern in late-2024 and early-2025 GUI agents: strong single-screen grounding but weaker performance on full-OS multi-step workflows.

### Multilingual OCR

Qwen2.5-VL upgrades OCR to support multiple scenarios, multiple languages, and multiple text orientations, including handwriting and stylised fonts.[^15][^9] Trained languages documented in community reporting include English, Chinese, French, German, Italian, Spanish, Portuguese, Arabic, Russian, Japanese, Korean and Vietnamese.[^15] OCR is integrated with grounding so the model can not only read text but also report where on the page it appears, supporting precise localization for downstream retrieval or redaction tasks.[^2][^15]

## How does Qwen2.5-VL compare to GPT-4o and Claude on benchmarks?

The official Hugging Face model card for Qwen2.5-VL-72B-Instruct reports the following selected scores:[^4]

| Benchmark | Qwen2.5-VL-72B-Instruct | Notes |
|---|---|---|
| MMMU (val) | 70.2 | versus 70.3 reported for [GPT-4o](/wiki/gpt_4o) on the same card |
| [MMMU-Pro](/wiki/mmmu-pro) | 51.1 | |
| [MathVista](/wiki/mathvista) | 74.8 | |
| MMBench-EN | 88.6 | |
| DocVQA (val) | 96.4 | |
| ChartQA (test) | 89.5 | |
| OCRBench-V2 (en/zh) | 61.5 / 63.7 | |
| [Video-MME](/wiki/video-mme) (no subs.) | 73.3 | |
| [Video-MME](/wiki/video-mme) (with subs.) | 79.1 | |
| LVBench | 47.3 | |
| PerceptionTest (test) | 73.2 | |
| ScreenSpot Pro | 43.6 | GUI grounding |
| AITZ (EM) | 83.2 | Android task evaluation |
| Android Control (High EM) | 67.36 | |
| Android Control (Low EM) | 93.7 | |

The technical report frames these results as matching or exceeding contemporary closed models on document and diagram understanding, while remaining competitive on general visual question answering and college-level reasoning ([MMMU](/wiki/mmmu)).[^1] Independent secondary reporting on the launch echoed the comparison, noting Qwen's claim that the 72B model beats [GPT-4o](/wiki/gpt_4o), [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet), and Gemini 2.0 Flash on a range of video, math, and document tasks.[^11]

The 32B Instruct release, despite its smaller scale, surpasses the 72B model on several reasoning-heavy benchmarks including [MMMU](/wiki/mmmu), [MMMU-Pro](/wiki/mmmu-pro), and [MathVista](/wiki/mathvista); the Qwen team attributes this to targeted reinforcement-learning post-training rather than architectural changes.[^3] The blog also notes that the 32B model beats the prior generation's Qwen2-VL-72B-Instruct on MM-MT-Bench, a multi-turn vision-language conversation benchmark.[^3] The 7B model is reported to "outperform [GPT-4o](/wiki/gpt_4o)-mini in a number of tasks," and the 3B model in turn outperforms the 7B variant of its predecessor on multiple benchmarks.[^2]

## What are the Qwen2.5-VL model sizes?

| Variant | Parameters | Release date | License | Notes |
|---|---|---|---|---|
| Qwen2.5-VL-3B (base / Instruct) | 3B | 26 Jan 2025[^2] | Apache 2.0[^12] | Compact deployment, edge use |
| Qwen2.5-VL-7B (base / Instruct) | 7B | 26 Jan 2025[^2] | Apache 2.0[^13] | Default open-weight tier |
| Qwen2.5-VL-32B-Instruct | 32B | 24 Mar 2025[^3] | Apache 2.0[^3] | Reasoning, math, human-preference tuning |
| Qwen2.5-VL-72B (base / Instruct) | 72B | 26 Jan 2025[^2] | Qwen License[^4] | Flagship; matches GPT-4o on several tasks |

Quantized AWQ versions of the 3B, 7B, and 72B Instruct checkpoints are also published on [Hugging Face](/wiki/hugging_face) for lower-memory inference.[^4] The full series is mirrored on [ModelScope](/wiki/modelscope) (Alibaba's open-source community) and the flagship is accessible via Qwen Chat.[^6][^2]

## Implementations and ecosystem

### Open-source serving stacks

Qwen2.5-VL is supported by mainstream open-source serving and runtime stacks. The Hugging Face [Transformers](/wiki/transformers_library) library exposes the `Qwen2_5_VLForConditionalGeneration` class and matching `AutoProcessor`, and the Qwen team distributes a `qwen-vl-utils` helper for image and video pre-processing including frame extraction, resolution clamping, and message preparation.[^4] [vLLM](/wiki/vllm) and [SGLang](/wiki/sglang) both ship Qwen2.5-VL kernels for high-throughput batched inference, and the Hugging Face card lists both as primary inference backends.[^4] Local-runtime ecosystems including [llama.cpp](/wiki/llama_cpp), [Ollama](/wiki/ollama), and LM Studio provide GGUF-quantized builds via community conversions, with the Qwen team noting 32 separate quantization variants available across these runtimes.[^14][^4] AWQ 4-bit quantizations of the 3B, 7B, and 72B Instruct checkpoints are published officially by the Qwen team on [Hugging Face](/wiki/hugging_face) for lower-memory inference.[^4]

### Hosted services

The model is exposed via [Alibaba Cloud](/wiki/alibaba_cloud)'s hosted Model Studio API and via the Qwen Chat web interface for end-user experimentation.[^6] Third-party API providers including OpenRouter expose the 32B model for free or low-cost inference, broadening access for developers without GPU infrastructure.[^3][^4] Within Alibaba's own product surface, Qwen2.5-VL underpins multimodal features inside the [Tongyi Qianwen](/wiki/tongyi_qianwen) consumer assistant.[^6]

### Adoption signals

The 7B Instruct checkpoint alone reports more than 200,000 downloads per month on Hugging Face as of mid-2025, making it one of the most-downloaded open vision-language models of the period.[^4] The 3B variant has been widely adopted for on-device and edge deployments where its sub-10GB memory footprint is critical, and the 32B Instruct release has become a popular alternative to the 72B for users seeking a balance between capability and inference cost.[^3][^4]

The model has been integrated into popular agent frameworks for GUI automation, including the Qwen-Agent library maintained by the Qwen team itself. Open-source projects such as Open-Qwen2VL have published derivative training recipes using publicly available datasets to approximate Qwen2.5-VL's behaviour on academic compute budgets, providing a reproducible reference for the broader research community.[^10]

## What is Qwen2.5-VL used for?

Documented and reported applications include:

* Document and form parsing for finance, insurance, and commerce, including direct JSON or HTML extraction from invoices, receipts, and tables.[^6][^2]
* Chart and diagram understanding, including conversion of plots and infographics into structured data or natural-language summaries.[^4][^9]
* Long-form video question answering and segment localization, for example for surveillance, lecture recordings, and broadcast content.[^2][^1]
* GUI agents for desktop and mobile automation, including booking flows, application configuration, and content editing.[^11][^2]
* Multilingual OCR across ten or more languages, with support for varied orientations and document types.[^9][^15]
* Educational and STEM tools that combine [MathVista](/wiki/mathvista)-style visual math reasoning with document parsing, exploited especially by the 32B Instruct variant.[^3]
* Backbone for derived research models such as the open-data Open-Qwen2VL recipe and a variety of fine-tuning experiments published on Hugging Face.[^10]

## Limitations

The technical report and secondary coverage acknowledge several limitations.[^1][^11] Performance on agent benchmarks that simulate full operating-system environments such as OSWorld remained low at launch, with TechCrunch noting that Qwen2.5-VL "scor[es] poorly on OSWorld."[^11] Long-video performance, while strong on Video-MME and LVBench, still trails closed frontier systems on the hardest benchmarks; LVBench scores hover in the high 40s.[^4] The 72B model is released under the Qwen License, which imposes use restrictions for downstream services exceeding 100 million monthly active users, distinguishing it from the [Apache 2.0](/wiki/apache_license) terms of the smaller siblings and from fully permissive releases.[^11][^3]

The model's structured-output formats (QwenVL HTML, JSON layouts) are proprietary to the Qwen ecosystem and require downstream code to parse, and the Qwen team has not released the full training data mix used in the three pretraining stages, limiting reproducibility for outside researchers.[^10][^1] Hallucination on unfamiliar document layouts and on rare languages is reported in community evaluations, and the 3B model in particular trades capability for footprint.[^4]

## Comparison with related work

Qwen2.5-VL sits in a dense family of open-weight vision-language systems.

| System | Developer | Sizes | Highlights |
|---|---|---|---|
| Qwen2-VL | [Qwen](/wiki/qwen) team, [Alibaba Cloud](/wiki/alibaba_cloud) | 2B / 7B / 72B | Direct predecessor; full-attention ViT; introduced M-RoPE[^7] |
| Qwen2.5-VL | [Qwen](/wiki/qwen) team | 3B / 7B / 32B / 72B | Window-attention ViT, absolute-time M-RoPE, doc parsing[^1] |
| [InternVL](/wiki/internvl) | OpenGVLab | 1B-78B | Strong open multimodal lineage; broadly comparable benchmarks[^1] |
| [LLaVA](/wiki/llava) | Liu et al., academic / community | 7B-34B | Influential earlier instruction-tuned VLM line[^16] |
| [Florence-2](/wiki/florence_2) | Microsoft | 0.23B / 0.77B | Compact multitask vision model with grounded outputs |
| [GPT-4o](/wiki/gpt_4o) | [OpenAI](/wiki/openai) | proprietary | Closed-weight competitor on most benchmarks |
| [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) | [Anthropic](/wiki/anthropic) | proprietary | Closed competitor, also ships [Computer Use](/wiki/anthropic_computer_use) |
| Mistral OCR / Mistral Small 3.1 | [Mistral AI](/wiki/mistral) | proprietary / open | Competing 24B-tier open VLM[^3] |

The Qwen team subsequently released [Qwen3](/wiki/qwen_3)-VL in autumn 2025, which extended the architecture with Interleaved M-RoPE, DeepStack multi-level ViT feature fusion, and Text-Timestamp Alignment, and shipped at 2B through 235B (MoE) scales, but Qwen2.5-VL remained the most widely deployed Qwen vision-language family through 2025 due to its earlier release, broad ecosystem support, and 32B sweet-spot variant.

## Significance

Qwen2.5-VL is notable for several reasons.[^1][^11][^3]

It demonstrated that an open-weight model could match closed frontier systems on document understanding and several reasoning benchmarks rather than only on narrower tasks. On DocVQA the 72B model reaches 96.4, a level that, prior to early 2025, had been the exclusive domain of closed-weight systems.[^4][^1] Independent reporting noted that Alibaba's own benchmarking showed the model "beats [GPT-4o](/wiki/gpt_4o), [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet), and Google's Gemini 2.0 Flash on a range of video understanding, math, document analysis, and question-answering evaluations."[^11]

It made hour-scale video understanding with native temporal grounding accessible to the open-source community for the first time at this performance level. The absolute-time M-RoPE design has since influenced subsequent open vision-language work, including the [Qwen3](/wiki/qwen_3)-VL successor's Text-Timestamp Alignment.[^1]

It packaged GUI agent capabilities, document parsing, and visual question answering into a single foundation model that downstream developers could fine-tune or deploy directly. Rather than requiring separate models for OCR, layout analysis, object detection, and agent planning, Qwen2.5-VL exposed all of these via a single instruction-tuned API.[^2][^9]

It advanced the trend, accelerated by [DeepSeek](/wiki/deepseek) and Qwen in early 2025, of high-quality Chinese open-weight foundation models setting the open-source frontier.[^11][^3] Together with [DeepSeek-R1](/wiki/deepseek_r1) released a few days earlier, Qwen2.5-VL became part of a broader narrative shift in which several frontier-grade open-weight releases came from non-US labs within a single quarter.[^11]

Finally, the release demonstrated that incremental architectural improvements (window attention, absolute-time M-RoPE, [RMSNorm](/wiki/rmsnorm) and [SwiGLU](/wiki/swiglu) in the ViT) combined with a substantially expanded data mix (4.1T versus 1.2T tokens) could yield large improvements over a generation. The Qwen team's choice to release the 32B variant under [Apache 2.0](/wiki/apache_license) two months after the initial drop also signalled an evolving disposition toward more permissive licensing for mid-scale open models, even as the 72B flagship retained the Qwen-specific license with high-MAU restrictions.[^3][^11]

## See also

* [Qwen](/wiki/qwen)
* [Qwen3](/wiki/qwen_3)
* [Alibaba](/wiki/alibaba)
* [Alibaba Cloud](/wiki/alibaba_cloud)
* [Tongyi Qianwen](/wiki/tongyi_qianwen)
* [Vision language model](/wiki/vision_language_model)
* [Multimodal AI](/wiki/multimodal_ai)
* [Vision Transformer (ViT)](/wiki/vision_transformer_vit)
* [Rotary position embedding (RoPE)](/wiki/rope)
* [RMSNorm](/wiki/rmsnorm)
* [SwiGLU](/wiki/swiglu)
* [YaRN](/wiki/yarn)
* [LLaVA](/wiki/llava)
* [InternVL](/wiki/internvl)
* [Florence-2](/wiki/florence_2)
* [GPT-4o](/wiki/gpt_4o)
* [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet)
* [Anthropic Computer Use](/wiki/anthropic_computer_use)
* [OpenAI Operator](/wiki/openai_operator)
* [Computer-use agent](/wiki/computer-use_agent)
* [Bounding Box](/wiki/bounding_box)
* [Video-MME](/wiki/video-mme)
* [MMMU](/wiki/mmmu)
* [MMMU-Pro](/wiki/mmmu-pro)
* [MathVista](/wiki/mathvista)
* [Structured output](/wiki/structured_output)
* [Hugging Face](/wiki/hugging_face)
* [ModelScope](/wiki/modelscope)
* [vLLM](/wiki/vllm)
* [SGLang](/wiki/sglang)
* [Ollama](/wiki/ollama)
* [llama.cpp](/wiki/llama_cpp)
* [Hugging Face Transformers](/wiki/transformers_library)
* [Supervised fine-tuning](/wiki/supervised_fine-tuning)
* [Long-context language models](/wiki/long_context)

## References

[^1]: Bai, Shuai et al. "Qwen2.5-VL Technical Report." arXiv:2502.13923, 2025-02-19. https://arxiv.org/abs/2502.13923. Accessed 2026-05-20.

[^2]: Qwen Team. "Qwen2.5 VL! Qwen2.5 VL! Qwen2.5 VL!" Qwen Blog, 2025-01-26. https://qwenlm.github.io/blog/qwen2.5-vl/. Accessed 2026-05-20.

[^3]: Qwen Team. "Qwen2.5-VL-32B: Smarter and Lighter." Qwen Blog, 2025-03-24. https://qwenlm.github.io/blog/qwen2.5-vl-32b/. Accessed 2026-05-20.

[^4]: Qwen Team. "Qwen/Qwen2.5-VL-72B-Instruct (model card)." Hugging Face, 2025. https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct. Accessed 2026-05-20.

[^5]: Qwen Team. "Qwen2.5-VL Technical Report (architecture section)." arXiv:2502.13923v1, 2025-02-19. https://arxiv.org/abs/2502.13923v1. Accessed 2026-05-20.

[^6]: Alibaba Cloud. "Alibaba Cloud Releases Latest AI Models For Enhanced Visual Understanding and Long Context Inputs." Alibaba Cloud Community, 2025-02-06. https://www.alibabacloud.com/blog/alibaba-cloud-releases-latest-ai-models-for-enhanced-visual-understanding-and-long-context-inputs_601963. Accessed 2026-05-20.

[^7]: Wang, Peng et al. "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution." arXiv:2409.12191, 2024-09-18. https://arxiv.org/abs/2409.12191. Accessed 2026-05-20.

[^8]: Qwen Team. "Qwen2.5: A Party of Foundation Models!" Qwen Blog, 2024-09-19. https://qwenlm.github.io/blog/qwen2.5/. Accessed 2026-05-20.

[^9]: tangbasky. "Qwen2.5-VL: A hands on code walkthrough." Towards AI, 2025. https://towardsai.net/p/machine-learning/qwen2-5-vl-a-hands-on-code-walkthrough. Accessed 2026-05-20.

[^10]: emergentmind. "Qwen2.5-VL: Advanced Vision-Language Model." EmergentMind topic page, 2025. https://www.emergentmind.com/topics/qwen2-5-vl-model. Accessed 2026-05-20.

[^11]: Wiggers, Kyle. "Alibaba's Qwen team releases AI models that can control PCs and phones." TechCrunch, 2025-01-27. https://techcrunch.com/2025/01/27/alibabas-qwen-team-releases-ai-models-that-can-control-pcs-and-phones/. Accessed 2026-05-20.

[^12]: Qwen Team. "Qwen/Qwen2.5-VL-3B-Instruct (model card)." Hugging Face, 2025. https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct. Accessed 2026-05-20.

[^13]: Qwen Team. "Qwen/Qwen2.5-VL-7B-Instruct (model card)." Hugging Face, 2025. https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct. Accessed 2026-05-20.

[^14]: Ollama. "qwen2.5vl (library entry)." Ollama, 2025. https://ollama.com/library/qwen2.5vl. Accessed 2026-05-20.

[^15]: deepwiki. "OCR and Text Recognition (Qwen2.5-VL)." DeepWiki, 2025. https://deepwiki.com/elsawhs/qwen2.5-vl/7.1-ocr-and-text-recognition. Accessed 2026-05-20.

[^16]: Sharma, Sana. "Qwen Releases the Qwen2.5-VL-32B-Instruct." MarkTechPost, 2025-03-24. https://www.marktechpost.com/2025/03/24/qwen-releases-the-qwen2-5-vl-32b-instruct-a-32b-parameter-vlm-that-surpasses-qwen2-5-vl-72b-and-other-models-like-gpt-4o-mini/. Accessed 2026-05-20.