Qwen2.5
Last reviewed
Jun 3, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,788 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,788 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen2.5 is a family of large language models released by the Qwen team at Alibaba Cloud in September 2024 [1][2]. It is the successor to Qwen2 and the predecessor of Qwen3 in the Qwen series, which is also marketed in China under the name Tongyi Qianwen. The release covered seven dense base models and their instruction-tuned counterparts, ranging from 0.5 billion to 72 billion parameters, alongside hosted mixture-of-experts models offered only through the Alibaba Cloud API. Compared with Qwen2, the main gains were in coding, mathematics, instruction following, long-context handling, and generation of structured output such as JSON [1].
The open-weight models were published on Hugging Face and ModelScope on 19 September 2024 [1]. A detailed technical report (arXiv:2412.15115) followed in December 2024 [3]. Qwen2.5 became one of the most widely used open-weight model families of its generation and served as the base for a large number of community fine-tunes and downstream systems.
The dense Qwen2.5 lineup spans seven sizes, each released as a base (pretrained) model and an instruction-tuned model. All are decoder-only transformers that use grouped-query attention (GQA) for efficient key-value caching, rotary positional embeddings (RoPE), the SwiGLU activation, RMSNorm, and a bias term in the attention QKV projection [4][5]. The tokenizer is a byte-level byte-pair-encoding tokenizer with a vocabulary of 151,646 tokens [6].
| Model | Total params | Non-embedding params | Layers | Q / KV heads | Context | Generation |
|---|---|---|---|---|---|---|
| Qwen2.5-0.5B | 0.49B | 0.36B | 24 | 14 / 2 | 32K | 8K |
| Qwen2.5-1.5B | 1.54B | 1.31B | 28 | 12 / 2 | 32K | 8K |
| Qwen2.5-3B | 3.09B | 2.77B | 36 | 16 / 2 | 32K | 8K |
| Qwen2.5-7B | 7.61B | 6.53B | 28 | 28 / 4 | 128K | 8K |
| Qwen2.5-14B | 14.7B | 13.1B | 48 | 40 / 8 | 128K | 8K |
| Qwen2.5-32B | 32.5B | 31.0B | 64 | 40 / 8 | 128K | 8K |
| Qwen2.5-72B | 72.7B | 70.0B | 80 | 64 / 8 | 128K | 8K |
The 7B and larger models advertise a 128K (131,072) token context window, while the three smallest models (0.5B, 1.5B, 3B) are capped at 32K [4][5]. Every size can generate up to 8,192 tokens. The architecture choices are consistent across the family, so the smaller models are essentially scaled-down versions of the same recipe rather than separate designs.
In addition to the open-weight dense models, the technical report describes two proprietary hosted models, Qwen2.5-Turbo and Qwen2.5-Plus, which use a mixture-of-experts (MoE) architecture and are available only through the Alibaba Cloud Model Studio API [3]. There is no published parameter count for these hosted models.
Qwen2.5 was pretrained on a corpus scaled to roughly 18 trillion tokens, up from the 7 trillion tokens used for Qwen2 [1][3]. The Qwen team attributed much of the improvement to better data filtering and to a deliberate increase in the share of knowledge-rich, coding, and mathematics data. Some of the higher-quality coding and math data was generated or curated with the help of the earlier specialist models in the family, including Qwen2.5-Coder data pipelines and the math-focused sibling lines.
Pretraining used a two-stage context schedule: the bulk of training was done at a 4,096-token context length, after which training continued at 32,768 tokens to extend the usable context [5]. For the larger models, the 128K window is reached at inference time using YaRN (a RoPE-based length-extrapolation technique) and the Dual Chunk Attention scheme, so the full 128K context is enabled through a configuration change rather than being trained end to end at that length [4].
Post-training combined supervised fine-tuning on over one million examples with a multi-stage reinforcement learning procedure, which the report credits for the large jumps on instruction-following and human-preference benchmarks relative to Qwen2 [3]. The supervised data targeted long-text generation (over 8K tokens), structured data understanding such as tables, and structured output generation including JSON [1].
The headline improvements over Qwen2 are in three areas. Coding and mathematics both benefited from the larger and more specialized pretraining mix, with the instruction-tuned 72B model roughly doubling its Qwen2 predecessor's score on LiveCodeBench and improving the MATH score from 69.0 to 83.1 [7]. Instruction following improved sharply, with the 72B-Instruct model's Arena-Hard score rising from 48.1 (Qwen2-72B-Instruct) to 81.2 [7].
The models were also tuned to be more reliable at producing structured output. Qwen2.5 can follow system prompts more consistently, generate long-form text beyond 8K tokens, read tabular data, and emit well-formed JSON, which makes the instruct models more practical as components in tool-using and agentic pipelines [1]. Multilingual coverage spans more than 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic [1][6].
The open-weight 7B through 72B models natively support a 128K-token input window with up to 8K tokens of generation [4]. To serve much longer inputs, the Qwen team released a separate hosted model, Qwen2.5-Turbo, on 15 November 2024 that extends the context length from 128K to 1 million tokens [8]. One million tokens corresponds to roughly ten full-length novels or 30,000 lines of code.
Qwen2.5-Turbo uses a sparse attention scheme to keep inference affordable at that length. The team reported that it reduces the time to first token for a 1M-token context from 4.9 minutes to 68 seconds, a 4.3x speedup, while keeping the price at ¥0.3 per million tokens [8]. On long-context evaluations, Qwen2.5-Turbo reached 100% accuracy on a 1M-length passkey-retrieval test and scored 93.1 on the RULER benchmark, ahead of GPT-4's 91.6 reported in the same comparison [8]. In January 2025 the team additionally open-sourced 7B and 14B "Qwen2.5-1M" variants that support a 1M-token context for local deployment [9], distinct from the hosted Turbo model.
The flagship open-weight model, Qwen2.5-72B-Instruct, was positioned against the much larger Llama-3-405B-Instruct (about five times the parameter count) and several proprietary models. The following instruction-tuned scores are from the official release [7].
| Benchmark | Qwen2.5-72B-Instruct | Qwen2-72B-Instruct | Llama-3.1-70B-Instruct |
|---|---|---|---|
| MMLU-Pro | 71.1 | 49.0 | 66.4 |
| MMLU-redux | 86.8 | 80.3 | 83.0 |
| GPQA | 49.0 | 34.3 | 41.4 |
| MATH | 83.1 | 69.0 | 68.0 |
| GSM8K | 95.8 | 91.1 | 95.1 |
| HumanEval | 86.6 | 86.0 | 80.5 |
| MBPP | 88.2 | 80.2 | 84.2 |
| LiveCodeBench | 55.5 | 32.2 | 46.6 |
| Arena-Hard | 81.2 | 48.1 | 55.7 |
| MT-Bench | 9.35 | 9.12 | 8.79 |
| IFEval | 84.1 | 77.6 | 83.6 |
The Qwen2.5-72B base model scored 86.1 on MMLU, 62.1 on MATH, and 91.5 on GSM8K, ahead of Qwen2-72B on each and competitive with the Llama-3-405B base model on several tasks despite the large size difference [7]. The technical report notes that the hosted Qwen2.5-Turbo and Qwen2.5-Plus perform competitively against GPT-4o-mini and GPT-4o respectively, at substantially lower cost [3].
Smaller instruct models retained strong reasoning scores for their size. Selected figures [7]:
| Model | MATH | GSM8K | HumanEval | MMLU-Pro |
|---|---|---|---|---|
| Qwen2.5-32B-Instruct | 83.1 | N/A | N/A | 69.0 |
| Qwen2.5-14B-Instruct | 80.0 | N/A | N/A | 63.7 |
| Qwen2.5-7B-Instruct | 75.5 | N/A | 84.8 | 56.3 |
| Qwen2.5-3B-Instruct | 65.9 | 86.7 | 74.4 | N/A |
| Qwen2.5-1.5B-Instruct | 55.2 | 73.2 | 61.6 | N/A |
| Qwen2.5-0.5B-Instruct | 34.4 | 49.6 | 35.4 | N/A |
Most Qwen2.5 models are released under the Apache 2.0 license, which permits commercial use. The two exceptions are the 3B and 72B models: Qwen2.5-3B is covered by the more restrictive Qwen Research License, and Qwen2.5-72B is covered by the Qwen License, a custom community license [1][2]. The 72B Qwen License is broadly permissive but adds conditions for very large-scale commercial deployment.
The specialist sibling lines follow a similar pattern. In the Qwen2.5-Coder series, the 0.5B, 1.5B, 7B, 14B, and 32B models are Apache 2.0 while the 3B model uses the Qwen Research License [10]. The split reflects the team's general practice of opening the small and large workhorse sizes under Apache 2.0 while reserving research-only terms for the 3B tier.
Alongside the general-purpose models, the Qwen team shipped task-specialized siblings that share the Qwen2.5 base architecture and naming. Qwen2.5-Coder is a code-focused line, initially released at 1.5B and 7B and later expanded with 0.5B, 3B, 14B, and a flagship 32B that the team described as competitive with proprietary code models [10]. The math-focused line continued the work begun with Qwen2-Math, offering Qwen2.5-Math at 1.5B, 7B, and 72B with support for chain-of-thought and tool-integrated reasoning.
A vision-language extension, Qwen2.5-VL, was released in early 2025 and adds image and video understanding on top of the Qwen2.5 language backbone. These specialist models are documented separately; the general Qwen2.5 article covers only the text-only dense and hosted models.
Qwen2.5 was widely adopted in the open-weight ecosystem because it combined permissive licensing for most sizes, a broad range of parameter counts, and strong benchmark results that were competitive with much larger models. The 7B, 14B, and 32B instruct models in particular became common choices for local deployment and fine-tuning, and the family was frequently used as a starting point for distillation and reinforcement-learning experiments by other groups.
The Qwen2.5-Max model, a large-scale MoE system trained on over 20 trillion tokens, was announced separately in January 2025 and positioned as Alibaba's frontier offering of that period [11]. The dense-model line was then superseded by Qwen3 in 2025, which introduced hybrid reasoning ("thinking" and "non-thinking" modes) and a renewed mixture-of-experts lineup. Despite the newer releases, the Qwen2.5 base checkpoints remained in active use because of their stable architecture and permissive licenses.