Qwen2.5

Chinese AI Large Language Models Open Source AI

10 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 1,931 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Qwen2.5 is a family of open-weight large language models that Alibaba Cloud's Qwen team released on 19 September 2024, spanning seven dense sizes from 0.5 billion to 72 billion parameters, pretrained on roughly 18 trillion tokens, and supporting context windows up to 128K tokens ^[1]^[2]. It is the successor to Qwen2 and the predecessor of Qwen3 in the Qwen series, which is also marketed in China under the name Tongyi Qianwen. Most Qwen2.5 models are released under the permissive Apache 2.0 license, and the family is accompanied by code-specialized Qwen2.5-Coder and math-specialized Qwen2.5-Math sibling lines.

The release covered seven dense base models and their instruction-tuned counterparts, alongside hosted mixture-of-experts models offered only through the Alibaba Cloud API. Compared with Qwen2, the main gains were in coding, mathematics, instruction following, long-context handling, and generation of structured output such as JSON ^[1]. The Qwen team summarized the jump bluntly: "Compared to Qwen2, Qwen2.5 has acquired significantly more knowledge (MMLU: 85+) and has greatly improved capabilities in coding (HumanEval 85+) and mathematics (MATH 80+)" ^[1].

The open-weight models were published on Hugging Face and ModelScope on 19 September 2024 ^[1]. A detailed technical report (arXiv:2412.15115) followed in December 2024 ^[3]. Qwen2.5 became one of the most widely used open-weight model families of its generation and served as the base for a large number of community fine-tunes and downstream systems.

What models are in the Qwen2.5 lineup?

The dense Qwen2.5 lineup spans seven sizes, each released as a base (pretrained) model and an instruction-tuned model. All are decoder-only transformers that use grouped-query attention (GQA) for efficient key-value caching, rotary positional embeddings (RoPE), the SwiGLU activation, RMSNorm, and a bias term in the attention QKV projection ^[4]^[5]. The tokenizer is a byte-level byte-pair-encoding tokenizer with a vocabulary of 151,646 tokens ^[6].

Model	Total params	Non-embedding params	Layers	Q / KV heads	Context	Generation
Qwen2.5-0.5B	0.49B	0.36B	24	14 / 2	32K	8K
Qwen2.5-1.5B	1.54B	1.31B	28	12 / 2	32K	8K
Qwen2.5-3B	3.09B	2.77B	36	16 / 2	32K	8K
Qwen2.5-7B	7.61B	6.53B	28	28 / 4	128K	8K
Qwen2.5-14B	14.7B	13.1B	48	40 / 8	128K	8K
Qwen2.5-32B	32.5B	31.0B	64	40 / 8	128K	8K
Qwen2.5-72B	72.7B	70.0B	80	64 / 8	128K	8K

The 7B and larger models advertise a 128K (131,072) token context window, while the three smallest models (0.5B, 1.5B, 3B) are capped at 32K ^[4]^[5]. Every size can generate up to 8,192 tokens. The architecture choices are consistent across the family, so the smaller models are essentially scaled-down versions of the same recipe rather than separate designs.

In addition to the open-weight dense models, the technical report describes two proprietary hosted models, Qwen2.5-Turbo and Qwen2.5-Plus, which use a mixture-of-experts (MoE) architecture and are available only through the Alibaba Cloud Model Studio API ^[3]. There is no published parameter count for these hosted models.

How was Qwen2.5 trained?

Qwen2.5 was pretrained on a corpus scaled to roughly 18 trillion tokens, up from the 7 trillion tokens used for Qwen2 ^[1]^[3]. The Qwen team attributed much of the improvement to better data filtering and to a deliberate increase in the share of knowledge-rich, coding, and mathematics data. Some of the higher-quality coding and math data was generated or curated with the help of the earlier specialist models in the family, including Qwen2.5-Coder data pipelines and the math-focused sibling lines.

Pretraining used a two-stage context schedule: the bulk of training was done at a 4,096-token context length, after which training continued at 32,768 tokens to extend the usable context ^[5]. For the larger models, the 128K window is reached at inference time using YaRN (a RoPE-based length-extrapolation technique) and the Dual Chunk Attention scheme, so the full 128K context is enabled through a configuration change rather than being trained end to end at that length ^[4].

Post-training combined supervised fine-tuning on over one million examples with a multi-stage reinforcement learning procedure, which the report credits for the large jumps on instruction-following and human-preference benchmarks relative to Qwen2 ^[3]. The supervised data targeted long-text generation (over 8K tokens), structured data understanding such as tables, and structured output generation including JSON ^[1].

What can Qwen2.5 do?

The headline improvements over Qwen2 are in three areas. Coding and mathematics both benefited from the larger and more specialized pretraining mix, with the instruction-tuned 72B model roughly doubling its Qwen2 predecessor's score on LiveCodeBench and improving the MATH score from 69.0 to 83.1 ^[7]. Instruction following improved sharply, with the 72B-Instruct model's Arena-Hard score rising from 48.1 (Qwen2-72B-Instruct) to 81.2 ^[7].

The models were also tuned to be more reliable at producing structured output. As the Qwen team put it, the new models "achieve significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON" ^[1]. Qwen2.5 can follow system prompts more consistently, generate long-form text beyond 8K tokens, read tabular data, and emit well-formed JSON, which makes the instruct models more practical as components in tool-using and agentic pipelines ^[1]. Multilingual coverage spans more than 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic ^[1]^[6].

How long is the Qwen2.5 context window?

The open-weight 7B through 72B models natively support a 128K-token input window with up to 8K tokens of generation ^[4]. To serve much longer inputs, the Qwen team released a separate hosted model, Qwen2.5-Turbo, on 15 November 2024 that extends the context length from 128K to 1 million tokens ^[8]. One million tokens corresponds to roughly ten full-length novels or 30,000 lines of code.

Qwen2.5-Turbo uses a sparse attention scheme to keep inference affordable at that length. The team reported that it reduces the time to first token for a 1M-token context from 4.9 minutes to 68 seconds, a 4.3x speedup, while keeping the price at ¥0.3 per million tokens ^[8]. On long-context evaluations, Qwen2.5-Turbo reached 100% accuracy on a 1M-length passkey-retrieval test and scored 93.1 on the RULER benchmark, ahead of GPT-4's 91.6 reported in the same comparison ^[8]. In January 2025 the team additionally open-sourced 7B and 14B "Qwen2.5-1M" variants that support a 1M-token context for local deployment ^[9], distinct from the hosted Turbo model.

How does Qwen2.5 perform on benchmarks?

The flagship open-weight model, Qwen2.5-72B-Instruct, was positioned against the much larger Llama-3-405B-Instruct (about five times the parameter count) and several proprietary models. The following instruction-tuned scores are from the official release ^[7].

Benchmark	Qwen2.5-72B-Instruct	Qwen2-72B-Instruct	Llama-3.1-70B-Instruct
MMLU-Pro	71.1	49.0	66.4
MMLU-redux	86.8	80.3	83.0
GPQA	49.0	34.3	41.4
MATH	83.1	69.0	68.0
GSM8K	95.8	91.1	95.1
HumanEval	86.6	86.0	80.5
MBPP	88.2	80.2	84.2
LiveCodeBench	55.5	32.2	46.6
Arena-Hard	81.2	48.1	55.7
MT-Bench	9.35	9.12	8.79
IFEval	84.1	77.6	83.6

The Qwen2.5-72B base model scored 86.1 on MMLU, 62.1 on MATH, and 91.5 on GSM8K, ahead of Qwen2-72B on each and competitive with the Llama-3-405B base model on several tasks despite the large size difference ^[7]. The technical report notes that the hosted Qwen2.5-Turbo and Qwen2.5-Plus perform competitively against GPT-4o-mini and GPT-4o respectively, at substantially lower cost ^[3].

Smaller instruct models retained strong reasoning scores for their size. Selected figures ^[7]:

Model	MATH	GSM8K	HumanEval	MMLU-Pro
Qwen2.5-32B-Instruct	83.1	N/A	N/A	69.0
Qwen2.5-14B-Instruct	80.0	N/A	N/A	63.7
Qwen2.5-7B-Instruct	75.5	N/A	84.8	56.3
Qwen2.5-3B-Instruct	65.9	86.7	74.4	N/A
Qwen2.5-1.5B-Instruct	55.2	73.2	61.6	N/A
Qwen2.5-0.5B-Instruct	34.4	49.6	35.4	N/A

Is Qwen2.5 open source?

Most Qwen2.5 models are released under the Apache 2.0 license, which permits commercial use. The two exceptions are the 3B and 72B models: Qwen2.5-3B is covered by the more restrictive Qwen Research License, and Qwen2.5-72B is covered by the Qwen License, a custom community license ^[1]^[2]. The 72B Qwen License is broadly permissive but adds conditions for very large-scale commercial deployment.

The specialist sibling lines follow a similar pattern. In the Qwen2.5-Coder series, the 0.5B, 1.5B, 7B, 14B, and 32B models are Apache 2.0 while the 3B model uses the Qwen Research License ^[10]. The split reflects the team's general practice of opening the small and large workhorse sizes under Apache 2.0 while reserving research-only terms for the 3B tier.

What are the Qwen2.5 specialized variants?

Alongside the general-purpose models, the Qwen team shipped task-specialized siblings that share the Qwen2.5 base architecture and naming. Qwen2.5-Coder is a code-focused line, further pretrained on about 5.5 trillion tokens of code, initially released at 1.5B and 7B and later expanded with 0.5B, 3B, 14B, and a flagship 32B that the team described as competitive with proprietary code models ^[10]. The math-focused line continued the work begun with Qwen2-Math, offering Qwen2.5-Math at 1.5B, 7B, and 72B with support for chain-of-thought and tool-integrated reasoning.

A vision-language extension, Qwen2.5-VL, was released in early 2025 and adds image and video understanding on top of the Qwen2.5 language backbone. These specialist models are documented separately; the general Qwen2.5 article covers only the text-only dense and hosted models.

How does Qwen2.5 relate to Qwen3?

Qwen2.5 was widely adopted in the open-weight ecosystem because it combined permissive licensing for most sizes, a broad range of parameter counts, and strong benchmark results that were competitive with much larger models. The 7B, 14B, and 32B instruct models in particular became common choices for local deployment and fine-tuning, and the family was frequently used as a starting point for distillation and reinforcement-learning experiments by other groups.

The Qwen2.5-Max model, a large-scale MoE system trained on over 20 trillion tokens, was announced separately in January 2025 and positioned as Alibaba's frontier offering of that period ^[11]. The dense-model line was then superseded by Qwen3 in 2025, which introduced hybrid reasoning ("thinking" and "non-thinking" modes) and a renewed mixture-of-experts lineup. Despite the newer releases, the Qwen2.5 base checkpoints remained in active use because of their stable architecture and permissive licenses.

References

Qwen Team, "Qwen2.5: A Party of Foundation Models!", Qwen blog, 19 September 2024. https://qwenlm.github.io/blog/qwen2.5/ ↩
"Qwen2.5: A Party of Foundation Models!", Alibaba Cloud Community. https://www.alibabacloud.com/blog/qwen2-5-a-party-of-foundation-models_601782 ↩
Qwen Team, "Qwen2.5 Technical Report", arXiv:2412.15115. https://arxiv.org/abs/2412.15115 ↩
"Qwen/Qwen2.5-72B-Instruct", Hugging Face model card. https://huggingface.co/Qwen/Qwen2.5-72B-Instruct ↩
"Qwen/Qwen2.5-7B", Hugging Face model card. https://huggingface.co/Qwen/Qwen2.5-7B ↩
"Key Concepts", Qwen documentation (tokenizer and language support). https://qwen.readthedocs.io/en/latest/getting_started/concepts.html ↩
Qwen Team, "Qwen2.5-LLM: Extending the boundary of LLMs", Qwen blog. https://qwenlm.github.io/blog/qwen2.5-llm/ ↩
Qwen Team, "Extending the Context Length to 1M Tokens!", Qwen blog, 15 November 2024. https://qwenlm.github.io/blog/qwen2.5-turbo/ ↩
Qwen Team, "Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens", Qwen blog. https://qwenlm.github.io/blog/qwen2.5-1m/ ↩
Qwen Team, "Qwen2.5-Coder Series: Powerful, Diverse, Practical", Qwen blog. https://qwenlm.github.io/blog/qwen2.5-coder-family/ ↩
Qwen Team, "Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model", Qwen blog. https://qwenlm.github.io/blog/qwen2.5-max/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AI Model Release Timeline (2022-2026)Absolute Zero Reasoner Falcon 3 Gemma 3 Huawei PanGu InternVL3 LLM Context Window Comparison Nemotron Nemotron-H Phi-4-mini QwQ Qwen2 Qwen2-Math Qwen2.5-Coder Sakana AI SmolLM 2 SmolLM 3 VAPO (Value-based Augmented PPO)Yi-Lightning

What models are in the Qwen2.5 lineup?

How was Qwen2.5 trained?

What can Qwen2.5 do?

How long is the Qwen2.5 context window?

How does Qwen2.5 perform on benchmarks?

Is Qwen2.5 open source?

What are the Qwen2.5 specialized variants?

How does Qwen2.5 relate to Qwen3?

References

Improve this article

Related Articles

Qwen

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here

Related Articles

Qwen

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here