Qwen2
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,366 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,366 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen2 is the second major generation of the Qwen family of open large language models developed by the Qwen team at Alibaba Cloud, the cloud computing division of Alibaba. It was released on 6 June 2024 as the successor to the original Qwen series (launched in 2023) and was itself superseded by Qwen2.5 in September 2024. [1][2] The release comprised five model sizes, each shipped as a base (pretrained) model and an instruction-tuned (chat) model, ranging from 0.5 billion to 72 billion parameters and including one mixture-of-experts model. The accompanying Qwen2 Technical Report was posted to arXiv on 15 July 2024. [3]
Qwen is also marketed under the brand name Tongyi Qianwen in China. The Qwen2 weights were distributed through Hugging Face and Alibaba's own ModelScope platform, and most sizes were released under the permissive Apache 2.0 license, which contributed to the family's wide adoption for fine-tuning and downstream research.
Qwen2 was published in five sizes. Four are conventional dense transformers; the 57B-A14B variant is a mixture-of-experts (MoE) model, meaning it has roughly 57 billion total parameters but activates only about 14 billion of them for any given token (the "A14B" suffix denotes 14 billion activated parameters). [1][3]
| Model | Type | Total parameters | Activated parameters | Layers | Hidden size | Query heads | KV heads |
|---|---|---|---|---|---|---|---|
| Qwen2-0.5B | Dense | 0.5B | 0.5B | 24 | 896 | 14 | 2 |
| Qwen2-1.5B | Dense | 1.5B | 1.5B | 28 | 1,536 | 12 | 2 |
| Qwen2-7B | Dense | 7B | 7B | 28 | 3,584 | 28 | 4 |
| Qwen2-57B-A14B | MoE | 57B | 14B | 28 | 3,584 | 28 | 4 |
| Qwen2-72B | Dense | 72B | 72B | 80 | 8,192 | 64 | 8 |
The MoE model uses 64 experts and routes each token through 8 of them. Rather than being trained from scratch, Qwen2-57B-A14B was "upcycled" from the dense Qwen2-7B, reusing its weights to initialize the expert layers, which lowered the training cost. [3] Each size was released in two variants: a base model for further pretraining or fine-tuning, and an "-Instruct" model aligned for chat and instruction following. [3]
All five Qwen2 models are decoder-only transformers that share a common design. Every size uses grouped-query attention (GQA) in place of standard multi-head attention, which reduces the size of the key-value cache and speeds up inference; this is reflected in the small number of KV heads relative to query heads in the table above. The models use SwiGLU activations, rotary position embeddings (RoPE) for positional information, RMSNorm with pre-normalization, and a bias term on the attention QKV projections. [3]
The two smallest models, Qwen2-0.5B and Qwen2-1.5B, tie their input and output embedding matrices to save parameters, while the larger models keep them separate. Qwen2 retains the byte-level byte-pair-encoding tokenizer introduced with the first Qwen generation, with a vocabulary of 151,643 ordinary tokens plus 3 control tokens. [3]
Pretraining corpus sizes varied by model. According to the technical report, Qwen2-72B, Qwen2-7B, and Qwen2-1.5B were each trained on 7 trillion tokens, Qwen2-0.5B on 12 trillion tokens, and the Qwen2-57B-A14B MoE model on 4.5 trillion tokens. Post-training combined supervised fine-tuning with direct preference optimization (DPO) for alignment. [3]
A stated focus of Qwen2 over its predecessor was broader language coverage. Beyond English and Chinese, the pretraining data was expanded to include 27 additional languages, for a total of roughly 29 languages, spanning major Western European, Eastern European, Middle Eastern, and East and Southeast Asian languages such as Spanish, French, German, Russian, Arabic, Korean, Japanese, Thai, and Vietnamese. [1][3] The technical report rounds this figure to "approximately 30 languages." Qwen2 also addressed code-switching, a common failure mode in which multilingual models inappropriately mix languages within a single response. [1]
Qwen2 models were pretrained at a context length of 4,096 tokens, which was extended to 32,768 tokens during a later pretraining phase. For the instruction-tuned models, context was extended further at inference time using YARN (a RoPE-scaling method) together with Dual Chunk Attention, allowing the larger models to process sequences of up to 131,072 tokens (128K). [1][3] The maximum supported context length differs by size:
| Model (Instruct) | Maximum context |
|---|---|
| Qwen2-0.5B-Instruct | 32K tokens |
| Qwen2-1.5B-Instruct | 32K tokens |
| Qwen2-7B-Instruct | 128K tokens |
| Qwen2-57B-A14B-Instruct | 64K tokens |
| Qwen2-72B-Instruct | 128K tokens |
The two flagship instruct models, Qwen2-7B-Instruct and Qwen2-72B-Instruct, were the ones marketed for the full 128K context window, and the official model cards configure the YARN scaling factor relative to the 32,768-token training length. [4]
At release, Qwen2-72B-Instruct posted strong scores across general-knowledge, coding, mathematics, and Chinese-language benchmarks, and Alibaba positioned it as competitive with, and on several measures ahead of, Meta's contemporaneous Llama-3-70B-Instruct. The table below lists figures reported in the official launch materials and the Qwen2-72B-Instruct model card. [1][4]
| Benchmark | Qwen2-72B-Instruct | Llama-3-70B-Instruct |
|---|---|---|
| MMLU | 82.3 | 82.0 |
| MMLU-Pro | 64.4 | 56.2 |
| GPQA | 42.4 | 41.9 |
| HumanEval (code) | 86.0 | 81.7 |
| MBPP (code) | 80.2 | 82.3 |
| GSM8K (math) | 91.1 | 93.0 |
| MATH | 59.7 | 50.4 |
On additional evaluations the Qwen2-72B-Instruct card reports an MT-Bench score of 9.12, Arena-Hard of 48.1, MultiPL-E of 69.2, LiveCodeBench of 35.7, and the Chinese benchmarks C-Eval at 83.8 and AlignBench at 8.27. [4] The Qwen2-72B base model scored 84.2 on MMLU, 64.6 on HumanEval, 89.5 on GSM8K, and 51.1 on MATH. [1] As with all self-reported benchmark numbers, these figures came from the developer and reflect the evaluation setups chosen by the Qwen team.
Qwen2 used a split licensing scheme. Four of the five sizes, Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, and the Qwen2-57B-A14B MoE model (along with their instruct variants), were released under the Apache 2.0 license, which permits commercial use, modification, and redistribution with minimal restrictions. The largest model, Qwen2-72B (and Qwen2-72B-Instruct), was released under the Tongyi Qianwen license, a custom Alibaba license that imposes additional terms, including a requirement that very large-scale commercial deployments seek a separate agreement. [1][4] This was a change from the first Qwen generation, where the smaller dense checkpoints had not all been openly licensed, and it reflected a broader move by Alibaba toward open-weight releases.
With Qwen2.5, released in September 2024, Alibaba moved most of the lineup (with the exception of the 3B and 72B sizes) to Apache 2.0, continuing the trend Qwen2 began. [2]
Qwen2 was well received as one of the strongest open-weight model families available in mid-2024, and the permissive licensing of most sizes made it a popular base for fine-tuning and quantization. The 0.5B and 1.5B models in particular found use in resource-constrained and on-device settings, while the 72B model competed with the largest contemporary open models. Hugging Face reported that across 2024 the small instruction-tuned Qwen models were among the most-downloaded open models on its hub, and over the following year the broader Qwen lineage (Qwen2 and its successors) grew into one of the most widely downloaded and most frequently derived open-model families, eventually being cited as overtaking Meta's Llama series by cumulative downloads. [5]
Qwen2.5, announced in September 2024, was the direct successor to Qwen2 and substantially expanded the family, adding more dense sizes (0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B) and training on a much larger corpus reported at 18 trillion tokens. [2] The Qwen lineage continued with Qwen3 in 2025. Specialized derivatives built on the Qwen2 architecture were also released around the same period, including the Qwen2-VL vision-language models, Qwen2-Audio, and Qwen2-Math, which extended the base text models to additional modalities and domains.