Qwen2

Chinese AI Large Language Models Open Source AI

8 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v2 · 1,639 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Qwen2 is the second major generation of open large language models developed by the Qwen team at Alibaba Cloud, released on 6 June 2024. ^[1]^[3] It shipped in five sizes ranging from 0.5 billion to 72 billion parameters, including one 57B-A14B mixture-of-experts model, with the larger instruction-tuned variants supporting context windows of up to 128,000 tokens and most sizes distributed under the permissive Apache 2.0 license. ^[1]^[3] It sits between Qwen1.5 and Qwen2.5 in the Qwen lineage and was positioned by Alibaba as competitive with, and on several benchmarks ahead of, Meta's Llama-3-70B. ^[1]

The Qwen team announced the release on the official blog, writing: "After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2." ^[1] The series was developed by the Qwen team at Alibaba Cloud, the cloud computing division of Alibaba, as the successor to the original Qwen series (launched in 2023) and was itself superseded by Qwen2.5 in September 2024. ^[1]^[2] Each of the five sizes was released as a base (pretrained) model and an instruction-tuned (chat) model. The accompanying Qwen2 Technical Report was posted to arXiv on 15 July 2024. ^[3]

Qwen is also marketed under the brand name Tongyi Qianwen in China. The Qwen2 weights were distributed through Hugging Face and Alibaba's own ModelScope platform, and most sizes were released under the permissive Apache 2.0 license, which contributed to the family's wide adoption for fine-tuning and downstream research. ^[1]

What is Qwen2?

Qwen2 is a family of open-weight, decoder-only transformer language models, the second generation of Alibaba Cloud's Qwen series. The release comprised five model sizes, each shipped as a base (pretrained) model and an instruction-tuned (chat) model, ranging from 0.5 billion to 72 billion parameters and including one mixture-of-experts model. ^[1]^[3] The Qwen team summarized the headline improvements over Qwen1.5 as: "State-of-the-art performance in a large number of benchmark evaluations," along with "significantly improved performance in coding and mathematics" and broader multilingual coverage. ^[1]

What sizes does Qwen2 come in?

Qwen2 was published in five sizes. Four are conventional dense transformers; the 57B-A14B variant is a mixture-of-experts (MoE) model, meaning it has roughly 57 billion total parameters but activates only about 14 billion of them for any given token (the "A14B" suffix denotes 14 billion activated parameters). ^[1]^[3]

Model	Type	Total parameters	Activated parameters	Layers	Hidden size	Query heads	KV heads
Qwen2-0.5B	Dense	0.5B	0.5B	24	896	14	2
Qwen2-1.5B	Dense	1.5B	1.5B	28	1,536	12	2
Qwen2-7B	Dense	7B	7B	28	3,584	28	4
Qwen2-57B-A14B	MoE	57B	14B	28	3,584	28	4
Qwen2-72B	Dense	72B	72B	80	8,192	64	8

The MoE model uses 64 experts and routes each token through 8 of them. Rather than being trained from scratch, Qwen2-57B-A14B was "upcycled" from the dense Qwen2-7B, reusing its weights to initialize the expert layers, which lowered the training cost. ^[3] Each size was released in two variants: a base model for further pretraining or fine-tuning, and an "-Instruct" model aligned for chat and instruction following. ^[3]

How is Qwen2 built?

All five Qwen2 models are decoder-only transformers that share a common design. Every size uses grouped-query attention (GQA) in place of standard multi-head attention, which reduces the size of the key-value cache and speeds up inference; this is reflected in the small number of KV heads relative to query heads in the table above. The models use SwiGLU activations, rotary position embeddings (RoPE) for positional information, RMSNorm with pre-normalization, and a bias term on the attention QKV projections. ^[3]

The two smallest models, Qwen2-0.5B and Qwen2-1.5B, tie their input and output embedding matrices to save parameters, while the larger models keep them separate. Qwen2 retains the byte-level byte-pair-encoding tokenizer introduced with the first Qwen generation, with a vocabulary of 151,643 ordinary tokens plus 3 control tokens. ^[3]

Pretraining corpus sizes varied by model. According to the technical report, Qwen2-72B, Qwen2-7B, and Qwen2-1.5B were each trained on 7 trillion tokens, Qwen2-0.5B on 12 trillion tokens, and the Qwen2-57B-A14B MoE model on 4.5 trillion tokens. Post-training combined supervised fine-tuning with direct preference optimization (DPO) for alignment. ^[3]

How many languages does Qwen2 support?

A stated focus of Qwen2 over its predecessor was broader language coverage. Beyond English and Chinese, the pretraining data was expanded to include 27 additional languages, for a total of roughly 29 languages, spanning major Western European, Eastern European, Middle Eastern, and East and Southeast Asian languages such as Spanish, French, German, Russian, Arabic, Korean, Japanese, Thai, and Vietnamese. ^[1]^[3] The technical report rounds this figure to "approximately 30 languages." Qwen2 also addressed code-switching, a common failure mode in which multilingual models inappropriately mix languages within a single response. ^[1]

How long is Qwen2's context window?

Qwen2 models were pretrained at a context length of 4,096 tokens, which was extended to 32,768 tokens during a later pretraining phase. For the instruction-tuned models, context was extended further at inference time using YARN (a RoPE-scaling method) together with Dual Chunk Attention, allowing the larger models to process sequences of up to 131,072 tokens (128K). ^[1]^[3] The maximum supported context length differs by size:

Model (Instruct)	Maximum context
Qwen2-0.5B-Instruct	32K tokens
Qwen2-1.5B-Instruct	32K tokens
Qwen2-7B-Instruct	128K tokens
Qwen2-57B-A14B-Instruct	64K tokens
Qwen2-72B-Instruct	128K tokens

The two flagship instruct models, Qwen2-7B-Instruct and Qwen2-72B-Instruct, were the ones marketed for the full 128K context window, and the official model cards configure the YARN scaling factor relative to the 32,768-token training length. ^[4]

How does Qwen2 perform on benchmarks?

At release, Qwen2-72B-Instruct posted strong scores across general-knowledge, coding, mathematics, and Chinese-language benchmarks, and Alibaba positioned it as competitive with, and on several measures ahead of, Meta's contemporaneous Llama-3-70B-Instruct. The Qwen team stated that "Qwen2-72B exhibits superior performance compared to leading models such as Llama-3-70B," and that "Qwen2-72B-Instruct significantly surpasses Qwen1.5-72B-Chat across all benchmarks, and also reaches competitive performance compared with Llama-3-70B-Instruct." ^[1] The table below lists figures reported in the official launch materials and the Qwen2-72B-Instruct model card. ^[1]^[4]

Benchmark	Qwen2-72B-Instruct	Llama-3-70B-Instruct
MMLU	82.3	82.0
MMLU-Pro	64.4	56.2
GPQA	42.4	41.9
HumanEval (code)	86.0	81.7
MBPP (code)	80.2	82.3
GSM8K (math)	91.1	93.0
MATH	59.7	50.4

On additional evaluations the Qwen2-72B-Instruct card reports an MT-Bench score of 9.12, Arena-Hard of 48.1, MultiPL-E of 69.2, LiveCodeBench of 35.7, and the Chinese benchmarks C-Eval at 83.8 and AlignBench at 8.27. ^[4] The Qwen2-72B base model scored 84.2 on MMLU, 64.6 on HumanEval, 89.5 on GSM8K, and 51.1 on MATH. ^[1] As with all self-reported benchmark numbers, these figures came from the developer and reflect the evaluation setups chosen by the Qwen team.

Is Qwen2 open source?

Qwen2 used a split licensing scheme. Four of the five sizes, Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, and the Qwen2-57B-A14B MoE model (along with their instruct variants), were released under the Apache 2.0 license, which permits commercial use, modification, and redistribution with minimal restrictions. The largest model, Qwen2-72B (and Qwen2-72B-Instruct), was released under the Tongyi Qianwen license, a custom Alibaba license that imposes additional terms, including a requirement that very large-scale commercial deployments seek a separate agreement. ^[1]^[4] This was a change from the first Qwen generation, where the smaller dense checkpoints had not all been openly licensed, and it reflected a broader move by Alibaba toward open-weight releases. The Qwen team framed the licensing shift as deliberate, writing: "We believe that the enhanced openness of our models to the community can accelerate the applications and commercial usages of Qwen2 all around the world." ^[1]

With Qwen2.5, released in September 2024, Alibaba moved most of the lineup (with the exception of the 3B and 72B sizes) to Apache 2.0, continuing the trend Qwen2 began. ^[2]

How was Qwen2 received?

Qwen2 was well received as one of the strongest open-weight model families available in mid-2024, and the permissive licensing of most sizes made it a popular base for fine-tuning and quantization. The 0.5B and 1.5B models in particular found use in resource-constrained and on-device settings, while the 72B model competed with the largest contemporary open models. Hugging Face reported that across 2024 the small instruction-tuned Qwen models were among the most-downloaded open models on its hub, and over the following year the broader Qwen lineage (Qwen2 and its successors) grew into one of the most widely downloaded and most frequently derived open-model families, eventually being cited as overtaking Meta's Llama series by cumulative downloads. ^[5]

What came after Qwen2?

Qwen2.5, announced in September 2024, was the direct successor to Qwen2 and substantially expanded the family, adding more dense sizes (0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B) and training on a much larger corpus reported at 18 trillion tokens. ^[2] The Qwen lineage continued with Qwen3 in 2025. Specialized derivatives built on the Qwen2 architecture were also released around the same period, including the Qwen2-VL vision-language models, Qwen2-Audio, and Qwen2-Math, which extended the base text models to additional modalities and domains.

References

"Hello Qwen2." Qwen Team, Alibaba Cloud (official blog), 6 June 2024. https://qwenlm.github.io/blog/qwen2/ ↩
"Qwen2.5: A Party of Foundation Models!" Qwen Team, Alibaba Cloud (official blog), 19 September 2024. https://qwenlm.github.io/blog/qwen2.5/ ↩
An Yang et al. "Qwen2 Technical Report." arXiv:2407.10671, 15 July 2024. https://arxiv.org/abs/2407.10671 ↩
"Qwen/Qwen2-72B-Instruct." Hugging Face model card. https://huggingface.co/Qwen/Qwen2-72B-Instruct ↩
"Qwen." Wikipedia. https://en.wikipedia.org/wiki/Qwen ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AMD Instinct MI325X Marco-o1 Qwen2-Math Qwen2.5 Sparse upcycling

What is Qwen2?

What sizes does Qwen2 come in?

How is Qwen2 built?

How many languages does Qwen2 support?

How long is Qwen2's context window?

How does Qwen2 perform on benchmarks?

Is Qwen2 open source?

How was Qwen2 received?

What came after Qwen2?

References

Improve this article

Related Articles

Qwen

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here

Related Articles

Qwen

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here