Qwen2.5-Coder

AI Code Generation Chinese AI Large Language Models

9 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 1,739 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Qwen2.5-Coder is the code-specialized series within the Qwen2.5 generation of large language models developed by the Qwen team at Alibaba. It is the successor to CodeQwen1.5 and consists of six open-weight sizes, from 0.5 billion to 32 billion parameters, each released in a base and an instruction-tuned variant. The models are continued-pretrained on top of the general Qwen2.5 base models using a code-heavy corpus of more than 5.5 trillion tokens, which the team reports gives the smaller sizes coding performance competitive with much larger general models ^[1]^[2]. The flagship Qwen2.5-Coder-32B-Instruct was positioned at release as the strongest open-weight code model then available, with results on several coding benchmarks that match or come close to OpenAI's GPT-4o ^[2]^[3].

The 1.5B and 7B sizes appeared first in September 2024, and the remaining four sizes (0.5B, 3B, 14B, and 32B) followed on 12 November 2024, completing the lineup ^[2]^[4]. The accompanying technical report, arXiv:2409.12186, was first submitted on 18 September 2024 and updated alongside the November launch ^[1].

Background

The Qwen team had shipped a dedicated code model once before. CodeQwen1.5-7B, released in April 2024, was built on the Qwen1.5 base, pretrained on roughly 3 trillion tokens of code-related data, supported 92 programming languages, and handled context lengths up to 64K tokens ^[5]. Qwen2.5-Coder is described in the technical report as a significant upgrade from that predecessor, with gains concentrated in code generation, code reasoning, and code repair ^[1].

Rather than train from scratch, Qwen2.5-Coder reuses the Qwen2.5 base checkpoints as a starting point and continues pretraining them on code. This lets the series inherit Qwen2.5's general language and mathematics ability while specializing for software tasks, and it explains why the code models share Qwen2.5's tokenizer (a 151,646-token vocabulary) and transformer design ^[1].

Model lineup

The series spans six sizes, each shipped as a base model (for downstream fine-tuning) and an Instruct model (aligned for direct chat and instruction following). All six are trained on the same 5.5 trillion token code corpus. The smaller three sizes (0.5B, 1.5B, 3B) tie their input and output embeddings and support a 32K-token context, while the larger three (7B, 14B, 32B) untie embeddings and support 128K tokens through context-length extension ^[1]^[2].

Size	Layers	Hidden size	Query / KV heads	Tied embeddings	Context	License
0.5B	24	896	14 / 2	Yes	32K	Apache 2.0
1.5B	28	1,536	12 / 2	Yes	32K	Apache 2.0
3B	36	2,048	16 / 2	Yes	32K	Qwen Research
7B	28	3,584	28 / 4	No	128K	Apache 2.0
14B	48	5,120	40 / 8	No	128K	Apache 2.0
32B	64	5,120	40 / 8	No	128K	Apache 2.0

All sizes use a decoder-only transformer with rotary position embeddings, SwiGLU activations, RMSNorm, grouped-query attention, and a bias term on the attention QKV projection. The attention head dimension is 128 across the lineup ^[1]^[3].

Architecture and long context

The 32B model is the largest in the series, with 32.5 billion parameters (31.0 billion excluding embeddings), 64 layers, and grouped-query attention using 40 query heads and 8 key-value heads ^[3]. The four largest behaviors that distinguish Qwen2.5-Coder from a general chat model come from its training objectives rather than its layout: the base models are trained with both next-token prediction and a fill-in-the-middle (FIM) objective, so they can complete code given both a prefix and a suffix, which is the pattern editor autocomplete relies on ^[1].

Long-context support is added during pretraining. Files are first trained at a maximum sequence length of 8,192 tokens, then a repo-level stage extends the context from 8,192 to 32,768 tokens, and the YaRN extrapolation mechanism is applied to reach the advertised 131,072-token (128K) window on the 7B, 14B, and 32B models ^[1]^[3]. The repo-level stage trains across concatenated files from the same repository so the model learns cross-file dependencies, which underpins the series' results on repository-level completion benchmarks such as RepoEval and CrossCodeEval ^[1]^[2].

Training data

The continued-pretraining corpus is built from five components and totals more than 5.5 trillion tokens ^[1]:

Source code pulled from public GitHub repositories, spanning 92 programming languages.
Text-code grounding data, where natural-language text is paired with code, filtered from Common Crawl.
Synthetic data generated using CodeQwen1.5 and filtered by an executor so that only code that runs is kept.
Mathematics data drawn from the Qwen2.5-Math corpus, included to preserve reasoning ability.
General text from the Qwen2.5 corpus with code segments removed, to avoid double counting.

The technical report describes an ablation over the mixing ratio and reports that a blend of roughly 70 percent code, 20 percent general text, and 10 percent mathematics performed best, which is the ratio used for the released models ^[1]. The team frames this as a deliberate balance: enough code to specialize, but enough text and math to keep the model usable for reasoning and instruction following rather than purely for autocomplete.

Post-training of the Instruct models is staged. The report describes a coarse-to-fine supervised fine-tuning pass (starting from large, lower-quality instruction samples and refining toward high-quality, rejection-sampled examples), a mixed-tuning stage that combines standard SFT with FIM-format instruction samples so the aligned model keeps its infilling ability, and a direct preference optimization stage for code that derives preference labels from code-execution feedback together with an LLM-as-judge ^[1].

Capabilities

Qwen2.5-Coder targets the full range of code tasks rather than generation alone. The base models do code completion and fill-in-the-middle infilling, including the repository-level case where context comes from other files. The Instruct models add code generation from natural language, code reasoning, bug fixing and repair, and code-grounded chat. The series also handles text-to-SQL, where a question and a database schema are turned into a SQL query ^[1]^[2].

Across these tasks the team emphasizes breadth of language coverage (the 92 languages carried over from the training data, with strong results reported on the multilingual McEval and MdEval suites that cover dozens of languages) and the practicality of the smaller checkpoints, which are small enough to run as local coding assistants ^[2]. The 7B and 14B sizes in particular became common defaults for self-hosted code completion because they fit on a single consumer GPU while retaining most of the larger model's accuracy on standard benchmarks.

Benchmarks

The headline claim for the release is that Qwen2.5-Coder-32B-Instruct matches GPT-4o on several coding benchmarks and is the best open-weight code model on EvalPlus, LiveCodeBench, and BigCodeBench ^[2]. The technical report evaluates it across more than ten benchmarks covering generation, completion, reasoning, and repair ^[1]. The table below gives the figures reported in the technical report and the Qwen blog for the 32B-Instruct model alongside GPT-4o and DeepSeek-Coder-V2-Instruct (DS-Coder-V2-Instruct), the two strongest comparison points at the time.

Benchmark	Qwen2.5-Coder-32B-Instruct	GPT-4o	DS-Coder-V2-Instruct
HumanEval	92.7	92.1	85.4
HumanEval+	87.2	86.0	82.3
MBPP	90.2	86.8	89.4
MBPP+	75.1	72.5	75.1
BigCodeBench (full)	49.6	50.1	48.2
BigCodeBench (hard)	27.0	25.0	24.3
LiveCodeBench (Pass@1)	31.4	34.6	27.9
Aider (Pass@2)	73.7	73.7 (comparable)	73.7
McEval	65.9	N/A	N/A
MdEval	75.2	N/A	N/A

On the Aider code-repair benchmark the 32B-Instruct model scored 60.9 at Pass@1 and 73.7 at Pass@2, which the team describes as comparable to GPT-4o ^[1]^[2]. On the multilingual evaluations it reports 65.9 on McEval and 75.2 on MdEval, both presented as first place among open-source models ^[2]. For text-to-SQL the report states only that Qwen2.5-Coder outperforms other code models of the same size on Spider and BIRD, without publishing a single headline accuracy number in the table; the underlying Spider results have been reproduced by third parties in the mid-to-high 80s depending on the prompting setup ^[1]^[6].

The base models are strong for their size as well. Qwen2.5-Coder-32B base scores 65.9 on HumanEval, and the report stresses that the smaller base models consistently beat larger general models on code benchmarks, which is the practical argument for the series ^[1].

Licensing

Five of the six sizes (0.5B, 1.5B, 7B, 14B, and 32B) are released under the Apache License 2.0, which permits commercial use, modification, and redistribution. The 3B size is the exception and is released under the more restrictive Qwen Research license rather than Apache 2.0 ^[2]. Weights for every size are distributed on Hugging Face in base and Instruct variants, with quantized GGUF, AWQ, and GPTQ distributions published by the team and the community ^[4].

Predecessor and successors

The direct predecessor is CodeQwen1.5, the April 2024 7B code model on which Qwen2.5-Coder improves in generation, reasoning, and repair ^[1]^[5]. Within the same generation, the code series sits alongside the general Qwen2.5 models and the broader Qwen family.

The successor is Qwen3-Coder, announced in July 2025 as part of the Qwen3 generation. Where Qwen2.5-Coder centered on code completion, infilling, and single-shot generation, Qwen3-Coder shifts toward agentic coding: long-horizon tasks where the model invokes tools, runs commands, and edits files across a repository. Its flagship, Qwen3-Coder-480B-A35B-Instruct, is a mixture-of-experts model with 480 billion total parameters and 35 billion active, a native 256K-token context, and a focus on benchmarks such as SWE-bench Verified rather than the HumanEval-style suites that defined the Qwen2.5-Coder era ^[7]. Even so, the Qwen2.5-Coder checkpoints, and the 7B, 14B, and 32B sizes in particular, remained widely used for local code completion well after their successor shipped, because they are dense, permissively licensed, and small enough to self-host.

References

Hui, Binyuan; Yang, Jian; et al. "Qwen2.5-Coder Technical Report." arXiv:2409.12186. https://arxiv.org/abs/2409.12186 ↩
Qwen Team. "Qwen2.5-Coder Series: Powerful, Diverse, Practical." Qwen blog, 12 November 2024. https://qwenlm.github.io/blog/qwen2.5-coder-family/ ↩
"Qwen/Qwen2.5-Coder-32B-Instruct." Hugging Face model card. https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct ↩
Qwen Team. "Qwen2.5-Coder: Code More, Learn More!" Qwen blog, 19 September 2024. https://qwenlm.github.io/blog/qwen2.5-coder/ ↩
Qwen Team. "Code with CodeQwen1.5." Qwen blog, April 2024. https://qwenlm.github.io/blog/codeqwen1.5/ ↩
"Qwen2.5-Coder Technical Report (HTML)." arXiv, v3. https://arxiv.org/html/2409.12186v3 ↩
Qwen Team. "Qwen3-Coder: Agentic Coding in the World." Qwen blog, 22 July 2025. https://qwenlm.github.io/blog/qwen3-coder/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Best Small Language Models Qwen2.5 WebDev Arena

Background

Model lineup

Architecture and long context

Training data

Capabilities

Benchmarks

Licensing

Predecessor and successors

References

Improve this article

Related Articles

DeepSeek-Coder

Trae

Qwen3-Coder-Next

CodeGeeX

Claude Sonnet 4.5

MBPP

What links here

Related Articles

DeepSeek-Coder

Trae

Qwen3-Coder-Next

CodeGeeX

Claude Sonnet 4.5

MBPP

What links here