Qwen2.5-Coder
Last reviewed
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,741 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,741 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen2.5-Coder is the code-specialized series within the Qwen2.5 generation of large language models developed by the Qwen team at Alibaba. It is the successor to CodeQwen1.5 and consists of six open-weight sizes, from 0.5 billion to 32 billion parameters, each released in a base and an instruction-tuned variant. The models are continued-pretrained on top of the general Qwen2.5 base models using a code-heavy corpus of more than 5.5 trillion tokens, which the team reports gives the smaller sizes coding performance competitive with much larger general models [1][2]. The flagship Qwen2.5-Coder-32B-Instruct was positioned at release as the strongest open-weight code model then available, with results on several coding benchmarks that match or come close to OpenAI's GPT-4o [2][3].
The 1.5B and 7B sizes appeared first in September 2024, and the remaining four sizes (0.5B, 3B, 14B, and 32B) followed on 12 November 2024, completing the lineup [2][4]. The accompanying technical report, arXiv:2409.12186, was first submitted on 18 September 2024 and updated alongside the November launch [1].
The Qwen team had shipped a dedicated code model once before. CodeQwen1.5-7B, released in April 2024, was built on the Qwen1.5 base, pretrained on roughly 3 trillion tokens of code-related data, supported 92 programming languages, and handled context lengths up to 64K tokens [5]. Qwen2.5-Coder is described in the technical report as a significant upgrade from that predecessor, with gains concentrated in code generation, code reasoning, and code repair [1].
Rather than train from scratch, Qwen2.5-Coder reuses the Qwen2.5 base checkpoints as a starting point and continues pretraining them on code. This lets the series inherit Qwen2.5's general language and mathematics ability while specializing for software tasks, and it explains why the code models share Qwen2.5's tokenizer (a 151,646-token vocabulary) and transformer design [1].
The series spans six sizes, each shipped as a base model (for downstream fine-tuning) and an Instruct model (aligned for direct chat and instruction following). All six are trained on the same 5.5 trillion token code corpus. The smaller three sizes (0.5B, 1.5B, 3B) tie their input and output embeddings and support a 32K-token context, while the larger three (7B, 14B, 32B) untie embeddings and support 128K tokens through context-length extension [1][2].
| Size | Layers | Hidden size | Query / KV heads | Tied embeddings | Context | License |
|---|---|---|---|---|---|---|
| 0.5B | 24 | 896 | 14 / 2 | Yes | 32K | Apache 2.0 |
| 1.5B | 28 | 1,536 | 12 / 2 | Yes | 32K | Apache 2.0 |
| 3B | 36 | 2,048 | 16 / 2 | Yes | 32K | Qwen Research |
| 7B | 28 | 3,584 | 28 / 4 | No | 128K | Apache 2.0 |
| 14B | 48 | 5,120 | 40 / 8 | No | 128K | Apache 2.0 |
| 32B | 64 | 5,120 | 40 / 8 | No | 128K | Apache 2.0 |
All sizes use a decoder-only transformer with rotary position embeddings, SwiGLU activations, RMSNorm, grouped-query attention, and a bias term on the attention QKV projection. The attention head dimension is 128 across the lineup [1][3].
The 32B model is the largest in the series, with 32.5 billion parameters (31.0 billion excluding embeddings), 64 layers, and grouped-query attention using 40 query heads and 8 key-value heads [3]. The four largest behaviors that distinguish Qwen2.5-Coder from a general chat model come from its training objectives rather than its layout: the base models are trained with both next-token prediction and a fill-in-the-middle (FIM) objective, so they can complete code given both a prefix and a suffix, which is the pattern editor autocomplete relies on [1].
Long-context support is added during pretraining. Files are first trained at a maximum sequence length of 8,192 tokens, then a repo-level stage extends the context from 8,192 to 32,768 tokens, and the YaRN extrapolation mechanism is applied to reach the advertised 131,072-token (128K) window on the 7B, 14B, and 32B models [1][3]. The repo-level stage trains across concatenated files from the same repository so the model learns cross-file dependencies, which underpins the series' results on repository-level completion benchmarks such as RepoEval and CrossCodeEval [1][2].
The continued-pretraining corpus is built from five components and totals more than 5.5 trillion tokens [1]:
The technical report describes an ablation over the mixing ratio and reports that a blend of roughly 70 percent code, 20 percent general text, and 10 percent mathematics performed best, which is the ratio used for the released models [1]. The team frames this as a deliberate balance: enough code to specialize, but enough text and math to keep the model usable for reasoning and instruction following rather than purely for autocomplete.
Post-training of the Instruct models is staged. The report describes a coarse-to-fine supervised fine-tuning pass (starting from large, lower-quality instruction samples and refining toward high-quality, rejection-sampled examples), a mixed-tuning stage that combines standard SFT with FIM-format instruction samples so the aligned model keeps its infilling ability, and a direct preference optimization stage for code that derives preference labels from code-execution feedback together with an LLM-as-judge [1].
Qwen2.5-Coder targets the full range of code tasks rather than generation alone. The base models do code completion and fill-in-the-middle infilling, including the repository-level case where context comes from other files. The Instruct models add code generation from natural language, code reasoning, bug fixing and repair, and code-grounded chat. The series also handles text-to-SQL, where a question and a database schema are turned into a SQL query [1][2].
Across these tasks the team emphasizes breadth of language coverage (the 92 languages carried over from the training data, with strong results reported on the multilingual McEval and MdEval suites that cover dozens of languages) and the practicality of the smaller checkpoints, which are small enough to run as local coding assistants [2]. The 7B and 14B sizes in particular became common defaults for self-hosted code completion because they fit on a single consumer GPU while retaining most of the larger model's accuracy on standard benchmarks.
The headline claim for the release is that Qwen2.5-Coder-32B-Instruct matches GPT-4o on several coding benchmarks and is the best open-weight code model on EvalPlus, LiveCodeBench, and BigCodeBench [2]. The technical report evaluates it across more than ten benchmarks covering generation, completion, reasoning, and repair [1]. The table below gives the figures reported in the technical report and the Qwen blog for the 32B-Instruct model alongside GPT-4o and DeepSeek-Coder-V2-Instruct (DS-Coder-V2-Instruct), the two strongest comparison points at the time.
| Benchmark | Qwen2.5-Coder-32B-Instruct | GPT-4o | DS-Coder-V2-Instruct |
|---|---|---|---|
| HumanEval | 92.7 | 92.1 | 85.4 |
| HumanEval+ | 87.2 | 86.0 | 82.3 |
| MBPP | 90.2 | 86.8 | 89.4 |
| MBPP+ | 75.1 | 72.5 | 75.1 |
| BigCodeBench (full) | 49.6 | 50.1 | 48.2 |
| BigCodeBench (hard) | 27.0 | 25.0 | 24.3 |
| LiveCodeBench (Pass@1) | 31.4 | 34.6 | 27.9 |
| Aider (Pass@2) | 73.7 | 73.7 (comparable) | 73.7 |
| McEval | 65.9 | N/A | N/A |
| MdEval | 75.2 | N/A | N/A |
On the Aider code-repair benchmark the 32B-Instruct model scored 60.9 at Pass@1 and 73.7 at Pass@2, which the team describes as comparable to GPT-4o [1][2]. On the multilingual evaluations it reports 65.9 on McEval and 75.2 on MdEval, both presented as first place among open-source models [2]. For text-to-SQL the report states only that Qwen2.5-Coder outperforms other code models of the same size on Spider and BIRD, without publishing a single headline accuracy number in the table; the underlying Spider results have been reproduced by third parties in the mid-to-high 80s depending on the prompting setup [1][6].
The base models are strong for their size as well. Qwen2.5-Coder-32B base scores 65.9 on HumanEval, and the report stresses that the smaller base models consistently beat larger general models on code benchmarks, which is the practical argument for the series [1].
Five of the six sizes (0.5B, 1.5B, 7B, 14B, and 32B) are released under the Apache License 2.0, which permits commercial use, modification, and redistribution. The 3B size is the exception and is released under the more restrictive Qwen Research license rather than Apache 2.0 [2]. Weights for every size are distributed on Hugging Face in base and Instruct variants, with quantized GGUF, AWQ, and GPTQ distributions published by the team and the community [4].
The direct predecessor is CodeQwen1.5, the April 2024 7B code model on which Qwen2.5-Coder improves in generation, reasoning, and repair [1][5]. Within the same generation, the code series sits alongside the general Qwen2.5 models and the broader Qwen family.
The successor is Qwen3-Coder, announced in July 2025 as part of the Qwen3 generation. Where Qwen2.5-Coder centered on code completion, infilling, and single-shot generation, Qwen3-Coder shifts toward agentic coding: long-horizon tasks where the model invokes tools, runs commands, and edits files across a repository. Its flagship, Qwen3-Coder-480B-A35B-Instruct, is a mixture-of-experts model with 480 billion total parameters and 35 billion active, a native 256K-token context, and a focus on benchmarks such as SWE-bench Verified rather than the HumanEval-style suites that defined the Qwen2.5-Coder era [7]. Even so, the Qwen2.5-Coder checkpoints, and the 7B, 14B, and 32B sizes in particular, remained widely used for local code completion well after their successor shipped, because they are dense, permissively licensed, and small enough to self-host.