DeepSeek-Coder
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,863 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,863 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek-Coder is a family of open-weight code language models developed by DeepSeek-AI, the Hangzhou-based laboratory affiliated with the hedge fund High-Flyer. The family spans three generations: the original DeepSeek-Coder V1 (1.3B, 6.7B, and 33B dense models trained from scratch on two trillion tokens), DeepSeek-Coder-V2 (a Mixture-of-Experts model released in two sizes, 16B Lite and 236B, supporting 338 programming languages and 128K context), and the subsequent absorption of code capability into the unified flagship checkpoints DeepSeek-V2.5 and DeepSeek V3.[1][2][3] Across its public benchmarks, DeepSeek-Coder repeatedly produced the strongest open-weight numbers of its time on HumanEval, MBPP, and LiveCodeBench, in several cases approaching or matching the proprietary frontier (GPT-4 Turbo, Claude 3 Opus, Gemini 1.5 Pro).[2][3] The weights are distributed under a permissive custom DeepSeek License that allows commercial use, with the accompanying code under the MIT License.[4][5]
| Field | Value |
|---|---|
| Developer | DeepSeek-AI (and Peking University collaborators) |
| First release | 2 November 2023 (V1)[6] |
| Major successor | DeepSeek-Coder-V2, 17 June 2024[7] |
| V1 sizes | 1.3B, 5.7B, 6.7B, 33B (Base and Instruct)[1][5] |
| V2 sizes | 16B Lite (2.4B active), 236B (21B active)[2][3] |
| V1 pretraining | 2T tokens, 87% code, 13% natural language[1] |
| V2 additional pretraining | 6T tokens on top of DeepSeek-V2 checkpoint[2] |
| Programming languages | 87 (V1), 338 (V2)[1][2] |
| Context window | 16K (V1), 128K (V2)[1][2] |
| Architecture | Dense decoder Transformer (V1); DeepSeekMoE with MLA (V2)[1][2] |
| Primary paper (V1) | arXiv:2401.14196, 25 January 2024[1] |
| Primary paper (V2) | arXiv:2406.11931, 17 June 2024[2] |
| Code license | MIT[4] |
| Model license | DeepSeek License (commercial use permitted)[5] |
DeepSeek-AI began releasing code-specialised models in late 2023, several months before publishing its first general chat model. The initial DeepSeek-Coder family was made public on 2 November 2023, predating the accompanying arXiv preprint by nearly three months.[6][8] The technical report "DeepSeek-Coder: When the Large Language Model Meets Programming, The Rise of Code Intelligence" was submitted to arXiv on 25 January 2024 (revised on 26 January 2024) by Daya Guo, Qihao Zhu, Dejian Yang and collaborators from DeepSeek-AI and the Key Lab of HCST at Peking University.[1]
DeepSeek positioned the first release as an attempt to close the gap between open-source code models, then anchored by Meta's Code Llama and BigCode's StarCoder, and proprietary systems such as OpenAI's Codex and GPT-3.5.[1] The paper reported that the 33B instruction-tuned variant outperformed GPT-3.5-Turbo on HumanEval and matched it on MBPP, and that the 6.7B base model already exceeded CodeLlama-34B on multilingual HumanEval.[1][6]
DeepSeek-Coder-V2, with the subtitle "Breaking the Barrier of Closed-Source Models in Code Intelligence", was released and posted to arXiv on 17 June 2024.[2][7] Unlike V1, which was trained from scratch, V2 began from an intermediate checkpoint of the broader DeepSeek-V2 base model and added six trillion additional tokens of code- and math-rich data. The release also re-platformed the family onto a Mixture-of-Experts backbone using DeepSeekMoE and Multi-head Latent Attention (MLA) introduced in DeepSeek-V2.[2][9]
A subsequent checkpoint, DeepSeek-Coder-V2-Instruct-0724, replaced the original V2 on the DeepSeek API platform in late July 2024, and the official deepseek-coder endpoint was thereafter routed to the unified DeepSeek-V2.5 model released on 5 September 2024.[10] V2.5 explicitly merged DeepSeek-V2-0628 (chat) with DeepSeek-Coder-V2-0724 (code), and the subsequent flagship DeepSeek V3 (released 26 December 2024) absorbed code capabilities into a single 671B-parameter / 37B-active MoE checkpoint that no longer ships as a separate "Coder" branch.[10][11]
The V1 corpus totalled roughly 2 trillion tokens. The composition was 87% source code, 10% English code-related natural language (issues, pull requests, GitHub Markdown, StackExchange), and 3% Chinese natural language.[1] After filtering, the raw code dataset contained approximately 798 GB and 603 million files spanning 87 programming languages.[6]
The data pipeline applied repository-level deduplication. Rather than removing duplicate files independently, files within a GitHub repository were concatenated and MinHash-LSH was run over whole-project representations, preserving cross-file dependency structure that DeepSeek's authors found important for project-level reasoning.[1][6] Files were then ordered using a topological sort over their import dependencies before sequence packing, producing project-coherent training documents.
DeepSeek-Coder V1 was trained with a 16K context window. Pretraining used the standard causal language modelling loss together with a Fill-in-the-Middle (FIM) objective at the document level, applied with a 50% Prefix-Suffix-Middle (PSM) probability and a fallback Suffix-Prefix-Middle (SPM) ordering.[1] The FIM ratio of 0.5 was chosen as a compromise between vanilla left-to-right loss and high-FIM regimes that improve infilling but slightly hurt completion quality.[1]
Instruction tuning produced the -Instruct variants. The instruction set used Alpaca-style formatting on roughly 2 billion tokens of curated demonstrations, including code explanation, code completion, and software engineering tasks.[1][5]
V2 inherited the DeepSeekMoE architecture from DeepSeek-V2, replacing the dense transformer of V1 with a Mixture-of-Experts decoder.[2][9] Two sizes were released:
| Variant | Total parameters | Active parameters per token | Context |
|---|---|---|---|
| DeepSeek-Coder-V2-Lite | 16B | 2.4B | 128K |
| DeepSeek-Coder-V2 | 236B | 21B | 128K |
Both variants used Multi-head Latent Attention (MLA) for KV-cache compression and a fine-grained expert layout with shared and routed experts.[2][9] Continued pretraining added 6 trillion tokens to the DeepSeek-V2 base, with the new corpus weighted toward code and mathematics; the documented mix was 60% code, 10% mathematics, and 30% natural language tokens.[2] Programming-language coverage expanded from the original 87 to 338 languages, and the context window was extended from 16K to 128K via Rotary position embedding (RoPE) frequency scaling.[2]
Alignment combined supervised fine-tuning with reinforcement learning. The reinforcement-learning stage used GRPO (Group Relative Policy Optimization), the critic-free policy-gradient variant introduced earlier in DeepSeek's mathematical reasoning work; rewards came from a combination of compile-and-test feedback for code and a learned reward model for general instruction following.[2][12]
| Property | V1 (Jan 2024) | V2 (Jun 2024) |
|---|---|---|
| Architecture | Dense Transformer | DeepSeekMoE with MLA |
| Sizes | 1.3B, 6.7B, 33B | 16B (2.4B active), 236B (21B active) |
| Pretraining tokens | 2T from scratch | DeepSeek-V2 base + 6T |
| Languages | 87 | 338 |
| Context | 16K | 128K |
| FIM | PSM (50% rate) + SPM | Inherited; FIM evaluated on DS-FIM |
| RL stage | None reported | GRPO |
| Paper | arXiv:2401.14196 | arXiv:2406.11931 |
All figures from the respective technical reports.[1][2]
The V1 paper reported pass@1 scores for the Base and Instruct variants on standard code benchmarks. Selected numbers:[1][6]
| Model | HumanEval (Py) | MBPP | DS-1000 |
|---|---|---|---|
| DeepSeek-Coder-Base-1.3B | 34.8 | 46.2 | (not reported) |
| DeepSeek-Coder-Base-6.7B | 49.4 | 60.6 | (not reported) |
| DeepSeek-Coder-Base-33B | 56.1 | 66.0 | 40.2 |
| DeepSeek-Coder-Instruct-1.3B | 65.2 | 49.4 | (not reported) |
| DeepSeek-Coder-Instruct-6.7B | 78.6 | 65.4 | (not reported) |
| DeepSeek-Coder-Instruct-33B | 79.3 | 70.0 | (not reported) |
The 33B base model exceeded Code Llama 34B by 7.9 points on HumanEval (Python), 9.3 on multilingual HumanEval, 10.8 on MBPP and 5.9 on DS-1000 in the paper's tabulations.[1][6] Crucially, the 6.7B base model already matched or surpassed CodeLlama-34B on HumanEval, foreshadowing the parameter-efficiency claims the lab would extend in V2.[1]
The V2 release re-introduced the 236B MoE model and the 16B Lite model. Selected pass@1 numbers from the paper and the GitHub README:[2][3]
| Benchmark | DeepSeek-Coder-V2-Instruct (236B) | GPT-4o-0513 (proprietary) |
|---|---|---|
| HumanEval | 90.2 | 91.0 |
| MBPP+ | 76.2 | 73.5 |
| LiveCodeBench | 43.4 | 43.4 |
| MATH | 75.7 | 76.6 |
| SWE-bench (verified subset of paper) | 12.7 | 26.7 |
The paper concludes that DeepSeek-Coder-V2 was the first open-weight model to come within striking distance of the closed-source frontier on most code-specific benchmarks while remaining clearly weaker on agentic SWE-bench tasks.[2] The V2.5 merge then nudged HumanEval to roughly 89% and the LiveCodeBench score from 39.7 to 41.8 on subsequent evaluation windows.[10]
BigCodeBench was evaluated separately by community leaderboards; DeepSeek-Coder-V2-Instruct ranked among the top open models at release, though precise leaderboard positions shifted with subsequent updates and contamination filtering.[2]
The official Hugging Face hub under the deepseek-ai/ namespace hosts the full V1 ladder (deepseek-coder-1.3b-base, -1.3b-instruct, -5.7bmqa-base, -6.7b-base, -6.7b-instruct, -33b-base, -33b-instruct) and the V2 family (DeepSeek-Coder-V2-Lite-Base, -Lite-Instruct, DeepSeek-Coder-V2-Base, DeepSeek-Coder-V2-Instruct, plus the dated -0724 checkpoint).[5][7][13] The fine-tuned 5.7B "MQA" variant uses multi-query attention and was added after the initial release for memory-constrained deployment.[5]
Downstream, the V1 6.7B and 33B Instruct variants accumulated tens of thousands of derivative fine-tunes, quantizations (GGUF, AWQ, GPTQ, EXL2) and merges across the Hugging Face ecosystem, making them among the most-downloaded open-weight code models of 2024 alongside StarCoder and Code Llama.[5][13]
The DeepSeek-Coder repositories ship under a dual-license model: the source code (training, evaluation, and inference scripts) under the MIT License, and the model weights under a custom DeepSeek License Agreement (Version 1.0, 23 October 2023).[4][14] The DeepSeek License grants a perpetual, worldwide, royalty-free copyright license including the right to host the model behind an API, distribute derivative weights, and charge fees, subject to use-based restrictions covering military use, harm to minors, fraud, harassment, illegal discrimination, and a handful of analogous categories drawn from the OpenRAIL family of model licenses.[14] Compared to Meta's Llama community license, the DeepSeek License imposes no monthly active user threshold and no separate redistribution clause; in that sense it is closer to a true Apache-style permissive license, though the use-based restrictions distinguish it from OSI-approved open source licensing in the strict sense.[14][15]
DeepSeek-Coder-V2 retained the same dual-license structure, with the LICENSE-MODEL file updated only to reflect the new model identifiers.[4][7]
DeepSeek-Coder occupies the same niche as Code Llama, StarCoder, Codestral and Qwen3-Coder: an open-weight base for code completion, repository-level infilling, programmer chat, and downstream fine-tuning into coding agents.[1][2][16] The combination of FIM training, a 16K (later 128K) context, and a permissive commercial license made V1 a popular base for company-internal copilots and self-hosted code assistants in 2024, particularly among teams unwilling to send proprietary source to OpenAI or Anthropic.[1][5] The V2-Lite 16B Mixture-of-Experts (with 2.4B active) became a frequently quantised target for desktop deployment, since its active-parameter footprint fits within consumer GPUs while its 128K context supports whole-file edits.[2][3]
Beyond direct use, DeepSeek-Coder served as a research benchmark in its own right. Papers proposing new code agents, fill-in-the-middle objectives, or repository-level reasoning routinely cite DeepSeek-Coder as the open baseline, and the V2 architecture's combination of MoE plus MLA influenced subsequent open releases.[2][9]
The V2 release was the proximate motivation for DeepSeek-AI's later positioning as a credible peer of US frontier labs. When DeepSeek-R1 reasoning models were released in 2025, the lab built on the GRPO recipe and code/math data pipelines first publicly documented in the DeepSeek-Coder-V2 paper.[12][17]
The V1 paper itself acknowledged several limitations. Coverage of niche or low-resource languages was thinner than that of StarCoder, whose 619-language Stack v2 corpus included many languages absent from DeepSeek's 87-language list.[16] On low-resource languages such as D, Julia, Lua and Perl, StarCoder2-15B matched or surpassed DeepSeek-Coder-33B in independent evaluations.[16]
V2's SWE-bench result (12.7%) trailed both Claude 3 Opus and GPT-4-Turbo by a wide margin, indicating that pure code-completion strength did not translate into agentic, multi-step issue-resolution performance.[2] The model also occasionally regressed on natural-language tasks compared with the underlying DeepSeek-V2 chat checkpoint, motivating the subsequent V2.5 merge.[10]
On benchmark interpretation, observers have noted that HumanEval scores above ~90% are saturated and no longer distinguish frontier models; LiveCodeBench (with its rolling fresh problems) and SWE-bench are now the more informative coding signals.[18][19] The DeepSeek-Coder family was instrumental in establishing this consensus, since its HumanEval numbers crossed the 90% threshold a full year before most independent leaderboards updated their weighting.
A separate practical limitation is the size of the full 236B V2 model. With 21B parameters active per token but 236B in aggregate, hosting requires roughly eight 80GB GPUs in bf16, which is comparable to Llama 3 70B and considerably larger than Codestral 22B or Qwen2.5-Coder-32B.[2][3][20]
| Model | Released | Params | Open weights | License allows commercial | HumanEval pass@1 (reported) |
|---|---|---|---|---|---|
| Code Llama 34B Instruct | Aug 2023 | 34B dense | Yes | Yes (Llama 2 community) | ~48.2 (base 34B)[1] |
| StarCoder 2 15B | Feb 2024 | 15B dense | Yes | Yes (BigCode OpenRAIL-M) | ~46.3[16] |
| DeepSeek-Coder 33B Instruct | Nov 2023 | 33B dense | Yes | Yes (DeepSeek License) | 79.3[1] |
| DeepSeek-Coder-V2 236B Instruct | Jun 2024 | 236B / 21B active | Yes | Yes (DeepSeek License) | 90.2[2] |
| Codestral 22B | May 2024 | 22B dense | Yes | No (Mistral Non-Production) | ~81.1 (Mistral release notes)[20] |
| Qwen2.5-Coder 32B Instruct | Sep 2024 | 32B dense | Yes | Yes (Apache 2.0) | 87.0[21] |
| GPT-4 / GPT-4-Turbo | 2023-2024 | proprietary | No | N/A | ~85-90 across versions[2] |
| Claude 3 Opus | Mar 2024 | proprietary | No | N/A | 84.9 (reported by V2 paper)[2] |
Three observations follow. First, DeepSeek-Coder-V2 was the first MoE coder to clearly out-score every contemporaneous dense open model on HumanEval and MBPP+ at release, although Qwen2.5-Coder-32B (September 2024) and later Qwen3-Coder variants subsequently closed or reversed that gap on several benchmarks, particularly with smaller models trained on more code tokens.[2][21] Second, Codestral's "non-production" license made DeepSeek-Coder the obvious commercial-friendly choice for self-hosted production deployments during mid-2024.[20] Third, on absolute pass@1 numbers, the V2 family was the first open model series to bring GPT-4-class HumanEval results to weights anyone could download.[2][7]
DeepSeek-Coder sits inside a broader DeepSeek lineage: the general-purpose DeepSeek V3 and DeepSeek V3.1 LLMs, the DeepSeek-R1 reasoning models built on top of DeepSeek-V3, and specialised siblings like DeepSeek Janus (multimodal generation), DeepSeek-VL2 (vision-language), and DeepSeek-OCR.[11][17] The lineage shares a common architectural toolkit (DeepSeekMoE, MLA, GRPO) and a common evaluation philosophy that emphasises open weights, transparent data composition, and aggressive cost-efficiency.[2][9][12]
Adjacent open code-LLM efforts include StarCoder and StarCoder2 from the BigCode collaboration, which trained on The Stack (BigCode dataset), the broader Code Llama family, Codestral and Codestral Mamba from Mistral, the AlphaCode systems from Google DeepMind, and the Qwen3-Coder line from Alibaba.[16][20][22] Many of these systems share design ideas with DeepSeek-Coder, including repository-level corpora, fill-in-the-middle objectives, and Python-centric benchmark suites such as HumanEval, MBPP, LiveCodeBench, and BigCodeBench.[16][21]