DeepSeek-V2
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,539 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,539 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek-V2 is a mixture-of-experts (MoE) large language model released in May 2024 by DeepSeek, the Chinese artificial intelligence lab that was spun out of the quantitative hedge fund High-Flyer and is led by Liang Wenfeng. The model has 236 billion total parameters, of which about 21 billion are activated for each token, and supports a context window of 128,000 tokens. [1][2] It was the first DeepSeek model to combine two architectural ideas the lab would carry forward into DeepSeek-V3 and DeepSeek-R1: Multi-head Latent Attention (MLA), which compresses the key-value cache to cut inference memory, and the DeepSeekMoE feed-forward design that mixes many fine-grained routed experts with a small number of always-on shared experts. [1]
DeepSeek-V2 also drew attention for its price. When the lab opened the model through its API at roughly 1 yuan per million input tokens and 2 yuan per million output tokens, it undercut every major domestic competitor and is widely credited with setting off a 2024 price war among Chinese model providers. [3][4]
DeepSeek-V2 is built on the standard Transformer decoder, but it replaces both the attention block and the feed-forward block with custom designs. The base model has 60 layers and a hidden dimension of 5,120. Attention uses 128 heads with a per-head dimension of 128. [1] Each token is routed through a sparse MoE feed-forward network rather than a dense one, so although the model holds 236 billion parameters in total, only about 21 billion participate in any single forward pass. This sparsity is the main reason DeepSeek reported that V2 cost 42.5 percent less to train than the earlier dense DeepSeek 67B model while scoring higher on benchmarks. [1]
The paper frames the two innovations around a single goal: keep the strong performance of a large MoE model while making both training and inference cheap enough to serve at scale. MLA targets inference memory, and DeepSeekMoE targets training and serving compute. [1]
The most-cited contribution of DeepSeek-V2 is Multi-head Latent Attention. In a conventional Transformer, serving long contexts is expensive because the model must cache a separate key and value vector for every attention head at every position, and that key-value (KV) cache grows linearly with sequence length. MLA instead projects the keys and values down into a single low-rank latent vector that is shared across heads, caches only that compressed latent, and reconstructs the per-head keys and values on the fly. [1]
DeepSeek used a KV compression dimension of 512 and a query compression dimension of 1,536. [1] A complication is rotary position embedding (RoPE): because RoPE is position-dependent, it cannot be folded cleanly into the low-rank compression. The authors solved this with a "decoupled" RoPE scheme that carries position information on a small set of extra query dimensions and a shared key, using a per-head decoupled dimension of 64. [1] The reported result is a 93.3 percent reduction in the KV cache relative to DeepSeek 67B and a maximum generation throughput up to 5.76 times higher, which is what made it economical to offer a 128K context at low prices. [1][2]
The feed-forward layers use DeepSeekMoE, a design the lab had introduced earlier and refined for V2. Instead of a handful of large experts, DeepSeekMoE splits each expert into smaller "fine-grained" pieces so that the router can combine knowledge more flexibly, and it sets aside a few "shared" experts that every token always uses, on the theory that shared experts can absorb common knowledge and leave the routed experts free to specialize. [1]
In DeepSeek-V2 each MoE layer has 160 routed experts plus 2 shared experts, and each token activates 6 of the routed experts in addition to the shared ones. [1] To keep training stable and avoid routing collapse, the model uses auxiliary load-balancing losses at the expert, device, and communication levels, along with a device-limited routing scheme that caps how many devices a token's experts can be spread across. [1]
DeepSeek-V2 was pretrained on a corpus of 8.1 trillion tokens, a multilingual mixture weighted toward English and Chinese. [1][2] Pretraining used a 4K-token sequence length, after which the context was extended to 128K. The extension used YaRN applied to the decoupled shared key, with a scaling factor of 40, and about 1,000 additional training steps at a 32K sequence length; the model was then validated for long-context retrieval. [1]
The released chat models were produced in two post-training stages. First came supervised fine-tuning (SFT) on roughly 1.5 million conversational examples spanning helpfulness and safety. The lab then ran reinforcement learning using Group Relative Policy Optimization (GRPO), the same RL algorithm DeepSeek used in its math work and later in R1, to align the model with human preferences. [1] DeepSeek released both an SFT-only chat model and the RL-tuned DeepSeek-V2 Chat. [1]
On standard evaluations the DeepSeek-V2 base model was competitive with the strongest open-weight models available in mid-2024, including LLaMA 3 70B and Mixtral 8x22B, despite activating only 21 billion parameters per token. [1] Selected base-model scores reported in the paper:
| Benchmark | DeepSeek-V2 Base |
|---|---|
| MMLU | 78.5 |
| BBH | 78.9 |
| C-Eval | 81.7 |
| CMMLU | 84.0 |
| GSM8K | 79.2 |
| MATH | 43.6 |
| HumanEval | 48.8 |
The chat model was tuned for open-ended dialogue and alignment. DeepSeek-V2 Chat (RL) reached a length-controlled win rate of 38.9 on AlpacaEval 2.0 and an overall score of 8.97 on MT-Bench. [1] On AlignBench, a Chinese alignment benchmark, it scored 7.91 overall, which the paper reports as the best among open-source models at the time and close to GPT-4-class systems. [1]
DeepSeek shipped several models around the V2 base, and the lineage is easy to confuse because the chat and code branches were later merged.
| Model | Total params | Activated params | Context | Notes |
|---|---|---|---|---|
| DeepSeek-V2 | 236B | 21B | 128K | Base and Chat (SFT and RL) |
| DeepSeek-V2-Lite | 16B | 2.4B | 32K | Smaller model for research and local use |
| DeepSeek-Coder-V2 | 236B | 21B | 128K | Code-specialized, continued pretraining |
| DeepSeek-Coder-V2-Lite | 16B | 2.4B | 128K | Smaller code variant |
| DeepSeek-V2.5 | 236B | 21B | 128K | Merge of the V2 chat and Coder-V2 lines |
DeepSeek-V2-Lite is a much smaller MoE with 16 billion total and 2.4 billion activated parameters, intended to be runnable on a single GPU; it uses the same MLA and DeepSeekMoE ideas but with 27 layers and a 32K context. [2]
DeepSeek-Coder-V2, released in June 2024, was not trained from scratch. It continued pretraining from an intermediate DeepSeek-V2 checkpoint on an additional 6 trillion tokens of code-heavy data, expanding programming-language coverage from 86 to 338 languages and extending the code context from 16K to 128K. DeepSeek reported that Coder-V2 matched closed models such as GPT-4 Turbo on code-specific tasks. [5] Like the V2 chat work, the math-focused DeepSeek-Math effort fed techniques (including GRPO) into this family.
In September 2024 the lab consolidated the two branches into DeepSeek-V2.5, which merged DeepSeek-V2-0628 (the upgraded general chat model) and DeepSeek-Coder-V2-0724 into a single model that handled both general dialogue and coding, with improvements in writing, instruction following, and safety. DeepSeek-V2.5 was served through the same deepseek-chat and deepseek-coder API endpoints. [6]
DeepSeek priced the V2 API far below the prevailing rates for Chinese large models, at roughly 1 yuan (about 0.14 US dollars) per million input tokens and 2 yuan per million output tokens. [3][4] At the time, comparable access to GPT-4-class models cost on the order of tens of dollars per million tokens, so the gap was about two orders of magnitude. The low price was possible largely because MLA and the sparse MoE design cut the memory and compute needed to serve each request.
The launch is widely described as the trigger for a 2024 price war among Chinese AI vendors. In the weeks after V2 appeared, large competitors including Alibaba and Baidu cut their model prices steeply, with reported reductions exceeding 95 percent on some models, and other providers followed. [3][4] DeepSeek itself later said it had not expected the market to be so price-sensitive, and the episode helped establish the lab's reputation for aggressive, efficiency-driven pricing that continued through V3 and R1.
DeepSeek released the V2 code under the MIT License and the model weights under its own "DeepSeek License," which permits commercial use subject to a use-based restrictions schedule. [2] The weights for the base, chat, and Lite variants were published on Hugging Face, and the inference code and documentation were released on GitHub. [2]