# DeepSeek-V2

> Source: https://aiwiki.ai/wiki/deepseek_v2
> Updated: 2026-06-27
> Categories: Chinese AI, Large Language Models, Mixture of Experts
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

DeepSeek-V2 is a 236-billion-parameter [mixture-of-experts](/wiki/mixture_of_experts) (MoE) [large language model](/wiki/large_language_model) released in May 2024 by [DeepSeek](/wiki/deepseek), the Chinese AI lab spun out of the quantitative hedge fund High-Flyer and led by [Liang Wenfeng](/wiki/liang_wenfeng). It activates only about 21 billion of its 236 billion parameters per token, supports a 128,000-token context window, and introduced two architectural ideas, Multi-head Latent Attention (MLA) and DeepSeekMoE, that made it both cheap to train and cheap to serve. [1][2] Its aggressive API pricing is widely credited with setting off a 2024 price war among Chinese model providers. [3][4]

The technical report summarizes the design in one sentence: "DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference." [1] The same two innovations, MLA and DeepSeekMoE, were carried forward into [DeepSeek-V3](/wiki/deepseek_v3) and [DeepSeek-R1](/wiki/deepseek_r1), making V2 the architectural foundation of the models that later brought DeepSeek to global attention.

## What is DeepSeek-V2?

DeepSeek-V2 is an open-weight MoE language model with 236 billion total parameters, of which about 21 billion are activated for each token, and a context window of 128,000 tokens. [1][2] It was the first DeepSeek model to combine Multi-head Latent Attention (MLA), which compresses the key-value cache to cut inference memory, with the DeepSeekMoE feed-forward design that mixes many fine-grained routed experts with a small number of always-on shared experts. [1] Because only a fraction of its parameters fire on any given token, DeepSeek reported that V2 cost 42.5 percent less to train than the earlier dense DeepSeek 67B model while scoring higher on benchmarks. [1]

## How is DeepSeek-V2 built?

DeepSeek-V2 is built on the standard [Transformer](/wiki/transformer) decoder, but it replaces both the attention block and the feed-forward block with custom designs. The base model has 60 layers and a hidden dimension of 5,120. Attention uses 128 heads with a per-head dimension of 128. [1] Each token is routed through a sparse MoE feed-forward network rather than a dense one, so although the model holds 236 billion parameters in total, only about 21 billion participate in any single forward pass. This sparsity is the main reason DeepSeek reported that V2 cost 42.5 percent less to train than the earlier dense DeepSeek 67B model while scoring higher on benchmarks. [1]

The paper frames the two innovations around a single goal: keep the strong performance of a large MoE model while making both training and inference cheap enough to serve at scale. As the report puts it, "MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation." [1] MLA targets inference memory, and DeepSeekMoE targets training and serving compute. [1]

## What is Multi-head Latent Attention?

The most-cited contribution of DeepSeek-V2 is Multi-head Latent Attention. In a conventional Transformer, serving long contexts is expensive because the model must cache a separate key and value vector for every attention head at every position, and that key-value (KV) cache grows linearly with sequence length. MLA instead projects the keys and values down into a single low-rank latent vector that is shared across heads, caches only that compressed latent, and reconstructs the per-head keys and values on the fly. [1] The official model card describes MLA as a mechanism that "utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference." [2]

DeepSeek used a KV compression dimension of 512 and a query compression dimension of 1,536. [1] A complication is rotary position embedding (RoPE): because RoPE is position-dependent, it cannot be folded cleanly into the low-rank compression. The authors solved this with a "decoupled" RoPE scheme that carries position information on a small set of extra query dimensions and a shared key, using a per-head decoupled dimension of 64. [1] The reported result is a 93.3 percent reduction in the KV cache relative to DeepSeek 67B and a maximum generation throughput up to 5.76 times higher, which is what made it economical to offer a 128K context at low prices. [1][2]

## What is DeepSeekMoE?

The feed-forward layers use DeepSeekMoE, a design the lab had introduced earlier and refined for V2. The model card calls it "a high-performance MoE architecture that enables training stronger models at lower costs." [2] Instead of a handful of large experts, DeepSeekMoE splits each expert into smaller "fine-grained" pieces so that the router can combine knowledge more flexibly, and it sets aside a few "shared" experts that every token always uses, on the theory that shared experts can absorb common knowledge and leave the routed experts free to specialize. [1]

In DeepSeek-V2 each MoE layer has 160 routed experts plus 2 shared experts, and each token activates 6 of the routed experts in addition to the shared ones. [1] To keep training stable and avoid routing collapse, the model uses auxiliary load-balancing losses at the expert, device, and communication levels, along with a device-limited routing scheme that caps how many devices a token's experts can be spread across. [1]

## How was DeepSeek-V2 trained?

DeepSeek-V2 was pretrained on a corpus of 8.1 trillion tokens, a multilingual mixture weighted toward English and Chinese. [1][2] Pretraining used a 4K-token sequence length, after which the context was extended to 128K. The extension used YaRN applied to the decoupled shared key, with a scaling factor of 40, and about 1,000 additional training steps at a 32K sequence length; the model was then validated for long-context retrieval. [1]

The released chat models were produced in two post-training stages. First came supervised fine-tuning (SFT) on roughly 1.5 million conversational examples spanning helpfulness and safety. The lab then ran reinforcement learning using Group Relative Policy Optimization (GRPO), the same RL algorithm DeepSeek used in its math work and later in R1, to align the model with human preferences. [1] DeepSeek released both an SFT-only chat model and the RL-tuned DeepSeek-V2 Chat. [1]

## How well does DeepSeek-V2 perform on benchmarks?

On standard evaluations the DeepSeek-V2 base model was competitive with the strongest open-weight models available in mid-2024, including LLaMA 3 70B and Mixtral 8x22B, despite activating only 21 billion parameters per token. [1] As the report states, "even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models." [1] Selected base-model scores reported in the paper:

| Benchmark | DeepSeek-V2 Base |
| --- | --- |
| MMLU | 78.5 |
| BBH | 78.9 |
| C-Eval | 81.7 |
| CMMLU | 84.0 |
| GSM8K | 79.2 |
| MATH | 43.6 |
| HumanEval | 48.8 |

The chat model was tuned for open-ended dialogue and alignment. DeepSeek-V2 Chat (RL) reached a length-controlled win rate of 38.9 on AlpacaEval 2.0 and an overall score of 8.97 on MT-Bench. [1] On AlignBench, a Chinese alignment benchmark, it scored 7.91 overall, which the paper reports as the best among open-source models at the time and close to GPT-4-class systems. [1]

## What variants of DeepSeek-V2 exist?

DeepSeek shipped several models around the V2 base, and the lineage is easy to confuse because the chat and code branches were later merged.

| Model | Total params | Activated params | Context | Notes |
| --- | --- | --- | --- | --- |
| DeepSeek-V2 | 236B | 21B | 128K | Base and Chat (SFT and RL) |
| DeepSeek-V2-Lite | 16B | 2.4B | 32K | Smaller model for research and local use |
| DeepSeek-Coder-V2 | 236B | 21B | 128K | Code-specialized, continued pretraining |
| DeepSeek-Coder-V2-Lite | 16B | 2.4B | 128K | Smaller code variant |
| DeepSeek-V2.5 | 236B | 21B | 128K | Merge of the V2 chat and Coder-V2 lines |

DeepSeek-V2-Lite is a much smaller MoE with 16 billion total and 2.4 billion activated parameters, intended to be runnable on a single GPU; it uses the same MLA and DeepSeekMoE ideas but with 27 layers and a 32K context. [2]

[DeepSeek-Coder-V2](/wiki/deepseek_coder), released in June 2024, was not trained from scratch. It continued pretraining from an intermediate DeepSeek-V2 checkpoint on an additional 6 trillion tokens of code-heavy data, expanding programming-language coverage from 86 to 338 languages and extending the code context from 16K to 128K. DeepSeek reported that Coder-V2 matched closed models such as GPT-4 Turbo on code-specific tasks. [5] Like the V2 chat work, the math-focused [DeepSeek-Math](/wiki/deepseek_math) effort fed techniques (including GRPO) into this family.

In September 2024 the lab consolidated the two branches into DeepSeek-V2.5, which merged DeepSeek-V2-0628 (the upgraded general chat model) and DeepSeek-Coder-V2-0724 into a single model that handled both general dialogue and coding, with improvements in writing, instruction following, and safety. DeepSeek-V2.5 was served through the same deepseek-chat and deepseek-coder API endpoints. [6]

## Why was DeepSeek-V2 so cheap?

DeepSeek priced the V2 API far below the prevailing rates for Chinese large models, at roughly 1 yuan (about 0.14 US dollars) per million input tokens and 2 yuan per million output tokens. [3][4] At the time, comparable access to GPT-4-class models cost on the order of tens of dollars per million tokens, so the gap was about two orders of magnitude. The low price was possible largely because MLA and the sparse MoE design cut the memory and compute needed to serve each request: MLA reduced the KV cache by 93.3 percent and lifted maximum generation throughput to 5.76 times that of DeepSeek 67B. [1]

The launch is widely described as the trigger for a 2024 price war among Chinese AI vendors. In the weeks after V2 appeared, large competitors including Alibaba and Baidu cut their model prices steeply, with reported reductions exceeding 95 percent on some models, and other providers followed. [3][4] DeepSeek itself later said it had not expected the market to be so price-sensitive, and the episode helped establish the lab's reputation for aggressive, efficiency-driven pricing that continued through V3 and R1.

## Is DeepSeek-V2 open source?

DeepSeek released the V2 code under the MIT License and the model weights under its own "DeepSeek License," which permits commercial use subject to a use-based restrictions schedule. [2] The weights for the base, chat, and Lite variants were published on Hugging Face, and the inference code and documentation were released on GitHub. [2]

## References

1. DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model," arXiv:2405.04434. https://arxiv.org/abs/2405.04434
2. "deepseek-ai/DeepSeek-V2," GitHub repository and Hugging Face model card. https://github.com/deepseek-ai/DeepSeek-V2
3. "DeepSeek," Wikipedia. https://en.wikipedia.org/wiki/DeepSeek
4. "DeepSeek History: From Hedge Fund to V4," DeepSeek AI Guide. https://deepseekai.guide/guides/deepseek-history/
5. DeepSeek-AI, "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence," arXiv:2406.11931. https://arxiv.org/abs/2406.11931
6. "DeepSeek-V2.5: A New Open-Source Model Combining General and Coding Capabilities," DeepSeek API Docs. https://api-docs.deepseek.com/news/news0905