DeepSeek V3.2
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,825 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,825 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek-V3.2 is an open-weight Mixture of Experts large language model family developed by DeepSeek, the Hangzhou research lab founded by quantitative trading firm High-Flyer. The family is built around an architectural change called DeepSeek Sparse Attention (DSA), a fine-grained token-level sparse attention mechanism that reduces attention complexity from quadratic to near-linear in sequence length without measurable accuracy regression. DeepSeek released the experimental precursor, DeepSeek-V3.2-Exp, on September 29, 2025, alongside a more than 50 percent reduction in API pricing.[^1] The official V3.2 release followed on December 1, 2025, together with a high-compute reasoning variant, V3.2-Speciale, which attained gold-medal-level performance on the 2025 International Mathematical Olympiad, International Olympiad in Informatics, ICPC World Finals, and Chinese Mathematical Olympiad.[^2][^3]
V3.2 sits between DeepSeek V3.1, the hybrid reasoning model released in August 2025, and DeepSeek V4, the ground-up redesigned family released in preview on April 24, 2026.[^17] Unlike V4, which abandoned Multi-head Latent Attention entirely in favor of Compressed Sparse Attention (CSA), V3.2 is positioned as a continuation of the V3 line that swaps in DSA while preserving the underlying MoE architecture and parameter count.
DeepSeek, led by founder Liang Wenfeng, spent 2024 and 2025 iterating rapidly on its V3 family. The base model, DeepSeek V3, launched in December 2024 with 671 billion total parameters and 37 billion active per token. Successive updates, DeepSeek-V3-0324 in March 2025 and the reasoning-trained DeepSeek-R1 in January 2025, pushed performance on math and code benchmarks while keeping training costs an order of magnitude below those of comparable Western frontier labs.
DeepSeek V3.1, released in August 2025, merged thinking and non-thinking inference into a single 671-billion-parameter checkpoint and extended the context window from 64K to 128K tokens. A maintenance update, DeepSeek-V3.1-Terminus, followed on September 22, 2025, with improvements to agentic tool use and resolution of language-consistency issues. V3.1-Terminus was the direct precursor to V3.2: every V3.2 release uses Terminus as its checkpoint base and adds DSA through a continued-training stage.[^1][^4]
The motivation for DSA was straightforward. Dense attention scales quadratically with sequence length, and the per-query KV-cache lookup dominates both compute and memory at long contexts. Existing alternatives such as sliding-window attention and block-sparse attention either trade off retrieval fidelity or require coarse-grained selection that is mismatched to the way Transformer heads actually attend to text. DeepSeek's research goal with V3.2-Exp was to validate whether a learned, token-level sparse attention pattern could be inserted into the existing V3-style architecture without retraining from scratch and without measurable performance loss.[^4][^10]
The V3.2 family was rolled out in two phases, separated by approximately two months.
DeepSeek-V3.2-Exp launched on September 29, 2025 as an explicitly experimental release.[^1] The announcement framed the model as a validation of the DSA approach rather than a successor product, and DeepSeek kept V3.1-Terminus available through its API until October 15, 2025, 15:59 UTC, to allow A-B comparison.[^1] The accompanying API pricing cut, more than 50 percent on input and output tokens, was tied directly to the inference-side efficiency gains of DSA at long contexts.[^1][^8]
An important correction was issued on November 17, 2025. DeepSeek announced that the RoPE (rotary positional embedding) implementation in the indexer module had been using an interleaved layout inherited from MLA, whereas the indexer requires a non-interleaved layout for correct behavior.[^4] Affected downloads were updated and external implementations in vLLM, SGLang, and the DeepSeek inference repository were patched.
On December 1, 2025, DeepSeek announced the official V3.2 release, accompanied by a technical report on arXiv (2512.02556) authored by approximately 264 contributors.[^3] Three checkpoints were published on Hugging Face:[^5][^6]
| Checkpoint | Description | Intended use |
|---|---|---|
| DeepSeek-V3.2-Base | Pretrained base, MIT license | Continued training, research |
| DeepSeek-V3.2 | Production chat and reasoning model | Default API and chat endpoint |
| DeepSeek-V3.2-Speciale | High-compute reasoning variant | Olympiad-style reasoning, research |
V3.2-Speciale was released alongside the main model and used substantially more post-training compute. DeepSeek described it as a research artifact rather than a default product, and it was offered via a dedicated temporary API endpoint (api.deepseek.com/v3.2_speciale_expires_on_20251215) that was retired on December 15, 2025, 15:59 UTC.[^2] Tool calling is disabled in Speciale; the variant is positioned exclusively as a pure reasoning engine.[^2][^6]
Both V3.2 and V3.2-Speciale were made available through Microsoft Foundry in public preview on December 15, 2025, with Azure providing managed deployment alongside Foundry's evaluation, routing, and observability stack.[^16] Hugging Face downloads of the DeepSeek-V3.2 weights exceeded four million in the month following release.[^5]
The DeepSeek arXiv paper consolidates the V3.2-Exp and V3.2 releases as a single line of work; the official December release supersedes V3.2-Exp on the API but the underlying architecture is largely the same, with refinements in the post-training pipeline.[^3]
DeepSeek-V3.2 retains the V3.1-Terminus base architecture: a 671-billion-parameter MoE transformer with 37 billion active parameters per token, Multi-head Latent Attention (MLA) for the dense attention path, and the auxiliary-loss-free expert routing introduced with V3. The published model cards on Hugging Face list 685 billion total parameters for both V3.2-Exp and V3.2, with the additional weights coming from the new DSA components rather than from changes to the base MoE blocks.[^4][^5] The context window is 128K tokens, identical to V3.1.
The only architectural change relative to V3.1-Terminus is the insertion of DSA in place of dense MLA attention. Everything else, including the MoE layer count, expert count, expert sizes, and routing topology, is unchanged.[^4][^9]
DSA replaces the dense attention computation with a two-stage selection-then-attention pipeline. For each query token, only a small subset of preceding keys participates in the main MLA computation. The two stages are:
Lightning Indexer. A lightweight scoring module computes relevance between the current query and every preceding token. The indexer uses 64 attention heads (versus 128 for the main MLA path) at FP8 precision, with parameters distinct from the main MLA heads.[^9][^10] Its output is a scalar relevance score per (query, key) pair, computed as a weighted sum of ReLU activations over query-key dot products. ReLU was chosen over softmax for throughput, since negative scores collapse to zero and the activation is cheaper to compute in low precision.[^3][^10] Because the indexer is small and runs in FP8, it can sweep the full sequence at a small fraction of the cost of full attention.
Token Selector. Given the indexer scores, the selector retrieves the top-k previous tokens for each query, where k is a fixed hyperparameter. For the released V3.2 checkpoints, k is set to 2,048; with a 128K-token context, this means each query attends to roughly 1.6 percent of the available keys.[^3][^10] The selected key-value pairs are then passed to the main MLA path, which runs at full precision.
The net result is that overall attention cost scales as O(Lk) rather than O(L²), where L is the sequence length and k is the fixed selection budget. At 128K tokens with k=2,048, this is approximately a 64-fold reduction in attention FLOPs relative to dense attention over the same context. An independent Tensor Economics analysis measured that at 131K tokens DSA loads roughly five times less data per decode step than dense MLA, with indexer overhead of about 132 bytes per token (FP8) versus 656 bytes per token for the main MLA cache.[^10]
The vLLM implementation stores the MLA cache as 512 bytes of FP8 NoPE (no-positional-embedding) keys, 16 bytes of FP32 scale factors, and 128 bytes of BF16 RoPE embeddings per token, with the indexer K cache held in separate FP8 blocks aligned to FlashMLA's block size of 64.[^11][^21] DSA is instantiated within MLA's MQA (Multi-Query Attention) mode so that the underlying kernels can be implemented efficiently.
DeepSeek did not retrain V3.2 from scratch. Instead, the V3.1-Terminus checkpoint was extended with DSA through two stages of continued training:[^3][^9]
| Stage | Steps | Sequences per step | Tokens per sequence | Total tokens | Description |
|---|---|---|---|---|---|
| Dense warm-up | 1,000 | 16 | 128,000 | 2.1 billion | Indexer is trained to mimic the attention distribution of V3.1-Terminus's dense MLA; the rest of the model is frozen. |
| Sparse training | 15,000 | 480 | 128,000 | 943.7 billion | All parameters are unfrozen and the model is trained end-to-end with sparse attention active. |
The warm-up stage is the engineering core of DSA's success: the indexer learns its scoring function by imitating the dense attention pattern of a frozen teacher (V3.1-Terminus itself), so that when sparse attention is switched on, the selector approximates the same key set that dense attention would have weighted heavily. This is similar in spirit to distillation but operates on attention patterns rather than output distributions.[^3][^9]
The December 2025 release introduced a substantially expanded post-training pipeline relative to V3.1. The DeepSeek technical report describes three components.
V3.2's reinforcement learning stage scales post-training compute well beyond what V3.1 used and applies several stabilizing techniques designed for off-policy and large-batch GRPO training:[^3][^9]
Other refinements relative to V3.1's RL pipeline include the removal of explicit format rewards (replaced by length penalties for agentic tasks), the adoption of generative reward models for non-verifiable domains, and the integration of self-verification techniques inherited from DeepSeekMath V2 for mathematical reasoning.[^9]
DeepSeek built a data pipeline that synthesizes training data for agent settings programmatically rather than collecting it from human trajectories. The pipeline generates more than 1,800 distinct sandboxed environments and 85,000 task prompts.[^2][^3] The breakdown is:[^3]
| Agent type | Task count |
|---|---|
| Search agent | 50,275 |
| Code agent | 24,667 |
| Code interpreter | 5,908 |
| General agent | 4,417 |
Each environment exposes a small set of programmable tools, and tasks are generated by composing tool primitives into goals of varying difficulty.
The December release introduced what DeepSeek calls "thinking with tools": the model is the first publicly released system to natively interleave chain-of-thought reasoning with tool calls inside a single response, in both thinking and non-thinking modes.[^2][^13] A strategy called thinking context management retains reasoning traces across consecutive tool calls within a single turn but clears them on a new user message; tool calls and tool results remain in context even when intermediate reasoning text is trimmed to stay within a budget. This avoids the standard agentic-loop pattern in which the model must reconstruct its reasoning state on every new tool result.
The Speciale variant uses the same base checkpoint as V3.2 but with additional RL compute concentrated on long-form mathematical and competitive-programming reasoning. Speciale post-training drew on synthetic olympiad-style problems and verifier-graded trajectories from a code-execution sandbox.[^3][^6] DeepSeek did not publish the exact compute budget for Speciale, but the technical report describes it as substantially exceeding the base V3.2 RL budget. Tool-calling capability is intentionally removed from Speciale so that all post-training compute serves pure reasoning.[^6][^16]
The primary claim of V3.2-Exp at its September launch was performance parity with V3.1-Terminus despite the architectural simplification.[^1][^4] The December V3.2 release pushed several benchmarks higher, and V3.2-Speciale produced an additional tier of competition-grade results.
From the DeepSeek-V3.2-Exp model card:[^4]
| Benchmark | V3.1-Terminus | V3.2-Exp |
|---|---|---|
| MMLU-Pro | 85.0 | 85.0 |
| GPQA-Diamond | 80.7 | 79.9 |
| LiveCodeBench | 74.9 | 74.1 |
| AIME 2025 | 88.4 | 89.3 |
| Codeforces | 2046 | 2121 |
| BrowseComp | 38.5 | 40.1 |
| SimpleQA | 96.8 | 97.1 |
| SWE-bench Multilingual | 57.8 | 57.9 |
| HLE | 21.7 | 19.8 |
The pattern is essentially flat on knowledge benchmarks, slightly positive on coding and AIME, and slightly positive on agentic browse and QA tasks. The roughly two-point drop on Humanity's Last Exam was flagged by external reviewers as the only meaningful regression in the V3.2-Exp suite, with speculation that the fixed top-k=2,048 selector occasionally misses a long-range global connection needed for the hardest items.[^14][^18]
The December V3.2 release improved on V3.2-Exp through expanded RL. Reported headline scores include:[^3][^5]
| Benchmark | V3.2 |
|---|---|
| MMLU-Pro | 85.0 |
| GPQA-Diamond | 82.4 |
| AIME 2025 | 93.1 |
| HMMT Feb 2025 | 92.5 |
| LiveCodeBench | 83.3 |
| Codeforces (rating) | 2,386 |
| SWE-bench Verified | 73.1 |
| SWE-bench Pro | 15.6 |
| MathArena AIME 2026 | 94.17 |
| BrowseComp (with context mgmt) | 67.6 |
| MCP-Mark | 38.0 |
| HLE | 25 |
DeepSeek's technical report positions V3.2 as comparable to GPT-5-high on reasoning benchmarks while remaining slightly behind Gemini 3 Pro on the same suite.[^3]
The Speciale variant reports substantially higher reasoning scores than the base V3.2 model:[^3][^6]
| Benchmark | V3.2-Speciale |
|---|---|
| AIME 2025 | 96.0 |
| HMMT Feb 2025 | 99.2 |
| Codeforces (rating) | 2,701 |
| HLE | 30 |
DeepSeek's paper describes Speciale as surpassing GPT-5 on these benchmarks while remaining on par with Gemini-3.0-Pro.[^3]
Speciale was evaluated on four competitive reasoning events held during 2025. All scores were reported under contest conditions with no internet access and no external tools beyond a code-execution sandbox.[^2][^3][^6]
| Event | V3.2-Speciale result | Medal |
|---|---|---|
| IMO 2025 | 35 / 42 points | Gold |
| CMO 2025 | Gold-level score | Gold |
| IOI 2025 | 492 / 600 points, 10th place | Gold |
| ICPC World Finals 2025 | 2nd place | Gold |
The IMO score of 35 out of 42 placed Speciale ahead of every other publicly known model evaluation in the 2025 cycle and at parity with strong human gold medalists. The IOI ranking of 10th overall and ICPC second-place finish were independently described as the first machine results to clear gold-medal thresholds across all four flagship olympiads in a single calendar year, and the first such sweep produced by a general-purpose chat model rather than a specialized prover.[^13]
The V3.2 release was tied to one of the most aggressive single-step price reductions in the commercial LLM market in 2025. On September 29, 2025, DeepSeek announced a unified pricing schedule applicable to V3.2-Exp:[^1][^8]
| Token type | V3.1-Terminus | V3.2-Exp |
|---|---|---|
| Input (cache hit) | $0.07 / 1M | $0.028 / 1M |
| Input (cache miss) | $0.56 / 1M | $0.28 / 1M |
| Output | $1.68 / 1M | $0.42 / 1M |
The headline reduction was roughly 50 percent on input and 75 percent on output tokens. DeepSeek attributed the change directly to the reduced per-token compute and KV-cache footprint enabled by DSA, particularly at long contexts.[^1] Independent analyses of the prices, by VentureBeat and the cost-tracking site CostGoat, characterized V3.2 as the cheapest frontier-class API offering available at the time of launch.[^8][^15]
A further reduction took effect on April 26, 2026, 12:15 UTC: the input cache-hit price was cut to one-tenth of its launch price, bringing it below $0.003 per million tokens.[^15] This step coincided with the V4 launch and was applied retroactively to all live DeepSeek API models. After the May 2026 stacked adjustments, the V4-Flash cache-hit price fell to roughly $0.0029 per million tokens, with V3.2 retained on the API as a cheaper alternative for non-V4 workloads.[^15][^17]
The September pricing cut had immediate effects on the commercial market. OpenAI announced a price reduction on its own GPT-4-class API a few weeks later, and Anthropic published lower batch-API pricing in November 2025 that closed part of the gap; both adjustments were widely attributed in trade press to competitive pressure from DeepSeek's V3.2 schedule.[^8]
The V3.2 release was supported on day zero by major open-source inference stacks. vLLM, SGLang, and Hugging Face transformers all shipped patches by early October 2025 that handled the indexer module, the top-k selection step, and the patched RoPE layout that became required after the November correction.[^4][^11][^21] Red Hat AI published a detailed integration guide for vLLM, including measurements indicating that DSA delivers roughly 3.5x throughput improvement at 128K context against dense MLA on the same NVIDIA Hopper and Blackwell hardware.[^11]
For on-premise deployment, DeepSeek released two open-source kernel libraries alongside the model:[^1][^7]
A notable secondary release was the broad use of TileLang, DeepSeek's open-source ML compiler that lowers Python kernel descriptions to GPU-specific machine code. DeepSeek published TileLang versions of the DSA kernels alongside the hand-written CUDA variants, demonstrating that roughly 80 lines of Python could reach 95 percent of FlashMLA's CUDA performance.[^20] This made DSA unusually portable across non-NVIDIA accelerators.
Several Chinese accelerator vendors completed adaptation on the same day as the V3.2-Exp release:[^20]
This was the first DeepSeek release in which all three major Chinese accelerator stacks reached production parity with NVIDIA on the launch day, and it was widely interpreted as evidence that the TileLang toolchain was central to DeepSeek's hardware-portability strategy.[^20]
The model weights are distributed under the MIT license. Quantizations to FP8, INT8, and 4-bit were published by the community within the first two weeks of the V3.2-Exp release, with several variants reducing aggregate VRAM requirements to under 700 GB for production inference.[^4][^5]
V3.2 was received as a focused, architecturally narrow but commercially significant release. Sebastian Raschka's December 2025 technical retrospective on the V3 line characterized DSA as the most consequential efficiency improvement DeepSeek had shipped since the introduction of MLA in V3 itself: a pure inference-time optimization that propagates linearly into operational cost.[^9] The Tensor Economics blog described DSA as the first published deployment of fine-grained token-level sparse attention at frontier scale, and noted that the lightning-indexer pattern was likely to be adopted broadly.[^10]
The V3.2-Speciale olympiad results drew comparisons to the AlphaProof and AlphaGeometry results that Google DeepMind had published in 2024 and 2025, with the distinction that Speciale is a general-purpose chat model rather than a specialized prover. The fact that an MIT-licensed open-weight model produced gold-medal-level olympiad results in a single calendar year was widely characterized as a reset of expectations about the gap between open-weight and proprietary frontier systems.[^13][^19]
The September pricing schedule was the first time a frontier-class API offering had been priced below $0.50 per million output tokens. Industry commentators noted that while Claude Sonnet 4.5 and GPT-5 retained pricing premiums on the basis of brand and ecosystem, V3.2 had become the default reference point for cost-sensitive long-context applications such as coding assistants and document analysis pipelines.[^10][^15]
V3.2 is structurally a member of the V3 family. It inherits the 671-billion-parameter MoE backbone, the MLA dense attention path (now wrapped in DSA selection), the auxiliary-loss-free routing, and the 128K context window. The continuity is intentional: DeepSeek's stated goal with V3.2 was to validate DSA as a drop-in efficiency improvement that could be applied to an existing checkpoint through continued training, without committing to a new pretraining run.[^3]
DeepSeek V4, released in preview on April 24, 2026, is by contrast a ground-up architectural redesign. V4 replaces MLA entirely with a hybrid combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), extends the context window to one million tokens, and ships in two parameter scales: a 1.6-trillion-parameter Pro and a 284-billion-parameter Flash.[^17] DSA, in V4, becomes one component of a larger compression strategy rather than the primary mechanism; the V4 paper explicitly cites the V3.2 lightning-indexer design as a building block for CSA. At 1M-token context, V4-Pro reportedly achieves 27 percent of the single-token FLOPs and 10 percent of the KV cache size of V3.2, with V4-Flash reaching 10 percent of the FLOPs and 7 percent of the KV cache.[^17]
In terms of capabilities, V3.2 occupies a band slightly above V3.1-Terminus on coding and tool use and substantially above on math when running in Speciale mode. V4-Pro then leapfrogs V3.2-Speciale on most benchmarks while running at a fraction of the inference cost per long-context token. V3.2 remained the default DeepSeek API model from December 2025 through April 2026, when V4 entered preview as a separate endpoint, and continued to be offered alongside V4 thereafter as a cheaper option for workloads that did not need V4's million-token context.[^17]