DeepSeek V3.2

Chinese AI Large Language Models Open Source AI

20 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v5 · 4,001 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DeepSeek-V3.2 is an open-weight Mixture of Experts large language model family developed by DeepSeek that introduces DeepSeek Sparse Attention (DSA), a fine-grained token-level attention mechanism that cuts attention cost from quadratic to near-linear in sequence length while preserving accuracy. DeepSeek released the experimental precursor, DeepSeek-V3.2-Exp, on September 29, 2025, alongside a more than 50 percent reduction in API pricing.^[1] DeepSeek's own description is that DSA "achieves fine-grained sparse attention for the first time, delivering substantial improvements in long-context training and inference efficiency," and it framed V3.2-Exp as "an intermediate step toward our next-generation architecture."^[4] The model is built around the 671-billion-parameter V3 backbone (37 billion active per token) with a 128K-token context window and is distributed under the MIT license on Hugging Face.^[4]^[5]

The official V3.2 release followed on December 1, 2025, together with a high-compute reasoning variant, V3.2-Speciale, which attained gold-medal-level performance on the 2025 International Mathematical Olympiad, International Olympiad in Informatics, ICPC World Finals, and Chinese Mathematical Olympiad.^[2]^[3] V3.2 sits between DeepSeek V3.1, the hybrid reasoning model released in August 2025, and DeepSeek V4, the ground-up redesigned family released in preview on April 24, 2026.^[17] Unlike V4, which abandoned Multi-head Latent Attention entirely in favor of Compressed Sparse Attention (CSA), V3.2 is positioned as a continuation of the V3 line that swaps in DSA while preserving the underlying MoE architecture and parameter count.

Background

DeepSeek, led by founder Liang Wenfeng, spent 2024 and 2025 iterating rapidly on its V3 family. The base model, DeepSeek V3, launched in December 2024 with 671 billion total parameters and 37 billion active per token. Successive updates, DeepSeek-V3-0324 in March 2025 and the reasoning-trained DeepSeek-R1 in January 2025, pushed performance on math and code benchmarks while keeping training costs an order of magnitude below those of comparable Western frontier labs.

DeepSeek V3.1, released in August 2025, merged thinking and non-thinking inference into a single 671-billion-parameter checkpoint and extended the context window from 64K to 128K tokens. A maintenance update, DeepSeek-V3.1-Terminus, followed on September 22, 2025, with improvements to agentic tool use and resolution of language-consistency issues. V3.1-Terminus was the direct precursor to V3.2: every V3.2 release uses Terminus as its checkpoint base and adds DSA through a continued-training stage.^[1]^[4]

The motivation for DSA was straightforward. Dense attention scales quadratically with sequence length, and the per-query KV-cache lookup dominates both compute and memory at long contexts. Existing alternatives such as sliding-window attention and block-sparse attention either trade off retrieval fidelity or require coarse-grained selection that is mismatched to the way Transformer heads actually attend to text. DeepSeek's research goal with V3.2-Exp was to validate whether a learned, token-level sparse attention pattern could be inserted into the existing V3-style architecture without retraining from scratch and without measurable performance loss.^[4]^[10]

When was DeepSeek V3.2 released?

The V3.2 family was rolled out in two phases, separated by approximately two months.

V3.2-Exp (September 2025)

DeepSeek-V3.2-Exp launched on September 29, 2025 as an explicitly experimental release.^[1] The announcement framed the model as a validation of the DSA approach rather than a successor product, and DeepSeek kept V3.1-Terminus available through its API until October 15, 2025, 15:59 UTC, to allow A-B comparison.^[1] The accompanying API pricing cut, more than 50 percent on input and output tokens, was tied directly to the inference-side efficiency gains of DSA at long contexts.^[1]^[8]

An important correction was issued on November 17, 2025. DeepSeek announced that the RoPE (rotary positional embedding) implementation in the indexer module had been using an interleaved layout inherited from MLA, whereas the indexer requires a non-interleaved layout for correct behavior.^[4] Affected downloads were updated and external implementations in vLLM, SGLang, and the DeepSeek inference repository were patched.

V3.2 official and V3.2-Speciale (December 2025)

On December 1, 2025, DeepSeek announced the official V3.2 release, accompanied by a technical report on arXiv (2512.02556) authored by approximately 264 contributors.^[3] Three checkpoints were published on Hugging Face:^[5]^[6]

Checkpoint	Description	Intended use
DeepSeek-V3.2-Base	Pretrained base, MIT license	Continued training, research
DeepSeek-V3.2	Production chat and reasoning model	Default API and chat endpoint
DeepSeek-V3.2-Speciale	High-compute reasoning variant	Olympiad-style reasoning, research

V3.2-Speciale was released alongside the main model and used substantially more post-training compute. DeepSeek described it as a research artifact rather than a default product, and it was offered via a dedicated temporary API endpoint (api.deepseek.com/v3.2_speciale_expires_on_20251215) that was retired on December 15, 2025, 15:59 UTC.^[2] Tool calling is disabled in Speciale; the variant is positioned exclusively as a pure reasoning engine.^[2]^[6]

Both V3.2 and V3.2-Speciale were made available through Microsoft Foundry in public preview on December 15, 2025, with Azure providing managed deployment alongside Foundry's evaluation, routing, and observability stack.^[16] Hugging Face downloads of the DeepSeek-V3.2 weights exceeded four million in the month following release.^[5]

The DeepSeek arXiv paper consolidates the V3.2-Exp and V3.2 releases as a single line of work; the official December release supersedes V3.2-Exp on the API but the underlying architecture is largely the same, with refinements in the post-training pipeline.^[3]

Architecture

DeepSeek-V3.2 retains the V3.1-Terminus base architecture: a 671-billion-parameter MoE transformer with 37 billion active parameters per token, Multi-head Latent Attention (MLA) for the dense attention path, and the auxiliary-loss-free expert routing introduced with V3. The published model cards on Hugging Face list 685 billion total parameters for both V3.2-Exp and V3.2, with the additional weights coming from the new DSA components rather than from changes to the base MoE blocks.^[4]^[5] The context window is 128K tokens, identical to V3.1.

The only architectural change relative to V3.1-Terminus is the insertion of DSA in place of dense MLA attention. Everything else, including the MoE layer count, expert count, expert sizes, and routing topology, is unchanged.^[4]^[9] To isolate the effect of the new attention mechanism, DeepSeek states that it "deliberately aligned the training configurations of DeepSeek-V3.2-Exp with V3.1-Terminus," so that any benchmark difference reflects DSA rather than a change in data or training recipe.^[4]

What is DeepSeek Sparse Attention?

DSA replaces the dense attention computation with a two-stage selection-then-attention pipeline. For each query token, only a small subset of preceding keys participates in the main MLA computation. The two stages are:

Lightning Indexer. A lightweight scoring module computes relevance between the current query and every preceding token. The indexer uses 64 attention heads (versus 128 for the main MLA path) at FP8 precision, with parameters distinct from the main MLA heads.^[9]^[10] Its output is a scalar relevance score per (query, key) pair, computed as a weighted sum of ReLU activations over query-key dot products. ReLU was chosen over softmax for throughput, since negative scores collapse to zero and the activation is cheaper to compute in low precision.^[3]^[10] Because the indexer is small and runs in FP8, it can sweep the full sequence at a small fraction of the cost of full attention.

Token Selector. Given the indexer scores, the selector retrieves the top-k previous tokens for each query, where k is a fixed hyperparameter. For the released V3.2 checkpoints, k is set to 2,048; with a 128K-token context, this means each query attends to roughly 1.6 percent of the available keys.^[3]^[10] The selected key-value pairs are then passed to the main MLA path, which runs at full precision.

The net result is that overall attention cost scales as O(Lk) rather than O(L²), where L is the sequence length and k is the fixed selection budget. At 128K tokens with k=2,048, this is approximately a 64-fold reduction in attention FLOPs relative to dense attention over the same context. An independent Tensor Economics analysis measured that at 131K tokens DSA loads roughly five times less data per decode step than dense MLA, with indexer overhead of about 132 bytes per token (FP8) versus 656 bytes per token for the main MLA cache.^[10]

The vLLM implementation stores the MLA cache as 512 bytes of FP8 NoPE (no-positional-embedding) keys, 16 bytes of FP32 scale factors, and 128 bytes of BF16 RoPE embeddings per token, with the indexer K cache held in separate FP8 blocks aligned to FlashMLA's block size of 64.^[11]^[21] DSA is instantiated within MLA's MQA (Multi-Query Attention) mode so that the underlying kernels can be implemented efficiently.

Training pipeline for DSA

DeepSeek did not retrain V3.2 from scratch. Instead, the V3.1-Terminus checkpoint was extended with DSA through two stages of continued training:^[3]^[9]

Stage	Steps	Sequences per step	Tokens per sequence	Total tokens	Description
Dense warm-up	1,000	16	128,000	2.1 billion	Indexer is trained to mimic the attention distribution of V3.1-Terminus's dense MLA; the rest of the model is frozen.
Sparse training	15,000	480	128,000	943.7 billion	All parameters are unfrozen and the model is trained end-to-end with sparse attention active.

The warm-up stage is the engineering core of DSA's success: the indexer learns its scoring function by imitating the dense attention pattern of a frozen teacher (V3.1-Terminus itself), so that when sparse attention is switched on, the selector approximates the same key set that dense attention would have weighted heavily. This is similar in spirit to distillation but operates on attention patterns rather than output distributions.^[3]^[9]

Post-training

The December 2025 release introduced a substantially expanded post-training pipeline relative to V3.1. The DeepSeek technical report describes three components.

Reinforcement learning scaling

V3.2's reinforcement learning stage scales post-training compute well beyond what V3.1 used and applies several stabilizing techniques designed for off-policy and large-batch GRPO training:^[3]^[9]

Unbiased KL estimation corrects bias in off-policy updates by reformulating the KL penalty term so it remains an unbiased estimator under importance sampling.
Off-policy sequence masking stabilizes training when the behavior policy and the policy being updated diverge significantly, by masking out token positions where the divergence exceeds a threshold δ.
Keep routing preserves the MoE expert-routing decisions across the policy and value rollouts within an RL step, so that gradient updates do not fight against routing-induced variance.
Keep sampling mask maintains a consistent action space between top-p sampling at rollout time and the loss computation, avoiding distributional shift at the action edge.

Other refinements relative to V3.1's RL pipeline include the removal of explicit format rewards (replaced by length penalties for agentic tasks), the adoption of generative reward models for non-verifiable domains, and the integration of self-verification techniques inherited from DeepSeekMath V2 for mathematical reasoning.^[9]

Agentic task synthesis

DeepSeek built a data pipeline that synthesizes training data for agent settings programmatically rather than collecting it from human trajectories. The pipeline generates more than 1,800 distinct sandboxed environments and 85,000 task prompts.^[2]^[3] The breakdown is:^[3]

Agent type	Task count
Search agent	50,275
Code agent	24,667
Code interpreter	5,908
General agent	4,417

Each environment exposes a small set of programmable tools, and tasks are generated by composing tool primitives into goals of varying difficulty.

The December release introduced what DeepSeek calls "thinking with tools": the model is the first publicly released system to natively interleave chain-of-thought reasoning with tool calls inside a single response, in both thinking and non-thinking modes.^[2]^[13] A strategy called thinking context management retains reasoning traces across consecutive tool calls within a single turn but clears them on a new user message; tool calls and tool results remain in context even when intermediate reasoning text is trimmed to stay within a budget. This avoids the standard agentic-loop pattern in which the model must reconstruct its reasoning state on every new tool result.

V3.2-Speciale post-training

The Speciale variant uses the same base checkpoint as V3.2 but with additional RL compute concentrated on long-form mathematical and competitive-programming reasoning. Speciale post-training drew on synthetic olympiad-style problems and verifier-graded trajectories from a code-execution sandbox.^[3]^[6] DeepSeek did not publish the exact compute budget for Speciale, but the technical report describes it as substantially exceeding the base V3.2 RL budget. Tool-calling capability is intentionally removed from Speciale so that all post-training compute serves pure reasoning.^[6]^[16]

Benchmarks

The primary claim of V3.2-Exp at its September launch was performance parity with V3.1-Terminus despite the architectural simplification.^[1]^[4] The December V3.2 release pushed several benchmarks higher, and V3.2-Speciale produced an additional tier of competition-grade results.

How does DeepSeek V3.2-Exp compare to V3.1-Terminus?

From the DeepSeek-V3.2-Exp model card:^[4]

Benchmark	V3.1-Terminus	V3.2-Exp
MMLU-Pro	85.0	85.0
GPQA-Diamond	80.7	79.9
LiveCodeBench	74.9	74.1
AIME 2025	88.4	89.3
Codeforces	2046	2121
BrowseComp	38.5	40.1
SimpleQA	96.8	97.1
SWE-bench Multilingual	57.8	57.9
HLE	21.7	19.8

The pattern is essentially flat on knowledge benchmarks, slightly positive on coding and AIME, and slightly positive on agentic browse and QA tasks. The roughly two-point drop on Humanity's Last Exam was flagged by external reviewers as the only meaningful regression in the V3.2-Exp suite, with speculation that the fixed top-k=2,048 selector occasionally misses a long-range global connection needed for the hardest items.^[14]^[18]

V3.2 official versus contemporaries

The December V3.2 release improved on V3.2-Exp through expanded RL. Reported headline scores include:^[3]^[5]

Benchmark	V3.2
MMLU-Pro	85.0
GPQA-Diamond	82.4
AIME 2025	93.1
HMMT Feb 2025	92.5
LiveCodeBench	83.3
Codeforces (rating)	2,386
SWE-bench Verified	73.1
SWE-bench Pro	15.6
MathArena AIME 2026	94.17
BrowseComp (with context mgmt)	67.6
MCP-Mark	38.0
HLE	25

DeepSeek's technical report states that with a robust reinforcement learning protocol and scaled post-training compute, "DeepSeek-V3.2 performs comparably to GPT-5," while remaining slightly behind Gemini 3 Pro on the same suite.^[3]

V3.2-Speciale reasoning results

The Speciale variant reports substantially higher reasoning scores than the base V3.2 model:^[3]^[6]

Benchmark	V3.2-Speciale
AIME 2025	96.0
HMMT Feb 2025	99.2
Codeforces (rating)	2,701
HLE	30

DeepSeek's paper describes Speciale as surpassing GPT-5 on these benchmarks while remaining on par with Gemini-3.0-Pro.^[3]

V3.2-Speciale olympiad results

Speciale was evaluated on four competitive reasoning events held during 2025. All scores were reported under contest conditions with no internet access and no external tools beyond a code-execution sandbox.^[2]^[3]^[6]

Event	V3.2-Speciale result	Medal
IMO 2025	35 / 42 points	Gold
CMO 2025	Gold-level score	Gold
IOI 2025	492 / 600 points, 10th place	Gold
ICPC World Finals 2025	2nd place	Gold

The IMO score of 35 out of 42 placed Speciale ahead of every other publicly known model evaluation in the 2025 cycle and at parity with strong human gold medalists. The IOI ranking of 10th overall and ICPC second-place finish were independently described as the first machine results to clear gold-medal thresholds across all four flagship olympiads in a single calendar year, and the first such sweep produced by a general-purpose chat model rather than a specialized prover.^[13]

How much does the DeepSeek V3.2 API cost?

The V3.2 release was tied to one of the most aggressive single-step price reductions in the commercial LLM market in 2025. On September 29, 2025, DeepSeek announced a unified pricing schedule applicable to V3.2-Exp, telling developers that "DeepSeek API prices drop 50%+, effective immediately":^[1]^[8]

Token type	V3.1-Terminus	V3.2-Exp
Input (cache hit)	$0.07 / 1M	$0.028 / 1M
Input (cache miss)	$0.56 / 1M	$0.28 / 1M
Output	$1.68 / 1M	$0.42 / 1M

The headline reduction was roughly 50 percent on input and 75 percent on output tokens. DeepSeek attributed the change directly to the reduced per-token compute and KV-cache footprint enabled by DSA, particularly at long contexts.^[1] Independent analyses of the prices, by VentureBeat and the cost-tracking site CostGoat, characterized V3.2 as the cheapest frontier-class API offering available at the time of launch.^[8]^[15]

A further reduction took effect on April 26, 2026, 12:15 UTC: the input cache-hit price was cut to one-tenth of its launch price, bringing it below $0.003 per million tokens.^[15] This step coincided with the V4 launch and was applied retroactively to all live DeepSeek API models. After the May 2026 stacked adjustments, the V4-Flash cache-hit price fell to roughly $0.0029 per million tokens, with V3.2 retained on the API as a cheaper alternative for non-V4 workloads.^[15]^[17]

The September pricing cut had immediate effects on the commercial market. OpenAI announced a price reduction on its own GPT-4-class API a few weeks later, and Anthropic published lower batch-API pricing in November 2025 that closed part of the gap; both adjustments were widely attributed in trade press to competitive pressure from DeepSeek's V3.2 schedule.^[8]

Inference and tooling

The V3.2 release was supported on day zero by major open-source inference stacks. vLLM, SGLang, and Hugging Face transformers all shipped patches by early October 2025 that handled the indexer module, the top-k selection step, and the patched RoPE layout that became required after the November correction.^[4]^[11]^[21] Red Hat AI published a detailed integration guide for vLLM, including measurements indicating that DSA delivers roughly 3.5x throughput improvement at 128K context against dense MLA on the same NVIDIA Hopper and Blackwell hardware.^[11]

For on-premise deployment, DeepSeek released two open-source kernel libraries alongside the model:^[1]^[7]

FlashMLA provides sparse attention kernels for MLA, including paged variants for batched inference.
DeepGEMM includes high-performance CUDA kernels for the lightning indexer's FP8 logit computation.

A notable secondary release was the broad use of TileLang, DeepSeek's open-source ML compiler that lowers Python kernel descriptions to GPU-specific machine code. DeepSeek published TileLang versions of the DSA kernels alongside the hand-written CUDA variants, demonstrating that roughly 80 lines of Python could reach 95 percent of FlashMLA's CUDA performance.^[20] This made DSA unusually portable across non-NVIDIA accelerators.

Day-zero Chinese chip adoption

Several Chinese accelerator vendors completed adaptation on the same day as the V3.2-Exp release:^[20]

Huawei Ascend shipped a vLLM-Ascend integration with custom operator implementations for the indexer and the top-k selector, with all inference code open-sourced.
Cambricon released vLLM-MLU, an inference engine for its MLU GPUs that natively executes DSA layers.
Hygon completed kernel adaptation on its DCU GPUs in parallel with the Ascend and Cambricon work.

This was the first DeepSeek release in which all three major Chinese accelerator stacks reached production parity with NVIDIA on the launch day, and it was widely interpreted as evidence that the TileLang toolchain was central to DeepSeek's hardware-portability strategy.^[20]

Is DeepSeek V3.2 open source?

The model weights are distributed under the MIT license, one of the most permissive open-source licenses, which allows unrestricted commercial use, modification, and redistribution.^[4]^[5] All three December checkpoints (Base, V3.2, and Speciale) are published openly on Hugging Face.^[5]^[6] Quantizations to FP8, INT8, and 4-bit were published by the community within the first two weeks of the V3.2-Exp release, with several variants reducing aggregate VRAM requirements to under 700 GB for production inference.^[4]^[5] DeepSeek also open-sourced the supporting kernels (FlashMLA, DeepGEMM) and the TileLang reference implementations, so the full DSA stack is reproducible outside DeepSeek's own infrastructure.^[7]^[20]

Reception and significance

V3.2 was received as a focused, architecturally narrow but commercially significant release. Sebastian Raschka's December 2025 technical retrospective on the V3 line characterized DSA as the most consequential efficiency improvement DeepSeek had shipped since the introduction of MLA in V3 itself: a pure inference-time optimization that propagates linearly into operational cost.^[9] The Tensor Economics blog described DSA as the first published deployment of fine-grained token-level sparse attention at frontier scale, and noted that the lightning-indexer pattern was likely to be adopted broadly.^[10]

The V3.2-Speciale olympiad results drew comparisons to the AlphaProof and AlphaGeometry results that Google DeepMind had published in 2024 and 2025, with the distinction that Speciale is a general-purpose chat model rather than a specialized prover. The fact that an MIT-licensed open-weight model produced gold-medal-level olympiad results in a single calendar year was widely characterized as a reset of expectations about the gap between open-weight and proprietary frontier systems.^[13]^[19]

The September pricing schedule was the first time a frontier-class API offering had been priced below $0.50 per million output tokens. Industry commentators noted that while Claude Sonnet 4.5 and GPT-5 retained pricing premiums on the basis of brand and ecosystem, V3.2 had become the default reference point for cost-sensitive long-context applications such as coding assistants and document analysis pipelines.^[10]^[15]

How does DeepSeek V3.2 differ from V3 and V4?

V3.2 is structurally a member of the V3 family. It inherits the 671-billion-parameter MoE backbone, the MLA dense attention path (now wrapped in DSA selection), the auxiliary-loss-free routing, and the 128K context window. The continuity is intentional: DeepSeek's stated goal with V3.2 was to validate DSA as a drop-in efficiency improvement that could be applied to an existing checkpoint through continued training, without committing to a new pretraining run.^[3]

DeepSeek V4, released in preview on April 24, 2026, is by contrast a ground-up architectural redesign. V4 replaces MLA entirely with a hybrid combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), extends the context window to one million tokens, and ships in two parameter scales: a 1.6-trillion-parameter Pro and a 284-billion-parameter Flash.^[17] DSA, in V4, becomes one component of a larger compression strategy rather than the primary mechanism; the V4 paper explicitly cites the V3.2 lightning-indexer design as a building block for CSA. At 1M-token context, V4-Pro reportedly achieves 27 percent of the single-token FLOPs and 10 percent of the KV cache size of V3.2, with V4-Flash reaching 10 percent of the FLOPs and 7 percent of the KV cache.^[17]

In terms of capabilities, V3.2 occupies a band slightly above V3.1-Terminus on coding and tool use and substantially above on math when running in Speciale mode. V4-Pro then leapfrogs V3.2-Speciale on most benchmarks while running at a fraction of the inference cost per long-context token. V3.2 remained the default DeepSeek API model from December 2025 through April 2026, when V4 entered preview as a separate endpoint, and continued to be offered alongside V4 thereafter as a cheaper option for workloads that did not need V4's million-token context.^[17]

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

AI Model Release Timeline (2022-2026)Aider Polyglot DeepSeek V3.1 GLM-4.6 GRPO MiniMax M3 MiniMax-Text-01 Mixture of Experts (MoE)Qwen3-Coder-Next

Background

When was DeepSeek V3.2 released?

V3.2-Exp (September 2025)

V3.2 official and V3.2-Speciale (December 2025)

Architecture

What is DeepSeek Sparse Attention?

Training pipeline for DSA

Post-training

Reinforcement learning scaling

Agentic task synthesis

V3.2-Speciale post-training

Benchmarks

How does DeepSeek V3.2-Exp compare to V3.1-Terminus?

V3.2 official versus contemporaries

V3.2-Speciale reasoning results

V3.2-Speciale olympiad results

How much does the DeepSeek V3.2 API cost?

Inference and tooling

Day-zero Chinese chip adoption

Is DeepSeek V3.2 open source?

Reception and significance

How does DeepSeek V3.2 differ from V3 and V4?

See also

References

Improve this article

Related Articles

Qwen

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here

Related Articles

Qwen

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here