DeepSeek Sparse Attention (DSA)
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 2,142 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 2,142 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek Sparse Attention (DSA) is a trainable, fine-grained sparse attention mechanism introduced by the Chinese AI company DeepSeek in its experimental model DeepSeek-V3.2-Exp, released on September 29, 2025. DSA pairs a lightweight scoring module that DeepSeek calls the "lightning indexer" with a fine-grained token-selection step, so that each query attends to only a small, fixed number of the most relevant earlier tokens instead of to the entire preceding sequence. This cuts the dominant cost of the core attention computation from quadratic, O(L^2), to roughly O(L*k) in the sequence length L, where k is the number of selected tokens (2,048 by default, and far smaller than L in long contexts). DeepSeek reported that the technique preserves output quality, closely matching the benchmark scores of its predecessor DeepSeek-V3.1-Terminus, while reducing long-context training and inference cost enough that the company lowered its published API prices by more than half on the same day. [1][2][5]
DSA is DeepSeek's production response to the long-context attention bottleneck, and it is the engineering successor to the company's earlier research method Native Sparse Attention (NSA). Where NSA was validated on a model trained from scratch, DSA was added to an already-trained flagship through continued training, which let DeepSeek measure the cost of sparsity against a near-identical dense baseline. The company described DeepSeek-V3.2-Exp as an experimental, intermediate step toward a next-generation architecture rather than a finished product, and it kept the model's training configuration deliberately aligned with V3.1-Terminus so that any change in benchmark numbers could be attributed to the attention mechanism itself. [1][2]
The model that carries DSA is built on the same mixture-of-experts (MoE) and multi-head latent attention (MLA) foundation as DeepSeek-V3, a 671-billion-parameter MoE with roughly 37 billion parameters active per token. DSA is layered on top of MLA rather than replacing it: the indexer reuses MLA's compressed latent representations to score tokens cheaply, and the selected tokens are then read through the ordinary MLA attention path. [2][3]
In a standard transformer, every query token attends to every preceding key token, so producing the output for a sequence of length L costs on the order of L^2 query-key interactions. For the long context windows that modern large language models target, this quadratic term dominates both the compute of training and the memory traffic of decoding, where each newly generated token must read the entire stored key-value (KV) cache. [3]
DeepSeek had already attacked the memory side of this problem with MLA, which compresses the per-layer KV cache into a small shared low-rank latent so that far less data is stored and moved during decoding. MLA reduces the size of what attention reads, but it does not by itself reduce the number of tokens each query must consider, so the attention computation is still effectively quadratic in context length. DSA closes that gap by making the set of attended tokens sparse. Because it operates on MLA's latent representations and selects whole latent KV entries that are shared across all query heads, the two mechanisms compose cleanly: MLA shrinks each entry, and DSA reduces how many entries are touched. [2][3]
DeepSeek-V3.2-Exp itself is a continued-training derivative of DeepSeek-V3.1-Terminus. DeepSeek froze the training recipe of the earlier model and changed essentially one thing, the attention mechanism, in order to isolate the effect of DSA. [1][2]
DSA has two components: a lightning indexer that produces a cheap relevance score for every preceding token, and a fine-grained selector that keeps only the highest-scoring tokens for full attention. [1][3]
For a query at position t, the indexer assigns each earlier position s an index score by combining a small number of lightweight per-head dot products. In the notation of DeepSeek's technical report the score is
I(t,s) = sum over j of w(t,j) * ReLU( q(t,j) . k(s) ),
where q(t,j) is the indexer query for head j, k(s) is the indexer key at position s, w(t,j) is a learned per-head weight, and the sum runs over the indexer's heads. Two design choices make this scan affordable. The indexer uses far fewer heads than the main attention (64 indexer heads against the model's 128 attention heads) and runs in FP8 precision, and it uses a ReLU nonlinearity instead of a softmax, which is cheaper to evaluate at high throughput. The indexer keys are derived from MLA's compressed token representations, so the scoring does not require re-reading full-dimensional vectors. The indexer's scoring is itself formally quadratic in L, but because each per-token operation is so light, its wall-clock and floating-point cost are minor relative to dense attention, and the expensive part of the layer is what follows. [3][7][10]
Given the index scores, the selector keeps only the top-k positions for each query, with k set to 2,048 by default, and builds a sparse attention mask that hides every other token. The main MLA attention is then computed over just those k selected entries. Selection is fine-grained, meaning it operates at the level of individual tokens rather than over fixed contiguous blocks. This is the central technical distinction between DSA and the block-based schemes that preceded it, and DeepSeek describes V3.2-Exp as the first model to achieve fine-grained sparse attention at this scale. Because MLA in its decoding form behaves like multi-query attention, with one latent KV entry shared across all query heads, the same 2,048 selected entries serve every head of a token, so the selection is computed once per query rather than once per head. With k held constant, the core attention cost grows only linearly with context length, giving the O(L*k) behavior that motivates the design. [3][7][10]
DSA was instilled through continued training of DeepSeek-V3.1-Terminus in two phases. In a dense warm-up stage of about 2.1 billion tokens (roughly 1,000 steps), the main model is frozen and only the lightning indexer is trained. The indexer learns to imitate the dense model's own attention by minimizing a Kullback-Leibler divergence between its predicted token importances and the attention weights the dense model actually produces, so the indexer learns to point at the tokens the model would have attended to anyway. In the second, sparse stage of about 943.7 billion tokens (roughly 15,000 steps at a main-model learning rate near 7.3e-6), the top-k selector is switched on and the whole model, indexer included, is trained with sparse attention active so that the network adapts to operating under sparsity. DeepSeek then applied a post-training pipeline that distilled several specialist models (covering domains such as mathematics, competitive programming, and agentic search) into the final checkpoint and used reinforcement learning with Group Relative Policy Optimization (GRPO) to consolidate the stages. [4][7]
The headline efficiency claim is the reduction of the core attention term from O(L^2) to O(L*k). Because the indexer is engineered to be far cheaper per token than the main attention, the overall long-context cost is dominated by the sparse attention over k tokens rather than by the full sequence. DeepSeek packaged high-performance kernels for the mechanism, releasing sparse-attention kernels in FlashMLA, indexer logit kernels in DeepGEMM, and additional kernels in TileLang. [2][10]
On quality, DeepSeek aligned the configurations of V3.2-Exp and V3.1-Terminus and published a side-by-side comparison showing near parity across a broad benchmark suite. The results are mixed in the small, with V3.2-Exp slightly ahead on several reasoning and agentic-search tasks and slightly behind on a few knowledge and coding tasks, which DeepSeek attributed to ordinary run-to-run variance rather than a systematic loss from sparsity. [1][2][10]
| Benchmark | DeepSeek-V3.1-Terminus | DeepSeek-V3.2-Exp |
|---|---|---|
| MMLU-Pro | 85.0 | 85.0 |
| GPQA-Diamond | 80.7 | 79.9 |
| Humanity's Last Exam | 21.7 | 19.8 |
| LiveCodeBench | 74.9 | 74.1 |
| AIME 2025 | 88.4 | 89.3 |
| HMMT 2025 | 86.1 | 83.6 |
| Codeforces (rating) | 2046 | 2121 |
| Aider-Polyglot | 76.1 | 74.5 |
| BrowseComp | 38.5 | 40.1 |
| BrowseComp-zh | 45.0 | 47.9 |
| SimpleQA | 96.8 | 97.1 |
| SWE Verified | 68.4 | 67.8 |
| SWE-bench Multilingual | 57.8 | 57.9 |
| Terminal-bench | 36.7 | 37.7 |
The efficiency gain showed up most directly in price. On release day DeepSeek cut its API prices by more than 50 percent: the cache-miss input price fell to 0.28 US dollars per million tokens (from 0.56), the cache-hit input price to 0.028 dollars per million, and the output price to 0.42 dollars per million. The saving grows with context length, since the cost reduction comes from replacing a quadratic term with a near-linear one. An analysis summarized by DeepLearning.AI estimated that long-context inference became roughly 6 to 7 times cheaper than on V3.1-Terminus, citing illustrative per-million-token costs of about 0.10 dollars versus 0.60 dollars at 32,000 tokens of input and about 0.30 dollars versus 2.30 dollars at 128,000 tokens. [4][5][6]
DeepSeek released the model weights on Hugging Face under the MIT License, and the open release made DSA immediately reproducible. Serving frameworks including vLLM and SGLang added support on the day of release, and on November 17, 2025 DeepSeek noted a correction to a Rotary Position Embedding discrepancy in the indexer module of its earlier inference code. [2][6]
DSA is the direct descendant of NSA, the trainable sparse-attention method DeepSeek published in February 2025 and which shared a Best Paper Award at ACL 2025. NSA runs three parallel branches over the history, a coarse compression branch, a fine-grained selection branch, and a local sliding window, and blends them with a learned gate, all built around hardware-aligned block selection. DSA keeps NSA's goals of trainable, fine-grained sparsity and hardware-aware kernels but simplifies the design dramatically: it drops the multi-branch gate in favor of a single lightning indexer plus a top-k selector, it selects individual tokens rather than contiguous blocks, and it is bolted onto MLA through continued training rather than being baked in from scratch. [2][3][8]
DSA also belongs to the same wave as Mixture of Block Attention (MoBA), released within days of NSA in February 2025 by Moonshot AI and deployed in its Kimi assistant. MoBA carries the mixture-of-experts routing idea into attention, partitioning the context into blocks and routing each query to its top-k blocks by a dot product against each block's mean-pooled key. NSA, MoBA, and DSA together mark a shift away from inference-only sparsification, in which a dense model is pruned after the fact, and toward sparsity that is a native, trained-in property of the model. DSA's distinguishing feature within that group is its token-level granularity: rather than choosing blocks, its indexer scores and its selector keeps individual key-value entries. [3][8][9]
DSA is notable less as a new idea than as a demonstration that fine-grained, learned sparse attention can be deployed at frontier scale with no meaningful loss of quality and with an immediate, large reduction in serving cost. By aligning V3.2-Exp's recipe with V3.1-Terminus and open-sourcing the weights, the kernels, and the benchmark comparison, DeepSeek gave the community a clean controlled study of sparse attention on a 671-billion-parameter model, something earlier sparse-attention papers could only approximate at smaller scale. The mechanism was subsequently carried forward into DeepSeek's broader V3.2 line, with a later technical report, "DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models," presenting the architecture beyond its initial experimental framing. For long-context applications, where the quadratic cost of attention had been the binding constraint, DSA reframed the problem as one of cheaply selecting which few thousand tokens actually matter for each query. [1][2][11]