DeepSeek Sparse Attention (DSA)

Deep Learning Neural Networks

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 2,211 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DeepSeek Sparse Attention (DSA) is a trainable, fine-grained sparse attention mechanism introduced by the Chinese AI company DeepSeek in its experimental model DeepSeek-V3.2-Exp, released on September 29, 2025. DSA pairs a lightweight scoring module that DeepSeek calls the "lightning indexer" with a fine-grained token-selection step, so that each query attends to only a small, fixed number of the most relevant earlier tokens (2,048 by default) instead of to the entire preceding sequence. This cuts the dominant cost of the core attention computation from quadratic, O(L^2), to roughly O(L*k) in the sequence length L, where k is the number of selected tokens and far smaller than L in long contexts. DeepSeek reported that the technique preserves output quality, closely matching the benchmark scores of its predecessor DeepSeek-V3.1-Terminus, while reducing long-context training and inference cost enough that the company lowered its published API prices by more than half (50 percent or more) on the same day. ^[1]^[2]^[5]

DeepSeek summarized the mechanism in its release announcement: "DeepSeek Sparse Attention (DSA) achieves fine-grained sparse attention with minimal impact on output quality, boosting long-context performance & reducing compute cost." ^[1]

What is DeepSeek Sparse Attention?

DSA is DeepSeek's production response to the long-context attention bottleneck, and it is the engineering successor to the company's earlier research method Native Sparse Attention (NSA). Where NSA was validated on a model trained from scratch, DSA was added to an already-trained flagship through continued training, which let DeepSeek measure the cost of sparsity against a near-identical dense baseline. The company described DeepSeek-V3.2-Exp as an experimental, intermediate step toward a next-generation architecture rather than a finished product, and it kept the model's training configuration deliberately aligned with V3.1-Terminus so that any change in benchmark numbers could be attributed to the attention mechanism itself. ^[1]^[2]

The model that carries DSA is built on the same mixture-of-experts (MoE) and multi-head latent attention (MLA) foundation as DeepSeek-V3, a 671-billion-parameter MoE with roughly 37 billion parameters active per token. DSA is layered on top of MLA rather than replacing it: the indexer reuses MLA's compressed latent representations to score tokens cheaply, and the selected tokens are then read through the ordinary MLA attention path. ^[2]^[3]

Why is sparse attention needed? Background on DeepSeek-V3.2-Exp and MLA

In a standard transformer, every query token attends to every preceding key token, so producing the output for a sequence of length L costs on the order of L^2 query-key interactions. For the long context windows that modern large language models target, this quadratic term dominates both the compute of training and the memory traffic of decoding, where each newly generated token must read the entire stored key-value (KV) cache. ^[3]

DeepSeek had already attacked the memory side of this problem with MLA, which compresses the per-layer KV cache into a small shared low-rank latent so that far less data is stored and moved during decoding. MLA reduces the size of what attention reads, but it does not by itself reduce the number of tokens each query must consider, so the attention computation is still effectively quadratic in context length. DSA closes that gap by making the set of attended tokens sparse. Because it operates on MLA's latent representations and selects whole latent KV entries that are shared across all query heads, the two mechanisms compose cleanly: MLA shrinks each entry, and DSA reduces how many entries are touched. ^[2]^[3]

DeepSeek-V3.2-Exp itself is a continued-training derivative of DeepSeek-V3.1-Terminus. DeepSeek froze the training recipe of the earlier model and changed essentially one thing, the attention mechanism, in order to isolate the effect of DSA. ^[1]^[2]

How does DeepSeek Sparse Attention work?

DSA has two components: a lightning indexer that produces a cheap relevance score for every preceding token, and a fine-grained selector that keeps only the highest-scoring tokens for full attention. ^[1]^[3]

What is the lightning indexer?

For a query at position t, the indexer assigns each earlier position s an index score by combining a small number of lightweight per-head dot products. In the notation of DeepSeek's technical report the score is

I(t,s) = sum over j of w(t,j) * ReLU( q(t,j) . k(s) ),

where q(t,j) is the indexer query for head j, k(s) is the indexer key at position s, w(t,j) is a learned per-head weight, and the sum runs over the indexer's heads. Two design choices make this scan affordable. The indexer uses far fewer heads than the main attention (64 indexer heads against the model's 128 attention heads) and runs in FP8 precision, and it uses a ReLU nonlinearity instead of a softmax, which is cheaper to evaluate at high throughput. The indexer keys are derived from MLA's compressed token representations, so the scoring does not require re-reading full-dimensional vectors. The indexer's scoring is itself formally quadratic in L, but because each per-token operation is so light, its wall-clock and floating-point cost are minor relative to dense attention, and the expensive part of the layer is what follows. ^[3]^[7]^[10]

How does fine-grained token selection work?

Given the index scores, the selector keeps only the top-k positions for each query, with k set to 2,048 by default, and builds a sparse attention mask that hides every other token. The main MLA attention is then computed over just those k selected entries. Selection is fine-grained, meaning it operates at the level of individual tokens rather than over fixed contiguous blocks. This is the central technical distinction between DSA and the block-based schemes that preceded it, and DeepSeek describes V3.2-Exp as the first model to achieve fine-grained sparse attention at this scale. Because MLA in its decoding form behaves like multi-query attention, with one latent KV entry shared across all query heads, the same 2,048 selected entries serve every head of a token, so the selection is computed once per query rather than once per head. With k held constant, the core attention cost grows only linearly with context length, giving the O(L*k) behavior that motivates the design. ^[3]^[7]^[10]

How is DSA trained into the model?

DSA was instilled through continued training of DeepSeek-V3.1-Terminus in two phases. In a dense warm-up stage of about 2.1 billion tokens (roughly 1,000 steps), the main model is frozen and only the lightning indexer is trained. The indexer learns to imitate the dense model's own attention by minimizing a Kullback-Leibler divergence between its predicted token importances and the attention weights the dense model actually produces, so the indexer learns to point at the tokens the model would have attended to anyway. In the second, sparse stage of about 943.7 billion tokens (roughly 15,000 steps at a main-model learning rate near 7.3e-6), the top-k selector is switched on and the whole model, indexer included, is trained with sparse attention active so that the network adapts to operating under sparsity. DeepSeek then applied a post-training pipeline that distilled several specialist models (covering domains such as mathematics, competitive programming, and agentic search) into the final checkpoint and used reinforcement learning with Group Relative Policy Optimization (GRPO) to consolidate the stages. ^[4]^[7]

How much more efficient is DSA, and does it hurt quality?

The headline efficiency claim is the reduction of the core attention term from O(L^2) to O(L*k). Because the indexer is engineered to be far cheaper per token than the main attention, the overall long-context cost is dominated by the sparse attention over k tokens rather than by the full sequence. DeepSeek packaged high-performance kernels for the mechanism, releasing sparse-attention kernels in FlashMLA, indexer logit kernels in DeepGEMM, and additional kernels in TileLang. ^[2]^[10]

On quality, DeepSeek aligned the configurations of V3.2-Exp and V3.1-Terminus and published a side-by-side comparison showing near parity across a broad benchmark suite. The results are mixed in the small, with V3.2-Exp slightly ahead on several reasoning and agentic-search tasks and slightly behind on a few knowledge and coding tasks, which DeepSeek attributed to ordinary run-to-run variance rather than a systematic loss from sparsity. ^[1]^[2]^[10]

Benchmark	DeepSeek-V3.1-Terminus	DeepSeek-V3.2-Exp
MMLU-Pro	85.0	85.0
GPQA-Diamond	80.7	79.9
Humanity's Last Exam	21.7	19.8
LiveCodeBench	74.9	74.1
AIME 2025	88.4	89.3
HMMT 2025	86.1	83.6
Codeforces (rating)	2046	2121
Aider-Polyglot	76.1	74.5
BrowseComp	38.5	40.1
BrowseComp-zh	45.0	47.9
SimpleQA	96.8	97.1
SWE Verified	68.4	67.8
SWE-bench Multilingual	57.8	57.9
Terminal-bench	36.7	37.7

The efficiency gain showed up most directly in price. On release day DeepSeek cut its API prices by more than 50 percent: the cache-miss input price fell to 0.28 US dollars per million tokens (from 0.56), the cache-hit input price to 0.028 dollars per million, and the output price to 0.42 dollars per million. The saving grows with context length, since the cost reduction comes from replacing a quadratic term with a near-linear one. An analysis summarized by DeepLearning.AI estimated that long-context inference became roughly 6 to 7 times cheaper than on V3.1-Terminus, citing illustrative per-million-token costs of about 0.10 dollars versus 0.60 dollars at 32,000 tokens of input and about 0.30 dollars versus 2.30 dollars at 128,000 tokens. ^[4]^[5]^[6]

DeepSeek released the model weights on Hugging Face under the MIT License, and the open release made DSA immediately reproducible. Serving frameworks including vLLM and SGLang added support on the day of release, and on November 17, 2025 DeepSeek noted a correction to a Rotary Position Embedding discrepancy in the indexer module of its earlier inference code. ^[2]^[6]

How does DSA differ from Native Sparse Attention and other sparse attention?

DSA is the direct descendant of NSA, the trainable sparse-attention method DeepSeek published in February 2025 and which shared a Best Paper Award at ACL 2025. NSA runs three parallel branches over the history, a coarse compression branch, a fine-grained selection branch, and a local sliding window, and blends them with a learned gate, all built around hardware-aligned block selection. DSA keeps NSA's goals of trainable, fine-grained sparsity and hardware-aware kernels but simplifies the design dramatically: it drops the multi-branch gate in favor of a single lightning indexer plus a top-k selector, it selects individual tokens rather than contiguous blocks, and it is bolted onto MLA through continued training rather than being baked in from scratch. ^[2]^[3]^[8]

DSA also belongs to the same wave as Mixture of Block Attention (MoBA), released within days of NSA in February 2025 by Moonshot AI and deployed in its Kimi assistant. MoBA carries the mixture-of-experts routing idea into attention, partitioning the context into blocks and routing each query to its top-k blocks by a dot product against each block's mean-pooled key. NSA, MoBA, and DSA together mark a shift away from inference-only sparsification, in which a dense model is pruned after the fact, and toward sparsity that is a native, trained-in property of the model. DSA's distinguishing feature within that group is its token-level granularity: rather than choosing blocks, its indexer scores and its selector keeps individual key-value entries. ^[3]^[8]^[9]

Why does DeepSeek Sparse Attention matter?

DSA is notable less as a new idea than as a demonstration that fine-grained, learned sparse attention can be deployed at frontier scale with no meaningful loss of quality and with an immediate, large reduction in serving cost. By aligning V3.2-Exp's recipe with V3.1-Terminus and open-sourcing the weights, the kernels, and the benchmark comparison, DeepSeek gave the community a clean controlled study of sparse attention on a 671-billion-parameter model, something earlier sparse-attention papers could only approximate at smaller scale. The mechanism was subsequently carried forward into DeepSeek's broader V3.2 line, with a later technical report, "DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models," presenting the architecture beyond its initial experimental framing. For long-context applications, where the quadratic cost of attention had been the binding constraint, DSA reframed the problem as one of cheaply selecting which few thousand tokens actually matter for each query. ^[1]^[2]^[11]

References

DeepSeek-AI. "Introducing DeepSeek-V3.2-Exp." DeepSeek API Docs, September 29, 2025. https://api-docs.deepseek.com/news/news250929 ↩
DeepSeek-AI. "DeepSeek-V3.2-Exp" (model release, technical report, and kernels). GitHub, September 29, 2025. https://github.com/deepseek-ai/DeepSeek-V3.2-Exp ↩
Raschka, Sebastian. "A Technical Tour of the DeepSeek Models from V3 to V3.2." Ahead of AI, 2025. https://magazine.sebastianraschka.com/p/technical-deepseek ↩
DeepLearning.AI. "DeepSeek-V3.2-Exp Streamlines Processing Using a 'Lightning Indexer,' Boosting Efficiency." The Batch, October 2025. https://www.deeplearning.ai/the-batch/deepseek-v3-2-exp-streamlines-processing-using-a-lightning-indexer-boosting-efficiency ↩
TechCrunch. "DeepSeek releases 'sparse attention' model that cuts API costs in half." September 29, 2025. https://techcrunch.com/2025/09/29/deepseek-releases-sparse-attention-model-that-cuts-api-costs-in-half/ ↩
VentureBeat. "DeepSeek's new V3.2-Exp model cuts API pricing in half to less than 3 cents per 1M input tokens." September 29, 2025. https://venturebeat.com/ai/deepseeks-new-v3-2-exp-model-cuts-api-pricing-in-half-to-less-than-3-cents ↩
Golge, Eren. "Model check: DeepSeek-V3.2-Exp, Fine-Grained Sparse Attention for Efficient Long-Context LLMs." Substack, 2025. https://erogol.substack.com/p/model-check-deepseek-v32-exp-fine ↩
Yuan, Jingyang, et al. "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention." arXiv:2502.11089, February 16, 2025. https://arxiv.org/abs/2502.11089 ↩
Lu, Enzhe, et al. "MoBA: Mixture of Block Attention for Long-Context LLMs." arXiv:2502.13189, February 18, 2025. https://arxiv.org/abs/2502.13189 ↩
MarkTechPost. "DeepSeek V3.2-Exp Cuts Long-Context Costs with DeepSeek Sparse Attention (DSA) While Maintaining Benchmark Parity." September 30, 2025. https://www.marktechpost.com/2025/09/30/deepseek-v3-2-exp-cuts-long-context-costs-with-deepseek-sparse-attention-dsa-while-maintaining-benchmark-parity/ ↩
DeepSeek-AI. "DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models." arXiv:2512.02556, December 2025. https://arxiv.org/abs/2512.02556 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DeepSeek V3 GLM-5

What is DeepSeek Sparse Attention?

Why is sparse attention needed? Background on DeepSeek-V3.2-Exp and MLA

How does DeepSeek Sparse Attention work?

What is the lightning indexer?

How does fine-grained token selection work?

How is DSA trained into the model?

How much more efficient is DSA, and does it hurt quality?

How does DSA differ from Native Sparse Attention and other sparse attention?

Why does DeepSeek Sparse Attention matter?

References

Improve this article

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here