Native Sparse Attention (NSA)

Deep Learning Neural Networks

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 2,152 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Native Sparse Attention (NSA) is a hardware-aligned, natively trainable sparse attention mechanism introduced in February 2025 by DeepSeek, in collaboration with researchers at Peking University and the University of Washington, to make long-context language model training and inference dramatically cheaper while matching the accuracy of dense (full) attention. It replaces standard all-to-all attention with a dynamic, hierarchical strategy that runs three parallel branches over the key-value sequence, a coarse-grained compression branch, a fine-grained block-selection branch, and a local sliding window, blended by a learned gate. On 64,000-token sequences NSA reports up to 9.0 times faster forward propagation, 6.0 times faster backward propagation, and 11.6 times faster decoding than a FlashAttention full-attention baseline, while maintaining or exceeding that baseline's accuracy. The method is described in the paper "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention" by Jingyang Yuan and colleagues, which won a Best Paper Award at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) in Vienna. ^[1]^[2]^[3]

What is Native Sparse Attention?

The core problem NSA targets is that the self-attention operation in a transformer has computational cost that grows quadratically, O(L^2), with sequence length L. For modern context windows of tens or hundreds of thousands of tokens, attention comes to dominate both the compute of the forward and backward passes during training and the memory traffic of the key-value (KV) cache during decoding. Sparse attention, which computes only a subset of the query-key interactions, is the obvious remedy, but two practical obstacles had limited earlier methods. First, many sparse schemes are applied only at inference time on top of a model that was pretrained with full attention, so the model never learns to operate with the sparsity and a train-inference mismatch degrades quality. Second, theoretical reductions in floating point operations frequently fail to produce real wall-clock speedups because the sparse access pattern is irregular and does not map well to GPU memory hierarchies or to optimized kernels such as FlashAttention. ^[1]

NSA's contribution is to address both obstacles at once through two stated design principles: it is hardware-aligned, meaning its sparse pattern is organized into contiguous blocks and its kernels are written to keep the GPU's arithmetic units busy and to cooperate with grouped-query attention; and it is natively trainable, meaning the sparsity is present and learned from the start of pretraining rather than bolted on afterward. The paper's abstract states that NSA "employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision." ^[1] The authors report that an NSA model pretrained end-to-end matches or exceeds an equivalent full-attention model on general, long-context, and reasoning benchmarks, while delivering large speedups on 64,000-token sequences. ^[1]

Why is long-context attention so expensive?

In a standard attention layer, every query token attends to every preceding key token, so producing the output for a sequence of length L requires on the order of L^2 query-key dot products and the same number of weighted value sums. Doubling the context roughly quadruples the attention work. During autoregressive decoding the bottleneck shifts from arithmetic to memory: each newly generated token must read the entire stored KV cache, so the per-token cost grows linearly with the context already produced and the KV cache becomes a dominant consumer of memory bandwidth. ^[1]

A long line of work has tried to make attention sub-quadratic, including fixed sparse patterns, clustering or hashing of keys, low-rank and kernelized approximations, and KV-cache eviction policies that keep only the tokens judged most important. DeepSeek's own multi-head latent attention attacks the memory side by compressing the cache into a low-rank latent. The NSA authors observe that many such techniques share two weaknesses they set out to avoid. Inference-only sparsification, for example evicting tokens after a full-attention pretrain, forces the model to operate in a regime it was never trained for. And methods whose sparsity is token-by-token rather than block-by-block tend to scatter memory reads, so even when they cut FLOPs they stall the hardware and lose the speedup in practice. NSA is explicitly engineered so that the pattern it learns is also a pattern the hardware can execute efficiently. ^[1]

How does NSA work?

For each query position, NSA constructs three separate sets of keys and values from the preceding context, attends to each set independently, and then combines the three resulting outputs with learned gates. The branches capture context at different granularities. ^[1]

Token compression

The compression branch gives the query a cheap, global view of the whole history. Consecutive keys, and separately values, are grouped into blocks, and each block is mapped to a single compressed key or value by a small learnable multilayer perceptron that also encodes the within-block positions. In the paper's configuration the compression block length is 32 tokens with a sliding stride of 16. Because a long sequence collapses to a much shorter sequence of compressed tokens, attending to them is inexpensive, and the resulting attention scores double as a coarse map of which regions of the context matter most. ^[1]

Token selection

Coarse compression alone loses fine detail, so the selection branch restores it by attending at full resolution to a small number of important blocks. NSA reuses the attention scores already computed against the compressed tokens as importance scores, aggregates them onto a grid of fixed selection blocks (block size 64 in the paper), and keeps only the top-scoring blocks (the top 16). Selecting whole contiguous blocks rather than scattered individual tokens is a deliberate hardware choice, because it produces coalesced memory reads. Crucially, when the model uses grouped-query attention, all query heads in a group share one set of selected blocks, computed by summing their importance scores, so the expensive KV cache is loaded once per group rather than once per head. ^[1]

Sliding window

The third branch is a plain local window over the most recent tokens, set to 512 in the paper, which handles fluent local continuation. Giving local context its own branch is a subtle but important detail: local patterns are easy to learn, and if the compression and selection branches had to model them too, gradients would be dominated by the local signal and the long-range machinery might never train properly. Isolating the window lets the other two branches specialize in genuine long-range retrieval. ^[1]

Gated combination

The outputs of the three branches are merged into the layer output by a gate. A small MLP followed by a sigmoid reads the query's features and produces a weight in the range 0 to 1 for each branch, and the final output is the gated sum of the three branch attentions. Because every component, the compression MLP, the importance-based selection, and the gate, participates in the forward pass on which the model is trained, gradients flow through them and the model learns how to compress, what to select, and how much to trust each branch. ^[1]

What makes NSA hardware-aligned and natively trainable?

The hardware-aligned principle shows up in NSA's custom kernels, written in Triton and built on the same IO-aware, blockwise philosophy as FlashAttention. For the selection branch, which is the part with a data-dependent pattern, the kernel loops over query groups on the outer grid, loads the small set of selected KV blocks for each group into fast on-chip SRAM on the inner loop, and performs the attention there. Sharing the selected blocks across all heads of a grouped-query group balances the arithmetic intensity, the ratio of compute to memory traffic, so that the GPU's tensor cores stay busy instead of waiting on memory. This is how the theoretical sparsity is converted into real speed. ^[1]

The natively trainable principle is the conceptual heart of the paper. Rather than train a dense model and then discard attention at inference, NSA is part of the architecture during pretraining, so the model adapts its representations to the sparse pattern and there is no train-test gap to pay at deployment. Training with sparse attention from the start also reduces the cost of pretraining itself, not just inference. The one genuinely discrete operation, choosing the top blocks, is made effectively learnable by deriving the importance scores from the differentiable compression branch and by letting the gate learn how heavily to weight the selected output. ^[1]

NSA component	Role	Paper setting
Compression block / stride	coarse global summary	length 32, stride 16
Selection block / count	fine-grained important regions	block 64, keep top 16
Sliding window	local recent context	512 tokens
Gate	combine the three branches	MLP plus sigmoid per branch

How accurate and how fast is NSA?

NSA was validated by pretraining a 27-billion-parameter mixture-of-experts model (about 3 billion parameters active per token, 30 layers, with 72 routed plus 2 shared experts) on 270 billion tokens, alongside an identical full-attention baseline. On a suite of general benchmarks spanning knowledge, reasoning, and coding, the NSA model's average score slightly exceeded the full-attention model's. On the LongBench long-context suite it scored higher as well, and on a 64,000-token needle-in-a-haystack retrieval test it achieved perfect accuracy at every depth. After both models were given chain-of-thought reasoning ability distilled from DeepSeek-R1, the NSA variant outperformed the full-attention variant on AIME mathematics competition problems, indicating that the learned sparsity does not harm and may even help long-form reasoning. As the abstract summarizes, "the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning." ^[1]

The efficiency gains are the headline result. Measured against a full-attention baseline implemented with FlashAttention on 64,000-token sequences, NSA reported up to 9.0 times faster forward propagation, 6.0 times faster backward propagation, and an 11.6 times speedup in decoding, the last reflecting how much less of the KV cache must be read for each generated token. These accelerations grow with context length, since they come from replacing a quadratic cost with one closer to linear. ^[1]

How does NSA compare to other sparse-attention methods?

NSA builds on the lineage of FlashAttention: it does not change attention's mathematics so much as supply blockwise, SRAM-tiling kernels that make its specific sparse pattern fast, and its full-attention baseline is itself a FlashAttention implementation. Where FlashAttention computes dense attention exactly but efficiently, NSA computes a learned sparse subset efficiently. ^[1]^[6]

NSA appeared within days of Mixture of Block Attention (MoBA), released in February 2025 by Moonshot AI and deployed in its Kimi assistant. MoBA also makes block-sparse attention trainable, but it does so by applying the mixture-of-experts routing idea directly to attention: the context is partitioned into blocks and each query is routed to its top-k blocks according to the dot product between the query and each block's mean-pooled key. Compared with MoBA's single learned routing mechanism, NSA adds the explicit compression and sliding-window branches and the three-way gate, and it places heavier emphasis on grouped-query-aware kernels. Both works demonstrated that sparsity learned during pretraining can match full attention, a notable convergence from two independent laboratories in the same month. ^[1]^[4]

NSA is also the research predecessor of DeepSeek Sparse Attention (DSA), the mechanism DeepSeek shipped in its experimental DeepSeek-V3.2-Exp model on September 29, 2025. DSA keeps NSA's goal of trainable, fine-grained sparsity but simplifies the design: a lightweight "lightning indexer" scores the relevance of preceding tokens to the current query, and a fine-grained selector then keeps only the highest-scoring tokens for full attention, layered on top of multi-head latent attention. Whereas NSA combines three granularities through a gate, DSA concentrates on efficient fine-grained token selection, and DeepSeek used it to cut the model's published API prices by roughly half. Together, NSA, MoBA, and DSA mark a shift away from inference-only sparse attention toward sparsity that is a native, trained-in property of the model. ^[1]^[5]

References

Yuan, Jingyang; Gao, Huazuo; Dai, Damai; Luo, Junyu; Zhao, Liang; Zhang, Zhengyan; Xie, Zhenda; Wei, Y. X.; Wang, Lean; Xiao, Zhiping; Wang, Yuqing; Ruan, Chong; Zhang, Ming; Liang, Wenfeng; Zeng, Wangding. "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention." arXiv:2502.11089, February 16, 2025. https://arxiv.org/abs/2502.11089 ↩
Yuan, Jingyang, et al. "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Volume 1: Long Papers, pages 23078 to 23097, Vienna, Austria, July 2025. https://aclanthology.org/2025.acl-long.1126/ ↩
South China Morning Post. "DeepSeek founder shares best paper award at top global AI research conference." July 2025. https://www.scmp.com/tech/big-tech/article/3320255/deepseek-founder-shares-best-paper-award-top-global-ai-research-conference ↩
Lu, Enzhe, et al. "MoBA: Mixture of Block Attention for Long-Context LLMs." arXiv:2502.13189, February 18, 2025. https://arxiv.org/abs/2502.13189 ↩
DeepSeek-AI. "DeepSeek-V3.2-Exp" (model release introducing DeepSeek Sparse Attention). September 29, 2025. https://github.com/deepseek-ai/DeepSeek-V3.2-Exp ↩
Dao, Tri; Fu, Daniel Y.; Ermon, Stefano; Rudra, Atri; Re, Christopher. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." arXiv:2205.14135, May 2022. https://arxiv.org/abs/2205.14135 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DeepSeek Sparse Attention (DSA)Mixture of Block Attention (MoBA)Sparse attention

What is Native Sparse Attention?

Why is long-context attention so expensive?

How does NSA work?

Token compression

Token selection

Sliding window

Gated combination

What makes NSA hardware-aligned and natively trainable?

How accurate and how fast is NSA?

How does NSA compare to other sparse-attention methods?

References

Improve this article

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here