Native Sparse Attention (NSA)
Last reviewed
Sources
6 citations
Review status
Source-backed
Revision
v1 · 2,052 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
6 citations
Review status
Source-backed
Revision
v1 · 2,052 words
Add missing citations, update stale details, or suggest a clearer explanation.
Native Sparse Attention (NSA) is a trainable sparse attention mechanism introduced in 2025 by DeepSeek in collaboration with researchers at Peking University and the University of Washington. It is designed to make long-context language model training and inference dramatically cheaper while matching the accuracy of dense (full) attention. NSA replaces the standard all-to-all attention computation with a dynamic, hierarchical strategy that runs three parallel branches over the key-value sequence, a coarse-grained compression branch, a fine-grained block-selection branch, and a local sliding window, and blends their outputs with a learned gate. The method is described in the paper "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention" by Jingyang Yuan and colleagues, which received a Best Paper Award at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) in Vienna. [1][2][3]
The core problem NSA targets is that the self-attention operation in a transformer has computational cost that grows quadratically, O(L^2), with sequence length L. For modern context windows of tens or hundreds of thousands of tokens, attention comes to dominate both the compute of the forward and backward passes during training and the memory traffic of the key-value (KV) cache during decoding. Sparse attention, which computes only a subset of the query-key interactions, is the obvious remedy, but two practical obstacles had limited earlier methods. First, many sparse schemes are applied only at inference time on top of a model that was pretrained with full attention, so the model never learns to operate with the sparsity and a train-inference mismatch degrades quality. Second, theoretical reductions in floating point operations frequently fail to produce real wall-clock speedups because the sparse access pattern is irregular and does not map well to GPU memory hierarchies or to optimized kernels such as FlashAttention. [1]
NSA's contribution is to address both obstacles at once through two stated design principles: it is hardware-aligned, meaning its sparse pattern is organized into contiguous blocks and its kernels are written to keep the GPU's arithmetic units busy and to cooperate with grouped-query attention; and it is natively trainable, meaning the sparsity is present and learned from the start of pretraining rather than bolted on afterward. The authors report that an NSA model pretrained end-to-end matches or exceeds an equivalent full-attention model on general, long-context, and reasoning benchmarks, while delivering large speedups on 64,000-token sequences. [1]
In a standard attention layer, every query token attends to every preceding key token, so producing the output for a sequence of length L requires on the order of L^2 query-key dot products and the same number of weighted value sums. Doubling the context roughly quadruples the attention work. During autoregressive decoding the bottleneck shifts from arithmetic to memory: each newly generated token must read the entire stored KV cache, so the per-token cost grows linearly with the context already produced and the KV cache becomes a dominant consumer of memory bandwidth. [1]
A long line of work has tried to make attention sub-quadratic, including fixed sparse patterns, clustering or hashing of keys, low-rank and kernelized approximations, and KV-cache eviction policies that keep only the tokens judged most important. DeepSeek's own multi-head latent attention attacks the memory side by compressing the cache into a low-rank latent. The NSA authors observe that many such techniques share two weaknesses they set out to avoid. Inference-only sparsification, for example evicting tokens after a full-attention pretrain, forces the model to operate in a regime it was never trained for. And methods whose sparsity is token-by-token rather than block-by-block tend to scatter memory reads, so even when they cut FLOPs they stall the hardware and lose the speedup in practice. NSA is explicitly engineered so that the pattern it learns is also a pattern the hardware can execute efficiently. [1]
For each query position, NSA constructs three separate sets of keys and values from the preceding context, attends to each set independently, and then combines the three resulting outputs with learned gates. The branches capture context at different granularities. [1]
The compression branch gives the query a cheap, global view of the whole history. Consecutive keys, and separately values, are grouped into blocks, and each block is mapped to a single compressed key or value by a small learnable multilayer perceptron that also encodes the within-block positions. In the paper's configuration the compression block length is 32 tokens with a sliding stride of 16. Because a long sequence collapses to a much shorter sequence of compressed tokens, attending to them is inexpensive, and the resulting attention scores double as a coarse map of which regions of the context matter most. [1]
Coarse compression alone loses fine detail, so the selection branch restores it by attending at full resolution to a small number of important blocks. NSA reuses the attention scores already computed against the compressed tokens as importance scores, aggregates them onto a grid of fixed selection blocks (block size 64 in the paper), and keeps only the top-scoring blocks (the top 16). Selecting whole contiguous blocks rather than scattered individual tokens is a deliberate hardware choice, because it produces coalesced memory reads. Crucially, when the model uses grouped-query attention, all query heads in a group share one set of selected blocks, computed by summing their importance scores, so the expensive KV cache is loaded once per group rather than once per head. [1]
The third branch is a plain local window over the most recent tokens, set to 512 in the paper, which handles fluent local continuation. Giving local context its own branch is a subtle but important detail: local patterns are easy to learn, and if the compression and selection branches had to model them too, gradients would be dominated by the local signal and the long-range machinery might never train properly. Isolating the window lets the other two branches specialize in genuine long-range retrieval. [1]
The outputs of the three branches are merged into the layer output by a gate. A small MLP followed by a sigmoid reads the query's features and produces a weight in the range 0 to 1 for each branch, and the final output is the gated sum of the three branch attentions. Because every component, the compression MLP, the importance-based selection, and the gate, participates in the forward pass on which the model is trained, gradients flow through them and the model learns how to compress, what to select, and how much to trust each branch. [1]
The hardware-aligned principle shows up in NSA's custom kernels, written in Triton and built on the same IO-aware, blockwise philosophy as FlashAttention. For the selection branch, which is the part with a data-dependent pattern, the kernel loops over query groups on the outer grid, loads the small set of selected KV blocks for each group into fast on-chip SRAM on the inner loop, and performs the attention there. Sharing the selected blocks across all heads of a grouped-query group balances the arithmetic intensity, the ratio of compute to memory traffic, so that the GPU's tensor cores stay busy instead of waiting on memory. This is how the theoretical sparsity is converted into real speed. [1]
The natively trainable principle is the conceptual heart of the paper. Rather than train a dense model and then discard attention at inference, NSA is part of the architecture during pretraining, so the model adapts its representations to the sparse pattern and there is no train-test gap to pay at deployment. Training with sparse attention from the start also reduces the cost of pretraining itself, not just inference. The one genuinely discrete operation, choosing the top blocks, is made effectively learnable by deriving the importance scores from the differentiable compression branch and by letting the gate learn how heavily to weight the selected output. [1]
| NSA component | Role | Paper setting |
|---|---|---|
| Compression block / stride | coarse global summary | length 32, stride 16 |
| Selection block / count | fine-grained important regions | block 64, keep top 16 |
| Sliding window | local recent context | 512 tokens |
| Gate | combine the three branches | MLP plus sigmoid per branch |
NSA was validated by pretraining a 27-billion-parameter mixture-of-experts model (about 3 billion parameters active per token, 30 layers, with 72 routed plus 2 shared experts) on 270 billion tokens, alongside an identical full-attention baseline. On a suite of general benchmarks spanning knowledge, reasoning, and coding, the NSA model's average score slightly exceeded the full-attention model's. On the LongBench long-context suite it scored higher as well, and on a 64,000-token needle-in-a-haystack retrieval test it achieved perfect accuracy at every depth. After both models were given chain-of-thought reasoning ability distilled from DeepSeek-R1, the NSA variant outperformed the full-attention variant on AIME mathematics competition problems, indicating that the learned sparsity does not harm and may even help long-form reasoning. [1]
The efficiency gains are the headline result. Measured against a full-attention baseline implemented with FlashAttention on 64,000-token sequences, NSA reported roughly 9 times faster forward propagation, 6 times faster backward propagation, and an 11.6 times speedup in decoding, the last reflecting how much less of the KV cache must be read for each generated token. These accelerations grow with context length, since they come from replacing a quadratic cost with one closer to linear. [1]
NSA builds on the lineage of FlashAttention: it does not change attention's mathematics so much as supply blockwise, SRAM-tiling kernels that make its specific sparse pattern fast, and its full-attention baseline is itself a FlashAttention implementation. Where FlashAttention computes dense attention exactly but efficiently, NSA computes a learned sparse subset efficiently. [1][6]
NSA appeared within days of Mixture of Block Attention (MoBA), released in February 2025 by Moonshot AI and deployed in its Kimi assistant. MoBA also makes block-sparse attention trainable, but it does so by applying the mixture-of-experts routing idea directly to attention: the context is partitioned into blocks and each query is routed to its top-k blocks according to the dot product between the query and each block's mean-pooled key. Compared with MoBA's single learned routing mechanism, NSA adds the explicit compression and sliding-window branches and the three-way gate, and it places heavier emphasis on grouped-query-aware kernels. Both works demonstrated that sparsity learned during pretraining can match full attention, a notable convergence from two independent laboratories in the same month. [1][4]
NSA is also the research predecessor of DeepSeek Sparse Attention (DSA), the mechanism DeepSeek shipped in its experimental DeepSeek-V3.2-Exp model on September 29, 2025. DSA keeps NSA's goal of trainable, fine-grained sparsity but simplifies the design: a lightweight "lightning indexer" scores the relevance of preceding tokens to the current query, and a fine-grained selector then keeps only the highest-scoring tokens for full attention, layered on top of multi-head latent attention. Whereas NSA combines three granularities through a gate, DSA concentrates on efficient fine-grained token selection, and DeepSeek used it to cut the model's published API prices by roughly half. Together, NSA, MoBA, and DSA mark a shift away from inference-only sparse attention toward sparsity that is a native, trained-in property of the model. [1][5]