Sparse attention is a family of techniques that reduce the computational and memory cost of the attention mechanism in transformer models by restricting each token to attend to only a subset of other tokens in the sequence, rather than every token. In the standard self-attention operation, every token computes attention scores with every other token, resulting in O(n²) time and memory complexity for a sequence of length n. As sequence lengths grow into the thousands or millions, this quadratic cost becomes the primary bottleneck for both training and inference. Sparse attention addresses this by defining structured or learned sparsity patterns in the attention matrix so that each token attends only to a carefully selected set of positions, reducing complexity to O(n), O(n log n), or O(n√n) depending on the method.
The core operation of a transformer is scaled dot-product attention, defined as:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
where Q (queries), K (keys), and V (values) are matrices derived from the input sequence, and d_k is the dimension of the key vectors. The matrix product QK^T produces an n x n attention matrix, where n is the sequence length. Computing, storing, and backpropagating through this matrix requires O(n²) floating-point operations and O(n²) memory.
For a sequence of 512 tokens (a common length in early BERT models), the attention matrix contains roughly 262,000 entries per head, which is manageable. However, for a sequence of 16,384 tokens, the matrix contains over 268 million entries per head, and for 100,000 tokens, the count exceeds 10 billion. This quadratic scaling means that doubling the sequence length quadruples the compute and memory requirements, making it impractical to process long documents, books, codebases, or high-resolution images with standard dense attention.
Sparse attention methods aim to preserve the expressiveness of the attention mechanism while replacing the dense n x n matrix with a sparse one that has far fewer non-zero entries. The sparsity pattern determines which token pairs are included in the computation, and the design of this pattern is what distinguishes different sparse attention approaches.
The Sparse Transformer, introduced by Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever at OpenAI in April 2019, was one of the first major works to propose factorized sparse attention patterns for generative modeling. The paper, titled "Generating Long Sequences with Sparse Transformers," demonstrated that carefully designed sparsity patterns could reduce the attention complexity from O(n²) to O(n√n) while achieving state-of-the-art results on density estimation benchmarks including Enwik8, CIFAR-10, and ImageNet-64.
In strided attention, the full attention computation is factorized into two separate attention heads with complementary patterns. One head attends to the previous l tokens in a contiguous local window, while the other head attends to every l-th token (i.e., at a fixed stride). The stride l is chosen to be approximately √n, so each head attends to roughly √n positions. Since both heads attend to O(√n) positions per query, and the sequence has n positions, the total cost is O(n√n).
Strided attention works well for data that has a natural two-dimensional structure, such as images. When image pixels are arranged in raster scan order, attending at a stride of √n is equivalent to attending along the column dimension, while the local window covers the row dimension. The combination of row and column attention gives each token indirect access to the entire image within two layers.
For data without a clear two-dimensional layout (such as text), the Sparse Transformer uses a "fixed" factorized pattern. In this pattern, the sequence is divided into blocks of a fixed size (for example, 128 tokens). One attention head attends to all tokens within the same block, while the other head attends to the last element of each preceding block. The last element of each block acts as a summary token that aggregates information from its block and relays it to future tokens. This pattern also achieves O(n√n) complexity.
Beyond the attention pattern, the paper introduced several additional contributions: gradient checkpointing (recomputation) of attention matrices to reduce memory usage, efficient GPU kernels for block-sparse matrix operations, and architectural modifications including pre-activation residual connections and careful initialization to enable training of deeper networks (up to hundreds of layers).
The Longformer, developed by Iz Beltagy, Matthew E. Peters, and Arman Cohan at the Allen Institute for AI (AI2), was presented in April 2020 and introduced a combination of local windowed attention and task-specific global attention that scales linearly with sequence length, achieving O(n) complexity.
The primary component of Longformer's attention is a sliding window of fixed size w centered on each token. Each token attends only to w/2 tokens on each side, resulting in O(n * w) total computation. Since w is a fixed constant (not dependent on n), the complexity is O(n). The window size can vary across layers: the Longformer uses smaller windows (e.g., 32 tokens) in lower layers to capture local syntactic patterns, and larger windows (e.g., 512 tokens) in higher layers to build broader semantic representations.
To extend the receptive field without increasing the window size, the Longformer supports dilated sliding windows, analogous to dilated convolutions in convolutional neural networks. With a dilation factor of d, the window skips every d-1 positions, so a window of size w covers a span of w * d tokens. In practice, the Longformer uses a small amount of dilation on a subset of attention heads in higher layers, while lower layers use standard (non-dilated) windows to capture local context.
For tasks that require a holistic understanding of the entire sequence, the Longformer designates a small number of tokens as "global" attention tokens. These global tokens attend to every token in the sequence, and every token attends to the global tokens. The identity of global tokens is task-dependent: for classification, the [CLS] token uses global attention; for question answering, all question tokens use global attention. Since the number of global tokens is small (typically O(1) or proportional to the question length, not the document length), the addition of global attention does not change the overall O(n) complexity.
The Longformer set state-of-the-art results on text8 and enwik8 for character-level language modeling, and the pretrained Longformer consistently outperformed RoBERTa on long document tasks. The authors also introduced Longformer-Encoder-Decoder (LED) for long-document summarization.
The BigBird model, proposed by Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed at Google Research, was published in July 2020 and presented at NeurIPS 2020. BigBird combines three types of sparse attention to achieve O(n) complexity while preserving important theoretical properties of full attention.
BigBird's sparse attention mechanism consists of three components:
| Component | Description | Tokens attended per query |
|---|---|---|
| Window (local) attention | Each token attends to w neighboring tokens on each side | O(w), where w is the window size |
| Global attention | A set of g designated global tokens attend to all tokens, and all tokens attend to them | O(g) per token (g is a small constant) |
| Random attention | Each token attends to r randomly selected tokens from the sequence | O(r), where r is a small constant |
The combination of these three patterns ensures that every pair of tokens is connected through a short path (typically two or three hops), which allows information to propagate across the entire sequence over multiple layers.
A notable contribution of the BigBird paper is its theoretical analysis. The authors proved that the BigBird attention mechanism is a universal approximator of sequence-to-sequence functions, meaning it can approximate any continuous function mapping sequences to sequences. They further showed that BigBird is Turing complete, preserving this property from the full attention transformer. These results demonstrate that the sparse attention pattern does not fundamentally limit the model's representational capacity. A key theoretical finding was that the presence of O(1) global tokens is sufficient for maintaining these properties.
BigBird could handle sequences up to 8 times longer than what was previously possible on similar hardware, supporting sequence lengths up to 4,096 tokens in their experiments. The model achieved strong results on NLP tasks including question answering (Natural Questions, TriviaQA), document summarization, and genomics applications where long-range dependencies in DNA sequences are important.
The Routing Transformer, developed by Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier, was published in the Transactions of the Association for Computational Linguistics (TACL) in 2021. Unlike the Sparse Transformer, Longformer, and BigBird, which use fixed (predetermined) sparsity patterns, the Routing Transformer learns dynamic, content-dependent sparse attention patterns.
The central idea is to use online k-means clustering to group queries and keys that are likely to have high attention scores. At each attention step, the queries and keys are clustered into a set of centroids, and each query attends only to the keys assigned to the same cluster. This routing mechanism is differentiable and can be trained end-to-end.
The clustering reduces the attention complexity from O(n²d) to O(n^{1.5}d), where d is the hidden dimension. The model avoids the rigidity of fixed patterns (which may miss important long-range dependencies) while still being substantially more efficient than dense attention.
The Routing Transformer achieved strong results on language modeling benchmarks, outperforming comparable sparse attention models on Wikitext-103 (15.8 vs. 18.3 perplexity) and on image generation on ImageNet-64 (3.43 vs. 3.44 bits/dim) while using fewer self-attention layers. It also set a new state-of-the-art on the PG-19 dataset, achieving a test perplexity of 33.2 with a 22-layer model trained on sequences of length 8,192.
Sliding window attention has seen widespread adoption in modern large language models (LLMs), particularly after the release of Mistral 7B in September 2023.
Mistral 7B uses a sliding window attention (SWA) mechanism with a window size of 4,096 tokens. Each layer attends only to the previous 4,096 hidden states, yielding O(n * w) computation per layer where w = 4,096. Because the transformer stacks multiple layers, the effective receptive field expands with depth. A hidden state at position i in layer k has indirect access to information from tokens up to w * k positions back in the input. With a window of 4,096 tokens and 32 layers, the theoretical attention span extends to approximately 131,072 tokens.
Mistral 7B pairs sliding window attention with grouped-query attention (GQA) for further efficiency gains during inference. The fixed window size also enables the use of a rolling (rotating) buffer cache of size w, which caps the KV cache memory at a constant regardless of the total sequence length. This design matched or exceeded the performance of LLaMA 2 13B (a model with nearly twice as many parameters) on most benchmarks.
Sliding window attention has been adopted in several other production models. Mixtral 8x7B, also from Mistral AI, uses the same approach. Many recent long-context models use sliding window attention in some or all of their layers, sometimes combined with a small number of full-attention layers to maintain global information flow.
Block-sparse attention is an implementation strategy that organizes the sparse attention computation into dense blocks, which can be executed efficiently on modern GPU hardware. Rather than operating on individual token pairs, block-sparse methods divide the sequence into contiguous blocks (for example, blocks of 64 or 128 tokens) and determine which block pairs should participate in the attention computation. Within each selected block pair, the attention is computed densely.
Modern GPUs achieve peak throughput when performing dense matrix multiplications on regularly shaped tiles. Block-sparse attention exploits this by ensuring that all active computations are dense block multiplications, avoiding the irregular memory access patterns that would arise from truly unstructured sparsity. The sparsity is only at the block level: entire blocks are either included or excluded from the computation.
Several frameworks provide block-sparse attention kernels:
| Framework | Key features |
|---|---|
| OpenAI blocksparse | TensorFlow ops and GPU kernels for block-sparse matrix multiplication; used in the original Sparse Transformer |
| DeepSpeed Sparse Attention | Supports configurable block sizes and multiple sparsity patterns (Fixed, BigBird, BSLongformer); Triton-based kernels; up to 6x speedup over dense attention |
| Block-Sparse FlashAttention | Extends FlashAttention with a block-sparsity mask; skips zero blocks to reduce IO by the sparsity factor |
| MIT-HAN-Lab Block-Sparse-Attention | Supports streaming attention and block-sparse patterns; optimized for Hopper (H100) and Blackwell (B200) GPUs |
In February 2025, DeepSeek introduced Native Sparse Attention (NSA), a hardware-aligned sparse attention mechanism designed to be natively trainable. NSA uses a dynamic hierarchical sparse strategy with three parallel branches: compressed attention for coarse-grained context, selected attention for important token blocks, and sliding attention for local context. The design is optimized for modern GPU memory hierarchies, processing key-value sequences in spatially continuous blocks. NSA was presented at ACL 2025 and demonstrated substantial speedups over full attention on sequences of 64,000 tokens while maintaining or exceeding the performance of dense attention models.
Ring Attention, proposed by Hao Liu, Matei Zaharia, and Pieter Abbeel at UC Berkeley in October 2023 and published at ICLR 2024, addresses the problem of fitting very long sequences across multiple devices rather than reducing the per-device computation through sparsity in the attention pattern.
In Ring Attention, the sequence is partitioned across multiple host devices arranged in a logical ring. Each device holds a segment of the query, key, and value tensors. During the attention computation, key-value blocks are passed around the ring: each device sends its current key-value block to the next device while simultaneously receiving a block from the previous device. At each step, a device computes blockwise attention between its local queries and the received key-value block. After n-1 communication steps (where n is the number of devices), every query has attended to every key-value pair.
The critical insight is that the communication of key-value blocks can be fully overlapped with the computation of blockwise attention, so the communication overhead is effectively hidden. This allows the approach to scale the context length linearly with the number of devices without approximations, additional communication overhead, or changes to the attention output.
Ring Attention enables training and inference on sequences that are up to (device count) times longer than what a single device can handle. Combined with memory-efficient attention implementations like FlashAttention, this has enabled context windows of millions of tokens. Experiments demonstrated effectiveness on both language modeling and reinforcement learning tasks.
It is worth noting that Ring Attention is not a sparse attention method in the strict sense, since it computes exact dense attention. Instead, it is a distributed computation strategy that complements sparse attention techniques: one can use Ring Attention to distribute the sequence across devices while also applying sparse attention within each device to further reduce computation.
The following table summarizes the key properties of the major sparse attention methods discussed above.
| Method | Year | Authors / Organization | Attention pattern | Complexity | Key innovation |
|---|---|---|---|---|---|
| Sparse Transformer | 2019 | Child et al. / OpenAI | Strided + fixed factorization | O(n√n) | First factorized sparse attention; efficient GPU kernels |
| Longformer | 2020 | Beltagy et al. / AI2 | Sliding window + dilated + global | O(n) | Linear scaling; task-specific global tokens |
| BigBird | 2020 | Zaheer et al. / Google | Window + global + random | O(n) | Turing completeness proof; universal approximation |
| Routing Transformer | 2021 | Roy et al. / Google | Learned clusters (online k-means) | O(n^{1.5}) | Content-based dynamic sparsity |
| Mistral 7B SWA | 2023 | Mistral AI | Sliding window (w=4096) | O(n * w) | Rolling KV cache; practical LLM deployment |
| Ring Attention | 2023 | Liu et al. / UC Berkeley | Distributed dense (not sparse) | O(n²/devices) | Device-linear context scaling; overlapped communication |
| NSA | 2025 | DeepSeek | Hierarchical: compressed + selected + sliding | O(n) | Hardware-aligned; natively trainable; three-branch design |
FlashAttention, introduced by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re at Stanford University and published at NeurIPS 2022, is often discussed alongside sparse attention but addresses a fundamentally different problem. Understanding the distinction is important.
FlashAttention computes exact dense attention (the same mathematical result as standard attention) but does so with dramatically fewer memory accesses by exploiting the GPU memory hierarchy. The algorithm uses tiling: it partitions the Q, K, and V matrices into blocks that fit in GPU on-chip SRAM, computes partial attention results in fast SRAM, and writes only the final output back to slow GPU HBM (high-bandwidth memory). The large n x n attention matrix is never fully materialized in HBM. This achieves up to 7.6x speedup over standard attention and uses O(n) HBM rather than O(n²), but the total number of floating-point operations remains O(n²).
Sparse attention methods reduce the actual number of operations by skipping attention computations between token pairs that are masked out by the sparsity pattern. This means sparse attention computes an approximate result (compared to dense attention) because some token interactions are ignored entirely. The benefit is a genuine reduction in FLOPs, not just better memory access patterns.
FlashAttention and sparse attention are complementary rather than competing. Block-sparse FlashAttention combines both ideas: it uses the tiling and IO-aware computation strategy of FlashAttention while skipping entire blocks that are zeroed out by a block-sparsity mask. This yields both the IO efficiency of FlashAttention and the FLOP reduction of sparse attention. In practice, many modern systems layer these techniques together with other optimizations such as KV cache quantization and grouped-query attention.
| Aspect | FlashAttention | Sparse attention |
|---|---|---|
| Attention result | Exact (dense) | Approximate (subset of interactions) |
| FLOP reduction | No (same O(n²) FLOPs) | Yes (reduced to O(n), O(n√n), etc.) |
| Memory optimization | Yes (O(n) HBM via tiling) | Yes (fewer entries computed and stored) |
| Primary mechanism | IO-aware tiling to reduce HBM reads/writes | Structured or learned sparsity masks |
| Can be combined | Yes, with block-sparse masks | Yes, with FlashAttention kernels |
Sparse attention techniques have become essential building blocks in the development of long-context language models. Several trends highlight their importance:
Modern LLMs have rapidly expanded their context windows from the 512 tokens of early BERT models to 128,000 tokens (GPT-4 Turbo, Claude 3.5 Sonnet), 200,000 tokens (Claude 3 with extended context), and even 1,000,000 tokens (Gemini 1.5 Pro). Achieving these context lengths requires a combination of sparse attention patterns, efficient attention kernels (FlashAttention), distributed strategies (Ring Attention), and architectural innovations like sliding window layers interleaved with full-attention layers.
Many production models do not use a single attention strategy across all layers. Instead, they combine sparse and dense attention in a hybrid fashion. For example, a model might use sliding window attention in most layers for efficient local processing, while reserving a few layers for full (dense) attention to capture global dependencies. This hybrid approach balances computational efficiency with the ability to reason over the entire context.
Sparse attention models like the Longformer and BigBird have found particular success in tasks involving long documents, such as question answering over scientific papers, legal document analysis, and summarization of book-length texts. The ability to process thousands of tokens at once, rather than truncating inputs to short fixed lengths, has led to significant improvements on benchmarks like TriviaQA, Natural Questions, and WikiHop.
BigBird demonstrated that sparse attention is valuable outside of natural language processing. In genomics, DNA sequences can be hundreds of thousands of base pairs long, and the ability to model long-range interactions between distant base pairs is critical for tasks such as promoter region classification and chromatin-profile prediction. Sparse attention makes it feasible to apply transformer-based models to these sequences.
Long-context models with sparse attention are increasingly used for code analysis, where entire repositories or large codebases must be processed simultaneously. The ability to handle tens of thousands of lines of code enables applications like cross-file code completion, bug detection across modules, and repository-level code summarization.
Choosing or designing a sparse attention pattern involves several trade-offs:
| Consideration | Description |
|---|---|
| Sparsity level | Higher sparsity reduces compute and memory but may miss important long-range dependencies |
| Pattern structure | Fixed patterns (window, strided) are simple and hardware-friendly; learned patterns (routing) are more flexible but add overhead |
| Hardware alignment | Block-aligned sparsity patterns map well to GPU SIMT architectures; irregular sparsity can underperform despite fewer FLOPs |
| Task requirements | Classification tasks may need only global summary tokens; generation tasks need causal (triangular) sparsity; retrieval tasks need broad coverage |
| Layer-wise variation | Using different patterns at different depths (small windows early, large windows or global attention late) can improve expressiveness |
| Training vs. inference | Some methods (like the Routing Transformer's k-means routing) add overhead during training that may not be present at inference |
The idea of restricting attention to a subset of positions predates the transformer architecture itself. Local attention mechanisms were explored in early sequence-to-sequence models. However, the Sparse Transformer (2019) was the first work to systematically design and evaluate factorized sparse patterns for transformers and demonstrate their effectiveness at scale.
The period from 2020 to 2021 saw rapid development, with the Longformer, BigBird, Routing Transformer, Performer (which uses random feature maps for linear attention), and Linformer (which uses linear projections to approximate the attention matrix) all proposing different strategies to address the quadratic bottleneck. Each made different trade-offs between computational complexity, implementation simplicity, theoretical guarantees, and empirical performance.
From 2022 onward, the focus shifted toward combining sparse attention with hardware-aware implementations (FlashAttention), distributed computation (Ring Attention), and hybrid architectures that mix sparse and dense layers. The release of Mistral 7B in 2023 demonstrated that sliding window attention could be used effectively in a production-grade open-weight language model, and DeepSeek's NSA in 2025 showed that natively trainable sparse attention with hardware-aligned kernels could match full attention quality at substantially lower cost.