Sparse attention

Deep Learning Machine Learning Model Architecture Transformer Models

21 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v6 · 4,251 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Sparse attention is a family of techniques that cut the computational and memory cost of the attention mechanism in transformer models by letting each token attend to only a subset of other tokens in a sequence, rather than every token. Standard self-attention compares every token with every other token, which costs $O(n^2)$ time and memory for a sequence of length $n$ ; sparse attention replaces that dense $n \times n$ matrix with a structured or learned sparse one, reducing complexity to $O(n)$ , $O(n \log n)$ , or $O(n\sqrt{n})$ depending on the method. The approach has become a core building block of long-context AI: DeepSeek's 2025 Native Sparse Attention reports up to 11.6x faster decoding on 64,000-token sequences while matching or exceeding full-attention quality.^[9]

As sequence lengths grow into the thousands or millions, the quadratic cost of dense attention becomes the primary bottleneck for both training and inference. Sparse attention addresses this by defining structured or learned sparsity patterns in the attention matrix so that each token attends only to a carefully selected set of positions.

Why is standard attention so expensive?

The core operation of a transformer is scaled dot-product attention, defined as:

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

where $Q$ (queries), $K$ (keys), and $V$ (values) are matrices derived from the input sequence, and $d_k$ is the dimension of the key vectors. The matrix product $QK^\top$ produces an $n \times n$ attention matrix, where $n$ is the sequence length. Computing, storing, and backpropagating through this matrix requires $O(n^2)$ floating-point operations and $O(n^2)$ memory.

For a sequence of 512 tokens (a common length in early BERT models), the attention matrix contains roughly 262,000 entries per head, which is manageable. However, for a sequence of 16,384 tokens, the matrix contains over 268 million entries per head, and for 100,000 tokens, the count exceeds 10 billion. This quadratic scaling means that doubling the sequence length quadruples the compute and memory requirements, making it impractical to process long documents, books, codebases, or high-resolution images with standard dense attention.

Sparse attention methods aim to preserve the expressiveness of the attention mechanism while replacing the dense $n \times n$ matrix with a sparse one that has far fewer non-zero entries. The sparsity pattern determines which token pairs are included in the computation, and the design of this pattern is what distinguishes different sparse attention approaches.

What was the Sparse Transformer (Child et al., 2019)?

The Sparse Transformer, introduced by Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever at OpenAI in April 2019, was one of the first major works to propose factorized sparse attention patterns for generative modeling.^[1] The paper, titled "Generating Long Sequences with Sparse Transformers," states plainly that it "introduce[s] sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$ " complexity.^[1] It demonstrated state-of-the-art results on density estimation benchmarks including Enwik8, CIFAR-10, and ImageNet-64, and showed that the resulting networks could "model sequences tens of thousands of timesteps long using hundreds of layers."^[1]

Strided attention

In strided attention, the full attention computation is factorized into two separate attention heads with complementary patterns. One head attends to the previous $l$ tokens in a contiguous local window, while the other head attends to every $l$ -th token (i.e., at a fixed stride).^[1] The stride $l$ is chosen to be approximately $\sqrt{n}$ , so each head attends to roughly $\sqrt{n}$ positions. Since both heads attend to $O(\sqrt{n})$ positions per query, and the sequence has $n$ positions, the total cost is $O(n\sqrt{n})$ .

Strided attention works well for data that has a natural two-dimensional structure, such as images. When image pixels are arranged in raster scan order, attending at a stride of $\sqrt{n}$ is equivalent to attending along the column dimension, while the local window covers the row dimension. The combination of row and column attention gives each token indirect access to the entire image within two layers.

Fixed attention

For data without a clear two-dimensional layout (such as text), the Sparse Transformer uses a "fixed" factorized pattern. In this pattern, the sequence is divided into blocks of a fixed size (for example, 128 tokens). One attention head attends to all tokens within the same block, while the other head attends to the last element of each preceding block.^[1] The last element of each block acts as a summary token that aggregates information from its block and relays it to future tokens. This pattern also achieves $O(n\sqrt{n})$ complexity.

Beyond the attention pattern, the paper introduced several additional contributions: gradient checkpointing (recomputation) of attention matrices to reduce memory usage, efficient GPU kernels for block-sparse matrix operations, and architectural modifications including pre-activation residual connections and careful initialization to enable training of deeper networks (up to hundreds of layers).^[1]

How does the Longformer (Beltagy et al., 2020) scale linearly?

The Longformer, developed by Iz Beltagy, Matthew E. Peters, and Arman Cohan at the Allen Institute for AI (AI2), was presented in April 2020 and introduced a combination of local windowed attention and task-specific global attention that scales linearly with sequence length, achieving $O(n)$ complexity.^[2] The authors describe it as "a drop-in replacement for the standard self-attention" that "combines a local windowed attention with a task motivated global attention."^[2]

Sliding window attention

The primary component of Longformer's attention is a sliding window of fixed size $w$ centered on each token. Each token attends only to $w/2$ tokens on each side, resulting in $O(n \cdot w)$ total computation. Since $w$ is a fixed constant (not dependent on $n$ ), the complexity is $O(n)$ .^[2] The window size can vary across layers: the Longformer uses smaller windows (e.g., 32 tokens) in lower layers to capture local syntactic patterns, and larger windows (e.g., 512 tokens) in higher layers to build broader semantic representations.^[2]

Dilated sliding window

To extend the receptive field without increasing the window size, the Longformer supports dilated sliding windows, analogous to dilated convolutions in convolutional neural networks. With a dilation factor of $d$ , the window skips every $d-1$ positions, so a window of size $w$ covers a span of $w \cdot d$ tokens.^[2] In practice, the Longformer uses a small amount of dilation on a subset of attention heads in higher layers, while lower layers use standard (non-dilated) windows to capture local context.^[2]

Global attention

For tasks that require a holistic understanding of the entire sequence, the Longformer designates a small number of tokens as "global" attention tokens. These global tokens attend to every token in the sequence, and every token attends to the global tokens. The identity of global tokens is task-dependent: for classification, the [CLS] token uses global attention; for question answering, all question tokens use global attention.^[2] Since the number of global tokens is small (typically $O(1)$ or proportional to the question length, not the document length), the addition of global attention does not change the overall $O(n)$ complexity.

The Longformer set state-of-the-art results on text8 and enwik8 for character-level language modeling, and the pretrained Longformer consistently outperformed RoBERTa on long document tasks.^[2] The authors also introduced Longformer-Encoder-Decoder (LED) for long-document summarization.^[2]

What is BigBird (Zaheer et al., 2020)?

The BigBird model, proposed by Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed at Google Research, was published in July 2020 and presented at NeurIPS 2020.^[3] BigBird combines three types of sparse attention to achieve $O(n)$ complexity while preserving important theoretical properties of full attention. The authors report that "the proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware."^[3]

Three-component attention

BigBird's sparse attention mechanism consists of three components:

Component	Description	Tokens attended per query
Window (local) attention	Each token attends to $w$ neighboring tokens on each side	$O(w)$ , where $w$ is the window size
Global attention	A set of $g$ designated global tokens attend to all tokens, and all tokens attend to them	$O(g)$ per token ( $g$ is a small constant)
Random attention	Each token attends to $r$ randomly selected tokens from the sequence	$O(r)$ , where $r$ is a small constant

The combination of these three patterns ensures that every pair of tokens is connected through a short path (typically two or three hops), which allows information to propagate across the entire sequence over multiple layers.^[3]

Theoretical guarantees

A notable contribution of the BigBird paper is its theoretical analysis. The authors proved that the BigBird attention mechanism is a universal approximator of sequence-to-sequence functions, meaning it can approximate any continuous function mapping sequences to sequences.^[3] They further showed that BigBird is Turing complete, preserving this property from the full attention transformer.^[3] These results demonstrate that the sparse attention pattern does not fundamentally limit the model's representational capacity. A key theoretical finding was that the presence of $O(1)$ global tokens is sufficient for maintaining these properties.^[3]

Applications

BigBird could handle sequences up to 8 times longer than what was previously possible on similar hardware, supporting sequence lengths up to 4,096 tokens in their experiments.^[3] The model achieved strong results on NLP tasks including question answering (Natural Questions, TriviaQA), document summarization, and genomics applications where long-range dependencies in DNA sequences are important.^[3]

What is the Routing Transformer (Roy et al., 2021)?

The Routing Transformer, developed by Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier, was published in the Transactions of the Association for Computational Linguistics (TACL) in 2021.^[4] Unlike the Sparse Transformer, Longformer, and BigBird, which use fixed (predetermined) sparsity patterns, the Routing Transformer learns dynamic, content-dependent sparse attention patterns.^[4]

Online k-means routing

The central idea is to use online k-means clustering to group queries and keys that are likely to have high attention scores. At each attention step, the queries and keys are clustered into a set of centroids, and each query attends only to the keys assigned to the same cluster.^[4] This routing mechanism is differentiable and can be trained end-to-end.

The clustering reduces the attention complexity from $O(n^2 d)$ to $O(n^{1.5} d)$ , where $d$ is the hidden dimension.^[4] The model avoids the rigidity of fixed patterns (which may miss important long-range dependencies) while still being substantially more efficient than dense attention.

Results

The Routing Transformer achieved strong results on language modeling benchmarks, outperforming comparable sparse attention models on Wikitext-103 (15.8 vs. 18.3 perplexity) and on image generation on ImageNet-64 (3.43 vs. 3.44 bits/dim) while using fewer self-attention layers.^[4] It also set a new state-of-the-art on the PG-19 dataset, achieving a test perplexity of 33.2 with a 22-layer model trained on sequences of length 8,192.^[4]

How do modern LLMs use sliding window attention?

Sliding window attention has seen widespread adoption in modern large language models (LLMs), particularly after the release of Mistral 7B in September 2023.^[6]

Mistral 7B

Mistral 7B uses a sliding window attention (SWA) mechanism with a window size of 4,096 tokens. Each layer attends only to the previous 4,096 hidden states, yielding $O(n \cdot w)$ computation per layer where $w = 4{,}096$ .^[6] Because the transformer stacks multiple layers, the effective receptive field expands with depth. A hidden state at position $i$ in layer $k$ has indirect access to information from tokens up to $w \cdot k$ positions back in the input. With a window of 4,096 tokens and 32 layers, the theoretical attention span extends to approximately 131,072 tokens.^[6]

Mistral 7B pairs sliding window attention with grouped-query attention (GQA) for further efficiency gains during inference.^[6] The fixed window size also enables the use of a rolling (rotating) buffer cache of size $w$ , which caps the KV cache memory at a constant regardless of the total sequence length.^[6] The Mistral team reports that the model "outperforms Llama 2 13B across all evaluated benchmarks," despite LLaMA 2 13B having nearly twice as many parameters; on MMLU (5-shot) Mistral 7B scored 60.1 versus 55.6 for Llama 2 13B.^[6]

Other models using sliding window attention

Sliding window attention has been adopted in several other production models. Mixtral 8x7B, also from Mistral AI, uses the same approach. Many recent long-context models use sliding window attention in some or all of their layers, sometimes combined with a small number of full-attention layers to maintain global information flow.

What is block-sparse attention?

Block-sparse attention is an implementation strategy that organizes the sparse attention computation into dense blocks, which can be executed efficiently on modern GPU hardware. Rather than operating on individual token pairs, block-sparse methods divide the sequence into contiguous blocks (for example, blocks of 64 or 128 tokens) and determine which block pairs should participate in the attention computation. Within each selected block pair, the attention is computed densely.

Hardware efficiency

Modern GPUs achieve peak throughput when performing dense matrix multiplications on regularly shaped tiles. Block-sparse attention exploits this by ensuring that all active computations are dense block multiplications, avoiding the irregular memory access patterns that would arise from truly unstructured sparsity. The sparsity is only at the block level: entire blocks are either included or excluded from the computation.

Implementations

Several frameworks provide block-sparse attention kernels:

Framework	Key features
OpenAI blocksparse	TensorFlow ops and GPU kernels for block-sparse matrix multiplication; used in the original Sparse Transformer
DeepSpeed Sparse Attention	Supports configurable block sizes and multiple sparsity patterns (Fixed, BigBird, BSLongformer); Triton-based kernels; up to 6x speedup over dense attention
Block-Sparse FlashAttention	Extends FlashAttention with a block-sparsity mask; skips zero blocks to reduce IO by the sparsity factor
MIT-HAN-Lab Block-Sparse-Attention	Supports streaming attention and block-sparse patterns; optimized for Hopper (H100) and Blackwell (B200) GPUs

Native Sparse Attention (NSA)

In February 2025, DeepSeek introduced Native Sparse Attention (NSA), a hardware-aligned sparse attention mechanism designed to be natively trainable.^[9] NSA uses a dynamic hierarchical sparse strategy with three parallel branches: compressed attention for coarse-grained context, selected attention for important token blocks, and sliding attention for local context.^[9] The design is optimized for modern GPU memory hierarchies, processing key-value sequences in spatially continuous blocks. On 64,000-token sequences, NSA reports speedups over full attention of 11.6x in decoding, 9.0x in forward propagation, and 6.0x in backward propagation, while a model pretrained with NSA "maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning."^[9] NSA was presented at ACL 2025.^[9]

What is Ring Attention (Liu et al., 2023)?

Ring Attention, proposed by Hao Liu, Matei Zaharia, and Pieter Abbeel at UC Berkeley in October 2023 and published at ICLR 2024, addresses the problem of fitting very long sequences across multiple devices rather than reducing the per-device computation through sparsity in the attention pattern.^[7]

Mechanism

In Ring Attention, the sequence is partitioned across multiple host devices arranged in a logical ring. Each device holds a segment of the query, key, and value tensors. During the attention computation, key-value blocks are passed around the ring: each device sends its current key-value block to the next device while simultaneously receiving a block from the previous device.^[7] At each step, a device computes blockwise attention between its local queries and the received key-value block. After n-1 communication steps (where n is the number of devices), every query has attended to every key-value pair.^[7]

The critical insight is that the communication of key-value blocks can be fully overlapped with the computation of blockwise attention, so the communication overhead is effectively hidden.^[7] This allows the approach to scale the context length linearly with the number of devices without approximations, additional communication overhead, or changes to the attention output.^[7]

Capabilities

Ring Attention enables training and inference on sequences that are up to (device count) times longer than what a single device can handle.^[7] Combined with memory-efficient attention implementations like FlashAttention, this has enabled context windows of millions of tokens. Experiments demonstrated effectiveness on both language modeling and reinforcement learning tasks.^[7]

It is worth noting that Ring Attention is not a sparse attention method in the strict sense, since it computes exact dense attention. Instead, it is a distributed computation strategy that complements sparse attention techniques: one can use Ring Attention to distribute the sequence across devices while also applying sparse attention within each device to further reduce computation.

How do the major sparse attention methods compare?

The following table summarizes the key properties of the major sparse attention methods discussed above.

Method	Year	Authors / Organization	Attention pattern	Complexity	Key innovation
Sparse Transformer	2019	Child et al. / OpenAI	Strided + fixed factorization	$O(n\sqrt{n})$	First factorized sparse attention; efficient GPU kernels
Longformer	2020	Beltagy et al. / AI2	Sliding window + dilated + global	$O(n)$	Linear scaling; task-specific global tokens
BigBird	2020	Zaheer et al. / Google	Window + global + random	$O(n)$	Turing completeness proof; universal approximation
Routing Transformer	2021	Roy et al. / Google	Learned clusters (online k-means)	$O(n^{1.5})$	Content-based dynamic sparsity
Mistral 7B SWA	2023	Mistral AI	Sliding window ( $w=4096$ )	$O(n \cdot w)$	Rolling KV cache; practical LLM deployment
Ring Attention	2023	Liu et al. / UC Berkeley	Distributed dense (not sparse)	$O(n^2/\text{devices})$	Device-linear context scaling; overlapped communication
NSA	2025	DeepSeek	Hierarchical: compressed + selected + sliding	$O(n)$	Hardware-aligned; natively trainable; three-branch design

How does sparse attention differ from FlashAttention?

FlashAttention, introduced by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re at Stanford University and published at NeurIPS 2022, is often discussed alongside sparse attention but addresses a fundamentally different problem.^[5] Understanding the distinction is important.

FlashAttention: exact but fast

FlashAttention computes exact dense attention (the same mathematical result as standard attention) but does so with dramatically fewer memory accesses by exploiting the GPU memory hierarchy.^[5] The algorithm uses tiling: it partitions the Q, K, and V matrices into blocks that fit in GPU on-chip SRAM, computes partial attention results in fast SRAM, and writes only the final output back to slow GPU HBM (high-bandwidth memory).^[5] The large $n \times n$ attention matrix is never fully materialized in HBM. This achieves up to 7.6x speedup over standard attention and uses $O(n)$ HBM rather than $O(n^2)$ , but the total number of floating-point operations remains $O(n^2)$ .^[5]

Sparse attention: fewer operations

Sparse attention methods reduce the actual number of operations by skipping attention computations between token pairs that are masked out by the sparsity pattern. This means sparse attention computes an approximate result (compared to dense attention) because some token interactions are ignored entirely. The benefit is a genuine reduction in FLOPs, not just better memory access patterns.

Complementary approaches

FlashAttention and sparse attention are complementary rather than competing. Block-sparse FlashAttention combines both ideas: it uses the tiling and IO-aware computation strategy of FlashAttention while skipping entire blocks that are zeroed out by a block-sparsity mask.^[8] This yields both the IO efficiency of FlashAttention and the FLOP reduction of sparse attention.^[8] In practice, many modern systems layer these techniques together with other optimizations such as KV cache quantization and grouped-query attention.

Aspect	FlashAttention	Sparse attention
Attention result	Exact (dense)	Approximate (subset of interactions)
FLOP reduction	No (same $O(n^2)$ FLOPs)	Yes (reduced to $O(n)$ , $O(n\sqrt{n})$ , etc.)
Memory optimization	Yes ( $O(n)$ HBM via tiling)	Yes (fewer entries computed and stored)
Primary mechanism	IO-aware tiling to reduce HBM reads/writes	Structured or learned sparsity masks
Can be combined	Yes, with block-sparse masks	Yes, with FlashAttention kernels

What is sparse attention used for?

Sparse attention techniques have become essential building blocks in the development of long-context language models. Several trends highlight their importance:

Extending context windows

Modern LLMs have rapidly expanded their context windows from the 512 tokens of early BERT models to 128,000 tokens (GPT-4 Turbo, Claude 3.5 Sonnet), 200,000 tokens (Claude 3 with extended context), and even 1,000,000 tokens (Gemini 1.5 Pro). Achieving these context lengths requires a combination of sparse attention patterns, efficient attention kernels (FlashAttention), distributed strategies (Ring Attention), and architectural innovations like sliding window layers interleaved with full-attention layers.

Hybrid attention architectures

Many production models do not use a single attention strategy across all layers. Instead, they combine sparse and dense attention in a hybrid fashion. For example, a model might use sliding window attention in most layers for efficient local processing, while reserving a few layers for full (dense) attention to capture global dependencies. This hybrid approach balances computational efficiency with the ability to reason over the entire context.

Document understanding and retrieval

Sparse attention models like the Longformer and BigBird have found particular success in tasks involving long documents, such as question answering over scientific papers, legal document analysis, and summarization of book-length texts. The ability to process thousands of tokens at once, rather than truncating inputs to short fixed lengths, has led to significant improvements on benchmarks like TriviaQA, Natural Questions, and WikiHop.

Genomics and scientific applications

BigBird demonstrated that sparse attention is valuable outside of natural language processing.^[3] In genomics, DNA sequences can be hundreds of thousands of base pairs long, and the ability to model long-range interactions between distant base pairs is critical for tasks such as promoter region classification and chromatin-profile prediction.^[3] Sparse attention makes it feasible to apply transformer-based models to these sequences.

Code understanding

Long-context models with sparse attention are increasingly used for code analysis, where entire repositories or large codebases must be processed simultaneously. The ability to handle tens of thousands of lines of code enables applications like cross-file code completion, bug detection across modules, and repository-level code summarization.

Design considerations and trade-offs

Choosing or designing a sparse attention pattern involves several trade-offs:

Consideration	Description
Sparsity level	Higher sparsity reduces compute and memory but may miss important long-range dependencies
Pattern structure	Fixed patterns (window, strided) are simple and hardware-friendly; learned patterns (routing) are more flexible but add overhead
Hardware alignment	Block-aligned sparsity patterns map well to GPU SIMT architectures; irregular sparsity can underperform despite fewer FLOPs
Task requirements	Classification tasks may need only global summary tokens; generation tasks need causal (triangular) sparsity; retrieval tasks need broad coverage
Layer-wise variation	Using different patterns at different depths (small windows early, large windows or global attention late) can improve expressiveness
Training vs. inference	Some methods (like the Routing Transformer's k-means routing) add overhead during training that may not be present at inference

History and evolution

The idea of restricting attention to a subset of positions predates the transformer architecture itself. Local attention mechanisms were explored in early sequence-to-sequence models. However, the Sparse Transformer (2019) was the first work to systematically design and evaluate factorized sparse patterns for transformers and demonstrate their effectiveness at scale.^[1]

The period from 2020 to 2021 saw rapid development, with the Longformer, BigBird, Routing Transformer, Performer (which uses random feature maps for linear attention), and Linformer (which uses linear projections to approximate the attention matrix) all proposing different strategies to address the quadratic bottleneck.^[10] Each made different trade-offs between computational complexity, implementation simplicity, theoretical guarantees, and empirical performance.

From 2022 onward, the focus shifted toward combining sparse attention with hardware-aware implementations (FlashAttention), distributed computation (Ring Attention), and hybrid architectures that mix sparse and dense layers. The release of Mistral 7B in 2023 demonstrated that sliding window attention could be used effectively in a production-grade open-weight language model, and DeepSeek's NSA in 2025 showed that natively trainable sparse attention with hardware-aligned kernels could match full attention quality at substantially lower cost.^[9]

References

Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509. ↩
Beltagy, I., Peters, M. E., & Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150. ↩
Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2020). "Big Bird: Transformers for Longer Sequences." Advances in Neural Information Processing Systems (NeurIPS) 33. ↩
Roy, A., Saffar, M., Vaswani, A., & Grangier, D. (2021). "Efficient Content-Based Sparse Attention with Routing Transformers." Transactions of the Association for Computational Linguistics (TACL), 9, 53-68. ↩
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." Advances in Neural Information Processing Systems (NeurIPS) 35. ↩
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., et al. (2023). "Mistral 7B." arXiv:2310.06825. ↩
Liu, H., Zaharia, M., & Abbeel, P. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv:2310.01889. Published at ICLR 2024. ↩
Yuan, J., Chen, S., Zhang, M., Chen, B., Xiao, G., Lu, S., Guo, H., & Han, S. (2024). "Block-Sparse FlashAttention." arXiv:2512.07011. ↩
DeepSeek-AI. (2025). "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention." arXiv:2502.11089. Published at ACL 2025. ↩
Wang, S., Li, B. Z., Khabsa, M., Fang, H., & Ma, H. (2020). "Linformer: Self-Attention with Linear Complexity." arXiv:2006.04768. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit