Sliding window attention
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,167 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,167 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sliding window attention (SWA) is a sparse attention pattern that restricts each query token to attend only to a fixed-size window of nearby tokens, instead of attending to every preceding (or every other) token in the sequence. By replacing the dense, quadratic attention matrix of the standard Transformer with a banded matrix whose width is the window size, sliding window attention reduces both the time and memory complexity of self-attention from O(n^2) to O(n*w), where n is the sequence length and w is the window width.[1][2]
The technique was popularized for natural language processing by the Longformer model of Beltagy, Peters, and Cohan (2020), which combined sliding window attention with a small number of "global" tokens to handle long documents.[1] It was subsequently adopted in models such as BigBird (Zaheer et al., 2020),[3] and later in the decoder-only language model Mistral 7B, released in October 2023, where SWA was paired with a rolling buffer cache to bound the KV cache memory used during inference.[2]
This article surveys the motivation for sliding window attention, its core mechanics and variants, its use in Mistral 7B and related models, and how it compares to alternative approaches to long-context modeling such as ring attention, FlashAttention, Rotary position embedding (RoPE), and YaRN.
The self-attention operation introduced in the original Transformer of Vaswani et al. (2017) computes a similarity score between every pair of tokens in a sequence.[4] For a sequence of length n with hidden dimension d, the matrix product QK^T that produces the attention scores has shape n by n, so both the compute and the memory required by attention scale quadratically with n.
The Longformer paper states this limitation directly: "The original Transformer model has a self-attention component with O(n^2) time and memory complexity where n is the input sequence length."[1] The Mistral 7B technical report makes the same observation: "The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability."[2]
This quadratic scaling has several practical consequences:
A long line of work has tried to break this scaling bottleneck. One natural idea is to recognize that most useful linguistic structure is local: in many texts, the meaning of a token depends most strongly on tokens in its immediate neighborhood. Sliding window attention is the simplest realization of this intuition.
In sliding window attention, each token at position i is allowed to attend only to a window of W tokens centered on or preceding position i. Tokens outside that window are masked out of the attention computation.
In the bidirectional (encoder) formulation introduced by Longformer, each token attends to (w/2) tokens on each side, for a total window of size w. The Longformer authors describe the pattern as: "Given the importance of local context, our attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input, similar to CNNs."[1]
The computational complexity of this pattern is O(n*w), which is linear in n when w is fixed.[1] In Longformer-base, the window size used for masked language modeling fine-tuning is w = 512, matching the original RoBERTa context length while supporting sequences up to 4,096 tokens long.[1]
In the decoder-only causal formulation used by Mistral 7B, each token at position i in layer k attends only to tokens at positions between i - W and i in the previous layer. The Mistral technical report describes this as: "The hidden state in position i of the layer k, h_i, attends to all hidden states from the previous layer with positions between i - W and i."[2]
A single sliding window layer can only access W tokens, which by itself would seem insufficient for long-range reasoning. The crucial observation, made by Longformer and later by Mistral, is that information can propagate further when many windowed-attention layers are stacked: at each layer information moves forward by up to W positions, so after k stacked layers a token can be influenced by tokens up to k * W positions away.
Longformer puts it as follows: "In a transformer with L layers, the receptive field size at the top layer is Lw (assuming w is fixed for all layers)."[1] The Mistral 7B paper makes the same observation: "Recursively, h_i can access tokens from the input layer at a distance of up to Wk tokens."[2]
For Mistral 7B, which has 32 layers and a window size W = 4,096, this yields a theoretical attention span of approximately 131,000 tokens (32 * 4,096 = 131,072), as reported in the paper: "At the last layer, using a window size of W = 4096, we have a theoretical attention span of approximately 131K tokens."[2] In practice, the effective context used during pre-training is bounded by the position encoding scheme and the training sequence length.
Several variants of the basic sliding window pattern have been proposed, primarily to mitigate the loss of long-range dependencies that pure local attention can suffer.
The Longformer paper notes that pure windowed attention is "not flexible enough to learn task-specific representations" for downstream tasks such as classification or question answering, where some tokens (a [CLS] token for classification, or question tokens for QA) need to interact with the entire sequence.[1] To address this, Longformer adds "global attention" on a small number of pre-selected positions: tokens with global attention can attend to every other token, and every other token can attend to them.
Because the number of global tokens is small relative to n, the overall complexity of combined local and global attention remains O(n). The Longformer authors found this design "is critical for best performance on downstream tasks."[1]
Longformer also uses separate projection matrices Q_s, K_s, V_s for sliding window attention and Q_g, K_g, V_g for global attention, giving the model flexibility to learn different similarity structures for the two types of interaction.[1]
Longformer additionally introduces a "dilated sliding window" variant analogous to dilated convolutions: instead of attending to W contiguous neighbors, the window has gaps of size d, so the receptive field becomes L * d * w with no extra compute. The authors observe: "To further increase the receptive field without increasing computation, the sliding window can be 'dilated'... Assuming a fixed d and w for all layers, the receptive field is Ldw, which can reach tens of thousands of tokens even for small values of d."[1]
For autoregressive language modeling on text8 and enwik8, Longformer uses different window sizes across layers (small windows in early layers, larger windows in higher layers), and applies dilation to only two attention heads, leaving the rest non-dilated.[1] An ablation showed that increasing window sizes from bottom to top layers outperformed fixed or decreasing schedules, and adding dilation to two heads improved performance further.[1]
BigBird, proposed by Zaheer et al. at NeurIPS 2020, generalizes the Longformer design by combining three sparse attention components: a sliding window of local neighbors, a small set of global tokens, and a small set of randomly chosen "long-range" connections per token.[3] The paper reports that this construction "reduces this quadratic dependency to linear," and that BigBird is "a universal approximator of sequence functions and is Turing complete," preserving the expressive properties of full attention.[3] BigBird enabled "sequences of length up to 8x of what was previously possible using similar hardware."[3]
The earliest widely-cited sparse attention pattern in this lineage is the Sparse Transformer of Child, Gray, Radford, and Sutskever (2019), which "introduces sparse factorizations of the attention matrix which reduce this to O(n*sqrt(n))."[5] Sparse Transformer uses strided and fixed attention patterns rather than a literal sliding window, but it is the conceptual ancestor of later windowed-attention designs.
The Longformer authors explicitly position their work as similar to Sparse Transformer: "The model with the most similar attention pattern to ours is Sparse Transformer (Child et al., 2019), which uses a form of dilated sliding window of blocks of size 8x8 provided by BlockSparse (Gray et al., 2017)."[1]
Mistral 7B was released by Mistral AI in October 2023 with the technical report "Mistral 7B" by Jiang et al.[2] The model is a 7-billion-parameter decoder-only Transformer that combines two efficiency mechanisms: grouped-query attention (GQA) and sliding window attention.
According to Table 1 of the Mistral 7B paper, the model has the following architectural parameters:[2]
The sliding window size of 4,096 tokens is half the context length used during training, which means that even at the maximum trained context length each token can attend to a substantial fraction of the sequence directly, while still benefiting from the linear scaling of attention against the cache.
A key practical benefit of sliding window attention in autoregressive decoding is that the KV cache can be bounded by the window size. Mistral 7B implements this with a "rolling buffer cache" of fixed size W: "The cache has a fixed size of W, and the keys and values for the timestep i are stored in position i mod W of the cache. As a result, when the position i is larger than W, past values in the cache are overwritten, and the size of the cache stops increasing."[2]
The paper reports a concrete memory saving: "On a sequence length of 32k tokens, this reduces the cache memory usage by 8x, without impacting the model quality."[2] This is one of the most often-cited practical advantages of SWA: cache memory becomes bounded by W rather than growing without limit with sequence length.
For long prompts, Mistral 7B pre-fills the cache with the prompt before token-by-token generation begins. The paper describes a chunking scheme in which the prompt is split into chunks of size equal to the window, and each chunk's attention is computed against (a) the previous chunk in the cache and (b) the current chunk under a causal mask. Tokens outside the sliding window are excluded from both the cache and the attention mask.[2]
To get the wall-clock benefit of sliding window attention, Mistral relies on optimized kernels from FlashAttention and Meta's xFormers library. The paper states: "In practice, for a sequence length of 16K and W = 4096, changes made to FlashAttention and xFormers yield a 2x speed improvement over a vanilla attention baseline."[2] The acknowledgements thank Tri Dao and Daniel Haziza for these kernel changes.[2]
Mistral AI's Mixtral mixture-of-experts model (Mixtral and Mixtral 8x22B) inherits the dense Transformer block design of Mistral 7B, including the sliding window attention pattern at the block level, while replacing the feed-forward layer with a sparse mixture of experts. As of the date of access, the architecture details listed on Mistral AI's documentation pages for these models reflect this design.
Sliding window attention buys efficiency at the cost of expressive power inside a single layer. The trade-offs are well documented in the source papers.
Sliding window attention is one of several families of techniques for extending Transformer context. Alternatives differ in whether they change the attention pattern, the position encoding, the computational kernel, or the hardware parallelization strategy.
FlashAttention, introduced by Dao et al., is an exact, IO-aware implementation of standard dense attention that fuses the softmax with the matrix multiplication and tiles the computation so that intermediate scores never have to be written to high-bandwidth memory.[6] FlashAttention does not reduce the asymptotic O(n^2) work of attention, but in practice it makes full dense attention far faster and more memory-efficient than naive implementations, narrowing the gap with sparse patterns. The Mistral 7B paper specifically credits FlashAttention as part of the kernel work that makes its SWA implementation 2x faster than a vanilla baseline.[2] FlashAttention is one reason modern open-weight models have often chosen full attention over sliding window attention at moderate context lengths.
Ring attention parallelizes full attention across many devices by streaming key-value blocks around a ring of accelerators while overlapping communication with computation. Like FlashAttention it does not change the attention pattern itself; it changes how the computation is distributed. Ring attention is complementary to sparse patterns such as sliding window attention: a model could in principle combine ring-style parallelism with a banded attention mask.
A different family of approaches extends the usable context of a model without changing the attention pattern, by modifying the position encoding. Rotary position embedding (RoPE) injects positional information into the queries and keys via rotation matrices indexed by absolute position. YaRN (Yet another RoPE extensioN) interpolates and rescales RoPE so that a model originally trained at one context length can extrapolate or be fine-tuned to a longer one. These methods are orthogonal to sliding window attention: a model can use both, although in practice modern long-context models often rely on RoPE-based scaling plus FlashAttention with full attention, rather than sparse windowed patterns.
As discussed above, BigBird and Longformer are themselves SWA-based; they extend pure sliding window attention with global tokens (both) and random connections (BigBird only).[1][3] Both are theoretical and empirical demonstrations that linear-time sparse attention can match full attention on many tasks, but neither has displaced full attention as the default for modern decoder-only large language models.
After Mistral 7B introduced sliding window attention to a high-profile decoder-only language model, much community discussion focused on whether SWA would remain part of Mistral's architecture in successor models or whether full attention (helped by FlashAttention and improved position scaling) would supplant it.
In Hugging Face's transformers library, the MistralAttention implementation has been the subject of open discussion regarding how and when the sliding window is actually applied at inference time, and several versions of model configurations distributed on the Hugging Face Hub set the sliding_window parameter to null for newer Mistral models, indicating that the SWA mask is not used at runtime.[7][8] Because this article restricts itself to verifiable claims, the precise attention pattern of each post-Mistral-7B model should be confirmed against the specific model's official documentation or configuration file rather than assumed to follow the Mistral 7B paper.
What is clear from the published record is: