Sliding window attention

Model Architecture Transformer Models

16 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v3 · 3,248 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Sliding window attention (SWA) is a sparse attention pattern in which each query token attends only to a fixed-size window of nearby tokens instead of to every preceding (or every other) token in the sequence. By replacing the dense, quadratic attention matrix of the standard Transformer with a banded matrix whose width equals the window size, sliding window attention reduces the time and memory complexity of self-attention from $O(n^2)$ to $O(nw)$ , where n is the sequence length and w is the window width.^[1]^[2] When many windowed layers are stacked, a token's effective receptive field grows to $Lw$ tokens across L layers, so Mistral 7B (32 layers, window 4,096) reaches a theoretical attention span of about 131,072 tokens while keeping its decoding cache bounded to the window size.^[2]

The technique was popularized for natural language processing by the Longformer model of Beltagy, Peters, and Cohan (2020), which combined sliding window attention with a small number of "global" tokens to handle long documents.^[1] It was subsequently adopted in models such as BigBird (Zaheer et al., 2020),^[3] and later in the decoder-only language model Mistral 7B, released in October 2023, where SWA was paired with a rolling buffer cache to bound the KV cache memory used during inference.^[2]

This article surveys the motivation for sliding window attention, its core mechanics and variants, its use in Mistral 7B and related models, and how it compares to alternative approaches to long-context modeling such as ring attention, FlashAttention, Rotary position embedding (RoPE), and YaRN.

Why does attention scale quadratically?

The self-attention operation introduced in the original Transformer of Vaswani et al. (2017) computes a similarity score between every pair of tokens in a sequence.^[4] For a sequence of length n with hidden dimension d, the matrix product $QK^\top$ that produces the attention scores has shape n by n, so both the compute and the memory required by attention scale quadratically with n.

The Longformer paper states this limitation directly: "The original Transformer model has a self-attention component with $O(n^2)$ time and memory complexity where n is the input sequence length."^[1] The Mistral 7B technical report makes the same observation: "The number of operations in vanilla attention is quadratic in the sequence length, and the memory increases linearly with the number of tokens. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability."^[2]

This quadratic scaling has several practical consequences:

Long documents (books, legal contracts, long codebases, long chat histories) cannot fit into a single attention pass on commodity GPUs.
Even when they fit, inference latency grows rapidly because every new token must attend to every previous token.
The KV cache that stores keys and values for autoregressive decoding grows linearly with the context length, but the attention computation against it grows quadratically, so memory bandwidth becomes a bottleneck.

A long line of work has tried to break this scaling bottleneck. One natural idea is to recognize that most useful linguistic structure is local: in many texts, the meaning of a token depends most strongly on tokens in its immediate neighborhood. Sliding window attention is the simplest realization of this intuition.

How does sliding window attention work?

In sliding window attention, each token at position i is allowed to attend only to a window of W tokens centered on or preceding position i. Tokens outside that window are masked out of the attention computation.

Bidirectional formulation (Longformer)

In the bidirectional (encoder) formulation introduced by Longformer, each token attends to $(w/2)$ tokens on each side, for a total window of size w. The Longformer authors describe the pattern as: "Given the importance of local context, our attention pattern employs a fixed-size window attention surrounding each token. Using multiple stacked layers of such windowed attention results in a large receptive field, where top layers have access to all input locations and have the capacity to build representations that incorporate information across the entire input, similar to CNNs."^[1]

The computational complexity of this pattern is $O(nw)$ , which is linear in n when w is fixed.^[1] In Longformer-base, the window size used for masked language modeling fine-tuning is $w = 512$ , matching the original RoBERTa context length while supporting sequences up to 4,096 tokens long (8 times longer than BERT).^[1]

Causal (left-only) formulation (Mistral 7B)

In the decoder-only causal formulation used by Mistral 7B, each token at position i in layer k attends only to tokens at positions between $i - W$ and $i$ in the previous layer. The Mistral technical report describes this as: "The hidden state in position i of the layer k, $h_i$ , attends to all hidden states from the previous layer with positions between $i - W$ and $i$ ."^[2]

How does the receptive field grow with depth?

A single sliding window layer can only access W tokens, which by itself would seem insufficient for long-range reasoning. The crucial observation, made by Longformer and later by Mistral, is that information can propagate further when many windowed-attention layers are stacked: at each layer information moves forward by up to W positions, so after k stacked layers a token can be influenced by tokens up to $kW$ positions away.

Longformer puts it as follows: "In a transformer with L layers, the receptive field size at the top layer is $Lw$ (assuming w is fixed for all layers)."^[1] The Mistral 7B paper makes the same observation: "Recursively, $h_i$ can access tokens from the input layer at a distance of up to $Wk$ tokens."^[2]

For Mistral 7B, which has 32 layers and a window size $W = 4{,}096$ , this yields a theoretical attention span of approximately 131,000 tokens ( $32 \times 4{,}096 = 131{,}072$ ), as reported in the paper: "At the last layer, using a window size of $W = 4096$ , we have a theoretical attention span of approximately 131K tokens."^[2] In practice, the effective context used during pre-training is bounded by the position encoding scheme and the training sequence length.

What are the main variants of sliding window attention?

Several variants of the basic sliding window pattern have been proposed, primarily to mitigate the loss of long-range dependencies that pure local attention can suffer.

Global attention (Longformer)

The Longformer paper notes that pure windowed attention is "not flexible enough to learn task-specific representations" for downstream tasks such as classification or question answering, where some tokens (a [CLS] token for classification, or question tokens for QA) need to interact with the entire sequence.^[1] To address this, Longformer adds "global attention" on a small number of pre-selected positions: tokens with global attention can attend to every other token, and every other token can attend to them.

Because the number of global tokens is small relative to n, the overall complexity of combined local and global attention remains $O(n)$ . The Longformer authors found this design "is critical for best performance on downstream tasks."^[1]

Longformer also uses separate projection matrices Q_s, K_s, V_s for sliding window attention and Q_g, K_g, V_g for global attention, giving the model flexibility to learn different similarity structures for the two types of interaction.^[1]

Dilated sliding windows

Longformer additionally introduces a "dilated sliding window" variant analogous to dilated convolutions: instead of attending to W contiguous neighbors, the window has gaps of size d, so the receptive field becomes $Ldw$ with no extra compute. The authors observe: "To further increase the receptive field without increasing computation, the sliding window can be 'dilated'... Assuming a fixed d and w for all layers, the receptive field is $Ldw$ , which can reach tens of thousands of tokens even for small values of d."^[1]

For autoregressive language modeling on text8 and enwik8, Longformer uses different window sizes across layers (small windows in early layers, larger windows in higher layers), and applies dilation to only two attention heads, leaving the rest non-dilated.^[1] An ablation showed that increasing window sizes from bottom to top layers outperformed fixed or decreasing schedules, and adding dilation to two heads improved performance further.^[1]

Sliding window plus global plus random (BigBird)

BigBird, proposed by Zaheer et al. at NeurIPS 2020, generalizes the Longformer design by combining three sparse attention components: a sliding window of local neighbors, a small set of global tokens, and a small set of randomly chosen "long-range" connections per token.^[3] The paper reports that this construction "reduces this quadratic dependency to linear," and that BigBird is "a universal approximator of sequence functions and is Turing complete," preserving the expressive properties of full attention.^[3] BigBird enabled "sequences of length up to 8x of what was previously possible using similar hardware."^[3]

Earlier sparse patterns: Sparse Transformer

The earliest widely-cited sparse attention pattern in this lineage is the Sparse Transformer of Child, Gray, Radford, and Sutskever (2019), which "introduces sparse factorizations of the attention matrix which reduce this to $O(n\sqrt{n})$ ."^[5] Sparse Transformer uses strided and fixed attention patterns rather than a literal sliding window, but it is the conceptual ancestor of later windowed-attention designs.

The Longformer authors explicitly position their work as similar to Sparse Transformer: "The model with the most similar attention pattern to ours is Sparse Transformer (Child et al., 2019), which uses a form of dilated sliding window of blocks of size 8x8 provided by BlockSparse (Gray et al., 2017)."^[1]

How does Mistral 7B use sliding window attention?

Mistral 7B was released by Mistral AI in October 2023 with the technical report "Mistral 7B" by Jiang et al.^[2] The model is a 7-billion-parameter decoder-only Transformer that combines two efficiency mechanisms: grouped-query attention (GQA) and sliding window attention.

Architecture parameters

According to Table 1 of the Mistral 7B paper, the model has the following architectural parameters:^[2]

Parameter	Value
dim (hidden size)	4,096
n_layers	32
head_dim	128
hidden_dim (feed-forward)	14,336
n_heads	32
n_kv_heads (grouped-query attention)	8
window_size	4,096
context_len	8,192
vocab_size	32,000

The sliding window size of 4,096 tokens is half the context length used during training, which means that even at the maximum trained context length each token can attend to a substantial fraction of the sequence directly, while still benefiting from the linear scaling of attention against the cache.

Rolling buffer cache

A key practical benefit of sliding window attention in autoregressive decoding is that the KV cache can be bounded by the window size. Mistral 7B implements this with a "rolling buffer cache" of fixed size W: "The cache has a fixed size of W, and the keys and values for the timestep i are stored in position $i \bmod W$ of the cache. As a result, when the position i is larger than W, past values in the cache are overwritten, and the size of the cache stops increasing."^[2]

The paper reports a concrete memory saving: "On a sequence length of 32k tokens, this reduces the cache memory usage by 8x, without impacting the model quality."^[2] This is one of the most often-cited practical advantages of SWA: cache memory becomes bounded by W rather than growing without limit with sequence length.

Pre-fill and chunking

For long prompts, Mistral 7B pre-fills the cache with the prompt before token-by-token generation begins. The paper describes a chunking scheme in which the prompt is split into chunks of size equal to the window, and each chunk's attention is computed against (a) the previous chunk in the cache and (b) the current chunk under a causal mask. Tokens outside the sliding window are excluded from both the cache and the attention mask.^[2]

Implementation: FlashAttention and xFormers

To get the wall-clock benefit of sliding window attention, Mistral relies on optimized kernels from FlashAttention and Meta's xFormers library. The paper states: "In practice, for a sequence length of 16K and $W = 4096$ , changes made to FlashAttention and xFormers yield a 2x speed improvement over a vanilla attention baseline."^[2] The acknowledgements thank Tri Dao and Daniel Haziza for these kernel changes.^[2]

Mixtral

Mistral AI's Mixtral mixture-of-experts model (Mixtral and Mixtral 8x22B) inherits the dense Transformer block design of Mistral 7B, including the sliding window attention pattern at the block level, while replacing the feed-forward layer with a sparse mixture of experts. As of the date of access, the architecture details listed on Mistral AI's documentation pages for these models reflect this design.

What are the trade-offs of sliding window attention?

Sliding window attention buys efficiency at the cost of expressive power inside a single layer. The trade-offs are well documented in the source papers.

Pros

Linear time and memory in n (when w is fixed), enabling much longer sequences for a given hardware budget.^[1]^[2]
Bounded KV cache size during decoding (window W rather than full context).^[2]
Empirically strong results on long-document tasks: Longformer reports state-of-the-art on text8 and enwik8 character-level language modeling and improvements over RoBERTa on WikiHop, TriviaQA, HotpotQA, OntoNotes, IMDB, and Hyperpartisan classification.^[1]
In Mistral 7B, sliding window attention helped a 7-billion-parameter model outperform Llama 2 13B on the benchmarks reported in the paper, and surpass Llama 1 34B in reasoning, mathematics, and code generation.^[2]

Cons

Inside any single layer, tokens more than W positions away from each other cannot communicate at all. Long-range dependencies must travel through many layers, and information that requires precise long-range copy or retrieval (for example, retrieving an exact name introduced thousands of tokens earlier) is harder to learn than with full attention.
The "theoretical receptive field" of $LW$ tokens is an upper bound; the effective receptive field can be substantially smaller because each hop attenuates the signal.
Bidirectional variants such as Longformer's require task-specific global tokens for good performance on downstream tasks such as classification or question answering, adding a hyperparameter (which positions to make global).^[1]
The custom banded attention pattern is not natively supported by the standard matrix multiplications in deep learning libraries, so dedicated kernels (custom CUDA, Triton, FlashAttention, xFormers) are required to actually realize the theoretical speedups.^[1]^[2]

How does sliding window attention compare to other long-context methods?

Sliding window attention is one of several families of techniques for extending Transformer context. Alternatives differ in whether they change the attention pattern, the position encoding, the computational kernel, or the hardware parallelization strategy.

FlashAttention

FlashAttention, introduced by Dao et al., is an exact, IO-aware implementation of standard dense attention that fuses the softmax with the matrix multiplication and tiles the computation so that intermediate scores never have to be written to high-bandwidth memory.^[6] FlashAttention does not reduce the asymptotic $O(n^2)$ work of attention, but in practice it makes full dense attention far faster and more memory-efficient than naive implementations, narrowing the gap with sparse patterns. The Mistral 7B paper specifically credits FlashAttention as part of the kernel work that makes its SWA implementation 2x faster than a vanilla baseline.^[2] FlashAttention is one reason modern open-weight models have often chosen full attention over sliding window attention at moderate context lengths.

Ring attention

Ring attention parallelizes full attention across many devices by streaming key-value blocks around a ring of accelerators while overlapping communication with computation. Like FlashAttention it does not change the attention pattern itself; it changes how the computation is distributed. Ring attention is complementary to sparse patterns such as sliding window attention: a model could in principle combine ring-style parallelism with a banded attention mask.

Position-encoding extensions: RoPE and YaRN

A different family of approaches extends the usable context of a model without changing the attention pattern, by modifying the position encoding. Rotary position embedding (RoPE) injects positional information into the queries and keys via rotation matrices indexed by absolute position. YaRN (Yet another RoPE extensioN) interpolates and rescales RoPE so that a model originally trained at one context length can extrapolate or be fine-tuned to a longer one. These methods are orthogonal to sliding window attention: a model can use both, although in practice modern long-context models often rely on RoPE-based scaling plus FlashAttention with full attention, rather than sparse windowed patterns.

BigBird and Longformer

As discussed above, BigBird and Longformer are themselves SWA-based; they extend pure sliding window attention with global tokens (both) and random connections (BigBird only).^[1]^[3] Both are theoretical and empirical demonstrations that linear-time sparse attention can match full attention on many tasks, but neither has displaced full attention as the default for modern decoder-only large language models.

Is sliding window attention still used in newer Mistral models?

After Mistral 7B introduced sliding window attention to a high-profile decoder-only language model, much community discussion focused on whether SWA would remain part of Mistral's architecture in successor models or whether full attention (helped by FlashAttention and improved position scaling) would supplant it.

In Hugging Face's transformers library, the MistralAttention implementation has been the subject of open discussion regarding how and when the sliding window is actually applied at inference time, and several versions of model configurations distributed on the Hugging Face Hub set the sliding_window parameter to null for newer Mistral models, indicating that the SWA mask is not used at runtime.^[7]^[8] Because this article restricts itself to verifiable claims, the precise attention pattern of each post-Mistral-7B model should be confirmed against the specific model's official documentation or configuration file rather than assumed to follow the Mistral 7B paper.

What is clear from the published record is:

Sliding window attention is a well-defined, well-motivated sparse attention pattern with $O(nw)$ complexity and bounded KV cache.^[1]^[2]
It is empirically effective in both encoder-style (Longformer, BigBird) and decoder-style (Mistral 7B) settings.^[1]^[2]^[3]
It is one of several long-context strategies, and the popularity of any given approach is heavily influenced by the availability of fast kernels (such as FlashAttention) for the alternatives, and by position-encoding tricks (such as YaRN on top of RoPE) that let models trained at one context length be used at another.

References

Beltagy, I., Peters, M. E., & Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150. https://arxiv.org/abs/2004.05150 (Accessed 2026-06-22). ↩
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). "Mistral 7B." arXiv:2310.06825. https://arxiv.org/abs/2310.06825 (Accessed 2026-06-22). ↩
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2020). "Big Bird: Transformers for Longer Sequences." Advances in Neural Information Processing Systems 33 (NeurIPS 2020). arXiv:2007.14062. https://arxiv.org/abs/2007.14062 (Accessed 2026-06-22). ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30. arXiv:1706.03762. https://arxiv.org/abs/1706.03762 (Accessed 2026-06-22). ↩
Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509. https://arxiv.org/abs/1904.10509 (Accessed 2026-06-22). ↩
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." Advances in Neural Information Processing Systems 35. https://arxiv.org/abs/2205.14135 (Accessed 2026-06-22). ↩
Hugging Face. "Mistral" model documentation. https://huggingface.co/docs/transformers/en/model_doc/mistral (Accessed 2026-06-22). ↩
Hugging Face Transformers Issue #29777, "MistralAttention: where is the sliding window." https://github.com/huggingface/transformers/issues/29777 (Accessed 2026-06-22). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Attention sink Gemma 3 Gemma 4 H2O (Heavy-Hitter Oracle for KV Cache)KV Cache Linear Attention Long-context language models LongLoRA LongNet LongRoPE Mistral 7B Mistral AI Self-Extend StreamingLLM Test-Time Training (TTT)Titans (neural architecture)YOCO (You Only Cache Once)

Why does attention scale quadratically?

How does sliding window attention work?

Bidirectional formulation (Longformer)

Causal (left-only) formulation (Mistral 7B)

How does the receptive field grow with depth?

What are the main variants of sliding window attention?

Global attention (Longformer)

Dilated sliding windows

Sliding window plus global plus random (BigBird)

Earlier sparse patterns: Sparse Transformer

How does Mistral 7B use sliding window attention?

Architecture parameters

Rolling buffer cache

Pre-fill and chunking

Implementation: FlashAttention and xFormers

Mixtral

What are the trade-offs of sliding window attention?

Pros

Cons

How does sliding window attention compare to other long-context methods?

FlashAttention

Ring attention

Position-encoding extensions: RoPE and YaRN

BigBird and Longformer

Is sliding window attention still used in newer Mistral models?

See also

References

Improve this article

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths

What links here

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths

What links here