LongNet
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,781 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,781 words
Add missing citations, update stale details, or suggest a clearer explanation.
LongNet is a transformer variant introduced by Microsoft Research in July 2023 that is designed to scale attention to sequences exceeding one billion tokens while preserving performance on shorter inputs. The architecture is built around dilated attention, a sparse attention pattern in which each attention head attends only to query, key, and value positions selected at a fixed dilation rate inside a fixed segment length; stacking heads with geometrically growing rates produces a receptive field that covers the entire sequence with linear total work. The paper, "LongNet: Scaling Transformers to 1,000,000,000 Tokens" by Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei, was released on arXiv on 5 July 2023 with a revised v2 on 19 July 2023.[^1] A reference implementation was added to the open-source microsoft/torchscale library in December 2023, alongside a vision variant called LongViT.[^2] LongNet is frequently cited as a benchmark in the broader effort to make long-context language models tractable, sitting alongside Ring Attention, Mamba, Hyena, and Infini-Attention as a representative subquadratic approach.
| Property | Value |
|---|---|
| Original paper | LongNet: Scaling Transformers to 1,000,000,000 Tokens |
| arXiv ID | 2307.02486 |
| First posted | 5 July 2023 (v1); revised 19 July 2023 (v2) |
| Authors | Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei |
| Affiliations | Microsoft Research; Xi'an Jiaotong University (Nanning Zheng) |
| Core mechanism | Dilated attention with geometric segment length / dilation rate progression |
| Asymptotic attention cost | O(N d) per layer |
| Token-pair path length | O(log N) |
| Implementation | microsoft/torchscale, MIT license |
| Experimental sequence length | Up to 32,768 tokens (perplexity); up to 1B tokens for latency/scaling tests |
| Status | "Work in progress" (per arXiv header) |
The paper itself states that it can "successfully scale the sequence length to 1 billion tokens, while maintaining the performance on shorter sequences," but the language-modeling perplexity results in the paper are reported only up to 32K tokens; the billion-token regime is established by runtime, throughput, and scaling-argument analyses rather than by a billion-token trained checkpoint.[^1][^3]
Standard self-attention in a transformer block computes, for a sequence of length N and hidden dimension d, a softmax-weighted interaction between every pair of query and key vectors. The dominant term in this computation is the N x N attention matrix, giving the layer a per-token compute and memory cost of O(N^2 d). This quadratic scaling has been the central obstacle to extending context windows in language models: doubling the sequence length quadruples both wall-clock time and activation memory for the attention layer, and the trend grows worse as practitioners push toward million- and billion-token sequences for use cases such as whole-codebase analysis, multi-document reasoning, long video, and genome-scale modeling.[^1][^4]
Two broad strategies have been pursued to control this cost. The first replaces the dense softmax attention with a sparsified or compressed variant that lowers the asymptotic complexity. Examples include sparse / strided attention as introduced in Sparse Transformer, the local + global pattern of Longformer, the random + window + global pattern of BigBird, and kernel- or low-rank approximations such as Linformer and Performer.[^1][^5] More recent entrants such as Mamba (a structured state-space model), Hyena (implicit long convolutions parameterised by neural networks), and Mega (moving-average equipped gated attention) abandon softmax attention entirely in favour of recurrent or convolutional surrogates that scale linearly or quasi-linearly with sequence length.[^4][^6] The second strategy preserves exact attention but distributes the computation across many devices, exemplified by Ring Attention, which partitions the sequence across GPUs in a ring topology and overlaps key/value transmission with local block-wise attention so that exact softmax attention can run at million-token scale.[^7]
LongNet sits between these two strategies. Like sparse attention methods it changes the attention pattern to break the O(N^2) wall, but it preserves exact softmax over the selected positions rather than approximating it. Like Ring Attention, it is explicitly co-designed with sequence parallelism: the dilated pattern is chosen so that the per-segment computation can be sharded across devices with bounded communication cost. The motivating premise, articulated in the abstract, is that scaling sequence length is "as critical as scaling model size and training tokens" and that future foundation models should treat "a whole corpus or even the entire Internet as a sequence."[^1]
The first version of the LongNet paper appeared on arXiv on 5 July 2023, less than two months after Ring Attention (June 2023) and a few months before Mamba (December 2023). It was authored by a Microsoft Research team that had previously published a series of long-sequence and architectural improvements, including MAGNETO (a foundation-model normalisation scheme), xPos (an extrapolatable variant of rotary position embedding), and Retentive Network (RetNet). The same group maintained the open-source torchscale library, where they consolidated these architectures.[^2][^8] A short v2 revision was posted on 19 July 2023 and remains the current public version; the paper carries a "Work in progress" notice and has not been published at a peer-reviewed venue at the time of writing.[^3]
A reference implementation followed in December 2023, when LongNet (along with a vision adaptation called LongViT) was added to microsoft/torchscale under the MIT license. The implementation exposes the dilated pattern through two configuration lists, segment_length and dilated_ratio, and depends on Flash Attention for efficient masked attention on selected positions.[^2] An unofficial PyTorch port by Frank Odom (fkodom/dilated-attention-pytorch) and a third-party "plug in and play" repository by Kye Gomez (kyegomez/LongNet) appeared during the same period and helped spread the design to practitioners outside Microsoft, although these projects are not endorsed by the paper's authors.[^9][^10]
Subsequent long-context work has frequently cited LongNet either as a baseline or as an exemplar of a particular design decision (sparse exact attention plus sequence parallelism). Inside Microsoft, LongNet is one of several architectural building blocks consolidated in torchscale alongside RetNet and the MAGNETO/xPos modifications, and the dilated-attention idea has been generalised by later papers such as PowerAttention, which use exponentially expanding receptive fields with denser local coverage.[^11]
For a single attention head, let Q, K, V in R^{N x d} be the per-token query, key, and value projections. Vanilla self-attention computes O = softmax(Q K^T / sqrt(d)) V, with cost dominated by the N x N matrix Q K^T. The N^2 term is what LongNet seeks to remove.[^1]
Dilated attention introduces two hyperparameters per head: a segment length w (how many consecutive tokens form an attention block) and a dilation rate r (the stride within that block). The procedure for a single dilated head is:
Because each segment now performs an attention over only w/r tokens, the per-segment cost is O((w/r)^2 d). Summed over N/w segments, the total cost is O(N w d / r^2), which for fixed w and r is linear in N. The exact form of the sparsified queries given in the paper is Q-tilde_i = [Q_{i w}, Q_{i w + r}, Q_{i w + 2 r}, ...].[^1][^12]
A single (w, r) head only attends to a periodic subset of the sequence and would miss many token-pair interactions. LongNet therefore composes multiple dilated heads with geometrically growing segment lengths and dilation rates, so that small (w, r) heads provide dense local context and large (w, r) heads provide sparse global context. The default LongNet recipe uses a geometric progression: segment sizes w = {2048, 4096, 8192, 16384, 32768} paired with dilation rates r = {1, 2, 4, 6, 12}, run in parallel within a single multi-head layer.[^12] The outputs are combined as a softmax-weighted sum,
O = sum_{i=1..k} alpha_i O|_{r_i, w_i}, with alpha_i = s_i / sum_j s_j,
where s_i is the denominator of the softmax in head i. Using the softmax denominator as the mixing weight (rather than a learnable scalar) makes the combined output behave like an exact softmax over the union of the per-head attended positions, which the paper reports to be empirically superior to learnable gating.[^12]
The geometric progression has two important consequences. First, even though each head touches only a sparse subset of positions, every token can reach every other token in at most O(log N) hops across heads, because the segment length doubles (or more) per head. This is the source of the paper's "logarithm dependency between any two tokens" claim.[^1] Second, the total cost of all heads at a layer remains linear in N because the per-head cost decreases geometrically as r grows, so the sum is bounded by a constant multiple of the cheapest head.
Inside a head group, LongNet further diversifies coverage by having each head choose a different offset into the dilation pattern. Two heads with the same (w, r) but offsets 0 and 1 will sparsify on a different subgrid of positions, so the union of the heads' attended positions in a segment is denser than any single head. This is analogous to the multi-pattern decomposition used in BigBird and Longformer but expressed compactly as a stride/offset rather than as window plus global plus random patterns.[^1][^12]
The formal complexity argument in the paper proceeds in three steps:
This argument turns the asymptotic story into an asymptotic and constant-factor story: LongNet is not merely O(N d) in theory, it is also implementable as a sequence of small FlashAttention calls on contiguous segments, which is the property that makes the architecture practical on real GPUs.
The second half of the dilated-attention design is its compatibility with sequence parallelism, the technique of partitioning a single sequence across multiple devices along the sequence dimension rather than along the model or batch dimension. This is distinct from data parallelism, model parallelism, tensor parallelism, or pipeline parallelism and is necessary once a single sequence is too large to fit in one GPU's memory. The paper describes the distributed algorithm as orthogonal to those other parallelism axes: a LongNet training run can stack sequence parallelism for the attention with tensor parallelism for the feed-forward block and data parallelism across replicas.[^1][^12]
The algorithm has three components:
The key claim is that the communication cost of this scheme is constant with respect to N, because the per-device gather payload depends on w/r and on the number of devices, not on the total sequence length. As a result, the per-device compute and the per-device communication both remain bounded as the total sequence is scaled by adding more devices. This is what enables the headline 1B-token configuration: the paper reports near-constant per-token training latency as the global sequence length grows from 8K through 1B, with the number of GPUs scaled proportionally.[^1][^12]
The distributed scheme is also what differentiates LongNet from naive sparse attention. Many earlier sparse patterns (such as random sparsity) require gather-scatter operations whose communication cost scales poorly under sequence parallelism. The dilated pattern is regular, fixed, and aligned to the segment boundaries that are themselves aligned to device shards, so the communication can be expressed as a single all-gather per segment per head and scheduled to overlap with computation.[^12]
The published LongNet experiments use a transformer decoder with the MAGNETO normalisation scheme and xPos positional embeddings (the same backbone used by RetNet and other Microsoft Research models in torchscale).[^2][^12] The principal model has 12 layers, hidden dimension 768, and 12 heads, totalling roughly 125M parameters, although the paper also includes a scaling-law sweep up to 2.7B parameters. The tokenizer is cl100k_base, the byte-pair encoding used by OpenAI's GPT-3.5 and GPT-4. The training corpus is The Stack, a large permissively licensed source-code dataset.[^12] The optimiser, learning-rate schedule, and batch composition follow the conventions of the torchscale recipes, with 300K training steps and roughly 500K tokens per batch.[^12]
The dilated-attention hyperparameters used in the language-modelling experiments are segment sizes w = {2048, 4096, 8192, 16384, 32768} with dilation rates r = {1, 2, 4, 6, 12}, executed in parallel within each layer. With these settings the maximum supported context window for the language-modelling runs is 32,768 tokens, the same as the largest segment. This is the experimental ceiling for the reported perplexity numbers, as opposed to the theoretical 1B-token ceiling supported by the architecture and distributed algorithm.[^12]
The headline language-modelling table compares three configurations: a vanilla Transformer trained at 2K context, a Sparse Transformer trained at 32K context, and LongNet trained at 32K context. All are evaluated on The Stack at evaluation contexts of 2K, 8K, and 32K tokens. The reported test perplexities are:[^12]
| Model | Train length | Eval 2K | Eval 8K | Eval 32K |
|---|---|---|---|---|
| Vanilla Transformer | 2,048 | 4.24 | 5.07 | 11.29 |
| Sparse Transformer | 32,768 | 5.15 | 4.00 | 3.64 |
| LongNet | 32,768 | 4.37 | 3.33 | 3.01 |
LongNet's perplexity drops monotonically as the evaluation context grows, which the paper interprets as evidence that the model uses the additional context productively rather than ignoring it. The vanilla Transformer, by contrast, degrades sharply past its training context because xPos extrapolation does not fully compensate for unseen positional ranges. Compared with the Sparse Transformer baseline at matched training length, LongNet has both lower perplexity at every evaluation length and lower wall-clock training cost per step.[^12]
A second experiment scales the model size from 125M to 2.7B parameters at fixed context length and fits a power-law of validation loss as a function of compute, in the style of the Chinchilla scaling laws. LongNet's scaling curve closely tracks a vanilla Transformer baseline trained at shorter context, while reaching lower absolute loss because of its longer context. The paper interprets this as evidence that dilated attention does not degrade the "loss versus compute" curve relative to dense attention.[^1][^12]
The runtime / throughput experiments are the basis of the billion-token headline. Holding the per-device sequence length constant and adding GPUs, the paper reports near-flat per-token latency as the global sequence length is scaled through 8K, 32K, 128K, 1M, and on up to 1B tokens. The runtime of LongNet at 1B tokens is reported as roughly comparable to that of a vanilla Transformer at 32K tokens, while the vanilla Transformer's runtime grows quadratically and quickly exceeds the test budget.[^1][^12] These experiments are described as forward-pass / training-step measurements rather than full pretraining runs, and the paper does not claim to have trained a converged language model at 1B-token context.
The reference implementation lives in the torchscale repository under the torchscale/architecture directory and is exposed as a LongNet model class with encoder and decoder variants. The dilated attention is implemented by precomputing the sparsified position indices for each head and dispatching to Flash Attention to perform the actual softmax. Users specify the dilation pattern through two lists in the config, for example segment_length='[2048,4096]' and dilated_ratio='[1,2]'. The repository is published under the MIT license and was extended in December 2023 with both LongNet and LongViT modules.[^2]
LongViT applies the dilated-attention block to image tokens rather than text tokens. The motivation is the same as LongNet's: high-resolution images and large gigapixel-style inputs produce token sequences whose length is dominated by the quadratic cost of attention. LongViT replaces the standard ViT attention block with dilated attention so that the receptive field still covers the full image with linear cost. The implementation also lives in torchscale.[^2]
Two community PyTorch ports are widely cited. fkodom/dilated-attention-pytorch provides an unofficial implementation of the dilated-attention block compatible with vanilla PyTorch and Flash Attention.[^9] kyegomez/LongNet provides a "plug in and play" PyTorch module that bundles the attention block with a transformer wrapper.[^10] Neither is affiliated with Microsoft, and both predate the official torchscale implementation; users should be aware that minor details (in particular the mixed-density combination rule and offset handling per head) may differ from the canonical implementation.
The closest conceptual ancestors of LongNet are sparse attention patterns. Sparse Transformer introduced strided and fixed attention masks with O(N sqrt(N)) cost; Longformer combined a sliding window with a small set of global tokens; BigBird added random sparsity on top of window and global. LongNet differs from these in two ways. First, the dilated pattern is exponential rather than polynomial in coverage, which is what makes the per-pair path length O(log N) rather than O(sqrt N) or O(1) only for window-local tokens. Second, LongNet emphasises co-design with sequence parallelism, whereas earlier sparse patterns were primarily designed for single-device efficiency.[^1][^4][^5]
Sliding window attention and its more recent variants (used in Mistral and several other open-weights LLMs) trade global reach for a fixed-cost local window plus, in some implementations, "attention sinks" or recurrent state. LongNet's local heads behave similarly to a sliding window, but the additional large-w/large-r heads supply the global coverage that a pure window lacks.[^4]
Mamba is a structured state-space model (S6) that achieves linear scaling without any attention, using a selective recurrence that conditions its state-transition parameters on the input. Hyena uses long, implicitly parameterised convolutions to act as a global mixing operator with sub-quadratic cost. Mega blends gated attention with an exponential moving average to recover long-range mixing without quadratic attention. Relative to these, LongNet keeps explicit softmax attention (with all the resulting interpretability and exact-recall properties) at the cost of needing the dilated pattern to be regular enough to fit a sequence-parallel schedule. Empirical comparisons typically find Mamba and Mamba-style hybrids more competitive in tokens-per-second and memory, while attention-based models such as LongNet tend to retain stronger associative recall on long-range tasks.[^4][^6] Comparative studies through 2024 and 2025, including the LongMamba analysis of Mamba's long-context limits, framed the choice as a trade-off between Mamba's efficiency and the recall fidelity of attention.[^6]
Ring Attention preserves exact dense softmax attention and instead shards the sequence across GPUs in a ring, overlapping the cyclic shift of K/V blocks with local block-wise attention. This keeps the per-pair complexity of dense attention but converts the memory bottleneck into a communication bottleneck, enabling million-token training with bounded per-device memory. LongNet and Ring Attention are therefore complementary: Ring Attention preserves the attention pattern at high communication cost; LongNet changes the pattern to make communication asymptotically cheap.[^7]
Infini-Attention takes a different route, retaining vanilla attention over a local window and adding a compressive long-term memory that is updated recurrently. This trades exactness on long-range positions for constant memory cost regardless of sequence length. LongNet retains exact attention over its (sparsified) positions, so its global heads can produce recall over individual long-range tokens that a compressed memory cannot.[^4]
| Approach | Per-layer cost | Token-pair path | Attention exactness | Default sharding |
|---|---|---|---|---|
| Vanilla Transformer | O(N^2 d) | O(1) | Exact softmax | Data / tensor / pipeline |
| Sparse Transformer | O(N sqrt(N) d) | O(1) within mask | Exact over mask | Data / tensor |
| Longformer | O(N w d) | O(N/w) | Exact over window + global | Data / tensor |
| BigBird | O(N (w + g + r) d) | O(log N) (random) | Exact over mask | Data / tensor |
| LongNet (dilated) | O(N d) | O(log N) | Exact over selected positions | Sequence + data + tensor |
| Ring Attention | O(N^2 d) globally; O((N/P)^2 d) per device | O(1) | Exact softmax | Sequence (ring) |
| Mamba | O(N d) | Recurrent (no explicit path) | Not attention | Sequence / data |
| Hyena | O(N log N d) | Convolutional | Not attention | Data / tensor |
| Infini-Attention | O(N w d) + compressive | Local exact + lossy global | Local exact, compressed global | Data / tensor |
Numbers in this table are taken from each method's primary paper as cited above; N is sequence length, d is hidden dimension, w is window or segment length, g is the number of global tokens, r is the number of random tokens per query, and P is the number of sequence-parallel devices.[^1][^4][^5][^6][^7]
LongNet's significance is twofold. First, it offers a concrete architecture that combines a fixed-pattern sparse attention with sequence parallelism, demonstrating that the two design choices can be combined into a single block whose cost is provably linear in sequence length and whose per-device communication cost is bounded. This makes it one of the cleaner case studies in the long-context literature for what an "attention block with linear cost" looks like when the design has to satisfy both per-device efficiency and distributed scalability.[^1][^12]
Second, the architecture motivates and supports the case that increasing context length is a third axis of scaling, alongside model size and training tokens. The paper frames this as the ability to treat "a whole corpus or even the entire Internet as a sequence," a framing that has been picked up in subsequent long-context work.[^1] Whether or not 1B-token training is currently economically useful, the design pattern (segment + dilation + sequence parallelism) has been generalised by later models, most prominently in PowerAttention and other exponentially expanding receptive field schemes.[^11]
In code modelling specifically, the 32K-token LongNet language-model results on The Stack show monotonically decreasing perplexity with increasing evaluation context, which is the property practitioners care about for whole-file or whole-repository code intelligence. The vision variant LongViT extends the same idea to image tokens, which is relevant for high-resolution and gigapixel-image classification.[^2][^12]
Several caveats are routinely raised about LongNet in secondary discussion:
LongNet is related to a cluster of architectures and systems that aim at the same long-context bottleneck from different angles: