LongNet

Microsoft Model Architecture Transformer Models

24 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 4,778 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LongNet is a transformer variant introduced by Microsoft Research in July 2023 that is designed to scale attention to sequences exceeding one billion tokens while preserving performance on shorter inputs. The architecture is built around dilated attention, a sparse attention pattern in which each attention head attends only to query, key, and value positions selected at a fixed dilation rate inside a fixed segment length; stacking heads with geometrically growing rates produces a receptive field that covers the entire sequence with linear total work. The paper, "LongNet: Scaling Transformers to 1,000,000,000 Tokens" by Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei, was released on arXiv on 5 July 2023 with a revised v2 on 19 July 2023.^[1] A reference implementation was added to the open-source microsoft/torchscale library in December 2023, alongside a vision variant called LongViT.^[2] LongNet is frequently cited as a benchmark in the broader effort to make long-context language models tractable, sitting alongside Ring Attention, Mamba, Hyena, and Infini-Attention as a representative subquadratic approach.

Overview

Property	Value
Original paper	LongNet: Scaling Transformers to 1,000,000,000 Tokens
arXiv ID	2307.02486
First posted	5 July 2023 (v1); revised 19 July 2023 (v2)
Authors	Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei
Affiliations	Microsoft Research; Xi'an Jiaotong University (Nanning Zheng)
Core mechanism	Dilated attention with geometric segment length / dilation rate progression
Asymptotic attention cost	$O(N d)$ per layer
Token-pair path length	$O(\log N)$
Implementation	microsoft/torchscale, MIT license
Experimental sequence length	Up to 32,768 tokens (perplexity); up to 1B tokens for latency/scaling tests
Status	"Work in progress" (per arXiv header)

The paper itself states that it can "successfully scale the sequence length to 1 billion tokens, while maintaining the performance on shorter sequences," but the language-modeling perplexity results in the paper are reported only up to 32K tokens; the billion-token regime is established by runtime, throughput, and scaling-argument analyses rather than by a billion-token trained checkpoint.^[1]^[3]

Background

Standard self-attention in a transformer block computes, for a sequence of length N and hidden dimension d, a softmax-weighted interaction between every pair of query and key vectors. The dominant term in this computation is the $N \times N$ attention matrix, giving the layer a per-token compute and memory cost of $O(N^2 d)$ . This quadratic scaling has been the central obstacle to extending context windows in language models: doubling the sequence length quadruples both wall-clock time and activation memory for the attention layer, and the trend grows worse as practitioners push toward million- and billion-token sequences for use cases such as whole-codebase analysis, multi-document reasoning, long video, and genome-scale modeling.^[1]^[4]

Two broad strategies have been pursued to control this cost. The first replaces the dense softmax attention with a sparsified or compressed variant that lowers the asymptotic complexity. Examples include sparse / strided attention as introduced in Sparse Transformer, the local + global pattern of Longformer, the random + window + global pattern of BigBird, and kernel- or low-rank approximations such as Linformer and Performer.^[1]^[5] More recent entrants such as Mamba (a structured state-space model), Hyena (implicit long convolutions parameterised by neural networks), and Mega (moving-average equipped gated attention) abandon softmax attention entirely in favour of recurrent or convolutional surrogates that scale linearly or quasi-linearly with sequence length.^[4]^[6] The second strategy preserves exact attention but distributes the computation across many devices, exemplified by Ring Attention, which partitions the sequence across GPUs in a ring topology and overlaps key/value transmission with local block-wise attention so that exact softmax attention can run at million-token scale.^[7]

LongNet sits between these two strategies. Like sparse attention methods it changes the attention pattern to break the $O(N^2)$ wall, but it preserves exact softmax over the selected positions rather than approximating it. Like Ring Attention, it is explicitly co-designed with sequence parallelism: the dilated pattern is chosen so that the per-segment computation can be sharded across devices with bounded communication cost. The motivating premise, articulated in the abstract, is that scaling sequence length is "as critical as scaling model size and training tokens" and that future foundation models should treat "a whole corpus or even the entire Internet as a sequence."^[1]

History

The first version of the LongNet paper appeared on arXiv on 5 July 2023, less than two months after Ring Attention (June 2023) and a few months before Mamba (December 2023). It was authored by a Microsoft Research team that had previously published a series of long-sequence and architectural improvements, including MAGNETO (a foundation-model normalisation scheme), xPos (an extrapolatable variant of rotary position embedding), and Retentive Network (RetNet). The same group maintained the open-source torchscale library, where they consolidated these architectures.^[2]^[8] A short v2 revision was posted on 19 July 2023 and remains the current public version; the paper carries a "Work in progress" notice and has not been published at a peer-reviewed venue at the time of writing.^[3]

A reference implementation followed in December 2023, when LongNet (along with a vision adaptation called LongViT) was added to microsoft/torchscale under the MIT license. The implementation exposes the dilated pattern through two configuration lists, segment_length and dilated_ratio, and depends on Flash Attention for efficient masked attention on selected positions.^[2] An unofficial PyTorch port by Frank Odom (fkodom/dilated-attention-pytorch) and a third-party "plug in and play" repository by Kye Gomez (kyegomez/LongNet) appeared during the same period and helped spread the design to practitioners outside Microsoft, although these projects are not endorsed by the paper's authors.^[9]^[10]

Subsequent long-context work has frequently cited LongNet either as a baseline or as an exemplar of a particular design decision (sparse exact attention plus sequence parallelism). Inside Microsoft, LongNet is one of several architectural building blocks consolidated in torchscale alongside RetNet and the MAGNETO/xPos modifications, and the dilated-attention idea has been generalised by later papers such as PowerAttention, which use exponentially expanding receptive fields with denser local coverage.^[11]

Dilated attention

Standard attention recap

For a single attention head, let $Q, K, V \in \mathbb{R}^{N \times d}$ be the per-token query, key, and value projections. Vanilla self-attention computes $O = \mathrm{softmax}(Q K^\top / \sqrt{d}) V$ , with cost dominated by the $N \times N$ matrix $Q K^\top$ . The $N^2$ term is what LongNet seeks to remove.^[1]

Segment and dilation hyperparameters

Dilated attention introduces two hyperparameters per head: a segment length $w$ (how many consecutive tokens form an attention block) and a dilation rate $r$ (the stride within that block). The procedure for a single dilated head is:

Partition the input sequence of length $N$ into $N/w$ consecutive segments of length $w$ .
Inside each segment, keep only every $r$ -th row of Q, K, and V to produce sparsified tensors $\tilde{Q}, \tilde{K}, \tilde{V}$ of length $w/r$ per segment.
Run standard scaled-dot-product attention on the sparsified segment: $O_{\text{segment}} = \mathrm{softmax}(\tilde{Q} \tilde{K}^\top / \sqrt{d}) \tilde{V}$ .
Scatter the result back to the original positions.

Because each segment now performs an attention over only $w/r$ tokens, the per-segment cost is $O((w/r)^2 d)$ . Summed over $N/w$ segments, the total cost is $O(N w d / r^2)$ , which for fixed $w$ and $r$ is linear in $N$ . The exact form of the sparsified queries given in the paper is $\tilde{Q}_i = [Q_{iw}, Q_{iw + r}, Q_{iw + 2r}, \ldots]$ .^[1]^[12]

Mixed densities across heads

A single (w, r) head only attends to a periodic subset of the sequence and would miss many token-pair interactions. LongNet therefore composes multiple dilated heads with geometrically growing segment lengths and dilation rates, so that small (w, r) heads provide dense local context and large (w, r) heads provide sparse global context. The default LongNet recipe uses a geometric progression: segment sizes $w = \{2048, 4096, 8192, 16384, 32768\}$ paired with dilation rates $r = \{1, 2, 4, 6, 12\}$ , run in parallel within a single multi-head layer.^[12] The outputs are combined as a softmax-weighted sum,

O = \sum_{i=1}^{k} \alpha_i\, O\vert_{r_i, w_i}, \quad \text{with} \quad \alpha_i = \frac{s_i}{\sum_j s_j}

where $s_i$ is the denominator of the softmax in head i. Using the softmax denominator as the mixing weight (rather than a learnable scalar) makes the combined output behave like an exact softmax over the union of the per-head attended positions, which the paper reports to be empirically superior to learnable gating.^[12]

The geometric progression has two important consequences. First, even though each head touches only a sparse subset of positions, every token can reach every other token in at most $O(\log N)$ hops across heads, because the segment length doubles (or more) per head. This is the source of the paper's "logarithm dependency between any two tokens" claim.^[1] Second, the total cost of all heads at a layer remains linear in N because the per-head cost decreases geometrically as r grows, so the sum is bounded by a constant multiple of the cheapest head.

Multi-head allocation

Inside a head group, LongNet further diversifies coverage by having each head choose a different offset into the dilation pattern. Two heads with the same (w, r) but offsets 0 and 1 will sparsify on a different subgrid of positions, so the union of the heads' attended positions in a segment is denser than any single head. This is analogous to the multi-pattern decomposition used in BigBird and Longformer but expressed compactly as a stride/offset rather than as window plus global plus random patterns.^[1]^[12]

Complexity argument

The formal complexity argument in the paper proceeds in three steps:

For each head, the cost per layer is $O(N w d / r^2)$ . With the geometric series $r_i = 2^{i-1}$ and $w_i = w_0 \cdot 2^{i-1}$ , this telescopes to a constant multiple of the smallest head's cost, leaving an overall $O(N d)$ per layer.
The maximum token-pair distance is bounded by $w_k$ , the largest segment length, which itself grows logarithmically with the maximum sequence length to be supported. The shortest path between any two tokens through the head graph is therefore $O(\log N)$ .
Because the attention pattern is data-independent (it is a fixed function of position), masks can be precomputed and the attention can use FlashAttention kernels, preserving the IO-aware memory efficiency of dense kernels.^[1]^[12]

This argument turns the asymptotic story into an asymptotic and constant-factor story: LongNet is not merely O(N d) in theory, it is also implementable as a sequence of small FlashAttention calls on contiguous segments, which is the property that makes the architecture practical on real GPUs.

Distributed training and sequence parallelism

The second half of the dilated-attention design is its compatibility with sequence parallelism, the technique of partitioning a single sequence across multiple devices along the sequence dimension rather than along the model or batch dimension. This is distinct from data parallelism, model parallelism, tensor parallelism, or pipeline parallelism and is necessary once a single sequence is too large to fit in one GPU's memory. The paper describes the distributed algorithm as orthogonal to those other parallelism axes: a LongNet training run can stack sequence parallelism for the attention with tensor parallelism for the feed-forward block and data parallelism across replicas.^[1]^[12]

The algorithm has three components:

Sequence sharding. Each device receives a contiguous slice of the input tokens and computes Q, K, V projections locally. For dilation rate r and segment length w that fit inside a single shard, the dilated attention is a purely local operation: each device runs FlashAttention on its own sparsified segments without inter-device communication.
Cross-device gather for long segments. When a head's segment length w exceeds the per-device shard, the device performs an all-gather of the sparsified K and V tensors from peer devices so it can compute attention over the longer segment. Because dilation reduces the per-segment K/V size by a factor of r, the all-gather payload is small compared with a dense attention pattern of the same segment length.
Scatter and combine. The mixed-density combination (the softmax-weighted sum across heads) is performed locally per device after each head's output has been computed.^[12]

The key claim is that the communication cost of this scheme is constant with respect to N, because the per-device gather payload depends on w/r and on the number of devices, not on the total sequence length. As a result, the per-device compute and the per-device communication both remain bounded as the total sequence is scaled by adding more devices. This is what enables the headline 1B-token configuration: the paper reports near-constant per-token training latency as the global sequence length grows from 8K through 1B, with the number of GPUs scaled proportionally.^[1]^[12]

The distributed scheme is also what differentiates LongNet from naive sparse attention. Many earlier sparse patterns (such as random sparsity) require gather-scatter operations whose communication cost scales poorly under sequence parallelism. The dilated pattern is regular, fixed, and aligned to the segment boundaries that are themselves aligned to device shards, so the communication can be expressed as a single all-gather per segment per head and scheduled to overlap with computation.^[12]

Training setup and language-modelling results

Model and data

The published LongNet experiments use a transformer decoder with the MAGNETO normalisation scheme and xPos positional embeddings (the same backbone used by RetNet and other Microsoft Research models in torchscale).^[2]^[12] The principal model has 12 layers, hidden dimension 768, and 12 heads, totalling roughly 125M parameters, although the paper also includes a scaling-law sweep up to 2.7B parameters. The tokenizer is cl100k_base, the byte-pair encoding used by OpenAI's GPT-3.5 and GPT-4. The training corpus is The Stack, a large permissively licensed source-code dataset.^[12] The optimiser, learning-rate schedule, and batch composition follow the conventions of the torchscale recipes, with 300K training steps and roughly 500K tokens per batch.^[12]

The dilated-attention hyperparameters used in the language-modelling experiments are segment sizes $w = \{2048, 4096, 8192, 16384, 32768\}$ with dilation rates $r = \{1, 2, 4, 6, 12\}$ , executed in parallel within each layer. With these settings the maximum supported context window for the language-modelling runs is 32,768 tokens, the same as the largest segment. This is the experimental ceiling for the reported perplexity numbers, as opposed to the theoretical 1B-token ceiling supported by the architecture and distributed algorithm.^[12]

Perplexity at varying context lengths

The headline language-modelling table compares three configurations: a vanilla Transformer trained at 2K context, a Sparse Transformer trained at 32K context, and LongNet trained at 32K context. All are evaluated on The Stack at evaluation contexts of 2K, 8K, and 32K tokens. The reported test perplexities are:^[12]

Model	Train length	Eval 2K	Eval 8K	Eval 32K
Vanilla Transformer	2,048	4.24	5.07	11.29
Sparse Transformer	32,768	5.15	4.00	3.64
LongNet	32,768	4.37	3.33	3.01

LongNet's perplexity drops monotonically as the evaluation context grows, which the paper interprets as evidence that the model uses the additional context productively rather than ignoring it. The vanilla Transformer, by contrast, degrades sharply past its training context because xPos extrapolation does not fully compensate for unseen positional ranges. Compared with the Sparse Transformer baseline at matched training length, LongNet has both lower perplexity at every evaluation length and lower wall-clock training cost per step.^[12]

A second experiment scales the model size from 125M to 2.7B parameters at fixed context length and fits a power-law of validation loss as a function of compute, in the style of the Chinchilla scaling laws. LongNet's scaling curve closely tracks a vanilla Transformer baseline trained at shorter context, while reaching lower absolute loss because of its longer context. The paper interprets this as evidence that dilated attention does not degrade the "loss versus compute" curve relative to dense attention.^[1]^[12]

Throughput and the 1B-token claim

The runtime / throughput experiments are the basis of the billion-token headline. Holding the per-device sequence length constant and adding GPUs, the paper reports near-flat per-token latency as the global sequence length is scaled through 8K, 32K, 128K, 1M, and on up to 1B tokens. The runtime of LongNet at 1B tokens is reported as roughly comparable to that of a vanilla Transformer at 32K tokens, while the vanilla Transformer's runtime grows quadratically and quickly exceeds the test budget.^[1]^[12] These experiments are described as forward-pass / training-step measurements rather than full pretraining runs, and the paper does not claim to have trained a converged language model at 1B-token context.

Implementations

microsoft/torchscale

The reference implementation lives in the torchscale repository under the torchscale/architecture directory and is exposed as a LongNet model class with encoder and decoder variants. The dilated attention is implemented by precomputing the sparsified position indices for each head and dispatching to Flash Attention to perform the actual softmax. Users specify the dilation pattern through two lists in the config, for example segment_length='[2048,4096]' and dilated_ratio='[1,2]'. The repository is published under the MIT license and was extended in December 2023 with both LongNet and LongViT modules.^[2]

LongViT

LongViT applies the dilated-attention block to image tokens rather than text tokens. The motivation is the same as LongNet's: high-resolution images and large gigapixel-style inputs produce token sequences whose length is dominated by the quadratic cost of attention. LongViT replaces the standard ViT attention block with dilated attention so that the receptive field still covers the full image with linear cost. The implementation also lives in torchscale.^[2]

Third-party ports

Two community PyTorch ports are widely cited. fkodom/dilated-attention-pytorch provides an unofficial implementation of the dilated-attention block compatible with vanilla PyTorch and Flash Attention.^[9] kyegomez/LongNet provides a "plug in and play" PyTorch module that bundles the attention block with a transformer wrapper.^[10] Neither is affiliated with Microsoft, and both predate the official torchscale implementation; users should be aware that minor details (in particular the mixed-density combination rule and offset handling per head) may differ from the canonical implementation.

Sparse attention families

The closest conceptual ancestors of LongNet are sparse attention patterns. Sparse Transformer introduced strided and fixed attention masks with $O(N \sqrt{N})$ cost; Longformer combined a sliding window with a small set of global tokens; BigBird added random sparsity on top of window and global. LongNet differs from these in two ways. First, the dilated pattern is exponential rather than polynomial in coverage, which is what makes the per-pair path length $O(\log N)$ rather than $O(\sqrt{N})$ or $O(1)$ only for window-local tokens. Second, LongNet emphasises co-design with sequence parallelism, whereas earlier sparse patterns were primarily designed for single-device efficiency.^[1]^[4]^[5]

Sliding window attention and its more recent variants (used in Mistral and several other open-weights LLMs) trade global reach for a fixed-cost local window plus, in some implementations, "attention sinks" or recurrent state. LongNet's local heads behave similarly to a sliding window, but the additional large-w/large-r heads supply the global coverage that a pure window lacks.^[4]

State-space and convolutional models

Mamba is a structured state-space model (S6) that achieves linear scaling without any attention, using a selective recurrence that conditions its state-transition parameters on the input. Hyena uses long, implicitly parameterised convolutions to act as a global mixing operator with sub-quadratic cost. Mega blends gated attention with an exponential moving average to recover long-range mixing without quadratic attention. Relative to these, LongNet keeps explicit softmax attention (with all the resulting interpretability and exact-recall properties) at the cost of needing the dilated pattern to be regular enough to fit a sequence-parallel schedule. Empirical comparisons typically find Mamba and Mamba-style hybrids more competitive in tokens-per-second and memory, while attention-based models such as LongNet tend to retain stronger associative recall on long-range tasks.^[4]^[6] Comparative studies through 2024 and 2025, including the LongMamba analysis of Mamba's long-context limits, framed the choice as a trade-off between Mamba's efficiency and the recall fidelity of attention.^[6]

Distributed exact attention

Ring Attention preserves exact dense softmax attention and instead shards the sequence across GPUs in a ring, overlapping the cyclic shift of K/V blocks with local block-wise attention. This keeps the per-pair complexity of dense attention but converts the memory bottleneck into a communication bottleneck, enabling million-token training with bounded per-device memory. LongNet and Ring Attention are therefore complementary: Ring Attention preserves the attention pattern at high communication cost; LongNet changes the pattern to make communication asymptotically cheap.^[7]

Recurrent and memory-augmented attention

Infini-Attention takes a different route, retaining vanilla attention over a local window and adding a compressive long-term memory that is updated recurrently. This trades exactness on long-range positions for constant memory cost regardless of sequence length. LongNet retains exact attention over its (sparsified) positions, so its global heads can produce recall over individual long-range tokens that a compressed memory cannot.^[4]

Comparison table

Approach	Per-layer cost	Token-pair path	Attention exactness	Default sharding
Vanilla Transformer	$O(N^2 d)$	$O(1)$	Exact softmax	Data / tensor / pipeline
Sparse Transformer	$O(N \sqrt{N} d)$	$O(1)$ within mask	Exact over mask	Data / tensor
Longformer	$O(N w d)$	$O(N/w)$	Exact over window + global	Data / tensor
BigBird	$O(N (w + g + r) d)$	$O(\log N)$ (random)	Exact over mask	Data / tensor
LongNet (dilated)	$O(N d)$	$O(\log N)$	Exact over selected positions	Sequence + data + tensor
Ring Attention	$O(N^2 d)$ globally; $O((N/P)^2 d)$ per device	$O(1)$	Exact softmax	Sequence (ring)
Mamba	$O(N d)$	Recurrent (no explicit path)	Not attention	Sequence / data
Hyena	$O(N \log N d)$	Convolutional	Not attention	Data / tensor
Infini-Attention	$O(N w d)$ + compressive	Local exact + lossy global	Local exact, compressed global	Data / tensor

Numbers in this table are taken from each method's primary paper as cited above; $N$ is sequence length, $d$ is hidden dimension, $w$ is window or segment length, $g$ is the number of global tokens, $r$ is the number of random tokens per query, and $P$ is the number of sequence-parallel devices.^[1]^[4]^[5]^[6]^[7]

Applications and significance

LongNet's significance is twofold. First, it offers a concrete architecture that combines a fixed-pattern sparse attention with sequence parallelism, demonstrating that the two design choices can be combined into a single block whose cost is provably linear in sequence length and whose per-device communication cost is bounded. This makes it one of the cleaner case studies in the long-context literature for what an "attention block with linear cost" looks like when the design has to satisfy both per-device efficiency and distributed scalability.^[1]^[12]

Second, the architecture motivates and supports the case that increasing context length is a third axis of scaling, alongside model size and training tokens. The paper frames this as the ability to treat "a whole corpus or even the entire Internet as a sequence," a framing that has been picked up in subsequent long-context work.^[1] Whether or not 1B-token training is currently economically useful, the design pattern (segment + dilation + sequence parallelism) has been generalised by later models, most prominently in PowerAttention and other exponentially expanding receptive field schemes.^[11]

In code modelling specifically, the 32K-token LongNet language-model results on The Stack show monotonically decreasing perplexity with increasing evaluation context, which is the property practitioners care about for whole-file or whole-repository code intelligence. The vision variant LongViT extends the same idea to image tokens, which is relevant for high-resolution and gigapixel-image classification.^[2]^[12]

Limitations and criticisms

Several caveats are routinely raised about LongNet in secondary discussion:

Headline length not empirically trained. The paper's title and abstract claim 1B-token scaling, but the published perplexity results stop at 32K tokens; the 1B-token figures are runtime and throughput measurements, not a converged language model. The paper itself notes that the authors' compute environment limited the language-model experiments to 32K context. Whether dilated attention's perplexity claims hold at million- or billion-token context is an open empirical question.^[1]^[12]^[13]
Fixed, data-independent pattern. Like other fixed-mask sparse attention methods, LongNet cannot adapt its attended positions to the input. Tasks that require long-range exact recall of arbitrary tokens (such as needle-in-a-haystack variants, see Needle in a Haystack (NIAH)) may suffer because the relevant token may not lie on the dilated subgrid of any head. The mixed-density / multi-offset design mitigates but does not eliminate this issue.^[4]
No peer-reviewed venue. As of the v2 release, the paper is marked "Work in progress" on arXiv and has not been accepted at a peer-reviewed venue, which limits the depth of independent scrutiny relative to fully reviewed works such as FlashAttention or Mamba.^[3]
Implementation maturity. The official torchscale implementation depends on Flash Attention and on a specific dilation/segment configuration; users who wish to use other backbones or non-Flash kernels need to derive their own dilation-aware kernels. The third-party ports vary in fidelity to the official combination rule.^[2]^[9]^[10]
Comparison to non-attention alternatives. Subsequent state-space-model work, particularly the Mamba and Mamba-2 families and their long-context successors, has argued that linear-time recurrence achieves comparable language-modelling quality at lower per-token cost than attention-based long-context variants, at least in throughput-bound regimes. LongNet has not, to date, been directly compared against modern Mamba variants at million-token-scale language modelling in a published peer-reviewed study.^[6]

LongNet is related to a cluster of architectures and systems that aim at the same long-context bottleneck from different angles:

Ring Attention: keeps exact dense attention, distributes the sequence in a ring of devices, and overlaps key/value rotation with attention compute.
Mamba and Mamba 2: structured state-space models with linear-time recurrence.
Hyena: implicit long convolutions as a sub-quadratic global mixing operator.
Infini-Attention: local windowed attention plus a compressive long-term memory.
Sliding window attention: bounded-window attention used in many production LLMs.
Sparse attention: the umbrella family from which dilated attention can be viewed as a specific structured instance.
Rotary position embedding (RoPE) and the related xPos scheme: positional encodings used by LongNet to support extrapolation across long contexts.
YaRN: a post-hoc rotary-embedding adjustment used to extend pretrained context windows.
FlashAttention: the IO-aware exact-attention kernel that LongNet's official implementation depends on.
Retentive Network (RetNet): a sibling Microsoft Research architecture (also in torchscale) that uses a retention mechanism instead of softmax attention.
State space model (deep learning): the broader family that includes S4, S5, and Mamba.

References

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei, "LongNet: Scaling Transformers to 1,000,000,000 Tokens", arXiv:2307.02486, 2023-07-05 (v1) / 2023-07-19 (v2). https://arxiv.org/abs/2307.02486. Accessed 2026-05-20. ↩
Microsoft Research, "torchscale: Transformers at any scale", GitHub repository, 2023-12. https://github.com/microsoft/torchscale. Accessed 2026-05-20. ↩
arXiv listing for 2307.02486 (version history and "work in progress" status), arXiv, 2023-07-19. https://arxiv.org/abs/2307.02486v2. Accessed 2026-05-20. ↩
Devansh, "Transformers vs Mamba vs Linear Attention: Who Wins Long Context?", Machine Learning Made Simple (Medium), 2024. https://machine-learning-made-simple.medium.com/transformers-vs-mamba-vs-linear-attention-who-wins-long-context-f1dc8ceb5ede. Accessed 2026-05-20. ↩
AI Papers Academy, "LongNet: Scaling Transformers to 1B Tokens with Dilated Attention", aipapersacademy.com, 2023-07. https://aipapersacademy.com/longnet/. Accessed 2026-05-20. ↩
Anonymous authors, "LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement", arXiv:2504.16053, 2025-04. https://arxiv.org/abs/2504.16053. Accessed 2026-05-20. ↩
Hao Liu, Matei Zaharia, Pieter Abbeel, "Ring Attention with Blockwise Transformers for Near-Infinite Context", arXiv:2310.01889, 2023-10-03. https://arxiv.org/abs/2310.01889. Accessed 2026-05-20. ↩
Microsoft Research, "LongNet: Scaling Transformers to 1,000,000,000 Tokens" (publication page), microsoft.com/en-us/research, 2023-07. https://www.microsoft.com/en-us/research/publication/longnet-scaling-transformers-to-1000000000-tokens/. Accessed 2026-05-20. ↩
Frank Odom, "dilated-attention-pytorch: (Unofficial) implementation of dilated attention from LongNet", GitHub, 2023. https://github.com/fkodom/dilated-attention-pytorch. Accessed 2026-05-20. ↩
Kye Gomez, "LongNet: Implementation of plug in and play Attention from LongNet", GitHub, 2023. https://github.com/kyegomez/LongNet. Accessed 2026-05-20. ↩
Anonymous authors, "PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention", arXiv:2503.03588, 2025-03. https://arxiv.org/abs/2503.03588. Accessed 2026-05-20. ↩
Jiayu Ding et al., "LongNet: Scaling Transformers to 1,000,000,000 Tokens" (HTML rendering of v2 with full equations, algorithm, and tables), ar5iv.labs.arxiv.org, 2023-07-19. https://ar5iv.labs.arxiv.org/html/2307.02486. Accessed 2026-05-20. ↩
Storrs Hoen, "Paper Walkthrough: LongNet: Scaling Transformers to 1,000,000,000 Tokens", storrs.io, 2023-07. https://storrs.io/paper-walkthrough-longnet-scaling-transformers-to-1-000-000-000-tokens/. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Differential Transformer

Overview

Background

History

Dilated attention

Standard attention recap

Segment and dilation hyperparameters

Mixed densities across heads

Multi-head allocation

Complexity argument

Distributed training and sequence parallelism

Training setup and language-modelling results

Model and data

Perplexity at varying context lengths

Throughput and the 1B-token claim

Implementations

microsoft/torchscale

LongViT

Third-party ports

Comparison with related approaches

Sparse attention families

State-space and convolutional models

Distributed exact attention

Recurrent and memory-augmented attention

Comparison table

Applications and significance

Limitations and criticisms

Related work

See also

References

Improve this article

Related Articles

Differential Transformer

DeBERTa

BitNet

BitNet b1.58

LongRoPE

YOCO (You Only Cache Once)

What links here

Related Articles

Differential Transformer

DeBERTa

BitNet

BitNet b1.58

LongRoPE

YOCO (You Only Cache Once)