Ring Attention
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,627 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,627 words
Add missing citations, update stale details, or suggest a clearer explanation.
Ring Attention, formally Ring Attention with Blockwise Transformers, is a distributed algorithm for computing the self-attention operation of transformer neural networks across a ring of compute devices. Introduced in October 2023 by Hao Liu, Matei Zaharia, and Pieter Abbeel at the University of California, Berkeley, the technique enables training and inference on sequences whose length scales linearly with the number of available devices, without resorting to attention approximations or sparsity, and without incurring net communication overhead.[1][2] By splitting queries, keys, and values along the sequence dimension and rotating key-value blocks around a logical ring while each device performs blockwise FlashAttention-style computation, Ring Attention overlaps inter-device communication with on-device arithmetic. The paper, posted to arXiv as report 2310.01889 on 3 October 2023, was accepted to ICLR 2024 and has become a foundational primitive in the broader category of "context parallelism" used to train models with context windows reaching the multi-million-token regime.[1][3][4]
The algorithm is the system-level counterpart to FlashAttention: where FlashAttention reduces the single-device memory cost of attention from quadratic to linear in sequence length by tiling the softmax computation, Ring Attention applies the same online-softmax mathematics across an arbitrary number of devices arranged in a ring topology. Each accelerator holds only one slice of the sequence in high-bandwidth memory; key-value blocks circulate via peer-to-peer transfers so that, by the end of an N-step rotation across N devices, each device has applied attention to the complete key/value tensor without ever materialising it locally.[5][6] Ring Attention has been independently re-implemented in JAX and PyTorch by both the original authors and the open-source community, has been adopted (under various names and refinements) inside production frameworks such as NVIDIA Megatron-LM "Context Parallelism," Microsoft DeepSpeed-Ulysses (in hybrid configurations), and the open-source Large World Model (LWM) at Berkeley, and has been widely credited as the algorithmic basis enabling the million-token context windows that arrived during 2024 and beyond.[4][7][8][9]
The dot-product self-attention operation at the heart of the transformer architecture computes, for input length N and embedding dimension d, a matrix of pairwise similarities Q Kᵀ of shape N × N. Both the activations and (during training) the gradient through the softmax must be stored, giving naïve attention a memory footprint of O(N²) and a compute cost of O(N²d).[1][10] At realistic embedding dimensions of several thousand, this quadratic term begins to dominate well before sequence length reaches a million tokens; for N = 1,048,576 the attention matrix in 16-bit precision alone occupies roughly 2 TB, vastly exceeding the high-bandwidth memory of any single GPU or TPU chip available in 2023.[1][5]
Three lines of work attempted to relax this bottleneck before Ring Attention. Approximate-attention methods such as Longformer, BigBird, Linformer, Performer, and sparse attention variants reduced the asymptotic cost at the expense of an approximation that, in practice, degraded model quality on retrieval-sensitive tasks. State-space alternatives such as Mamba replaced softmax attention with a linear-time recurrence but required either training from scratch or substantial architectural surgery.[11] The third line, memory-efficient exact attention, kept the softmax but reorganised the computation to never materialise the N × N matrix at once. The decisive paper in this family was Tri Dao's FlashAttention (2022), which used IO-aware tiling and an "online softmax" recurrence to reduce single-device memory to O(N) while remaining numerically equivalent to standard attention.[10][12]
The immediate predecessor of Ring Attention was the Blockwise Parallel Transformer (BPT), released in May 2023 by Hao Liu and Pieter Abbeel.[13] BPT extended FlashAttention's online-softmax tiling beyond the attention layer to fuse the subsequent feed-forward network into the same blockwise pass, eliminating the temporary activation buffer that the FFN otherwise required between matmuls. BPT enabled training of sequences up to 32× longer than vanilla transformers and roughly 4× longer than the previous memory-efficient state of the art, on a single device.[13] However, like FlashAttention, BPT remained fundamentally constrained by the per-device memory capacity: even with linear-in-N activations, the input embeddings, parameters, and optimizer states still had to fit on one accelerator.
Ring Attention with Blockwise Transformers is precisely the multi-device generalisation of BPT. By distributing the blockwise computation that BPT had restricted to a single device across an entire ring of devices, Liu, Zaharia, and Abbeel converted the per-device linear memory bound of BPT into a cluster-level linear bound, multiplied by the number of devices.[1][14]
Sharding activations along the sequence dimension was not itself novel in late 2023. NVIDIA's Megatron-LM had introduced "Sequence Parallelism" (SP) as a complement to its tensor parallelism (TP), but the original SP variant split only the layer-normalisation and dropout activations along the sequence axis; the attention operation itself was still gathered to full sequence length on each tensor-parallel rank before computation.[15] Microsoft's DeepSpeed-Ulysses, posted in September 2023 just weeks before Ring Attention, took a different approach: it scattered the head dimension rather than the sequence dimension during attention, using all-to-all collectives to transpose the layout before and after the local softmax.[7] Ulysses's parallelism degree, however, could not exceed the number of attention heads, a hard ceiling at perhaps 64 heads for typical configurations, and was poorly suited to grouped-query and multi-query attention models where head counts were deliberately reduced.[7][16]
Ring Attention occupied a complementary niche: its parallelism degree was bounded only by the number of available devices and by the ratio between block FLOPS and ring bandwidth, not by any model-architectural parameter. It also did not require all-to-all collectives, instead using point-to-point ring-permute (in JAX's jax.lax.ppermute) for communication, which mapped well onto the systolic NVLink/NVSwitch topologies of contemporary GPU pods and onto TPU ICI interconnects.[1][17]
Ring Attention is most easily understood as two nested loops, an outer loop over query blocks pinned to each device and an inner loop over key-value blocks that rotate around the ring, fused with the FlashAttention online-softmax recurrence so that no intermediate attention matrix is ever materialised.[5][6]
Given input tokens 1, …, N and P devices, the sequence is partitioned into P contiguous slices of length N/P. Device i receives slice i and computes its local query, key, and value projections Q_i, K_i, V_i, each of shape (N/P) × d_model. Inside each device, these are further tiled into sub-blocks of length c (the FlashAttention block size) so that the BPT/FlashAttention online-softmax recurrence can run entirely in on-chip SRAM.[1][5]
Each device's job is to compute Attn(Q_i, K, V), the attention of its query slice against the full concatenated K and V across all devices. The local share of K and V (namely K_i and V_i) starts on device i; the remaining P−1 slices must visit the device over time.[5][6]
The P devices are logically arranged in a ring such that device i receives K and V from device i−1 (mod P) and sends its current K, V buffer to device i+1 (mod P). The algorithm proceeds for P iterations. At iteration t:
Send/Recv ring-permute that transmits the current K^t, V^t to the next device while simultaneously receiving K^{t+1}, V^{t+1} from the previous device.After P iterations every query block has been attended against every key-value pair exactly once, and the final partial output O has been correctly normalised by the running denominator ℓ. The mathematical result is bit-equivalent (up to floating-point reduction order) to standard softmax attention on the full N × N matrix.[1][5]
The decisive property of the algorithm is that the per-iteration ring-permute of one (K, V) block can be hidden behind the per-iteration blockwise attention compute, provided the block size is large enough. The original paper proves that with block size c, hidden dimension h, batch b, per-device peak FLOPS F, and per-link bandwidth B, the overlap condition is satisfied when c ≥ F/B (roughly: blocks must be large enough that arithmetic for one block takes at least as long as transferring the next).[1][5] On an A100 with NVLink (≈ 300 GB/s per link, ≈ 300 TFLOPS BF16) this requires block sizes of around 256–1,024 tokens, well within practical regimes. As long as this inequality holds, Ring Attention has the same wall-clock cost per token as the equivalent single-device blockwise attention, and the parallelism is essentially free.[1][6]
The original paper's pseudocode initialises a buffer that accumulates the running attention output, then in each ring step calls jax.lax.ppermute (a non-blocking circular shift) on the K and V buffers while issuing the FlashAttention kernel on the current buffer; both operations live on the same XLA computation stream, allowing the XLA compiler to overlap them.[1][17]
Training requires a backward pass through the same operation. The authors define a custom JAX vector-Jacobian product via jax.custom_vjp, rematerialising the same ring rotation in reverse during the backward computation to produce gradients with respect to Q, K, and V while preserving the linear-memory invariant. Activations are checkpointed at the ring-permute boundaries so that no quadratic-memory intermediate ever needs to be saved across the full backward pass.[1][17]
The complexity properties of Ring Attention are central to its appeal and are worth stating precisely.
Memory cost per device. Each device stores only its slice of Q, K, V (size O(N/P · d)) plus the running partial output and softmax statistics (size O(c · d)) and any incoming K, V buffer (size O(N/P · d)). The total per-device memory is therefore O(N/P) in the sequence length, linear in the local shard and independent of total sequence length when P scales with N. The original paper reports the per-layer activation memory as exactly 6 b c h bytes (batch b, block size c, hidden h), constant in total sequence length.[5][14] For comparison, vanilla attention's per-layer activation memory is 2 b h N², and even Megatron-style memory-efficient attention requires roughly 8 b s h.[5][14]
Communication cost per device. Over the P iterations each device sends a total of (P−1) × (N/P) × 2 × d_head × b key-value tokens (the factor 2 covers both K and V), giving an aggregate communication volume of O(N · d) per layer per device, linear in N. When normalised by the O(N²d / P) on-device compute, this gives a per-device communication-to-compute ratio of O(P / N), which shrinks as sequences grow longer, explaining why Ring Attention's relative overhead decreases with scale.[5][6]
Net compute overhead. Because the algorithm performs the same FLOPs as standard attention, the FLOP count remains N²d on each layer when summed across the ring, distributed evenly as (N/P)·N·d on each device. There is no algorithmic compute overhead, only the IO of the ring-permute, which the overlap analysis above shows can be hidden.[1][5]
The cumulative effect is that Ring Attention removes the per-device memory wall: provided communication can be hidden under compute, doubling the device count doubles the trainable sequence length at the same wall-clock speed per token, indefinitely, with no accuracy loss.[1][5]
The ICLR 2024 paper reported a series of scaling experiments that established the practical envelope of the algorithm:
On reinforcement-learning benchmarks (ExoRL), Ring Attention let an "Action Transformer" condition on 128 simultaneous trajectories, a scale previously infeasible, while maintaining a cumulative-return score of 113.66 versus 111.13 for the BPT baseline.[5] On long-context language-model fine-tuning, a LLaMA-13 B model adapted with Ring Attention to a 512 K-token context outperformed GPT-3.5-turbo-16K and Vicuna-16K on line-retrieval probes well beyond their advertised context windows.[5]
Several refinements have been proposed since the original Ring Attention paper, all preserving the core ring-rotation structure but improving load balance, communication topology, or kernel integration.
Posted on arXiv on 15 November 2023, six weeks after Ring Attention, by William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley, Striped Attention addresses a load-balancing flaw in Ring Attention's behaviour under causal (autoregressive) attention masking.[18][19] In causal attention, query token i attends only to keys ≤ i; with contiguous sequence shards, the device holding the latest tokens performs full attention while the device holding the earliest tokens does almost nothing, producing a triangular workload imbalance whose worst-case efficiency is bounded by 50 %.[18]
Striped Attention's fix is conceptually minimal: instead of giving each device a contiguous slice of the sequence, give it a striped (every-Pth-token) interleaved subset. This redistributes the causal triangle into stripes whose work is balanced across the ring at every rotation step. The authors report end-to-end throughput improvements of up to 1.45× on 8× A100 GPUs at 256 K context and 1.65× on 16-chip TPU v4 at 786 K context, with no change to the underlying numerics.[18][19] The lucidrains PyTorch implementation includes striped attention as an option, and the zhuzilin implementation exposes a "zigzag" variant that achieves the same load-balancing goal with a slightly different permutation.[8][20]
Tree Attention, by Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony, and Beren Millidge (Zyphra and EleutherAI; arXiv 2408.04093, 7 August 2024), is the most significant theoretical competitor to Ring Attention.[21][22] Where Ring Attention executes a linear-time circular rotation, Tree Attention notes that the cross-device reduction inside attention is mathematically associative (the running maximum and running exponential-sum are both associative operations on log-sum-exp triples) and can therefore be performed as a tree reduction in O(log P) communication rounds rather than O(P).[21][22]
The authors of Tree Attention derive their algorithm from a scalar energy-function formulation of attention with a Bayesian interpretation connecting it to Hopfield networks; the gradient of this energy function is the attention output, and an efficient forward computation (achieved via tree reduction over logsumexp) immediately gives an efficient backward.[21][22] Reported speedups versus Ring Attention reach up to 8× faster decoding at 1 M sequence length with 2× less peak memory and 2× less communication volume.[21][22] Tree Attention applies most cleanly to the decoding setting (one query against the full K, V cache); during prefill, the asymptotic advantage is preserved but the per-iteration constants narrow the gap.[21] The lucidrains ring-attention-pytorch repository includes a "Tree Attention Decoding" implementation as a complementary, not exclusive, technique.[8]
NVIDIA's Megatron-LM library, starting with version 0.5, introduced Context Parallelism (CP) as a first-class parallelism dimension alongside data, tensor, and pipeline parallelism.[4][9] The official documentation describes CP as "similar to Ring Attention" but with several engineering refinements: it integrates the latest cuDNN FlashAttention kernels directly, applies a balanced causal-mask scheme inspired by Striped Attention, and converts the necessary all-gather/reduce-scatter collectives into peer-to-peer ring communications to match the ring topology of NVLink/NVSwitch fabrics.[4][9] CP is enabled by setting context_parallel_size to the desired ring degree and is the recommended NVIDIA-blessed long-context training path inside Megatron-Core and NVIDIA NeMo.[4][9]
Microsoft's DeepSpeed framework integrates Ulysses sequence parallelism (which scatters the head dimension via all-to-all collectives) with Ring Attention into a 2-D hybrid scheme. Each ring step is itself wrapped by Ulysses all-to-alls; the inner Ulysses degree absorbs heads while the outer Ring degree absorbs sequence length. This hybrid avoids both Ulysses's hard ceiling at the number of heads and Ring Attention's per-link bandwidth ceiling, and has been adopted in the SWIFT training framework with a partition rule requiring the sequence length to divide by world_size × 2 for zigzag-style load balancing.[7][16] Reported memory reduction on Qwen2.5-3B across 8× A100s is approximately 4.2× (from 75.4 GiB to 17.9 GiB per device) when running with ulysses=2, ring=4.[7]
The literature has continued to multiply. RingFormer (arXiv 2501.01182, January 2025) applies ring attention specifically to neural vocoders. TokenRing (arXiv 2412.20501, December 2024) augments the ring with bidirectional communication to halve communication latency.[23] RingX (SC '25 / ACM 3712285) targets HPC-scale long-context training, integrating ring attention with collective-communication primitives optimised for InfiniBand-coupled clusters.[24] A line of unified sequence-parallelism work by Fang et al. (USP, arXiv 2405.07719) systematises the design space spanned by Ulysses, Ring Attention, and their hybrids.[16]
The most widely used open-source implementations are:
haoliuhl/ringattention: the official JAX implementation by Hao Liu, released alongside the paper. It exports ringattention and blockwise_feedforward functions designed to compose with JAX's shard_map primitive for multi-device dispatch, supports causal-block masking, inference KV caching via cache_idx, and configurable rematerialisation policies. Licensed under Apache 2.0 and used as the attention kernel inside the Large World Model release.[25]lucidrains/ring-attention-pytorch: an independent PyTorch re-implementation by Phil Wang ("lucidrains") that splits along the sequence dimension and applies ring-reduce to the attention tiles, integrating Tri Dao's FlashAttention CUDA kernels and a custom forward Triton kernel. Supports striped attention, grouped-query attention to reduce per-step communication, tree-attention decoding, rotary embeddings, and variable per-rank sequence lengths. Released under the MIT license; active development continued into 2025.[8][26]zhuzilin/ring-flash-attention: a PyTorch library that exposes ring_flash_attn_qkvpacked_func and ring_flash_attn_varlen_func mirroring the FlashAttention APIs, plus a zigzag_ring_flash_attn_func for load-balancing and a llama3_flash_attn variant matching the configuration described in the Llama 3 tech report. Includes a Hugging Face model adapter; reports ~85–90 % of single-GPU throughput on 8× H800. MIT-licensed and used as the upstream for several downstream long-context fine-tuning projects.[20]gpu-mode/ring-attention: the GPU MODE community's pedagogical implementation, used as the basis for CUDA MODE Lecture 13 on ring attention.[6]NVIDIA's Megatron-Core Context Parallel module is the principal closed-system implementation, integrated with cuDNN FlashAttention kernels and supporting MHA, MQA, and grouped-query attention variants from version 0.5 onward.[4][9]
Public information about the use of Ring Attention in proprietary frontier models is uneven, and the article distinguishes carefully between confirmed and rumoured deployments.
The Large World Model (LWM) project, released in February 2024 by the same Berkeley group (Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel), is the most thoroughly documented public deployment of Ring Attention. The accompanying paper "World Model on Million-Length Video And Language With Blockwise RingAttention" (arXiv 2402.08268) describes a 7 B-parameter autoregressive transformer trained progressively from 4 K to 1 M token contexts using Ring Attention as the core attention kernel.[27][28] The model can answer questions about hour-long YouTube videos and reports >99 % retrieval accuracy on million-token needle-in-a-haystack tasks, outperforming GPT-4V and Gemini Pro on the same probe at the time of release. The entire training code, attention implementation, and 7 B-parameter weights are public on Hugging Face under the LargeWorldModel organisation.[27][28]
Google DeepMind's Gemini 1.5 Pro, announced in February 2024 with a one-million-token context window (extendable to ten million in research demonstrations), is widely speculated in the technical-press and engineering communities to use Ring Attention or a close derivative as part of its long-context architecture.[29] However, the published Gemini 1.5 technical report (arXiv 2403.05530) does not name Ring Attention as a citation or describe its attention sharding scheme in detail.[30] The report attributes long-context capabilities to a combination of Mixture-of-Experts sparse architecture, custom efficient attention, and infrastructural improvements, without specifying the algorithm. The Ring Attention attribution should therefore be regarded as plausibly inspired-by but not confirmed by Google.[30]
Magic's LTM-2-mini model, announced in August 2024 as the first commercial model with a 100-million-token context window, explicitly contrasts its architecture with attention-based mechanisms.[31] Magic's public technical note states that "for each decoded token, LTM-2-mini's sequence-dimension algorithm is roughly 1000× cheaper than the attention mechanism in Llama 3.1 405B for a 100 M-token context window," indicating a non-attention recurrence rather than a Ring-Attention-style exact-attention scheme.[31] Magic has not disclosed the specific algorithm but explicitly distinguishes it from the family of approaches descended from Ring Attention.[31]
There is no public statement from Anthropic identifying Ring Attention as the attention algorithm behind Claude's 200 K and 1 M context windows. Anthropic's public engineering writing emphasises long-context prompting practice and benchmarking, not the underlying parallelism strategy. Claims linking Ring Attention to Claude appearing in third-party blog posts are speculative.[32]
The strongest "production" claim that can be made is at the framework level rather than the model level: every major open-source training stack, including NVIDIA Megatron-LM, NVIDIA NeMo, Microsoft DeepSpeed (in hybrid Ulysses+Ring mode), and PyTorch Lightning (PyTorch Lightning)'s Fabric long-context recipe, ships Ring-Attention-style context parallelism as a supported and recommended parallelism dimension for long-context model training, and many publicly available long-context fine-tunes (including LWM and several open community efforts) explicitly cite Ring Attention as the algorithmic basis.[4][7][9][25]
Ring Attention received its formal recognition at ICLR 2024 (Vienna, May 2024), where it was published in the main proceedings as paper 1119587863e78451f080da2a768c4935.[3] In the year following publication it became one of the most heavily cited papers in the distributed-systems-for-ML subfield, accumulating implementations in JAX, PyTorch, Triton, and CUDA inside both academic and industry codebases.[8][20][25][26]
Commentary in the engineering community has frequently described Ring Attention as the "missing primitive" that made the multi-million-token context windows of 2024 economically viable, comparable in significance to FlashAttention's role at the single-device tier. The GPU MODE / CUDA MODE educational community devoted an entire lecture (Lecture 13, February 2024) to the algorithm, and several open-source large-context training efforts (including LWM and follow-on releases) have cited it as enabling technology.[6][27]
Critical commentary has focused on three limitations. Causal-mask load imbalance (addressed by Striped Attention and zigzag variants) was the most prominent first-generation concern.[18][19] Communication latency, particularly on links slower than NVLink (PCIe Gen 5, Ethernet, AWS EFA), can break the overlap condition at smaller block sizes; the recommended remedy is grouped-query attention to reduce per-step (K, V) volume, or hybrid Ulysses+Ring 2-D schemes.[7][16] Logarithmic vs. linear scaling: Tree Attention has shown that the linear-time ring rotation is asymptotically suboptimal for decoding, where its O(P) communication rounds are dominated by O(log P) tree reductions.[21][22]
Nonetheless, the original Liu/Zaharia/Abbeel formulation remains the canonical reference. As of mid-2025, the term "Ring Attention" appears in most major ML-systems courses and surveys as the standard name for sequence-dimension parallelism of exact attention, and the algorithm's central insight, that the online-softmax recurrence is associative enough to commute with arbitrary ring-permute communication patterns, has become a foundational result of the long-context era.[4][6][16]
| Property | Vanilla attention | FlashAttention | Sequence Parallelism (Megatron, original) | DeepSpeed-Ulysses | Ring Attention | Tree Attention |
|---|---|---|---|---|---|---|
| Memory per device | O(N²) | O(N) | O(N²) (gathered for attention) | O(N) | O(N/P) | O(N/P) |
| Communication rounds | n/a (single-device) | n/a (single-device) | none for attention itself | 2× all-to-all per layer | P−1 ring permutes per layer | log₂ P tree reductions per decoding step |
| Parallelism ceiling | 1 device | 1 device | number of attention heads (via TP) | number of heads | number of devices | number of devices |
| Net compute overhead | 0 | 0 | 0 | 0 (with sufficient heads) | 0 (with overlap condition) | 0 (asymptotically smaller constants) |
| Causal masking | trivial | trivial | trivial | trivial | imbalanced (fixed by Striped/zigzag) | trivial |
| Approximation? | no | no | no | no | no | no |
The principal trade-off is between Ulysses's lower per-layer communication latency (one round-trip of all-to-all) and Ring Attention's freedom from the head-count ceiling. Production frameworks increasingly combine the two as a 2-D hybrid.[4][7][16]