Ring Attention

Training & Optimization Transformer Models

24 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

32 citations

Revision

v6 · 4,842 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Ring Attention, formally Ring Attention with Blockwise Transformers, is a distributed algorithm for computing the self-attention operation of transformer neural networks across a ring of compute devices, enabling exact attention over sequences whose length scales linearly with the number of devices. Introduced in October 2023 by Hao Liu, Matei Zaharia, and Pieter Abbeel at the University of California, Berkeley, the technique enables training and inference on sequences whose length scales linearly with the number of available devices, without resorting to attention approximations or sparsity, and without incurring net communication overhead.^[1]^[2] The paper's own summary states that Ring Attention "leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention," yielding "training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers."^[1] By splitting queries, keys, and values along the sequence dimension and rotating key-value blocks around a logical ring while each device performs blockwise FlashAttention-style computation, Ring Attention overlaps inter-device communication with on-device arithmetic. The paper, posted to arXiv as report 2310.01889 on 3 October 2023, was accepted to ICLR 2024 and has become a foundational primitive in the broader category of "context parallelism" used to train long-context language models with context windows reaching the multi-million-token regime.^[1]^[3]^[4]

The algorithm is the system-level counterpart to FlashAttention: where FlashAttention reduces the single-device memory cost of attention from quadratic to linear in sequence length by tiling the softmax computation, Ring Attention applies the same online-softmax mathematics across an arbitrary number of devices arranged in a ring topology. Each accelerator holds only one slice of the sequence in high-bandwidth memory; key-value blocks circulate via peer-to-peer transfers so that, by the end of an N-step rotation across N devices, each device has applied attention to the complete key/value tensor without ever materialising it locally.^[5]^[6] Ring Attention has been independently re-implemented in JAX and PyTorch by both the original authors and the open-source community, has been adopted (under various names and refinements) inside production frameworks such as NVIDIA Megatron-LM "Context Parallelism," Microsoft DeepSpeed-Ulysses (in hybrid configurations), and the open-source Large World Model (LWM) at Berkeley, and has been widely credited as the algorithmic basis enabling the million-token context windows that arrived during 2024 and beyond.^[4]^[7]^[8]^[9]

What problem does Ring Attention solve?

The quadratic-memory problem of attention

The dot-product self-attention operation at the heart of the transformer architecture computes, for input length N and embedding dimension d, a matrix of pairwise similarities $Q K^\top$ of shape $N \times N$ . Both the activations and (during training) the gradient through the softmax must be stored, giving naïve attention a memory footprint of $O(N^2)$ and a compute cost of $O(N^2 d)$ .^[1]^[10] At realistic embedding dimensions of several thousand, this quadratic term begins to dominate well before sequence length reaches a million tokens; for $N = 1{,}048{,}576$ the attention matrix in 16-bit precision alone occupies roughly 2 TB, vastly exceeding the high-bandwidth memory of any single GPU or TPU chip available in 2023.^[1]^[5]

Three lines of work attempted to relax this bottleneck before Ring Attention. Approximate-attention methods such as Longformer, BigBird, Linformer, Performer, and sparse attention variants reduced the asymptotic cost at the expense of an approximation that, in practice, degraded model quality on retrieval-sensitive tasks. State-space alternatives such as Mamba replaced softmax attention with a linear-time recurrence but required either training from scratch or substantial architectural surgery.^[11] The third line, memory-efficient exact attention, kept the softmax but reorganised the computation to never materialise the N × N matrix at once. The decisive paper in this family was Tri Dao's FlashAttention (2022), which used IO-aware tiling and an "online softmax" recurrence to reduce single-device memory to O(N) while remaining numerically equivalent to standard attention.^[10]^[12]

Blockwise Parallel Transformer (BPT)

The immediate predecessor of Ring Attention was the Blockwise Parallel Transformer (BPT), released in May 2023 by Hao Liu and Pieter Abbeel.^[13] BPT extended FlashAttention's online-softmax tiling beyond the attention layer to fuse the subsequent feed-forward network into the same blockwise pass, eliminating the temporary activation buffer that the FFN otherwise required between matmuls. BPT enabled training of sequences up to 32× longer than vanilla transformers and roughly 4× longer than the previous memory-efficient state of the art, on a single device.^[13] However, like FlashAttention, BPT remained fundamentally constrained by the per-device memory capacity: even with linear-in-N activations, the input embeddings, parameters, and optimizer states still had to fit on one accelerator.

Ring Attention with Blockwise Transformers is precisely the multi-device generalisation of BPT. By distributing the blockwise computation that BPT had restricted to a single device across an entire ring of devices, Liu, Zaharia, and Abbeel converted the per-device linear memory bound of BPT into a cluster-level linear bound, multiplied by the number of devices.^[1]^[14]

Sequence parallelism: prior approaches

Sharding activations along the sequence dimension was not itself novel in late 2023. NVIDIA's Megatron-LM had introduced "Sequence Parallelism" (SP) as a complement to its tensor parallelism (TP), but the original SP variant split only the layer-normalisation and dropout activations along the sequence axis; the attention operation itself was still gathered to full sequence length on each tensor-parallel rank before computation.^[15] Microsoft's DeepSpeed-Ulysses, posted in September 2023 just weeks before Ring Attention, took a different approach: it scattered the head dimension rather than the sequence dimension during attention, using all-to-all collectives to transpose the layout before and after the local softmax.^[7] Ulysses's parallelism degree, however, could not exceed the number of attention heads, a hard ceiling at perhaps 64 heads for typical configurations, and was poorly suited to grouped-query and multi-query attention models where head counts were deliberately reduced.^[7]^[16]

Ring Attention occupied a complementary niche: its parallelism degree was bounded only by the number of available devices and by the ratio between block FLOPS and ring bandwidth, not by any model-architectural parameter. It also did not require all-to-all collectives, instead using point-to-point ring-permute (in JAX's jax.lax.ppermute) for communication, which mapped well onto the systolic NVLink/NVSwitch topologies of contemporary GPU pods and onto TPU ICI interconnects.^[1]^[17]

How does Ring Attention work?

Ring Attention is most easily understood as two nested loops, an outer loop over query blocks pinned to each device and an inner loop over key-value blocks that rotate around the ring, fused with the FlashAttention online-softmax recurrence so that no intermediate attention matrix is ever materialised.^[5]^[6]

Setup

Given input tokens 1, …, N and P devices, the sequence is partitioned into P contiguous slices of length $N/P$ . Device i receives slice i and computes its local query, key, and value projections $Q_i, K_i, V_i$ , each of shape $(N/P) \times d_{\mathrm{model}}$ . Inside each device, these are further tiled into sub-blocks of length c (the FlashAttention block size) so that the BPT/FlashAttention online-softmax recurrence can run entirely in on-chip SRAM.^[1]^[5]

Each device's job is to compute $\mathrm{Attn}(Q_i, K, V)$ , the attention of its query slice against the full concatenated K and V across all devices. The local share of K and V (namely $K_i$ and $V_i$ ) starts on device i; the remaining $P-1$ slices must visit the device over time.^[5]^[6]

The ring rotation

The P devices are logically arranged in a ring such that device i receives K and V from device $i - 1 \pmod{P}$ and sends its current K, V buffer to device $i + 1 \pmod{P}$ . The algorithm proceeds for P iterations. At iteration t:

Device i holds query slice $Q_i$ (which never moves) and a current key-value pair $(K^t, V^t)$ that was originally produced on device $(i - t) \bmod P$ .
Device i fires off an asynchronous Send/Recv ring-permute that transmits the current $K^t, V^t$ to the next device while simultaneously receiving $K^{t+1}, V^{t+1}$ from the previous device.
Concurrently, device i runs blockwise FlashAttention of $Q_i$ against the $K^t, V^t$ it currently holds, updating two running statistics for each query block: the running maximum logit $m$ and the running exponential-sum denominator $\ell$ (the standard online-softmax invariants), along with a running partial output $O$ .^[5]^[6]^[12]

After P iterations every query block has been attended against every key-value pair exactly once, and the final partial output $O$ has been correctly normalised by the running denominator $\ell$ . The mathematical result is bit-equivalent (up to floating-point reduction order) to standard softmax attention on the full $N \times N$ matrix.^[1]^[5]

How is communication overlapped with computation?

The decisive property of the algorithm is that the per-iteration ring-permute of one (K, V) block can be hidden behind the per-iteration blockwise attention compute, provided the block size is large enough. The original paper proves that with block size c, hidden dimension h, batch b, per-device peak FLOPS F, and per-link bandwidth B, the overlap condition is satisfied when $c \geq F/B$ (roughly: blocks must be large enough that arithmetic for one block takes at least as long as transferring the next).^[1]^[5] On an A100 with NVLink (≈ 300 GB/s per link, ≈ 300 TFLOPS BF16) this requires block sizes of around 256-1,024 tokens, well within practical regimes. As long as this inequality holds, Ring Attention has the same wall-clock cost per token as the equivalent single-device blockwise attention, and the parallelism is essentially free.^[1]^[6]

The original paper's pseudocode initialises a buffer that accumulates the running attention output, then in each ring step calls jax.lax.ppermute (a non-blocking circular shift) on the K and V buffers while issuing the FlashAttention kernel on the current buffer; both operations live on the same XLA computation stream, allowing the XLA compiler to overlap them.^[1]^[17]

Backward pass

Training requires a backward pass through the same operation. The authors define a custom JAX vector-Jacobian product via jax.custom_vjp, rematerialising the same ring rotation in reverse during the backward computation to produce gradients with respect to Q, K, and V while preserving the linear-memory invariant. Activations are checkpointed at the ring-permute boundaries so that no quadratic-memory intermediate ever needs to be saved across the full backward pass.^[1]^[17]

What are the memory and communication costs?

The complexity properties of Ring Attention are central to its appeal and are worth stating precisely.

Memory cost per device. Each device stores only its slice of Q, K, V (size $O(N/P \cdot d)$ ) plus the running partial output and softmax statistics (size $O(c \cdot d)$ ) and any incoming K, V buffer (size $O(N/P \cdot d)$ ). The total per-device memory is therefore $O(N/P)$ in the sequence length, linear in the local shard and independent of total sequence length when P scales with N. The original paper reports the per-layer activation memory as exactly $6 b c h$ bytes (batch b, block size c, hidden h), constant in total sequence length.^[5]^[14] For comparison, vanilla attention's per-layer activation memory is $2 b h N^2$ , and even Megatron-style memory-efficient attention requires roughly $8 b s h$ .^[5]^[14]

Communication cost per device. Over the P iterations each device sends a total of $(P-1) \times (N/P) \times 2 \times d_{\mathrm{head}} \times b$ key-value tokens (the factor 2 covers both K and V), giving an aggregate communication volume of $O(N \cdot d)$ per layer per device, linear in N. When normalised by the $O(N^2 d / P)$ on-device compute, this gives a per-device communication-to-compute ratio of $O(P / N)$ , which shrinks as sequences grow longer, explaining why Ring Attention's relative overhead decreases with scale.^[5]^[6]

Net compute overhead. Because the algorithm performs the same FLOPs as standard attention, the FLOP count remains $N^2 d$ on each layer when summed across the ring, distributed evenly as $(N/P) \cdot N \cdot d$ on each device. There is no algorithmic compute overhead, only the IO of the ring-permute, which the overlap analysis above shows can be hidden.^[1]^[5]

The cumulative effect is that Ring Attention removes the per-device memory wall: provided communication can be hidden under compute, doubling the device count doubles the trainable sequence length at the same wall-clock speed per token, indefinitely, with no accuracy loss.^[1]^[5]

What did the original paper demonstrate?

The ICLR 2024 paper reported a series of scaling experiments that established the practical envelope of the algorithm:

8× A100 GPUs (7 B model). Prior memory-efficient methods topped out at 32 K tokens; Ring Attention enabled training up to 256 K, an 8× improvement, exactly matching the device count.^[5]
32× A100 GPUs (7 B model). Maximum trainable context reached 4,096 K (≈ 4.1 M) tokens, 32× larger than the prior baseline at the same hardware.^[5]
TPU v4-1024 pod (7 B model). Trainable context reached 8,192 K (≈ 8 M) tokens, a roughly 512× improvement over baseline.^[5]
TPU v4-1024 pod (30 B model). The largest reported configuration enabled training of sequences exceeding 100 million tokens, at the time the longest exact-attention sequence ever trained.^[1]^[5]

On reinforcement-learning benchmarks (ExoRL), Ring Attention let an "Action Transformer" condition on 128 simultaneous trajectories, a scale previously infeasible, while maintaining a cumulative-return score of 113.66 versus 111.13 for the BPT baseline.^[5] On long-context language-model fine-tuning, a LLaMA-13 B model adapted with Ring Attention to a 512 K-token context outperformed GPT-3.5-turbo-16K and Vicuna-16K on line-retrieval probes well beyond their advertised context windows.^[5]

Several refinements have been proposed since the original Ring Attention paper, all preserving the core ring-rotation structure but improving load balance, communication topology, or kernel integration.

Striped Attention

Posted on arXiv on 15 November 2023, six weeks after Ring Attention, by William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley, Striped Attention addresses a load-balancing flaw in Ring Attention's behaviour under causal (autoregressive) attention masking.^[18]^[19] In causal attention, query token $i$ attends only to keys $\leq i$ ; with contiguous sequence shards, the device holding the latest tokens performs full attention while the device holding the earliest tokens does almost nothing, producing a triangular workload imbalance whose worst-case efficiency is bounded by 50 %.^[18]

Striped Attention's fix is conceptually minimal: instead of giving each device a contiguous slice of the sequence, give it a striped (every-Pth-token) interleaved subset. This redistributes the causal triangle into stripes whose work is balanced across the ring at every rotation step. The authors report that "we are able to achieve up to 1.45x end-to-end throughput improvements over the original Ring Attention algorithm on causal transformer training at a sequence length of 256k" on A100 GPUs, and "on 16 TPUv4 chips, we were able to achieve 1.65x speedups at sequence lengths of 786k," with no change to the underlying numerics.^[18]^[19] The lucidrains PyTorch implementation includes striped attention as an option, and the zhuzilin implementation exposes a "zigzag" variant that achieves the same load-balancing goal with a slightly different permutation.^[8]^[20]

Tree Attention

Tree Attention, by Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony, and Beren Millidge (Zyphra and EleutherAI; arXiv 2408.04093, 7 August 2024), is the most significant theoretical competitor to Ring Attention.^[21]^[22] Where Ring Attention executes a linear-time circular rotation, Tree Attention notes that the cross-device reduction inside attention is mathematically associative (the running maximum and running exponential-sum are both associative operations on log-sum-exp triples) and can therefore be performed as a tree reduction in $O(\log P)$ communication rounds rather than $O(P)$ .^[21]^[22]

The authors of Tree Attention derive their algorithm from a scalar energy-function formulation of attention with a Bayesian interpretation connecting it to Hopfield networks; the gradient of this energy function is the attention output, and an efficient forward computation (achieved via tree reduction over logsumexp) immediately gives an efficient backward.^[21]^[22] Reported speedups versus Ring Attention reach up to 8× faster decoding at 1 M sequence length with 2× less peak memory and 2× less communication volume.^[21]^[22] Tree Attention applies most cleanly to the decoding setting (one query against the full K, V cache); during prefill, the asymptotic advantage is preserved but the per-iteration constants narrow the gap.^[21] The lucidrains ring-attention-pytorch repository includes a "Tree Attention Decoding" implementation as a complementary, not exclusive, technique.^[8]

Context Parallelism (NVIDIA Megatron-LM)

NVIDIA's Megatron-LM library, starting with version 0.5, introduced Context Parallelism (CP) as a first-class parallelism dimension alongside data, tensor, and pipeline parallelism.^[4]^[9] The official documentation describes CP as "similar to Ring Attention" but with several engineering refinements: it integrates the latest cuDNN FlashAttention kernels directly, applies a balanced causal-mask scheme inspired by Striped Attention, and converts the necessary all-gather/reduce-scatter collectives into peer-to-peer ring communications to match the ring topology of NVLink/NVSwitch fabrics.^[4]^[9] CP is enabled by setting context_parallel_size to the desired ring degree and is the recommended NVIDIA-blessed long-context training path inside Megatron-Core and NVIDIA NeMo.^[4]^[9]

DeepSpeed-Ulysses / hybrid 2D parallelism

Microsoft's DeepSpeed framework integrates Ulysses sequence parallelism (which scatters the head dimension via all-to-all collectives) with Ring Attention into a 2-D hybrid scheme. Each ring step is itself wrapped by Ulysses all-to-alls; the inner Ulysses degree absorbs heads while the outer Ring degree absorbs sequence length. This hybrid avoids both Ulysses's hard ceiling at the number of heads and Ring Attention's per-link bandwidth ceiling, and has been adopted in the SWIFT training framework with a partition rule requiring the sequence length to divide by world_size × 2 for zigzag-style load balancing.^[7]^[16] Reported memory reduction on Qwen2.5-3B across 8× A100s is approximately 4.2× (from 75.4 GiB to 17.9 GiB per device) when running with ulysses=2, ring=4.^[7]

Other variants

The literature has continued to multiply. RingFormer (arXiv 2501.01182, January 2025) applies ring attention specifically to neural vocoders. TokenRing (arXiv 2412.20501, December 2024) augments the ring with bidirectional communication to halve communication latency.^[23] RingX (SC '25 / ACM 3712285) targets HPC-scale long-context training, integrating ring attention with collective-communication primitives optimised for InfiniBand-coupled clusters.^[24] A line of unified sequence-parallelism work by Fang et al. (USP, arXiv 2405.07719) systematises the design space spanned by Ulysses, Ring Attention, and their hybrids.^[16]

What are the main open-source implementations?

The most widely used open-source implementations are:

haoliuhl/ringattention: the official JAX implementation by Hao Liu, released alongside the paper. It exports ringattention and blockwise_feedforward functions designed to compose with JAX's shard_map primitive for multi-device dispatch, supports causal-block masking, inference KV caching via cache_idx, and configurable rematerialisation policies. Licensed under Apache 2.0 and used as the attention kernel inside the Large World Model release.^[25]
lucidrains/ring-attention-pytorch: an independent PyTorch re-implementation by Phil Wang ("lucidrains") that splits along the sequence dimension and applies ring-reduce to the attention tiles, integrating Tri Dao's FlashAttention CUDA kernels and a custom forward Triton kernel. Supports striped attention, grouped-query attention to reduce per-step communication, tree-attention decoding, rotary embeddings, and variable per-rank sequence lengths. Released under the MIT license; active development continued into 2025.^[8]^[26]
zhuzilin/ring-flash-attention: a PyTorch library that exposes ring_flash_attn_qkvpacked_func and ring_flash_attn_varlen_func mirroring the FlashAttention APIs, plus a zigzag_ring_flash_attn_func for load-balancing and a llama3_flash_attn variant matching the configuration described in the Llama 3 tech report. Includes a Hugging Face model adapter; reports ~85-90 % of single-GPU throughput on 8× H800. MIT-licensed and used as the upstream for several downstream long-context fine-tuning projects.^[20]
gpu-mode/ring-attention: the GPU MODE community's pedagogical implementation, used as the basis for CUDA MODE Lecture 13 on ring attention.^[6]

NVIDIA's Megatron-Core Context Parallel module is the principal closed-system implementation, integrated with cuDNN FlashAttention kernels and supporting MHA, MQA, and grouped-query attention variants from version 0.5 onward.^[4]^[9]

What models and frameworks use Ring Attention?

Public information about the use of Ring Attention in proprietary frontier models is uneven, and the article distinguishes carefully between confirmed and rumoured deployments.

Confirmed: Large World Model (UC Berkeley)

The Large World Model (LWM) project, released in February 2024 by the same Berkeley group (Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel), is the most thoroughly documented public deployment of Ring Attention. The accompanying paper "World Model on Million-Length Video And Language With Blockwise RingAttention" (arXiv 2402.08268) describes a 7 B-parameter autoregressive transformer trained progressively from 4 K to 1 M token contexts using Ring Attention as the core attention kernel.^[27]^[28] The model can answer questions about hour-long YouTube videos and reports near-perfect retrieval accuracy on million-token needle-in-a-haystack tasks, outperforming GPT-4V and Gemini Pro on the same probe at the time of release; in the harder multi-needle setting LWM-Text reports 0.97 accuracy retrieving one needle from four at 32 K context and maintains strong performance out to 1 M tokens.^[27]^[28] The entire training code, attention implementation, and 7 B-parameter weights are public on Hugging Face under the LargeWorldModel organisation.^[27]^[28]

Is Ring Attention used in Gemini 1.5 Pro?

Google DeepMind's Gemini 1.5 Pro, announced in February 2024 with a one-million-token context window (extendable to ten million in research demonstrations), is widely speculated in the technical-press and engineering communities to use Ring Attention or a close derivative as part of its long-context architecture.^[29] However, the published Gemini 1.5 technical report (arXiv 2403.05530) does not name Ring Attention as a citation or describe its attention sharding scheme in detail.^[30] The report describes Gemini 1.5 Pro as a "sparse mixture-of-expert (MoE) Transformer-based model" that achieves "near-perfect recall (>99%)" on retrieval up to at least 10 million tokens, "a generational leap" over Claude 2.1 (200K) and GPT-4 Turbo (128K), attributing the capability to a combination of Mixture-of-Experts sparse architecture, custom efficient attention, and infrastructural improvements without specifying the algorithm.^[30] The Ring Attention attribution should therefore be regarded as plausibly inspired-by but not confirmed by Google.^[30]

Reportedly not used: Magic.dev LTM-2

Magic's LTM-2-mini model, announced in August 2024 as the first commercial model with a 100-million-token context window, explicitly contrasts its architecture with attention-based mechanisms.^[31] Magic's public technical note states that "for each decoded token, LTM-2-mini's sequence-dimension algorithm is roughly 1000x cheaper than the attention mechanism in Llama 3.1 405B for a 100M token context window," indicating a non-attention recurrence rather than a Ring-Attention-style exact-attention scheme.^[31] Magic has not disclosed the specific algorithm but explicitly distinguishes it from the family of approaches descended from Ring Attention.^[31]

Unconfirmed: Anthropic Claude

There is no public statement from Anthropic identifying Ring Attention as the attention algorithm behind Claude's 200 K and 1 M context windows. Anthropic's public engineering writing emphasises long-context prompting practice and benchmarking, not the underlying parallelism strategy. Claims linking Ring Attention to Claude appearing in third-party blog posts are speculative.^[32]

Confirmed (framework-level): Megatron-LM / NeMo / DeepSpeed

The strongest "production" claim that can be made is at the framework level rather than the model level: every major open-source training stack, including NVIDIA Megatron-LM, NVIDIA NeMo, Microsoft DeepSpeed (in hybrid Ulysses+Ring mode), and PyTorch Lightning (PyTorch Lightning)'s Fabric long-context recipe, ships Ring-Attention-style context parallelism as a supported and recommended parallelism dimension for long-context model training, and many publicly available long-context fine-tunes (including LWM and several open community efforts) explicitly cite Ring Attention as the algorithmic basis.^[4]^[7]^[9]^[25]

Why is Ring Attention significant?

Ring Attention received its formal recognition at ICLR 2024 (Vienna, May 2024), where it was published in the main proceedings as paper 1119587863e78451f080da2a768c4935.^[3] In the year following publication it became one of the most heavily cited papers in the distributed-systems-for-ML subfield, accumulating implementations in JAX, PyTorch, Triton, and CUDA inside both academic and industry codebases.^[8]^[20]^[25]^[26]

Commentary in the engineering community has frequently described Ring Attention as the "missing primitive" that made the multi-million-token context windows of 2024 economically viable, comparable in significance to FlashAttention's role at the single-device tier. The GPU MODE / CUDA MODE educational community devoted an entire lecture (Lecture 13, February 2024) to the algorithm, and several open-source large-context training efforts (including LWM and follow-on releases) have cited it as enabling technology.^[6]^[27]

Critical commentary has focused on three limitations. Causal-mask load imbalance (addressed by Striped Attention and zigzag variants) was the most prominent first-generation concern.^[18]^[19] Communication latency, particularly on links slower than NVLink (PCIe Gen 5, Ethernet, AWS EFA), can break the overlap condition at smaller block sizes; the recommended remedy is grouped-query attention to reduce per-step (K, V) volume, or hybrid Ulysses+Ring 2-D schemes.^[7]^[16] Logarithmic vs. linear scaling: Tree Attention has shown that the linear-time ring rotation is asymptotically suboptimal for decoding, where its O(P) communication rounds are dominated by O(log P) tree reductions.^[21]^[22]

Nonetheless, the original Liu/Zaharia/Abbeel formulation remains the canonical reference. As of mid-2025, the term "Ring Attention" appears in most major ML-systems courses and surveys as the standard name for sequence-dimension parallelism of exact attention, and the algorithm's central insight, that the online-softmax recurrence is associative enough to commute with arbitrary ring-permute communication patterns, has become a foundational result of the long-context era.^[4]^[6]^[16]

Property	Vanilla attention	FlashAttention	Sequence Parallelism (Megatron, original)	DeepSpeed-Ulysses	Ring Attention	Tree Attention
Memory per device	$O(N^2)$	$O(N)$	$O(N^2)$ (gathered for attention)	$O(N)$	$O(N/P)$	$O(N/P)$
Communication rounds	n/a (single-device)	n/a (single-device)	none for attention itself	2× all-to-all per layer	$P-1$ ring permutes per layer	$\log_2 P$ tree reductions per decoding step
Parallelism ceiling	1 device	1 device	number of attention heads (via TP)	number of heads	number of devices	number of devices
Net compute overhead	0	0	0	0 (with sufficient heads)	0 (with overlap condition)	0 (asymptotically smaller constants)
Causal masking	trivial	trivial	trivial	trivial	imbalanced (fixed by Striped/zigzag)	trivial
Approximation?	no	no	no	no	no	no

The principal trade-off is between Ulysses's lower per-layer communication latency (one round-trip of all-to-all) and Ring Attention's freedom from the head-count ceiling. Production frameworks increasingly combine the two as a 2-D hybrid.^[4]^[7]^[16]

References

Liu, Hao; Zaharia, Matei; Abbeel, Pieter. "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv:2310.01889, submitted 3 October 2023. https://arxiv.org/abs/2310.01889 ↩
arXiv listing for 2310.01889, including version history v1 (Oct 3, 2023) through v4 (Nov 27, 2023). https://arxiv.org/abs/2310.01889v3 ↩
International Conference on Learning Representations (ICLR) 2024 proceedings: "RingAttention with Blockwise Transformers for Near-Infinite Context." https://proceedings.iclr.cc/paper_files/paper/2024/file/1119587863e78451f080da2a768c4935-Paper-Conference.pdf ↩
NVIDIA Megatron-Core developer documentation, "context_parallel package." https://docs.nvidia.com/megatron-core/developer-guide/0.15.0/api-guide/context_parallel.html ↩
HTML rendering of arXiv 2310.01889: algorithm pseudocode, communication and memory analysis, experimental tables. https://arxiv.org/html/2310.01889 ↩
Christian Mills, "GPU MODE Lecture 13: Ring Attention" (notes on the CUDA MODE community lecture). https://christianjmills.com/posts/cuda-mode-notes/lecture-013/ ↩
Hugging Face Blog, "Ultra-Long Sequence Parallelism: Ulysses + Ring-Attention Technical Principles and Implementation." https://huggingface.co/blog/exploding-gradients/ulysses-ring-attention ↩
lucidrains/ring-attention-pytorch GitHub repository. https://github.com/lucidrains/ring-attention-pytorch ↩
NVIDIA Megatron-Core developer documentation, current revision, "context_parallel package." https://docs.nvidia.com/megatron-core/developer-guide/0.16.0/user-guide/features/context_parallel.html ↩
Dao, Tri et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022, arXiv 2205.14135. ↩
Coconut Mode, "Ring Attention Explained." https://coconut-mode.com/posts/ring-attention/ ↩
Online-softmax derivation in FlashAttention course notes (Zihao Ye, "From Online Softmax to FlashAttention"). https://courses.cs.washington.edu/courses/cse599m/23sp/notes/flashattn.pdf ↩
Liu, Hao; Abbeel, Pieter. "Blockwise Parallel Transformer for Large Context Models." arXiv:2305.19370, submitted 30 May 2023. https://arxiv.org/abs/2305.19370 ↩
Aussie AI research review, "Ring Attention." https://www.aussieai.com/research/ring-attention ↩
Korthikanti, V. et al. "Reducing Activation Recomputation in Large Transformer Models." NVIDIA / Megatron-LM Sequence Parallelism. ↩
Fang, Jiarui et al. "USP: A Unified Sequence Parallelism Approach for Long Context Generative AI." arXiv 2405.07719. https://arxiv.org/html/2405.07719v3 ↩
haoliuhl/ringattention GitHub README. https://github.com/haoliuhl/ringattention ↩
Brandon, William; Nrusimha, Aniruddha; Qian, Kevin; Ankner, Zachary; Jin, Tian; Song, Zhiye; Ragan-Kelley, Jonathan. "Striped Attention: Faster Ring Attention for Causal Transformers." arXiv:2311.09431, submitted 15 November 2023. https://arxiv.org/abs/2311.09431 ↩
Semantic Scholar record for Striped Attention. https://www.semanticscholar.org/paper/Striped-Attention:-Faster-Ring-Attention-for-Causal-Brandon-Nrusimha/ade22704be8a0fc3730d320cc7934b2ccbcd97e4 ↩
zhuzilin/ring-flash-attention GitHub repository. https://github.com/zhuzilin/ring-flash-attention ↩
Shyam, Vasudev; Pilault, Jonathan; Shepperd, Emily; Anthony, Quentin; Millidge, Beren. "Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters." arXiv:2408.04093, submitted 7 August 2024. https://arxiv.org/abs/2408.04093 ↩
Zyphra blog post, "Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters." https://www.zyphra.com/post/tree-attention-topology-aware-decoding-for-long-context-attention-on-gpu-clusters ↩
TokenRing: arXiv 2412.20501. https://arxiv.org/html/2412.20501v1 ↩
RingX, Proceedings of SC '25. https://dl.acm.org/doi/10.1145/3712285.3759859 ↩
haoliuhl/ringattention repository: official JAX implementation. https://github.com/haoliuhl/ringattention ↩
Releases page of lucidrains/ring-attention-pytorch. https://github.com/lucidrains/ring-attention-pytorch/releases ↩
Liu, Hao; Yan, Wilson; Zaharia, Matei; Abbeel, Pieter. "World Model on Million-Length Video And Language With Blockwise RingAttention." arXiv 2402.08268, submitted 13 February 2024. https://arxiv.org/abs/2402.08268 ↩
Large World Models project page, UC Berkeley. https://largeworldmodel.github.io/lwm/ ↩
Google blog, "Introducing Gemini 1.5, Google's next-generation AI model" (15 February 2024). https://blog.google/innovation-and-ai/products/google-gemini-next-generation-model-february-2024/ ↩
Gemini Team, Google. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv 2403.05530. https://arxiv.org/abs/2403.05530 ↩
Magic.dev blog, "100M Token Context Windows" (August 2024). https://magic.dev/blog/100m-token-context-windows ↩
Anthropic blog, "Introducing 100K Context Windows." https://www.anthropic.com/news/100k-context-windows ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributor · full history

Suggest edit

What links here

Context Parallelism Infini-Attention KV Cache Long-context language models LongLoRA LongNet LongRoPE Megatron-LM Partitioning strategy Sequence Parallelism Sliding window attention Sparse attention StreamingLLM

What problem does Ring Attention solve?

The quadratic-memory problem of attention

Blockwise Parallel Transformer (BPT)

Sequence parallelism: prior approaches

How does Ring Attention work?

Setup

The ring rotation

How is communication overlapped with computation?

Backward pass

What are the memory and communication costs?

What did the original paper demonstrate?

What variants and refinements exist?

Striped Attention

Tree Attention

Context Parallelism (NVIDIA Megatron-LM)

DeepSpeed-Ulysses / hybrid 2D parallelism

Other variants

What are the main open-source implementations?

What models and frameworks use Ring Attention?

Confirmed: Large World Model (UC Berkeley)

Is Ring Attention used in Gemini 1.5 Pro?

Reportedly not used: Magic.dev LTM-2

Unconfirmed: Anthropic Claude

Confirmed (framework-level): Megatron-LM / NeMo / DeepSpeed

Why is Ring Attention significant?

How does Ring Attention compare to related techniques?

See also

References

Improve this article

Related Articles

Masked autoencoder (MAE)

BERT

Multi-head Latent Attention

Multi-Head Self-Attention

Positional encoding

Rotary Position Embedding

What links here

Related Articles

Masked autoencoder (MAE)

BERT

Multi-head Latent Attention

Multi-Head Self-Attention

Positional encoding

Rotary Position Embedding

What links here