Sequence Parallelism
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,276 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,276 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sequence parallelism (SP) is a family of distributed training techniques for transformer-based neural networks that partitions activations along the sequence (token) dimension across multiple accelerators, reducing per-device activation memory and unlocking longer context lengths than would otherwise fit on a single device.[^1] Sequence parallelism is typically composed with tensor parallelism (TP), pipeline parallelism (PP), and data parallelism (DP) to form multi-dimensional parallel training topologies for large language models. Two principal modern variants exist: the LayerNorm/dropout sharding scheme introduced for Megatron-LM by Korthikanti et al. in 2022,[^1] and the all-to-all attention scheme of DeepSpeed-Ulysses introduced by Jacobs et al. in 2023.[^2] A closely related approach, Ring Attention by Liu, Zaharia, and Abbeel,[^3] distributes self-attention itself across devices using blockwise computation and ring communication, and forms the basis of NVIDIA's "Context Parallelism" feature in current Megatron-LM releases.[^4]
Training large transformer models is bottlenecked not only by parameter memory but by activation memory: the intermediate tensors stored between the forward and backward passes for gradient computation. For a single transformer layer with sequence length $s$, microbatch size $b$, hidden dimension $h$, and $a$ attention heads, the activation footprint is dominated by terms proportional to $sbh$ (for LayerNorm, residual, projection outputs) and $s^2ab$ (for the attention probability matrix), so memory grows linearly in $h$ and quadratically in $s$ in the worst case.[^1] As models scaled from GPT-3's 175 B parameters and 2 048-token context to multi-hundred-billion-parameter models trained at 8 K, 32 K, 128 K and beyond, activation memory became the binding constraint, often forcing practitioners to invoke full activation recomputation (gradient checkpointing) and pay an extra forward pass per step.[^1]
Classical tensor parallelism, introduced in the 2019 Megatron-LM paper, splits the weight matrices of attention and MLP blocks across $t$ tensor-parallel ranks along the hidden dimension.[^5] This reduces parameter and activation memory inside the attention/MLP regions by a factor of $t$, but it leaves LayerNorm, dropout, and the residual stream replicated across all $t$ ranks: those layers operate on $sbh$-shaped activations that are not sharded by hidden-dim TP.[^1] Sequence parallelism arose to close that gap.
The phrase "sequence parallelism" was first used by Li, Xue, Baranwal, Li, and You in a 2021 paper from the Colossal-AI group at NUS, which split sequences across GPUs and used a ring-style self-attention exchange to compute attention without ever materialising the full sequence on any one device.[^6] Their experiments scaled to over 114 K tokens on 64 P100 GPUs, demonstrating that splitting on the sequence axis was a viable alternative to splitting on the hidden axis.[^6] The term was subsequently reused, with different mechanics, by the NVIDIA Megatron group in 2022.
The Megatron-LM variant of sequence parallelism is defined in Reducing Activation Recomputation in Large Transformer Models by Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro, posted to arXiv on 10 May 2022.[^1] It is a refinement that sits inside tensor parallelism rather than replacing it.
A standard tensor-parallel transformer block has two TP regions: the attention block (column-parallel QKV projection, row-parallel output projection) and the MLP block (column-parallel first linear, row-parallel second linear). These regions are entered with an f operator (identity in forward, all-reduce in backward) and exited with a g operator (all-reduce in forward, identity in backward) in the original Megatron-LM design.[^5] Between TP regions sit LayerNorm and dropout, which operate token-wise and were therefore left unsharded, meaning every TP rank stored a redundant copy of the full $sbh$ activation.[^1]
Sequence parallelism modifies this by sharding the inputs and outputs of LayerNorm and dropout along the sequence dimension. The g operator (TP region exit) becomes a reduce-scatter along the sequence axis instead of an all-reduce, and the f operator (TP region entry) becomes an all-gather along the sequence axis instead of a no-op.[^1] Because an all-reduce equals a reduce-scatter followed by an all-gather, the aggregate communication volume per step is unchanged versus baseline TP, but the activation tensor entering and leaving each LayerNorm/dropout has shape $(s/t, b, h)$ instead of $(s, b, h)$, so every TP rank stores only $1/t$ of those activations.[^1]
Korthikanti et al. derive activation-memory expressions per transformer layer. For pure TP with parallel size $t$, the activations per layer (in bytes) scale as $sbh \cdot (10 + 24/t) + 5 \cdot abs^2/t$, with the leading $10sbh$ term coming from LayerNorm, dropout, and residual paths that are not sharded by TP.[^1] With sequence parallelism applied, the unsharded $10sbh$ term is also divided by $t$, yielding $sbh \cdot 34/t + 5 \cdot abs^2/t$, an essentially uniform $1/t$ sharding of activation memory inside the TP region.[^1]
Combined with their second contribution, selective activation recomputation (recomputing only the cheap-to-recompute attention softmax/dropout activations, while storing the rest), the Megatron paper reports a 5x reduction in activation memory and an over-90 % reduction in the time overhead from activation recomputation.[^1] Training a 530 B-parameter GPT-3-style model on 2 240 NVIDIA A100 GPUs reached 54.2 % Model FLOPs Utilization (MFU), versus 42.1 % with full recomputation, a 29 % speedup.[^1]
Megatron-style SP has three important properties:
sequence_parallel=True flag on tensor_model_parallel_size > 1.[^7]DeepSpeed-Ulysses, presented in DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models by Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He (arXiv 2309.14509, submitted 25 September 2023), takes a different design philosophy.[^2] Rather than refining hidden-dimension TP, Ulysses keeps the sequence partitioned across devices for most of the computation and only briefly reshuffles for attention.
Let $P$ be the number of sequence-parallel devices and $h$ be the number of attention heads. Outside attention, every device holds an $(N/P) \times d$ slice of the activation, where $N$ is the sequence length and $d$ is the hidden dimension. Linear projections, LayerNorm, MLP, and residuals all operate on this sequence-sharded representation.[^2]
Right before attention, Ulysses applies an all-to-all collective on the projected Q, K, V tensors. After the all-to-all, each device holds the full sequence (length $N$) but only a non-overlapping subset of attention heads ($h/P$ heads per device).[^2] This requires $P \mid h$, i.e. the number of heads is divisible by the SP degree. Each device then computes ordinary head-parallel attention with any backend, including FlashAttention v2, on its head subset over the full sequence.[^2] A second all-to-all redistributes the per-head outputs back into the sequence-sharded layout for the output projection, MLP, and downstream layers.[^2]
Because the all-to-all over $P$ devices moves $M/P$ bytes per link per step (where $M$ is the aggregate message), Ulysses achieves a per-link attention communication volume of $4Nh/P$ for the four QKV-plus-output projections, i.e. $O(N/P)$ per device.[^2] In contrast, the Megatron-LM SP approach scales as $O(N)$ per link because it relies on all-gather along the sequence dimension to assemble inputs for attention.[^2] As long as $N$ and $P$ scale proportionally (a common regime in long-context training), Ulysses keeps per-link communication volume constant, a property the authors highlight as a key scaling advantage.[^2]
The Ulysses paper reports training with sequence lengths over a million tokens, supporting sequences "4x longer than existing systems" while improving training throughput by up to 2.5x.[^2] The DeepSpeed blog elaborates that the design is attention-implementation agnostic (it works with dense, sparse, and FlashAttention-2 kernels), composes with DeepSpeed ZeRO-3 sharding, and exposes a DistributedAttention wrapper that requires minimal model code changes.[^8]
The principal limitation is the head-divisibility constraint: SP degree cannot exceed the number of attention heads, which caps Ulysses parallelism. For models with grouped-query attention or few heads this can be restrictive, and it creates friction when composing Ulysses with hidden-dim TP, because both methods consume the same head count.[^9]
A complementary approach is Ring Attention with Blockwise Transformers for Near-Infinite Context by Hao Liu, Matei Zaharia, and Pieter Abbeel, posted to arXiv on 3 October 2023 (2310.01889).[^3] Ring Attention attacks attention's quadratic memory cost head-on by distributing it across the devices that hold the sequence shards.
Ring Attention organises $P$ devices into a logical ring. Each device $i$ holds the local sequence chunk $X_i$ of length $N/P$ and its corresponding queries $Q_i$, keys $K_i$, and values $V_i$. Attention is computed blockwise (in the style of FlashAttention) on a per-chunk basis using running softmax statistics.[^3] While each device computes attention between its $Q_i$ and the currently held $(K_j, V_j)$ block, it simultaneously sends $(K_j, V_j)$ to the next device in the ring and receives $(K_{j-1}, V_{j-1})$ from the previous one. After $P$ steps the ring has rotated all KV blocks past every $Q_i$ and the global attention output has been accumulated.[^3] The crucial system property is that the cost of communicating a KV block can be fully overlapped by the cost of computing one block of attention, so when the per-block compute exceeds the per-block transfer, communication is hidden.[^3]
Because no device ever materialises the full $N \times N$ attention matrix or the full KV cache, the per-device memory is proportional to $N/P$ regardless of total sequence length. Liu et al. demonstrate training sequences "device count times longer" than what blockwise-only baselines like Blockwise Parallel Transformers (BPT) could handle, scaling to millions of tokens.[^3] Ring Attention is exact, not an approximation: no tokens are dropped, no attention pattern is restricted.[^3]
Ring Attention is sometimes described as the distributed-memory analogue of FlashAttention.[^9] FlashAttention shards attention across a single GPU's SRAM tiles, using the online-softmax trick to fold the full attention into a streaming computation; Ring Attention applies the same blockwise pattern across a multi-GPU memory hierarchy, with the inter-GPU ring exchange playing the role that HBM-to-SRAM streaming plays inside a single device.[^9] In practice the two compose: Ring Attention dispatches local blocks to FlashAttention kernels.[^4]
For decoder-only language models the attention mask is lower-triangular. If the sequence is split into $P$ equal contiguous chunks and laid out 0 to $P-1$ along the ring, then rank 0 has the fewest tokens to attend over (only its own past) while rank $P-1$ has the most, yielding poor load balance.[^9] Liu et al. and subsequent work proposed reordering ("Striped Attention") and chunk interleaving to equalise per-rank work; the canonical Megatron-LM Context Parallelism implementation includes this optimisation by default.[^4][^9]
In modern NVIDIA training stacks (Megatron Core, NeMo Framework, Megatron-Bridge), the production sequence-sharding feature is called Context Parallelism (CP) and is distinct from the original Megatron SP.[^4][^7] CP combines ring attention with classical Korthikanti-style SP and is the recommended path for long-context training above roughly 32 K tokens.[^7]
CP partitions network inputs and all activations along the sequence dimension, not only LayerNorm/dropout.[^4] Each GPU stores only its sequence chunk of every layer's activations and KV cache. For attention, CP uses a ring-style exchange: each rank gathers KV chunks from peers as needed and pipelines the gather with the local attention computation.[^4] The Megatron Core documentation notes that "all-gather and reduce-scatter communications are transformed to point-to-point communications in ring topology under the hood," with a configurable cp_comm_type parameter that accepts p2p, all_gather, a2a, or a2a+p2p; the p2p mode is implemented as ring-exchange send/receive operations hard-coded to overlap with the attention compute of sequence chunks.[^4]
NVIDIA positions CP as an improvement over the original Ring Attention paper on two axes: it leverages the current OSS and cuDNN FlashAttention kernels for the per-chunk compute, and it eliminates the wasted lower-triangular work and load imbalance from causal masking by reordering chunks along the ring.[^4]
CP is orthogonal to TP, PP, DP, and Expert Parallelism: the total GPU count satisfies $\text{world size} = \text{TP} \times \text{CP} \times \text{PP} \times \text{DP}$.[^4] Korthikanti-style SP within the TP region is typically kept on whenever TP > 1, so a typical 128 K-token training job for a 70 B model might run with TP = 8 (+ SP inside TP), CP = 8, PP = 4, DP = 4 on 1 024 GPUs.[^10]
The Llama 3 herd of models paper from Meta documents using Context Parallelism extensively for long-context phases of pre-training: when extending Llama 3 to 128 K, CP = 16 lets each rank still see only 8 K tokens, matching the activation footprint of the short-context base training and re-using the existing 3D parallel topology.[^10] NVIDIA's developer blog further reports that on B200 hardware CP delivers more than 2x speedup at long sequences, and that CP becomes "mandatory" at sequence lengths approaching one million tokens.[^11]
| Variant | Year | Shards along | Attention strategy | Communication for attention | Composes with TP? | Bound by head count? |
|---|---|---|---|---|---|---|
| Colossal-AI SP (Li et al.) | 2021[^6] | Sequence | Ring self-attention | $O(N)$ per ring step | yes | no |
| Megatron SP (Korthikanti et al.) | 2022[^1] | Sequence (only at LN/dropout) | Standard TP attention | none extra | required (TP > 1) | no |
| DeepSpeed-Ulysses | 2023[^2] | Sequence | Head-parallel after all-to-all | $O(N/P)$ per link | with friction (heads shared) | yes ($P \mid h$) |
| Ring Attention (Liu et al.) | 2023[^3] | Sequence | Blockwise + ring KV rotation | $O(N/P)$ per ring step, overlapped | yes | no |
| Megatron Context Parallelism | 2023 to present[^4] | Sequence (all activations) | Ring + FlashAttention, causal-aware | $O(N/P)$, overlapped | yes (orthogonal axis) | no |
In every variant the activation memory at attention scales as $O(N/P)$ when SP is engaged across $P$ ranks, but the achievable $P$ and the communication overhead differ.
A 2024 paper by Fang and Zhao, USP: A Unified Sequence Parallelism Approach for Long-Context Generative AI (arXiv 2405.07719), proposes hybridising Ulysses and Ring Attention into a unified hierarchical scheme that can run Ulysses over a smaller dimension (e.g. within a node, where all-to-all is cheap) and Ring over a larger dimension (e.g. across nodes, where overlapped point-to-point is preferable).[^9] USP reports 47 % MFU and 208 K-token training on LLaMA3-8B over two 8x A800 nodes.[^9]
Sequence parallelism is one axis of multi-dimensional parallelism. Its interactions with the others are as follows.[^7][^10]
The principal applications of sequence parallelism are:
Sequence parallelism is not free.
Sequence parallelism sits between three closely related lines of work. Activation-memory-reducing methods such as gradient checkpointing reduce memory at the cost of recomputation. Memory-efficient attention kernels such as FlashAttention reduce attention's intra-device memory without sharding the sequence across devices. And model-parallel methods such as tensor parallelism and pipeline parallelism shard parameters and layers but leave per-rank sequence length unchanged. Sequence parallelism complements all three, and modern long-context training pipelines combine them all.