Sequence Parallelism

Sequence parallelism (SP) is a family of distributed training techniques for transformer-based neural networks that partitions activations along the sequence (token) dimension across multiple accelerators, reducing per-device activation memory and unlocking longer context lengths than would otherwise fit on a single device.[^1] Sequence parallelism is typically composed with tensor parallelism (TP), pipeline parallelism (PP), and data parallelism (DP) to form multi-dimensional parallel training topologies for large language models. Two principal modern variants exist: the LayerNorm/dropout sharding scheme introduced for Megatron-LM by Korthikanti et al. in 2022,[^1] and the all-to-all attention scheme of DeepSpeed-Ulysses introduced by Jacobs et al. in 2023.[^2] A closely related approach, Ring Attention by Liu, Zaharia, and Abbeel,[^3] distributes self-attention itself across devices using blockwise computation and ring communication, and forms the basis of NVIDIA's "Context Parallelism" feature in current Megatron-LM releases.[^4]

Background

Training large transformer models is bottlenecked not only by parameter memory but by activation memory: the intermediate tensors stored between the forward and backward passes for gradient computation. For a single transformer layer with sequence length $s$, microbatch size $b$, hidden dimension $h$, and $a$ attention heads, the activation footprint is dominated by terms proportional to $sbh$ (for LayerNorm, residual, projection outputs) and $s^2ab$ (for the attention probability matrix), so memory grows linearly in $h$ and quadratically in $s$ in the worst case.[^1] As models scaled from GPT-3's 175 B parameters and 2 048-token context to multi-hundred-billion-parameter models trained at 8 K, 32 K, 128 K and beyond, activation memory became the binding constraint, often forcing practitioners to invoke full activation recomputation (gradient checkpointing) and pay an extra forward pass per step.[^1]

Classical tensor parallelism, introduced in the 2019 Megatron-LM paper, splits the weight matrices of attention and MLP blocks across $t$ tensor-parallel ranks along the hidden dimension.[^5] This reduces parameter and activation memory inside the attention/MLP regions by a factor of $t$, but it leaves LayerNorm, dropout, and the residual stream replicated across all $t$ ranks: those layers operate on $sbh$-shaped activations that are not sharded by hidden-dim TP.[^1] Sequence parallelism arose to close that gap.

The phrase "sequence parallelism" was first used by Li, Xue, Baranwal, Li, and You in a 2021 paper from the Colossal-AI group at NUS, which split sequences across GPUs and used a ring-style self-attention exchange to compute attention without ever materialising the full sequence on any one device.[^6] Their experiments scaled to over 114 K tokens on 64 P100 GPUs, demonstrating that splitting on the sequence axis was a viable alternative to splitting on the hidden axis.[^6] The term was subsequently reused, with different mechanics, by the NVIDIA Megatron group in 2022.

Megatron-LM Sequence Parallelism (Korthikanti et al., 2022)

The Megatron-LM variant of sequence parallelism is defined in Reducing Activation Recomputation in Large Transformer Models by Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro, posted to arXiv on 10 May 2022.[^1] It is a refinement that sits inside tensor parallelism rather than replacing it.

Mechanism

A standard tensor-parallel transformer block has two TP regions: the attention block (column-parallel QKV projection, row-parallel output projection) and the MLP block (column-parallel first linear, row-parallel second linear). These regions are entered with an f operator (identity in forward, all-reduce in backward) and exited with a g operator (all-reduce in forward, identity in backward) in the original Megatron-LM design.[^5] Between TP regions sit LayerNorm and dropout, which operate token-wise and were therefore left unsharded, meaning every TP rank stored a redundant copy of the full $sbh$ activation.[^1]

Sequence parallelism modifies this by sharding the inputs and outputs of LayerNorm and dropout along the sequence dimension. The g operator (TP region exit) becomes a reduce-scatter along the sequence axis instead of an all-reduce, and the f operator (TP region entry) becomes an all-gather along the sequence axis instead of a no-op.[^1] Because an all-reduce equals a reduce-scatter followed by an all-gather, the aggregate communication volume per step is unchanged versus baseline TP, but the activation tensor entering and leaving each LayerNorm/dropout has shape $(s/t, b, h)$ instead of $(s, b, h)$, so every TP rank stores only $1/t$ of those activations.[^1]

Memory savings

Korthikanti et al. derive activation-memory expressions per transformer layer. For pure TP with parallel size $t$, the activations per layer (in bytes) scale as $sbh \cdot (10 + 24/t) + 5 \cdot abs^2/t$, with the leading $10sbh$ term coming from LayerNorm, dropout, and residual paths that are not sharded by TP.[^1] With sequence parallelism applied, the unsharded $10sbh$ term is also divided by $t$, yielding $sbh \cdot 34/t + 5 \cdot abs^2/t$, an essentially uniform $1/t$ sharding of activation memory inside the TP region.[^1]

Combined with their second contribution, selective activation recomputation (recomputing only the cheap-to-recompute attention softmax/dropout activations, while storing the rest), the Megatron paper reports a 5x reduction in activation memory and an over-90 % reduction in the time overhead from activation recomputation.[^1] Training a 530 B-parameter GPT-3-style model on 2 240 NVIDIA A100 GPUs reached 54.2 % Model FLOPs Utilization (MFU), versus 42.1 % with full recomputation, a 29 % speedup.[^1]

Properties

Megatron-style SP has three important properties:

It requires TP > 1. SP is defined relative to a TP region; with $t = 1$ there is nothing to all-gather. NeMo and Megatron Core gate the sequence_parallel=True flag on tensor_model_parallel_size > 1.[^7]
Total communication is unchanged. All-reduce decomposes losslessly into reduce-scatter + all-gather, so SP is a free reduction in activation memory in the bandwidth budget.[^1]
It does not extend to attention computation itself. Attention is computed inside the TP region with the full sequence present (in a head-parallel fashion across TP ranks), so SP alone does not lift the per-device $O(s)$ activation floor at attention; it lifts only the LayerNorm/dropout/residual floor.[^1] Lifting the attention floor is what later motivated DeepSpeed-Ulysses and Ring/Context Parallelism.

DeepSpeed-Ulysses (Jacobs et al., 2023)

DeepSpeed-Ulysses, presented in DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models by Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He (arXiv 2309.14509, submitted 25 September 2023), takes a different design philosophy.[^2] Rather than refining hidden-dimension TP, Ulysses keeps the sequence partitioned across devices for most of the computation and only briefly reshuffles for attention.

Mechanism

Let $P$ be the number of sequence-parallel devices and $h$ be the number of attention heads. Outside attention, every device holds an $(N/P) \times d$ slice of the activation, where $N$ is the sequence length and $d$ is the hidden dimension. Linear projections, LayerNorm, MLP, and residuals all operate on this sequence-sharded representation.[^2]

Right before attention, Ulysses applies an all-to-all collective on the projected Q, K, V tensors. After the all-to-all, each device holds the full sequence (length $N$) but only a non-overlapping subset of attention heads ($h/P$ heads per device).[^2] This requires $P \mid h$, i.e. the number of heads is divisible by the SP degree. Each device then computes ordinary head-parallel attention with any backend, including FlashAttention v2, on its head subset over the full sequence.[^2] A second all-to-all redistributes the per-head outputs back into the sequence-sharded layout for the output projection, MLP, and downstream layers.[^2]

Communication analysis

Because the all-to-all over $P$ devices moves $M/P$ bytes per link per step (where $M$ is the aggregate message), Ulysses achieves a per-link attention communication volume of $4Nh/P$ for the four QKV-plus-output projections, i.e. $O(N/P)$ per device.[^2] In contrast, the Megatron-LM SP approach scales as $O(N)$ per link because it relies on all-gather along the sequence dimension to assemble inputs for attention.[^2] As long as $N$ and $P$ scale proportionally (a common regime in long-context training), Ulysses keeps per-link communication volume constant, a property the authors highlight as a key scaling advantage.[^2]

Results and limits

The Ulysses paper reports training with sequence lengths over a million tokens, supporting sequences "4x longer than existing systems" while improving training throughput by up to 2.5x.[^2] The DeepSpeed blog elaborates that the design is attention-implementation agnostic (it works with dense, sparse, and FlashAttention-2 kernels), composes with DeepSpeed ZeRO-3 sharding, and exposes a DistributedAttention wrapper that requires minimal model code changes.[^8]

The principal limitation is the head-divisibility constraint: SP degree cannot exceed the number of attention heads, which caps Ulysses parallelism. For models with grouped-query attention or few heads this can be restrictive, and it creates friction when composing Ulysses with hidden-dim TP, because both methods consume the same head count.[^9]

Ring Attention (Liu, Zaharia, Abbeel, 2023)

A complementary approach is Ring Attention with Blockwise Transformers for Near-Infinite Context by Hao Liu, Matei Zaharia, and Pieter Abbeel, posted to arXiv on 3 October 2023 (2310.01889).[^3] Ring Attention attacks attention's quadratic memory cost head-on by distributing it across the devices that hold the sequence shards.

Mechanism

Ring Attention organises $P$ devices into a logical ring. Each device $i$ holds the local sequence chunk $X_i$ of length $N/P$ and its corresponding queries $Q_i$, keys $K_i$, and values $V_i$. Attention is computed blockwise (in the style of FlashAttention) on a per-chunk basis using running softmax statistics.[^3] While each device computes attention between its $Q_i$ and the currently held $(K_j, V_j)$ block, it simultaneously sends $(K_j, V_j)$ to the next device in the ring and receives $(K_{j-1}, V_{j-1})$ from the previous one. After $P$ steps the ring has rotated all KV blocks past every $Q_i$ and the global attention output has been accumulated.[^3] The crucial system property is that the cost of communicating a KV block can be fully overlapped by the cost of computing one block of attention, so when the per-block compute exceeds the per-block transfer, communication is hidden.[^3]

Because no device ever materialises the full $N \times N$ attention matrix or the full KV cache, the per-device memory is proportional to $N/P$ regardless of total sequence length. Liu et al. demonstrate training sequences "device count times longer" than what blockwise-only baselines like Blockwise Parallel Transformers (BPT) could handle, scaling to millions of tokens.[^3] Ring Attention is exact, not an approximation: no tokens are dropped, no attention pattern is restricted.[^3]

Relationship to FlashAttention

Ring Attention is sometimes described as the distributed-memory analogue of FlashAttention.[^9] FlashAttention shards attention across a single GPU's SRAM tiles, using the online-softmax trick to fold the full attention into a streaming computation; Ring Attention applies the same blockwise pattern across a multi-GPU memory hierarchy, with the inter-GPU ring exchange playing the role that HBM-to-SRAM streaming plays inside a single device.[^9] In practice the two compose: Ring Attention dispatches local blocks to FlashAttention kernels.[^4]

Load-balance issue with causal masks

For decoder-only language models the attention mask is lower-triangular. If the sequence is split into $P$ equal contiguous chunks and laid out 0 to $P-1$ along the ring, then rank 0 has the fewest tokens to attend over (only its own past) while rank $P-1$ has the most, yielding poor load balance.[^9] Liu et al. and subsequent work proposed reordering ("Striped Attention") and chunk interleaving to equalise per-rank work; the canonical Megatron-LM Context Parallelism implementation includes this optimisation by default.[^4][^9]

Context Parallelism in NVIDIA Megatron-LM

In modern NVIDIA training stacks (Megatron Core, NeMo Framework, Megatron-Bridge), the production sequence-sharding feature is called Context Parallelism (CP) and is distinct from the original Megatron SP.[^4][^7] CP combines ring attention with classical Korthikanti-style SP and is the recommended path for long-context training above roughly 32 K tokens.[^7]

Mechanism

CP partitions network inputs and all activations along the sequence dimension, not only LayerNorm/dropout.[^4] Each GPU stores only its sequence chunk of every layer's activations and KV cache. For attention, CP uses a ring-style exchange: each rank gathers KV chunks from peers as needed and pipelines the gather with the local attention computation.[^4] The Megatron Core documentation notes that "all-gather and reduce-scatter communications are transformed to point-to-point communications in ring topology under the hood," with a configurable cp_comm_type parameter that accepts p2p, all_gather, a2a, or a2a+p2p; the p2p mode is implemented as ring-exchange send/receive operations hard-coded to overlap with the attention compute of sequence chunks.[^4]

NVIDIA positions CP as an improvement over the original Ring Attention paper on two axes: it leverages the current OSS and cuDNN FlashAttention kernels for the per-chunk compute, and it eliminates the wasted lower-triangular work and load imbalance from causal masking by reordering chunks along the ring.[^4]

Composing with other parallelism axes

CP is orthogonal to TP, PP, DP, and Expert Parallelism: the total GPU count satisfies $\text{world size} = \text{TP} \times \text{CP} \times \text{PP} \times \text{DP}$.[^4] Korthikanti-style SP within the TP region is typically kept on whenever TP > 1, so a typical 128 K-token training job for a 70 B model might run with TP = 8 (+ SP inside TP), CP = 8, PP = 4, DP = 4 on 1 024 GPUs.[^10]

The Llama 3 herd of models paper from Meta documents using Context Parallelism extensively for long-context phases of pre-training: when extending Llama 3 to 128 K, CP = 16 lets each rank still see only 8 K tokens, matching the activation footprint of the short-context base training and re-using the existing 3D parallel topology.[^10] NVIDIA's developer blog further reports that on B200 hardware CP delivers more than 2x speedup at long sequences, and that CP becomes "mandatory" at sequence lengths approaching one million tokens.[^11]

Comparison of variants

Variant	Year	Shards along	Attention strategy	Communication for attention	Composes with TP?	Bound by head count?
Colossal-AI SP (Li et al.)	2021[^6]	Sequence	Ring self-attention	$O(N)$ per ring step	yes	no
Megatron SP (Korthikanti et al.)	2022[^1]	Sequence (only at LN/dropout)	Standard TP attention	none extra	required (TP > 1)	no
DeepSpeed-Ulysses	2023[^2]	Sequence	Head-parallel after all-to-all	$O(N/P)$ per link	with friction (heads shared)	yes ($P \mid h$)
Ring Attention (Liu et al.)	2023[^3]	Sequence	Blockwise + ring KV rotation	$O(N/P)$ per ring step, overlapped	yes	no
Megatron Context Parallelism	2023 to present[^4]	Sequence (all activations)	Ring + FlashAttention, causal-aware	$O(N/P)$, overlapped	yes (orthogonal axis)	no

In every variant the activation memory at attention scales as $O(N/P)$ when SP is engaged across $P$ ranks, but the achievable $P$ and the communication overhead differ.

A 2024 paper by Fang and Zhao, USP: A Unified Sequence Parallelism Approach for Long-Context Generative AI (arXiv 2405.07719), proposes hybridising Ulysses and Ring Attention into a unified hierarchical scheme that can run Ulysses over a smaller dimension (e.g. within a node, where all-to-all is cheap) and Ring over a larger dimension (e.g. across nodes, where overlapped point-to-point is preferable).[^9] USP reports 47 % MFU and 208 K-token training on LLaMA3-8B over two 8x A800 nodes.[^9]

Interaction with other parallelism axes

Sequence parallelism is one axis of multi-dimensional parallelism. Its interactions with the others are as follows.[^7][^10]

Tensor parallelism (TP). Korthikanti SP requires TP > 1 by construction. CP is orthogonal to TP and is normally combined with it: TP shards along hidden, CP shards along sequence. Ulysses conflicts with TP because both want to subdivide attention heads.
Pipeline parallelism (PP). SP and CP are layer-local and compose freely with PP.
Data parallelism (DP). All SP variants compose with DP (including ZeRO-1/2/3 / FSDP) on the global-batch axis. Ulysses in particular was designed to combine with DeepSpeed ZeRO-3 for combined sequence + parameter sharding.[^2]
Expert parallelism (EP) / Mixture-of-Experts. MoE layers add an additional all-to-all over experts; CP at long sequences and EP at MoE are independent axes used together in models like Mixtral training at long context.[^10]
Activation recomputation. SP reduces but does not eliminate per-device activation pressure. The Korthikanti paper explicitly pairs SP with selective activation recomputation; Megatron and DeepSpeed retain full or selective checkpointing as an optional knob even when CP/Ulysses is on.[^1]

Applications

The principal applications of sequence parallelism are:

Long-context pre-training. Llama 3 pre-training used CP to extend context from 8 K to 128 K tokens without per-rank memory blow-up.[^10]
Long-context continued training. Open-weights long-context recipes for Llama and Mistral-class models typically combine SP at LayerNorm with CP at attention to scale to 256 K to 1 M tokens.[^11]
Inference for long inputs. NVIDIA and Meta have both published million-token inference recipes that reuse the CP machinery on the KV cache.[^11]
Multimodal long sequences. Vision-language and video-language models with very long token sequences (e.g. dense video tokenisation) use CP to keep activation memory tractable.[^11]

Limitations and trade-offs

Sequence parallelism is not free.

Communication. Even though Megatron SP's total volume equals plain TP, the all-gather/reduce-scatter pair on the sequence axis is latency-sensitive on slower interconnects; SP is most cost-effective when TP runs within a high-bandwidth NVLink island.[^1]
Head-divisibility (Ulysses). Ulysses cannot use more SP ranks than attention heads, capping its parallelism. Grouped-query attention models with few KV heads exacerbate this.[^9]
Causal-mask load imbalance (Ring). Naive ring attention wastes about half the FLOPs in lower-triangle masking unless chunk interleaving is used; Megatron CP handles this explicitly but the engineering is non-trivial.[^4][^9]
Code intrusiveness. Both Ulysses and CP require attention kernels that expose the right hooks (KV exchange, head all-to-all). Naive PyTorch attention has to be replaced by a sequence-aware wrapper.[^4][^8]
Composability. Composing Ulysses with TP, or composing more than one of (TP, SP, CP, EP, PP, DP) requires care to avoid collective conflicts and over-partitioning.[^9]

Sequence parallelism sits between three closely related lines of work. Activation-memory-reducing methods such as gradient checkpointing reduce memory at the cost of recomputation. Memory-efficient attention kernels such as FlashAttention reduce attention's intra-device memory without sharding the sequence across devices. And model-parallel methods such as tensor parallelism and pipeline parallelism shard parameters and layers but leave per-rank sequence length unchanged. Sequence parallelism complements all three, and modern long-context training pipelines combine them all.

References

Sequence Parallelism

Background

Megatron-LM Sequence Parallelism (Korthikanti et al., 2022)

Mechanism

Memory savings

Properties

DeepSpeed-Ulysses (Jacobs et al., 2023)

Mechanism

Communication analysis

Results and limits

Ring Attention (Liu, Zaharia, Abbeel, 2023)

Mechanism

Relationship to FlashAttention

Load-balance issue with causal masks

Context Parallelism in NVIDIA Megatron-LM

Mechanism

Composing with other parallelism axes

Comparison of variants

Interaction with other parallelism axes

Applications

Limitations and trade-offs

Related Work

See also

References

Improve this article

Sequence Parallelism

Background

Megatron-LM Sequence Parallelism (Korthikanti et al., 2022)

Mechanism

Memory savings

Properties

DeepSpeed-Ulysses (Jacobs et al., 2023)

Mechanism

Communication analysis

Results and limits

Ring Attention (Liu, Zaharia, Abbeel, 2023)

Mechanism

Relationship to FlashAttention

Load-balance issue with causal masks

Context Parallelism in NVIDIA Megatron-LM

Mechanism

Composing with other parallelism axes

Comparison of variants

Interaction with other parallelism axes

Applications

Limitations and trade-offs

Related Work

See also

References