# Sequence Parallelism

> Source: https://aiwiki.ai/wiki/sequence_parallelism
> Updated: 2026-06-09
> Categories: AI Infrastructure, Deep Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# Sequence Parallelism

**Sequence parallelism (SP)** is a family of distributed training techniques for [transformer](/wiki/transformer)-based neural networks that partitions activations along the sequence (token) dimension across multiple accelerators, reducing per-device activation memory and unlocking longer context lengths than would otherwise fit on a single device.[^1] Sequence parallelism is typically composed with [tensor parallelism](/wiki/tensor_parallelism) (TP), [pipeline parallelism](/wiki/pipeline_parallelism) (PP), and [data parallelism](/wiki/data_parallelism) (DP) to form multi-dimensional parallel training topologies for large language models. Two principal modern variants exist: the LayerNorm/dropout sharding scheme introduced for [Megatron-LM](/wiki/megatron_lm) by Korthikanti et al. in 2022,[^1] and the all-to-all attention scheme of [DeepSpeed](/wiki/deepspeed)-Ulysses introduced by Jacobs et al. in 2023.[^2] A closely related approach, [Ring Attention](/wiki/ring_attention) by Liu, Zaharia, and Abbeel,[^3] distributes self-attention itself across devices using blockwise computation and ring communication, and forms the basis of NVIDIA's "Context Parallelism" feature in current Megatron-LM releases.[^4]

## Background

Training large transformer models is bottlenecked not only by parameter memory but by *activation* memory: the intermediate tensors stored between the forward and backward passes for gradient computation. For a single transformer layer with sequence length $s$, microbatch size $b$, hidden dimension $h$, and $a$ attention heads, the activation footprint is dominated by terms proportional to $sbh$ (for LayerNorm, residual, projection outputs) and $s^2ab$ (for the attention probability matrix), so memory grows linearly in $h$ and quadratically in $s$ in the worst case.[^1] As models scaled from GPT-3's 175 B parameters and 2 048-token context to multi-hundred-billion-parameter models trained at 8 K, 32 K, 128 K and beyond, activation memory became the binding constraint, often forcing practitioners to invoke full activation recomputation (gradient checkpointing) and pay an extra forward pass per step.[^1]

Classical tensor parallelism, introduced in the 2019 Megatron-LM paper, splits the weight matrices of attention and MLP blocks across $t$ tensor-parallel ranks along the hidden dimension.[^5] This reduces parameter and activation memory inside the attention/MLP regions by a factor of $t$, but it leaves LayerNorm, dropout, and the residual stream replicated across all $t$ ranks: those layers operate on $sbh$-shaped activations that are not sharded by hidden-dim TP.[^1] Sequence parallelism arose to close that gap.

The phrase "sequence parallelism" was first used by Li, Xue, Baranwal, Li, and You in a 2021 paper from the Colossal-AI group at NUS, which split sequences across GPUs and used a ring-style self-attention exchange to compute attention without ever materialising the full sequence on any one device.[^6] Their experiments scaled to over 114 K tokens on 64 P100 GPUs, demonstrating that splitting on the sequence axis was a viable alternative to splitting on the hidden axis.[^6] The term was subsequently reused, with different mechanics, by the NVIDIA Megatron group in 2022.

## Megatron-LM Sequence Parallelism (Korthikanti et al., 2022)

The Megatron-LM variant of sequence parallelism is defined in *Reducing Activation Recomputation in Large Transformer Models* by Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and [Bryan Catanzaro](/wiki/bryan_catanzaro), posted to arXiv on 10 May 2022.[^1] It is a refinement that sits inside tensor parallelism rather than replacing it.

### Mechanism

A standard tensor-parallel transformer block has two TP regions: the attention block (column-parallel QKV projection, row-parallel output projection) and the MLP block (column-parallel first linear, row-parallel second linear). These regions are entered with an `f` operator (identity in forward, all-reduce in backward) and exited with a `g` operator (all-reduce in forward, identity in backward) in the original Megatron-LM design.[^5] Between TP regions sit LayerNorm and dropout, which operate token-wise and were therefore left *unsharded*, meaning every TP rank stored a redundant copy of the full $sbh$ activation.[^1]

Sequence parallelism modifies this by sharding the inputs and outputs of LayerNorm and dropout along the sequence dimension. The `g` operator (TP region exit) becomes a reduce-scatter along the sequence axis instead of an all-reduce, and the `f` operator (TP region entry) becomes an all-gather along the sequence axis instead of a no-op.[^1] Because an all-reduce equals a reduce-scatter followed by an all-gather, the aggregate communication volume per step is unchanged versus baseline TP, but the activation tensor entering and leaving each LayerNorm/dropout has shape $(s/t, b, h)$ instead of $(s, b, h)$, so every TP rank stores only $1/t$ of those activations.[^1]

### Memory savings

Korthikanti et al. derive activation-memory expressions per transformer layer. For pure TP with parallel size $t$, the activations per layer (in bytes) scale as $sbh \cdot (10 + 24/t) + 5 \cdot abs^2/t$, with the leading $10sbh$ term coming from LayerNorm, dropout, and residual paths that are not sharded by TP.[^1] With sequence parallelism applied, the unsharded $10sbh$ term is also divided by $t$, yielding $sbh \cdot 34/t + 5 \cdot abs^2/t$, an essentially uniform $1/t$ sharding of activation memory inside the TP region.[^1]

Combined with their second contribution, *selective activation recomputation* (recomputing only the cheap-to-recompute attention softmax/dropout activations, while storing the rest), the Megatron paper reports a 5x reduction in activation memory and an over-90 % reduction in the time overhead from activation recomputation.[^1] Training a 530 B-parameter GPT-3-style model on 2 240 [NVIDIA A100](/wiki/nvidia_a100) GPUs reached 54.2 % Model FLOPs Utilization (MFU), versus 42.1 % with full recomputation, a 29 % speedup.[^1]

### Properties

Megatron-style SP has three important properties:

1. **It requires TP > 1.** SP is defined relative to a TP region; with $t = 1$ there is nothing to all-gather. NeMo and Megatron Core gate the `sequence_parallel=True` flag on `tensor_model_parallel_size > 1`.[^7]
2. **Total communication is unchanged.** All-reduce decomposes losslessly into reduce-scatter + all-gather, so SP is a free reduction in activation memory in the bandwidth budget.[^1]
3. **It does not extend to attention computation itself.** Attention is computed inside the TP region with the full sequence present (in a head-parallel fashion across TP ranks), so SP alone does not lift the per-device $O(s)$ activation floor at attention; it lifts only the LayerNorm/dropout/residual floor.[^1] Lifting the attention floor is what later motivated DeepSpeed-Ulysses and Ring/Context Parallelism.

## DeepSpeed-Ulysses (Jacobs et al., 2023)

DeepSpeed-Ulysses, presented in *DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models* by Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He (arXiv 2309.14509, submitted 25 September 2023), takes a different design philosophy.[^2] Rather than refining hidden-dimension TP, Ulysses keeps the sequence partitioned across devices for *most* of the computation and only briefly reshuffles for attention.

### Mechanism

Let $P$ be the number of sequence-parallel devices and $h$ be the number of attention heads. Outside attention, every device holds an $(N/P) \times d$ slice of the activation, where $N$ is the sequence length and $d$ is the hidden dimension. Linear projections, LayerNorm, MLP, and residuals all operate on this sequence-sharded representation.[^2]

Right before attention, Ulysses applies an all-to-all collective on the projected Q, K, V tensors. After the all-to-all, each device holds the *full* sequence (length $N$) but only a non-overlapping subset of attention heads ($h/P$ heads per device).[^2] This requires $P \mid h$, i.e. the number of heads is divisible by the SP degree. Each device then computes ordinary head-parallel attention with any backend, including [FlashAttention](/wiki/flashattention) v2, on its head subset over the full sequence.[^2] A second all-to-all redistributes the per-head outputs back into the sequence-sharded layout for the output projection, MLP, and downstream layers.[^2]

### Communication analysis

Because the all-to-all over $P$ devices moves $M/P$ bytes per link per step (where $M$ is the aggregate message), Ulysses achieves a per-link attention communication volume of $4Nh/P$ for the four QKV-plus-output projections, i.e. $O(N/P)$ per device.[^2] In contrast, the Megatron-LM SP approach scales as $O(N)$ per link because it relies on all-gather along the sequence dimension to assemble inputs for attention.[^2] As long as $N$ and $P$ scale proportionally (a common regime in long-context training), Ulysses keeps per-link communication volume *constant*, a property the authors highlight as a key scaling advantage.[^2]

### Results and limits

The Ulysses paper reports training with sequence lengths over a million tokens, supporting sequences "4x longer than existing systems" while improving training throughput by up to 2.5x.[^2] The DeepSpeed blog elaborates that the design is attention-implementation agnostic (it works with dense, sparse, and FlashAttention-2 kernels), composes with DeepSpeed ZeRO-3 sharding, and exposes a `DistributedAttention` wrapper that requires minimal model code changes.[^8]

The principal limitation is the head-divisibility constraint: SP degree cannot exceed the number of attention heads, which caps Ulysses parallelism. For models with grouped-query attention or few heads this can be restrictive, and it creates friction when composing Ulysses with hidden-dim TP, because both methods consume the same head count.[^9]

## Ring Attention (Liu, Zaharia, Abbeel, 2023)

A complementary approach is **Ring Attention with Blockwise Transformers for Near-Infinite Context** by Hao Liu, Matei Zaharia, and [Pieter Abbeel](/wiki/pieter_abbeel), posted to arXiv on 3 October 2023 (2310.01889).[^3] Ring Attention attacks attention's quadratic memory cost head-on by distributing it across the devices that hold the sequence shards.

### Mechanism

Ring Attention organises $P$ devices into a logical ring. Each device $i$ holds the local sequence chunk $X_i$ of length $N/P$ and its corresponding queries $Q_i$, keys $K_i$, and values $V_i$. Attention is computed blockwise (in the style of FlashAttention) on a per-chunk basis using running softmax statistics.[^3] While each device computes attention between its $Q_i$ and the currently held $(K_j, V_j)$ block, it simultaneously sends $(K_j, V_j)$ to the next device in the ring and receives $(K_{j-1}, V_{j-1})$ from the previous one. After $P$ steps the ring has rotated all KV blocks past every $Q_i$ and the global attention output has been accumulated.[^3] The crucial system property is that the cost of communicating a KV block can be fully overlapped by the cost of computing one block of attention, so when the per-block compute exceeds the per-block transfer, communication is hidden.[^3]

Because no device ever materialises the full $N \times N$ attention matrix or the full KV cache, the per-device memory is proportional to $N/P$ regardless of total sequence length. Liu et al. demonstrate training sequences "device count times longer" than what blockwise-only baselines like Blockwise Parallel Transformers (BPT) could handle, scaling to millions of tokens.[^3] Ring Attention is exact, not an approximation: no tokens are dropped, no attention pattern is restricted.[^3]

### Relationship to FlashAttention

Ring Attention is sometimes described as the distributed-memory analogue of FlashAttention.[^9] FlashAttention shards attention across a single GPU's SRAM tiles, using the online-softmax trick to fold the full attention into a streaming computation; Ring Attention applies the same blockwise pattern across a multi-GPU memory hierarchy, with the inter-GPU ring exchange playing the role that HBM-to-SRAM streaming plays inside a single device.[^9] In practice the two compose: Ring Attention dispatches local blocks to FlashAttention kernels.[^4]

### Load-balance issue with causal masks

For [decoder-only language models](/wiki/generative_pre-trained_transformer) the attention mask is lower-triangular. If the sequence is split into $P$ equal contiguous chunks and laid out 0 to $P-1$ along the ring, then rank 0 has the fewest tokens to attend over (only its own past) while rank $P-1$ has the most, yielding poor load balance.[^9] Liu et al. and subsequent work proposed reordering ("Striped Attention") and chunk interleaving to equalise per-rank work; the canonical Megatron-LM Context Parallelism implementation includes this optimisation by default.[^4][^9]

## Context Parallelism in NVIDIA Megatron-LM

In modern NVIDIA training stacks (Megatron Core, NeMo Framework, Megatron-Bridge), the production sequence-sharding feature is called **Context Parallelism (CP)** and is distinct from the original Megatron SP.[^4][^7] CP combines ring attention with classical Korthikanti-style SP and is the recommended path for long-context training above roughly 32 K tokens.[^7]

### Mechanism

CP partitions network *inputs and all activations* along the sequence dimension, not only LayerNorm/dropout.[^4] Each GPU stores only its sequence chunk of every layer's activations and KV cache. For attention, CP uses a ring-style exchange: each rank gathers KV chunks from peers as needed and pipelines the gather with the local attention computation.[^4] The Megatron Core documentation notes that "all-gather and reduce-scatter communications are transformed to point-to-point communications in ring topology under the hood," with a configurable `cp_comm_type` parameter that accepts `p2p`, `all_gather`, `a2a`, or `a2a+p2p`; the `p2p` mode is implemented as ring-exchange send/receive operations hard-coded to overlap with the attention compute of sequence chunks.[^4]

NVIDIA positions CP as an improvement over the original Ring Attention paper on two axes: it leverages the current OSS and cuDNN FlashAttention kernels for the per-chunk compute, and it eliminates the wasted lower-triangular work and load imbalance from causal masking by reordering chunks along the ring.[^4]

### Composing with other parallelism axes

CP is orthogonal to TP, PP, DP, and Expert Parallelism: the total GPU count satisfies $\text{world size} = \text{TP} \times \text{CP} \times \text{PP} \times \text{DP}$.[^4] Korthikanti-style SP within the TP region is typically kept on whenever TP > 1, so a typical 128 K-token training job for a 70 B model might run with TP = 8 (+ SP inside TP), CP = 8, PP = 4, DP = 4 on 1 024 GPUs.[^10]

The Llama 3 herd of models paper from Meta documents using Context Parallelism extensively for long-context phases of pre-training: when extending Llama 3 to 128 K, CP = 16 lets each rank still see only 8 K tokens, matching the activation footprint of the short-context base training and re-using the existing 3D parallel topology.[^10] NVIDIA's developer blog further reports that on [B200](/wiki/nvidia_b200) hardware CP delivers more than 2x speedup at long sequences, and that CP becomes "mandatory" at sequence lengths approaching one million tokens.[^11]

## Comparison of variants

| Variant | Year | Shards along | Attention strategy | Communication for attention | Composes with TP? | Bound by head count? |
| --- | --- | --- | --- | --- | --- | --- |
| Colossal-AI SP (Li et al.) | 2021[^6] | Sequence | Ring self-attention | $O(N)$ per ring step | yes | no |
| Megatron SP (Korthikanti et al.) | 2022[^1] | Sequence (only at LN/dropout) | Standard TP attention | none extra | required (TP > 1) | no |
| DeepSpeed-Ulysses | 2023[^2] | Sequence | Head-parallel after all-to-all | $O(N/P)$ per link | with friction (heads shared) | yes ($P \mid h$) |
| Ring Attention (Liu et al.) | 2023[^3] | Sequence | Blockwise + ring KV rotation | $O(N/P)$ per ring step, overlapped | yes | no |
| Megatron Context Parallelism | 2023 to present[^4] | Sequence (all activations) | Ring + FlashAttention, causal-aware | $O(N/P)$, overlapped | yes (orthogonal axis) | no |

In every variant the activation memory at attention scales as $O(N/P)$ when SP is engaged across $P$ ranks, but the achievable $P$ and the communication overhead differ.

A 2024 paper by Fang and Zhao, *USP: A Unified Sequence Parallelism Approach for Long-Context Generative AI* (arXiv 2405.07719), proposes hybridising Ulysses and Ring Attention into a unified hierarchical scheme that can run Ulysses over a smaller dimension (e.g. within a node, where all-to-all is cheap) and Ring over a larger dimension (e.g. across nodes, where overlapped point-to-point is preferable).[^9] USP reports 47 % MFU and 208 K-token training on LLaMA3-8B over two 8x A800 nodes.[^9]

## Interaction with other parallelism axes

Sequence parallelism is one axis of multi-dimensional parallelism. Its interactions with the others are as follows.[^7][^10]

* **Tensor parallelism (TP).** Korthikanti SP requires TP > 1 by construction. CP is orthogonal to TP and is normally combined with it: TP shards along hidden, CP shards along sequence. Ulysses *conflicts* with TP because both want to subdivide attention heads.
* **Pipeline parallelism (PP).** SP and CP are layer-local and compose freely with PP.
* **Data parallelism (DP).** All SP variants compose with DP (including ZeRO-1/2/3 / [FSDP](/wiki/fsdp)) on the global-batch axis. Ulysses in particular was designed to combine with DeepSpeed ZeRO-3 for combined sequence + parameter sharding.[^2]
* **Expert parallelism (EP) / Mixture-of-Experts.** [MoE](/wiki/mixture_of_experts) layers add an additional all-to-all over experts; CP at long sequences and EP at MoE are independent axes used together in models like [Mixtral](/wiki/mixtral) training at long context.[^10]
* **Activation recomputation.** SP reduces but does not eliminate per-device activation pressure. The Korthikanti paper explicitly pairs SP with selective activation recomputation; Megatron and DeepSpeed retain full or selective checkpointing as an optional knob even when CP/Ulysses is on.[^1]

## Applications

The principal applications of sequence parallelism are:

* **Long-context pre-training.** Llama 3 pre-training used CP to extend context from 8 K to 128 K tokens without per-rank memory blow-up.[^10]
* **Long-context continued training.** Open-weights long-context recipes for Llama and Mistral-class models typically combine SP at LayerNorm with CP at attention to scale to 256 K to 1 M tokens.[^11]
* **Inference for long inputs.** NVIDIA and Meta have both published million-token *inference* recipes that reuse the CP machinery on the KV cache.[^11]
* **Multimodal long sequences.** Vision-language and video-language models with very long token sequences (e.g. dense video tokenisation) use CP to keep activation memory tractable.[^11]

## Limitations and trade-offs

Sequence parallelism is not free.

* **Communication.** Even though Megatron SP's total volume equals plain TP, the all-gather/reduce-scatter pair on the sequence axis is latency-sensitive on slower interconnects; SP is most cost-effective when TP runs within a high-bandwidth NVLink island.[^1]
* **Head-divisibility (Ulysses).** Ulysses cannot use more SP ranks than attention heads, capping its parallelism. Grouped-query attention models with few KV heads exacerbate this.[^9]
* **Causal-mask load imbalance (Ring).** Naive ring attention wastes about half the FLOPs in lower-triangle masking unless chunk interleaving is used; Megatron CP handles this explicitly but the engineering is non-trivial.[^4][^9]
* **Code intrusiveness.** Both Ulysses and CP require attention kernels that expose the right hooks (KV exchange, head all-to-all). Naive PyTorch attention has to be replaced by a sequence-aware wrapper.[^4][^8]
* **Composability.** Composing Ulysses with TP, or composing more than one of (TP, SP, CP, EP, PP, DP) requires care to avoid collective conflicts and over-partitioning.[^9]

## Related Work

Sequence parallelism sits between three closely related lines of work. Activation-memory-reducing methods such as gradient checkpointing reduce memory at the cost of recomputation. Memory-efficient attention kernels such as FlashAttention reduce attention's intra-device memory without sharding the sequence across devices. And model-parallel methods such as tensor parallelism and pipeline parallelism shard parameters and layers but leave per-rank sequence length unchanged. Sequence parallelism complements all three, and modern long-context training pipelines combine them all.

## See also

* [Attention Is All You Need (Transformer)](/wiki/attention_is_all_you_need_transformer)
* [Llama 3.3](/wiki/llama_3_3)
* [Mistral AI](/wiki/mistral)
* [Graphics processing unit](/wiki/gpu)
* [GPT-3](/wiki/gpt-3)

## References

[^1]: Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro, "Reducing Activation Recomputation in Large Transformer Models", arXiv:2205.05198, 2022-05-10. https://arxiv.org/abs/2205.05198. Accessed 2026-05-21.
[^2]: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He, "DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models", arXiv:2309.14509, 2023-09-25. https://arxiv.org/abs/2309.14509. Accessed 2026-05-21.
[^3]: Hao Liu, Matei Zaharia, Pieter Abbeel, "Ring Attention with Blockwise Transformers for Near-Infinite Context", arXiv:2310.01889, 2023-10-03. https://arxiv.org/abs/2310.01889. Accessed 2026-05-21.
[^4]: NVIDIA, "context_parallel package, Megatron-LM developer guide", NVIDIA Corporation, 2024-10-01. https://docs.nvidia.com/megatron-core/developer-guide/0.15.0/api-guide/context_parallel.html. Accessed 2026-05-21.
[^5]: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism", arXiv:1909.08053, 2019-09-17. https://arxiv.org/abs/1909.08053. Accessed 2026-05-21.
[^6]: Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You, "Sequence Parallelism: Long Sequence Training from System Perspective", arXiv:2105.13120, 2021-05-26. https://arxiv.org/abs/2105.13120. Accessed 2026-05-21.
[^7]: NVIDIA, "Parallelisms, NeMo Framework User Guide", NVIDIA Corporation, 2025-02-01. https://docs.nvidia.com/nemo-framework/user-guide/25.02/nemotoolkit/features/parallelisms.html. Accessed 2026-05-21.
[^8]: DeepSpeed Team, "DeepSpeed-Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (blog)", Microsoft / DeepSpeed Project, 2023-09-25. https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ulysses. Accessed 2026-05-21.
[^9]: Jiarui Fang, Shangchun Zhao, "USP: A Unified Sequence Parallelism Approach for Long Context Generative AI", arXiv:2405.07719, 2024-05-13. https://arxiv.org/abs/2405.07719. Accessed 2026-05-21.
[^10]: Llama Team, "The Llama 3 Herd of Models", arXiv:2407.21783, 2024-07-31. https://arxiv.org/abs/2407.21783. Accessed 2026-05-21.
[^11]: NVIDIA, "Scaling to Millions of Tokens with Efficient Long-Context LLM Training", NVIDIA Developer Blog, 2024-11-15. https://developer.nvidia.com/blog/scaling-to-millions-of-tokens-with-efficient-long-context-llm-training/. Accessed 2026-05-21.

