# Sequence Parallelism

> Source: https://aiwiki.ai/wiki/sequence_parallelism
> Updated: 2026-07-07
> Categories: AI Infrastructure, Deep Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Sequence parallelism (SP)** is a family of distributed training techniques for [transformer](/wiki/transformer)-based neural networks that partitions activations along the sequence (token) dimension across multiple accelerators, reducing per-device activation memory and unlocking longer context lengths than would otherwise fit on a single device.[^1] By splitting a sequence of length N across P devices, sequence parallelism drives per-device activation and attention memory toward O(N/P), which is what lets modern systems train transformers on context windows of over a million tokens.[^3][^8] Sequence parallelism is typically composed with [tensor parallelism](/wiki/tensor_parallelism) (TP), [pipeline parallelism](/wiki/pipeline_parallelism) (PP), and [data parallelism](/wiki/data_parallelism) (DP) to form multi-dimensional parallel training topologies for [large language models](/wiki/large_language_model). Two principal modern variants exist: the LayerNorm/dropout sharding scheme introduced for [Megatron-LM](/wiki/megatron_lm) by Korthikanti et al. in 2022,[^1] and the all-to-all attention scheme of [DeepSpeed](/wiki/deepspeed)-Ulysses introduced by Jacobs et al. in 2023.[^2] A closely related approach, [Ring Attention](/wiki/ring_attention) by Liu, Zaharia, and Abbeel,[^3] distributes self-attention itself across devices using blockwise computation and ring communication, and forms the basis of NVIDIA's "Context Parallelism" feature in current Megatron-LM releases.[^4]

## Background

Training large transformer models is bottlenecked not only by parameter memory but by *activation* memory: the intermediate tensors stored between the forward and backward passes for gradient computation. For a single transformer layer with sequence length $s$, microbatch size $b$, hidden dimension $h$, and $a$ attention heads, the activation footprint is dominated by terms proportional to $sbh$ (for LayerNorm, residual, projection outputs) and $s^2ab$ (for the attention probability matrix), so memory grows linearly in $h$ and quadratically in $s$ in the worst case.[^1] As models scaled from GPT-3's 175 B parameters and 2 048-token context to multi-hundred-billion-parameter models trained at 8 K, 32 K, 128 K and beyond, activation memory became the binding constraint, often forcing practitioners to invoke full activation recomputation (gradient checkpointing) and pay an extra forward pass per step.[^1]

Classical tensor parallelism, introduced in the 2019 Megatron-LM paper, splits the weight matrices of attention and MLP blocks across $t$ tensor-parallel ranks along the hidden dimension.[^5] This reduces parameter and activation memory inside the attention/MLP regions by a factor of $t$, but it leaves LayerNorm, dropout, and the residual stream replicated across all $t$ ranks: those layers operate on $sbh$-shaped activations that are not sharded by hidden-dim TP.[^1] Sequence parallelism arose to close that gap.

The phrase "sequence parallelism" was first used by Li, Xue, Baranwal, Li, and You in a 2021 paper from the Colossal-AI group at NUS, which split sequences across GPUs and used a ring-style self-attention exchange, which they named Ring Self-Attention (RSA), to compute attention without ever materialising the full sequence on any one device.[^6] Their experiments scaled to over 114 K tokens on 64 P100 GPUs, which the authors report is "over 27x longer than existing sparse attention works," and reached "13.7x and 3.0x maximum batch size and sequence length respectively" versus tensor parallelism at that scale.[^6] This demonstrated that splitting on the sequence axis was a viable alternative to splitting on the hidden axis. The term was subsequently reused, with different mechanics, by the NVIDIA Megatron group in 2022.

## How does sequence parallelism differ from tensor, data, and pipeline parallelism?

The four axes of large-model parallelism partition different dimensions of the training problem, and long-context systems combine them rather than choosing among them.[^1][^7]

| Parallelism | Shards along | What it reduces | Cross-device communication |
| --- | --- | --- | --- |
| [Data parallelism](/wiki/data_parallelism) (DP) | The batch (samples) | Per-replica optimizer and gradient work | All-reduce of gradients each step |
| [Tensor parallelism](/wiki/tensor_parallelism) (TP) | The hidden dimension (weight matrices) | Parameter and in-block activation memory | All-reduce inside every attention/MLP block |
| [Pipeline parallelism](/wiki/pipeline_parallelism) (PP) | The layer stack (depth) | Parameter memory across stages | Point-to-point activations between stages |
| Sequence / context parallelism (SP/CP) | The sequence (tokens) | Activation and attention memory | All-gather/reduce-scatter or ring exchange along the sequence |

Data parallelism replicates the full model and splits the batch, so it does nothing for a single long sequence that will not fit on one device. Tensor and pipeline parallelism shard the model itself but leave every device processing the full sequence length, so per-device activation memory still grows with context. Sequence parallelism is the axis that shrinks the per-device token count, which is why it is layered on top of the other three for long-context training rather than replacing them.[^1][^7]

## Megatron-LM Sequence Parallelism (Korthikanti et al., 2022)

The Megatron-LM variant of sequence parallelism is defined in *Reducing Activation Recomputation in Large Transformer Models* by Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and [Bryan Catanzaro](/wiki/bryan_catanzaro), posted to arXiv on 10 May 2022.[^1] It is a refinement that sits inside tensor parallelism rather than replacing it.

### Mechanism

A standard tensor-parallel transformer block has two TP regions: the attention block (column-parallel QKV projection, row-parallel output projection) and the MLP block (column-parallel first linear, row-parallel second linear). These regions are entered with an `f` operator (identity in forward, all-reduce in backward) and exited with a `g` operator (all-reduce in forward, identity in backward) in the original Megatron-LM design.[^5] Between TP regions sit LayerNorm and dropout, which operate token-wise and were therefore left *unsharded*, meaning every TP rank stored a redundant copy of the full $sbh$ activation.[^1]

Sequence parallelism modifies this by sharding the inputs and outputs of LayerNorm and dropout along the sequence dimension. The `g` operator (TP region exit) becomes a reduce-scatter along the sequence axis instead of an all-reduce, and the `f` operator (TP region entry) becomes an all-gather along the sequence axis instead of a no-op.[^1] Because an all-reduce equals a reduce-scatter followed by an all-gather, the aggregate communication volume per step is unchanged versus baseline TP, but the activation tensor entering and leaving each LayerNorm/dropout has shape $(s/t, b, h)$ instead of $(s, b, h)$, so every TP rank stores only $1/t$ of those activations.[^1]

### Memory savings

Korthikanti et al. derive activation-memory expressions per transformer layer. For pure TP with parallel size $t$, the activations per layer (in bytes) scale as $sbh \cdot (10 + 24/t) + 5 \cdot abs^2/t$, with the leading $10sbh$ term coming from LayerNorm, dropout, and residual paths that are not sharded by TP.[^1] With sequence parallelism applied, the unsharded $10sbh$ term is also divided by $t$, yielding $sbh \cdot 34/t + 5 \cdot abs^2/t$, an essentially uniform $1/t$ sharding of activation memory inside the TP region.[^1]

Combined with their second contribution, *selective activation recomputation* (recomputing only the cheap-to-recompute attention softmax/dropout activations, while storing the rest), the Megatron paper reports a 5x reduction in activation memory and an over-90 % reduction in the time overhead from activation recomputation.[^1] Training a 530 B-parameter GPT-3-style model on 2 240 [NVIDIA A100](/wiki/nvidia_a100) GPUs reached 54.2 % Model FLOPs Utilization (MFU), versus 42.1 % with full recomputation, a 29 % speedup.[^1]

### Properties

Megatron-style SP has three important properties:

1. **It requires TP > 1.** SP is defined relative to a TP region; with $t = 1$ there is nothing to all-gather. NeMo and Megatron Core gate the `sequence_parallel=True` flag on `tensor_model_parallel_size > 1`.[^7]
2. **Total communication is unchanged.** All-reduce decomposes losslessly into reduce-scatter + all-gather, so SP is a free reduction in activation memory in the bandwidth budget.[^1]
3. **It does not extend to attention computation itself.** Attention is computed inside the TP region with the full sequence present (in a head-parallel fashion across TP ranks), so SP alone does not lift the per-device $O(s)$ activation floor at attention; it lifts only the LayerNorm/dropout/residual floor.[^1] Lifting the attention floor is what later motivated DeepSpeed-Ulysses and Ring/Context Parallelism.

## DeepSpeed-Ulysses (Jacobs et al., 2023)

DeepSpeed-Ulysses, presented in *DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models* by Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He (arXiv 2309.14509, submitted 25 September 2023), takes a different design philosophy.[^2] Rather than refining hidden-dimension TP, Ulysses keeps the sequence partitioned across devices for *most* of the computation and only briefly reshuffles for attention.

### Mechanism

Let $P$ be the number of sequence-parallel devices and $h$ be the number of attention heads. Outside attention, every device holds an $(N/P) \times d$ slice of the activation, where $N$ is the sequence length and $d$ is the hidden dimension. Linear projections, LayerNorm, MLP, and residuals all operate on this sequence-sharded representation.[^2]

Right before attention, Ulysses applies an all-to-all collective on the projected Q, K, V tensors. After the all-to-all, each device holds the *full* sequence (length $N$) but only a non-overlapping subset of attention heads ($h/P$ heads per device).[^2] This requires $P \mid h$, i.e. the number of heads is divisible by the SP degree. Each device then computes ordinary head-parallel attention with any backend, including [FlashAttention](/wiki/flashattention) v2, on its head subset over the full sequence.[^2] A second all-to-all redistributes the per-head outputs back into the sequence-sharded layout for the output projection, MLP, and downstream layers.[^2]

### Communication analysis

Because the all-to-all over $P$ devices moves $M/P$ bytes per link per step (where $M$ is the aggregate message), Ulysses achieves a per-link attention communication volume of $4Nh/P$ for the four QKV-plus-output projections, i.e. $O(N/P)$ per device.[^2] In contrast, the Megatron-LM SP approach scales as $O(N)$ per link because it relies on all-gather along the sequence dimension to assemble inputs for attention.[^2] As long as $N$ and $P$ scale proportionally (a common regime in long-context training), Ulysses keeps per-link communication volume *constant*, a property the authors highlight as a key scaling advantage.[^2]

### Results and limits

The Ulysses paper reports that the method "trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline."[^2] The accompanying DeepSpeed blog reports enabling training with sequences of "over a million tokens" at a "sustained throughput of over 175 TFlops/GPU (over 54% of hardware peak)," and elaborates that the design is attention-implementation agnostic (it works with dense, sparse, and FlashAttention-2 kernels), composes with DeepSpeed ZeRO-3 sharding, and exposes a `DistributedAttention` wrapper that requires minimal model code changes.[^8]

The principal limitation is the head-divisibility constraint: SP degree cannot exceed the number of attention heads, which caps Ulysses parallelism. For models with grouped-query attention or few heads this can be restrictive, and it creates friction when composing Ulysses with hidden-dim TP, because both methods consume the same head count.[^9]

## Ring Attention (Liu, Zaharia, Abbeel, 2023)

A complementary approach is **Ring Attention with Blockwise Transformers for Near-Infinite Context** by Hao Liu, Matei Zaharia, and [Pieter Abbeel](/wiki/pieter_abbeel), posted to arXiv on 3 October 2023 (2310.01889).[^3] Ring Attention attacks attention's quadratic memory cost head-on by distributing it across the devices that hold the sequence shards.

### Mechanism

Ring Attention organises $P$ devices into a logical ring. Each device $i$ holds the local sequence chunk $X_i$ of length $N/P$ and its corresponding queries $Q_i$, keys $K_i$, and values $V_i$. Attention is computed blockwise (in the style of FlashAttention) on a per-chunk basis using running softmax statistics.[^3] While each device computes attention between its $Q_i$ and the currently held $(K_j, V_j)$ block, it simultaneously sends $(K_j, V_j)$ to the next device in the ring and receives $(K_{j-1}, V_{j-1})$ from the previous one. After $P$ steps the ring has rotated all KV blocks past every $Q_i$ and the global attention output has been accumulated.[^3] The crucial system property is that the cost of communicating a KV block can be fully overlapped by the cost of computing one block of attention, so when the per-block compute exceeds the per-block transfer, communication is hidden.[^3]

Because no device ever materialises the full $N \times N$ attention matrix or the full KV cache, the per-device memory is proportional to $N/P$ regardless of total sequence length. Liu et al. demonstrate training sequences "device count times longer" than what blockwise-only baselines like Blockwise Parallel Transformers (BPT) could handle, scaling to millions of tokens.[^3] The method reaches this, in the authors' words, "without resorting to approximations or incurring additional communication and computation overheads": Ring Attention is exact, so no tokens are dropped and no attention pattern is restricted.[^3]

### Relationship to FlashAttention

Ring Attention is sometimes described as the distributed-memory analogue of FlashAttention.[^9] FlashAttention shards attention across a single GPU's SRAM tiles, using the online-softmax trick to fold the full attention into a streaming computation; Ring Attention applies the same blockwise pattern across a multi-GPU memory hierarchy, with the inter-GPU ring exchange playing the role that HBM-to-SRAM streaming plays inside a single device.[^9] In practice the two compose: Ring Attention dispatches local blocks to FlashAttention kernels.[^4]

### Load-balance issue with causal masks

For [decoder-only language models](/wiki/generative_pre-trained_transformer) the attention mask is lower-triangular. If the sequence is split into $P$ equal contiguous chunks and laid out 0 to $P-1$ along the ring, then rank 0 has the fewest tokens to attend over (only its own past) while rank $P-1$ has the most, yielding poor load balance.[^9] Liu et al. and subsequent work proposed reordering ("Striped Attention") and chunk interleaving to equalise per-rank work; the canonical Megatron-LM Context Parallelism implementation includes this optimisation by default.[^4][^9]

## Context Parallelism in NVIDIA Megatron-LM

In modern NVIDIA training stacks (Megatron Core, NeMo Framework, Megatron-Bridge), the production sequence-sharding feature is called **Context Parallelism (CP)** and is distinct from the original Megatron SP.[^4][^7] CP combines ring attention with classical Korthikanti-style SP and is the recommended path for long-context training above roughly 32 K tokens.[^7]

### Mechanism

CP partitions network *inputs and all activations* along the sequence dimension, not only LayerNorm/dropout.[^4] Each GPU stores only its sequence chunk of every layer's activations and KV cache. For attention, CP uses a ring-style exchange: each rank gathers KV chunks from peers as needed and pipelines the gather with the local attention computation.[^4] The Megatron Core documentation notes that "the all-gather and reduce-scatter are transformed to point-to-point communications in ring topology under the hood."[^4]

Megatron Core exposes a configurable `cp_comm_type` parameter that selects among four implementations of the attention communication, and the modes can even be interleaved layer by layer.[^13]

| `cp_comm_type` | Communication for attention | Overlaps with compute? | Notes |
| --- | --- | --- | --- |
| `p2p` | Exchange KV chunks with point-to-point sends in a ring | Yes (async) | The ring-attention style; hard-coded to overlap with the attention compute of sequence chunks |
| `all_gather` | All-gather the full KV sequence before attention | No | Simplest scheme; the approach Llama 3 adopted for its CP attention |
| `a2a` | All-to-all scatter of attention heads across the CP group | Partial | The DeepSpeed-Ulysses style, brought inside Megatron CP |
| `a2a+p2p` | Hierarchical: all-to-all within a low-level group, point-to-point across a high-level group | Yes | All-to-all over NVLink islands, point-to-point over InfiniBand between nodes |

That the `a2a` mode is documented as being "like DeepSpeed Ulysses" shows that Megatron Core's Context Parallelism has converged the ring and all-to-all lineages into a single configurable feature.[^13]

NVIDIA positions CP as an improvement over the original Ring Attention paper on two axes: "(1) leveraging the latest OSS and cuDNN flash attention kernels" and "(2) removing unnecessary computation resulted from low-triangle causal masking and achieving optimal load balance among GPUs" by reordering chunks along the ring.[^4]

### Composing with other parallelism axes

CP is orthogonal to TP, PP, DP, and Expert Parallelism: the total GPU count satisfies $\text{world size} = \text{TP} \times \text{CP} \times \text{PP} \times \text{DP}$.[^4] Korthikanti-style SP within the TP region is typically kept on whenever TP > 1, and CP is layered on top for the sequence axis.

Meta's engineering account of Llama 3 pre-training documents the pattern precisely. Llama 3 405B was trained on 16 384 [H100](/wiki/nvidia_h100) GPUs with a 16 M-token global batch using four-dimensional parallelism (FSDP, TP, PP, and CP).[^12] The short-context base phase at 8 192 tokens used TP = 8, CP = 1, PP = 16, and DP = 128; when the context window was extended to 131 072 (128 K) tokens, the configuration became TP = 8, CP = 16, PP = 16, and DP = 8, so that each rank still processed only 8 K tokens, matching the activation footprint of the base training.[^12] Notably, Meta chose an "all-gather-based CP attention" rather than the ring-based scheme of the original Ring Attention paper, for two reasons: it flexibly supports Llama 3's irregular document-mask attention, and because attention communication grows only linearly with sequence length while attention computation grows quadratically, the all-gather overhead becomes a smaller fraction of each step as sequences lengthen.[^12] To balance the load created by the causal mask, Llama 3's CP splits the input into $2 \times cp$ chunks and assigns each rank $i$ both its $i$-th and $(2 \times cp - i - 1)$-th chunk.[^12]

NVIDIA's developer blog reports that on [B200](/wiki/nvidia_b200) hardware CP delivers "more than 2x speedup for Llama 3 8B with sequences ranging from 16K to 1 million tokens," that "starting from 32K sequence length and beyond" CP yields higher teraflops, and that "at a sequence length of 1 million, using CP is mandatory to get models running."[^11]

## How do the sequence parallelism variants compare?

| Variant | Year | Shards along | Attention strategy | Communication for attention | Composes with TP? | Bound by head count? |
| --- | --- | --- | --- | --- | --- | --- |
| Colossal-AI SP (Li et al.) | 2021[^6] | Sequence | Ring self-attention | $O(N)$ per ring step | yes | no |
| Megatron SP (Korthikanti et al.) | 2022[^1] | Sequence (only at LN/dropout) | Standard TP attention | none extra | required (TP > 1) | no |
| DeepSpeed-Ulysses | 2023[^2] | Sequence | Head-parallel after all-to-all | $O(N/P)$ per link | with friction (heads shared) | yes ($P \mid h$) |
| Ring Attention (Liu et al.) | 2023[^3] | Sequence | Blockwise + ring KV rotation | $O(N/P)$ per ring step, overlapped | yes | no |
| Megatron Context Parallelism | 2023 to present[^4] | Sequence (all activations) | Ring + FlashAttention, causal-aware | $O(N/P)$, overlapped | yes (orthogonal axis) | no |

In every variant the activation memory at attention scales as $O(N/P)$ when SP is engaged across $P$ ranks, but the achievable $P$ and the communication overhead differ.

A 2024 paper by Fang and Zhao, *USP: A Unified Sequence Parallelism Approach for Long-Context Generative AI* (arXiv 2405.07719), proposes hybridising Ulysses and Ring Attention into a unified hierarchical scheme that can run Ulysses over a smaller dimension (e.g. within a node, where all-to-all is cheap) and Ring over a larger dimension (e.g. across nodes, where overlapped point-to-point is preferable).[^9] USP reports 47 % MFU and 208 K-token training on LLaMA3-8B over two 8x A800 nodes.[^9]

## Interaction with other parallelism axes

Sequence parallelism is one axis of multi-dimensional parallelism. Its interactions with the others are as follows.[^7][^10]

* **Tensor parallelism (TP).** Korthikanti SP requires TP > 1 by construction. CP is orthogonal to TP and is normally combined with it: TP shards along hidden, CP shards along sequence. Ulysses *conflicts* with TP because both want to subdivide attention heads.
* **Pipeline parallelism (PP).** SP and CP are layer-local and compose freely with PP.
* **Data parallelism (DP).** All SP variants compose with DP (including ZeRO-1/2/3 / [FSDP](/wiki/fsdp)) on the global-batch axis. Ulysses in particular was designed to combine with DeepSpeed ZeRO-3 for combined sequence + parameter sharding.[^2]
* **Expert parallelism (EP) / Mixture-of-Experts.** [MoE](/wiki/mixture_of_experts) layers add an additional all-to-all over experts; CP (for long sequences) and EP (for expert routing) are independent, composable axes in Megatron Core and NeMo, so a long-context mixture-of-experts model can engage both at once.[^7]
* **Activation recomputation.** SP reduces but does not eliminate per-device activation pressure. The Korthikanti paper explicitly pairs SP with selective activation recomputation; Megatron and DeepSpeed retain full or selective checkpointing as an optional knob even when CP/Ulysses is on.[^1]

## What is sequence parallelism used for?

The principal applications of sequence parallelism are:

* **Long-context pre-training.** Llama 3 pre-training used CP to extend context from 8 K to 128 K tokens without per-rank memory blow-up.[^10][^12]
* **Long-context continued training.** Open-weights long-context recipes for Llama and Mistral-class models typically combine SP at LayerNorm with CP at attention to scale to 256 K to 1 M tokens.[^11]
* **Inference for long inputs.** NVIDIA and Meta have both published million-token *inference* recipes that reuse the CP machinery on the KV cache.[^11]
* **Multimodal long sequences.** Vision-language and video-language models with very long token sequences (e.g. dense video tokenisation) use CP to keep activation memory tractable.[^11]

## What are the limitations and trade-offs?

Sequence parallelism is not free.

* **Communication.** Even though Megatron SP's total volume equals plain TP, the all-gather/reduce-scatter pair on the sequence axis is latency-sensitive on slower interconnects; SP is most cost-effective when TP runs within a high-bandwidth NVLink island.[^1]
* **Head-divisibility (Ulysses).** Ulysses cannot use more SP ranks than attention heads, capping its parallelism. Grouped-query attention models with few KV heads exacerbate this.[^9]
* **Causal-mask load imbalance (Ring).** Naive ring attention wastes about half the FLOPs in lower-triangle masking unless chunk interleaving is used; Megatron CP handles this explicitly but the engineering is non-trivial.[^4][^9]
* **Code intrusiveness.** Both Ulysses and CP require attention kernels that expose the right hooks (KV exchange, head all-to-all). Naive PyTorch attention has to be replaced by a sequence-aware wrapper.[^4][^8]
* **Composability.** Composing Ulysses with TP, or composing more than one of (TP, SP, CP, EP, PP, DP) requires care to avoid collective conflicts and over-partitioning.[^9]

## Related Work

Sequence parallelism sits between three closely related lines of work. Activation-memory-reducing methods such as gradient checkpointing reduce memory at the cost of recomputation. Memory-efficient attention kernels such as FlashAttention reduce attention's intra-device memory without sharding the sequence across devices. And model-parallel methods such as tensor parallelism and pipeline parallelism shard parameters and layers but leave per-rank sequence length unchanged. Sequence parallelism complements all three, and modern long-context training pipelines combine them all.

## See also

* [Attention Is All You Need (Transformer)](/wiki/attention_is_all_you_need_transformer)
* [Llama 3.3](/wiki/llama_3_3)
* [Mistral AI](/wiki/mistral)
* [Graphics processing unit](/wiki/gpu)
* [GPT-3](/wiki/gpt-3)

## References

[^1]: Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro, "Reducing Activation Recomputation in Large Transformer Models", arXiv:2205.05198, 2022-05-10. https://arxiv.org/abs/2205.05198. Accessed 2026-05-21.
[^2]: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He, "DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models", arXiv:2309.14509, 2023-09-25. https://arxiv.org/abs/2309.14509. Accessed 2026-05-21.
[^3]: Hao Liu, Matei Zaharia, Pieter Abbeel, "Ring Attention with Blockwise Transformers for Near-Infinite Context", arXiv:2310.01889, 2023-10-03. https://arxiv.org/abs/2310.01889. Accessed 2026-05-21.
[^4]: NVIDIA, "context_parallel package, Megatron-LM developer guide", NVIDIA Corporation, 2024-10-01. https://docs.nvidia.com/megatron-core/developer-guide/0.15.0/api-guide/context_parallel.html. Accessed 2026-05-21.
[^5]: Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism", arXiv:1909.08053, 2019-09-17. https://arxiv.org/abs/1909.08053. Accessed 2026-05-21.
[^6]: Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, Yang You, "Sequence Parallelism: Long Sequence Training from System Perspective", arXiv:2105.13120, 2021-05-26. https://arxiv.org/abs/2105.13120. Accessed 2026-05-21.
[^7]: NVIDIA, "Parallelisms, NeMo Framework User Guide", NVIDIA Corporation, 2025-02-01. https://docs.nvidia.com/nemo-framework/user-guide/25.02/nemotoolkit/features/parallelisms.html. Accessed 2026-05-21.
[^8]: DeepSpeed Team, "DeepSpeed-Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (blog)", Microsoft / DeepSpeed Project, 2023-09-25. https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ulysses. Accessed 2026-05-21.
[^9]: Jiarui Fang, Shangchun Zhao, "USP: A Unified Sequence Parallelism Approach for Long Context Generative AI", arXiv:2405.07719, 2024-05-13. https://arxiv.org/abs/2405.07719. Accessed 2026-05-21.
[^10]: Llama Team, "The Llama 3 Herd of Models", arXiv:2407.21783, 2024-07-31. https://arxiv.org/abs/2407.21783. Accessed 2026-05-21.
[^11]: NVIDIA, "Scaling to Millions of Tokens with Efficient Long-Context LLM Training", NVIDIA Developer Blog, 2024-11-15. https://developer.nvidia.com/blog/scaling-to-millions-of-tokens-with-efficient-long-context-llm-training/. Accessed 2026-05-21.
[^12]: Weiwei Chu, Xinfeng Xie, Jiecao Yu, Jie Wang, Amar Phanishayee, Chunqiang Tang, Yuchen Hao, Jianyu Huang, et al. (Meta Platforms), "Scaling Llama 3 Training with Efficient Parallelism Strategies", Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA '25), 2025-06. https://doi.org/10.1145/3695053.3731410. Accessed 2026-07-08.
[^13]: NVIDIA, "core.model_parallel_config (cp_comm_type), Megatron Core developer guide", NVIDIA Corporation, 2025. https://docs.nvidia.com/megatron-core/developer-guide/0.17.0/apidocs/core/core.model_parallel_config.html. Accessed 2026-07-08.