Context Parallelism
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,310 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,310 words
Add missing citations, update stale details, or suggest a clearer explanation.
Context Parallelism (CP) is a distributed training strategy that partitions the input sequence dimension of a transformer across multiple accelerators and uses ring-style point-to-point communication to exchange key and value tensors during attention computation.[^1] The technique allows transformer models to be trained on sequences far longer than would fit in the activation memory of a single GPU, with each device computing attention only for its local query chunk while keys and values circulate around the participating devices.[^1][^2] The name "context parallelism" was popularized by NVIDIA when the algorithm was added to Megatron-LM and Megatron-Core in 2024, and the underlying attention algorithm was first described in the Ring Attention paper by Hao Liu, Matei Zaharia, and Pieter Abbeel of UC Berkeley in October 2023.[^2][^3] Context parallelism became a load-bearing component of the training stacks for Llama 3.1 (128K tokens), Gemini 2.5 Pro (multi-million tokens), and other frontier models that train with very long context windows.[^4][^5]
The memory footprint of activations in a transformer layer grows linearly with the sequence length S, and the attention matrix grows quadratically with S.[^6] For a model trained at 128K tokens, the activations for a single microbatch easily exceed the high-bandwidth memory (HBM) of an NVIDIA H100 GPU even when Flash Attention is used to avoid materializing the full attention score matrix.[^7] Three earlier mitigations were widely deployed before context parallelism emerged. First, activation recomputation (also called gradient checkpointing) discards intermediate activations on the forward pass and recomputes them on the backward pass, at the cost of roughly 30 percent additional compute per training step.[^7] Second, tensor parallelism shards weights and activations across the hidden dimension, but adding more tensor-parallel ranks shrinks the per-rank compute so much that it can no longer overlap with the all-reduce communication, hurting throughput.[^7] Third, sequence parallelism in the Megatron-LM 2 paper of 2022 shards only the activations of LayerNorm and Dropout layers along the sequence dimension; it leaves the attention block fully replicated and therefore does not solve the long-context activation problem.[^8]
Two more aggressive ideas appeared in fall 2023. DeepSpeed-Ulysses (arXiv 2309.14509, September 2023) partitions inputs along the sequence dimension and then uses an all-to-all collective immediately before attention so that each rank receives the full sequence but only for a non-overlapping subset of attention heads.[^9] Ring Attention with Blockwise Transformers (arXiv 2310.01889, October 2023) keeps the sequence sharded throughout the attention block and circulates key and value blocks around a logical ring of devices, overlapping the block transfers with the computation of partial attention scores.[^2] NVIDIA adopted the Ring Attention algorithm as the core of its production "context parallelism" feature in Megatron-LM, and Meta used the same family of algorithms for the 128K-token training and inference of Llama 3.1.[^1][^4]
A context-parallel group of size CP partitions an input sequence of length S into CP non-overlapping chunks along the sequence axis.[^1] Each device holds the queries Q_i, keys K_i, and values V_i for its local chunk, plus the full slice of model weights replicated across the CP group (weights are sharded across the tensor-parallel and pipeline-parallel dimensions if those are also in use).[^1] All operations outside attention (linear projections, MLP, LayerNorm, residual connections, embedding) operate independently on each chunk because they are pointwise or per-token in the sequence dimension.[^1] Only the attention operator requires cross-device communication because each query token must, in principle, attend to every key and value token.
The ring attention algorithm proceeds in CP iterations.[^2] In iteration t, device i computes a partial softmax-and-output of its local Q_i against the K and V slice it currently holds, while in parallel posting an asynchronous SendRecv that ships the K and V slice it just consumed to neighbor (i+1) mod CP and receives a new K and V slice from neighbor (i-1) mod CP.[^2] After CP iterations every Q_i has been multiplied against every K_j and V_j, and the partial outputs are merged using the log-sum-exp accumulation trick from Flash Attention so that the final result is bit-equivalent to dense attention computed on a single device.[^2][^10] Because the SendRecv runs in parallel with the block attention compute, the communication can be fully hidden whenever the per-block compute time exceeds the per-block transfer time, which is the regime that holds for sequences in the tens of thousands of tokens on modern interconnects.[^2]
NVIDIA Megatron-Core exposes the communication strategy via the cp_comm_type configuration option.[^11] The default p2p mode uses pairwise SendRecv around the ring exactly as in the original Ring Attention paper.[^11] An a2a (all-to-all) mode reproduces the DeepSpeed-Ulysses head-partitioning strategy, and a hierarchical mode combines the two by using all-to-all within a node and point-to-point across nodes, which is the pattern explored in academic work on Unified Sequence Parallelism.[^12] In Megatron-Core the hierarchical_context_parallel_sizes parameter lets users specify the inner all-to-all size and the outer point-to-point size as a two-element list.[^11]
Megatron-LM exposes the CP size as the command-line argument --context-parallel-size, with a default value of 1 (CP disabled).[^11] The companion argument --cp-comm-type selects among p2p, a2a, and a2a+p2p strategies.[^11] In the Python API the same fields appear on ModelParallelConfig as context_parallel_size, cp_comm_type, and hierarchical_context_parallel_sizes.[^11] An additional max_seqlen_per_dp_cp_rank parameter bounds the per-rank sequence length to avoid OOM during variable-length training.[^11]
With CP of degree N, each device holds 1/N of the activations along the sequence axis, which reduces activation memory by roughly N times.[^1] The compute also decreases by roughly N times per device because each device computes only its local rows of the attention output, although the total system compute is unchanged.[^1] Communication volume per device per layer scales with O(S) for keys and values that traverse the ring once, which is significantly less than the O(S^2) that a naive all-gather of activations would require.[^2]
Context parallelism in Megatron-Core is, in its default p2p form, an implementation of the Ring Attention algorithm of Liu, Zaharia, and Abbeel.[^2][^11] The paper introduced blockwise computation of self-attention and feedforward layers combined with a ring topology that fully overlaps key-value block communication with attention computation, enabling sequences up to "device count times longer" than prior memory-efficient transformers.[^2] NVIDIA's contribution was to integrate the algorithm into a production parallelism library, expose it through command-line and Python configuration, and combine it with the other parallelism axes of Megatron-LM.[^1][^11]
DeepSpeed-Ulysses (Jacobs et al., 2023) takes a different communication approach: it partitions inputs along the sequence dimension but performs an all-to-all collective just before attention so that each device receives the full sequence for a subset of the attention heads.[^9] The DeepSpeed-Ulysses authors report a 2.5x training speedup with 4x longer sequences over baselines and a constant communication volume when devices are scaled proportionally to sequence length.[^9] The trade-off is that the degree of head-parallelism is bounded by the number of attention heads, which is typically small (for example 32 to 128 in current frontier models) and conflicts with tensor parallelism that also wants to shard heads.[^13] Megatron-Core's a2a and a2a+p2p modes let users combine the two algorithms: all-to-all within a node where head sharding is cheap and point-to-point across nodes where head sharding would exhaust available heads.[^11][^12]
The Unified Sequence Parallelism (USP) paper of Fang and Zhao (arXiv 2405.07719, May 2024) formalizes the hybrid as 2D sequence parallelism and reports 47 percent model-FLOP utilization on two 8x A800 nodes training Llama-3-8B at sequence length 208K.[^12] USP's open-source implementation was upstreamed into NVIDIA Transformer Engine's AttnFuncWithCPAndKVP2P, which is the kernel that Megatron-Core dispatches into when context parallelism is enabled.[^12][^14]
Context parallelism was introduced into NVIDIA Megatron-Core in early 2024 and is documented in the Megatron-Core developer guide.[^11] The 0.5 release line requires Megatron-Core greater than or equal to 0.5.0 and Transformer Engine greater than or equal to 1.1 to use CP.[^15] The context_parallel package in Megatron-Core 0.15 documents the public API for ring attention dispatch.[^15] NeMo, NVIDIA's higher-level training framework, exposes the same feature via the context_parallel_size field of the MegatronStrategy configuration object.[^16]
NVIDIA's Megatron-Bridge documentation recommends CP=2 in the standard Llama-3 long-context recipe at 8K sequence length and notes that CP becomes mandatory at 1M-token sequence length, where activation recomputation alone is insufficient.[^11] A January 2026 NVIDIA Technical Blog post on Dynamic Context Parallelism reports that adaptive per-microbatch selection of CP size delivers a 1.48x speedup on a GitHub-pretraining workload and a 1.25x speedup on CommonCrawl, with end-to-end gains above 35 percent in multi-thousand-GPU industrial deployments.[^17] Dynamic CP works by pre-constructing CP process groups at multiple sizes (powers of two) at initialization time and selecting per microbatch based on the longest packed sequence, using a token-head-dimension (THD) layout to avoid padding short samples up to the longest length in the batch.[^17]
A November 2025 NVIDIA blog on accelerating long-context training in JAX and XLA reports that integrating NVSHMEM with the XLA compiler for CP communication yields up to 36 percent speedup over NCCL on Llama-3 8B at 256K tokens.[^18]
Naive ring attention with causal masks produces severe load imbalance. If the sequence [0, 1, ..., S-1] is split into CP contiguous chunks, the last chunk attends to all earlier tokens and the first chunk attends to nothing earlier, so the device holding the last chunk does roughly CP times more attention work than the device holding the first chunk.[^4][^19] Two reordering tricks restore balance.
The Llama 3 paper describes a zigzag-style split as follows: tokens are split evenly into 2 x CP chunks (rather than CP), and rank i is assigned both chunk i and chunk (2 x CP - i - 1).[^4] This pairs an early-position chunk with a late-position chunk on every rank, equalizing the amount of causal attention computation across ranks.[^4] The Llama 3 authors explicitly state that this "sharding strategy ensures a balanced computation workload among CP ranks".[^4]
The striped variant of Brandon et al. exploits the fact that absolute positions inside the attention computation can be permuted without changing the attended-set per query, as long as position information is restored via the position embeddings (such as RoPE).[^19] Tokens are interleaved across devices, so GPU 0 holds positions {0, CP, 2 x CP, ...}, GPU 1 holds {1, CP+1, 2 x CP+1, ...}, and so on.[^19] Striped achieves nearly perfect load balance under causal masks and is supported in several open-source CP implementations.[^19] Comparative benchmarks find that zigzag and striped both restore load balance, with zigzag being slightly faster than striped at sequence lengths below a few hundred thousand and the gap closing at longer lengths.[^19]
For training, the Llama 3 paper notes that Meta used an "all-gather based pass-KV algorithm" in which keys and values are all-gathered upfront and then the local attention is computed against the full KV.[^4] This trades extra memory and bandwidth at the start of the attention block for simpler programming and avoids the ring-loop dependency chain.[^4] The companion inference paper (arXiv 2411.01783, Yang et al., November 2024) describes pass-KV and pass-Q ring variants that selectively rotate the smaller of the two tensors during prefill and decode, achieving 93 percent parallelization efficiency on 128 H100 GPUs for 1M-context prefill of Llama-3 405B in 77 seconds.[^20]
Context parallelism is one dimension of a multi-dimensional partitioning of a transformer training job. The Llama 3 paper describes a 4D parallelism with groups ordered [TP, CP, PP, DP], where DP is implemented as FSDP.[^4] In this layout the innermost (highest-bandwidth) group is tensor parallelism, then context parallelism, then pipeline parallelism, then FSDP, which minimizes the volume of cross-host all-reduces.[^4]
The interaction with other axes is constrained.
| Axis | Interaction with CP |
|---|---|
| Tensor Parallelism | Composable; CP and TP shard orthogonal dimensions (sequence vs. hidden). Activations are sharded both ways simultaneously.[^1][^11] |
| Pipeline Parallelism | Composable; CP is local to each pipeline stage and ring communication happens within the CP group of that stage.[^1] |
| FSDP / Data Parallelism | Composable; CP partitions sequence while DP partitions batch. Llama 3 uses CP inside FSDP groups.[^4] |
| Expert Parallelism | Composable; CP shards sequences and EP shards experts. NeMo and Megatron-Core support combined CP+EP+TP for MoE long-context.[^16] |
| Flash Attention | Required dependency; each ring iteration runs a Flash Attention kernel on the local Q against the rotating K, V chunks, and the LSE merge from Flash Attention is what makes ring attention mathematically equivalent to dense attention.[^2][^10] |
| Sequence parallelism (Megatron-LM 2) | Composable but orthogonal; sequence parallelism shards LayerNorm and Dropout activations along the sequence axis only inside a TP group, while CP shards everything along the sequence axis across the CP group.[^1][^8] |
A consequence of this layout is that the CP communicator typically runs over NVLink within a node (for cheap p2p) or over InfiniBand across nodes when CP exceeds 8.[^4][^11]
The Llama 3 paper (Meta, July 2024) is the most detailed public account of context parallelism in a large training run.[^4] Meta used CP to extend Llama-3.1-405B's context window from 8K (the dense pretraining length) to 128K through six progressive continued-pretraining stages.[^4] The 4D parallelism order TP-CP-PP-DP is the same one exposed in NVIDIA Megatron-Core, and the all-gather pass-KV variant Meta used has been re-implemented in OSS context parallelism libraries.[^4][^12]
Google's Gemini family of models extends context to 1M and 2M tokens; multiple secondary sources state that Gemini's training stack uses sequence-dimension partitioning equivalent to context parallelism, although Google has not published the algorithmic details.[^21] The 360-LLaMA-Factory project (arXiv 2505.22296, May 2025) added a plug-and-play sequence-parallel post-training implementation built on the same ring algorithm.[^22] Microsoft DeepSpeed exposes a related family of sequence-parallel attention variants under the Ulysses name, and Hugging Face's Arctic Long Sequence Training adopts DeepSpeed-Ulysses partitioning for fine-tuning at million-token contexts.[^9][^13]
Context parallelism was the missing piece that unlocked routine training on 128K-token and longer contexts in 2024.[^4][^7] Before CP, the dominant strategies were activation recomputation (slow), aggressive TP (bandwidth-bound), or restricted attention patterns such as sliding-window or sparse attention (lossy).[^7] CP is the first technique that scales activation memory linearly with the number of devices in the CP group while remaining mathematically equivalent to dense attention.[^1][^2] This combination is what made Llama 3.1's 128K context, Gemini 2.5 Pro's 1M-token context, and similar systems feasible to train at scale without resorting to approximations.[^4][^21]
CP has also reshaped inference. Meta's million-token inference paper reports near-linear scaling of prefill latency up to 128 H100 GPUs for Llama-3 405B by adopting CP with pass-KV and pass-Q variants, attaining 93 percent parallelization efficiency on a 1M-token prompt.[^20] The technique works on both NVLink-rich and TCP-only datacenters, indicating that the communication cost is well within the budget of modern interconnects.[^20]
Context parallelism has several documented limitations.