# Context Parallelism

> Source: https://aiwiki.ai/wiki/context_parallelism
> Updated: 2026-06-07
> Categories: AI Infrastructure, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Context Parallelism (CP)** is a distributed training strategy that partitions the input sequence dimension of a [transformer](/wiki/transformer) across multiple accelerators and uses ring-style point-to-point communication to exchange key and value tensors during attention computation.[^1] The technique allows transformer models to be trained on sequences far longer than would fit in the activation memory of a single GPU, with each device computing attention only for its local query chunk while keys and values circulate around the participating devices.[^1][^2] The name "context parallelism" was popularized by NVIDIA when the algorithm was added to Megatron-LM and Megatron-Core in 2024, and the underlying attention algorithm was first described in the Ring Attention paper by Hao Liu, Matei Zaharia, and Pieter Abbeel of UC Berkeley in October 2023.[^2][^3] Context parallelism became a load-bearing component of the training stacks for [Llama 3.1](/wiki/llama_3_1) (128K tokens), [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) (multi-million tokens), and other frontier models that train with very long context windows.[^4][^5]

## Background

The memory footprint of activations in a transformer layer grows linearly with the sequence length S, and the attention matrix grows quadratically with S.[^6] For a model trained at 128K tokens, the activations for a single microbatch easily exceed the high-bandwidth memory (HBM) of an [NVIDIA H100](/wiki/nvidia_h100) GPU even when [Flash Attention](/wiki/flash_attention) is used to avoid materializing the full attention score matrix.[^7] Three earlier mitigations were widely deployed before context parallelism emerged. First, activation recomputation (also called gradient checkpointing) discards intermediate activations on the forward pass and recomputes them on the backward pass, at the cost of roughly 30 percent additional compute per training step.[^7] Second, [tensor parallelism](/wiki/tensor_parallelism) shards weights and activations across the hidden dimension, but adding more tensor-parallel ranks shrinks the per-rank compute so much that it can no longer overlap with the all-reduce communication, hurting throughput.[^7] Third, sequence parallelism in the Megatron-LM 2 paper of 2022 shards only the activations of LayerNorm and Dropout layers along the sequence dimension; it leaves the attention block fully replicated and therefore does not solve the long-context activation problem.[^8]

Two more aggressive ideas appeared in fall 2023. DeepSpeed-Ulysses (arXiv 2309.14509, September 2023) partitions inputs along the sequence dimension and then uses an all-to-all collective immediately before attention so that each rank receives the full sequence but only for a non-overlapping subset of attention heads.[^9] Ring Attention with Blockwise Transformers (arXiv 2310.01889, October 2023) keeps the sequence sharded throughout the attention block and circulates key and value blocks around a logical ring of devices, overlapping the block transfers with the computation of partial attention scores.[^2] NVIDIA adopted the Ring Attention algorithm as the core of its production "context parallelism" feature in Megatron-LM, and Meta used the same family of algorithms for the 128K-token training and inference of Llama 3.1.[^1][^4]

## How it works

### Sequence sharding

A context-parallel group of size CP partitions an input sequence of length S into CP non-overlapping chunks along the sequence axis.[^1] Each device holds the queries Q_i, keys K_i, and values V_i for its local chunk, plus the full slice of model weights replicated across the CP group (weights are sharded across the tensor-parallel and pipeline-parallel dimensions if those are also in use).[^1] All operations outside attention (linear projections, MLP, LayerNorm, residual connections, embedding) operate independently on each chunk because they are pointwise or per-token in the sequence dimension.[^1] Only the attention operator requires cross-device communication because each query token must, in principle, attend to every key and value token.

### Ring rotation of K and V

The ring attention algorithm proceeds in CP iterations.[^2] In iteration t, device i computes a partial softmax-and-output of its local Q_i against the K and V slice it currently holds, while in parallel posting an asynchronous SendRecv that ships the K and V slice it just consumed to neighbor (i+1) mod CP and receives a new K and V slice from neighbor (i-1) mod CP.[^2] After CP iterations every Q_i has been multiplied against every K_j and V_j, and the partial outputs are merged using the log-sum-exp accumulation trick from [Flash Attention](/wiki/flash_attention) so that the final result is bit-equivalent to dense attention computed on a single device.[^2][^10] Because the SendRecv runs in parallel with the block attention compute, the communication can be fully hidden whenever the per-block compute time exceeds the per-block transfer time, which is the regime that holds for sequences in the tens of thousands of tokens on modern interconnects.[^2]

### Communication topology

NVIDIA Megatron-Core exposes the communication strategy via the `cp_comm_type` configuration option.[^11] The default `p2p` mode uses pairwise SendRecv around the ring exactly as in the original Ring Attention paper.[^11] An `a2a` (all-to-all) mode reproduces the DeepSpeed-Ulysses head-partitioning strategy, and a hierarchical mode combines the two by using all-to-all within a node and point-to-point across nodes, which is the pattern explored in academic work on Unified Sequence Parallelism.[^12] In Megatron-Core the `hierarchical_context_parallel_sizes` parameter lets users specify the inner all-to-all size and the outer point-to-point size as a two-element list.[^11]

### Argument names in Megatron-LM

Megatron-LM exposes the CP size as the command-line argument `--context-parallel-size`, with a default value of 1 (CP disabled).[^11] The companion argument `--cp-comm-type` selects among `p2p`, `a2a`, and `a2a+p2p` strategies.[^11] In the Python API the same fields appear on `ModelParallelConfig` as `context_parallel_size`, `cp_comm_type`, and `hierarchical_context_parallel_sizes`.[^11] An additional `max_seqlen_per_dp_cp_rank` parameter bounds the per-rank sequence length to avoid OOM during variable-length training.[^11]

### Memory and compute scaling

With CP of degree N, each device holds 1/N of the activations along the sequence axis, which reduces activation memory by roughly N times.[^1] The compute also decreases by roughly N times per device because each device computes only its local rows of the attention output, although the total system compute is unchanged.[^1] Communication volume per device per layer scales with O(S) for keys and values that traverse the ring once, which is significantly less than the O(S^2) that a naive all-gather of activations would require.[^2]

## Relationship to Ring Attention and DeepSpeed-Ulysses

Context parallelism in Megatron-Core is, in its default p2p form, an implementation of the Ring Attention algorithm of Liu, Zaharia, and Abbeel.[^2][^11] The paper introduced blockwise computation of self-attention and feedforward layers combined with a ring topology that fully overlaps key-value block communication with attention computation, enabling sequences up to "device count times longer" than prior memory-efficient transformers.[^2] NVIDIA's contribution was to integrate the algorithm into a production parallelism library, expose it through command-line and Python configuration, and combine it with the other parallelism axes of Megatron-LM.[^1][^11]

DeepSpeed-Ulysses (Jacobs et al., 2023) takes a different communication approach: it partitions inputs along the sequence dimension but performs an all-to-all collective just before attention so that each device receives the full sequence for a subset of the attention heads.[^9] The DeepSpeed-Ulysses authors report a 2.5x training speedup with 4x longer sequences over baselines and a constant communication volume when devices are scaled proportionally to sequence length.[^9] The trade-off is that the degree of head-parallelism is bounded by the number of attention heads, which is typically small (for example 32 to 128 in current frontier models) and conflicts with tensor parallelism that also wants to shard heads.[^13] Megatron-Core's `a2a` and `a2a+p2p` modes let users combine the two algorithms: all-to-all within a node where head sharding is cheap and point-to-point across nodes where head sharding would exhaust available heads.[^11][^12]

The Unified Sequence Parallelism (USP) paper of Fang and Zhao (arXiv 2405.07719, May 2024) formalizes the hybrid as 2D sequence parallelism and reports 47 percent model-FLOP utilization on two 8x A800 nodes training Llama-3-8B at sequence length 208K.[^12] USP's open-source implementation was upstreamed into NVIDIA Transformer Engine's `AttnFuncWithCPAndKVP2P`, which is the kernel that Megatron-Core dispatches into when context parallelism is enabled.[^12][^14]

## Adoption in Megatron-LM and Megatron-Core

Context parallelism was introduced into NVIDIA Megatron-Core in early 2024 and is documented in the Megatron-Core developer guide.[^11] The 0.5 release line requires Megatron-Core greater than or equal to 0.5.0 and Transformer Engine greater than or equal to 1.1 to use CP.[^15] The `context_parallel` package in Megatron-Core 0.15 documents the public API for ring attention dispatch.[^15] NeMo, NVIDIA's higher-level training framework, exposes the same feature via the `context_parallel_size` field of the MegatronStrategy configuration object.[^16]

NVIDIA's Megatron-Bridge documentation recommends CP=2 in the standard Llama-3 long-context recipe at 8K sequence length and notes that CP becomes mandatory at 1M-token sequence length, where activation recomputation alone is insufficient.[^11] A January 2026 NVIDIA Technical Blog post on Dynamic Context Parallelism reports that adaptive per-microbatch selection of CP size delivers a 1.48x speedup on a GitHub-pretraining workload and a 1.25x speedup on CommonCrawl, with end-to-end gains above 35 percent in multi-thousand-GPU industrial deployments.[^17] Dynamic CP works by pre-constructing CP process groups at multiple sizes (powers of two) at initialization time and selecting per microbatch based on the longest packed sequence, using a token-head-dimension (THD) layout to avoid padding short samples up to the longest length in the batch.[^17]

A November 2025 NVIDIA blog on accelerating long-context training in JAX and XLA reports that integrating NVSHMEM with the XLA compiler for CP communication yields up to 36 percent speedup over NCCL on Llama-3 8B at 256K tokens.[^18]

## Variants and load balancing

### Load imbalance under causal masks

Naive ring attention with causal masks produces severe load imbalance. If the sequence [0, 1, ..., S-1] is split into CP contiguous chunks, the last chunk attends to all earlier tokens and the first chunk attends to nothing earlier, so the device holding the last chunk does roughly CP times more attention work than the device holding the first chunk.[^4][^19] Two reordering tricks restore balance.

### Zigzag splitting

The Llama 3 paper describes a zigzag-style split as follows: tokens are split evenly into 2 x CP chunks (rather than CP), and rank i is assigned both chunk i and chunk (2 x CP - i - 1).[^4] This pairs an early-position chunk with a late-position chunk on every rank, equalizing the amount of causal attention computation across ranks.[^4] The Llama 3 authors explicitly state that this "sharding strategy ensures a balanced computation workload among CP ranks".[^4]

### Striped attention

The striped variant of Brandon et al. exploits the fact that absolute positions inside the attention computation can be permuted without changing the attended-set per query, as long as position information is restored via the position embeddings (such as [RoPE](/wiki/rope)).[^19] Tokens are interleaved across devices, so GPU 0 holds positions {0, CP, 2 x CP, ...}, GPU 1 holds {1, CP+1, 2 x CP+1, ...}, and so on.[^19] Striped achieves nearly perfect load balance under causal masks and is supported in several open-source CP implementations.[^19] Comparative benchmarks find that zigzag and striped both restore load balance, with zigzag being slightly faster than striped at sequence lengths below a few hundred thousand and the gap closing at longer lengths.[^19]

### All-gather pass-KV (Llama-3 training)

For training, the Llama 3 paper notes that Meta used an "all-gather based pass-KV algorithm" in which keys and values are all-gathered upfront and then the local attention is computed against the full KV.[^4] This trades extra memory and bandwidth at the start of the attention block for simpler programming and avoids the ring-loop dependency chain.[^4] The companion inference paper (arXiv 2411.01783, Yang et al., November 2024) describes pass-KV and pass-Q ring variants that selectively rotate the smaller of the two tensors during prefill and decode, achieving 93 percent parallelization efficiency on 128 H100 GPUs for 1M-context prefill of Llama-3 405B in 77 seconds.[^20]

## Integration with other parallelism axes

Context parallelism is one dimension of a multi-dimensional partitioning of a transformer training job. The Llama 3 paper describes a 4D parallelism with groups ordered `[TP, CP, PP, DP]`, where DP is implemented as [FSDP](/wiki/fsdp).[^4] In this layout the innermost (highest-bandwidth) group is tensor parallelism, then context parallelism, then pipeline parallelism, then FSDP, which minimizes the volume of cross-host all-reduces.[^4]

The interaction with other axes is constrained.

| Axis | Interaction with CP |
|---|---|
| [Tensor Parallelism](/wiki/tensor_parallelism) | Composable; CP and TP shard orthogonal dimensions (sequence vs. hidden). Activations are sharded both ways simultaneously.[^1][^11] |
| [Pipeline Parallelism](/wiki/pipeline_parallelism) | Composable; CP is local to each pipeline stage and ring communication happens within the CP group of that stage.[^1] |
| [FSDP](/wiki/fsdp) / [Data Parallelism](/wiki/data_parallelism) | Composable; CP partitions sequence while DP partitions batch. Llama 3 uses CP inside FSDP groups.[^4] |
| [Expert Parallelism](/wiki/mixture_of_experts) | Composable; CP shards sequences and EP shards experts. NeMo and Megatron-Core support combined CP+EP+TP for MoE long-context.[^16] |
| [Flash Attention](/wiki/flash_attention) | Required dependency; each ring iteration runs a Flash Attention kernel on the local Q against the rotating K, V chunks, and the LSE merge from Flash Attention is what makes ring attention mathematically equivalent to dense attention.[^2][^10] |
| Sequence parallelism (Megatron-LM 2) | Composable but orthogonal; sequence parallelism shards LayerNorm and Dropout activations along the sequence axis only inside a TP group, while CP shards everything along the sequence axis across the CP group.[^1][^8] |

A consequence of this layout is that the CP communicator typically runs over [NVLink](/wiki/nvlink) within a node (for cheap p2p) or over InfiniBand across nodes when CP exceeds 8.[^4][^11]

## Adoption in frontier model training

The Llama 3 paper (Meta, July 2024) is the most detailed public account of context parallelism in a large training run.[^4] Meta used CP to extend Llama-3.1-405B's context window from 8K (the dense pretraining length) to 128K through six progressive continued-pretraining stages.[^4] The 4D parallelism order TP-CP-PP-DP is the same one exposed in NVIDIA Megatron-Core, and the all-gather pass-KV variant Meta used has been re-implemented in OSS context parallelism libraries.[^4][^12]

Google's [Gemini](/wiki/gemini) family of models extends context to 1M and 2M tokens; multiple secondary sources state that Gemini's training stack uses sequence-dimension partitioning equivalent to context parallelism, although Google has not published the algorithmic details.[^21] The 360-LLaMA-Factory project (arXiv 2505.22296, May 2025) added a plug-and-play sequence-parallel post-training implementation built on the same ring algorithm.[^22] Microsoft DeepSpeed exposes a related family of sequence-parallel attention variants under the Ulysses name, and Hugging Face's Arctic Long Sequence Training adopts DeepSpeed-Ulysses partitioning for fine-tuning at million-token contexts.[^9][^13]

## Significance

Context parallelism was the missing piece that unlocked routine training on 128K-token and longer contexts in 2024.[^4][^7] Before CP, the dominant strategies were activation recomputation (slow), aggressive TP (bandwidth-bound), or restricted attention patterns such as sliding-window or sparse attention (lossy).[^7] CP is the first technique that scales activation memory linearly with the number of devices in the CP group while remaining mathematically equivalent to dense attention.[^1][^2] This combination is what made Llama 3.1's 128K context, [Gemini 2.5 Pro](/wiki/gemini_2_5_pro)'s 1M-token context, and similar systems feasible to train at scale without resorting to approximations.[^4][^21]

CP has also reshaped inference. Meta's million-token inference paper reports near-linear scaling of prefill latency up to 128 H100 GPUs for Llama-3 405B by adopting CP with pass-KV and pass-Q variants, attaining 93 percent parallelization efficiency on a 1M-token prompt.[^20] The technique works on both NVLink-rich and TCP-only datacenters, indicating that the communication cost is well within the budget of modern interconnects.[^20]

## Limitations

Context parallelism has several documented limitations.

- **Load imbalance under causal masks.** Without zigzag or striped reordering, the device holding the latest portion of the sequence does roughly CP times more attention compute than the device holding the earliest portion, since causal masks prevent earlier tokens from attending to later ones.[^4][^19] Reordering fixes the average case but introduces some implementation complexity, especially for attention masks with arbitrary patterns (sliding window plus prefix, document boundaries inside packed sequences).[^19]
- **Diminishing returns at short sequences.** NVIDIA benchmarks show CP improves teraflops per device starting at sequence length 32K on Llama-3 8B, but at shorter sequences the ring communication overhead is not amortized and CP can hurt throughput.[^7]
- **Variable-length training overhead.** Real-world pretraining data has wildly varying sequence lengths; a static CP size sized for the longest sample wastes work on short samples. Dynamic Context Parallelism addresses this with per-microbatch group selection, but introduces additional bookkeeping and pre-constructed multi-size CP groups.[^17]
- **Decode inefficiency.** During autoregressive decode each generated token requires a fresh all-gather or ring of KV cache, and the per-token computation is too small to hide the communication. Meta's million-token inference paper reports that time-to-interpret-token actually degrades with more CP nodes during decode and recommends decoupling parallelism for prefill versus decode.[^20]
- **Conflict with head-sharding TP.** The all-to-all CP variant (Ulysses-style) shards attention heads, which collides with tensor parallelism that also wants to shard heads. In Megatron-Core, hierarchical CP keeps a2a inside the TP group and p2p across nodes to avoid the conflict, but the configuration space is non-trivial.[^11][^12]

## Related work

- [Ring Attention](/wiki/ring_attention): the underlying algorithm.[^2]
- [Flash Attention](/wiki/flash_attention): kernel-level building block whose LSE accumulation enables the ring composition to be exact.[^10]
- [Tensor Parallelism](/wiki/tensor_parallelism): orthogonal sharding of weights along the hidden axis.[^1]
- [Pipeline Parallelism](/wiki/pipeline_parallelism): orthogonal sharding of layers across devices.[^1]
- [FSDP](/wiki/fsdp): orthogonal sharding of optimizer state and gradients across data-parallel ranks.[^4]
- [Mixture of Experts](/wiki/mixture_of_experts): routed expert parallelism that composes with CP.[^16]
- [DeepSpeed](/wiki/deepspeed): Microsoft's training library hosting the Ulysses sequence-parallel attention variant.[^9]
- [Megatron-LM](/wiki/megatron_lm): NVIDIA's reference training stack where context parallelism is exposed as a first-class parallelism axis.[^11]

## See also

- [Megatron-LM](/wiki/megatron_lm)
- [Ring Attention](/wiki/ring_attention)
- [Flash Attention](/wiki/flash_attention)
- [Tensor Parallelism](/wiki/tensor_parallelism)
- [Pipeline Parallelism](/wiki/pipeline_parallelism)
- [Data Parallelism](/wiki/data_parallelism)
- [Fully Sharded Data Parallel (FSDP)](/wiki/fsdp)
- [Mixture of Experts](/wiki/mixture_of_experts)
- [DeepSpeed](/wiki/deepspeed)
- [KV Cache](/wiki/kv_cache)
- [Rotary position embedding (RoPE)](/wiki/rope)
- [Gemini 2.5 Pro](/wiki/gemini_2_5_pro)
- [Llama 3.1](/wiki/llama_3_1)
- [Pieter Abbeel](/wiki/pieter_abbeel)
- [NVLink](/wiki/nvlink)
- [NCCL](/wiki/nccl)

## References

[^1]: NVIDIA, "Context Parallelism", NVIDIA NeMo Framework User Guide (24.09), 2024-09. https://docs.nvidia.com/nemo-framework/user-guide/24.09/longcontext/contextparallel.html. Accessed 2026-05-21.
[^2]: Hao Liu, Matei Zaharia, Pieter Abbeel, "Ring Attention with Blockwise Transformers for Near-Infinite Context", arXiv:2310.01889, 2023-10-03. https://arxiv.org/abs/2310.01889. Accessed 2026-05-21.
[^3]: NVIDIA, "context_parallel package", Megatron-LM Developer Guide 0.15.0, 2024. https://docs.nvidia.com/megatron-core/developer-guide/0.15.0/api-guide/context_parallel.html. Accessed 2026-05-21.
[^4]: Llama Team, AI @ Meta, "The Llama 3 Herd of Models", arXiv:2407.21783, 2024-07-31. https://arxiv.org/abs/2407.21783. Accessed 2026-05-21.
[^5]: Google DeepMind, "Long context", Gemini API documentation, 2025. https://ai.google.dev/gemini-api/docs/long-context. Accessed 2026-05-21.
[^6]: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", arXiv:2205.14135, 2022-05-27. https://arxiv.org/abs/2205.14135. Accessed 2026-05-21.
[^7]: NVIDIA, "Scaling to Millions of Tokens with Efficient Long-Context LLM Training", NVIDIA Technical Blog, 2024. https://developer.nvidia.com/blog/scaling-to-millions-of-tokens-with-efficient-long-context-llm-training/. Accessed 2026-05-21.
[^8]: Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro, "Reducing Activation Recomputation in Large Transformer Models", arXiv:2205.05198, 2022-05-10. https://arxiv.org/abs/2205.05198. Accessed 2026-05-21.
[^9]: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He, "DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models", arXiv:2309.14509, 2023-09-25. https://arxiv.org/abs/2309.14509. Accessed 2026-05-21.
[^10]: Tri Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning", arXiv:2307.08691, 2023-07-17. https://arxiv.org/abs/2307.08691. Accessed 2026-05-21.
[^11]: NVIDIA, "Parallelism Strategies Guide", Megatron-LM Developer Guide (latest), 2024-2026. https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/parallelism-guide.html. Accessed 2026-05-21.
[^12]: Jiarui Fang, Shangchun Zhao, "USP: A Unified Sequence Parallelism Approach for Long Context Generative AI", arXiv:2405.07719, 2024-05-13. https://arxiv.org/abs/2405.07719. Accessed 2026-05-21.
[^13]: Microsoft DeepSpeed, "DeepSpeed-Ulysses", DeepSpeed Blog, 2023. https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ulysses/README.md. Accessed 2026-05-21.
[^14]: Insu Jang, "Introducing Context Parallelism", Better Tomorrow with Computer Science blog, 2024-09-20. https://insujang.github.io/2024-09-20/introducing-context-parallelism/. Accessed 2026-05-21.
[^15]: NVIDIA, "core.model_parallel_config", Megatron-Core 0.17.0 API Reference, 2024-2025. https://docs.nvidia.com/megatron-core/developer-guide/0.17.0/apidocs/core/core.model_parallel_config.html. Accessed 2026-05-21.
[^16]: NVIDIA, "Parallelisms Guide", Megatron Bridge documentation, 2025-2026. https://docs.nvidia.com/nemo/megatron-bridge/latest/parallelisms.html. Accessed 2026-05-21.
[^17]: NVIDIA, "Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core", NVIDIA Technical Blog, 2026-01. https://developer.nvidia.com/blog/speeding-up-variable-length-training-with-dynamic-context-parallelism-and-nvidia-megatron-core/. Accessed 2026-05-21.
[^18]: NVIDIA, "Accelerating Long-Context Model Training in JAX and XLA", NVIDIA Technical Blog, 2025-2026. https://developer.nvidia.com/blog/accelerating-long-context-model-training-in-jax-and-xla/. Accessed 2026-05-21.
[^19]: William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley, "Striped Attention: Faster Ring Attention for Causal Transformers", arXiv:2311.09431, 2023-11-15. https://arxiv.org/abs/2311.09431. Accessed 2026-05-21.
[^20]: Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, Jianyu Huang, "Context Parallelism for Scalable Million-Token Inference", arXiv:2411.01783, 2024-11-04. https://arxiv.org/abs/2411.01783. Accessed 2026-05-21.
[^21]: Exxact, "Context Parallelism & Ring Attention: Reaching 1M Token Context", Exxact Blog, 2024-2025. https://www.exxactcorp.com/blog/deep-learning/how-llms-reach-large-token-context-windows. Accessed 2026-05-21.
[^22]: Haosheng Zou et al., "360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training", arXiv:2505.22296, 2025-05-28. https://arxiv.org/abs/2505.22296. Accessed 2026-05-21.

