Megatron-LM
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 5,012 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 5,012 words
Add missing citations, update stale details, or suggest a clearer explanation.
Megatron-LM is an open-source training framework developed by NVIDIA for training very large transformer language models on GPU clusters. First announced in August 2019 and released alongside the paper "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" by Mohammad Shoeybi and colleagues, the project introduced a practical, PyTorch-native form of tensor parallelism that split the linear layers inside transformer attention and feed-forward blocks across multiple devices.[1][2] Over successive papers in 2021 and 2022 the project added pipeline parallelism with an interleaved 1F1B schedule, sequence parallelism, and selective activation recomputation, eventually scaling training experiments to a one-trillion-parameter GPT model on 3,072 A100 GPUs.[3][4] The core scheduling, communication, and parallelism machinery has since been refactored into a separately versioned library called Megatron-Core, which underpins NVIDIA's NeMo framework and is widely embedded into community forks such as Microsoft's Megatron-DeepSpeed and Hugging Face checkpoint converters.[5][6][7]
| Field | Value |
|---|---|
| Original release | August 12, 2019[8] |
| Developer | NVIDIA Applied Deep Learning Research |
| Source paper | Shoeybi et al., arXiv:1909.08053[1] |
| Latest core library | Megatron-Core 0.16.1 (March 20, 2026)[2] |
| Primary language | Python (PyTorch) |
| License | BSD 3-Clause for NVIDIA code; bundled third-party code under Apache 2.0 / MIT[9] |
| Repository | github.com/NVIDIA/Megatron-LM[2] |
| Reported GitHub stars | 16,400 (Mar 2026 snapshot)[2] |
| Largest demonstrated model | 1T parameters, 3,072 A100 GPUs, 502 PFLOP/s[3] |
The Megatron-LM project originated inside NVIDIA's Applied Deep Learning Research group (ADLR), led at the time by Bryan Catanzaro. In a blog post on August 12, 2019, the group announced that it had trained an 8.3-billion-parameter GPT-2-style model, "the largest transformer ever trained," using 8-way intra-layer model parallelism and 64-way data parallelism across 512 NVIDIA V100 GPUs.[8] The 8.3B model used 72 transformer layers, a hidden dimension of 3,072, and 24 attention heads, and was trained on an aggregate 174 GB corpus combining Wikipedia, OpenWebText, RealNews, and CC-Stories totalling roughly 40 million deduplicated documents.[8] The accompanying paper, posted on arXiv on September 17, 2019 as 1909.08053, formalized the technique as "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism."[1]
That first paper made two main claims. First, by splitting the weight matrices of self-attention and the MLP block column-wise and row-wise across GPUs (the column-then-row pattern explained below), it was possible to train a transformer of multi-billion-parameter scale without writing a new compiler, modifying PyTorch, or relying on the Mesh-TensorFlow style of pipelining that had been used for previous large-scale work.[1] Second, the approach scaled efficiently: the authors reported 15.1 PFLOP/s sustained across 512 GPUs, or 76% of the single-GPU baseline of 39 TFLOP/s.[1] Their 8.3B GPT-2 model set new state-of-the-art results on WikiText-103 perplexity (10.8, down from 15.8) and LAMBADA accuracy (66.5%, up from 63.2%), and their 3.9B BERT-style model improved RACE accuracy to 90.9%.[1]
At the time of the 2019 release, the dominant alternative for training models larger than what fit on a single accelerator was Mesh-TensorFlow, which required users to express the model in a special DSL that the compiler could then partition. Megatron-LM showed that the same effect could be achieved by inserting roughly a dozen lines of PyTorch communication code around the GEMMs of the existing transformer block. This pragmatic, framework-native approach is the reason the technique spread quickly to other labs.[1]
The second major paper, "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" by Narayanan, Shoeybi, Casper et al., was posted to arXiv on April 9, 2021 (and accepted at SC '21).[3] Tensor parallelism alone could not scale models beyond a few tens of billions of parameters because all-reduce traffic saturated the intra-node NVLink bandwidth; the obvious complement was pipeline parallelism, where successive layers are placed on different devices and microbatches flow through them in stages.[3]
Megatron-2 made two innovations. It folded an existing one-forward-one-backward (1F1B) schedule into a memory-aware pipeline and then introduced an interleaved 1F1B schedule in which each device hosts more than one non-contiguous chunk of the model, reducing the size of the pipeline "bubble" by a factor roughly proportional to the number of chunks per device.[3] It also showed how to compose tensor parallelism within a node, pipeline parallelism across nodes, and conventional data parallelism across replicas; this composition has come to be known as 3D parallelism.[3] On NVIDIA's Selene DGX A100 supercomputer the authors reported 502 PFLOP/s aggregate throughput for a one-trillion-parameter GPT-style model on 3,072 A100 GPUs, or 163 TFLOP/s per GPU (about 52% of theoretical peak), with a projected end-to-end training time of roughly 84 days.[3][10]
The 2021 paper also introduced a scatter-gather optimization for the eight InfiniBand network interface cards present on each DGX A100 node. Instead of routing pipeline-parallel send/receive traffic through a single NIC, the framework chunks the activations across NICs, ships them in parallel across the InfiniBand fabric, and reassembles them on the destination side using NVLink. NVIDIA reported up to 11% throughput improvement from this single optimization on the trillion-parameter run, illustrating how tightly Megatron-LM's published numbers are coupled to the specific topology of the underlying NVIDIA DGX systems.[10]
The next significant paper, "Reducing Activation Recomputation in Large Transformer Models" by Korthikanti, Casper et al., was posted to arXiv on May 10, 2022.[4] Activation memory had become the binding constraint for large models, and the standard workaround, full activation checkpointing, paid a roughly 30% compute tax. The paper introduced two complementary techniques.
Sequence parallelism observes that the LayerNorm and dropout operations in a transformer block are not actually replicated across tensor-parallel ranks, but rather can be sliced along the sequence dimension once the all-reduce in the row-parallel projection is rewritten as a reduce-scatter (followed later by an all-gather). This effectively pushes the sequence dimension through the non-GEMM parts of the block, reducing activation memory roughly in proportion to the tensor-parallel size.[4] Selective activation recomputation observes that only a small fraction of the activations (the softmax inputs and outputs of the attention block) contribute disproportionately to memory while being cheap to recompute, so the framework only recomputes those rather than the entire block.[4]
Together the two techniques cut activation memory by about 5x and reduced the runtime overhead of recomputation by more than 90%. On a 530-billion-parameter GPT-style model trained on 2,240 A100 GPUs, Megatron-3 reached 54.2% model FLOPs utilization (MFU), versus 42.1% with traditional full recomputation, a 29% wall-clock speedup.[4]
As the codebase grew, NVIDIA increasingly used the same components inside its own product stack, particularly the NeMo framework, while external users like Microsoft's DeepSpeed team and the BigScience consortium maintained heavyweight forks.[11][7] To stabilize the API for these downstream consumers, NVIDIA refactored the parallelism, communication, and transformer building blocks into a separately versioned library called Megatron-Core (megatron-core on PyPI), with the original Megatron-LM repository becoming a thin reference training driver on top of it.[5][6] The Megatron-Core 0.16.x series, released in early 2026, adds context parallelism variants, FP8 and FP4 mixed precision, MoE with expert parallelism, multimodal recipes, distributed checkpointing, fault detection, and a custom Fully Sharded Data Parallel (FSDP) implementation.[2][6]
The split is the practical analogue of how PyTorch itself separated out torch.distributed from the rest of the framework: Megatron-LM stays as a research code base where new techniques are first tried, while Megatron-Core is a versioned, semver-stable interface that downstream training stacks (NeMo, Megatron-Bridge, vendor forks) can pin to. The library is published to PyPI as megatron-core, with a numbered release roughly every two to four months and changelog entries that name the originating paper for each major feature.[5][6][15]
In the 2024 to 2026 window the project's headline additions have been the FP8 path for Hopper (using the Transformer Engine library), context parallelism for long-context training, MoE-specific parallelism (expert parallelism and the "MoE parallel folding" technique for heterogeneous mappings), distributed checkpointing that reduces checkpoint overhead by up to 42x relative to native PyTorch save/load, and a custom FSDP implementation that interoperates with Megatron-Core's tensor- and pipeline-parallel groups.[17][6][2] Megatron-Core 0.7 (July 2024) added LLaVA-style multimodal training; 0.16 (early 2026) extends this to diffusion-model collections and to Mamba and other state-space hybrids.[17][2]
The signature contribution of Megatron-LM is a specific way of splitting a transformer block across GPUs so that only two all-reduce operations are needed in each of the forward and backward passes per block. Consider the standard MLP Y = GeLU(X * A) * B. Megatron's recipe is:
A. The output dimension of A is split across the tensor-parallel group. Each rank holds a vertical slice A_i and computes X * A_i independently. Because GeLU is element-wise, the nonlinearity applies locally without communication.[1][12]B. The input dimension of B is split, matching the partition produced by the previous step. Each rank computes Y_i * B_i, and a single all-reduce sums the partial results to give the final output.[1][12]Self-attention uses an analogous pattern: the Q, K, V projection is column-parallel along the head dimension (so each rank holds a subset of attention heads), the attention computation is fully local within each head, and the output projection is row-parallel, again terminating in one all-reduce.[1] In the forward pass each block thus performs two all-reduces (one in the MLP, one in attention); the backward pass requires two more.[1] Because these all-reduces are dense and bandwidth-bound, Megatron-LM is normally restricted to tensor-parallel sizes that fit inside a single NVLink/NVSwitch domain (typically 8 GPUs within a DGX node).[3]
When the model is too deep for a single node, layers are partitioned into pipeline stages, each living on a different group of devices. A microbatch flows forward through stages, then backward in reverse. Naive (GPipe-style) scheduling fills the pipeline with forward passes for the entire batch before performing any backwards, which inflates activation memory proportional to the number of microbatches.[3]
Megatron-LM uses the PipeDream-Flush / 1F1B schedule, where after the pipeline is filled each device alternates one forward and one backward microbatch, keeping the number of in-flight activations bounded by the pipeline depth rather than the global batch size.[3] The interleaved 1F1B schedule then splits each pipeline stage into multiple non-contiguous "chunks" so that a device, rather than owning layers [4..7], might own [4, 5] and [12, 13]. This shrinks the relative size of the pipeline bubble by the number of chunks at the cost of a proportional increase in pipeline communication. Narayanan et al. report a 10% or greater end-to-end throughput improvement at trillion-parameter scale.[3]
Sequence parallelism is a refinement layered on top of tensor parallelism, not a replacement. In the standard tensor-parallel block the (b, s, h) activations entering and leaving the all-reduce are replicated across tensor-parallel ranks (where b, s, h are batch, sequence, and hidden dimensions). Sequence parallelism rewrites the communication around the LayerNorm and dropout regions so that, in those segments, activations are split along the s dimension. The all-reduce at the boundary becomes a reduce-scatter (entering the sequence-parallel region) and an all-gather (leaving it).[4] The total communication volume is unchanged because a reduce-scatter plus an all-gather equals one all-reduce in bytes; the benefit is that activation memory in the LayerNorm/dropout regions drops by the tensor-parallel factor t.[4]
Within a transformer block, the attention softmax and dropout regions dominate activation memory while being inexpensive to recompute (a few non-GEMM elementwise operations). Selective activation recomputation drops only those activations and recomputes them during backward, retaining the more expensive GEMM activations. Combined with sequence parallelism this almost eliminates the need for full block-level checkpointing.[4]
Context parallelism (CP) generalizes the sequence-splitting idea to the attention computation itself. The input, keys, and values are sharded along the sequence dimension, and the attention computation is performed by circulating K/V chunks in a ring among tensor-parallel ranks, similar to Liu et al.'s ring attention formulation.[13] NVIDIA's CP implementation differs from the academic ring-attention in that it leverages cuDNN/Flash-style fused attention kernels and handles causal masking by rebalancing work across ranks; the docs cite a 1.48x speedup for variable-length sequences from "Dynamic Context Parallelism" in recent Megatron-Core releases.[13][2]
Megatron-Core's distributed optimizer shards optimizer states (and master FP32 parameters in mixed precision) across the data-parallel group, in the spirit of DeepSpeed ZeRO Stage 1 but integrated with the tensor- and pipeline-parallel topology. It reduces optimizer-state memory by the data-parallel size without changing the per-step communication compared to vanilla data-parallel training.[5][6]
On NVIDIA Hopper (H100) and later, Megatron-Core integrates with the Transformer Engine library to perform GEMMs in FP8 with per-tensor or delayed scaling, using either E4M3 or E5M2 formats depending on the operation. Megatron-Core 0.16 added support for FP8 parameter all-gather under per-tensor scaling and, on NVIDIA Blackwell, FP4 compute paths for both pre-training and post-training.[2][6]
In a typical large run the parallelism dimensions stack: tensor parallelism within a node (usually 4 or 8), context parallelism along a second axis (often 2 or more, only when sequences are long), pipeline parallelism across nodes, data parallelism across the remaining replicas, and (for sparse models) expert parallelism splitting MoE experts among groups. The product of these dimensions equals the total number of devices. Megatron-Core's parallel_state module is the canonical reference implementation for laying out these groups onto a cluster and routing the corresponding NCCL communicators.[14][6]
A typical large run on a 1,024-GPU H100 cluster might use TP=8, CP=2, PP=8, DP=8, which multiplies to 1,024. Choosing the right factorization for a given model is one of the central engineering exercises: tensor parallelism is bandwidth-bound and must stay inside a single NVLink domain; pipeline parallelism trades the pipeline bubble against per-stage memory; context parallelism only helps when sequences are long enough that activations along the sequence dimension dominate memory; data parallelism is cheapest per step but multiplies the gradient all-reduce volume.[3][6]
A side benefit of Megatron-Core's parallelism model is that the same parallel-group abstractions can be reused for distributed checkpointing. Each rank writes only the parameter shards it owns, with a small index file describing how to reassemble the global state. Megatron-Core 0.7 reported up to a 42x reduction in checkpointing overhead compared to a naive torch.save of a gathered state, and the format is portable across different parallel configurations, so a model saved at TP=8/PP=4 can be resumed at TP=4/PP=8 without explicit conversion.[17][6]
NVIDIA's NeMo-adjacent stack uses Megatron-Core as its training kernel. NeMo Megatron originally bundled a fork of the Megatron-LM scripts with PyTorch Lightning wrappers; the current direction (as of 2026) is NeMo Megatron-Bridge, a PyTorch-native training loop that imports Megatron-Core directly and provides bidirectional checkpoint conversion with Hugging Face Transformers.[15][16] Megatron-Core powered the training of Nemotron-4 340B on more than 6,000 H100 GPUs.[17]
The Microsoft DeepSpeed team maintains a fork called Megatron-DeepSpeed that combines Megatron's tensor and pipeline parallel implementations with DeepSpeed's ZeRO sharding, optimizer offload, and universal checkpointing. This fork was used to train both Megatron-Turing NLG 530B and BLOOM 176B.[11][7] The BigScience consortium maintained a further fork of Megatron-DeepSpeed specifically for BLOOM.[7]
A long tail of academic and corporate forks (EPFL's Megatron-LLM, Alibaba's Megatron-LLaMA, the Arcee Megatron-LM-Llama-70B fork, Yandex's YaLM-100B distribution) extends the Megatron-LM base for specific model families or hardware platforms. Many of these focus on adding LLaMA-style architectural details (RoPE positional encoding, RMSNorm, SwiGLU MLP, grouped-query attention) to the original GPT recipe.[18][19]
While Megatron-LM is engineered for NVIDIA GPUs, AMD has published a ROCm port (used in the AMD ROCm 7.0 documentation for Llama 3 pre-training benchmarks) and several research groups have ported subsets of the library to Intel Habana and other accelerators.[26] These ports typically inherit the column-parallel/row-parallel/1F1B logic unchanged while substituting collective communication backends (RCCL for ROCm) and rewriting the FP8 paths that depend on Hopper-specific Tensor Core instructions. Performance on non-NVIDIA hardware is generally lower than on the reference DGX systems but the same algorithmic scaling holds.[26]
The BigCode project's StarCoder 15.5B model was trained on 512 A100 GPUs using Megatron-LM orchestration over a 24-day run, consuming roughly 320,000 GPU-hours and 1 trillion tokens drawn from The Stack v1.2 across more than 80 programming languages.[27] The training combined tensor and pipeline parallelism with multi-query attention and an 8,192-token context, demonstrating that the same Megatron-LM stack used for natural-language pre-training translates directly to code modeling.[27]
| Model | Year | Parameters | Hardware | Framework used | Citation |
|---|---|---|---|---|---|
| Megatron 8.3B | 2019 | 8.3B | 512 V100 | Megatron-LM v0 | [1][8] |
| Megatron-Turing NLG (MT-NLG) | 2022 | 530B | 2,240 A100 (later 4,480) | Megatron-DeepSpeed (3D parallel) | [11] |
| BLOOM | 2022 | 176B | 384 A100 80GB | Megatron-DeepSpeed (BigScience fork) | [7] |
| OPT-175B | 2022 | 175B | 992 A100 80GB | FSDP + Megatron-LM tensor parallel | [20] |
| Nemotron-4 340B | 2024 | 340B | 6,000+ H100 | Megatron-Core | [17] |
| Llama 3 herd | 2024 | up to 405B | 16,000 H100 | Meta internal stack, Megatron-style 4D parallel | [21] |
| Phi-3 | 2024 | up to 14B | NVIDIA H100 | DeepSpeed (incl. Megatron components) + UCP | [22] |
Adoption beyond this list extends to most large open-weight model releases that include parallelism details. The 2021 Megatron-LM paper is one of the most cited systems papers in modern deep learning; as of 2026 the arXiv record at semanticscholar.org and Google Scholar both list well over 2,000 citations for the original 1909.08053 paper.[1][23]
The Megatron-Turing Natural Language Generation 530B (MT-NLG) model, described by Smith, Patwary et al. in arXiv:2201.11990 (January 2022), is the canonical demonstration of Megatron-style training at scale. The model is a 105-layer decoder transformer with 530 billion parameters, trained jointly by NVIDIA and Microsoft using a combination of NVIDIA's Megatron-LM and Microsoft's DeepSpeed frameworks.[11]
Training ran on NVIDIA's Selene supercomputer (560 DGX A100 servers, 4,480 A100 GPUs, HDR InfiniBand interconnect) and on Microsoft Azure NDv4 cloud nodes following the same reference architecture.[11] The training stack used 3D parallelism: tensor parallelism within an 8-GPU NVLink domain, pipeline parallelism across nodes, and ZeRO-1 sharded data parallelism for the remaining dimension. At its time MT-NLG was the largest monolithic (non-MoE) language model trained, roughly 3x the parameter count of GPT-3.[11] The Megatron-3 paper subsequently reused this model as its 530B benchmark.[4]
The BigScience workshop's BLOOM model, trained between March and July 2022, used a fork of Megatron-DeepSpeed and was the first publicly released multilingual model at the 176B scale.[7] The configuration used tensor parallelism of size 4 (limited to a single node), pipeline parallelism of size 12 across 12 nodes, and data parallelism of size 8 for a total of 384 A100 80GB GPUs over 48 DGX-like nodes.[7] Training ran for 117 days and consumed approximately one million GPU-hours, producing 350 billion tokens of multilingual training data and reaching up to roughly 150 TFLOP/s per GPU; the team explicitly noted this as the highest reported throughput for that A100 generation.[7]
The BLOOM configuration also illustrates the model-architectural choices that interact with Megatron-LM's parallelism. The team used BF16 mixed precision (chosen over FP16 specifically to avoid the loss-scaling overflows that had plagued earlier large runs), ZeRO Stage 1 (optimizer-state sharding only, since gradient and parameter sharding would have conflicted with the pipeline-parallel decomposition), and ALiBi positional encoding instead of learned positional embeddings to enable extrapolation to longer sequences. Custom fused CUDA kernels for LayerNorm, attention, and GeLU rounded out the throughput optimizations.[7]
The BLOOM run also documented the operational reality of Megatron-style training at scale: roughly one to two GPU failures per week, automatic checkpointing every three hours to bound rollback, and a 24/7 on-call rota across the global team.[7]
| Approach | Primary memory strategy | Communication structure | Notes |
|---|---|---|---|
| Megatron-LM tensor parallel | Shard layer parameters and activations within block | All-reduce per block (4 total per layer) | Optimized for NVLink/NVSwitch; small TP within a node |
| DeepSpeed ZeRO | Shard optimizer state, gradients, parameters across DP ranks | All-gather/reduce-scatter per step | Drop-in for data-parallel; no model code changes |
| PyTorch FSDP | Shard parameters across DP ranks; reconstruct per layer | All-gather/reduce-scatter per layer | Native PyTorch; complements Megatron TP/PP |
| GSPMD / JAX pjit | Compiler-driven sharding via XLA | Compiler emits collectives | Used in Google TPU stacks; no special module code |
| Ring-attention class methods | Shard along sequence dimension | Ring all-to-all over K/V | Megatron-Core's context parallelism is a closely related variant |
The widely cited summary of the difference between DeepSpeed and Megatron-LM is one of philosophy: ZeRO leaves the model code unchanged and reconstructs parameters on demand using high-bandwidth collectives, paying communication for simplicity; Megatron-LM keeps activations local but rewrites the model into column-parallel/row-parallel layers, paying complexity for lower per-step communication.[24] In practice the two are usually combined, with tensor and pipeline parallelism coming from Megatron-LM and the data-parallel dimension using ZeRO-style sharding.[24][11]
Empirically, the relative performance of the two approaches depends sharply on cluster topology and model size. For models at the tens-of-billions scale, ZeRO-2 and ZeRO-3 are often competitive or faster than pure Megatron-LM tensor parallelism, because intra-node tensor parallelism's all-reduces dominate when the model fits in fewer GPUs. For models at 100B+ parameters trained across many nodes, Megatron-LM's tensor and pipeline parallelism wins on aggregate throughput because ZeRO-3's parameter all-gathers cross InfiniBand, while tensor-parallel all-reduces stay on NVLink. The MT-NLG paper reports that for the 530B model, pure ZeRO-3 was substantially slower than the 3D-parallel Megatron-DeepSpeed configuration; that pattern is the empirical justification for the now-standard combination of Megatron tensor parallelism with ZeRO Stage 1 optimizer sharding.[11][24]
FSDP occupies a middle ground: like ZeRO it shards parameters across the data-parallel group, but it is implemented as a module wrapper in PyTorch with no custom kernel code. Megatron-Core's recent "custom FSDP" implementation provides an FSDP-compatible variant tuned for the same parallel groups that Megatron tensor and pipeline parallelism use.[2][6] GSPMD, the compiler-based partitioner used in Google's JAX/TPU stacks, derives sharding annotations from user hints and emits collectives automatically; the trade-off is that it relies on the XLA compiler rather than hand-written PyTorch modules.[25]
Megatron-LM is exclusively a training framework: it does not provide inference serving (Triton, vLLM, and TensorRT-LLM cover that), although Megatron-Core does include exporters to TensorRT-LLM and inference engine adapters.[2] The systems built on top of Megatron-Core cover three broad domains.
Pre-training of large dense decoder LLMs. This is the original use case, and the one most often cited. Every major open-weights frontier release of 2023 to 2026 either used Megatron-LM directly or copied its tensor-parallel scheme: GPT-3 reproductions, MT-NLG, BLOOM, OPT, Nemotron, Llama 3, and many corporate-internal models.[21][11][7]
Mixture of Experts training. Megatron-Core implements expert parallelism (EP), grouped MoE expert dispatch, token-dropless and token-dropping variants, and the MoE-aware variants of all the other parallelism dimensions; this was used at scale for Nemotron-4 340B's MoE training experiments and for community MoE models that fork the Megatron codebase.[17][6]
Multimodal and hybrid architectures. Megatron-Core 0.7+ added training recipes for vision-language models (e.g., LLaVA-style), and 0.16 adds Mamba/state-space hybrids and diffusion model collections.[6][2] The same parallelism primitives carry over because the underlying linear layers and attention are unchanged.
Megatron-LM's significance is twofold. As a research artifact, the 2019 paper provided the first widely reproducible recipe for splitting transformer linear layers across GPUs in pure PyTorch; almost every subsequent large-model framework, including DeepSpeed-Megatron, the BigScience Megatron-DeepSpeed fork, EleutherAI's GPT-NeoX, and Meta's internal Llama 3 stack, either uses Megatron tensor parallelism directly or implements the same column-parallel/row-parallel pattern under another name.[7][11][21] The 2021 and 2022 papers similarly seeded the now-ubiquitous techniques of interleaved 1F1B pipelining and sequence-parallel activation reduction. The arXiv paper 1909.08053 is cited several thousand times on Semantic Scholar as of 2026, placing it among the most-cited systems contributions in the deep-learning literature of the last decade.[23]
As a production artifact, Megatron-Core is the training kernel inside NVIDIA's NeMo product, which in turn underpins commercial LLM training services offered by major cloud providers; the same kernel is used in the public Nemotron releases.[17][15] In effect, Megatron-Core has become a de-facto reference implementation for "what large-scale transformer training looks like on NVIDIA hardware," and the parallelism dimensions it exposes (TP, PP, CP, EP, DP) have become the shared vocabulary across the broader ecosystem.[6]
Megatron-LM has well-known weaknesses. Tensor parallelism's all-reduce traffic is bounded by the slowest link in the tensor-parallel group, which in practice limits useful tensor-parallel sizes to a single NVLink-connected node (typically 4 or 8 GPUs).[3] Pushing the model wider than that requires pipeline parallelism, which introduces the pipeline bubble and (for non-interleaved schedules) a fundamental floor on hardware utilization.[3]
The model-code intrusiveness is non-trivial: a vanilla PyTorch transformer has to be rewritten in terms of Megatron's ColumnParallelLinear, RowParallelLinear, and parallel-attention modules, and changes to the architecture have to respect tensor-parallel constraints (for instance, the number of attention heads must be divisible by the tensor-parallel size). This is the trade-off that ZeRO-style sharding deliberately avoids; it is one reason researchers without access to large clusters frequently prefer FSDP or DeepSpeed.[24]
Operationally, large Megatron-LM runs suffer from the failure modes that all multi-thousand-GPU training jobs share: silent data corruption, hardware faults, NCCL hangs, NVLink errors, and tail-latency from straggler hosts. The MT-NLG paper and the BLOOM training diary are unusually candid about the engineering overhead this incurs, including manual restarts in the tens to hundreds and rolling host replacement.[11][7][20]
Finally, the framework is tightly coupled to NVIDIA hardware. While AMD's ROCm port supports a subset of features and some forks target Habana or Intel GPUs, the canonical performance numbers are reported on NVIDIA NVLink/NVSwitch fabrics, with FP8 paths assuming Hopper or Blackwell Tensor Cores.[26][2]
The Megatron lineage sits inside a broader space of large-model training systems. The most directly comparable frameworks are DeepSpeed (Microsoft, also PyTorch, optimized around the ZeRO family) and PyTorch FSDP (now upstream, focused on parameter sharding rather than tensor splitting).[24] In the JAX world, GSPMD and its successors (Pathways, PaxML, MaxText) cover similar ground using XLA-based sharding annotations rather than hand-rewritten layers.[25]
Megatron-LM's tensor-parallel split is itself one entry in a family of "model-parallel" techniques that began with Mesh-TensorFlow (Shazeer et al., 2018) and continued through GShard (Lepikhin et al., 2020), Alpa (Zheng et al., 2022), and Colossal-AI. Megatron differs in that it is a focused, hand-tuned PyTorch implementation rather than a compiler-driven partitioner.[1][3]