Megatron-LM

NVIDIA Open Source AI Training & Optimization

26 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

27 citations

Revision

v4 · 5,164 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Megatron-LM is NVIDIA's open-source framework for training very large transformer language models across GPU clusters, and the name of the tensor-parallelism technique it introduced in 2019 for splitting a transformer's linear layers across multiple devices. It first gained prominence when NVIDIA used it to train an 8.3-billion-parameter GPT-2-style model on 512 V100 GPUs, at the time "the largest transformer ever trained."^[8]^[1] The accompanying paper by Mohammad Shoeybi and colleagues, "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (arXiv:1909.08053), introduced a practical, PyTorch-native form of tensor parallelism that splits the weight matrices inside transformer attention and feed-forward blocks across multiple devices.^[1]^[2] Over successive papers in 2021 and 2022 the project added pipeline parallelism with an interleaved 1F1B schedule, sequence parallelism, and selective activation recomputation, eventually scaling training experiments to a one-trillion-parameter GPT model on 3,072 A100 GPUs at 502 PFLOP/s.^[3]^[4] The core scheduling, communication, and parallelism machinery has since been refactored into a separately versioned library called Megatron-Core, which underpins NVIDIA's NeMo framework and is widely embedded into community forks such as Microsoft's Megatron-DeepSpeed and Hugging Face checkpoint converters.^[5]^[6]^[7]

Infobox

Field	Value
Original release	August 12, 2019^[8]
Developer	NVIDIA Applied Deep Learning Research
Source paper	Shoeybi et al., arXiv:1909.08053^[1]
Latest core library	Megatron-Core 0.16.1 (March 20, 2026)^[2]
Primary language	Python (PyTorch)
License	BSD 3-Clause for NVIDIA code; bundled third-party code under Apache 2.0 / MIT^[9]
Repository	github.com/NVIDIA/Megatron-LM^[2]
Reported GitHub stars	16,400 (Mar 2026 snapshot)^[2]
Largest demonstrated model	1T parameters, 3,072 A100 GPUs, 502 PFLOP/s^[3]

What is Megatron-LM used for?

Megatron-LM is exclusively a training framework: it does not provide inference serving (Triton, vLLM, and TensorRT-LLM cover that), although Megatron-Core does include exporters to TensorRT-LLM and inference engine adapters.^[2] It is used to pre-train and post-train large transformer models on multi-GPU and multi-node clusters by composing several forms of parallelism. The systems built on top of Megatron-Core cover three broad domains.

Pre-training of large dense decoder LLMs. This is the original use case, and the one most often cited. Every major open-weights frontier release of 2023 to 2026 either used Megatron-LM directly or copied its tensor-parallel scheme: GPT-3 reproductions, MT-NLG, BLOOM, OPT, Nemotron, Llama 3, and many corporate-internal models.^[21]^[11]^[7]

Mixture of Experts training. Megatron-Core implements expert parallelism (EP), grouped MoE expert dispatch, token-dropless and token-dropping variants, and the MoE-aware variants of all the other parallelism dimensions; this was used at scale for Nemotron-4 340B's MoE training experiments and for community MoE models that fork the Megatron codebase.^[17]^[6]

Multimodal and hybrid architectures. Megatron-Core 0.7+ added training recipes for vision-language models (e.g., LLaVA-style), and 0.16 adds Mamba/state-space hybrids and diffusion model collections.^[6]^[2] The same parallelism primitives carry over because the underlying linear layers and attention are unchanged.

History

When was Megatron-LM released? Origins (2019) and the 8.3B parameter demo

The Megatron-LM project originated inside NVIDIA's Applied Deep Learning Research group (ADLR), led at the time by Bryan Catanzaro. In a blog post on August 12, 2019, the group announced that it had trained an 8.3-billion-parameter GPT-2-style model, "the largest transformer ever trained," using 8-way intra-layer model parallelism and 64-way data parallelism across 512 NVIDIA V100 GPUs.^[8] The 8.3B model used 72 transformer layers, a hidden dimension of 3,072, and 24 attention heads, and was trained on an aggregate 174 GB corpus combining Wikipedia, OpenWebText, RealNews, and CC-Stories totalling roughly 40 million deduplicated documents.^[8] The accompanying paper, posted on arXiv on September 17, 2019 as 1909.08053, formalized the technique as "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism."^[1]

That first paper made two main claims. First, by splitting the weight matrices of self-attention and the MLP block column-wise and row-wise across GPUs (the column-then-row pattern explained below), it was possible to train a transformer of multi-billion-parameter scale without writing a new compiler, modifying PyTorch, or relying on the Mesh-TensorFlow style of pipelining that had been used for previous large-scale work.^[1] As the authors put it, the approach "can be fully implemented with the insertion of a few communication operations in native PyTorch."^[1] Second, the approach scaled efficiently: the authors reported 15.1 PFLOP/s sustained across 512 GPUs, or 76% of the single-GPU baseline of 39 TFLOP/s.^[1] Their 8.3B GPT-2 model set new state-of-the-art results on WikiText-103 perplexity (10.81, down from 15.8) and LAMBADA accuracy (66.51%, up from 63.2%), and their 3.9B BERT-style model improved RACE accuracy to 90.9%.^[1]

At the time of the 2019 release, the dominant alternative for training models larger than what fit on a single accelerator was Mesh-TensorFlow, which required users to express the model in a special DSL that the compiler could then partition. Megatron-LM showed that the same effect could be achieved by inserting roughly a dozen lines of PyTorch communication code around the GEMMs of the existing transformer block. This pragmatic, framework-native approach is the reason the technique spread quickly to other labs.^[1]

Megatron-2 (2021): pipeline parallelism and 3D parallelism

The second major paper, "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" by Narayanan, Shoeybi, Casper et al., was posted to arXiv on April 9, 2021 (and accepted at SC '21).^[3] Tensor parallelism alone could not scale models beyond a few tens of billions of parameters because all-reduce traffic saturated the intra-node NVLink bandwidth; the obvious complement was pipeline parallelism, where successive layers are placed on different devices and microbatches flow through them in stages.^[3]

Megatron-2 made two innovations. It folded an existing one-forward-one-backward (1F1B) schedule into a memory-aware pipeline and then introduced an interleaved 1F1B schedule in which each device hosts more than one non-contiguous chunk of the model, reducing the size of the pipeline "bubble" by a factor roughly proportional to the number of chunks per device.^[3] It also showed how to compose tensor parallelism within a node, pipeline parallelism across nodes, and conventional data parallelism across replicas; this composition has come to be known as 3D parallelism.^[3] On NVIDIA's Selene DGX A100 supercomputer the authors reported 502 PFLOP/s aggregate throughput for a one-trillion-parameter GPT-style model on 3,072 A100 GPUs, or 163 TFLOP/s per GPU (about 52% of theoretical peak), with a projected end-to-end training time of roughly 84 days.^[3]^[10]

The 2021 paper also introduced a scatter-gather optimization for the eight InfiniBand network interface cards present on each DGX A100 node. Instead of routing pipeline-parallel send/receive traffic through a single NIC, the framework chunks the activations across NICs, ships them in parallel across the InfiniBand fabric, and reassembles them on the destination side using NVLink. NVIDIA reported up to 11% throughput improvement from this single optimization on the trillion-parameter run, illustrating how tightly Megatron-LM's published numbers are coupled to the specific topology of the underlying NVIDIA DGX systems.^[10]

Megatron-3 (2022): sequence parallelism and selective recomputation

The next significant paper, "Reducing Activation Recomputation in Large Transformer Models" by Korthikanti, Casper et al., was posted to arXiv on May 10, 2022.^[4] Activation memory had become the binding constraint for large models, and the standard workaround, full activation checkpointing, paid a roughly 30% compute tax. The paper introduced two complementary techniques.

Sequence parallelism observes that the LayerNorm and dropout operations in a transformer block are not actually replicated across tensor-parallel ranks, but rather can be sliced along the sequence dimension once the all-reduce in the row-parallel projection is rewritten as a reduce-scatter (followed later by an all-gather). This effectively pushes the sequence dimension through the non-GEMM parts of the block, reducing activation memory roughly in proportion to the tensor-parallel size.^[4] Selective activation recomputation observes that only a small fraction of the activations (the softmax inputs and outputs of the attention block) contribute disproportionately to memory while being cheap to recompute, so the framework only recomputes those rather than the entire block.^[4]

Together the two techniques cut activation memory by about 5x and reduced the runtime overhead of recomputation by more than 90%. On a 530-billion-parameter GPT-style model trained on 2,240 A100 GPUs, Megatron-3 reached 54.2% model FLOPs utilization (MFU), versus 42.1% with traditional full recomputation, a 29% wall-clock speedup.^[4]

The split into Megatron-Core (2023 onward)

As the codebase grew, NVIDIA increasingly used the same components inside its own product stack, particularly the NeMo framework, while external users like Microsoft's DeepSpeed team and the BigScience consortium maintained heavyweight forks.^[11]^[7] To stabilize the API for these downstream consumers, NVIDIA refactored the parallelism, communication, and transformer building blocks into a separately versioned library called Megatron-Core (megatron-core on PyPI), with the original Megatron-LM repository becoming a thin reference training driver on top of it.^[5]^[6] The Megatron-Core 0.16.x series, released in early 2026, adds context parallelism variants, FP8 and FP4 mixed precision, MoE with expert parallelism, multimodal recipes, distributed checkpointing, fault detection, and a custom Fully Sharded Data Parallel (FSDP) implementation.^[2]^[6]

The split is the practical analogue of how PyTorch itself separated out torch.distributed from the rest of the framework: Megatron-LM stays as a research code base where new techniques are first tried, while Megatron-Core is a versioned, semver-stable interface that downstream training stacks (NeMo, Megatron-Bridge, vendor forks) can pin to. The library is published to PyPI as megatron-core, with a numbered release roughly every two to four months and changelog entries that name the originating paper for each major feature.^[5]^[6]^[15]

Recent milestones (2024 to 2026)

In the 2024 to 2026 window the project's headline additions have been the FP8 path for Hopper (using the Transformer Engine library), context parallelism for long-context training, MoE-specific parallelism (expert parallelism and the "MoE parallel folding" technique for heterogeneous mappings), distributed checkpointing that reduces checkpoint overhead by up to 42x relative to native PyTorch save/load, and a custom FSDP implementation that interoperates with Megatron-Core's tensor- and pipeline-parallel groups.^[17]^[6]^[2] Megatron-Core 0.7 (July 2024) added LLaVA-style multimodal training; 0.16 (early 2026) extends this to diffusion-model collections and to Mamba and other state-space hybrids.^[17]^[2] Through 2026 the library also gained initial DeepSeek-V4 support and support for emerging optimizers such as Muon via NVIDIA's Emerging-Optimizers library.^[2]

How does Megatron-LM tensor parallelism work?

Tensor parallelism for the transformer

The signature contribution of Megatron-LM is a specific way of splitting a transformer block across GPUs so that only two all-reduce operations are needed in each of the forward and backward passes per block. Consider the standard MLP Y = GeLU(X * A) * B. Megatron's recipe is:

Column-parallel linear A. The output dimension of A is split across the tensor-parallel group. Each rank holds a vertical slice A_i and computes X * A_i independently. Because GeLU is element-wise, the nonlinearity applies locally without communication.^[1]^[12]
Row-parallel linear B. The input dimension of B is split, matching the partition produced by the previous step. Each rank computes Y_i * B_i, and a single all-reduce sums the partial results to give the final output.^[1]^[12]

Self-attention uses an analogous pattern: the Q, K, V projection is column-parallel along the head dimension (so each rank holds a subset of attention heads), the attention computation is fully local within each head, and the output projection is row-parallel, again terminating in one all-reduce.^[1] In the forward pass each block thus performs two all-reduces (one in the MLP, one in attention); the backward pass requires two more.^[1] Because these all-reduces are dense and bandwidth-bound, Megatron-LM is normally restricted to tensor-parallel sizes that fit inside a single NVLink/NVSwitch domain (typically 8 GPUs within a DGX node).^[3]

Pipeline parallelism and the interleaved 1F1B schedule

When the model is too deep for a single node, layers are partitioned into pipeline stages, each living on a different group of devices. A microbatch flows forward through stages, then backward in reverse. Naive (GPipe-style) scheduling fills the pipeline with forward passes for the entire batch before performing any backwards, which inflates activation memory proportional to the number of microbatches.^[3]

Megatron-LM uses the PipeDream-Flush / 1F1B schedule, where after the pipeline is filled each device alternates one forward and one backward microbatch, keeping the number of in-flight activations bounded by the pipeline depth rather than the global batch size.^[3] The interleaved 1F1B schedule then splits each pipeline stage into multiple non-contiguous "chunks" so that a device, rather than owning layers [4..7], might own [4, 5] and [12, 13]. This shrinks the relative size of the pipeline bubble by the number of chunks at the cost of a proportional increase in pipeline communication. Narayanan et al. report a 10% or greater end-to-end throughput improvement at trillion-parameter scale.^[3]

Sequence parallelism

Sequence parallelism is a refinement layered on top of tensor parallelism, not a replacement. In the standard tensor-parallel block the (b, s, h) activations entering and leaving the all-reduce are replicated across tensor-parallel ranks (where b, s, h are batch, sequence, and hidden dimensions). Sequence parallelism rewrites the communication around the LayerNorm and dropout regions so that, in those segments, activations are split along the s dimension. The all-reduce at the boundary becomes a reduce-scatter (entering the sequence-parallel region) and an all-gather (leaving it).^[4] The total communication volume is unchanged because a reduce-scatter plus an all-gather equals one all-reduce in bytes; the benefit is that activation memory in the LayerNorm/dropout regions drops by the tensor-parallel factor t.^[4]

Selective activation recomputation

Within a transformer block, the attention softmax and dropout regions dominate activation memory while being inexpensive to recompute (a few non-GEMM elementwise operations). Selective activation recomputation drops only those activations and recomputes them during backward, retaining the more expensive GEMM activations. Combined with sequence parallelism this almost eliminates the need for full block-level checkpointing.^[4]

Context parallelism

Context parallelism (CP) generalizes the sequence-splitting idea to the attention computation itself. The input, keys, and values are sharded along the sequence dimension, and the attention computation is performed by circulating K/V chunks in a ring among tensor-parallel ranks, similar to Liu et al.'s ring attention formulation.^[13] NVIDIA's CP implementation differs from the academic ring-attention in that it leverages cuDNN/Flash-style fused attention kernels and handles causal masking by rebalancing work across ranks; the docs cite a 1.48x speedup for variable-length sequences from "Dynamic Context Parallelism" in recent Megatron-Core releases.^[13]^[2]

Distributed optimizer

Megatron-Core's distributed optimizer shards optimizer states (and master FP32 parameters in mixed precision) across the data-parallel group, in the spirit of DeepSpeed ZeRO Stage 1 but integrated with the tensor- and pipeline-parallel topology. It reduces optimizer-state memory by the data-parallel size without changing the per-step communication compared to vanilla data-parallel training.^[5]^[6]

FP8 and FP4 mixed precision

On NVIDIA Hopper (H100) and later, Megatron-Core integrates with the Transformer Engine library to perform GEMMs in FP8 with per-tensor or delayed scaling, using either E4M3 or E5M2 formats depending on the operation. Megatron-Core 0.16 added support for FP8 parameter all-gather under per-tensor scaling and, on NVIDIA Blackwell, FP4 compute paths for both pre-training and post-training.^[2]^[6]

3D parallelism

In a typical large run the parallelism dimensions stack: tensor parallelism within a node (usually 4 or 8), context parallelism along a second axis (often 2 or more, only when sequences are long), pipeline parallelism across nodes, data parallelism across the remaining replicas, and (for sparse models) expert parallelism splitting MoE experts among groups. The product of these dimensions equals the total number of devices. Megatron-Core's parallel_state module is the canonical reference implementation for laying out these groups onto a cluster and routing the corresponding NCCL communicators.^[14]^[6]

A typical large run on a 1,024-GPU H100 cluster might use TP=8, CP=2, PP=8, DP=8, which multiplies to 1,024. Choosing the right factorization for a given model is one of the central engineering exercises: tensor parallelism is bandwidth-bound and must stay inside a single NVLink domain; pipeline parallelism trades the pipeline bubble against per-stage memory; context parallelism only helps when sequences are long enough that activations along the sequence dimension dominate memory; data parallelism is cheapest per step but multiplies the gradient all-reduce volume.^[3]^[6]

Distributed checkpointing

A side benefit of Megatron-Core's parallelism model is that the same parallel-group abstractions can be reused for distributed checkpointing. Each rank writes only the parameter shards it owns, with a small index file describing how to reassemble the global state. Megatron-Core 0.7 reported up to a 42x reduction in checkpointing overhead compared to a naive torch.save of a gathered state, and the format is portable across different parallel configurations, so a model saved at TP=8/PP=4 can be resumed at TP=4/PP=8 without explicit conversion.^[17]^[6]

What software is built on Megatron-LM? Implementations and variants

Megatron-Core inside NeMo

NVIDIA's NeMo-adjacent stack uses Megatron-Core as its training kernel. NeMo Megatron originally bundled a fork of the Megatron-LM scripts with PyTorch Lightning wrappers; the current direction (as of 2026) is NeMo Megatron-Bridge, a PyTorch-native training loop that imports Megatron-Core directly and provides bidirectional checkpoint conversion with Hugging Face Transformers.^[15]^[16] Megatron-Core powered the training of Nemotron-4 340B on more than 6,000 H100 GPUs.^[17]

Megatron-DeepSpeed

The Microsoft DeepSpeed team maintains a fork called Megatron-DeepSpeed that combines Megatron's tensor and pipeline parallel implementations with DeepSpeed's ZeRO sharding, optimizer offload, and universal checkpointing. This fork was used to train both Megatron-Turing NLG 530B and BLOOM 176B.^[11]^[7] The BigScience consortium maintained a further fork of Megatron-DeepSpeed specifically for BLOOM.^[7]

Community forks and adapters

A long tail of academic and corporate forks (EPFL's Megatron-LLM, Alibaba's Megatron-LLaMA, the Arcee Megatron-LM-Llama-70B fork, Yandex's YaLM-100B distribution) extends the Megatron-LM base for specific model families or hardware platforms. Many of these focus on adding LLaMA-style architectural details (RoPE positional encoding, RMSNorm, SwiGLU MLP, grouped-query attention) to the original GPT recipe.^[18]^[19]

Does Megatron-LM run on non-NVIDIA hardware?

While Megatron-LM is engineered for NVIDIA GPUs, AMD has published a ROCm port (used in the AMD ROCm 7.0 documentation for Llama 3 pre-training benchmarks) and several research groups have ported subsets of the library to Intel Habana and other accelerators.^[26] These ports typically inherit the column-parallel/row-parallel/1F1B logic unchanged while substituting collective communication backends (RCCL for ROCm) and rewriting the FP8 paths that depend on Hopper-specific Tensor Core instructions. Performance on non-NVIDIA hardware is generally lower than on the reference DGX systems but the same algorithmic scaling holds.^[26]

StarCoder and other code models

The BigCode project's StarCoder 15.5B model was trained on 512 A100 GPUs using Megatron-LM orchestration over a 24-day run, consuming roughly 320,000 GPU-hours and 1 trillion tokens drawn from The Stack v1.2 across more than 80 programming languages.^[27] The training combined tensor and pipeline parallelism with multi-query attention and an 8,192-token context, demonstrating that the same Megatron-LM stack used for natural-language pre-training translates directly to code modeling.^[27]

Which models were trained with Megatron-style parallelism?

Model	Year	Parameters	Hardware	Framework used	Citation
Megatron 8.3B	2019	8.3B	512 V100	Megatron-LM v0	^[1]^[8]
Megatron-Turing NLG (MT-NLG)	2022	530B	2,240 A100 (later 4,480)	Megatron-DeepSpeed (3D parallel)	^[11]
BLOOM	2022	176B	384 A100 80GB	Megatron-DeepSpeed (BigScience fork)	^[7]
OPT-175B	2022	175B	992 A100 80GB	FSDP + Megatron-LM tensor parallel	^[20]
Nemotron-4 340B	2024	340B	6,000+ H100	Megatron-Core	^[17]
Llama 3 herd	2024	up to 405B	16,000 H100	Meta internal stack, Megatron-style 4D parallel	^[21]
Phi-3	2024	up to 14B	NVIDIA H100	DeepSpeed (incl. Megatron components) + UCP	^[22]

Adoption beyond this list extends to most large open-weight model releases that include parallelism details. The 2021 Megatron-LM paper is one of the most cited systems papers in modern deep learning; as of 2026 the arXiv record at semanticscholar.org and Google Scholar both list well over 2,000 citations for the original 1909.08053 paper.^[1]^[23]

MT-NLG 530B: a representative case study

The Megatron-Turing Natural Language Generation 530B (MT-NLG) model, described by Smith, Patwary et al. in arXiv:2201.11990 (January 2022), is the canonical demonstration of Megatron-style training at scale. The model is a 105-layer decoder transformer with 530 billion parameters, trained jointly by NVIDIA and Microsoft using a combination of NVIDIA's Megatron-LM and Microsoft's DeepSpeed frameworks.^[11]

Training ran on NVIDIA's Selene supercomputer (560 DGX A100 servers, 4,480 A100 GPUs, HDR InfiniBand interconnect) and on Microsoft Azure NDv4 cloud nodes following the same reference architecture.^[11] The training stack used 3D parallelism: tensor parallelism within an 8-GPU NVLink domain, pipeline parallelism across nodes, and ZeRO-1 sharded data parallelism for the remaining dimension. At its time MT-NLG was the largest monolithic (non-MoE) language model trained, roughly 3x the parameter count of GPT-3.^[11] The Megatron-3 paper subsequently reused this model as its 530B benchmark.^[4]

BLOOM 176B: the open-science alternative

The BigScience workshop's BLOOM model, trained between March and July 2022, used a fork of Megatron-DeepSpeed and was the first publicly released multilingual model at the 176B scale.^[7] The configuration used tensor parallelism of size 4 (limited to a single node), pipeline parallelism of size 12 across 12 nodes, and data parallelism of size 8 for a total of 384 A100 80GB GPUs over 48 DGX-like nodes.^[7] Training ran for 117 days and consumed approximately one million GPU-hours, producing 350 billion tokens of multilingual training data and reaching up to roughly 150 TFLOP/s per GPU; the team explicitly noted this as the highest reported throughput for that A100 generation.^[7]

The BLOOM configuration also illustrates the model-architectural choices that interact with Megatron-LM's parallelism. The team used BF16 mixed precision (chosen over FP16 specifically to avoid the loss-scaling overflows that had plagued earlier large runs), ZeRO Stage 1 (optimizer-state sharding only, since gradient and parameter sharding would have conflicted with the pipeline-parallel decomposition), and ALiBi positional encoding instead of learned positional embeddings to enable extrapolation to longer sequences. Custom fused CUDA kernels for LayerNorm, attention, and GeLU rounded out the throughput optimizations.^[7]

The BLOOM run also documented the operational reality of Megatron-style training at scale: roughly one to two GPU failures per week, automatic checkpointing every three hours to bound rollback, and a 24/7 on-call rota across the global team.^[7]

How does Megatron-LM differ from DeepSpeed and FSDP?

Approach	Primary memory strategy	Communication structure	Notes
Megatron-LM tensor parallel	Shard layer parameters and activations within block	All-reduce per block (4 total per layer)	Optimized for NVLink/NVSwitch; small TP within a node
DeepSpeed ZeRO	Shard optimizer state, gradients, parameters across DP ranks	All-gather/reduce-scatter per step	Drop-in for data-parallel; no model code changes
PyTorch FSDP	Shard parameters across DP ranks; reconstruct per layer	All-gather/reduce-scatter per layer	Native PyTorch; complements Megatron TP/PP
GSPMD / JAX pjit	Compiler-driven sharding via XLA	Compiler emits collectives	Used in Google TPU stacks; no special module code
Ring-attention class methods	Shard along sequence dimension	Ring all-to-all over K/V	Megatron-Core's context parallelism is a closely related variant

The widely cited summary of the difference between DeepSpeed and Megatron-LM is one of philosophy: ZeRO leaves the model code unchanged and reconstructs parameters on demand using high-bandwidth collectives, paying communication for simplicity; Megatron-LM keeps activations local but rewrites the model into column-parallel/row-parallel layers, paying complexity for lower per-step communication.^[24] In practice the two are usually combined, with tensor and pipeline parallelism coming from Megatron-LM and the data-parallel dimension using ZeRO-style sharding.^[24]^[11]

Empirically, the relative performance of the two approaches depends sharply on cluster topology and model size. For models at the tens-of-billions scale, ZeRO-2 and ZeRO-3 are often competitive or faster than pure Megatron-LM tensor parallelism, because intra-node tensor parallelism's all-reduces dominate when the model fits in fewer GPUs. For models at 100B+ parameters trained across many nodes, Megatron-LM's tensor and pipeline parallelism wins on aggregate throughput because ZeRO-3's parameter all-gathers cross InfiniBand, while tensor-parallel all-reduces stay on NVLink. The MT-NLG paper reports that for the 530B model, pure ZeRO-3 was substantially slower than the 3D-parallel Megatron-DeepSpeed configuration; that pattern is the empirical justification for the now-standard combination of Megatron tensor parallelism with ZeRO Stage 1 optimizer sharding.^[11]^[24]

FSDP occupies a middle ground: like ZeRO it shards parameters across the data-parallel group, but it is implemented as a module wrapper in PyTorch with no custom kernel code. Megatron-Core's recent "custom FSDP" implementation provides an FSDP-compatible variant tuned for the same parallel groups that Megatron tensor and pipeline parallelism use.^[2]^[6] GSPMD, the compiler-based partitioner used in Google's JAX/TPU stacks, derives sharding annotations from user hints and emits collectives automatically; the trade-off is that it relies on the XLA compiler rather than hand-written PyTorch modules.^[25]

Why is Megatron-LM significant?

Megatron-LM's significance is twofold. As a research artifact, the 2019 paper provided the first widely reproducible recipe for splitting transformer linear layers across GPUs in pure PyTorch; almost every subsequent large-model framework, including DeepSpeed-Megatron, the BigScience Megatron-DeepSpeed fork, EleutherAI's GPT-NeoX, and Meta's internal Llama 3 stack, either uses Megatron tensor parallelism directly or implements the same column-parallel/row-parallel pattern under another name.^[7]^[11]^[21] The 2021 and 2022 papers similarly seeded the now-ubiquitous techniques of interleaved 1F1B pipelining and sequence-parallel activation reduction. The arXiv paper 1909.08053 is cited several thousand times on Semantic Scholar as of 2026, placing it among the most-cited systems contributions in the deep-learning literature of the last decade.^[23]

As a production artifact, Megatron-Core is the training kernel inside NVIDIA's NeMo product, which in turn underpins commercial LLM training services offered by major cloud providers; the same kernel is used in the public Nemotron releases.^[17]^[15] In effect, Megatron-Core has become a de-facto reference implementation for "what large-scale transformer training looks like on NVIDIA hardware," and the parallelism dimensions it exposes (TP, PP, CP, EP, DP) have become the shared vocabulary across the broader ecosystem.^[6]

What are the limitations of Megatron-LM?

Megatron-LM has well-known weaknesses. Tensor parallelism's all-reduce traffic is bounded by the slowest link in the tensor-parallel group, which in practice limits useful tensor-parallel sizes to a single NVLink-connected node (typically 4 or 8 GPUs).^[3] Pushing the model wider than that requires pipeline parallelism, which introduces the pipeline bubble and (for non-interleaved schedules) a fundamental floor on hardware utilization.^[3]

The model-code intrusiveness is non-trivial: a vanilla PyTorch transformer has to be rewritten in terms of Megatron's ColumnParallelLinear, RowParallelLinear, and parallel-attention modules, and changes to the architecture have to respect tensor-parallel constraints (for instance, the number of attention heads must be divisible by the tensor-parallel size). This is the trade-off that ZeRO-style sharding deliberately avoids; it is one reason researchers without access to large clusters frequently prefer FSDP or DeepSpeed.^[24]

Operationally, large Megatron-LM runs suffer from the failure modes that all multi-thousand-GPU training jobs share: silent data corruption, hardware faults, NCCL hangs, NVLink errors, and tail-latency from straggler hosts. The MT-NLG paper and the BLOOM training diary are unusually candid about the engineering overhead this incurs, including manual restarts in the tens to hundreds and rolling host replacement.^[11]^[7]^[20]

Finally, the framework is tightly coupled to NVIDIA hardware. While AMD's ROCm port supports a subset of features and some forks target Habana or Intel GPUs, the canonical performance numbers are reported on NVIDIA NVLink/NVSwitch fabrics, with FP8 paths assuming Hopper or Blackwell Tensor Cores.^[26]^[2]

The Megatron lineage sits inside a broader space of large-model training systems. The most directly comparable frameworks are DeepSpeed (Microsoft, also PyTorch, optimized around the ZeRO family) and PyTorch FSDP (now upstream, focused on parameter sharding rather than tensor splitting).^[24] In the JAX world, GSPMD and its successors (Pathways, PaxML, MaxText) cover similar ground using XLA-based sharding annotations rather than hand-rewritten layers.^[25]

Megatron-LM's tensor-parallel split is itself one entry in a family of model parallelism techniques (a core part of modern distributed training) that began with Mesh-TensorFlow (Shazeer et al., 2018) and continued through GShard (Lepikhin et al., 2020), Alpa (Zheng et al., 2022), and Colossal-AI. Megatron differs in that it is a focused, hand-tuned PyTorch implementation rather than a compiler-driven partitioner.^[1]^[3]

References

Shoeybi, Mohammad et al., "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism", arXiv, 2019-09-17 (v1; v4 2020-03-13). https://arxiv.org/abs/1909.08053. Accessed 2026-05-20. ↩
NVIDIA, "Megatron-LM (GitHub repository)", NVIDIA, 2026 (latest release 0.16.1, 2026-03-20). https://github.com/NVIDIA/Megatron-LM. Accessed 2026-05-20. ↩
Narayanan, Deepak et al., "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM", arXiv, 2021-04-09 (v1; v5 2021-08-23). https://arxiv.org/abs/2104.04473. Accessed 2026-05-20. ↩
Korthikanti, Vijay et al., "Reducing Activation Recomputation in Large Transformer Models", arXiv, 2022-05-10. https://arxiv.org/abs/2205.05198. Accessed 2026-05-20. ↩
NVIDIA, "Megatron-Core product page", NVIDIA Developer, 2026. https://developer.nvidia.com/megatron-core. Accessed 2026-05-20. ↩
NVIDIA, "Megatron Core developer guide and parallelism guide", NVIDIA Docs, 2026. https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/parallelism-guide.html. Accessed 2026-05-20. ↩
Bekman, Stas et al., "The Technology Behind BLOOM Training", Hugging Face Blog, 2022-07-14. https://huggingface.co/blog/bloom-megatron-deepspeed. Accessed 2026-05-20. ↩
NVIDIA Applied Deep Learning Research, "MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism", NVIDIA ADLR Blog, 2019-08-12. https://nv-adlr.github.io/MegatronLM. Accessed 2026-05-20. ↩
NVIDIA, "Megatron-LM LICENSE file", GitHub, 2026 (current revision). https://github.com/NVIDIA/Megatron-LM/blob/main/LICENSE. Accessed 2026-05-20. ↩
NVIDIA, "Scaling Language Model Training to a Trillion Parameters Using Megatron", NVIDIA Technical Blog, 2021-04-12. https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/. Accessed 2026-05-20. ↩
Smith, Shaden, Patwary, Mostofa et al., "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model", arXiv, 2022-01-28 (v3 2022-02-04). https://arxiv.org/abs/2201.11990. Accessed 2026-05-20. ↩
NVIDIA, "Megatron-LM/megatron/core/tensor_parallel/layers.py", GitHub, 2026 (current). https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/layers.py. Accessed 2026-05-20. ↩
NVIDIA, "context_parallel package, Megatron-Core API documentation", NVIDIA Docs, 2026. https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html. Accessed 2026-05-20. ↩
NVIDIA, "Megatron-LM/megatron/core/parallel_state.py", GitHub, 2026. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/parallel_state.py. Accessed 2026-05-20. ↩
NVIDIA, "NeMo Megatron-Bridge", NVIDIA Docs, 2026. https://docs.nvidia.com/nemo/megatron-bridge/latest/. Accessed 2026-05-20. ↩
NVIDIA, "Megatron-Bridge (GitHub repository)", NVIDIA-NeMo organization, 2026. https://github.com/NVIDIA-NeMo/Megatron-Bridge. Accessed 2026-05-20. ↩
NVIDIA, "Train Generative AI Models More Efficiently with New NVIDIA Megatron-Core Functionalities", NVIDIA Technical Blog, 2024-07-12. https://developer.nvidia.com/blog/train-generative-ai-models-more-efficiently-with-new-nvidia-megatron-core-functionalities/. Accessed 2026-05-20. ↩
NVIDIA, "Llama, Mistral and other Llama-like model support in Megatron-LM (docs)", NVIDIA Docs, 2026. https://docs.nvidia.com/megatron-core/developer-guide/latest/llama_mistral.html. Accessed 2026-05-20. ↩
Alibaba, "Megatron-LLaMA: Best practice for training LLaMA models in Megatron-LM", GitHub, accessed 2026. https://github.com/alibaba/Megatron-LLaMA. Accessed 2026-05-20. ↩
Zhang, Susan et al., "OPT: Open Pre-trained Transformer Language Models (and training logbook)", arXiv 2205.01068 / Meta AI, 2022-05-02. https://arxiv.org/abs/2205.01068. Accessed 2026-05-20. ↩
Meta AI, "The Llama 3 Herd of Models", Meta AI / arXiv 2407.21783, 2024-07-23. https://arxiv.org/abs/2407.21783. Accessed 2026-05-20. ↩
Lian, Xiaoxia et al., "Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training", arXiv, 2024-06-27. https://arxiv.org/abs/2406.18820. Accessed 2026-05-20. ↩
Semantic Scholar, "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (citation page)", Semantic Scholar, accessed 2026. https://www.semanticscholar.org/paper/Megatron-LM:-Training-Multi-Billion-Parameter-Using-Shoeybi-Patwary/8323c591e119eb09b28b29fd6c7bc76bd889df7a. Accessed 2026-05-20. ↩
Microsoft DeepSpeed Team, "DeepSpeed training overview and features", DeepSpeed Documentation, accessed 2026. https://www.deepspeed.ai/training/. Accessed 2026-05-20. ↩
Xu, Yuanzhong et al., "GSPMD: General and Scalable Parallelization for ML Computation Graphs", arXiv, 2021-05-10. https://arxiv.org/abs/2105.04663. Accessed 2026-05-20. ↩
AMD, "Benchmark Llama 3 pre-training with Megatron-LM (ROCm 7.0)", AMD ROCm Documentation, 2026. https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/training-megatron-lm-llama-3.html. Accessed 2026-05-20. ↩
Li, Raymond et al., "StarCoder: may the source be with you!", arXiv, 2023-05-09. https://arxiv.org/abs/2305.06161. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Apertus Context Parallelism Cosine learning rate schedule EleutherAI Expert Parallelism Fully Sharded Data Parallel (FSDP)GPT-J GPU computing Gradient checkpointing LLaMA-Factory Model training NVLM Nemotron Nemotron-4 Partitioning strategy Sequence Parallelism Training run ZeRO (Zero Redundancy Optimizer)

Infobox

What is Megatron-LM used for?

History

When was Megatron-LM released? Origins (2019) and the 8.3B parameter demo

Megatron-2 (2021): pipeline parallelism and 3D parallelism

Megatron-3 (2022): sequence parallelism and selective recomputation

The split into Megatron-Core (2023 onward)

Recent milestones (2024 to 2026)

How does Megatron-LM tensor parallelism work?

Tensor parallelism for the transformer

Pipeline parallelism and the interleaved 1F1B schedule

Sequence parallelism

Selective activation recomputation

Context parallelism

Distributed optimizer

FP8 and FP4 mixed precision

3D parallelism

Distributed checkpointing

What software is built on Megatron-LM? Implementations and variants

Megatron-Core inside NeMo

Megatron-DeepSpeed

Community forks and adapters

Does Megatron-LM run on non-NVIDIA hardware?

StarCoder and other code models

Which models were trained with Megatron-style parallelism?

MT-NLG 530B: a representative case study

BLOOM 176B: the open-science alternative

How does Megatron-LM differ from DeepSpeed and FSDP?

Why is Megatron-LM significant?

What are the limitations of Megatron-LM?

Related Work and Comparison

See also

References

Improve this article

Related Articles

NCCL (NVIDIA Collective Communications Library)

Minitron

NVIDIA Dynamo

Jet-Nemotron

Nemotron 3

NVIDIA TensorRT-LLM

What links here

Related Articles

NCCL (NVIDIA Collective Communications Library)

Minitron

NVIDIA Dynamo

Jet-Nemotron

Nemotron 3

NVIDIA TensorRT-LLM

What links here