ZeRO (Zero Redundancy Optimizer)
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,006 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,006 words
Add missing citations, update stale details, or suggest a clearer explanation.
ZeRO (Zero Redundancy Optimizer) is a family of memory-optimization techniques for training large neural networks introduced by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He at microsoft research in the 2019 paper "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models".[1] The method partitions optimizer states, gradients, and model parameters across data-parallel processes rather than replicating them on every device, eliminating the memory redundancy inherent in conventional all-reduce data parallelism.[1] ZeRO is the central memory optimization in Microsoft's deepspeed library and has powered training of models including Turing-NLG 17B, Megatron-Turing NLG 530B, and BLOOM 176B.[2][3][4] Extensions of the original work include ZeRO-Offload (2021), ZeRO-Infinity (2021), and ZeRO++ (2023).[5][6][7]
Training large transformer models with billions of parameters confronts a fundamental memory constraint: GPU device memory on a single accelerator cannot hold the model state required by gradient-based optimization at full precision.[1] Mixed-precision training of a model with Ψ parameters using the Adam optimizer requires storing an fp16 copy of the parameters and the fp16 gradients (2Ψ + 2Ψ = 4Ψ bytes) along with the optimizer states held in fp32: an fp32 parameter copy, the fp32 first-moment (momentum) buffer, and the fp32 second-moment (variance) buffer (4Ψ + 4Ψ + 4Ψ = 12Ψ bytes), for a total of 16Ψ bytes per parameter.[1] A model with one billion parameters therefore needs roughly 16 GB just for these "model states" before any activations or workspace buffers are allocated.[1]
Conventional data-parallel training based on all-reduce, as implemented in PyTorch DistributedDataParallel and similar systems, replicates the full model, gradient buffers, and optimizer state on every GPU and synchronizes gradients with an all-reduce collective at each step.[1] This replication scales well in throughput but uses memory inefficiently: 64 data-parallel ranks holding a 1 B-parameter model carry 64 identical copies of the 16Ψ-byte model state, for an aggregate of roughly 1 TB across the cluster, even though only Ψ-many distinct values exist.[1] Model-parallel approaches such as Megatron-LM-style tensor parallelism and pipeline parallelism partition the model across devices, but they introduce code rewrites, harm computational granularity, and are typically restricted to a single node because of bandwidth requirements.[1][4] ZeRO was designed to obtain the memory benefits of model parallelism while retaining the simplicity and computational granularity of data parallelism.[1]
The ZeRO paper appeared on arXiv on 4 October 2019 and was published at the ACM/IEEE SC20 conference, where it received a Best Paper finalist nomination.[1][8] Microsoft released the deepspeed library implementing ZeRO Stage 1 on 10 February 2020 alongside a blog post announcing the 17 B-parameter Turing-NLG model.[2][9] DeepSpeed is open source under the Apache 2.0 license and is developed by an engineering team at Microsoft Research that includes the original ZeRO authors.[15]
The ZeRO paper formalizes three progressive levels of partitioning, traditionally labelled ZeRO-1, ZeRO-2, and ZeRO-3 (also called P_os, P_os+g, and P_os+g+p in the original notation).[1] Each stage shards an additional component of the model state across the N data-parallel ranks, reducing per-GPU memory by an additional factor.[1] The communication trade-off is summarised below.
| Stage | What is partitioned | Per-GPU model-state memory (for Ψ parameters, N ranks) | Communication volume vs DDP baseline |
|---|---|---|---|
| Baseline DDP | Nothing partitioned; all states replicated | 16Ψ | 1.0x (one all-reduce of Ψ gradients) |
| ZeRO-1 (P_os) | Optimizer states only (12Ψ portion) | 4Ψ + 12Ψ/N | 1.0x (same as DDP) |
| ZeRO-2 (P_os+g) | Optimizer states + gradients | 2Ψ + 14Ψ/N | 1.0x (same as DDP) |
| ZeRO-3 (P_os+g+p) | Optimizer states + gradients + parameters | 16Ψ/N | 1.5x DDP (reduce-scatter + two all-gathers) |
Source: ZeRO paper, Tables 1 and 2, and Microsoft Research blog posts.[1][9][10]
For a one-trillion-parameter model with 1,024 GPUs, the paper estimates the per-device model-state footprint at roughly 16 GB under ZeRO-3, which fits within high-bandwidth memory on data-centre accelerators.[9]
ZeRO-1 partitions only the optimizer states across data-parallel ranks. For Adam in fp16+fp32 mixed precision, that 12Ψ-byte block (fp32 master parameters plus fp32 momentum and variance) is split into N equal chunks; each rank holds and updates only its 1/N slice.[1] The fp16 parameter and gradient copies remain replicated on every device, so the forward and backward passes are unchanged.[1] After the backward pass, ranks perform a standard all-reduce on the fp16 gradients, then each rank applies the optimizer update to its local 1/N slice of fp32 master parameters and converts the result back to fp16 for its local partition.[1] An all-gather then reconstructs the updated fp16 parameters across all ranks for the next step.[1] The paper analyses the communication cost and shows it can be implemented in 2Ψ bytes per step (a reduce-scatter plus an all-gather), the same total volume as a single all-reduce.[1] Memory consumption drops from 16Ψ to 4Ψ + 12Ψ/N, a 4x reduction in the large-N limit.[1][9] DeepSpeed's tutorial uses a 1.5 B-parameter example in which the Adam optimizer states consume 18 GB on a single GPU but only 2.25 GB per device when partitioned across eight ranks.[10] Because ZeRO-1 does not modify the gradient or parameter buffers, it composes cleanly with pipeline parallelism and tensor parallelism, which is why it is the stage of choice in 3D-parallel configurations such as those used by Megatron-DeepSpeed for BLOOM and Megatron-Turing NLG.[3][18]
ZeRO-2 additionally partitions the gradients. After each layer's backward pass, the system performs a reduce-scatter on that layer's gradients so that rank r retains only the gradient slice corresponding to its 1/N partition of the optimizer states; the remaining gradient memory can be released.[1] The all-reduce of the baseline is therefore replaced by a reduce-scatter plus a parameter all-gather later in the step, yielding the same aggregate communication volume as ZeRO-1.[1] Per-GPU memory drops further to 2Ψ + 14Ψ/N, an 8x reduction over baseline DDP in the large-N limit.[1][11] Microsoft's ZeRO-2 announcement on 19 May 2020 reported training models up to 170 billion parameters with this stage and pretraining BERT-large in 44 minutes on 1,024 V100 GPUs.[11]
ZeRO-3 partitions the fp16 parameters themselves. At any given moment each rank holds only its 1/N slice of model weights; the full parameter set for a given layer is materialised on the fly via an all-gather just before that layer's forward computation, used for the forward pass, freed, and gathered again for the backward pass.[1] Gradients are reduce-scattered as in ZeRO-2, and optimizer updates are applied locally on the rank that owns the corresponding parameter slice.[1] Per-GPU model-state memory drops to 16Ψ/N, yielding linear memory savings with the data-parallel degree: 64 ranks cut model-state memory 64-fold, and 1,024 ranks cut it roughly 1,000-fold.[1][9] The cost is increased communication volume. Each training step requires a parameter all-gather in the forward pass, a parameter all-gather in the backward pass, and a gradient reduce-scatter, totalling 3Ψ bytes versus 2Ψ for baseline all-reduce DDP, an increase of 1.5x.[1] In return, ZeRO-3 makes it feasible to train models far larger than any single GPU could hold without any code rewrite of the model itself.[1]
A trillion-parameter dense model with Ψ = 10^12 has an fp16-plus-Adam state of roughly 16 TB at full replication. Under ZeRO-3 on 1,024 ranks the per-GPU model-state footprint falls to roughly 16 GB, which is within the high-bandwidth memory budget of an NVIDIA A100 40 GB or 80 GB device once activation memory and workspace buffers are also accounted for.[9] Microsoft used this calculation in the original ZeRO blog post to motivate the "trillion parameter" framing of the paper title.[9]
The ZeRO paper also defines ZeRO-R, a set of orthogonal techniques targeting "residual" memory (activations, temporary buffers, and memory fragmentation) rather than model states.[1] ZeRO-R partitions activation memory across tensor-parallel ranks, optionally offloading partitioned activations to CPU memory, and uses a constant-size buffer for fused communications plus a memory defragmentation routine.[1] Combined with ZeRO-DP (the optimizer-state, gradient, and parameter partitioning above) and gradient checkpointing, ZeRO-R further reduces peak GPU memory.[1]
ZeRO's communication overhead is masked in practice by overlapping collectives with computation. In ZeRO-3 the parameter all-gather for layer L+1 is launched asynchronously while the forward pass for layer L is still running, so that the layer L+1 weights are present when its forward kernel begins.[10] The gradient reduce-scatter during the backward pass is similarly pipelined with backward computation.[10] DeepSpeed exposes tunables such as reduce_bucket_size and allgather_bucket_size that batch multiple parameter or gradient slices into a single NCCL call, trading memory for collective efficiency.[10] At very large data-parallel scales the wall-clock cost of the additional all-gathers becomes the dominant factor and motivates the ZeRO++ optimizations.[7][13]
ZeRO-Offload, presented by Jie Ren and co-authors (including Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He) in a paper posted to arXiv on 18 January 2021 and published at USENIX ATC 2021, extends ZeRO-2 by offloading optimizer states and gradients from GPU memory to host CPU memory, and by offloading the optimizer update step from the GPU to the CPU.[5] The system places fp16 parameters and the forward and backward passes on the GPU while keeping the fp32 master parameters, fp32 momentum and variance buffers, and fp16 gradients in CPU DRAM; gradients are streamed from GPU to CPU after the backward pass, the Adam update is computed by an optimized CPU kernel, and the resulting fp16 parameters are streamed back to the GPU before the next forward pass.[5] To prevent the CPU optimizer update from becoming a bottleneck, the authors implemented a SIMD-vectorized CPU-Adam achieving roughly an order-of-magnitude speedup over PyTorch's default Adam.[5]
ZeRO-Offload reports training a 13 B-parameter model on a single NVIDIA V100 GPU, a 10x increase over PyTorch baseline, and sustains 40 TFLOPS per GPU for a 1.4 B-parameter model on a single V100 (versus 30 TFLOPS for native PyTorch).[5] On 128 GPUs the system delivers near-linear scaling and, combined with model parallelism, enables training models exceeding 70 B parameters on a single DGX-2 node.[5] The technique requires no model code changes from the data scientist beyond enabling the offload configuration in DeepSpeed.[5]
ZeRO-Infinity, by Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He, was posted to arXiv on 16 April 2021 and published at SC21.[6][12] It generalises ZeRO-3 and ZeRO-Offload by treating GPU HBM, CPU DRAM, and node-local NVMe SSDs as a unified hierarchical memory, partitioning model states across all three tiers.[6] The system introduces several components.
Infinity offload engine. An I/O subsystem manages overlapping bandwidth-efficient transfers between GPU, CPU, and NVMe. Microsoft Research reports that modern clusters carry roughly 2 to 3 times more aggregate CPU memory than GPU memory and roughly 50 times more aggregate NVMe capacity, so offloading to NVMe unlocks a memory pool large enough for trillion-scale models.[12]
Memory-centric tiling. Individual operators that produce activation tensors larger than fit in GPU memory are decomposed into a sequence of smaller tiles executed sequentially on the same GPU, bypassing the need for tensor parallelism solely for memory reasons.[6] This permits very wide layers (such as the very-large embedding matrices used in models with extended vocabularies) without forced model partitioning.[6]
Bandwidth-centric partitioning. Parameter all-gathers are restructured so that data is pulled in parallel from many CPU and NVMe sources, aggregating their per-link bandwidth.[6]
Activation checkpointing with CPU offload. Activations stashed for backpropagation can be offloaded to CPU memory, freeing GPU memory for parameters during the forward pass.[6]
Overlap-centric design. ZeRO-Infinity prefetches parameters across the memory hierarchy: NVMe-to-CPU and CPU-to-GPU transfers for the parameters of upcoming layers are issued asynchronously while the current layer is computing, hiding the latency of slow storage behind useful work.[6] An analogous reverse path streams gradients and parameter updates back to NVMe.[6]
The paper reports training models with over 30 trillion parameters on 512 NVIDIA V100 GPUs, sustained throughput exceeding 25 petaflops, and the ability to fine-tune trillion-parameter models on a single DGX-2 node with 16 GPUs.[6][12] Microsoft estimates that with 100 DGX-2 nodes the technique could in principle scale to over 100 trillion parameters.[12] ZeRO-Infinity is integrated with DeepSpeed and accessible through PyTorch Lightning and Hugging Face Transformers configuration files without model changes.[12]
ZeRO++ ("Extremely Efficient Collective Communication for Giant Model Training") by Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, and Yuxiong He was posted to arXiv on 16 June 2023 and announced on the Microsoft Research blog on 22 June 2023.[7][13] It targets the communication overhead of ZeRO-3, which becomes the dominant bottleneck when per-GPU batch sizes are small (as during reinforcement-learning fine-tuning) or when cluster interconnects are bandwidth-limited.[7][13] ZeRO++ bundles three independent optimizations.
qwZ (quantized weight communication). Block-based fp16-to-int8 quantization is applied to the parameter tensors before each forward-pass all-gather, halving the forward parameter communication volume from M to 0.5M bytes; dequantization happens at the receiver before the matrix multiplication.[7][13] To preserve accuracy the system uses smaller block sizes than naive uniform quantization.[7]
hpZ (hierarchical partition). Instead of holding only 1/N of each parameter on every node, ZeRO++ maintains a full secondary replica of model parameters within each node and partitions only across nodes for the primary copy. This trades a modest amount of intra-node GPU memory for elimination of the backward-pass parameter all-gather across the slow inter-node fabric, reducing inter-node weight communication during backward from M to zero.[7][13]
qgZ (quantized gradient communication). The standard reduce-scatter on gradients is replaced by an all-to-all-based quantized gradient averaging. Gradients are quantized to int4, exchanged with a hierarchical all-to-all collective, dequantized, summed in higher precision, and re-quantized for any subsequent exchange. The scheme reduces gradient communication from M to 0.25M bytes while preserving accuracy through careful dequantization timing.[7][13] The all-to-all replacement also avoids the ring-based reduction pattern of NCCL reduce-scatter, which has poor latency characteristics when each message is small.[7]
Together the three optimizations cut total per-step communication volume by 4x relative to ZeRO-3.[7][13] Microsoft reports throughput improvements of 28 to 36 percent over ZeRO-3 on a high-bandwidth (400 Gbps) cluster with small batch sizes and roughly 2x speedup on low-bandwidth (100 Gbps) clusters, evaluated on an 18 B-parameter GPT-style model trained on four-node V100 clusters.[7][13] On RLHF token-generation workloads the paper reports 2.25x speedup over ZeRO-3 and 1.26x on the training phase.[13]
The DeepSpeed implementation exposes three configuration flags (zero_quantized_weights, zero_hpz_partition_size, zero_quantized_gradients) that activate the techniques independently or jointly.[14]
PyTorch's native FullyShardedDataParallel (FSDP) implements the same core idea as ZeRO-3.[15] FSDP originated in Meta's FairScale library in early 2021, was upstreamed into the PyTorch core in version 1.11 (released March 2022), and was described in a paper by Yanli Zhao and colleagues at PyTorch posted to arXiv on 21 April 2023.[15][16][17] The PyTorch blog post introducing FSDP cites DeepSpeed ZeRO directly as an inspiration, and the FSDP paper acknowledges that the implementation shares the core principle of "sharding model parameters, gradients and optimizer states across data parallel workers".[16][17]
Differences in practice are mainly in API and integration. FSDP is part of PyTorch core and integrates with the PyTorch dispatcher, tensor subclasses, and CUDA caching allocator; DeepSpeed ZeRO is a separate library with its own runtime, configuration JSON, and CPU/NVMe offload engines.[17] FSDP offers a "FULL_SHARD" mode functionally equivalent to ZeRO-3, a "SHARD_GRAD_OP" mode equivalent to ZeRO-2, and a "NO_SHARD" mode equivalent to standard DDP.[16][17] The PyTorch announcement post notes that FSDP scales to 1 trillion parameters on AWS clusters using NVIDIA A100 GPUs, delivering 159 TFLOPS per A100 on GPT-175B and 84 TFLOPS per A100 on GPT-1T.[16]
Practitioners typically choose between the two on the basis of ecosystem rather than capability. Codebases tightly coupled to DeepSpeed (such as the Megatron-DeepSpeed stack used for BLOOM and Megatron-Turing NLG) use ZeRO directly, while Hugging Face Transformers and PyTorch Lightning-centric training pipelines often default to FSDP.[4][16][18] Both libraries support similar offload modes: FSDP offers CPU offload through a CPUOffload(offload_params=True) flag passed to the FullyShardedDataParallel wrapper, mirroring DeepSpeed's zero_optimization.offload_param configuration.[16] DeepSpeed retains a richer ecosystem of NVMe offload, optimizer kernels, and the ZeRO++ communication reductions; FSDP benefits from being part of the PyTorch core and from tight integration with PyTorch profiling, compilation, and tensor parallelism tooling.[15][16][17]
ZeRO and DeepSpeed have been used in the training of several of the largest publicly disclosed dense language models.
Turing-NLG (17 B parameters). Announced by Microsoft on 13 February 2020 alongside DeepSpeed, this 17 B-parameter transformer was at the time the largest disclosed dense language model, trained using ZeRO Stage 1 to remove optimizer-state redundancy.[2][9] The Turing-NLG announcement framed ZeRO as the key enabler, noting that without optimizer-state partitioning the model could not have been fit into the available cluster memory budget.[2]
Megatron-Turing NLG 530B. A 530 B-parameter dense transformer jointly trained by Microsoft and NVIDIA, described in a paper by Shaden Smith, Mostofa Patwary, and colleagues posted to arXiv on 28 January 2022.[3] The model was trained with the Megatron-DeepSpeed stack combining Megatron-LM tensor parallelism, DeepSpeed pipeline parallelism, and ZeRO data parallelism in a 3D-parallel configuration.[3] Smith and colleagues describe the topology in detail and document how the three parallelism axes interlock: tensor parallelism splits individual matrix multiplications within a single transformer block across GPUs in the same node, pipeline parallelism splits the layer sequence across nodes, and ZeRO data parallelism replicates the resulting tensor-and-pipeline-parallel "slice" across data-parallel groups while sharding optimizer state.[3]
BLOOM (176 B parameters). The 176 B-parameter multilingual model from the BigScience workshop, posted to arXiv on 9 November 2022, was trained on 384 NVIDIA A100 80GB GPUs across 48 nodes of the French Jean Zay supercomputer using the Megatron-DeepSpeed framework.[4][18] The training combined ZeRO Stage 1 (sharding only optimizer states) with tensor parallelism and pipeline parallelism over a roughly three-and-a-half-month period from March to July 2022.[18] The BLOOM team reported that ZeRO Stage 1 was preferred over Stages 2 and 3 when combined with pipeline parallelism because higher stages require additional reduce-scatter collectives per micro-batch.[18] The full BF16-plus-FP32 checkpoint reached 2.3 TB; BF16 weights alone were 329 GB.[18] To survive frequent hardware failures (one to two GPU failures per week, several multi-hour outages over the run), the team wrote checkpoints every three hours (every 100 training iterations) and built automated recovery scripts on top of the DeepSpeed ZeRO checkpoint format.[18]
Other adoptions. DeepSpeed has been used in training pipelines for many open-source fine-tuning workloads on the Hugging Face Hub, and the library has been integrated with Hugging Face Transformers via the Trainer and Accelerate APIs and with PyTorch Lightning as a training strategy.[15][19] The DeepSpeed GitHub repository reports more than 110 tagged releases and tens of thousands of stars, indicating broad community uptake.[15] DeepSpeed's hardware support has expanded over time from the original NVIDIA-only stack to include AMD Instinct MI-series accelerators, Intel Gaudi accelerators, and CPU-only fall-back paths.[15]
ZeRO-Offload made it feasible for individual researchers to fine-tune billion-scale models on a single consumer or workstation GPU, and the ZeRO-Offload paper specifically frames its goal as "democratizing" billion-scale training.[5] Subsequent libraries built on top of DeepSpeed have used ZeRO as their default sharding backend for instruction tuning, Adam-based pretraining, and reinforcement-learning fine-tuning workloads.[5][7][13]
ZeRO's design trades GPU memory savings for communication. ZeRO-3 in particular adds a 1.5x communication overhead per step relative to baseline data parallelism, and the parameter all-gathers it requires for every layer are difficult to overlap fully with computation when per-GPU batch sizes are small.[1][7] These overheads become severe on clusters with limited inter-node bandwidth (the motivating workload for ZeRO++), and they limit the speed of reinforcement-learning-from-human-feedback fine-tuning that uses short per-step generations.[7][13]
ZeRO-Offload and ZeRO-Infinity require PCIe and NVMe links with sufficient bandwidth to feed the GPU, and the optimizer update time on the CPU can dominate the step time on hardware with weak CPUs.[5][6] The ZeRO-Offload paper notes that without the SIMD-vectorized CPU-Adam kernel the CPU update would become a serial bottleneck.[5]
Integration with pipeline parallelism is also constrained: the BLOOM training report explicitly used only ZeRO Stage 1 in its 3D-parallel configuration because Stage 2's per-micro-batch reduce-scatter and Stage 3's per-layer all-gather both interact poorly with pipelined micro-batches.[18]
Finally, the original ZeRO analysis treats only the model states (parameters, gradients, optimizer states). Activation memory, which can dominate at long sequence lengths, is addressed by separate techniques (ZeRO-R activation partitioning, gradient checkpointing, tensor parallelism) rather than by ZeRO-DP alone.[1]
Convergence is unaffected by ZeRO Stages 1 and 2, which are mathematically identical to baseline data parallelism: only the layout of state in device memory changes.[1] ZeRO-3 is likewise numerically equivalent in fp32 master weights, although fp16 round-trips during all-gather may introduce slightly different rounding compared with a single-replica baseline.[1] ZeRO++ in contrast introduces quantization on weights and gradients; the ZeRO++ paper reports no measurable accuracy regression on the workloads it evaluates, but practitioners using qwZ or qgZ are advised to validate convergence on their own workload.[7][13]
ZeRO partitioning interacts with checkpoint formats. A ZeRO-3 checkpoint saves only each rank's parameter slice and optimizer slice, so consolidating the checkpoint into a single dense file (for inference, sharing, or migration to a different sharding strategy) requires an explicit "zero-to-fp32" or "consolidated state-dict" step provided by the DeepSpeed and FSDP runtimes.[10][16] Loading a ZeRO checkpoint at a different data-parallel degree similarly requires re-partitioning logic.[10]