DeepSpeed is an open-source deep learning optimization library developed by Microsoft that makes distributed training and inference of large models easy, efficient, and effective. Built on top of PyTorch, DeepSpeed provides a suite of system optimizations including the Zero Redundancy Optimizer (ZeRO), mixed precision training, pipeline parallelism, and inference acceleration. Since its release in February 2020, DeepSpeed has become one of the most widely used libraries for training models that exceed the memory capacity of a single GPU, and it has been used to train landmark models including BLOOM (176 billion parameters) and Megatron-Turing NLG (530 billion parameters) [1].
DeepSpeed emerged from Microsoft's "AI at Scale" initiative, which aimed to develop the infrastructure needed to train the largest AI models. The project was led by researchers at Microsoft Research, with Samyam Rajbhandari as a primary architect. The first public release came in February 2020, coinciding with the publication of the foundational ZeRO paper [2].
The motivation for DeepSpeed was straightforward: as model sizes grew from millions to billions of parameters, standard data-parallel training became insufficient. A model with billions of parameters cannot fit in the memory of a single GPU, and existing approaches to model parallelism (tensor parallelism and pipeline parallelism) required significant code modifications and were difficult to use. DeepSpeed's goal was to enable training of arbitrarily large models with minimal code changes [1].
The foundational research behind DeepSpeed is the ZeRO (Zero Redundancy Optimizer) paper by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He, first published as an arXiv preprint in October 2019 and later presented at SC20 (the International Conference for High Performance Computing). The paper identified a critical inefficiency in standard data-parallel training: every GPU maintains a complete copy of the model states (parameters, gradients, and optimizer states), resulting in massive memory redundancy [2].
For a model with P parameters trained with the Adam optimizer in mixed precision, the memory required per GPU for model states alone is approximately 2P + 2P + (4P + 4P + 4P) = 16P bytes. For a 7.5-billion parameter model, this amounts to roughly 120 GB per GPU, far exceeding the memory of any single GPU available at the time. The ZeRO paper proposed partitioning these redundant states across data-parallel processes, dramatically reducing per-GPU memory consumption without sacrificing computational efficiency [2].
The ZeRO optimizer is DeepSpeed's most important contribution. It provides three progressive stages of optimization, each partitioning additional model states across the data-parallel group and yielding greater memory savings at the cost of increased communication.
| Stage | What Is Partitioned | Memory Reduction | Communication Overhead |
|---|---|---|---|
| ZeRO Stage 1 (ZeRO-OS) | Optimizer states (momentum, variance in Adam) | Up to 4x | None (same as standard data parallelism) |
| ZeRO Stage 2 | Optimizer states + gradients | Up to 8x | None (same as standard data parallelism) |
| ZeRO Stage 3 | Optimizer states + gradients + parameters | Linear with number of GPUs | 1.5x increase (all-gather for forward/backward) |
In standard data-parallel training, each GPU maintains the full optimizer state. For Adam, this includes the first-moment estimate (momentum) and second-moment estimate (variance), each requiring 4 bytes per parameter in FP32, plus the FP32 master copy of the weights (4 bytes per parameter). That totals 12 bytes per parameter just for optimizer states.
ZeRO Stage 1 partitions only the optimizer states across data-parallel processes. Each GPU stores only 1/N of the optimizer states (where N is the number of GPUs), while still maintaining the full FP16 parameters and gradients. After the backward pass, gradients are reduced normally, but each GPU updates only its partition of the optimizer states and parameters. The updated parameters are then broadcast to all GPUs. This yields up to a 4x memory reduction with no additional communication overhead compared to standard data parallelism [2].
ZeRO Stage 2 adds gradient partitioning on top of Stage 1. After the backward pass, instead of performing an all-reduce on the full gradient tensor, each GPU reduces and retains only the gradients corresponding to its partition of the optimizer states. This is implemented using a reduce-scatter operation, which is communication-equivalent to the all-reduce used in standard data parallelism but results in each GPU holding only 1/N of the gradients. This provides up to an 8x memory reduction [2].
ZeRO Stage 3 takes the partitioning to its logical conclusion by also distributing the FP16 model parameters across GPUs. Each GPU stores only 1/N of the parameters. During the forward and backward passes, the needed parameters are collected on demand via all-gather operations, used for computation, and then discarded. This achieves memory reduction that scales linearly with the number of GPUs: with N GPUs, each GPU needs to store only 1/N of the total model states [2].
The tradeoff is a 1.5x increase in communication volume compared to standard data parallelism, due to the additional all-gather operations in the forward and backward passes. However, this communication can be overlapped with computation, and in practice the throughput impact is often modest.
To illustrate the memory impact, consider a 7.5-billion parameter model trained with Adam in mixed precision on 64 GPUs:
| Configuration | Memory per GPU |
|---|---|
| Standard data parallelism | ~120 GB |
| ZeRO Stage 1 | ~31.4 GB |
| ZeRO Stage 2 | ~16.6 GB |
| ZeRO Stage 3 | ~1.9 GB |
These reductions make it possible to train models with billions of parameters on clusters of GPUs that individually have only 16-80 GB of memory [2].
ZeRO-Offload extends ZeRO Stage 2 by offloading optimizer states and gradient computation to CPU memory and CPU compute. This enables training models with up to 13 billion parameters on a single GPU by leveraging the much larger capacity of system RAM (typically 256 GB or more) compared to GPU memory (typically 16-80 GB) [3].
The design of ZeRO-Offload carefully balances the computation and communication between CPU and GPU. The forward and backward passes (which are compute-intensive) remain on the GPU, while the optimizer step (which is memory-intensive but less compute-demanding) is performed on the CPU. A CPU Adam optimizer implementation is provided that is highly optimized with SIMD instructions.
ZeRO-Offload achieves near-GPU-only training throughput for large models because the optimizer step computation on the CPU can be overlapped with the next forward pass on the GPU. For smaller models where GPU utilization is already high, the overhead of CPU-GPU data transfers becomes more noticeable [3].
ZeRO-Infinity, introduced in 2021, extends the offloading concept to NVMe storage (SSDs), enabling training of models with trillions of parameters. Built on ZeRO Stage 3, ZeRO-Infinity can offload all model states (parameters, gradients, and optimizer states) to CPU memory and NVMe storage [4].
The key technical innovation is a dynamic prefetcher that traces forward and backward computation, constructing an internal map of operator sequences. Using this map, ZeRO-Infinity overlaps NVMe-to-CPU transfers with CPU-to-GPU transfers and GPU-to-GPU all-gather operations, effectively pipelining all three communication stages with computation. This design achieves bandwidth utilization close to the theoretical peak of the NVMe subsystem [4].
| Feature | ZeRO-Offload | ZeRO-Infinity |
|---|---|---|
| Built on | ZeRO Stage 2 | ZeRO Stage 3 |
| Offload target | CPU memory | CPU memory + NVMe storage |
| Offloaded states | Optimizer states + gradients | All model states (params, grads, optimizer) |
| Max model scale | ~13B params on single GPU | Trillions of parameters |
| Bandwidth optimization | CPU-GPU overlap | Dynamic prefetching across NVMe, CPU, and GPU |
ZeRO++, released in 2023, addresses the communication overhead that becomes a bottleneck in ZeRO Stage 3, particularly when training across nodes connected by relatively slow network links. ZeRO++ introduces three complementary communication optimization techniques that together reduce communication volume by up to 4x [5].
qwZ applies block-based quantization to reduce the communication volume of the parameter all-gather from FP16 to INT8, halving the data transferred. Block-based quantization conducts independent quantization on subsets of model parameters, achieving 3x better accuracy and 5x faster execution compared to naive quantization through highly optimized CUDA kernels [5].
hpZ eliminates cross-node all-gather communication during the backward pass through a hierarchical data remapping strategy. Instead of collecting parameters from all GPUs across all nodes, hpZ maintains a full copy of the parameters within each node (distributed across the node's GPUs) and only performs intra-node all-gather operations. This reduces cross-node communication volume from M/Z per GPU to M/(Z*N), where M is the model size, Z is the total number of GPUs, and N is the number of GPUs per node [5].
qgZ replaces the standard gradient all-reduce with a communication-efficient all-to-all based quantized gradient averaging scheme, further reducing communication volume during the backward pass [5].
| Component | Technique | Communication Reduction |
|---|---|---|
| qwZ | Block-based weight quantization (FP16 to INT8) | 2x for parameter all-gather |
| hpZ | Hierarchical weight partitioning | Eliminates cross-node backward all-gather |
| qgZ | Quantized gradient averaging | Reduces gradient communication volume |
| Combined | All three together | Up to 4x total reduction |
DeepSpeed-Chat, introduced in April 2023, provides an end-to-end training pipeline for reinforcement learning from human feedback (RLHF), the technique used to align large language models like ChatGPT with human preferences. The system addresses the significant engineering complexity of RLHF training, which requires coordinating multiple models (actor, critic, reward model, and reference model) simultaneously [6].
DeepSpeed-Chat implements the InstructGPT training methodology in three stages:
A single script can take a pre-trained Hugging Face model and run it through all three stages.
The key technical contribution of DeepSpeed-Chat is the DeepSpeed Hybrid Engine, which seamlessly switches between ZeRO-based training mode and inference mode during RLHF training. During the experience generation phase (where the model generates text), DeepSpeed-HE uses inference optimizations including tensor parallelism and optimized kernels. During the training phase, it switches to ZeRO-powered data parallelism. This hybrid approach yields over 15x speedup compared to existing RLHF systems [6].
DeepSpeed-Chat enabled training an OPT-13B model via RLHF in 9 hours and an OPT-30B model in 18 hours on Azure Cloud, at costs under $300 and $600 respectively. Compared to other systems like Colossal-AI and Hugging Face DDP, DeepSpeed-Chat achieved up to 19x higher throughput for RLHF training, and 10x faster performance on a single GPU [6].
DeepSpeed-FastGen, released in late 2023 and expanded in 2024, is an inference serving framework for large language models. Its core innovation is the Dynamic SplitFuse technique, which handles variable-length prompts and generation steps more efficiently than traditional continuous batching approaches [7].
Dynamic SplitFuse decomposes long prompts into smaller chunks and composes short prompts together, creating uniform token budgets across iterations. This approach addresses the performance cliffs that occur in traditional systems when long prompts cause batch sizes to drop, and it provides more consistent latency compared to systems like vLLM [7].
In 2024, DeepSpeed-FastGen added support for Mixture of Experts (MoE) architectures, including the Mixtral model family. A custom MoE module with inference-optimized kernels was developed, achieving 2.4x higher throughput for the Mixtral model compared to baseline implementations at a prompt length of 1,200 tokens and 60 generation steps. Support for Falcon and Phi-2 model families was also added [7].
Beyond MoE inference, DeepSpeed provides comprehensive support for training Mixture of Experts models. DeepSpeed-MoE includes:
DeepSpeed-MoE has been used to train large MoE models efficiently, enabling researchers to explore the MoE paradigm at scale without building custom distributed training infrastructure [8].
DeepSpeed integrates tightly with the Hugging Face Transformers and Accelerate libraries, making it accessible to the broad community of Hugging Face users. The integration allows users to enable DeepSpeed optimizations by simply passing a configuration file to the Hugging Face Trainer:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
deepspeed="ds_config.json",
...
)
The DeepSpeed configuration file specifies which ZeRO stage to use, whether to enable offloading, mixed precision settings, and other optimization parameters. Hugging Face's documentation provides detailed guides for each ZeRO stage and common configurations [9].
The Hugging Face Accelerate library also supports DeepSpeed as a backend, providing an even simpler interface for distributed training. This integration was instrumental in making DeepSpeed accessible beyond its core audience of distributed systems researchers.
DeepSpeed has been used to train several of the largest and most notable language models.
Megatron-Turing NLG (MT-NLG), a collaboration between Microsoft and NVIDIA announced in October 2021, was a 530-billion parameter autoregressive language model and at the time the largest dense transformer model ever trained. The training system combined three forms of parallelism: tensor parallelism from NVIDIA's Megatron-LM (for intra-node scaling), pipeline parallelism from DeepSpeed (for inter-node scaling), and data parallelism with ZeRO Stage 1 (for scaling across pipeline replicas). This "3D parallelism" approach became a template for subsequent large model training efforts [10].
BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), a 176-billion parameter multilingual model released in 2022, was trained by a collaboration of over 1,000 researchers using the Megatron-DeepSpeed framework. The training combined ZeRO sharding and pipeline parallelism from DeepSpeed with tensor parallelism from Megatron-LM. BLOOM was trained on 384 NVIDIA A100 80GB GPUs at the Jean Zay supercomputer in France [11].
DeepSpeed has also been used in training various other large models, and its ZeRO optimizer is commonly used for fine-tuning large models in research labs and companies worldwide. The Hugging Face integration means that any model available on the Hugging Face Hub can be fine-tuned with DeepSpeed optimizations with minimal code changes.
PyTorch's Fully Sharded Data Parallel (FSDP) is the most direct competitor to DeepSpeed's ZeRO optimizer. FSDP was inspired by ZeRO and implements similar parameter, gradient, and optimizer state sharding natively within PyTorch.
| Feature | DeepSpeed ZeRO | PyTorch FSDP |
|---|---|---|
| Sharding stages | 3 explicit stages (1, 2, 3) | FULL_SHARD, SHARD_GRAD_OP, NO_SHARD |
| CPU offloading | ZeRO-Offload, ZeRO-Infinity (CPU + NVMe) | CPU offloading supported |
| NVMe offloading | Yes (ZeRO-Infinity) | No |
| Communication optimization | ZeRO++ (quantized, hierarchical) | Standard collective operations |
| Pipeline parallelism | Built-in | Requires separate implementation |
| RLHF support | DeepSpeed-Chat (end-to-end) | Manual implementation |
| Inference optimization | DeepSpeed-FastGen, DeepSpeed-Inference | Separate tools needed |
| Integration | Hugging Face, standalone | Native PyTorch |
| Configuration | JSON config file | Python API |
| Raw throughput (FULL_SHARD equivalent) | Competitive | Sometimes faster (up to 5x reported) |
| Memory efficiency at extreme scale | Superior (NVMe offloading, ZeRO++) | Less optimized |
In benchmarks, FSDP's FULL_SHARD mode has shown up to 5x faster per-iteration throughput than DeepSpeed ZeRO Stage 3 in certain configurations. However, DeepSpeed becomes competitive and often preferable when CPU/NVMe offloading, communication optimization (ZeRO++), or extreme memory savings are required. The choice often depends on the specific use case: FSDP is attractive for teams that want to stay within the native PyTorch ecosystem, while DeepSpeed offers more features for extreme-scale training scenarios [12].
DeepSpeed continues to evolve with new capabilities addressing emerging needs in the AI community.
Universal Checkpointing provides efficient and flexible checkpointing for large-scale distributed training. It enables saving and loading checkpoints across different parallelism configurations (for example, saving a checkpoint with 3D parallelism on 256 GPUs and resuming on 128 GPUs with a different parallelism layout), which simplifies cluster management and fault recovery [13].
ALST enables scalable and efficient training with multi-million token sequence lengths. As context windows for large language models have grown from thousands to millions of tokens, the memory and computation requirements for processing long sequences have become a major bottleneck. ALST addresses this through specialized memory management and communication strategies optimized for very long sequences [13].
ZenFlow enables stall-free offloading training via asynchronous updates. Traditional offloading approaches (like ZeRO-Offload) can introduce stalls when the CPU is not fast enough to complete the optimizer step before the GPU needs the updated parameters. ZenFlow eliminates these stalls by overlapping computation and communication more aggressively through an asynchronous update mechanism [13].
SuperOffload, announced for 2026, targets large-scale LLM training on superchips (systems like NVIDIA Grace Hopper that combine CPU and GPU on a single module with high-bandwidth unified memory). It aims to exploit the unique memory architecture of these systems for more efficient offloading [13].
| Version | Date | Key Features |
|---|---|---|
| 0.3.x | Feb 2020 | Initial release, ZeRO Stage 1 and 2 |
| 0.4.x | 2021 | ZeRO Stage 3, ZeRO-Infinity |
| 0.7.x | 2022 | DeepSpeed-MoE, performance improvements |
| 0.9.x-0.10.x | 2023 | DeepSpeed-Chat, ZeRO++, DeepSpeed-FastGen |
| 0.14.x-0.15.x | 2024 | Universal Checkpointing, MoE inference, Mixtral support |
| 0.16.x-0.18.x | 2025 | ALST, ZenFlow, expanded hardware support |
DeepSpeed is designed as a drop-in replacement for PyTorch's standard training loop. Users wrap their model and optimizer with DeepSpeed's initialization function:
import deepspeed
model, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=optimizer,
config=ds_config
)
for batch in dataloader:
loss = model(batch)
model.backward(loss)
model.step()
The ds_config dictionary (or JSON file) controls all optimization settings, including ZeRO stage, offloading, mixed precision, gradient accumulation, and learning rate scheduling. This configuration-driven approach allows users to experiment with different optimization strategies without changing their training code [1].
As of early 2026, DeepSpeed remains one of the essential tools in the large-scale model training ecosystem. The library is at version 0.18.x, actively maintained by the DeepSpeed team (which has moved its GitHub organization from microsoft/DeepSpeed to deepspeedai/DeepSpeed).
Several trends define DeepSpeed's current trajectory:
Complementary role with FSDP. Rather than a zero-sum competition, DeepSpeed and PyTorch FSDP increasingly serve complementary roles. FSDP is the default choice for straightforward distributed training within the PyTorch ecosystem, while DeepSpeed is preferred for scenarios requiring advanced offloading, extreme memory optimization, or features like DeepSpeed-Chat.
Focus on LLM workflows. DeepSpeed's recent features (ALST, ZenFlow, DeepSpeed-Chat, DeepSpeed-FastGen) reflect its focus on the full lifecycle of large language model development, from pre-training through alignment to inference serving.
Hardware adaptation. With the emergence of new hardware architectures like NVIDIA Grace Hopper superchips and AMD MI300X GPUs, DeepSpeed is adapting its offloading and communication strategies. SuperOffload specifically targets the unified memory architecture of superchips.
Continued Hugging Face integration. The tight integration with Hugging Face Transformers and Accelerate ensures that DeepSpeed's optimizations remain accessible to the broad ML community, not just distributed systems specialists.
DeepSpeed's contribution to making large-scale model training accessible cannot be overstated. By reducing the memory barriers through ZeRO and providing turnkey solutions for RLHF training and inference, it has democratized capabilities that were once available only to organizations with massive engineering teams and hardware budgets.