# DeepSpeed

> Source: https://aiwiki.ai/wiki/deepspeed
> Updated: 2026-06-21
> Categories: AI Infrastructure, Deep Learning, Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

DeepSpeed is an open-source [deep learning](/wiki/deep-learning) optimization library developed by Microsoft that makes [distributed training](/wiki/data-parallelism) and inference of large models efficient, easy to use, and cost-effective.[8] Built on top of [PyTorch](/wiki/pytorch), DeepSpeed provides system-level optimizations that allow researchers and engineers to train models with billions or even trillions of parameters on commodity GPU clusters.[8] Its core innovation, the Zero Redundancy Optimizer (ZeRO), fundamentally rethinks how [model parallelism](/wiki/model-parallelism) and [data parallelism](/wiki/data-parallelism) interact to reduce memory consumption without sacrificing computational throughput.[1] In its founding 2020 evaluation, ZeRO trained models of over 100 billion parameters with super-linear speedup on 400 GPUs, achieving a throughput of 15 petaFLOPS, which the authors described as an 8x increase in model size and a 10x increase in achievable performance over the prior state of the art.[1]

DeepSpeed was first released by Microsoft Research in February 2020 and has since become one of the most widely adopted libraries for large-scale model training.[8] It has been used to train landmark models including [BLOOM](/wiki/bloom) (176 billion parameters) and Megatron-Turing NLG (530 billion parameters).[9][7] In February 2025, DeepSpeed was contributed to the Linux Foundation AI & Data as an incubation project, marking its transition to community-driven governance.[13]

## ELI5 (explain like I'm 5)

Imagine you have a really big jigsaw puzzle, so big that it does not fit on your table. Normally, you would need a huge table (a really expensive computer) to put the whole puzzle together. DeepSpeed is like a clever way of splitting the puzzle pieces across several smaller tables (regular computers). Each table only holds the pieces it needs right now, and when it needs a piece from another table, it just asks for it quickly. This way, you can solve even the biggest puzzle in the world using a bunch of regular-sized tables working together.

## What is DeepSpeed used for?

DeepSpeed is used to train and serve very large neural networks that would otherwise not fit in the memory of a single accelerator. Its primary applications are pre-training [large language models](/wiki/large-language-model) and other massive [neural networks](/wiki/neural-network), aligning them with [Reinforcement Learning from Human Feedback](/wiki/rlhf) (via DeepSpeed-Chat), and serving them efficiently at inference time (via DeepSpeed-Inference and DeepSpeed-FastGen). The official project tagline summarizes its scope: DeepSpeed is "a deep learning optimization library that makes distributed training and inference easy, efficient, and effective."[8] In practice, teams reach for DeepSpeed when they need extreme memory savings, CPU or NVMe offloading, communication-bandwidth reductions, or built-in support for [Mixture of Experts](/wiki/mixture-of-experts) and RLHF workflows.

## History and Development

### Origins at Microsoft Research

DeepSpeed emerged from Microsoft's "AI at Scale" initiative, which aimed to develop the infrastructure needed to train the largest AI models. The project was led by researchers at Microsoft Research, with Samyam Rajbhandari as a primary architect.[1] The first public release came in February 2020, coinciding with the publication of the foundational ZeRO paper.[1]

The motivation for DeepSpeed was straightforward: as model sizes grew from millions to billions of parameters, standard data-parallel training became insufficient. A model with billions of parameters cannot fit in the memory of a single [GPU](/wiki/gpu), and existing approaches to model parallelism (tensor parallelism and pipeline parallelism) required significant code modifications and were difficult to use.[1] DeepSpeed's goal was to enable training of arbitrarily large models with minimal code changes.[8]

### The ZeRO Paper

The foundational research behind DeepSpeed is the ZeRO (Zero Redundancy Optimizer) paper by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He, first published as an arXiv preprint in October 2019 and later presented at SC20 (the International Conference for High Performance Computing).[1] The paper identified a critical inefficiency in standard data-parallel training: every GPU maintains a complete copy of the model states (parameters, gradients, and optimizer states), resulting in massive memory redundancy.[1]

The paper's central claim was that eliminating this redundancy would let model size grow in proportion to the number of devices. As the authors put it, ZeRO "eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency."[1] Their analysis concluded that ZeRO "has the potential to scale beyond 1 Trillion parameters using today's hardware."[1]

For a model with P parameters trained with the Adam optimizer in [mixed precision](/wiki/mixed_precision_training), the memory required per GPU for model states alone is approximately 2P + 2P + (4P + 4P + 4P) = 16P bytes. For a 7.5-billion parameter model, this amounts to roughly 120 GB per GPU, far exceeding the memory of any single GPU available at the time.[1] The ZeRO paper proposed partitioning these redundant states across data-parallel processes, dramatically reducing per-GPU memory consumption without sacrificing computational efficiency.[1] In practice, the system trained models of up to 13 billion parameters without any model parallelism, which the authors noted is "harder for scientists to apply," and the same breakthroughs were used to build Turing-NLG (17 billion parameters), at the time the world's largest language model.[1]

## Background and motivation

Training [large language models](/wiki/large-language-model) and other massive [neural networks](/wiki/neural-network) requires storing three categories of data in GPU memory during training:

1. **Model states**: the model parameters, [gradients](/wiki/gradient), and [optimizer](/wiki/optimizer) states (such as momentum and variance for [Adam](/wiki/optimizer))
2. **Residual states**: activation memory, temporary buffers, and unusable fragmented memory
3. **Activation memory**: intermediate outputs stored for the backward pass during [backpropagation](/wiki/backpropagation)

For mixed-precision training with the Adam optimizer, the memory required per parameter breaks down as follows:

| Component | Precision | Bytes per parameter |
|---|---|---|
| Parameters | FP16 | 2 |
| Gradients | FP16 | 2 |
| Parameters (optimizer copy) | FP32 | 4 |
| Momentum (Adam) | FP32 | 4 |
| Variance (Adam) | FP32 | 4 |
| **Total** | | **16** |

For a model with 1.5 billion parameters (such as [GPT-2](/wiki/gpt2)), this amounts to 24 GB of memory for model states alone, exceeding the capacity of most individual GPUs before even accounting for activations and temporary buffers.[1]

Standard [data parallelism](/wiki/data-parallelism) replicates all model states across every GPU, which wastes memory. [Model parallelism](/wiki/model-parallelism) splits the model across GPUs but introduces communication overhead and often requires significant code changes.[1] DeepSpeed was created to address this fundamental tension between memory efficiency and computational efficiency.[1]

## Zero Redundancy Optimizer (ZeRO)

ZeRO is the core technology behind DeepSpeed. Introduced by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He in their 2020 paper "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models," ZeRO eliminates memory redundancy in data-parallel training by partitioning model states across data-parallel processes instead of replicating them.[1]

ZeRO operates in three progressive stages, each partitioning an additional component of the model states.[1]

### ZeRO stage 1: optimizer state partitioning

In standard data-parallel training, each GPU maintains the full optimizer state. For Adam, this includes the first-moment estimate (momentum) and second-moment estimate (variance), each requiring 4 bytes per parameter in FP32, plus the FP32 master copy of the weights (4 bytes per parameter). That totals 12 bytes per parameter just for optimizer states.[1]

In stage 1 (also known as ZeRO-OS), the optimizer states (for Adam, this includes the FP32 master copy of parameters, momentum, and variance) are partitioned across all data-parallel processes. Each process stores and updates only its assigned partition of the optimizer states, while still maintaining the full FP16 parameters and gradients. After the backward pass, gradients are reduced normally, but each GPU updates only its partition of the optimizer states and parameters. The updated parameters are then broadcast (via all-gather) to all GPUs.[1]

Stage 1 reduces optimizer state memory by a factor of N (where N is the data-parallel degree) while maintaining the same communication volume as standard data parallelism. For a typical Adam setup, this provides up to a 4x memory reduction.[1]

### ZeRO stage 2: gradient partitioning

Stage 2 adds gradient partitioning on top of stage 1. After the backward pass, instead of performing an all-reduce on the full gradient tensor, each process reduces and retains only the gradients corresponding to its partition of optimizer states. This is implemented using a reduce-scatter operation, which is communication-equivalent to the all-reduce used in standard data parallelism but results in each GPU holding only 1/N of the gradients. Once a gradient is reduced and used for the parameter update, it is discarded, freeing the memory. This provides up to an 8x memory reduction with the same communication volume as standard data parallelism.[1]

### ZeRO stage 3: parameter partitioning

Stage 3 takes the partitioning to its logical conclusion by distributing the FP16 model parameters themselves across data-parallel processes. Each process stores only a shard of the full parameter set (1/N of the parameters). During forward and backward passes, ZeRO-3 dynamically gathers the parameters needed for each layer via all-gather operations, uses them for computation, then discards non-local parameters after use.[1]

Stage 3 achieves memory reduction that scales linearly with the number of data-parallel processes. On 64 GPUs, this yields a 64x reduction in per-GPU memory for model states. The trade-off is a 50% (1.5x) increase in communication volume compared to standard data parallelism (due to the additional all-gather operations during the forward pass). However, this communication can be overlapped with computation, and in practice the throughput impact is often modest.[1]

### ZeRO stages comparison

| Feature | Stage 1 (ZeRO-OS) | Stage 2 | Stage 3 |
|---|---|---|---|
| Optimizer states partitioned | Yes | Yes | Yes |
| Gradients partitioned | No | Yes | Yes |
| Parameters partitioned | No | No | Yes |
| Memory reduction (vs. baseline) | Up to 4x | Up to 8x | Linear with N GPUs |
| Communication overhead vs. data parallelism | Same | Same | 1.5x |
| Code changes required | None | None | None |

### Memory savings example

To illustrate the memory impact, consider a 7.5-billion parameter model trained with Adam in mixed precision on 64 GPUs:

| Configuration | Memory per GPU |
|---|---|
| Standard data parallelism | ~120 GB |
| ZeRO Stage 1 | ~31.4 GB |
| ZeRO Stage 2 | ~16.6 GB |
| ZeRO Stage 3 | ~1.9 GB |

These reductions make it possible to train models with billions of parameters on clusters of GPUs that individually have only 16 to 80 GB of memory.[1]

## ZeRO-Offload

ZeRO-Offload, introduced by Jie Ren et al. in 2021, extends ZeRO stage 2 by offloading optimizer states and gradient computation to CPU memory and CPU compute. This allows training of models with up to 13 billion parameters on a single NVIDIA V100 GPU (32 GB), a 10x improvement over standard PyTorch, by leveraging the much larger capacity of system RAM (typically 256 GB or more) compared to GPU memory (typically 16 to 80 GB).[2] The paper framed the goal directly: ZeRO-Offload "changes the large model training landscape by making large model training accessible to nearly everyone," training models with over 13 billion parameters on a single GPU "without requiring any model change from the data scientists or sacrificing computational efficiency."[2]

The key insight behind ZeRO-Offload is a careful partitioning strategy between CPU and GPU: gradients, optimizer states, and the optimizer computation step are offloaded to the CPU, while parameters and the forward/backward computation remain on the GPU. This minimizes data movement across the PCIe bus while maximizing memory savings. A CPU Adam optimizer implementation is provided that is highly optimized with SIMD instructions.[2]

ZeRO-Offload works symbiotically with ZeRO stage 2 and scales to multiple GPUs when available. On a single V100 GPU, it achieves 40 TFLOPS for a 10-billion-parameter model, compared to 30 TFLOPS for a 1.4-billion-parameter model using PyTorch alone.[2] ZeRO-Offload achieves near-GPU-only training throughput for large models because the optimizer step computation on the CPU can be overlapped with the next forward pass on the GPU. For smaller models where GPU utilization is already high, the overhead of CPU-GPU data transfers becomes more noticeable. It also supports near-linear scaling on up to 128 GPUs.[2]

## ZeRO-Infinity

ZeRO-Infinity, published by Rajbhandari et al. in 2021, extends the offloading concept further by leveraging not only CPU memory but also NVMe (SSD) storage. Built on ZeRO stage 3, ZeRO-Infinity can offload all model states (parameters, gradients, and optimizer states) to CPU memory and NVMe storage, enabling training of models with tens or even hundreds of trillions of parameters on current-generation GPU clusters.[3] The paper opens by quantifying the memory wall it set out to break: "In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB)."[3]

ZeRO-Infinity introduces several innovations:

- **Infinity offload engine**: a system for efficiently moving data between GPU, CPU DRAM, and NVMe storage, with overlap of computation and data transfer
- **Memory-centric tiling**: breaks large operators into smaller tiles that can be processed sequentially, allowing individual layers with parameters exceeding GPU memory
- **Bandwidth-centric partitioning**: maps data to the storage tier (GPU, CPU, or NVMe) that best matches the required bandwidth for each training phase
- **Dynamic prefetcher**: traces forward and backward computation, constructing an internal map of operator sequences. Using this map, ZeRO-Infinity overlaps NVMe-to-CPU transfers with CPU-to-GPU transfers and GPU-to-GPU all-gather operations, effectively pipelining all three communication stages with computation. This achieves bandwidth utilization close to the theoretical peak of the NVMe subsystem.[3]

On 512 NVIDIA V100 GPUs, ZeRO-Infinity sustains over 25 petaFLOPS (40% of peak) and demonstrates superlinear scaling. It can also fine-tune trillion-parameter models on a single DGX-2 node, making such models accessible to smaller research labs.[3]

| Feature | ZeRO-Offload | ZeRO-Infinity |
|---|---|---|
| Built on | ZeRO Stage 2 | ZeRO Stage 3 |
| Offload target | CPU memory | CPU memory + NVMe storage |
| Offloaded states | Optimizer states + gradients | All model states (params, grads, optimizer) |
| Max model scale | ~13B params on single GPU | Trillions of parameters |
| Bandwidth optimization | CPU-GPU overlap | Dynamic prefetching across NVMe, CPU, and GPU |

## ZeRO++ (2023)

ZeRO++, released in 2023, addresses the communication overhead that becomes a bottleneck in ZeRO Stage 3, particularly when training across nodes connected by relatively slow network links. ZeRO++ introduces three complementary communication optimization techniques that together "reduce total communication volume by 4x compared with ZeRO, without impacting model quality."[14]

### Quantized Weights (qwZ)

qwZ applies block-based quantization to reduce the communication volume of the parameter all-gather from FP16 to INT8, halving the data transferred. Block-based quantization conducts independent quantization on subsets of model parameters, achieving 3x better accuracy and 5x faster execution compared to naive quantization through highly optimized [CUDA](/wiki/cuda) kernels.[14]

### Hierarchical Partitioning (hpZ)

hpZ eliminates cross-node all-gather communication during the backward pass through a hierarchical data remapping strategy. Instead of collecting parameters from all GPUs across all nodes, hpZ maintains a full copy of the parameters within each node (distributed across the node's GPUs) and only performs intra-node all-gather operations. This reduces cross-node communication volume from M/Z per GPU to M/(Z*N), where M is the model size, Z is the total number of GPUs, and N is the number of GPUs per node.[14]

### Quantized Gradients (qgZ)

qgZ replaces the standard gradient all-reduce with a communication-efficient all-to-all based quantized gradient averaging scheme, further reducing communication volume during the backward pass.[14]

| Component | Technique | Communication Reduction |
|---|---|---|
| qwZ | Block-based weight quantization (FP16 to INT8) | 2x for parameter all-gather |
| hpZ | Hierarchical weight partitioning | Eliminates cross-node backward all-gather |
| qgZ | Quantized gradient averaging | Reduces gradient communication volume |
| Combined | All three together | Up to 4x total reduction |

## 3D parallelism

DeepSpeed supports combining three parallelism strategies simultaneously, a technique referred to as 3D parallelism:

1. **Data parallelism** (powered by ZeRO): replicates or partitions model states across processes, with each process handling a different subset of the training data
2. **[Pipeline parallelism](/wiki/pipeline)**: splits the model layers into stages across GPUs, with micro-batches flowing through the pipeline to maximize utilization
3. **[Tensor parallelism](/wiki/model-parallelism)**: splits individual layers (such as large matrix multiplications) across multiple GPUs

3D parallelism simultaneously addresses both memory efficiency and computational efficiency, enabling DeepSpeed to train models with over one trillion parameters.[11] The Megatron-Turing NLG 530B model, a collaboration between NVIDIA and Microsoft, was trained using this approach.[7]

DeepSpeed's pipeline parallelism implementation uses gradient accumulation to extract pipeline parallelism. Each training batch is divided into micro-batches that flow through the pipeline stages in parallel. This reduces communication volume by 2 to 7x compared to standard approaches, making it particularly effective on clusters with limited network bandwidth.[11]

## Mixed-precision training

DeepSpeed supports mixed-precision training using FP16 and BF16 (bfloat16) formats. During mixed-precision training, the forward and backward passes are computed in half precision (16-bit), while the optimizer maintains a full-precision (FP32) master copy of the parameters for numerical stability.[11]

DeepSpeed handles [loss](/wiki/loss) scaling automatically to prevent gradient underflow in FP16 training. The library supports both dynamic loss scaling (which adjusts the scale factor based on whether overflow is detected) and static loss scaling.[11]

## Gradient checkpointing

DeepSpeed provides an activation checkpointing API that reduces activation memory at the cost of recomputing activations during the backward pass. Its implementation includes several optimizations beyond the standard approach:[11]

- **Activation partitioning**: partitions stored activations across data-parallel processes for model-parallel training
- **CPU checkpointing**: offloads stored activations to CPU memory, reducing GPU memory usage further
- **Contiguous memory optimization**: consolidates activation checkpoints into contiguous memory buffers to reduce fragmentation
- **Layerwise profiling**: profiles memory usage at the layer level to help identify bottlenecks

## Communication optimizations

DeepSpeed includes several techniques for reducing the communication overhead of distributed training.

### 1-bit Adam

1-bit Adam compresses gradient communication by quantizing the momentum term to a single bit per element. This reduces communication volume by up to 5x while maintaining the same convergence speed as uncompressed Adam.[6] The approach works in two phases: a warmup phase using standard Adam (typically 15 to 20% of total training steps), followed by a compression phase where the variance term is frozen and the momentum is compressed.[6]

Experiments on up to 256 GPUs show that 1-bit Adam achieves up to 3.3x higher throughput for [BERT](/wiki/bert) pre-training and up to 2.9x higher throughput for SQuAD fine-tuning.[6]

### 0/1 Adam

0/1 Adam extends 1-bit Adam by further reducing communication. It uses an adaptive compression strategy that sends zero bits when a gradient partition has not changed significantly, and one bit when it has. This can achieve up to 26x communication volume savings in favorable conditions.

### Communication overlapping

DeepSpeed overlaps gradient reduction operations with backward pass computation. As gradients become available during the backward pass, they are immediately reduced (averaged) across processes rather than waiting for all gradients to be computed first.[11]

## Sparse attention

DeepSpeed provides sparse attention kernels that support input sequences an order of magnitude longer than standard dense [attention](/wiki/attention) mechanisms. These kernels achieve up to 6x faster execution than dense attention with comparable accuracy and 1.5 to 3x faster execution than other sparse implementations.[11]

The library supports several sparse attention patterns, including Fixed (from OpenAI's Sparse Transformer), BigBird (from Google), and BSLongformer (a block-sparse implementation of Longformer). DeepSpeed also provides a template for defining custom sparse attention patterns.[11]

## DeepSpeed-MoE

DeepSpeed-MoE, presented at ICML 2022 by Rajbhandari et al., provides system support for training and running inference on [Mixture of Experts](/wiki/mixture-of-experts) (MoE) models. MoE models achieve quality comparable to dense models while requiring significantly less training compute, as only a subset of "expert" sub-networks is activated for each input.[4]

DeepSpeed-MoE includes:

- Novel MoE architecture designs that reduce model size by up to 3.7x through compression
- Optimized inference kernels that provide 7.3x better latency and cost compared to existing MoE inference solutions
- End-to-end training support with up to 5x training cost savings for auto-regressive language models
- Up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models
- Flexible expert parallelism across GPUs
- Load balancing mechanisms to ensure even utilization of experts
- Integration with ZeRO for memory-efficient MoE training
- Support for top-k routing with auxiliary load balancing losses[4]

DeepSpeed-MoE has been used to train large MoE models efficiently, enabling researchers to explore the MoE paradigm at scale without building custom distributed training infrastructure.[4]

## DeepSpeed-Chat

DeepSpeed-Chat, introduced in April 2023, is an end-to-end system for training [ChatGPT](/wiki/chatgpt)-style models using [Reinforcement Learning from Human Feedback](/wiki/rlhf) ([RLHF](/wiki/rlhf)), the technique used to align [large language models](/wiki/large_language_model) with human preferences. The system addresses the significant engineering complexity of RLHF training, which requires coordinating multiple models (actor, critic, reward model, and reference model) simultaneously.[5]

### Three-Stage Pipeline

DeepSpeed-Chat implements the three-step InstructGPT training pipeline:

1. **Supervised fine-tuning (SFT)**: training on human-written demonstration data
2. **Reward model training**: learning a reward function from human preference comparisons
3. **RLHF with PPO**: optimizing the policy model against the learned reward model using Proximal Policy Optimization

A single script can take a pre-trained Hugging Face model and run it through all three stages.[5]

### DeepSpeed Hybrid Engine (DeepSpeed-HE)

The system introduces the DeepSpeed Hybrid Engine (DeepSpeed-HE), which unifies training and inference optimizations into a single engine. During RLHF, the model alternates between generating responses (inference) and updating parameters (training). DeepSpeed-HE applies inference optimizations (such as kernel fusion, tensor parallelism, and [KV caching](/wiki/kv-cache)) during the generation phase and training optimizations (such as ZeRO and gradient checkpointing) during the update phase. This hybrid approach yields over 15x speedup compared to existing RLHF systems.[5]

### How much does it cost to train an RLHF model with DeepSpeed-Chat?

DeepSpeed-Chat enabled training an OPT-13B model via RLHF in 9 hours and an OPT-30B model in 18 hours on Azure Cloud, at costs under $300 and $600 respectively.[5] Compared to other systems like Colossal-AI and Hugging Face DDP, DeepSpeed-Chat achieved up to 19x higher throughput for RLHF training, and 10x faster performance on a single GPU. The system can handle training of models with over 200 billion parameters.[5]

## DeepSpeed-Inference

DeepSpeed-Inference provides optimized inference for [transformer](/wiki/transformer)-based models. It includes custom CUDA kernels for operations like attention, layer normalization, and bias-add-residual, reducing kernel launch overhead and improving GPU utilization. It also supports multi-GPU inference with tensor parallelism and pipeline parallelism for models that do not fit in the memory of a single GPU.[11]

## DeepSpeed-FastGen: Inference Optimization

DeepSpeed-FastGen, released in late 2023 and expanded in 2024, is an inference serving framework for large language models. Its core innovation is the Dynamic SplitFuse technique, which handles variable-length prompts and generation steps more efficiently than traditional continuous batching approaches.[15]

### Dynamic SplitFuse

Dynamic SplitFuse decomposes long prompts into smaller chunks and composes short prompts together, creating uniform token budgets across iterations. This approach addresses the performance cliffs that occur in traditional systems when long prompts cause batch sizes to drop, and it provides more consistent latency compared to systems like [vLLM](/wiki/vllm).[15]

### MoE Support

In 2024, DeepSpeed-FastGen added support for [Mixture of Experts](/wiki/moe) (MoE) architectures, including the [Mixtral](/wiki/mixtral) model family. A custom MoE module with inference-optimized kernels was developed, achieving 2.4x higher throughput for the Mixtral model compared to baseline implementations at a prompt length of 1,200 tokens and 60 generation steps. Support for [Falcon](/wiki/falcon) and Phi-2 model families was also added.[15]

## Additional features

| Feature | Description |
|---|---|
| Autotuning | Automatically finds optimal DeepSpeed configuration (ZeRO stage, batch size, etc.) for a given model and hardware |
| Curriculum learning | Data efficiency library that orders training samples from easy to hard |
| Progressive layer dropping | Compresses training by randomly skipping layers during forward/backward passes with increasing probability |
| FLOPs profiler | Measures model computational cost and identifies bottlenecks |
| Monitoring | Integration with TensorBoard, Weights & Biases, and CSV logging |
| Elastic training | Support for dynamic scaling of training jobs (adding or removing workers) |
| Fused optimizers | Custom CUDA kernels for Adam and other optimizers that fuse multiple operations into a single kernel launch |
| CPU-Adam | AVX SIMD-optimized Adam implementation for efficient CPU-side parameter updates during offloading |

## How does DeepSpeed differ from PyTorch FSDP?

[PyTorch](/wiki/pytorch) Fully Sharded Data Parallel (FSDP) is a native PyTorch framework inspired by ZeRO stage 3. Both libraries address the same fundamental problem (reducing memory redundancy in distributed training), but they differ in several ways.[12]

| Aspect | DeepSpeed (ZeRO) | PyTorch FSDP |
|---|---|---|
| Sharding stages | 3 explicit stages (1, 2, 3) | FULL_SHARD, SHARD_GRAD_OP, NO_SHARD (equivalent to ZeRO stage 3 only for FULL_SHARD) |
| CPU offloading | ZeRO-Offload, ZeRO-Infinity; can offload parameters and optimizer separately | CPU offloading supported; all-or-nothing (parameters, gradients, and optimizer together) |
| NVMe offloading | Yes (ZeRO-Infinity) | No |
| Communication optimization | ZeRO++ (quantized, hierarchical), 1-bit compression, custom backends | Standard PyTorch collective operations |
| Pipeline parallelism | Built-in | Requires separate implementation |
| RLHF support | DeepSpeed-Chat (end-to-end) | Manual implementation |
| Inference optimization | DeepSpeed-FastGen, DeepSpeed-Inference | Separate tools needed |
| Precision handling | Forces upcasting to FP32 for optimizer states | Allows low-precision optimizer operation (more flexible) |
| Configuration | JSON config file, mostly transparent to user | Python API; requires explicit wrapping policy for sharding decisions |
| PyTorch integration | External library, requires `deepspeed.initialize()` | Native PyTorch, integrates with `torch.compile` and PyTorch 2.x features |
| Ecosystem | Broader feature set (MoE, inference, sparse attention) | Tighter integration with PyTorch ecosystem |
| Raw throughput (FULL_SHARD equivalent) | Competitive | Sometimes faster (up to 5x reported in certain configurations) |
| Memory efficiency at extreme scale | Superior (NVMe offloading, ZeRO++) | Less optimized |

In benchmarks, FSDP's FULL_SHARD mode has shown up to 5x faster per-iteration throughput than DeepSpeed ZeRO Stage 3 in certain configurations.[18] FSDP tends to be faster and simpler for straightforward training scenarios, especially at smaller scales. DeepSpeed becomes more competitive and often preferable when CPU/NVMe offloading, communication optimization (ZeRO++), extreme memory savings, or specialized features like MoE support are required.[12] The choice often depends on the specific use case: FSDP is attractive for teams that want to stay within the native PyTorch ecosystem, while DeepSpeed offers more features for extreme-scale training scenarios.[12]

## Framework integration

DeepSpeed integrates with several popular frameworks and tools:

- **[Hugging Face](/wiki/hugging_face) Transformers**: the Trainer class provides built-in DeepSpeed support. Users supply a JSON configuration file, and the Trainer handles initialization, gradient accumulation, and checkpointing automatically. DeepSpeed can also be used without the Trainer via the `HfDeepSpeedConfig` class.[16]
- **PyTorch Lightning**: supports DeepSpeed through the Lightning Trainer's strategy parameter (for example, `strategy="deepspeed_stage_3"`).
- **Megatron-LM**: the Megatron-DeepSpeed repository combines NVIDIA's Megatron-LM with DeepSpeed for maximum-scale training.[17]
- **Hugging Face Accelerate**: provides a unified interface for switching between DeepSpeed and FSDP with minimal code changes.[12]
- **Microsoft Olive**: uses DeepSpeed for model optimization pipelines.

The Hugging Face integration allows users to enable DeepSpeed optimizations by simply passing a configuration file to the Hugging Face Trainer:[16]

```python
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    deepspeed="ds_config.json",
    ...
)
```

The DeepSpeed configuration file specifies which ZeRO stage to use, whether to enable offloading, mixed precision settings, and other optimization parameters. This integration was instrumental in making DeepSpeed accessible beyond its core audience of distributed systems researchers.[16]

## Adoption and notable models

DeepSpeed has been used to train many of the largest models in existence:

| Model | Organization | Parameters | Year |
|---|---|---|---|
| Megatron-Turing NLG | NVIDIA and Microsoft | 530 billion | 2022 |
| [BLOOM](/wiki/bloom) | BigScience (Hugging Face) | 176 billion | 2022 |
| GLM-130B | Tsinghua University | 130 billion | 2022 |
| YaLM-100B | Yandex | 100 billion | 2022 |
| GPT-NeoX-20B | [EleutherAI](/wiki/eleutherai) | 20 billion | 2022 |
| AlexaTM 20B | Amazon | 20 billion | 2022 |
| Turing-NLG | Microsoft | 17 billion | 2020 |

DeepSpeed is also widely used at academic and government research labs, including Oak Ridge National Lab, Carnegie Mellon University, the University of Tokyo, and Korea University.[19]

### Megatron-Turing NLG 530B

Megatron-Turing NLG (MT-NLG), a collaboration between Microsoft and NVIDIA announced in October 2021, was a 530-billion parameter autoregressive language model and at the time the largest dense transformer model ever trained.[7] The training system combined three forms of parallelism: tensor parallelism from NVIDIA's Megatron-LM (for intra-node scaling), pipeline parallelism from DeepSpeed (for inter-node scaling), and data parallelism with ZeRO Stage 1 (for scaling across pipeline replicas).[7] This "3D parallelism" approach became a template for subsequent large model training efforts.

### BLOOM 176B

BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), a 176-billion parameter multilingual model released in 2022, was trained by a collaboration of over 1,000 researchers using the Megatron-DeepSpeed framework.[9] The training combined ZeRO sharding and pipeline parallelism from DeepSpeed with tensor parallelism from Megatron-LM.[17] BLOOM was trained on 384 NVIDIA A100 80GB GPUs at the Jean Zay supercomputer in France.[17]

### Other Notable Models

DeepSpeed has also been used in training various other large models, and its ZeRO optimizer is commonly used for [fine-tuning](/wiki/fine_tuning) large models in research labs and companies worldwide. The Hugging Face integration means that any model available on the Hugging Face Hub can be fine-tuned with DeepSpeed optimizations with minimal code changes.[16]

## Architecture and Usage

DeepSpeed is designed as a drop-in replacement for PyTorch's standard training loop. Users wrap their model and optimizer with DeepSpeed's initialization function:[11]

```python
import deepspeed

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config
)

for batch in dataloader:
    loss = model(batch)
    model.backward(loss)
    model.step()
```

The `ds_config` dictionary (or JSON file) controls all optimization settings, including ZeRO stage, offloading, mixed precision, gradient accumulation, and learning rate scheduling. This configuration-driven approach allows users to experiment with different optimization strategies without changing their training code.[11]

## Configuration example

DeepSpeed is configured through a JSON file that specifies training options. A typical configuration for ZeRO stage 2 with mixed-precision training looks like this:[11]

```json
{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 1,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "reduce_scatter": true,
    "overlap_comm": true
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 3e-5,
      "betas": [0.9, 0.999],
      "eps": 1e-8
    }
  }
}
```

Training is launched with the `deepspeed` command:

```bash
deepspeed --num_gpus=8 train.py --deepspeed ds_config.json
```

## Recent Developments (2024-2026)

DeepSpeed continues to evolve with new capabilities addressing emerging needs in the AI community.

### Universal Checkpointing (2024)

Universal Checkpointing provides efficient and flexible checkpointing for large-scale distributed training. It enables saving and loading checkpoints across different parallelism configurations (for example, saving a checkpoint with 3D parallelism on 256 GPUs and resuming on 128 GPUs with a different parallelism layout), which simplifies cluster management and fault recovery.[19]

### Arctic Long Sequence Training (ALST, 2025)

ALST enables scalable and efficient training with multi-million token sequence lengths. As context windows for [large language models](/wiki/large_language_model) have grown from thousands to millions of tokens, the memory and computation requirements for processing long sequences have become a major bottleneck. ALST addresses this through specialized memory management and communication strategies optimized for very long sequences.[19]

### ZenFlow (2025)

ZenFlow enables stall-free offloading training via asynchronous updates. Traditional offloading approaches (like ZeRO-Offload) can introduce stalls when the CPU is not fast enough to complete the optimizer step before the GPU needs the updated parameters. ZenFlow eliminates these stalls by overlapping computation and communication more aggressively through an asynchronous update mechanism.[19]

### SuperOffload (2026)

SuperOffload, announced for 2026, targets large-scale LLM training on superchips (systems like NVIDIA Grace Hopper that combine CPU and GPU on a single module with high-bandwidth unified memory). It aims to exploit the unique memory architecture of these systems for more efficient offloading.[19]

### DeepSpeed Version History

| Version | Date | Key Features |
|---|---|---|
| 0.3.x | Feb 2020 | Initial release, ZeRO Stage 1 and 2 |
| 0.4.x | 2021 | ZeRO Stage 3, ZeRO-Infinity |
| 0.7.x | 2022 | DeepSpeed-MoE, performance improvements |
| 0.9.x-0.10.x | 2023 | DeepSpeed-Chat, ZeRO++, DeepSpeed-FastGen |
| 0.14.x-0.15.x | 2024 | Universal Checkpointing, MoE inference, Mixtral support |
| 0.16.x-0.18.x | 2025 | ALST, ZenFlow, expanded hardware support |

## Timeline

| Date | Event |
|---|---|
| October 2019 | ZeRO paper submitted to arXiv |
| February 2020 | DeepSpeed open-sourced by Microsoft |
| May 2020 | ZeRO paper published at SC20 |
| September 2020 | 1-bit Adam released |
| January 2021 | ZeRO-Offload paper published |
| March 2021 | ZeRO stage 3 with offloading released |
| April 2021 | ZeRO-Infinity paper published |
| October 2021 | Megatron-Turing NLG 530B announced |
| January 2022 | DeepSpeed-MoE paper released |
| April 2023 | DeepSpeed-Chat released for RLHF training |
| 2023 | ZeRO++ released |
| Late 2023 | DeepSpeed-FastGen released |
| 2024 | Universal Checkpointing, Mixtral MoE inference support |
| August 2024 | Native Windows support for single-GPU training |
| February 2025 | DeepSpeed contributed to Linux Foundation AI & Data |
| 2025 | ZenFlow released |
| June 2025 | Arctic Long Sequence Training (ALST) introduced |
| 2026 | SuperOffload announced |

## Is DeepSpeed open source?

Yes. DeepSpeed has been open source since its first release in February 2020 and is distributed under the MIT License.[8] On February 3, 2025, the project was contributed to the Linux Foundation AI & Data as an incubation project, a step the foundation said "strengthens the foundation's mission to foster innovation in AI and data technologies" and gives DeepSpeed "access to a thriving ecosystem of open-source projects, a global network of contributors, and robust technical and operational resources."[13] As part of this transition, the project moved its GitHub home from the `microsoft/DeepSpeed` organization to the community-governed `deepspeedai/DeepSpeed` organization.[19]

## Current State (2025-2026)

As of early 2026, DeepSpeed remains one of the essential tools in the large-scale model training ecosystem. The library is at version 0.18.x, actively maintained by the DeepSpeed team (which has moved its GitHub organization from `microsoft/DeepSpeed` to `deepspeedai/DeepSpeed`).[19]

Several trends define DeepSpeed's current trajectory:

**Complementary role with FSDP.** Rather than a zero-sum competition, DeepSpeed and PyTorch FSDP increasingly serve complementary roles. FSDP is the default choice for straightforward distributed training within the PyTorch ecosystem, while DeepSpeed is preferred for scenarios requiring advanced offloading, extreme memory optimization, or features like DeepSpeed-Chat.[12]

**Focus on LLM workflows.** DeepSpeed's recent features (ALST, ZenFlow, DeepSpeed-Chat, DeepSpeed-FastGen) reflect its focus on the full lifecycle of [large language model](/wiki/large_language_model) development, from pre-training through alignment to inference serving.[19]

**Hardware adaptation.** With the emergence of new hardware architectures like NVIDIA Grace Hopper superchips and AMD MI300X GPUs, DeepSpeed is adapting its offloading and communication strategies. SuperOffload specifically targets the unified memory architecture of superchips.[19]

**Continued Hugging Face integration.** The tight integration with [Hugging Face](/wiki/hugging_face) Transformers and Accelerate ensures that DeepSpeed's optimizations remain accessible to the broad ML community, not just distributed systems specialists.[16]

DeepSpeed's contribution to making large-scale model training accessible cannot be overstated. By reducing the memory barriers through ZeRO and providing turnkey solutions for RLHF training and inference, it has democratized capabilities that were once available only to organizations with massive engineering teams and hardware budgets.

## References

1. Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20)*. [arXiv:1910.02054](https://arxiv.org/abs/1910.02054)

2. Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O., Yang, S., Zhang, M., Li, D., & He, Y. (2021). "ZeRO-Offload: Democratizing Billion-Scale Model Training." *USENIX Annual Technical Conference (ATC)*. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840)

3. Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., & He, Y. (2021). "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21)*. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857)

4. Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., & He, Y. (2022). "DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale." *Proceedings of the 39th International Conference on Machine Learning (ICML 2022)*. [arXiv:2201.05596](https://arxiv.org/abs/2201.05596)

5. Yao, Z., Aminabadi, R. Y., Rajbhandari, S., Ruwase, O., Wu, X., Awan, A. A., Rasley, J., Zhang, M., Li, C., & He, Y. (2023). "DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales." [arXiv:2308.01320](https://arxiv.org/abs/2308.01320)

6. Tang, H., Ganesh, S., & Rajbhandari, S. (2021). "1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed." *Proceedings of the 38th International Conference on Machine Learning (ICML 2021)*. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888)

7. Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., Zhang, E., Child, R., Aminabadi, R. Y., Bernauer, J., Song, X., Shoeybi, M., He, Y., Houston, M., Tiwary, S., & Catanzaro, B. (2022). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model." [arXiv:2201.11990](https://arxiv.org/abs/2201.11990)

8. Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020). "DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters." *Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*. https://dl.acm.org/doi/10.1145/3394486.3406703

9. Scao, T. L., et al. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." [arXiv:2211.05100](https://arxiv.org/abs/2211.05100)

10. Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., & Weinbach, S. (2022). "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." [arXiv:2204.06745](https://arxiv.org/abs/2204.06745)

11. DeepSpeed documentation. "Training Overview and Features." https://www.deepspeed.ai/training/

12. Hugging Face. "FSDP vs DeepSpeed." https://huggingface.co/docs/accelerate/en/concept_guides/fsdp_and_deepspeed

13. Linux Foundation AI & Data. (2025). "LF AI & Data Welcomes DeepSpeed: Advancing Deep Learning Optimization." https://lfaidata.foundation/blog/2025/02/03/lf-ai-data-welcomes-deepspeed-advancing-deep-learning-optimization/

14. Microsoft Research. (2023). "DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication." https://www.microsoft.com/en-us/research/blog/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication/

15. Holmes, C., et al. (2024). "DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-[Inference](/wiki/inference)." https://arxiv.org/abs/2401.08671

16. Hugging Face documentation. "DeepSpeed Integration." https://huggingface.co/docs/transformers/en/deepspeed

17. BigScience Workshop. (2022). "The Technology Behind BLOOM Training." https://huggingface.co/blog/bloom-megatron-deepspeed

18. Kienzler, R. (2024). "FSDP vs DeepSpeed." https://romeokienzler.medium.com/fsdp-vs-deepspeed-9df47ee5ccbb

19. DeepSpeed. "Latest News." https://www.deepspeed.ai/

20. DeepSpeed Tutorials. "ZeRO-Offload." https://www.deepspeed.ai/tutorials/zero-offload/
