DeepSpeed is an open-source deep learning optimization library developed by Microsoft that makes distributed training and inference of large models efficient, easy to use, and cost-effective. Built on top of PyTorch, DeepSpeed provides system-level optimizations that allow researchers and engineers to train models with billions or even trillions of parameters on commodity GPU clusters. Its core innovation, the Zero Redundancy Optimizer (ZeRO), fundamentally rethinks how model parallelism and data parallelism interact to reduce memory consumption without sacrificing computational throughput.
DeepSpeed was first released by Microsoft Research in February 2020 and has since become one of the most widely adopted libraries for large-scale model training. It has been used to train landmark models including BLOOM (176 billion parameters) and Megatron-Turing NLG (530 billion parameters). In February 2025, DeepSpeed was contributed to the Linux Foundation AI & Data as an incubation project, marking its transition to community-driven governance.
Imagine you have a really big jigsaw puzzle, so big that it does not fit on your table. Normally, you would need a huge table (a really expensive computer) to put the whole puzzle together. DeepSpeed is like a clever way of splitting the puzzle pieces across several smaller tables (regular computers). Each table only holds the pieces it needs right now, and when it needs a piece from another table, it just asks for it quickly. This way, you can solve even the biggest puzzle in the world using a bunch of regular-sized tables working together.
DeepSpeed emerged from Microsoft's "AI at Scale" initiative, which aimed to develop the infrastructure needed to train the largest AI models. The project was led by researchers at Microsoft Research, with Samyam Rajbhandari as a primary architect. The first public release came in February 2020, coinciding with the publication of the foundational ZeRO paper.
The motivation for DeepSpeed was straightforward: as model sizes grew from millions to billions of parameters, standard data-parallel training became insufficient. A model with billions of parameters cannot fit in the memory of a single GPU, and existing approaches to model parallelism (tensor parallelism and pipeline parallelism) required significant code modifications and were difficult to use. DeepSpeed's goal was to enable training of arbitrarily large models with minimal code changes.
The foundational research behind DeepSpeed is the ZeRO (Zero Redundancy Optimizer) paper by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He, first published as an arXiv preprint in October 2019 and later presented at SC20 (the International Conference for High Performance Computing). The paper identified a critical inefficiency in standard data-parallel training: every GPU maintains a complete copy of the model states (parameters, gradients, and optimizer states), resulting in massive memory redundancy.
For a model with P parameters trained with the Adam optimizer in mixed precision, the memory required per GPU for model states alone is approximately 2P + 2P + (4P + 4P + 4P) = 16P bytes. For a 7.5-billion parameter model, this amounts to roughly 120 GB per GPU, far exceeding the memory of any single GPU available at the time. The ZeRO paper proposed partitioning these redundant states across data-parallel processes, dramatically reducing per-GPU memory consumption without sacrificing computational efficiency.
Training large language models and other massive neural networks requires storing three categories of data in GPU memory during training:
For mixed-precision training with the Adam optimizer, the memory required per parameter breaks down as follows:
| Component | Precision | Bytes per parameter |
|---|---|---|
| Parameters | FP16 | 2 |
| Gradients | FP16 | 2 |
| Parameters (optimizer copy) | FP32 | 4 |
| Momentum (Adam) | FP32 | 4 |
| Variance (Adam) | FP32 | 4 |
| Total | 16 |
For a model with 1.5 billion parameters (such as GPT-2), this amounts to 24 GB of memory for model states alone, exceeding the capacity of most individual GPUs before even accounting for activations and temporary buffers.
Standard data parallelism replicates all model states across every GPU, which wastes memory. Model parallelism splits the model across GPUs but introduces communication overhead and often requires significant code changes. DeepSpeed was created to address this fundamental tension between memory efficiency and computational efficiency.
ZeRO is the core technology behind DeepSpeed. Introduced by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He in their 2020 paper "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models," ZeRO eliminates memory redundancy in data-parallel training by partitioning model states across data-parallel processes instead of replicating them.
ZeRO operates in three progressive stages, each partitioning an additional component of the model states.
In standard data-parallel training, each GPU maintains the full optimizer state. For Adam, this includes the first-moment estimate (momentum) and second-moment estimate (variance), each requiring 4 bytes per parameter in FP32, plus the FP32 master copy of the weights (4 bytes per parameter). That totals 12 bytes per parameter just for optimizer states.
In stage 1 (also known as ZeRO-OS), the optimizer states (for Adam, this includes the FP32 master copy of parameters, momentum, and variance) are partitioned across all data-parallel processes. Each process stores and updates only its assigned partition of the optimizer states, while still maintaining the full FP16 parameters and gradients. After the backward pass, gradients are reduced normally, but each GPU updates only its partition of the optimizer states and parameters. The updated parameters are then broadcast (via all-gather) to all GPUs.
Stage 1 reduces optimizer state memory by a factor of N (where N is the data-parallel degree) while maintaining the same communication volume as standard data parallelism. For a typical Adam setup, this provides up to a 4x memory reduction.
Stage 2 adds gradient partitioning on top of stage 1. After the backward pass, instead of performing an all-reduce on the full gradient tensor, each process reduces and retains only the gradients corresponding to its partition of optimizer states. This is implemented using a reduce-scatter operation, which is communication-equivalent to the all-reduce used in standard data parallelism but results in each GPU holding only 1/N of the gradients. Once a gradient is reduced and used for the parameter update, it is discarded, freeing the memory. This provides up to an 8x memory reduction with the same communication volume as standard data parallelism.
Stage 3 takes the partitioning to its logical conclusion by distributing the FP16 model parameters themselves across data-parallel processes. Each process stores only a shard of the full parameter set (1/N of the parameters). During forward and backward passes, ZeRO-3 dynamically gathers the parameters needed for each layer via all-gather operations, uses them for computation, then discards non-local parameters after use.
Stage 3 achieves memory reduction that scales linearly with the number of data-parallel processes. On 64 GPUs, this yields a 64x reduction in per-GPU memory for model states. The trade-off is a 50% (1.5x) increase in communication volume compared to standard data parallelism (due to the additional all-gather operations during the forward pass). However, this communication can be overlapped with computation, and in practice the throughput impact is often modest.
| Feature | Stage 1 (ZeRO-OS) | Stage 2 | Stage 3 |
|---|---|---|---|
| Optimizer states partitioned | Yes | Yes | Yes |
| Gradients partitioned | No | Yes | Yes |
| Parameters partitioned | No | No | Yes |
| Memory reduction (vs. baseline) | Up to 4x | Up to 8x | Linear with N GPUs |
| Communication overhead vs. data parallelism | Same | Same | 1.5x |
| Code changes required | None | None | None |
To illustrate the memory impact, consider a 7.5-billion parameter model trained with Adam in mixed precision on 64 GPUs:
| Configuration | Memory per GPU |
|---|---|
| Standard data parallelism | ~120 GB |
| ZeRO Stage 1 | ~31.4 GB |
| ZeRO Stage 2 | ~16.6 GB |
| ZeRO Stage 3 | ~1.9 GB |
These reductions make it possible to train models with billions of parameters on clusters of GPUs that individually have only 16 to 80 GB of memory.
ZeRO-Offload, introduced by Jie Ren et al. in 2021, extends ZeRO stage 2 by offloading optimizer states and gradient computation to CPU memory and CPU compute. This allows training of models with up to 13 billion parameters on a single NVIDIA V100 GPU (32 GB), a 10x improvement over standard PyTorch, by leveraging the much larger capacity of system RAM (typically 256 GB or more) compared to GPU memory (typically 16 to 80 GB).
The key insight behind ZeRO-Offload is a careful partitioning strategy between CPU and GPU: gradients, optimizer states, and the optimizer computation step are offloaded to the CPU, while parameters and the forward/backward computation remain on the GPU. This minimizes data movement across the PCIe bus while maximizing memory savings. A CPU Adam optimizer implementation is provided that is highly optimized with SIMD instructions.
ZeRO-Offload works symbiotically with ZeRO stage 2 and scales to multiple GPUs when available. On a single V100 GPU, it achieves 40 TFLOPS for a 10-billion-parameter model, compared to 30 TFLOPS for a 1.4-billion-parameter model using PyTorch alone. ZeRO-Offload achieves near-GPU-only training throughput for large models because the optimizer step computation on the CPU can be overlapped with the next forward pass on the GPU. For smaller models where GPU utilization is already high, the overhead of CPU-GPU data transfers becomes more noticeable. It also supports near-linear scaling on up to 128 GPUs.
ZeRO-Infinity, published by Rajbhandari et al. in 2021, extends the offloading concept further by leveraging not only CPU memory but also NVMe (SSD) storage. Built on ZeRO stage 3, ZeRO-Infinity can offload all model states (parameters, gradients, and optimizer states) to CPU memory and NVMe storage, enabling training of models with tens or even hundreds of trillions of parameters on current-generation GPU clusters.
ZeRO-Infinity introduces several innovations:
On 512 NVIDIA V100 GPUs, ZeRO-Infinity sustains over 25 petaFLOPS (40% of peak) and demonstrates superlinear scaling. It can also fine-tune trillion-parameter models on a single DGX-2 node, making such models accessible to smaller research labs.
| Feature | ZeRO-Offload | ZeRO-Infinity |
|---|---|---|
| Built on | ZeRO Stage 2 | ZeRO Stage 3 |
| Offload target | CPU memory | CPU memory + NVMe storage |
| Offloaded states | Optimizer states + gradients | All model states (params, grads, optimizer) |
| Max model scale | ~13B params on single GPU | Trillions of parameters |
| Bandwidth optimization | CPU-GPU overlap | Dynamic prefetching across NVMe, CPU, and GPU |
ZeRO++, released in 2023, addresses the communication overhead that becomes a bottleneck in ZeRO Stage 3, particularly when training across nodes connected by relatively slow network links. ZeRO++ introduces three complementary communication optimization techniques that together reduce communication volume by up to 4x.
qwZ applies block-based quantization to reduce the communication volume of the parameter all-gather from FP16 to INT8, halving the data transferred. Block-based quantization conducts independent quantization on subsets of model parameters, achieving 3x better accuracy and 5x faster execution compared to naive quantization through highly optimized CUDA kernels.
hpZ eliminates cross-node all-gather communication during the backward pass through a hierarchical data remapping strategy. Instead of collecting parameters from all GPUs across all nodes, hpZ maintains a full copy of the parameters within each node (distributed across the node's GPUs) and only performs intra-node all-gather operations. This reduces cross-node communication volume from M/Z per GPU to M/(Z*N), where M is the model size, Z is the total number of GPUs, and N is the number of GPUs per node.
qgZ replaces the standard gradient all-reduce with a communication-efficient all-to-all based quantized gradient averaging scheme, further reducing communication volume during the backward pass.
| Component | Technique | Communication Reduction |
|---|---|---|
| qwZ | Block-based weight quantization (FP16 to INT8) | 2x for parameter all-gather |
| hpZ | Hierarchical weight partitioning | Eliminates cross-node backward all-gather |
| qgZ | Quantized gradient averaging | Reduces gradient communication volume |
| Combined | All three together | Up to 4x total reduction |
DeepSpeed supports combining three parallelism strategies simultaneously, a technique referred to as 3D parallelism:
3D parallelism simultaneously addresses both memory efficiency and computational efficiency, enabling DeepSpeed to train models with over one trillion parameters. The Megatron-Turing NLG 530B model, a collaboration between NVIDIA and Microsoft, was trained using this approach.
DeepSpeed's pipeline parallelism implementation uses gradient accumulation to extract pipeline parallelism. Each training batch is divided into micro-batches that flow through the pipeline stages in parallel. This reduces communication volume by 2 to 7x compared to standard approaches, making it particularly effective on clusters with limited network bandwidth.
DeepSpeed supports mixed-precision training using FP16 and BF16 (bfloat16) formats. During mixed-precision training, the forward and backward passes are computed in half precision (16-bit), while the optimizer maintains a full-precision (FP32) master copy of the parameters for numerical stability.
DeepSpeed handles loss scaling automatically to prevent gradient underflow in FP16 training. The library supports both dynamic loss scaling (which adjusts the scale factor based on whether overflow is detected) and static loss scaling.
DeepSpeed provides an activation checkpointing API that reduces activation memory at the cost of recomputing activations during the backward pass. Its implementation includes several optimizations beyond the standard approach:
DeepSpeed includes several techniques for reducing the communication overhead of distributed training.
1-bit Adam compresses gradient communication by quantizing the momentum term to a single bit per element. This reduces communication volume by up to 5x while maintaining the same convergence speed as uncompressed Adam. The approach works in two phases: a warmup phase using standard Adam (typically 15 to 20% of total training steps), followed by a compression phase where the variance term is frozen and the momentum is compressed.
Experiments on up to 256 GPUs show that 1-bit Adam achieves up to 3.3x higher throughput for BERT pre-training and up to 2.9x higher throughput for SQuAD fine-tuning.
0/1 Adam extends 1-bit Adam by further reducing communication. It uses an adaptive compression strategy that sends zero bits when a gradient partition has not changed significantly, and one bit when it has. This can achieve up to 26x communication volume savings in favorable conditions.
DeepSpeed overlaps gradient reduction operations with backward pass computation. As gradients become available during the backward pass, they are immediately reduced (averaged) across processes rather than waiting for all gradients to be computed first.
DeepSpeed provides sparse attention kernels that support input sequences an order of magnitude longer than standard dense attention mechanisms. These kernels achieve up to 6x faster execution than dense attention with comparable accuracy and 1.5 to 3x faster execution than other sparse implementations.
The library supports several sparse attention patterns, including Fixed (from OpenAI's Sparse Transformer), BigBird (from Google), and BSLongformer (a block-sparse implementation of Longformer). DeepSpeed also provides a template for defining custom sparse attention patterns.
DeepSpeed-MoE, presented at ICML 2022 by Rajbhandari et al., provides system support for training and running inference on Mixture of Experts (MoE) models. MoE models achieve quality comparable to dense models while requiring significantly less training compute, as only a subset of "expert" sub-networks is activated for each input.
DeepSpeed-MoE includes:
DeepSpeed-MoE has been used to train large MoE models efficiently, enabling researchers to explore the MoE paradigm at scale without building custom distributed training infrastructure.
DeepSpeed-Chat, introduced in April 2023, is an end-to-end system for training ChatGPT-style models using Reinforcement Learning from Human Feedback (RLHF), the technique used to align large language models with human preferences. The system addresses the significant engineering complexity of RLHF training, which requires coordinating multiple models (actor, critic, reward model, and reference model) simultaneously.
DeepSpeed-Chat implements the three-step InstructGPT training pipeline:
A single script can take a pre-trained Hugging Face model and run it through all three stages.
The system introduces the DeepSpeed Hybrid Engine (DeepSpeed-HE), which unifies training and inference optimizations into a single engine. During RLHF, the model alternates between generating responses (inference) and updating parameters (training). DeepSpeed-HE applies inference optimizations (such as kernel fusion, tensor parallelism, and KV caching) during the generation phase and training optimizations (such as ZeRO and gradient checkpointing) during the update phase. This hybrid approach yields over 15x speedup compared to existing RLHF systems.
DeepSpeed-Chat enabled training an OPT-13B model via RLHF in 9 hours and an OPT-30B model in 18 hours on Azure Cloud, at costs under $300 and $600 respectively. Compared to other systems like Colossal-AI and Hugging Face DDP, DeepSpeed-Chat achieved up to 19x higher throughput for RLHF training, and 10x faster performance on a single GPU. The system can handle training of models with over 200 billion parameters.
DeepSpeed-Inference provides optimized inference for transformer-based models. It includes custom CUDA kernels for operations like attention, layer normalization, and bias-add-residual, reducing kernel launch overhead and improving GPU utilization. It also supports multi-GPU inference with tensor parallelism and pipeline parallelism for models that do not fit in the memory of a single GPU.
DeepSpeed-FastGen, released in late 2023 and expanded in 2024, is an inference serving framework for large language models. Its core innovation is the Dynamic SplitFuse technique, which handles variable-length prompts and generation steps more efficiently than traditional continuous batching approaches.
Dynamic SplitFuse decomposes long prompts into smaller chunks and composes short prompts together, creating uniform token budgets across iterations. This approach addresses the performance cliffs that occur in traditional systems when long prompts cause batch sizes to drop, and it provides more consistent latency compared to systems like vLLM.
In 2024, DeepSpeed-FastGen added support for Mixture of Experts (MoE) architectures, including the Mixtral model family. A custom MoE module with inference-optimized kernels was developed, achieving 2.4x higher throughput for the Mixtral model compared to baseline implementations at a prompt length of 1,200 tokens and 60 generation steps. Support for Falcon and Phi-2 model families was also added.
| Feature | Description |
|---|---|
| Autotuning | Automatically finds optimal DeepSpeed configuration (ZeRO stage, batch size, etc.) for a given model and hardware |
| Curriculum learning | Data efficiency library that orders training samples from easy to hard |
| Progressive layer dropping | Compresses training by randomly skipping layers during forward/backward passes with increasing probability |
| FLOPs profiler | Measures model computational cost and identifies bottlenecks |
| Monitoring | Integration with TensorBoard, Weights & Biases, and CSV logging |
| Elastic training | Support for dynamic scaling of training jobs (adding or removing workers) |
| Fused optimizers | Custom CUDA kernels for Adam and other optimizers that fuse multiple operations into a single kernel launch |
| CPU-Adam | AVX SIMD-optimized Adam implementation for efficient CPU-side parameter updates during offloading |
PyTorch Fully Sharded Data Parallel (FSDP) is a native PyTorch framework inspired by ZeRO stage 3. Both libraries address the same fundamental problem (reducing memory redundancy in distributed training), but they differ in several ways.
| Aspect | DeepSpeed (ZeRO) | PyTorch FSDP |
|---|---|---|
| Sharding stages | 3 explicit stages (1, 2, 3) | FULL_SHARD, SHARD_GRAD_OP, NO_SHARD (equivalent to ZeRO stage 3 only for FULL_SHARD) |
| CPU offloading | ZeRO-Offload, ZeRO-Infinity; can offload parameters and optimizer separately | CPU offloading supported; all-or-nothing (parameters, gradients, and optimizer together) |
| NVMe offloading | Yes (ZeRO-Infinity) | No |
| Communication optimization | ZeRO++ (quantized, hierarchical), 1-bit compression, custom backends | Standard PyTorch collective operations |
| Pipeline parallelism | Built-in | Requires separate implementation |
| RLHF support | DeepSpeed-Chat (end-to-end) | Manual implementation |
| Inference optimization | DeepSpeed-FastGen, DeepSpeed-Inference | Separate tools needed |
| Precision handling | Forces upcasting to FP32 for optimizer states | Allows low-precision optimizer operation (more flexible) |
| Configuration | JSON config file, mostly transparent to user | Python API; requires explicit wrapping policy for sharding decisions |
| PyTorch integration | External library, requires deepspeed.initialize() | Native PyTorch, integrates with torch.compile and PyTorch 2.x features |
| Ecosystem | Broader feature set (MoE, inference, sparse attention) | Tighter integration with PyTorch ecosystem |
| Raw throughput (FULL_SHARD equivalent) | Competitive | Sometimes faster (up to 5x reported in certain configurations) |
| Memory efficiency at extreme scale | Superior (NVMe offloading, ZeRO++) | Less optimized |
In benchmarks, FSDP's FULL_SHARD mode has shown up to 5x faster per-iteration throughput than DeepSpeed ZeRO Stage 3 in certain configurations. FSDP tends to be faster and simpler for straightforward training scenarios, especially at smaller scales. DeepSpeed becomes more competitive and often preferable when CPU/NVMe offloading, communication optimization (ZeRO++), extreme memory savings, or specialized features like MoE support are required. The choice often depends on the specific use case: FSDP is attractive for teams that want to stay within the native PyTorch ecosystem, while DeepSpeed offers more features for extreme-scale training scenarios.
DeepSpeed integrates with several popular frameworks and tools:
HfDeepSpeedConfig class.strategy="deepspeed_stage_3").The Hugging Face integration allows users to enable DeepSpeed optimizations by simply passing a configuration file to the Hugging Face Trainer:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
deepspeed="ds_config.json",
...
)
The DeepSpeed configuration file specifies which ZeRO stage to use, whether to enable offloading, mixed precision settings, and other optimization parameters. This integration was instrumental in making DeepSpeed accessible beyond its core audience of distributed systems researchers.
DeepSpeed has been used to train many of the largest models in existence:
| Model | Organization | Parameters | Year |
|---|---|---|---|
| Megatron-Turing NLG | NVIDIA and Microsoft | 530 billion | 2022 |
| BLOOM | BigScience (Hugging Face) | 176 billion | 2022 |
| GLM-130B | Tsinghua University | 130 billion | 2022 |
| YaLM-100B | Yandex | 100 billion | 2022 |
| GPT-NeoX-20B | EleutherAI | 20 billion | 2022 |
| AlexaTM 20B | Amazon | 20 billion | 2022 |
| Turing-NLG | Microsoft | 17 billion | 2020 |
DeepSpeed is also widely used at academic and government research labs, including Oak Ridge National Lab, Carnegie Mellon University, the University of Tokyo, and Korea University.
Megatron-Turing NLG (MT-NLG), a collaboration between Microsoft and NVIDIA announced in October 2021, was a 530-billion parameter autoregressive language model and at the time the largest dense transformer model ever trained. The training system combined three forms of parallelism: tensor parallelism from NVIDIA's Megatron-LM (for intra-node scaling), pipeline parallelism from DeepSpeed (for inter-node scaling), and data parallelism with ZeRO Stage 1 (for scaling across pipeline replicas). This "3D parallelism" approach became a template for subsequent large model training efforts.
BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), a 176-billion parameter multilingual model released in 2022, was trained by a collaboration of over 1,000 researchers using the Megatron-DeepSpeed framework. The training combined ZeRO sharding and pipeline parallelism from DeepSpeed with tensor parallelism from Megatron-LM. BLOOM was trained on 384 NVIDIA A100 80GB GPUs at the Jean Zay supercomputer in France.
DeepSpeed has also been used in training various other large models, and its ZeRO optimizer is commonly used for fine-tuning large models in research labs and companies worldwide. The Hugging Face integration means that any model available on the Hugging Face Hub can be fine-tuned with DeepSpeed optimizations with minimal code changes.
DeepSpeed is designed as a drop-in replacement for PyTorch's standard training loop. Users wrap their model and optimizer with DeepSpeed's initialization function:
import deepspeed
model, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=optimizer,
config=ds_config
)
for batch in dataloader:
loss = model(batch)
model.backward(loss)
model.step()
The ds_config dictionary (or JSON file) controls all optimization settings, including ZeRO stage, offloading, mixed precision, gradient accumulation, and learning rate scheduling. This configuration-driven approach allows users to experiment with different optimization strategies without changing their training code.
DeepSpeed is configured through a JSON file that specifies training options. A typical configuration for ZeRO stage 2 with mixed-precision training looks like this:
{
"train_batch_size": 32,
"gradient_accumulation_steps": 1,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"reduce_scatter": true,
"overlap_comm": true
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 3e-5,
"betas": [0.9, 0.999],
"eps": 1e-8
}
}
}
Training is launched with the deepspeed command:
deepspeed --num_gpus=8 train.py --deepspeed ds_config.json
DeepSpeed continues to evolve with new capabilities addressing emerging needs in the AI community.
Universal Checkpointing provides efficient and flexible checkpointing for large-scale distributed training. It enables saving and loading checkpoints across different parallelism configurations (for example, saving a checkpoint with 3D parallelism on 256 GPUs and resuming on 128 GPUs with a different parallelism layout), which simplifies cluster management and fault recovery.
ALST enables scalable and efficient training with multi-million token sequence lengths. As context windows for large language models have grown from thousands to millions of tokens, the memory and computation requirements for processing long sequences have become a major bottleneck. ALST addresses this through specialized memory management and communication strategies optimized for very long sequences.
ZenFlow enables stall-free offloading training via asynchronous updates. Traditional offloading approaches (like ZeRO-Offload) can introduce stalls when the CPU is not fast enough to complete the optimizer step before the GPU needs the updated parameters. ZenFlow eliminates these stalls by overlapping computation and communication more aggressively through an asynchronous update mechanism.
SuperOffload, announced for 2026, targets large-scale LLM training on superchips (systems like NVIDIA Grace Hopper that combine CPU and GPU on a single module with high-bandwidth unified memory). It aims to exploit the unique memory architecture of these systems for more efficient offloading.
| Version | Date | Key Features |
|---|---|---|
| 0.3.x | Feb 2020 | Initial release, ZeRO Stage 1 and 2 |
| 0.4.x | 2021 | ZeRO Stage 3, ZeRO-Infinity |
| 0.7.x | 2022 | DeepSpeed-MoE, performance improvements |
| 0.9.x-0.10.x | 2023 | DeepSpeed-Chat, ZeRO++, DeepSpeed-FastGen |
| 0.14.x-0.15.x | 2024 | Universal Checkpointing, MoE inference, Mixtral support |
| 0.16.x-0.18.x | 2025 | ALST, ZenFlow, expanded hardware support |
| Date | Event |
|---|---|
| October 2019 | ZeRO paper submitted to arXiv |
| February 2020 | DeepSpeed open-sourced by Microsoft |
| May 2020 | ZeRO paper published at SC20 |
| September 2020 | 1-bit Adam released |
| January 2021 | ZeRO-Offload paper published |
| March 2021 | ZeRO stage 3 with offloading released |
| April 2021 | ZeRO-Infinity paper published |
| October 2021 | Megatron-Turing NLG 530B announced |
| January 2022 | DeepSpeed-MoE paper released |
| April 2023 | DeepSpeed-Chat released for RLHF training |
| 2023 | ZeRO++ released |
| Late 2023 | DeepSpeed-FastGen released |
| 2024 | Universal Checkpointing, Mixtral MoE inference support |
| August 2024 | Native Windows support for single-GPU training |
| February 2025 | DeepSpeed contributed to Linux Foundation AI & Data |
| 2025 | ZenFlow released |
| June 2025 | Arctic Long Sequence Training (ALST) introduced |
| 2026 | SuperOffload announced |
As of early 2026, DeepSpeed remains one of the essential tools in the large-scale model training ecosystem. The library is at version 0.18.x, actively maintained by the DeepSpeed team (which has moved its GitHub organization from microsoft/DeepSpeed to deepspeedai/DeepSpeed).
Several trends define DeepSpeed's current trajectory:
Complementary role with FSDP. Rather than a zero-sum competition, DeepSpeed and PyTorch FSDP increasingly serve complementary roles. FSDP is the default choice for straightforward distributed training within the PyTorch ecosystem, while DeepSpeed is preferred for scenarios requiring advanced offloading, extreme memory optimization, or features like DeepSpeed-Chat.
Focus on LLM workflows. DeepSpeed's recent features (ALST, ZenFlow, DeepSpeed-Chat, DeepSpeed-FastGen) reflect its focus on the full lifecycle of large language model development, from pre-training through alignment to inference serving.
Hardware adaptation. With the emergence of new hardware architectures like NVIDIA Grace Hopper superchips and AMD MI300X GPUs, DeepSpeed is adapting its offloading and communication strategies. SuperOffload specifically targets the unified memory architecture of superchips.
Continued Hugging Face integration. The tight integration with Hugging Face Transformers and Accelerate ensures that DeepSpeed's optimizations remain accessible to the broad ML community, not just distributed systems specialists.
DeepSpeed's contribution to making large-scale model training accessible cannot be overstated. By reducing the memory barriers through ZeRO and providing turnkey solutions for RLHF training and inference, it has democratized capabilities that were once available only to organizations with massive engineering teams and hardware budgets.