DeepSpeed

DeepSpeed is an open-source deep learning optimization library developed by Microsoft that makes distributed training and inference of large models efficient, easy to use, and cost-effective. Built on top of PyTorch, DeepSpeed provides system-level optimizations that allow researchers and engineers to train models with billions or even trillions of parameters on commodity GPU clusters. Its core innovation, the Zero Redundancy Optimizer (ZeRO), fundamentally rethinks how model parallelism and data parallelism interact to reduce memory consumption without sacrificing computational throughput.

DeepSpeed was first released by Microsoft Research in February 2020 and has since become one of the most widely adopted libraries for large-scale model training. It has been used to train landmark models including BLOOM (176 billion parameters) and Megatron-Turing NLG (530 billion parameters). In February 2025, DeepSpeed was contributed to the Linux Foundation AI & Data as an incubation project, marking its transition to community-driven governance.

ELI5 (explain like I'm 5)

Imagine you have a really big jigsaw puzzle, so big that it does not fit on your table. Normally, you would need a huge table (a really expensive computer) to put the whole puzzle together. DeepSpeed is like a clever way of splitting the puzzle pieces across several smaller tables (regular computers). Each table only holds the pieces it needs right now, and when it needs a piece from another table, it just asks for it quickly. This way, you can solve even the biggest puzzle in the world using a bunch of regular-sized tables working together.

History and Development

Origins at Microsoft Research

DeepSpeed emerged from Microsoft's "AI at Scale" initiative, which aimed to develop the infrastructure needed to train the largest AI models. The project was led by researchers at Microsoft Research, with Samyam Rajbhandari as a primary architect. The first public release came in February 2020, coinciding with the publication of the foundational ZeRO paper.

The motivation for DeepSpeed was straightforward: as model sizes grew from millions to billions of parameters, standard data-parallel training became insufficient. A model with billions of parameters cannot fit in the memory of a single GPU, and existing approaches to model parallelism (tensor parallelism and pipeline parallelism) required significant code modifications and were difficult to use. DeepSpeed's goal was to enable training of arbitrarily large models with minimal code changes.

The ZeRO Paper

The foundational research behind DeepSpeed is the ZeRO (Zero Redundancy Optimizer) paper by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He, first published as an arXiv preprint in October 2019 and later presented at SC20 (the International Conference for High Performance Computing). The paper identified a critical inefficiency in standard data-parallel training: every GPU maintains a complete copy of the model states (parameters, gradients, and optimizer states), resulting in massive memory redundancy.

For a model with P parameters trained with the Adam optimizer in mixed precision, the memory required per GPU for model states alone is approximately 2P + 2P + (4P + 4P + 4P) = 16P bytes. For a 7.5-billion parameter model, this amounts to roughly 120 GB per GPU, far exceeding the memory of any single GPU available at the time. The ZeRO paper proposed partitioning these redundant states across data-parallel processes, dramatically reducing per-GPU memory consumption without sacrificing computational efficiency.

Background and motivation

Training large language models and other massive neural networks requires storing three categories of data in GPU memory during training:

Model states: the model parameters, gradients, and optimizer states (such as momentum and variance for Adam)
Residual states: activation memory, temporary buffers, and unusable fragmented memory
Activation memory: intermediate outputs stored for the backward pass during backpropagation

For mixed-precision training with the Adam optimizer, the memory required per parameter breaks down as follows:

Component	Precision	Bytes per parameter
Parameters	FP16	2
Gradients	FP16	2
Parameters (optimizer copy)	FP32	4
Momentum (Adam)	FP32	4
Variance (Adam)	FP32	4
Total		16

For a model with 1.5 billion parameters (such as GPT-2), this amounts to 24 GB of memory for model states alone, exceeding the capacity of most individual GPUs before even accounting for activations and temporary buffers.

Standard data parallelism replicates all model states across every GPU, which wastes memory. Model parallelism splits the model across GPUs but introduces communication overhead and often requires significant code changes. DeepSpeed was created to address this fundamental tension between memory efficiency and computational efficiency.

Zero Redundancy Optimizer (ZeRO)

ZeRO is the core technology behind DeepSpeed. Introduced by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He in their 2020 paper "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models," ZeRO eliminates memory redundancy in data-parallel training by partitioning model states across data-parallel processes instead of replicating them.

ZeRO operates in three progressive stages, each partitioning an additional component of the model states.

ZeRO stage 1: optimizer state partitioning

In standard data-parallel training, each GPU maintains the full optimizer state. For Adam, this includes the first-moment estimate (momentum) and second-moment estimate (variance), each requiring 4 bytes per parameter in FP32, plus the FP32 master copy of the weights (4 bytes per parameter). That totals 12 bytes per parameter just for optimizer states.

In stage 1 (also known as ZeRO-OS), the optimizer states (for Adam, this includes the FP32 master copy of parameters, momentum, and variance) are partitioned across all data-parallel processes. Each process stores and updates only its assigned partition of the optimizer states, while still maintaining the full FP16 parameters and gradients. After the backward pass, gradients are reduced normally, but each GPU updates only its partition of the optimizer states and parameters. The updated parameters are then broadcast (via all-gather) to all GPUs.

Stage 1 reduces optimizer state memory by a factor of N (where N is the data-parallel degree) while maintaining the same communication volume as standard data parallelism. For a typical Adam setup, this provides up to a 4x memory reduction.

ZeRO stage 2: gradient partitioning

Stage 2 adds gradient partitioning on top of stage 1. After the backward pass, instead of performing an all-reduce on the full gradient tensor, each process reduces and retains only the gradients corresponding to its partition of optimizer states. This is implemented using a reduce-scatter operation, which is communication-equivalent to the all-reduce used in standard data parallelism but results in each GPU holding only 1/N of the gradients. Once a gradient is reduced and used for the parameter update, it is discarded, freeing the memory. This provides up to an 8x memory reduction with the same communication volume as standard data parallelism.

ZeRO stage 3: parameter partitioning

Stage 3 takes the partitioning to its logical conclusion by distributing the FP16 model parameters themselves across data-parallel processes. Each process stores only a shard of the full parameter set (1/N of the parameters). During forward and backward passes, ZeRO-3 dynamically gathers the parameters needed for each layer via all-gather operations, uses them for computation, then discards non-local parameters after use.

Stage 3 achieves memory reduction that scales linearly with the number of data-parallel processes. On 64 GPUs, this yields a 64x reduction in per-GPU memory for model states. The trade-off is a 50% (1.5x) increase in communication volume compared to standard data parallelism (due to the additional all-gather operations during the forward pass). However, this communication can be overlapped with computation, and in practice the throughput impact is often modest.

ZeRO stages comparison

Feature	Stage 1 (ZeRO-OS)	Stage 2	Stage 3
Optimizer states partitioned	Yes	Yes	Yes
Gradients partitioned	No	Yes	Yes
Parameters partitioned	No	No	Yes
Memory reduction (vs. baseline)	Up to 4x	Up to 8x	Linear with N GPUs
Communication overhead vs. data parallelism	Same	Same	1.5x
Code changes required	None	None	None

Memory savings example

To illustrate the memory impact, consider a 7.5-billion parameter model trained with Adam in mixed precision on 64 GPUs:

Configuration	Memory per GPU
Standard data parallelism	~120 GB
ZeRO Stage 1	~31.4 GB
ZeRO Stage 2	~16.6 GB
ZeRO Stage 3	~1.9 GB

These reductions make it possible to train models with billions of parameters on clusters of GPUs that individually have only 16 to 80 GB of memory.

ZeRO-Offload

ZeRO-Offload, introduced by Jie Ren et al. in 2021, extends ZeRO stage 2 by offloading optimizer states and gradient computation to CPU memory and CPU compute. This allows training of models with up to 13 billion parameters on a single NVIDIA V100 GPU (32 GB), a 10x improvement over standard PyTorch, by leveraging the much larger capacity of system RAM (typically 256 GB or more) compared to GPU memory (typically 16 to 80 GB).

The key insight behind ZeRO-Offload is a careful partitioning strategy between CPU and GPU: gradients, optimizer states, and the optimizer computation step are offloaded to the CPU, while parameters and the forward/backward computation remain on the GPU. This minimizes data movement across the PCIe bus while maximizing memory savings. A CPU Adam optimizer implementation is provided that is highly optimized with SIMD instructions.

ZeRO-Offload works symbiotically with ZeRO stage 2 and scales to multiple GPUs when available. On a single V100 GPU, it achieves 40 TFLOPS for a 10-billion-parameter model, compared to 30 TFLOPS for a 1.4-billion-parameter model using PyTorch alone. ZeRO-Offload achieves near-GPU-only training throughput for large models because the optimizer step computation on the CPU can be overlapped with the next forward pass on the GPU. For smaller models where GPU utilization is already high, the overhead of CPU-GPU data transfers becomes more noticeable. It also supports near-linear scaling on up to 128 GPUs.

ZeRO-Infinity

ZeRO-Infinity, published by Rajbhandari et al. in 2021, extends the offloading concept further by leveraging not only CPU memory but also NVMe (SSD) storage. Built on ZeRO stage 3, ZeRO-Infinity can offload all model states (parameters, gradients, and optimizer states) to CPU memory and NVMe storage, enabling training of models with tens or even hundreds of trillions of parameters on current-generation GPU clusters.

ZeRO-Infinity introduces several innovations:

Infinity offload engine: a system for efficiently moving data between GPU, CPU DRAM, and NVMe storage, with overlap of computation and data transfer
Memory-centric tiling: breaks large operators into smaller tiles that can be processed sequentially, allowing individual layers with parameters exceeding GPU memory
Bandwidth-centric partitioning: maps data to the storage tier (GPU, CPU, or NVMe) that best matches the required bandwidth for each training phase
Dynamic prefetcher: traces forward and backward computation, constructing an internal map of operator sequences. Using this map, ZeRO-Infinity overlaps NVMe-to-CPU transfers with CPU-to-GPU transfers and GPU-to-GPU all-gather operations, effectively pipelining all three communication stages with computation. This achieves bandwidth utilization close to the theoretical peak of the NVMe subsystem.

On 512 NVIDIA V100 GPUs, ZeRO-Infinity sustains over 25 petaFLOPS (40% of peak) and demonstrates superlinear scaling. It can also fine-tune trillion-parameter models on a single DGX-2 node, making such models accessible to smaller research labs.

Feature	ZeRO-Offload	ZeRO-Infinity
Built on	ZeRO Stage 2	ZeRO Stage 3
Offload target	CPU memory	CPU memory + NVMe storage
Offloaded states	Optimizer states + gradients	All model states (params, grads, optimizer)
Max model scale	~13B params on single GPU	Trillions of parameters
Bandwidth optimization	CPU-GPU overlap	Dynamic prefetching across NVMe, CPU, and GPU

ZeRO++ (2023)

ZeRO++, released in 2023, addresses the communication overhead that becomes a bottleneck in ZeRO Stage 3, particularly when training across nodes connected by relatively slow network links. ZeRO++ introduces three complementary communication optimization techniques that together reduce communication volume by up to 4x.

Quantized Weights (qwZ)

qwZ applies block-based quantization to reduce the communication volume of the parameter all-gather from FP16 to INT8, halving the data transferred. Block-based quantization conducts independent quantization on subsets of model parameters, achieving 3x better accuracy and 5x faster execution compared to naive quantization through highly optimized CUDA kernels.

Hierarchical Partitioning (hpZ)

hpZ eliminates cross-node all-gather communication during the backward pass through a hierarchical data remapping strategy. Instead of collecting parameters from all GPUs across all nodes, hpZ maintains a full copy of the parameters within each node (distributed across the node's GPUs) and only performs intra-node all-gather operations. This reduces cross-node communication volume from M/Z per GPU to M/(Z*N), where M is the model size, Z is the total number of GPUs, and N is the number of GPUs per node.

Quantized Gradients (qgZ)

qgZ replaces the standard gradient all-reduce with a communication-efficient all-to-all based quantized gradient averaging scheme, further reducing communication volume during the backward pass.

Component	Technique	Communication Reduction
qwZ	Block-based weight quantization (FP16 to INT8)	2x for parameter all-gather
hpZ	Hierarchical weight partitioning	Eliminates cross-node backward all-gather
qgZ	Quantized gradient averaging	Reduces gradient communication volume
Combined	All three together	Up to 4x total reduction

3D parallelism

DeepSpeed supports combining three parallelism strategies simultaneously, a technique referred to as 3D parallelism:

Data parallelism (powered by ZeRO): replicates or partitions model states across processes, with each process handling a different subset of the training data
Pipeline parallelism: splits the model layers into stages across GPUs, with micro-batches flowing through the pipeline to maximize utilization
Tensor parallelism: splits individual layers (such as large matrix multiplications) across multiple GPUs

3D parallelism simultaneously addresses both memory efficiency and computational efficiency, enabling DeepSpeed to train models with over one trillion parameters. The Megatron-Turing NLG 530B model, a collaboration between NVIDIA and Microsoft, was trained using this approach.

DeepSpeed's pipeline parallelism implementation uses gradient accumulation to extract pipeline parallelism. Each training batch is divided into micro-batches that flow through the pipeline stages in parallel. This reduces communication volume by 2 to 7x compared to standard approaches, making it particularly effective on clusters with limited network bandwidth.

Mixed-precision training

DeepSpeed supports mixed-precision training using FP16 and BF16 (bfloat16) formats. During mixed-precision training, the forward and backward passes are computed in half precision (16-bit), while the optimizer maintains a full-precision (FP32) master copy of the parameters for numerical stability.

DeepSpeed handles loss scaling automatically to prevent gradient underflow in FP16 training. The library supports both dynamic loss scaling (which adjusts the scale factor based on whether overflow is detected) and static loss scaling.

Gradient checkpointing

DeepSpeed provides an activation checkpointing API that reduces activation memory at the cost of recomputing activations during the backward pass. Its implementation includes several optimizations beyond the standard approach:

Activation partitioning: partitions stored activations across data-parallel processes for model-parallel training
CPU checkpointing: offloads stored activations to CPU memory, reducing GPU memory usage further
Contiguous memory optimization: consolidates activation checkpoints into contiguous memory buffers to reduce fragmentation
Layerwise profiling: profiles memory usage at the layer level to help identify bottlenecks

Communication optimizations

DeepSpeed includes several techniques for reducing the communication overhead of distributed training.

1-bit Adam

1-bit Adam compresses gradient communication by quantizing the momentum term to a single bit per element. This reduces communication volume by up to 5x while maintaining the same convergence speed as uncompressed Adam. The approach works in two phases: a warmup phase using standard Adam (typically 15 to 20% of total training steps), followed by a compression phase where the variance term is frozen and the momentum is compressed.

Experiments on up to 256 GPUs show that 1-bit Adam achieves up to 3.3x higher throughput for BERT pre-training and up to 2.9x higher throughput for SQuAD fine-tuning.

0/1 Adam

0/1 Adam extends 1-bit Adam by further reducing communication. It uses an adaptive compression strategy that sends zero bits when a gradient partition has not changed significantly, and one bit when it has. This can achieve up to 26x communication volume savings in favorable conditions.

Communication overlapping

DeepSpeed overlaps gradient reduction operations with backward pass computation. As gradients become available during the backward pass, they are immediately reduced (averaged) across processes rather than waiting for all gradients to be computed first.

Sparse attention

DeepSpeed provides sparse attention kernels that support input sequences an order of magnitude longer than standard dense attention mechanisms. These kernels achieve up to 6x faster execution than dense attention with comparable accuracy and 1.5 to 3x faster execution than other sparse implementations.

The library supports several sparse attention patterns, including Fixed (from OpenAI's Sparse Transformer), BigBird (from Google), and BSLongformer (a block-sparse implementation of Longformer). DeepSpeed also provides a template for defining custom sparse attention patterns.

DeepSpeed-MoE

DeepSpeed-MoE, presented at ICML 2022 by Rajbhandari et al., provides system support for training and running inference on Mixture of Experts (MoE) models. MoE models achieve quality comparable to dense models while requiring significantly less training compute, as only a subset of "expert" sub-networks is activated for each input.

DeepSpeed-MoE includes:

Novel MoE architecture designs that reduce model size by up to 3.7x through compression
Optimized inference kernels that provide 7.3x better latency and cost compared to existing MoE inference solutions
End-to-end training support with up to 5x training cost savings for auto-regressive language models
Up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models
Flexible expert parallelism across GPUs
Load balancing mechanisms to ensure even utilization of experts
Integration with ZeRO for memory-efficient MoE training
Support for top-k routing with auxiliary load balancing losses

DeepSpeed-MoE has been used to train large MoE models efficiently, enabling researchers to explore the MoE paradigm at scale without building custom distributed training infrastructure.

DeepSpeed-Chat

DeepSpeed-Chat, introduced in April 2023, is an end-to-end system for training ChatGPT-style models using Reinforcement Learning from Human Feedback (RLHF), the technique used to align large language models with human preferences. The system addresses the significant engineering complexity of RLHF training, which requires coordinating multiple models (actor, critic, reward model, and reference model) simultaneously.

Three-Stage Pipeline

DeepSpeed-Chat implements the three-step InstructGPT training pipeline:

Supervised fine-tuning (SFT): training on human-written demonstration data
Reward model training: learning a reward function from human preference comparisons
RLHF with PPO: optimizing the policy model against the learned reward model using Proximal Policy Optimization

A single script can take a pre-trained Hugging Face model and run it through all three stages.

DeepSpeed Hybrid Engine (DeepSpeed-HE)

The system introduces the DeepSpeed Hybrid Engine (DeepSpeed-HE), which unifies training and inference optimizations into a single engine. During RLHF, the model alternates between generating responses (inference) and updating parameters (training). DeepSpeed-HE applies inference optimizations (such as kernel fusion, tensor parallelism, and KV caching) during the generation phase and training optimizations (such as ZeRO and gradient checkpointing) during the update phase. This hybrid approach yields over 15x speedup compared to existing RLHF systems.

Performance

DeepSpeed-Chat enabled training an OPT-13B model via RLHF in 9 hours and an OPT-30B model in 18 hours on Azure Cloud, at costs under $300 and $600 respectively. Compared to other systems like Colossal-AI and Hugging Face DDP, DeepSpeed-Chat achieved up to 19x higher throughput for RLHF training, and 10x faster performance on a single GPU. The system can handle training of models with over 200 billion parameters.

DeepSpeed-Inference

DeepSpeed-Inference provides optimized inference for transformer-based models. It includes custom CUDA kernels for operations like attention, layer normalization, and bias-add-residual, reducing kernel launch overhead and improving GPU utilization. It also supports multi-GPU inference with tensor parallelism and pipeline parallelism for models that do not fit in the memory of a single GPU.

DeepSpeed-FastGen: Inference Optimization

DeepSpeed-FastGen, released in late 2023 and expanded in 2024, is an inference serving framework for large language models. Its core innovation is the Dynamic SplitFuse technique, which handles variable-length prompts and generation steps more efficiently than traditional continuous batching approaches.

Dynamic SplitFuse

Dynamic SplitFuse decomposes long prompts into smaller chunks and composes short prompts together, creating uniform token budgets across iterations. This approach addresses the performance cliffs that occur in traditional systems when long prompts cause batch sizes to drop, and it provides more consistent latency compared to systems like vLLM.

MoE Support

In 2024, DeepSpeed-FastGen added support for Mixture of Experts (MoE) architectures, including the Mixtral model family. A custom MoE module with inference-optimized kernels was developed, achieving 2.4x higher throughput for the Mixtral model compared to baseline implementations at a prompt length of 1,200 tokens and 60 generation steps. Support for Falcon and Phi-2 model families was also added.

Additional features

Feature	Description
Autotuning	Automatically finds optimal DeepSpeed configuration (ZeRO stage, batch size, etc.) for a given model and hardware
Curriculum learning	Data efficiency library that orders training samples from easy to hard
Progressive layer dropping	Compresses training by randomly skipping layers during forward/backward passes with increasing probability
FLOPs profiler	Measures model computational cost and identifies bottlenecks
Monitoring	Integration with TensorBoard, Weights & Biases, and CSV logging
Elastic training	Support for dynamic scaling of training jobs (adding or removing workers)
Fused optimizers	Custom CUDA kernels for Adam and other optimizers that fuse multiple operations into a single kernel launch
CPU-Adam	AVX SIMD-optimized Adam implementation for efficient CPU-side parameter updates during offloading

Comparison with PyTorch FSDP

PyTorch Fully Sharded Data Parallel (FSDP) is a native PyTorch framework inspired by ZeRO stage 3. Both libraries address the same fundamental problem (reducing memory redundancy in distributed training), but they differ in several ways.

Aspect	DeepSpeed (ZeRO)	PyTorch FSDP
Sharding stages	3 explicit stages (1, 2, 3)	FULL_SHARD, SHARD_GRAD_OP, NO_SHARD (equivalent to ZeRO stage 3 only for FULL_SHARD)
CPU offloading	ZeRO-Offload, ZeRO-Infinity; can offload parameters and optimizer separately	CPU offloading supported; all-or-nothing (parameters, gradients, and optimizer together)
NVMe offloading	Yes (ZeRO-Infinity)	No
Communication optimization	ZeRO++ (quantized, hierarchical), 1-bit compression, custom backends	Standard PyTorch collective operations
Pipeline parallelism	Built-in	Requires separate implementation
RLHF support	DeepSpeed-Chat (end-to-end)	Manual implementation
Inference optimization	DeepSpeed-FastGen, DeepSpeed-Inference	Separate tools needed
Precision handling	Forces upcasting to FP32 for optimizer states	Allows low-precision optimizer operation (more flexible)
Configuration	JSON config file, mostly transparent to user	Python API; requires explicit wrapping policy for sharding decisions
PyTorch integration	External library, requires `deepspeed.initialize()`	Native PyTorch, integrates with `torch.compile` and PyTorch 2.x features
Ecosystem	Broader feature set (MoE, inference, sparse attention)	Tighter integration with PyTorch ecosystem
Raw throughput (FULL_SHARD equivalent)	Competitive	Sometimes faster (up to 5x reported in certain configurations)
Memory efficiency at extreme scale	Superior (NVMe offloading, ZeRO++)	Less optimized

In benchmarks, FSDP's FULL_SHARD mode has shown up to 5x faster per-iteration throughput than DeepSpeed ZeRO Stage 3 in certain configurations. FSDP tends to be faster and simpler for straightforward training scenarios, especially at smaller scales. DeepSpeed becomes more competitive and often preferable when CPU/NVMe offloading, communication optimization (ZeRO++), extreme memory savings, or specialized features like MoE support are required. The choice often depends on the specific use case: FSDP is attractive for teams that want to stay within the native PyTorch ecosystem, while DeepSpeed offers more features for extreme-scale training scenarios.

Framework integration

DeepSpeed integrates with several popular frameworks and tools:

Hugging Face Transformers: the Trainer class provides built-in DeepSpeed support. Users supply a JSON configuration file, and the Trainer handles initialization, gradient accumulation, and checkpointing automatically. DeepSpeed can also be used without the Trainer via the HfDeepSpeedConfig class.
PyTorch Lightning: supports DeepSpeed through the Lightning Trainer's strategy parameter (for example, strategy="deepspeed_stage_3").
Megatron-LM: the Megatron-DeepSpeed repository combines NVIDIA's Megatron-LM with DeepSpeed for maximum-scale training.
Hugging Face Accelerate: provides a unified interface for switching between DeepSpeed and FSDP with minimal code changes.
Microsoft Olive: uses DeepSpeed for model optimization pipelines.

The Hugging Face integration allows users to enable DeepSpeed optimizations by simply passing a configuration file to the Hugging Face Trainer:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    deepspeed="ds_config.json",
    ...
)

The DeepSpeed configuration file specifies which ZeRO stage to use, whether to enable offloading, mixed precision settings, and other optimization parameters. This integration was instrumental in making DeepSpeed accessible beyond its core audience of distributed systems researchers.

Adoption and notable models

DeepSpeed has been used to train many of the largest models in existence:

Model	Organization	Parameters	Year
Megatron-Turing NLG	NVIDIA and Microsoft	530 billion	2022
BLOOM	BigScience (Hugging Face)	176 billion	2022
GLM-130B	Tsinghua University	130 billion	2022
YaLM-100B	Yandex	100 billion	2022
GPT-NeoX-20B	EleutherAI	20 billion	2022
AlexaTM 20B	Amazon	20 billion	2022
Turing-NLG	Microsoft	17 billion	2020

DeepSpeed is also widely used at academic and government research labs, including Oak Ridge National Lab, Carnegie Mellon University, the University of Tokyo, and Korea University.

Megatron-Turing NLG 530B

Megatron-Turing NLG (MT-NLG), a collaboration between Microsoft and NVIDIA announced in October 2021, was a 530-billion parameter autoregressive language model and at the time the largest dense transformer model ever trained. The training system combined three forms of parallelism: tensor parallelism from NVIDIA's Megatron-LM (for intra-node scaling), pipeline parallelism from DeepSpeed (for inter-node scaling), and data parallelism with ZeRO Stage 1 (for scaling across pipeline replicas). This "3D parallelism" approach became a template for subsequent large model training efforts.

BLOOM 176B

BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), a 176-billion parameter multilingual model released in 2022, was trained by a collaboration of over 1,000 researchers using the Megatron-DeepSpeed framework. The training combined ZeRO sharding and pipeline parallelism from DeepSpeed with tensor parallelism from Megatron-LM. BLOOM was trained on 384 NVIDIA A100 80GB GPUs at the Jean Zay supercomputer in France.

Other Notable Models

DeepSpeed has also been used in training various other large models, and its ZeRO optimizer is commonly used for fine-tuning large models in research labs and companies worldwide. The Hugging Face integration means that any model available on the Hugging Face Hub can be fine-tuned with DeepSpeed optimizations with minimal code changes.

Architecture and Usage

DeepSpeed is designed as a drop-in replacement for PyTorch's standard training loop. Users wrap their model and optimizer with DeepSpeed's initialization function:

import deepspeed

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config
)

for batch in dataloader:
    loss = model(batch)
    model.backward(loss)
    model.step()

The ds_config dictionary (or JSON file) controls all optimization settings, including ZeRO stage, offloading, mixed precision, gradient accumulation, and learning rate scheduling. This configuration-driven approach allows users to experiment with different optimization strategies without changing their training code.

Configuration example

DeepSpeed is configured through a JSON file that specifies training options. A typical configuration for ZeRO stage 2 with mixed-precision training looks like this:

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 1,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "reduce_scatter": true,
    "overlap_comm": true
  },
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 3e-5,
      "betas": [0.9, 0.999],
      "eps": 1e-8
    }
  }
}

Training is launched with the deepspeed command:

deepspeed --num_gpus=8 train.py --deepspeed ds_config.json

Recent Developments (2024-2026)

DeepSpeed continues to evolve with new capabilities addressing emerging needs in the AI community.

Universal Checkpointing (2024)

Universal Checkpointing provides efficient and flexible checkpointing for large-scale distributed training. It enables saving and loading checkpoints across different parallelism configurations (for example, saving a checkpoint with 3D parallelism on 256 GPUs and resuming on 128 GPUs with a different parallelism layout), which simplifies cluster management and fault recovery.

Arctic Long Sequence Training (ALST, 2025)

ALST enables scalable and efficient training with multi-million token sequence lengths. As context windows for large language models have grown from thousands to millions of tokens, the memory and computation requirements for processing long sequences have become a major bottleneck. ALST addresses this through specialized memory management and communication strategies optimized for very long sequences.

ZenFlow (2025)

ZenFlow enables stall-free offloading training via asynchronous updates. Traditional offloading approaches (like ZeRO-Offload) can introduce stalls when the CPU is not fast enough to complete the optimizer step before the GPU needs the updated parameters. ZenFlow eliminates these stalls by overlapping computation and communication more aggressively through an asynchronous update mechanism.

SuperOffload (2026)

SuperOffload, announced for 2026, targets large-scale LLM training on superchips (systems like NVIDIA Grace Hopper that combine CPU and GPU on a single module with high-bandwidth unified memory). It aims to exploit the unique memory architecture of these systems for more efficient offloading.

DeepSpeed Version History

Version	Date	Key Features
0.3.x	Feb 2020	Initial release, ZeRO Stage 1 and 2
0.4.x	2021	ZeRO Stage 3, ZeRO-Infinity
0.7.x	2022	DeepSpeed-MoE, performance improvements
0.9.x-0.10.x	2023	DeepSpeed-Chat, ZeRO++, DeepSpeed-FastGen
0.14.x-0.15.x	2024	Universal Checkpointing, MoE inference, Mixtral support
0.16.x-0.18.x	2025	ALST, ZenFlow, expanded hardware support

Timeline

Date	Event
October 2019	ZeRO paper submitted to arXiv
February 2020	DeepSpeed open-sourced by Microsoft
May 2020	ZeRO paper published at SC20
September 2020	1-bit Adam released
January 2021	ZeRO-Offload paper published
March 2021	ZeRO stage 3 with offloading released
April 2021	ZeRO-Infinity paper published
October 2021	Megatron-Turing NLG 530B announced
January 2022	DeepSpeed-MoE paper released
April 2023	DeepSpeed-Chat released for RLHF training
2023	ZeRO++ released
Late 2023	DeepSpeed-FastGen released
2024	Universal Checkpointing, Mixtral MoE inference support
August 2024	Native Windows support for single-GPU training
February 2025	DeepSpeed contributed to Linux Foundation AI & Data
2025	ZenFlow released
June 2025	Arctic Long Sequence Training (ALST) introduced
2026	SuperOffload announced

Current State (2025-2026)

As of early 2026, DeepSpeed remains one of the essential tools in the large-scale model training ecosystem. The library is at version 0.18.x, actively maintained by the DeepSpeed team (which has moved its GitHub organization from microsoft/DeepSpeed to deepspeedai/DeepSpeed).

Several trends define DeepSpeed's current trajectory:

Complementary role with FSDP. Rather than a zero-sum competition, DeepSpeed and PyTorch FSDP increasingly serve complementary roles. FSDP is the default choice for straightforward distributed training within the PyTorch ecosystem, while DeepSpeed is preferred for scenarios requiring advanced offloading, extreme memory optimization, or features like DeepSpeed-Chat.

Focus on LLM workflows. DeepSpeed's recent features (ALST, ZenFlow, DeepSpeed-Chat, DeepSpeed-FastGen) reflect its focus on the full lifecycle of large language model development, from pre-training through alignment to inference serving.

Hardware adaptation. With the emergence of new hardware architectures like NVIDIA Grace Hopper superchips and AMD MI300X GPUs, DeepSpeed is adapting its offloading and communication strategies. SuperOffload specifically targets the unified memory architecture of superchips.

Continued Hugging Face integration. The tight integration with Hugging Face Transformers and Accelerate ensures that DeepSpeed's optimizations remain accessible to the broad ML community, not just distributed systems specialists.

DeepSpeed's contribution to making large-scale model training accessible cannot be overstated. By reducing the memory barriers through ZeRO and providing turnkey solutions for RLHF training and inference, it has democratized capabilities that were once available only to organizations with massive engineering teams and hardware budgets.

References

Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20)*. arXiv:1910.02054
Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O., Yang, S., Zhang, M., Li, D., & He, Y. (2021). "ZeRO-Offload: Democratizing Billion-Scale Model Training." *USENIX Annual Technical Conference (ATC)*. arXiv:2101.06840
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., & He, Y. (2021). "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21)*. arXiv:2104.07857
Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., & He, Y. (2022). "DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale." *Proceedings of the 39th International Conference on Machine Learning (ICML 2022)*. arXiv:2201.05596
Yao, Z., Aminabadi, R. Y., Rajbhandari, S., Ruwase, O., Wu, X., Awan, A. A., Rasley, J., Zhang, M., Li, C., & He, Y. (2023). "DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales." arXiv:2308.01320
Tang, H., Ganesh, S., & Rajbhandari, S. (2021). "1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed." *Proceedings of the 38th International Conference on Machine Learning (ICML 2021)*. arXiv:2102.02888
Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., Zhang, E., Child, R., Aminabadi, R. Y., Bernauer, J., Song, X., Shoeybi, M., He, Y., Houston, M., Tiwary, S., & Catanzaro, B. (2022). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model." arXiv:2201.11990
Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020). "DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters." *Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*. https://dl.acm.org/doi/10.1145/3394486.3406703
Scao, T. L., et al. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv:2211.05100
Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., & Weinbach, S. (2022). "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." arXiv:2204.06745
DeepSpeed documentation. "Training Overview and Features." https://www.deepspeed.ai/training/
Hugging Face. "FSDP vs DeepSpeed." https://huggingface.co/docs/accelerate/en/concept_guides/fsdp_and_deepspeed
Linux Foundation AI & Data. (2025). "LF AI & Data Welcomes DeepSpeed: Advancing Deep Learning Optimization." https://lfaidata.foundation/blog/2025/02/03/lf-ai-data-welcomes-deepspeed-advancing-deep-learning-optimization/
Microsoft Research. (2023). "DeepSpeed ZeRO++: A leap in speed for LLM and chat model training with 4X less communication." https://www.microsoft.com/en-us/research/blog/deepspeed-zero-a-leap-in-speed-for-llm-and-chat-model-training-with-4x-less-communication/
Holmes, C., et al. (2024). "DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference." https://arxiv.org/abs/2401.08671
Hugging Face documentation. "DeepSpeed Integration." https://huggingface.co/docs/transformers/en/deepspeed
BigScience Workshop. (2022). "The Technology Behind BLOOM Training." https://huggingface.co/blog/bloom-megatron-deepspeed
Kienzler, R. (2024). "FSDP vs DeepSpeed." https://romeokienzler.medium.com/fsdp-vs-deepspeed-9df47ee5ccbb
DeepSpeed. "Latest News." https://www.deepspeed.ai/
DeepSpeed Tutorials. "ZeRO-Offload." https://www.deepspeed.ai/tutorials/zero-offload/

ELI5 (explain like I'm 5)

History and Development

Origins at Microsoft Research

The ZeRO Paper

Background and motivation

Zero Redundancy Optimizer (ZeRO)

ZeRO stage 1: optimizer state partitioning

ZeRO stage 2: gradient partitioning

ZeRO stage 3: parameter partitioning

ZeRO stages comparison

Memory savings example

ZeRO-Offload

ZeRO-Infinity

ZeRO++ (2023)

Quantized Weights (qwZ)

Hierarchical Partitioning (hpZ)

Quantized Gradients (qgZ)

3D parallelism

Mixed-precision training

Gradient checkpointing

Communication optimizations

1-bit Adam

0/1 Adam

Communication overlapping

Sparse attention

DeepSpeed-MoE

DeepSpeed-Chat

Three-Stage Pipeline

DeepSpeed Hybrid Engine (DeepSpeed-HE)

Performance

DeepSpeed-Inference

DeepSpeed-FastGen: Inference Optimization

Dynamic SplitFuse

MoE Support

Additional features

Comparison with PyTorch FSDP

Framework integration

Adoption and notable models

Megatron-Turing NLG 530B

BLOOM 176B

Other Notable Models

Architecture and Usage

Configuration example

Recent Developments (2024-2026)

Universal Checkpointing (2024)

Arctic Long Sequence Training (ALST, 2025)

ZenFlow (2025)

SuperOffload (2026)

DeepSpeed Version History

Timeline

Current State (2025-2026)

References

Improve this article

Related Articles

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Fully Sharded Data Parallel (FSDP)

Pipelining

Context window

ELI5 (explain like I'm 5)

History and Development

Origins at Microsoft Research

The ZeRO Paper

Background and motivation

Zero Redundancy Optimizer (ZeRO)

ZeRO stage 1: optimizer state partitioning

ZeRO stage 2: gradient partitioning

ZeRO stage 3: parameter partitioning

ZeRO stages comparison

Memory savings example

ZeRO-Offload

ZeRO-Infinity

ZeRO++ (2023)

Quantized Weights (qwZ)

Hierarchical Partitioning (hpZ)

Quantized Gradients (qgZ)

3D parallelism

Mixed-precision training

Gradient checkpointing