# Fully Sharded Data Parallel (FSDP)

> Source: https://aiwiki.ai/wiki/fsdp
> Updated: 2026-06-21
> Categories: AI Infrastructure, Deep Learning, Developer Tools, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Fully Sharded Data Parallel (FSDP)** is a distributed training technique implemented in [PyTorch](/wiki/pytorch) that shards a model's parameters, gradients, and optimizer states across data-parallel workers, allowing models with billions or trillions of parameters to be trained on commodity GPU clusters without resorting to complex tensor or pipeline parallelism. FSDP was first developed at [Meta AI](/wiki/meta_ai) and [Facebook AI Research](/wiki/meta_ai) (initially as part of the FairScale library in 2021) and was integrated into core PyTorch as a beta feature in PyTorch 1.11, released on March 10, 2022 [1][2][3]. In published scaling experiments on AWS, FSDP trained a dense 175-billion-parameter GPT model at a maximum of 159 teraFLOP/s per [NVIDIA A100](/wiki/nvidia_a100) GPU (about 51 percent of the A100's 312 teraFLOP/s peak) and a dense 1-trillion-parameter model at 84 teraFLOP/s per GPU [9][26].

FSDP is the PyTorch counterpart to Microsoft [DeepSpeed](/wiki/deepspeed)'s [ZeRO](/wiki/zero_optimizer) (Zero Redundancy Optimizer) Stage 3, sharing the same memory-saving idea of partitioning model state across data-parallel ranks while preserving the data-parallel programming model [4]. The PyTorch FSDP paper presented at [VLDB](/wiki/vldb) 2023 reports that "FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS" [9]. Since its release, FSDP has become the de-facto choice for large-scale training of foundation models in PyTorch and is used for [large language models](/wiki/large_language_model) such as the [Llama](/wiki/llama) and [Mistral](/wiki/mistral) families and diffusion models like [Stable Diffusion](/wiki/stable_diffusion). It is integrated into higher-level frameworks including [Hugging Face](/wiki/hugging_face) [Accelerate](/wiki/accelerate), [PyTorch Lightning](/wiki/pytorch_lightning), and [Mosaic Composer](/wiki/mosaic_composer) [5][6][7]. In July 2024, PyTorch 2.4 introduced **FSDP2**, a redesigned API at `torch.distributed._composable.fsdp.fully_shard` that uses per-parameter sharding via [DTensor](/wiki/dtensor) instead of the original FlatParameter design, providing better composability with [tensor parallelism](/wiki/tensor_parallelism) and clearer state semantics [8].

## Background and motivation

Training ever-larger neural networks on multiple [GPUs](/wiki/gpu) has historically relied on three families of parallelism, each with distinct trade-offs.

[Data parallelism](/wiki/data_parallelism), exemplified by PyTorch's [DistributedDataParallel](/wiki/ddp) (DDP), replicates the entire model on each worker and synchronizes gradients via an all-reduce after every backward pass. DDP is simple, but every rank holds a full copy of parameters, gradients, and optimizer states. For an Adam-style optimizer in mixed precision the per-parameter footprint is roughly 16 bytes (2 bytes for fp16 parameters, 2 bytes for fp16 gradients, 4 bytes for the fp32 master copy, and 8 bytes for the two Adam moments), so a 7-billion-parameter model already needs more than 100 GB just for state, before activations [4][9]. DDP alone cannot train modern foundation models even on H100-class hardware.

[Model parallelism](/wiki/model_parallelism) splits individual layers across GPUs. Its modern incarnation, tensor parallelism as popularized by [Megatron-LM](/wiki/megatron_lm), partitions matrix multiplications along an input or output dimension and requires all-reduce operations inside the forward and backward of each linear layer. Tensor parallelism scales well intra-node where [NVLink](/wiki/nvlink) bandwidth is high, but it is fragile across nodes and demands rewrites of attention and feed-forward layers. [Pipeline parallelism](/wiki/pipeline_parallelism) splits the layer stack into stages, each placed on a different device, and feeds micro-batches through the pipeline; it avoids the bandwidth requirements of tensor parallelism but introduces pipeline bubbles and complicates gradient accumulation.

In 2019 and 2020, researchers at [Microsoft](/wiki/microsoft) led by Samyam Rajbhandari proposed ZeRO (Zero Redundancy Optimizer) as a way to recover the memory savings of model parallelism while keeping the data-parallel programming model. ZeRO observed that DDP's full replication of parameters, gradients, and optimizer states is wasteful: in principle each rank only needs the slice it is updating during the optimizer step. By sharding optimizer states (Stage 1), gradients (Stage 2), and parameters themselves (Stage 3) across the data-parallel group, ZeRO reduces per-rank memory by a factor proportional to the world size at the cost of extra collective communication [4]. ZeRO-Infinity and ZeRO-Offload extended this idea by spilling sharded state to CPU memory or NVMe storage [10].

In July 2021, Meta's FairScale team published an initial PyTorch implementation of ZeRO-3 called Fully Sharded Data Parallel, written by Myle Ott and colleagues for use in fairseq's training of large language models [1]. Meta's announcement framed the goal plainly: "With FSDP, it is now possible to more efficiently train models that are orders of magnitude larger using fewer GPUs" [1]. That work was ported into core PyTorch and shipped as `torch.distributed.fsdp.FullyShardedDataParallel` in the 1.11 release on March 10, 2022 [2][3]. A 2023 VLDB paper by Zhao et al. documents the design choices and lessons learned from scaling FSDP to large foundation-model workloads at Meta [9].

## How FSDP works

FSDP organizes a model into a tree of FSDP units, where each unit is a subtree of the `nn.Module` graph wrapped with `FullyShardedDataParallel` (FSDP1) or registered via the `fully_shard` API (FSDP2). Each unit's parameters are flattened, concatenated, and split into equal shards across the data-parallel ranks. At the start of training, every rank holds only its 1/N slice of every unit's parameters.

During the **forward pass**, FSDP walks the model unit by unit. Just before each unit's `forward` is invoked, FSDP issues an `all_gather` collective so that every rank temporarily reconstructs the full parameter tensor for that unit. The forward computation runs on the full parameters as if the model were replicated. Once the forward returns, FSDP frees the gathered parameters and each rank again holds only its shard. Peak parameter memory at any moment is therefore the size of the largest single unit's full parameters, plus the sharded copies of all other units, rather than the full model.

During the **backward pass**, FSDP walks the units in reverse. Before each unit's backward, FSDP again all-gathers the unit's parameters because the backward formula for many layers needs the weights. After the backward computes the gradient with respect to those parameters, FSDP issues a `reduce_scatter` collective: this both averages the gradient across ranks (the equivalent of DDP's all-reduce) and scatters the result so that each rank ends up holding only the gradient slice corresponding to its parameter shard. The full gradient is never materialized on any single rank. The gathered parameters are then freed.

During the **optimizer step**, each rank updates only its local shard of parameters using its local shard of gradients and optimizer states. No cross-rank communication is required for the step itself. Optimizer states are effectively sharded as a side effect of sharding parameters and gradients: each rank only ever instantiates the moments for the slice it owns.

A core performance technique is overlapping these collectives with computation. FSDP supports prefetching the all-gather for the next unit while the current unit's forward or backward is still running on the GPU, controlled by the `forward_prefetch` and `backward_prefetch` arguments [11]. With suitable wrapping granularity and prefetching, the all-gather and reduce-scatter time can be largely hidden behind compute, so FSDP achieves throughput comparable to DDP at much lower memory.

## Sharding strategies

FSDP exposes a `ShardingStrategy` enum that lets users dial back the degree of sharding, mirroring the ZeRO stages. The most common strategies are summarized below [11][12].

| Strategy | Sharded state | ZeRO equivalent | Typical use |
| --- | --- | --- | --- |
| `FULL_SHARD` | Parameters, gradients, optimizer states | ZeRO-3 | Default. Maximum memory savings, used for large models. |
| `SHARD_GRAD_OP` | Gradients and optimizer states (parameters replicated) | ZeRO-2 | When parameters fit on each GPU but optimizer state does not. Less collective overhead than full shard. |
| `NO_SHARD` | Nothing (full replication) | DDP | Equivalent to DDP. Useful for debugging or comparison runs. |
| `HYBRID_SHARD` | Sharded inside each node, replicated across nodes | ZeRO-3 within node | Reduces inter-node bandwidth. Good fit for clusters where intra-node [NVLink](/wiki/nvlink) is much faster than [InfiniBand](/wiki/infiniband). |
| `_HYBRID_SHARD_ZERO2` | Like `HYBRID_SHARD` with ZeRO-2 inside each node | ZeRO-2 within node | Tuning bandwidth vs. memory. |

`HYBRID_SHARD` is important at large scale. With it, FSDP shards parameters across GPUs inside a single node and replicates the sharded view across nodes, so most heavy collectives stay on intra-node NVLink while a single all-reduce handles cross-node synchronization [9].

## Wrapping policies

FSDP only saves memory if the model is split into multiple units. If the entire model is wrapped as a single unit, the full parameter tensor is gathered every forward and the only savings come from sharded optimizer state (essentially ZeRO-1 behavior) [11]. Choosing the right wrapping granularity is therefore one of the most important tuning decisions when adopting FSDP. FSDP1 ships several auto-wrap policies in `torch.distributed.fsdp.wrap`:

* `size_based_auto_wrap_policy` traverses the module tree and wraps any submodule whose parameter count exceeds a `min_num_params` threshold (typically tens or hundreds of millions). It is a sensible default for unfamiliar architectures.
* `transformer_auto_wrap_policy` wraps each instance of a specified set of transformer-block classes (such as `LlamaDecoderLayer`, `GPT2Block`, or `T5Block`). This produces one FSDP unit per transformer layer, the standard layout for large language models because every block has roughly the same parameter count and compute time, so all-gather overlap is uniform across the model.
* `lambda_auto_wrap_policy` accepts a user-defined callable for fine-grained control over which submodules become units.

Users can also wrap submodules manually by calling `FSDP(submodule, ...)` directly. Manual wrapping is common when combining FSDP with [tensor parallelism](/wiki/tensor_parallelism), where certain layers need a specific wrapping order. FSDP2's `fully_shard` API replaces wrapping with a function call that registers a module as a sharded unit, avoiding the FlatParameter machinery that FSDP1 used to glue many `nn.Parameter` objects into a single contiguous shard [8].

## Mixed precision

FSDP integrates [mixed precision training](/wiki/mixed_precision_training) through the `MixedPrecision` configuration object, which exposes three independent dtypes [11][13]:

* `param_dtype` is used when parameters are gathered for forward and backward computation. Setting this to [bfloat16](/wiki/bfloat16) or [float16](/wiki/float16) cuts the all-gather payload in half versus fp32 and lets matrix multiplications run on tensor cores at higher throughput.
* `reduce_dtype` is used during the reduce-scatter of gradients. Many large-model recipes set this to `float32` even when `param_dtype` is bf16, because reductions over many ranks are sensitive to numeric error.
* `buffer_dtype` controls the dtype of non-parameter buffers (such as batch-norm running statistics).

A common recipe for [large language models](/wiki/large_language_model) is to keep parameters in `bfloat16` and reduce gradients in `float32`, while the optimizer holds master parameters and Adam moments in `float32`. This delivers nearly the throughput of pure bf16 with stability comparable to full fp32. FSDP performs the dtype casts internally during all-gather and reduce-scatter, so users do not need to manually cast tensors in the model code.

## Activation checkpointing

Sharding parameters, gradients, and optimizer states attacks a substantial portion of training memory, but for transformers the activations stored for backward can still dwarf parameter memory at sequence lengths of 4k or more. [Activation checkpointing](/wiki/activation_checkpointing), also called gradient checkpointing, recomputes activations during the backward pass instead of storing them, trading roughly one third extra forward-pass compute for a large activation-memory reduction [14].

FSDP composes with activation checkpointing through `torch.distributed.algorithms._checkpoint.checkpoint_wrapper`. The standard recipe wraps each transformer block first with `checkpoint_wrapper` and then with FSDP, so that the same block is both sharded and recomputed. Recent PyTorch releases also support selective activation checkpointing, which keeps cheap-to-store activations and recomputes only the expensive ones [9]. Combined with `FULL_SHARD` and bf16 mixed precision, activation checkpointing is what makes it possible to train 70-billion-parameter models on clusters of 64 to 512 [NVIDIA H100](/wiki/nvidia_h100) GPUs without resorting to tensor or pipeline parallelism.

## CPU offloading

For extreme cases where even a sharded model does not fit on the GPU, FSDP can offload sharded parameters and gradients to CPU memory between forward and backward, mirroring DeepSpeed ZeRO-Offload [10]. This is enabled with `cpu_offload=CPUOffload(offload_params=True)` [11]. The optimizer step then runs on the CPU, with shards copied back to GPU during the next iteration's all-gather. The cost is significant: every iteration pays for two PCIe transfers per parameter, and CPU optimizer steps are slow compared to GPU. CPU offload is most useful for fine-tuning very large models on a small number of [NVIDIA A100](/wiki/nvidia_a100) or H100 GPUs, where the alternative would be no training at all.

## State dict and checkpointing

Serializing an FSDP-trained model requires special care because no rank holds the full parameter tensor at rest. PyTorch supports several state-dict modes via `FSDP.state_dict_type` [11][15]:

* `FULL_STATE_DICT` materializes the unsharded parameter tensors and writes them out, by default only on rank 0. Convenient for inference checkpoints loaded outside the training cluster, but for very large models the full tensors may not fit in a single rank's CPU memory.
* `SHARDED_STATE_DICT` writes each rank's shard separately, producing a checkpoint shaped like the runtime layout. This avoids the all-gather and rank-0 memory pressure of the full mode and is recommended for distributed checkpointing of foundation models.
* `LOCAL_STATE_DICT` exposes the raw FlatParameter shards; deprecated in favor of `SHARDED_STATE_DICT`.

Since PyTorch 2.0, the recommended path is the **Distributed Checkpoint API (DCP)**, exposed as `torch.distributed.checkpoint`. DCP saves and loads sharded state dicts in a format that decouples the save layout from the load layout: a checkpoint saved on 32 GPUs can be reloaded on 64 GPUs with a different parallelism strategy without manual re-sharding [16]. DCP is the default checkpoint backend for both FSDP1 and FSDP2 in modern PyTorch.

## What is the difference between FSDP1 and FSDP2?

In July 2024, PyTorch 2.4 introduced a redesigned FSDP API called **FSDP2**, exposed as `torch.distributed._composable.fsdp.fully_shard` [8]. The most fundamental change is the move from FlatParameter to per-parameter sharding via [DTensor](/wiki/dtensor). FSDP1 concatenated all parameters of a unit into a one-dimensional FlatParameter and split that flat buffer evenly across ranks. The flat-buffer approach was communication-efficient but caused several pain points: parameters smaller than the world size could not be cleanly sharded, introspection required reasoning about the flat layout, and integration with tensor parallelism was clumsy because parameters lost their original shape until unflattened.

FSDP2 instead represents each parameter as a `DTensor` carrying an explicit sharding spec, sharded along its leading dimension across the data-parallel mesh, and parameters retain their original logical shape throughout training. This makes FSDP2 compose naturally with tensor parallelism, also implemented on DTensor: a parameter can be tensor-parallel-sharded along one mesh dimension and FSDP-sharded along another in the same spec. This 2D parallelism, exemplified by TorchTitan, is the modern PyTorch recipe for training models above 100 billion parameters across many nodes [8]. Other FSDP2 improvements include lazy initialization, clearer mixed-precision semantics, and a more explicit lifecycle for the all-gather buffers. FSDP2 is the recommended choice for new projects; FSDP1 remains supported for backward compatibility.

## How does FSDP compare with other parallelism techniques?

FSDP sits in a family of distributed-training approaches that solve overlapping but distinct problems [4][9][17][18][19].

| Technique | Primary memory savings | Programming model | Typical scale |
| --- | --- | --- | --- |
| [DDP](/wiki/ddp) | None (full replication) | Data parallel | Models that fit on one GPU. |
| FSDP / ZeRO-3 | Parameters, gradients, optimizer states | Data parallel | 10M to 100B+ parameters. |
| [DeepSpeed](/wiki/deepspeed) | Same as FSDP plus ZeRO-Infinity, ZeRO-Offload, MoE | Custom engine | Same as FSDP, plus extreme offload. |
| [Megatron-LM](/wiki/megatron_lm) | Tensor and pipeline parallelism | Custom framework | 1B to 1T parameter dense LLMs. |
| Megatron-DeepSpeed | Tensor + pipeline + ZeRO | Custom hybrid | Trillion-parameter dense and MoE. |
| [Colossal-AI](/wiki/colossal_ai) | ZeRO + tensor + pipeline | Custom framework | Open-source large-model training. |
| [JAX](/wiki/jax) `pjit` / `shard_map` | Per-tensor SPMD sharding | Functional SPMD | TPU-first foundation models. |

FSDP shards storage but keeps compute data-parallel, while tensor and pipeline parallelism shard compute itself. FSDP can be combined with both: 2D and 3D parallelism layouts where FSDP shards across one mesh dimension and tensor or pipeline parallelism handles another are now standard at trillion-parameter scale. The most direct comparison is to DeepSpeed ZeRO Stage 3, which shards the same three categories of state; the practical difference is that FSDP ships inside PyTorch with native `nn.Module` integration, whereas DeepSpeed wraps the model in its own engine and adds offload and mixture-of-experts features that FSDP leaves to other libraries [4][9].

## Real-world usage

FSDP has been adopted across academia and industry as the dominant PyTorch path to large-model training.

Meta has reported using FSDP variants for parts of the Llama training pipeline. The Llama 2 paper (July 2023) describes training on Meta's Research Super Cluster and internal production clusters with PyTorch-based infrastructure that includes FSDP for the data-parallel dimension [20]. The Llama 3 herd-of-models paper (2024) describes a 4D parallelism stack (tensor, context, pipeline, data) for the 405B model, where FSDP-style data-parallel sharding remains a component [21]. The PyTorch FSDP VLDB 2023 paper itself reports scaling experiments at Meta with models up to 1 trillion parameters [9].

The published AWS scaling study is the most concrete public benchmark of FSDP throughput. On clusters of A100 GPUs, FSDP reached a maximum of 159 teraFLOP/s per GPU on a dense GPT-175B model (51 percent of the A100's 312 teraFLOP/s theoretical peak) and 84 teraFLOP/s per GPU on a dense 1-trillion-parameter model, the latter achieved with a batch size of 4 on 128 GPUs [26]. The VLDB paper reports near-linear TFLOPS scaling as GPU count grows, with the 175B model sustaining throughput from 128 to 512 GPUs [9].

The Hugging Face ecosystem integrates FSDP through [Accelerate](/wiki/accelerate), which exposes FSDP configuration through a YAML file or CLI prompts and handles wrapping policy, mixed precision, and state-dict serialization automatically. Hugging Face also documents FSDP support in the Transformers `Trainer` class for fine-tuning models like Llama, Mistral, Falcon, and Mixtral on multi-GPU nodes [22]. [PyTorch Lightning](/wiki/pytorch_lightning) ships an `FSDPStrategy` that selects FSDP as the distributed backend with one keyword argument [23]. [Mosaic Composer](/wiki/mosaic_composer), the training library behind MosaicML's MPT and Databricks's DBRX, defaults to FSDP for large-model runs and was instrumental in popularizing the `HYBRID_SHARD` strategy for cost-efficient training [24].

Research labs including Stanford CRFM, the Allen Institute for AI (AI2), and EleutherAI use FSDP in their open-source training stacks. AI2's OLMo training code, released in 2024, is built on FSDP with `transformer_auto_wrap_policy` and DCP checkpointing [25]. TorchTitan, a Meta-led reference repository released in 2024, demonstrates FSDP2 plus tensor parallelism plus pipeline parallelism on the Llama architecture as a canonical example of modern PyTorch large-model training. FSDP is also used outside language modeling: Stable Diffusion XL and Stable Diffusion 3 fine-tuners routinely use FSDP through Hugging Face's `diffusers` and Accelerate for full-parameter fine-tuning of the larger U-Net or DiT backbones.

## Limitations and gotchas

FSDP is significantly more complex than DDP, and several pitfalls trip up new users [9][11].

All-gather overhead can dominate at small batch sizes or when wrapping granularity is too fine. If every layer is its own unit, the per-step number of collectives explodes; if too few units are used, peak parameter memory rises because more parameters are held in fully gathered form simultaneously. Tuning the wrapping policy along with `forward_prefetch` and `backward_prefetch` is often the difference between FSDP being faster or slower than DDP at a given memory budget.

[Gradient accumulation](/wiki/gradient_accumulation) requires the `no_sync()` context manager to be correct. Without it, FSDP reduce-scatters gradients on every micro-batch backward, defeating the point of accumulation. Inside `no_sync()`, FSDP keeps gradients local until the final micro-batch, then performs a single reduce-scatter. The trade-off is that during `no_sync()` each rank holds the unsharded gradient for the units it is currently updating, raising peak memory.

Mixed precision configuration is subtle. Setting `reduce_dtype` to fp16 (rather than bf16 or fp32) is a common cause of training divergence in large runs, because fp16 has too little dynamic range for averaged gradients across many ranks. Buffers (especially batch-norm statistics) often need to be left in fp32. State-dict serialization is also non-trivial: `FULL_STATE_DICT` requires assembling a full unsharded copy on rank 0, which can run that rank out of CPU memory at trillion-parameter scale. `SHARDED_STATE_DICT` plus DCP is the recommended path, but it produces a directory of files rather than a single `.pt` file, sometimes complicating downstream tooling.

FSDP1 has long-standing FlatParameter quirks for parameters whose first dimension is smaller than the world size: the flat buffer is padded to be divisible by the world size, and that padding leaks into introspection and mixed-precision casts. FSDP2 addresses this by sharding parameters individually as DTensors, though not every third-party library has migrated.

Cross-node communication can be the bottleneck even with the best wrapping policy. On clusters where nodes are connected by 200 Gbps [InfiniBand](/wiki/infiniband) but each node has eight GPUs on NVLink at terabits per second, inter-node `FULL_SHARD` collectives can stall the step. `HYBRID_SHARD` mitigates this at the cost of higher per-rank parameter memory. The choice depends on model size relative to per-node GPU memory and on network topology, and is one of the most consequential knobs in large FSDP runs.

Finally, FSDP composes with `torch.compile`, but the integration was not always seamless: early PyTorch 2.x releases had graph breaks at FSDP boundaries that undid much of the compile gain. As of PyTorch 2.4 and FSDP2 this has improved, but FSDP1 plus `torch.compile` should still be profiled carefully.

## See also

* [PyTorch](/wiki/pytorch), [DDP](/wiki/ddp), [DTensor](/wiki/dtensor)
* [DeepSpeed](/wiki/deepspeed) and [ZeRO](/wiki/zero_optimizer)
* [Megatron-LM](/wiki/megatron_lm), [Colossal-AI](/wiki/colossal_ai)
* [Tensor parallelism](/wiki/tensor_parallelism), [Pipeline parallelism](/wiki/pipeline_parallelism), [Data parallelism](/wiki/data_parallelism)
* [Mixed precision training](/wiki/mixed_precision_training), [Activation checkpointing](/wiki/activation_checkpointing)
* [Accelerate](/wiki/accelerate), [PyTorch Lightning](/wiki/pytorch_lightning), [Mosaic Composer](/wiki/mosaic_composer)

## References

1. Ott, M., Shleifer, S., Xu, M., Goyal, P., Duval, Q., and Caggiano, V. (July 15, 2021). "Fully Sharded Data Parallel: faster AI training with fewer GPUs." Engineering at Meta blog. https://engineering.fb.com/2021/07/15/open-source/fsdp/
2. PyTorch Team. (March 10, 2022). "PyTorch 1.11, TorchData, and functorch are now available." PyTorch blog. https://pytorch.org/blog/pytorch-1.11-released/
3. PyTorch Team. "FullyShardedDataParallel API documentation." PyTorch documentation. https://pytorch.org/docs/stable/fsdp.html
4. Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." In Proceedings of SC '20. arXiv:1910.02054. https://arxiv.org/abs/1910.02054
5. Hugging Face. "Fully Sharded Data Parallel." Accelerate documentation. https://huggingface.co/docs/accelerate/usage_guides/fsdp
6. Lightning AI. "FSDP Strategy." PyTorch Lightning documentation. https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.FSDPStrategy.html
7. Mosaic Composer Team. "FSDP in Composer." MosaicML Composer documentation. https://docs.mosaicml.com/projects/composer/en/stable/notes/distributed_training.html
8. PyTorch Team. (July 24, 2024). "PyTorch 2.4: Python 3.12, AOTInductor freezing, libuv backend." PyTorch blog (introduces FSDP2 / `fully_shard`). https://pytorch.org/blog/pytorch2-4/
9. Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y., and Li, S. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." Proceedings of [VLDB](/wiki/vldb) 2023, vol. 16, pp. 3848-3860. arXiv:2304.11277. https://arxiv.org/abs/2304.11277
10. Ren, J., Rajbhandari, S., Aminabadi, R. Y., Ruwase, O., Yang, S., Zhang, M., Li, D., and He, Y. (2021). "ZeRO-Offload: Democratizing Billion-Scale Model Training." USENIX ATC '21. https://www.usenix.org/conference/atc21/presentation/ren-jie
11. PyTorch Team. "FSDP API: ShardingStrategy, MixedPrecision, CPUOffload, BackwardPrefetch." PyTorch documentation. https://pytorch.org/docs/stable/fsdp.html
12. PyTorch Team. "Getting Started with Fully Sharded Data Parallel (FSDP)." PyTorch tutorials. https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
13. Wright, L., et al. "Training Tips for Mixed Precision and FSDP." PyTorch blog. https://pytorch.org/blog/efficient-large-scale-training-with-pytorch/
14. Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). "Training Deep Nets with Sublinear Memory Cost." arXiv:1604.06174. https://arxiv.org/abs/1604.06174
15. PyTorch Team. "Advanced Model Training with Fully Sharded Data Parallel (FSDP)." PyTorch tutorials. https://pytorch.org/tutorials/intermediate/FSDP_advanced_tutorial.html
16. PyTorch Team. "Distributed Checkpoint (DCP)." PyTorch documentation. https://pytorch.org/docs/stable/distributed.checkpoint.html
17. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." arXiv:1909.08053. https://arxiv.org/abs/1909.08053
18. Microsoft. "DeepSpeed: Extreme-scale model training for everyone." DeepSpeed documentation. https://www.deepspeed.ai/
19. HPC-AI Lab. "Colossal-AI: A unified deep learning system for large-scale parallel training." Colossal-AI documentation. https://colossalai.org/
20. Touvron, H., et al. (July 18, 2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. https://arxiv.org/abs/2307.09288
21. Llama Team, AI @ Meta. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. https://arxiv.org/abs/2407.21783
22. Hugging Face. "FSDP integration in Transformers Trainer." Transformers documentation. https://huggingface.co/docs/transformers/main/en/fsdp
23. Lightning AI. "Train models with billions of parameters using FSDP." PyTorch Lightning documentation. https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html
24. MosaicML. (May 5, 2023). "Introducing MPT-7B: A New Standard for Open-Source Commercially Usable LLMs." Databricks / MosaicML blog. https://www.databricks.com/blog/mpt-7b
25. Allen Institute for AI. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838. https://arxiv.org/abs/2402.00838
26. PyTorch. "Training a 1 Trillion Parameter Model With PyTorch Fully Sharded Data Parallel on AWS." PyTorch on Medium. https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff
