See also: Machine learning terms, Model parallelism, Pipeline parallelism, GPU computing, Deep learning, Distributed computing
Data parallelism is a distributed training technique in machine learning in which the input data is split across multiple processing units (typically GPUs), each holding a complete replica of the neural network model. Each device independently computes forward and backward passes on its local data partition, and the resulting gradients are aggregated across all devices before updating the model weights. Because every replica begins and ends each training step with identical parameters, data parallelism preserves the mathematical equivalence of single-device training while dramatically reducing wall-clock time.
Data parallelism is particularly useful when dealing with large-scale datasets and computationally-intensive models, such as deep neural networks and other complex machine learning architectures. By distributing the workload across multiple resources, data parallelism accelerates the training process and enables the construction of more accurate and robust models.
Data parallelism is the most widely used form of parallelism in deep learning. Its conceptual simplicity, strong scaling properties, and broad framework support make it the default starting point for any multi-GPU training workload. It has been used at every scale, from two-GPU desktop setups to clusters of thousands of accelerators training large language models such as GPT-4 and LLaMA. Frameworks such as PyTorch, TensorFlow, and JAX all provide built-in primitives for data-parallel training.
The training loop under data parallelism proceeds through the following stages in each iteration:
This cycle repeats for every mini-batch until training completes. The net effect is that the effective batch size scales linearly with the number of devices: training on 8 GPUs with a per-GPU batch of 32 yields an effective batch of 256.
Imagine a teacher needs to grade 1,000 exams. Instead of grading them all alone, the teacher makes 8 exact copies of the answer key and hands 125 exams to each of 8 assistants. Every assistant grades their stack independently. When they finish, they compare notes to make sure they all agree on the grading criteria. Then they repeat with the next 1,000 exams. That is data parallelism: the "answer key" is the model, the exams are the training data, and "comparing notes" is the gradient aggregation step.
Another way to picture it is a group of friends working together to complete a large puzzle. Each person takes a portion of the puzzle pieces and works on their section at the same time, making the whole process faster. Sometimes the friends stop and share their progress with each other, so everyone knows what the others are doing (synchronous approach). Other times they work independently and only occasionally check in with each other (asynchronous approach). Data parallelism helps to solve big machine learning problems more quickly by dividing the work among many "friends" (computational resources).
The gradient aggregation step can be performed in two fundamentally different ways.
In synchronous training, all workers must finish computing their local gradients before any worker can proceed to the weight update. The gradients are averaged using a blocking all-reduce operation, and every worker applies the same update at the same time. In synchronous stochastic gradient descent (Sync-SGD), because no worker ever uses stale parameters, the optimization trajectory is mathematically equivalent to training on a single device with the full global batch size.
Advantages:
Disadvantages:
Synchronous data parallelism is the dominant paradigm today. High-speed interconnects like NVLink and InfiniBand have made communication overhead tolerable even at large scale, and the predictable convergence behavior makes synchronous training more reliable for production workloads.
In asynchronous training, workers do not wait for one another. Each worker reads the current parameters from a central parameter server, computes gradients on its local data, and pushes those gradients back to the server, which applies the update immediately. There is no global synchronization barrier.
Advantages:
Disadvantages:
Researchers have proposed several remedies for gradient staleness, including dividing gradients by their staleness factor, using delay-compensated updates, and employing bounded-staleness protocols that allow limited asynchrony while capping the maximum staleness.
Asynchronous SGD was explored extensively in early work (Dean et al., 2012; Recht et al., 2011), but synchronous methods have largely superseded it for state-of-the-art training because the practical throughput gains rarely compensate for the statistical efficiency loss.
Some systems use a hybrid strategy: workers are organized into groups that synchronize internally (using all-reduce within the group) while groups communicate asynchronously with one another. For instance, workers within the same node may synchronize synchronously over NVLink, while nodes communicate asynchronously over a slower Ethernet network. This can balance the consistency of synchronous training with some of the fault tolerance and throughput benefits of asynchronous methods.
| Property | Synchronous | Asynchronous | Hybrid |
|---|---|---|---|
| Gradient freshness | Always current | May be stale | Current within group |
| Convergence behavior | Predictable, reproducible; equivalent to large-batch single-device SGD | Noisier, harder to tune; weaker, depends on staleness bounds | Intermediate |
| Hardware utilization | Bottlenecked by slowest worker | Near-maximum | High |
| Communication pattern | All-reduce (blocking) | Parameter server (non-blocking) | Mixed |
| Implementation complexity | Moderate (all-reduce collectives) | Higher (parameter server, staleness management) | Variable |
| Hyperparameter tuning | Straightforward | Requires tuning for staleness | Intermediate |
| Dominant use today | Yes (PyTorch DDP, Horovod) | Rare; large heterogeneous clusters | Occasional |
The all-reduce operation is the backbone of synchronous data parallelism. It takes a tensor from each of N workers, applies a reduction (typically summation), and distributes the result back so that every worker holds the identical reduced tensor. Several algorithms exist, each with different trade-offs.
Ring all-reduce, popularized in the deep learning community by Baidu's 2017 work, arranges workers in a logical ring. The algorithm proceeds in two phases:
The total data transmitted per worker is 2(N-1)/N times the tensor size, which approaches 2x the tensor size for large N. Crucially, this communication volume is independent of the number of workers, making ring all-reduce bandwidth-optimal. The bandwidth cost is determined only by the slowest link in the ring. However, the algorithm requires 2(N-1) sequential steps, which means its latency scales linearly with N. For clusters with many nodes, latency can become a bottleneck.
Tree all-reduce organizes workers in a binary (or k-ary) tree. The reduce phase flows from leaves to root, and the broadcast phase flows from root to leaves. Tree all-reduce has lower latency (O(log N) steps) but can create bandwidth bottlenecks at the root. It is more suitable when latency dominates over bandwidth, for example with small tensors or very large worker counts.
Recursive halving-doubling combines the strengths of both approaches by using a halving phase for reduce-scatter and a doubling phase for all-gather, achieving both bandwidth optimality and O(log N) latency for power-of-two worker counts.
In multi-node clusters, NCCL and other communication libraries often use a hierarchical approach: intra-node communication runs over NVLink or NVSwitch (high bandwidth, low latency), while inter-node communication runs over InfiniBand or RoCE. The reduce is performed first within each node and then across nodes, minimizing expensive cross-network traffic.
NVIDIA's Collective Communications Library (NCCL) provides highly optimized implementations of all-reduce and other collectives for multi-GPU and multi-node setups. NCCL automatically selects the best algorithm and communication topology based on the hardware configuration. Most modern deep learning frameworks, including PyTorch, use NCCL as the default communication backend for GPU-based distributed training.
| Algorithm | Bandwidth cost per worker | Latency (steps) | Best suited for |
|---|---|---|---|
| Ring all-reduce | 2(N-1)/N * tensor size | 2(N-1) | Large tensors, moderate worker count |
| Tree all-reduce | ~2 * tensor size | 2 log N | Small tensors, large worker count |
| Recursive halving-doubling | ~2 * tensor size | 2 log N | Power-of-two worker counts |
| Hierarchical | Varies (two-level) | Varies | Multi-node clusters with mixed interconnects |
Several software frameworks provide production-ready implementations of data parallelism. The table below summarizes the most prominent ones.
| Framework | Developer | Key Mechanism | Supported Backends | Notes |
|---|---|---|---|---|
| PyTorch DDP | Meta (Facebook) | All-reduce with gradient bucketing | NCCL, Gloo, MPI | Default choice for PyTorch users |
| PyTorch FSDP | Meta (Facebook) | Sharded parameters, all-gather + reduce-scatter | NCCL | Enables training models that do not fit in a single GPU |
| Horovod | Uber (now LF AI & Data) | Ring all-reduce via MPI/NCCL | NCCL, MPI, Gloo | Framework-agnostic; supports TensorFlow, PyTorch, MXNet |
| DeepSpeed ZeRO | Microsoft | Sharded optimizer states, gradients, and parameters | NCCL | Three progressive sharding stages |
| tf.distribute | MirroredStrategy (all-reduce) and others | NCCL, RING | Native TensorFlow distribution | |
| JAX pmap/pjit | XLA-compiled SPMD parallelism | XLA collectives | Functional transformation-based API |
PyTorch DDP is the standard API for data-parallel training in PyTorch. It wraps a model and transparently handles gradient synchronization during backpropagation.
How DDP works internally:
ProcessGroup::broadcast().bucket_cap_mb parameter (default 25 MB). Bucketed all-reduce begins as soon as all gradients in a bucket are ready.DataParallel module (which broadcasts parameters from a single GPU on every forward pass), DDP keeps parameters synchronized through gradient all-reduce alone, avoiding a major bottleneck.DDP requires that all processes call all-reduce operations in exactly the same order. This is enforced by using the bucket index order rather than the order in which buckets become ready. Mismatched ordering can lead to incorrect results or deadlocks.
tf.distribute.MirroredStrategy is TensorFlow's equivalent for synchronous data-parallel training on a single machine with multiple GPUs. Each variable in the model is mirrored across all replicas, and gradients are aggregated using NCCL-based all-reduce by default.
tf.distribute.MultiWorkerMirroredStrategy extends this to multi-node training. It requires a TF_CONFIG environment variable that specifies the cluster topology (worker addresses and roles). The API is designed so that code written for single-machine MirroredStrategy can be extended to multiple nodes with minimal changes.
Both strategies use a strategy.scope() context manager to distribute variables and computation, and they integrate directly with model.fit() as well as custom training loops.
Horovod, developed at Uber and published by Sergeev and Del Balso (2018), is a framework-agnostic distributed training library that supports PyTorch, TensorFlow, Keras, and Apache MXNet. It uses ring all-reduce via NCCL and was designed to require minimal code changes. Typically, adapting a single-GPU training script to use Horovod requires adding only a few lines of code. Its API centers on wrapping the optimizer to inject all-reduce calls during backpropagation.
Early versions relied on Baidu's open-source ring-allreduce implementation, but later releases switched to NCCL, which provides a highly optimized implementation. Horovod introduced Tensor Fusion, an algorithm that batches small gradient tensors together before performing the all-reduce, reducing the number of communication operations and improving throughput.
Horovod uses MPI (Message Passing Interface) for process management, making it familiar to the high-performance computing community. The horovodrun launcher simplifies starting distributed jobs across multiple nodes. Benchmarks in the original paper demonstrated near-linear scaling on up to 512 GPUs for training Inception-v3 and ResNet-101 on ImageNet. Horovod was widely used before DDP matured but remains popular in environments that need multi-framework support.
Ideally, training on N GPUs should be N times faster than training on a single GPU. In practice, communication overhead and other factors reduce this ratio. The scaling efficiency is defined as:
Scaling efficiency = (Time on 1 GPU) / (N * Time on N GPUs)
A scaling efficiency of 1.0 (or 100%) represents perfect linear scaling. Real-world systems typically achieve 85-95% efficiency on fast interconnects for large models.
Goyal et al. (2017) established a practical guideline for large-batch training in their paper "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." The linear scaling rule states:
When the mini-batch size is multiplied by k, multiply the learning rate by k.
The intuition is that with k times more data per step, the gradient estimate is k times more accurate (lower variance), and a proportionally larger step can be taken without increasing noise. All other hyperparameters (weight decay, momentum coefficient, etc.) are kept unchanged. Using this rule, the authors trained ResNet-50 on ImageNet with a batch size of 8,192 across 256 GPUs in one hour, matching the accuracy of small-batch baselines. They reported approximately 90% scaling efficiency when moving from 8 to 256 GPUs.
The linear scaling rule breaks down in the early stages of training, when the network parameters are changing rapidly and the loss landscape is highly non-linear. Applying a very large learning rate from the start can cause training to diverge.
Goyal et al. also introduced a gradual warmup scheme: the learning rate starts small and is linearly increased to the target value over the first few epochs (typically 5 epochs in their experiments). Warmup prevents the large learning rate from causing divergence early in training, when the model parameters are far from any reasonable optimum and the loss landscape is poorly conditioned. During warmup, the network moves to a region of the loss landscape where the linear scaling assumption holds more reliably.
Gradual warmup has since become a standard practice in large-scale training and is used in the pre-training of models such as BERT, GPT, and Vision Transformers.
While data parallelism enables training with very large effective batch sizes, this introduces its own set of difficulties. The linear scaling rule has practical limits. Beyond a certain batch size (which varies by task and model), increasing the batch size no longer reduces training time proportionally and may degrade final accuracy. This phenomenon is sometimes called the critical batch size.
Multiple studies have observed that models trained with very large batch sizes can suffer a "generalization gap," achieving the same training loss as small-batch models but worse test accuracy. The leading explanation is that large-batch optimization tends to converge to sharp minima in the loss landscape, where small perturbations to the parameters cause large increases in loss. In contrast, the inherent noise of small-batch gradient descent acts as implicit regularization that favors flatter minima, which generalize better. Techniques like learning rate warmup, LARS, LAMB, and carefully tuned learning rate schedules help mitigate this effect.
To push batch sizes even further, researchers have developed specialized optimizers:
LARS and LAMB address the observation that different layers can have very different gradient-to-weight ratios, and a single global learning rate (even when linearly scaled) may be too large for some layers and too small for others. By normalizing the update for each layer independently, these optimizers maintain stable training at batch sizes of 32K or more.
| Optimizer / Technique | Authors | Key Idea | Notable Result / Typical Batch Size |
|---|---|---|---|
| Linear scaling rule | Goyal et al., 2017 | Scale learning rate proportionally to batch size | Up to ~8,192 |
| Learning rate warmup | Goyal et al., 2017 | Prevent divergence at training start | All large-batch training |
| LARS | You et al., 2017 | Per-layer learning rate scaling based on weight and gradient norms | ResNet-50 with batch size 32,768 |
| LAMB | You et al., 2020 | Adam with layer-wise adaptive learning rates | BERT pre-training in 76 minutes (batch 65,536) |
Gradient accumulation is a technique that simulates a large effective batch size on hardware that cannot fit the full batch in memory. Instead of processing the entire batch at once, the batch is divided into smaller micro-batches. The model performs a forward and backward pass on each micro-batch and accumulates (sums or averages) the gradients. The optimizer step is applied only after all micro-batches have been processed. If the accumulation factor is A and the micro-batch size is B, the effective batch size is A * B.
For example, if the target effective batch size is 1,024 but only a batch of 256 fits in GPU memory, the training loop processes four micro-batches of 256 and accumulates their gradients before performing a single weight update. The resulting parameter update is mathematically identical to processing the full batch of 1,024 at once.
Gradient accumulation is sometimes called "poor man's data parallelism" because it achieves the same effective batch size increase as adding more GPUs, but at the cost of proportionally longer training time (since the micro-batches are processed sequentially on the same device). It can also be combined with multi-GPU data parallelism. For example, with 8 GPUs, a micro-batch size of 4, and an accumulation factor of 4, the effective global batch size is 8 * 4 * 4 = 128. This allows practitioners to reach batch sizes that would otherwise require more GPUs than are available.
Key considerations:
no_sync() context manager for DDP that skips synchronization during intermediate accumulation steps, reducing communication overhead.Standard data parallelism requires every device to store a full copy of the model parameters, gradients, and optimizer states. For a model with P parameters trained in mixed precision with Adam, the memory footprint per device is approximately 16P bytes (2P for fp16 parameters, 2P for fp16 gradients, 4P for fp32 master weights, 4P for first moments, 4P for second moments). This redundancy limits the maximum model size that data parallelism can handle.
ZeRO (Zero Redundancy Optimizer), proposed by Rajbhandari et al. (2020) at Microsoft Research and implemented in the DeepSpeed library, eliminates this redundancy by partitioning the model states across data-parallel workers rather than replicating them. ZeRO has three progressive stages:
Each worker stores only 1/N of the optimizer states (for Adam, these include the 32-bit master weights, first moment estimates, and second moment estimates). After gradients are reduced, each worker updates only its partition of the parameters and then broadcasts the updated parameters. This reduces optimizer state memory by a factor of N, yielding up to 4x total memory reduction in mixed-precision training with no additional communication overhead.
In addition to optimizer states, gradients are also partitioned. Each worker retains only the gradient shard that corresponds to its optimizer state partition. Gradients for other partitions are reduced and discarded immediately after the reduce-scatter operation. This further reduces memory consumption, up to 8x compared to baseline DDP.
The 16-bit model parameters themselves are partitioned. Each worker stores only 1/N of the parameters. Before the forward and backward passes, parameters are gathered on demand using all-gather operations and freed immediately after use. This achieves memory reduction proportional to the degree of data parallelism, enabling training of models that are far larger than what would fit on a single device, at the cost of 1.5x communication volume relative to DDP.
| ZeRO stage | What is partitioned | Memory per device (approx.) | Memory savings | Communication overhead |
|---|---|---|---|---|
| Stage 0 (baseline DDP) | Nothing (full replication) | 16P bytes | 1x | 1x (all-reduce) |
| Stage 1 (P_os) | Optimizer states | ~6P bytes | Up to 4x | 1x |
| Stage 2 (P_os+g) | Optimizer states + gradients | ~4P bytes | Up to 8x | 1x |
| Stage 3 (P_os+g+p) | Optimizer states + gradients + parameters | ~16P/N bytes | Linear with worker count | 1.5x (additional all-gather) |
ZeRO has been instrumental in enabling the training of very large models. The original paper demonstrated the feasibility of training models with over 100 billion parameters. ZeRO also supports offloading optimizer states and parameters to CPU memory or NVMe storage (ZeRO-Offload and ZeRO-Infinity), enabling training of trillion-parameter models on limited GPU hardware at the cost of slower throughput.
PyTorch FSDP is PyTorch's native implementation of ZeRO Stage 3 concepts. While standard DDP requires each GPU to hold a complete copy of the model parameters, gradients, and optimizer states, FSDP shards all three across data-parallel workers, dramatically reducing per-GPU memory consumption.
How FSDP works:
FSDP wraps model layers in a nested hierarchy, so only one layer at a time needs its full parameters materialized. This means peak GPU memory is determined by the size of the largest single FSDP unit plus the sharded states of all other units.
FSDP offers several sharding strategies:
| Sharding Strategy | What Is Sharded | Communication Overhead vs. DDP |
|---|---|---|
| FULL_SHARD | Optimizer states, gradients, and parameters | ~1.5x |
| SHARD_GRAD_OP (ZeRO Stage 2) | Optimizer states and gradients | ~1.0x |
| NO_SHARD | Nothing (equivalent to DDP) | 1.0x |
| HYBRID_SHARD | Full shard within a node, replicate across nodes | Between 1.0x and 1.5x |
The trade-off is clear: full sharding offers the lowest memory footprint per GPU but incurs the most communication. HYBRID_SHARD is often a practical middle ground for multi-node training, as intra-node communication over NVLink is fast while inter-node communication over InfiniBand is slower.
FSDP supports implicit prefetching (automatically issuing all-gather for the next layer while the current layer is computing) and optional CPU offloading. PyTorch FSDP2, introduced in more recent PyTorch releases, simplifies the API and improves composability with other parallelism strategies and with PyTorch features like activation checkpointing and mixed precision training.
Mixed precision training, introduced by Micikevicius et al. (2018) in their ICLR 2018 paper, is a technique that combines half-precision (FP16) and single-precision (FP32) floating-point arithmetic to accelerate training while maintaining model accuracy. It is frequently used alongside data parallelism to further improve throughput and reduce memory consumption.
Core components of mixed precision training:
Benefits for data-parallel training:
PyTorch provides torch.cuda.amp (automatic mixed precision) with GradScaler and autocast context managers that integrate seamlessly with DDP and FSDP. DeepSpeed and Horovod also provide built-in support for mixed precision training.
More recently, BF16 (bfloat16) has gained popularity as an alternative to FP16. BF16 uses the same number of exponent bits as FP32, giving it a much wider dynamic range, which eliminates the need for loss scaling in most cases. Google's TPU hardware natively supports BF16, and NVIDIA's Ampere and later GPU architectures also include BF16 Tensor Core support.
Data parallelism is one of three primary parallelism strategies used in distributed deep learning. The choice depends on model size, available hardware, and communication bandwidth.
| Aspect | Data Parallelism | Model Parallelism (tensor) | Pipeline Parallelism |
|---|---|---|---|
| What is replicated | Model (full copy on each worker) | Data (same batch on all workers) | Data (micro-batches flow through stages) |
| What is split | Data (each worker gets a shard) | Model (layers or tensors split across workers) | Model (consecutive layers assigned to stages) |
| Communication pattern | Gradient all-reduce (once per step) | Point-to-point activation/gradient transfers (every layer boundary) | Point-to-point between pipeline stages |
| Communication volume | Moderate | High (per-layer) | Lower than tensor parallelism |
| Memory per worker | Full model + optimizer states | Fraction of model + associated optimizer states | Fraction of model + micro-batch activations |
| When to use | Model fits in one GPU; dataset is large | Model does not fit in one GPU; very large layers (e.g., attention heads) | Model does not fit in one GPU; very deep models; want to overlap computation |
| Scaling bottleneck | Communication overhead at high worker counts (unless combined with ZeRO/FSDP) | Activation transfer latency; load imbalance; high communication overhead | Pipeline bubble (idle time); micro-batch scheduling |
| Implementation complexity | Low | High (requires manual partitioning or automated tools) | Medium to High (micro-batch scheduling, pipeline flushes) |
| Effective batch size | Increases with workers | Unchanged | Unchanged (but micro-batches increase throughput) |
In practice, large-scale training systems combine multiple strategies. This is often called 3D parallelism: data parallelism across groups of nodes, pipeline parallelism across nodes within a group, and tensor (model) parallelism across GPUs within a node. For example, the Megatron-LM framework from NVIDIA uses all three strategies simultaneously to train models with hundreds of billions of parameters. State-of-the-art training systems for very large models (such as GPT-4, PaLM, and LLaMA) use this hybrid approach, matching each parallelism dimension to the appropriate level of the hardware topology.
Scaling data parallelism beyond a single machine introduces network communication as a critical bottleneck. The hardware and software stack for multi-node training includes several key components.
NCCL (pronounced "nickel") is NVIDIA's library for collective communication primitives (all-reduce, all-gather, reduce-scatter, broadcast) optimized for NVIDIA GPUs. NCCL automatically detects the topology of the system, including PCIe, NVLink, NVSwitch, InfiniBand, and RoCE connections, and selects the optimal communication algorithm. It is the default communication backend for both PyTorch DDP and TensorFlow's distributed strategies when running on NVIDIA hardware.
NVLink is NVIDIA's high-bandwidth, low-latency point-to-point interconnect for GPUs within the same node. NVLink 4.0 (used in H100 GPUs) provides 900 GB/s of bidirectional bandwidth per GPU. NVSwitch enables all-to-all NVLink connectivity among all GPUs in a node, eliminating the topology constraints of point-to-point links. These interconnects make intra-node all-reduce extremely fast.
For inter-node communication, InfiniBand (IB) and RDMA over Converged Ethernet (RoCE) provide high-bandwidth, low-latency networking with remote direct memory access (RDMA). RDMA enables data to move directly between GPU memory on different nodes without involving the CPU, minimizing latency. NVIDIA's ConnectX network adapters and Quantum InfiniBand switches are commonly used in large AI training clusters. HDR InfiniBand provides 200 Gb/s per port, and NDR provides 400 Gb/s.
| Interconnect | Scope | Bandwidth (bidirectional) | Typical use |
|---|---|---|---|
| NVLink 4.0 | Intra-node (GPU to GPU) | 900 GB/s per GPU | All-reduce within a node |
| NVSwitch | Intra-node (all-to-all) | Full bisection bandwidth | All-to-all within a node |
| InfiniBand NDR | Inter-node | 400 Gb/s per port | Gradient sync across nodes |
| RoCE | Inter-node | Up to 400 Gb/s | Alternative to InfiniBand |
| PCIe Gen5 | Intra-node (CPU-GPU) | 128 GB/s per x16 slot | CPU offloading, data loading |
Communication overhead is the primary bottleneck in data-parallel training. Every iteration requires synchronizing gradients across all workers, and the time spent in communication directly reduces the fraction of time spent on useful computation. The communication cost of data-parallel training is determined by three factors:
This analysis reveals why data parallelism scales better for large models (more computation per step) and on fast interconnects (less time per all-reduce). The ratio of computation time to communication time (the computation-to-communication ratio) determines scaling efficiency. Models with a high parameter-to-FLOP ratio (such as large transformer models) tend to be more communication-bound.
| Factor | Impact |
|---|---|
| Model size | Larger models produce more gradient data to synchronize |
| Number of workers | Ring all-reduce latency grows linearly with worker count |
| Interconnect bandwidth | Higher bandwidth (NVLink, InfiniBand) reduces transfer time |
| Network topology | Intra-node communication (NVLink) is much faster than inter-node (Ethernet) |
| Gradient compression | Techniques like quantization and sparsification reduce the volume of data transferred |
Several approaches have been developed to mitigate communication costs:
While data parallelism can significantly accelerate the training process in machine learning, it also presents several challenges and trade-offs.
As the number of computational resources increases, the communication overhead between resources may also increase, potentially negating the benefits of data parallelism. This issue is particularly pronounced in synchronous data parallelism, where resources must frequently synchronize their updates.
Uneven distribution of data or computational resources can lead to load balancing issues, where some resources may be idle while others are still processing their data subsets. This can reduce the overall efficiency of the data parallelism approach and may require careful partitioning and resource allocation strategies. Dynamic batching, heterogeneous-aware schedulers, and straggler mitigation techniques can help address load imbalance.
Synchronous and asynchronous data parallelism offer different trade-offs between consistency and speed. Synchronous approaches provide better consistency but may be slower, while asynchronous approaches can be faster but may lead to less accurate models due to gradient staleness.
Training with very large batch sizes can sometimes lead to models that generalize less well to unseen data, even when training loss is comparable to small-batch baselines. This has been attributed to large-batch training converging to sharp minima rather than flat minima.
NCCL_DEBUG=INFO to identify communication bottlenecks.Data parallelism has been a core idea in parallel computing since the 1980s, with roots in SIMD (Single Instruction, Multiple Data) architectures and early data-parallel languages like CM Fortran for the Connection Machine. In the machine learning context, data-parallel SGD became practical with the rise of GPU computing in the early 2010s.
Key milestones include: