DiLoCo
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,220 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,220 words
Add missing citations, update stale details, or suggest a clearer explanation.
DiLoCo (Distributed Low-Communication training) is a distributed optimization algorithm for neural networks introduced by Google DeepMind in November 2023 to train large language models across loosely connected "islands" of compute. Each island performs many local optimization steps with AdamW on its own data shard, and only periodically averages parameter deltas across workers, after which an outer optimizer (Nesterov momentum on the averaged delta) updates the global weights. In the original paper, DiLoCo with eight workers and 500 inner steps between communications matched the validation perplexity of fully synchronous data parallel training on the C4 dataset while exchanging roughly 500 times less data.[^1] The method has become the foundation for a line of follow up work on geographically distributed training, including Prime Intellect's OpenDiLoCo, the INTELLECT-1 and INTELLECT-2 community trained models, and Streaming DiLoCo, which extends the communication efficiency another order of magnitude.[^2][^3][^4][^5]
| Property | Value |
|---|---|
| Original paper | "DiLoCo: Distributed Low-Communication Training of Language Models" (arXiv:2311.08105) |
| Originating lab | Google DeepMind |
| First public version | 14 November 2023 |
| Lead authors | Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc'Aurelio Ranzato, Arthur Szlam, Jiajun Shen |
| Algorithm class | Local SGD / federated averaging variant with outer Nesterov momentum |
| Inner optimizer | AdamW |
| Outer optimizer | Nesterov momentum |
| Headline result | Matches synchronous data parallel quality on C4 with roughly 500x less communication |
| Notable open implementation | OpenDiLoCo (Prime Intellect, Apache 2.0) |
| Notable production runs | INTELLECT-1 (10B, November 2024); INTELLECT-2 (32B post training, May 2025) |
| Successor | Streaming DiLoCo (DeepMind, January 2025) |
Training frontier language models with conventional data parallelism requires that every accelerator exchange a copy of every gradient at every step. In practice this is done with an all-reduce collective over high speed interconnects such as NVLink or InfiniBand, and the bandwidth requirement grows with both model size and the number of workers. Fully Sharded Data Parallel (FSDP) and DeepSpeed reduce the per device memory footprint by sharding parameters, gradients, and optimizer state, but they do not reduce the per step communication volume; if anything, they increase the number of collective operations per step.[^6]
The result is that synchronous data parallelism essentially forces large training runs into a single tightly coupled cluster, because a few hundred microseconds of added latency per all-reduce can dominate iteration time. As model sizes grew through 2022 and 2023, several research groups began asking whether language models could instead be trained across many smaller clusters, perhaps connected only by ordinary internet links of tens or hundreds of megabits per second. DiLoCo is one of the most influential answers to that question.[^1]
The algorithm has roots in two older traditions. The first is local stochastic gradient descent and federated averaging, introduced by McMahan and colleagues for federated learning in 2017, where edge devices perform several local SGD steps between rounds of communication.[^7] The second is the Lookahead optimizer of Zhang and colleagues (2019), which interpolates the fast inner weights with a slow outer copy after k inner steps, and the SlowMo / FedMom family of methods, which apply outer momentum to averaged gradients in a federated setting.[^1] DiLoCo is best understood as a fusion of these ideas, tailored to language model pretraining rather than to phone style federated learning.
A useful framing is to count the bytes exchanged per training step. In a vanilla data parallel run with k workers and a model of size P parameters, each all-reduce moves roughly 2P bytes per worker per step under FP16 accumulation. For a 1 billion parameter model trained for 100,000 steps across 64 workers, that adds up to hundreds of terabytes of cumulative communication. With 100 Gbit/s NVLink links the per step cost is acceptable; with a 1 Gbit/s wide area link it is not.
DiLoCo's core observation is that the per step granularity is mostly unnecessary. If workers communicate only every H steps and then exchange one averaged delta of size P, the cumulative communication is reduced by a factor of approximately H, with no change to wall clock training compute. The challenge is showing that this lossy synchronization schedule does not damage final model quality, which is what the DiLoCo paper set out to demonstrate.[^1]
The DiLoCo algorithm consists of two nested loops, an inner loop run independently on each of k workers and an outer loop that synchronizes them. The inner loop is ordinary stochastic gradient descent style optimization (AdamW in the published configuration), and the outer loop is a Nesterov momentum step on the averaged delta.[^1]
Let theta denote the model parameters, k the number of workers (islands), and H the number of inner steps between communications. For each outer step t from 1 to T:
In the published configuration the inner optimizer is AdamW with learning rate roughly 4e-4 and a cosine schedule, while the outer optimizer is Nesterov SGD with learning rate 0.7 and momentum 0.9. The number of inner steps H is typically 500, meaning each communication round corresponds to 500 sequential mini batches per worker.[^1]
The phrase "pseudo gradient" is borrowed from the federated learning literature and emphasizes that delta_i is not the gradient of any single batch; it is the negative of the parameter displacement produced by H steps of an arbitrary inner optimizer. The outer step therefore averages displacements rather than instantaneous gradients, which is the property that allows the inner optimizer to make rapid progress on local data without being held back by every other worker.[^1]
The table below summarizes the canonical DiLoCo configuration on C4 from the v1 paper.[^1]
| Hyperparameter | Value |
|---|---|
| Inner optimizer | AdamW |
| Inner learning rate | 4e-4 (cosine) |
| Outer optimizer | Nesterov momentum |
| Outer learning rate | 0.7 |
| Outer momentum | 0.9 |
| Inner steps per round (H) | 500 |
| Workers (k) | 8 (main); 4, 16, 64 in ablation |
| Sequence length | 1,024 tokens |
| Batch size per worker | 512 |
| Model sizes | 60M, 150M, 400M parameters |
| Dataset | C4 (Common Crawl) |
The outer learning rate of 0.7 looks unusually large, but it is multiplied by an averaged displacement that already represents H inner steps of work, so the effective scale is small. In practice the outer optimizer behaves like a slow consensus mechanism: the workers can drift apart by hundreds of inner steps, and the outer step pulls them back to a single shared trajectory once per round.[^1]
A useful way to think about DiLoCo is that it averages weights at low frequency rather than gradients at every step. The Lookahead optimizer (Zhang et al., 2019) does something similar with a single replica, treating the inner sequence of AdamW updates as exploration and the outer interpolation as a stabilizing pullback toward a slow moving anchor.[^1] DiLoCo generalizes this to k replicas: each worker plays the role of an inner trajectory, and the outer step is the consensus pullback. SlowMo and FedMom similarly apply outer momentum to averaged updates, but they were studied at the smaller scales typical of federated learning rather than at the billion parameter language model scale where DiLoCo lives.[^1]
This view also explains why DiLoCo is robust to non identically distributed data shards: because the inner optimizer is allowed to make significant local progress before averaging, each worker can specialize to its shard during the inner loop, and the outer step folds those specializations together. The original paper reports that DiLoCo tolerates data heterogeneity and dynamic resource availability that would be challenging for standard data parallelism.[^1]
The choice of Nesterov momentum on the outer loop is not incidental. In ordinary stochastic optimization, momentum accelerates convergence by accumulating consistent gradient directions over time and damping oscillations across high curvature directions. In the DiLoCo outer loop, the "gradient" being smoothed is a pseudo gradient produced by H steps of inner AdamW, so the directions it represents are already averaged over hundreds of mini batches and may include a mix of consistent low frequency drift and round to round noise from inner optimizer randomness. The DeepMind authors find that a moderate to large outer momentum coefficient (0.9 in the published configuration) is essential to stabilizing this signal and that disabling outer momentum leads to noticeably worse perplexity at the same number of outer rounds.[^1]
A subtler observation, raised in subsequent analysis work such as the Step K Nesterov Outer Optimizer (SNOO) study, is that DiLoCo's strong performance can be attributed substantially to the Lookahead style component with Nesterov momentum applied to the pseudo gradient, rather than purely to data parallel averaging across workers. SNOO finds that even in a single worker setting, several inner steps followed by a Nesterov outer update on the cumulative displacement can outperform plain inner optimization at matched compute, which recasts DiLoCo as a combined optimization and distribution innovation rather than a pure distribution trick.[^16]
A separate predecessor is the Elastic Averaging SGD (EASGD) framework, in which each worker maintains a local copy of the model parameters that is softly pulled toward a shared center variable via a spring like coupling. The pull strength is governed by an elastic constant rho, and the center variable itself moves slowly under the same coupling. DiLoCo can be viewed as a discretized, hard pull variant of elastic averaging: instead of a continuous spring, workers periodically snap back to the consensus, and the outer Nesterov step plays the role of the slow moving center variable's update law. This connection is invoked in the original DiLoCo paper to justify the algorithm's robustness, although the practical hyperparameters of DiLoCo differ from typical EASGD configurations by orders of magnitude in H.[^1]
The DiLoCo paper reports experiments on three model sizes, 60M, 150M, and 400M parameters, all decoder only Transformers with Chinchilla like architectures, trained on the C4 subset of Common Crawl.[^1] The primary result is that for a 150M parameter Llama style model trained for 88,000 outer rounds, DiLoCo with eight workers and H equal to 500 achieves a validation perplexity of roughly 15.0, lower than a fully synchronous baseline using either a single worker or an equivalent eight times larger batch, while communicating roughly 500 times less data per round than the synchronous case.[^1]
The paper also reports ablations on the number of workers (k in {4, 8, 16, 64}), on the inner step count H, and on the inner and outer optimizers, finding that the algorithm is robust across a wide range of these settings, that increasing k generally improves quality at fixed token budget, and that the Nesterov outer step is critical: SGD or no outer momentum yields visibly worse perplexity.[^1] In a follow up scaling study from DeepMind, the authors find that DiLoCo's optimal batch size is larger than that of vanilla data parallel training, that downstream task generalization improves with scale, and that the method outperforms data parallel training even at small model sizes when properly tuned.[^8]
DiLoCo is sometimes confused with classical distributed training paradigms, but it occupies a distinct niche. The comparison below summarizes the differences with the parallelism dimensions most relevant to LLM training.[^1][^6]
| Strategy | What is partitioned | Communication frequency | Bandwidth requirement | Typical network |
|---|---|---|---|---|
| Data parallel (DDP) | data batches; full parameters replicated | every step (all-reduce of gradients) | High | NVLink / InfiniBand |
| FSDP | data batches; parameters sharded | every step (gather + reduce-scatter) | Highest | NVLink / InfiniBand |
| Tensor parallel | individual tensor dimensions | every layer | Very high (intra step) | NVLink within a node |
| Pipeline parallel | stages of the network | once per micro batch boundary | Medium | InfiniBand across nodes |
| DiLoCo | data shards across islands | every H inner steps (500 in paper) | Low | ordinary internet, 100 Mbit/s class |
Within an island the workers can still use any combination of FSDP, tensor parallelism, and pipeline parallelism for intra island scaling; DiLoCo is orthogonal to these techniques and sits above them in the parallelism stack.[^1][^2]
DiLoCo is sometimes described as "federated learning for pretraining," and the analogy is close enough to be illuminating, but there are important differences in regime and tuning. In a typical federated setting such as FedAvg deployed on mobile phones, the model is small (often less than 10M parameters), data is non IID by construction, the number of clients per round is large but only a tiny subset participate per round, and the inner step count is small (often a single epoch over local data).[^7] Communication is dominated by upload bandwidth from the client to the server, and privacy is a primary motivation.
DiLoCo flips most of these parameters. Models range from 60M to multi billion parameters; data is generally IID because each worker is given a random shard of a centralized dataset; all workers participate in every round (in the synchronous variant); the inner step count is large (500 in the canonical configuration); and privacy is not a goal. The optimization recipe is also tuned for pretraining loss surfaces rather than the convex or near convex problems often analyzed in classical federated learning theory. As a result, DiLoCo's hyperparameters and empirical guarantees do not transfer directly from FedAvg style methods, even though the algorithm is mathematically a member of the same family.[^1][^7]
DiLoCo was originally a research artifact at DeepMind with no public reference implementation, but its open source reproductions and extensions have since become a small ecosystem in their own right.
Together AI's decentralized training competitors at Prime Intellect released OpenDiLoCo on 11 July 2024, an Apache 2.0 reproduction of the DiLoCo algorithm built on PyTorch FSDP for intra island scaling and the Hivemind distributed hash table library for inter island communication. Authored by Sami Jaghouar, Jack Min Ong, and Johannes Hagemann, the accompanying paper (arXiv:2407.07852) reproduced the DiLoCo numbers on a 150M model and scaled the implementation to a 1.1 billion parameter run, three times larger than the original paper's largest setting.[^2]
The headline OpenDiLoCo demonstration ran four DiLoCo workers, each with eight NVIDIA H100 GPUs, distributed across three countries on two continents (Canada via Hyperstack, Finland via DataCrunch, and the United States via Voltage Park and RunPod) with inter island bandwidth between 127 and 935 Mbit/s. The authors report a 90 to 95 percent compute utilization in the globally distributed setting, with the all-reduce step occupying only about 6.9 percent of training time.[^2] An ablation in the paper showed that the pseudo gradient all-reduce could be performed in FP16 without measurable degradation.[^2]
OpenDiLoCo is no longer actively maintained: the GitHub repository now redirects users to Prime Intellect's successor framework, prime (and the prime-rl variant used for INTELLECT-2), which the team describes as offering better fault tolerance and bandwidth utilization for production decentralized training.[^9][^4]
INTELLECT-1 is a 10 billion parameter decoder only language model trained by Prime Intellect using a hybrid DiLoCo / FSDP2 stack across globally distributed nodes contributed by 30 independent compute providers. The model was released on 29 November 2024, with a technical report (arXiv:2412.01152) published on 2 December 2024.[^3][^10]
The Llama 3 style architecture has 42 layers, hidden dimension 4096, 32 attention heads, and an 8,192 token context window; it was pretrained on roughly 1 trillion tokens drawn primarily from FineWeb-Edu (55 percent), Stack v2 (20 percent), FineWeb (10 percent), DCLM baseline (10 percent), and OpenWebMath (5 percent).[^3][^10] Training ran across 14 concurrent nodes spanning five countries and three continents, using up to 112 H100 GPUs simultaneously.
The Prime Intellect team reports an overall 83 percent compute utilization across continents and 96 percent when training was restricted to nodes within the United States, together with a 400x reduction in communication bandwidth versus standard data parallel training. This figure combines the H step DiLoCo reduction with an additional custom int8 quantized all-reduce over the pseudo gradients.[^3][^10] The supporting framework, prime, introduces ElasticDeviceMesh for dynamic process group management and live checkpoint recovery to handle the high failure rate of community contributed nodes.[^10]
INTELLECT-2 is a 32 billion parameter reasoning model based on Alibaba's QwQ-32B, post trained via fully asynchronous reinforcement learning across a permissionless swarm of compute contributors. It was released on 11 May 2025 under Apache 2.0, with a technical report at arXiv:2505.07291.[^4][^11]
INTELLECT-2 uses a different recipe from INTELLECT-1: instead of pretraining with DiLoCo as the core synchronization scheme, it applies an asynchronous variant of GRPO called prime-rl, together with auxiliary components TOPLOC (a locality sensitive hashing system that verifies inference rollouts to detect tampering or hardware noise) and SHARDCAST (an HTTP based tree topology for distributing updated policy weights). The training data is roughly 285,000 verifiable math and coding tasks from NuminaMath-1.5, DeepScaleR, and SYNTHETIC-1, with two sided token probability ratio clipping and aggressive gradient clipping added to GRPO for stability.[^4][^11] DiLoCo itself is listed in the report as a candidate future technique for fusing independently trained checkpoints into a single model.[^11]
Streaming DiLoCo, posted to arXiv on 30 January 2025 by Douillard and colleagues at Google DeepMind, addresses two practical limitations of the original DiLoCo: the peak bandwidth spike at the moment all workers exchange P parameters, and the wall clock cost of pausing training during that exchange.[^5]
It contributes three changes. First, the network is partitioned into P fragments (in the published configuration, roughly three transformer layers per fragment), and at each outer step only one fragment is synchronized rather than all parameters at once; this reduces the peak bandwidth by a factor proportional to the fragment size over the total number of layers, roughly 8x in the canonical setup. Second, communication of each fragment overlaps with continued inner optimization for tau steps, so workers do not stall while the all-reduce is in flight. Third, the exchanged pseudo gradients are compressed to a 4 bit floating point format (E3M0) with FP32 accumulation, while random dropout based compression was found to be ineffective.[^5]
Combining these three techniques on a 1 billion parameter model trained on Dolma yields a roughly 400x reduction in total bits exchanged versus a vanilla data parallel baseline while matching evaluation loss and downstream task accuracy on HellaSwag, PIQA, and ARC Easy. Streaming DiLoCo experiments span 35M to 4B parameter models and 25B to 250B token budgets, all using Chinchilla style architectures.[^5] The authors describe the result as a step toward "a distributed free lunch," in the sense that the bandwidth tax of distributed training can be made very small without observable quality loss.[^5]
The DeepMind team also published "Asynchronous Local-SGD Training for Language Modeling" (arXiv:2401.09135) on 17 January 2024, which addresses the case where workers cannot synchronize on a fixed schedule, for example because they are heterogeneous in compute speed.[^12] A naive asynchronous version of DiLoCo, in which each worker updates the global parameters as soon as it finishes H inner steps, underperforms the synchronous variant because the outer Nesterov momentum amplifies stale pseudo gradients. The paper proposes two corrections: a Delayed Nesterov scheme that postpones the momentum update relative to the gradient injection, and Dynamic Local Updates, in which each worker's H is adjusted on the fly to reflect its measured throughput.[^12] With these fixes, async DiLoCo matches the synchronous variant on per update perplexity while improving wall clock time on heterogeneous hardware. The accompanying code is published as google-deepmind/asyncdiloco.[^12]
A second DeepMind paper, "Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo" (arXiv:2503.09799, March 2025), studies how DiLoCo behaves under a fixed compute budget across model sizes.[^8] The headline findings are that DiLoCo scales as a power law in model size with a slightly more favorable exponent than data parallel training, that its compute optimal batch size is larger than that of AdamW data parallel training, and that the gap between DiLoCo and data parallel actually closes (or inverts in DiLoCo's favor) as models grow. The authors interpret this as evidence that DiLoCo is not only a low bandwidth fallback but a genuinely competitive optimization recipe at scale.[^8]
A separate but related effort from DeepMind, DiPaCo (Distributed Path Composition, arXiv:2403.10616, March 2024), combines DiLoCo style outer optimization with a modular architecture in which each training example flows through one of many "paths" of shared modules. At inference time, only a single path is executed per input, sidestepping the need to load the full model. The authors report that on C4, a DiPaCo model using 256 paths of 150M parameters outperforms a 1 billion parameter dense baseline in the same number of training steps, while using less wall clock time thanks to DiLoCo style low frequency communication.[^13] DiPaCo is mainly of interest as an architectural co design rather than a drop in replacement for DiLoCo, but it illustrates how the low communication outer loop can be repurposed for Mixture of Experts like designs.
Beyond the named papers above, the DiLoCo idea has spawned a family of variant algorithms that treat its outer loop as a generic recipe to be combined with different inner optimizers and compression schemes. MuLoCo studies whether the recently popularized Muon optimizer can serve as a drop in replacement for AdamW as the inner optimizer in DiLoCo, finding that Muon's structured updates compose cleanly with the DiLoCo outer loop.[^14] Communication efficient pretraining methods such as SparseLoCo apply gradient sparsification to the pseudo gradients exchanged in each outer round, trading some quality for additional bandwidth reduction.[^15] None of these has yet displaced the canonical DiLoCo recipe in production use, but they illustrate that the algorithm is being treated as a portable component rather than a fixed pipeline.
DiLoCo and its successors are widely cited as the technical basis for the recent wave of decentralized pretraining efforts pursued by Prime Intellect, Together AI, and others, as well as for multi datacenter training within hyperscalers. Before DiLoCo, training a state of the art model across two datacenters tens of milliseconds apart was widely considered impractical because synchronous all-reduce penalties scale poorly with link latency; with DiLoCo, the latency tax is amortized over hundreds of inner steps, and inter datacenter bandwidth becomes the binding constraint rather than per step jitter.[^1][^2]
The most visible demonstration of this shift is INTELLECT-1, the first publicly trained 10B parameter language model whose pretraining was distributed across multiple continents and multiple independent organizations, an outcome that would not have been feasible with vanilla data parallelism.[^3][^10] The 400x communication reduction reported for INTELLECT-1 and the comparable figure for Streaming DiLoCo are at the heart of arguments that frontier model training does not need to remain locked to a small number of hyperscale clusters.[^3][^5]
There are also more pragmatic implications inside individual organizations. DiLoCo style outer loops allow training jobs to be split across multiple datacenters of the same provider that are connected by ordinary backbone fiber rather than dedicated low latency lanes. This is sometimes referred to in industry as "multi DC pretraining," and several large labs have reported using DiLoCo derivatives or close cousins to operate above the single datacenter scale, although precise architectural details remain proprietary.
A second order effect of low communication training algorithms is that the unit of compute that can profitably participate in a large pretraining run becomes smaller. With pure synchronous data parallelism, a useful contribution to a frontier training run requires colocating hundreds of GPUs on a single low latency fabric. With DiLoCo, a small cluster of eight H100 GPUs sitting on a 1 Gbit/s link can be a productive participant, because the algorithm's bandwidth budget is dominated by H step intervals rather than per microsecond synchronization. This dramatically lowers the barrier to entry for non hyperscale operators, university clusters, and individual contributors, which in turn underpins the commercial proposition of decentralized training platforms such as Prime Intellect's permissionless network. The INTELLECT-1 and INTELLECT-2 demonstrations are the strongest existence proofs to date that this market structure can actually train competitive models, although the specific economic terms of compute contribution and reward attribution remain in flux.[^3][^4][^11]
Several caveats limit how broadly DiLoCo can be applied today.
First, DiLoCo reduces communication frequency, not the per round communication volume. In the canonical algorithm, every outer round still exchanges roughly P parameters across all workers, just less often. This is acceptable when the link can support an occasional bulk transfer but becomes a problem when peak bandwidth is itself the binding constraint; Streaming DiLoCo's fragment based synchronization is the most direct response to this issue.[^5]
Second, the algorithm is sensitive to outer hyperparameters. The original paper observes that the Nesterov outer momentum is critical, that the outer learning rate of 0.7 is unusually high, and that removing the outer optimizer entirely degrades performance substantially.[^1] This means a naive implementation that swaps in SGD or skips the Nesterov term can silently produce a much worse model.
Third, while DiLoCo scales well in the published 60M to 4B regime, very few public results extend it cleanly to the hundreds of billions of parameters typical of frontier dense models. Scaling laws for DiLoCo show favorable trends through 4B parameters,[^8] and INTELLECT-1 demonstrates a working 10B pretraining,[^3] but a fully open 70B or larger DiLoCo training run has yet to appear.
Fourth, the synchronization step is still a barrier in the basic algorithm: all workers must complete H steps before the global update can be applied. In practice some workers will be slower (heterogeneous hardware, contention, network jitter), which forces faster workers to idle. The asynchronous DiLoCo line of work addresses this with Delayed Nesterov momentum and Dynamic Local Updates, but a fully asynchronous DiLoCo without quality penalty is still an active area of research.[^12]
Fifth, fault tolerance and Byzantine robustness in fully open, permissionless training settings remain hard. INTELLECT-1 added ElasticDeviceMesh and live checkpoint recovery to handle node failures,[^10] and INTELLECT-2 added TOPLOC inference verification to detect tampering or hardware noise,[^11] but these are bolted on top of DiLoCo rather than baked into its math. A formal treatment of how much malicious or noisy participation DiLoCo can tolerate while still converging is an open problem.
DiLoCo sits in a broader family of communication efficient distributed training methods that includes local SGD (Stich, 2019), Lookahead (Zhang et al., 2019), SlowMo and FedMom (Wang et al., 2020), and federated averaging (McMahan et al., 2017).[^1][^7] Each of these methods reduces synchronization frequency in some way, but DiLoCo is distinguished by the combination of a large inner step count (500), a strong inner optimizer (AdamW), and a Nesterov outer step, tuned specifically for language model pretraining rather than for federated learning on edge devices.
At the systems level DiLoCo is complementary to Fully Sharded Data Parallel (FSDP), tensor parallelism, and pipeline parallelism, which can run within each DiLoCo island; to DeepSpeed style memory efficient training stacks; and to elastic federation systems like Hivemind, which OpenDiLoCo uses for inter island communication.[^2] Mixture of experts approaches such as MoE models or DiPaCo provide an orthogonal axis of compute partitioning that can be combined with DiLoCo's low frequency outer loop.[^13]