DiLoCo

Google DeepMind Training & Optimization

26 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v4 · 5,216 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DiLoCo (Distributed Low-Communication training) is a distributed optimization algorithm for neural networks introduced by Google DeepMind in November 2023 to train large language models across loosely connected "islands" of compute. Each island performs many local optimization steps with AdamW on its own data shard, and only periodically averages parameter deltas across workers, after which an outer optimizer (Nesterov momentum on the averaged delta) updates the global weights. In the original paper, DiLoCo with eight workers and 500 inner steps between communications matched the validation perplexity of fully synchronous data parallel training on the C4 dataset while exchanging roughly 500 times less data.^[1] The method has become the foundation for a line of follow up work on geographically distributed training, including Prime Intellect's OpenDiLoCo, the INTELLECT-1 and INTELLECT-2 community trained models, and Streaming DiLoCo, which extends the communication efficiency another order of magnitude.^[2]^[3]^[4]^[5]

Property	Value
Original paper	"DiLoCo: Distributed Low-Communication Training of Language Models" (arXiv:2311.08105)
Originating lab	Google DeepMind
First public version	14 November 2023
Lead authors	Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc'Aurelio Ranzato, Arthur Szlam, Jiajun Shen
Algorithm class	Local SGD / federated averaging variant with outer Nesterov momentum
Inner optimizer	AdamW
Outer optimizer	Nesterov momentum
Headline result	Matches synchronous data parallel quality on C4 with roughly 500x less communication
Notable open implementation	OpenDiLoCo (Prime Intellect, Apache 2.0)
Notable production runs	INTELLECT-1 (10B, November 2024); INTELLECT-2 (32B post training, May 2025)
Successor	Streaming DiLoCo (DeepMind, January 2025)

Background

Training frontier language models with conventional data parallelism requires that every accelerator exchange a copy of every gradient at every step. In practice this is done with an all-reduce collective over high speed interconnects such as NVLink or InfiniBand, and the bandwidth requirement grows with both model size and the number of workers. Fully Sharded Data Parallel (FSDP) and DeepSpeed reduce the per device memory footprint by sharding parameters, gradients, and optimizer state, but they do not reduce the per step communication volume; if anything, they increase the number of collective operations per step.^[6]

The result is that synchronous data parallelism essentially forces large training runs into a single tightly coupled cluster, because a few hundred microseconds of added latency per all-reduce can dominate iteration time. As model sizes grew through 2022 and 2023, several research groups began asking whether language models could instead be trained across many smaller clusters, perhaps connected only by ordinary internet links of tens or hundreds of megabits per second. DiLoCo is one of the most influential answers to that question.^[1]

The algorithm has roots in two older traditions. The first is local stochastic gradient descent and federated averaging, introduced by McMahan and colleagues for federated learning in 2017, where edge devices perform several local SGD steps between rounds of communication.^[7] The second is the Lookahead optimizer of Zhang and colleagues (2019), which interpolates the fast inner weights with a slow outer copy after k inner steps, and the SlowMo / FedMom family of methods, which apply outer momentum to averaged gradients in a federated setting.^[1] DiLoCo is best understood as a fusion of these ideas, tailored to language model pretraining rather than to phone style federated learning.

The communication bottleneck in large scale pretraining

A useful framing is to count the bytes exchanged per training step. In a vanilla data parallel run with k workers and a model of size P parameters, each all-reduce moves roughly 2P bytes per worker per step under FP16 accumulation. For a 1 billion parameter model trained for 100,000 steps across 64 workers, that adds up to hundreds of terabytes of cumulative communication. With 100 Gbit/s NVLink links the per step cost is acceptable; with a 1 Gbit/s wide area link it is not.

DiLoCo's core observation is that the per step granularity is mostly unnecessary. If workers communicate only every H steps and then exchange one averaged delta of size P, the cumulative communication is reduced by a factor of approximately H, with no change to wall clock training compute. The challenge is showing that this lossy synchronization schedule does not damage final model quality, which is what the DiLoCo paper set out to demonstrate.^[1]

The DiLoCo algorithm

The DiLoCo algorithm consists of two nested loops, an inner loop run independently on each of k workers and an outer loop that synchronizes them. The inner loop is ordinary stochastic gradient descent style optimization (AdamW in the published configuration), and the outer loop is a Nesterov momentum step on the averaged delta.^[1]

Algorithm

Let $\theta$ denote the model parameters, k the number of workers (islands), and H the number of inner steps between communications. For each outer step t from 1 to T:

Every worker i initializes its local parameter copy to the most recent global state, $\theta_i = \theta(t-1)$ .
For H inner steps, each worker performs InnerOpt updates on its own data shard, producing a final local parameter $\theta_i(t)$ .
Each worker computes a pseudo gradient $\delta_i = \theta(t-1) - \theta_i(t)$ , namely the cumulative drift from the global state over H inner steps.
The pseudo gradients are averaged across workers to produce $\delta(t) = \frac{1}{k} \sum_i \delta_i$ .
The outer optimizer applies a Nesterov momentum update using $\delta(t)$ as the gradient, yielding $\theta(t)$ .

In the published configuration the inner optimizer is AdamW with learning rate roughly 4e-4 and a cosine schedule, while the outer optimizer is Nesterov SGD with learning rate 0.7 and momentum 0.9. The number of inner steps H is typically 500, meaning each communication round corresponds to 500 sequential mini batches per worker.^[1]

The phrase "pseudo gradient" is borrowed from the federated learning literature and emphasizes that $\delta_i$ is not the gradient of any single batch; it is the negative of the parameter displacement produced by H steps of an arbitrary inner optimizer. The outer step therefore averages displacements rather than instantaneous gradients, which is the property that allows the inner optimizer to make rapid progress on local data without being held back by every other worker.^[1]

Hyperparameters from the original paper

The table below summarizes the canonical DiLoCo configuration on C4 from the v1 paper.^[1]

Hyperparameter	Value
Inner optimizer	AdamW
Inner learning rate	4e-4 (cosine)
Outer optimizer	Nesterov momentum
Outer learning rate	0.7
Outer momentum	0.9
Inner steps per round (H)	500
Workers (k)	8 (main); 4, 16, 64 in ablation
Sequence length	1,024 tokens
Batch size per worker	512
Model sizes	60M, 150M, 400M parameters
Dataset	C4 (Common Crawl)

The outer learning rate of 0.7 looks unusually large, but it is multiplied by an averaged displacement that already represents H inner steps of work, so the effective scale is small. In practice the outer optimizer behaves like a slow consensus mechanism: the workers can drift apart by hundreds of inner steps, and the outer step pulls them back to a single shared trajectory once per round.^[1]

Intuition: averaging weights, not gradients

A useful way to think about DiLoCo is that it averages weights at low frequency rather than gradients at every step. The Lookahead optimizer (Zhang et al., 2019) does something similar with a single replica, treating the inner sequence of AdamW updates as exploration and the outer interpolation as a stabilizing pullback toward a slow moving anchor.^[1] DiLoCo generalizes this to k replicas: each worker plays the role of an inner trajectory, and the outer step is the consensus pullback. SlowMo and FedMom similarly apply outer momentum to averaged updates, but they were studied at the smaller scales typical of federated learning rather than at the billion parameter language model scale where DiLoCo lives.^[1]

This view also explains why DiLoCo is robust to non identically distributed data shards: because the inner optimizer is allowed to make significant local progress before averaging, each worker can specialize to its shard during the inner loop, and the outer step folds those specializations together. The original paper reports that DiLoCo tolerates data heterogeneity and dynamic resource availability that would be challenging for standard data parallelism.^[1]

Why Nesterov momentum on the pseudo gradient

The choice of Nesterov momentum on the outer loop is not incidental. In ordinary stochastic optimization, momentum accelerates convergence by accumulating consistent gradient directions over time and damping oscillations across high curvature directions. In the DiLoCo outer loop, the "gradient" being smoothed is a pseudo gradient produced by H steps of inner AdamW, so the directions it represents are already averaged over hundreds of mini batches and may include a mix of consistent low frequency drift and round to round noise from inner optimizer randomness. The DeepMind authors find that a moderate to large outer momentum coefficient (0.9 in the published configuration) is essential to stabilizing this signal and that disabling outer momentum leads to noticeably worse perplexity at the same number of outer rounds.^[1]

A subtler observation, raised in subsequent analysis work such as the Step K Nesterov Outer Optimizer (SNOO) study, is that DiLoCo's strong performance can be attributed substantially to the Lookahead style component with Nesterov momentum applied to the pseudo gradient, rather than purely to data parallel averaging across workers. SNOO finds that even in a single worker setting, several inner steps followed by a Nesterov outer update on the cumulative displacement can outperform plain inner optimization at matched compute, which recasts DiLoCo as a combined optimization and distribution innovation rather than a pure distribution trick.^[16]

Connection to elastic averaging

A separate predecessor is the Elastic Averaging SGD (EASGD) framework, in which each worker maintains a local copy of the model parameters that is softly pulled toward a shared center variable via a spring like coupling. The pull strength is governed by an elastic constant rho, and the center variable itself moves slowly under the same coupling. DiLoCo can be viewed as a discretized, hard pull variant of elastic averaging: instead of a continuous spring, workers periodically snap back to the consensus, and the outer Nesterov step plays the role of the slow moving center variable's update law. This connection is invoked in the original DiLoCo paper to justify the algorithm's robustness, although the practical hyperparameters of DiLoCo differ from typical EASGD configurations by orders of magnitude in H.^[1]

Empirical results in the original paper

The DiLoCo paper reports experiments on three model sizes, 60M, 150M, and 400M parameters, all decoder only Transformers with Chinchilla like architectures, trained on the C4 subset of Common Crawl.^[1] The primary result is that for a 150M parameter Llama style model trained for 88,000 outer rounds, DiLoCo with eight workers and H equal to 500 achieves a validation perplexity of roughly 15.0, lower than a fully synchronous baseline using either a single worker or an equivalent eight times larger batch, while communicating roughly 500 times less data per round than the synchronous case.^[1]

The paper also reports ablations on the number of workers (k in {4, 8, 16, 64}), on the inner step count H, and on the inner and outer optimizers, finding that the algorithm is robust across a wide range of these settings, that increasing k generally improves quality at fixed token budget, and that the Nesterov outer step is critical: SGD or no outer momentum yields visibly worse perplexity.^[1] In a follow up scaling study from DeepMind, the authors find that DiLoCo's optimal batch size is larger than that of vanilla data parallel training, that downstream task generalization improves with scale, and that the method outperforms data parallel training even at small model sizes when properly tuned.^[8]

Comparison with other parallelism strategies

DiLoCo is sometimes confused with classical distributed training paradigms, but it occupies a distinct niche. The comparison below summarizes the differences with the parallelism dimensions most relevant to LLM training.^[1]^[6]

Strategy	What is partitioned	Communication frequency	Bandwidth requirement	Typical network
Data parallel (DDP)	data batches; full parameters replicated	every step (all-reduce of gradients)	High	NVLink / InfiniBand
FSDP	data batches; parameters sharded	every step (gather + reduce-scatter)	Highest	NVLink / InfiniBand
Tensor parallel	individual tensor dimensions	every layer	Very high (intra step)	NVLink within a node
Pipeline parallel	stages of the network	once per micro batch boundary	Medium	InfiniBand across nodes
DiLoCo	data shards across islands	every H inner steps (500 in paper)	Low	ordinary internet, 100 Mbit/s class

Within an island the workers can still use any combination of FSDP, tensor parallelism, and pipeline parallelism for intra island scaling; DiLoCo is orthogonal to these techniques and sits above them in the parallelism stack.^[1]^[2]

How DiLoCo differs from standard federated learning

DiLoCo is sometimes described as "federated learning for pretraining," and the analogy is close enough to be illuminating, but there are important differences in regime and tuning. In a typical federated setting such as FedAvg deployed on mobile phones, the model is small (often less than 10M parameters), data is non IID by construction, the number of clients per round is large but only a tiny subset participate per round, and the inner step count is small (often a single epoch over local data).^[7] Communication is dominated by upload bandwidth from the client to the server, and privacy is a primary motivation.

DiLoCo flips most of these parameters. Models range from 60M to multi billion parameters; data is generally IID because each worker is given a random shard of a centralized dataset; all workers participate in every round (in the synchronous variant); the inner step count is large (500 in the canonical configuration); and privacy is not a goal. The optimization recipe is also tuned for pretraining loss surfaces rather than the convex or near convex problems often analyzed in classical federated learning theory. As a result, DiLoCo's hyperparameters and empirical guarantees do not transfer directly from FedAvg style methods, even though the algorithm is mathematically a member of the same family.^[1]^[7]

Implementations and follow up work

DiLoCo was originally a research artifact at DeepMind with no public reference implementation, but its open source reproductions and extensions have since become a small ecosystem in their own right.

OpenDiLoCo (Prime Intellect, July 2024)

Together AI's decentralized training competitors at Prime Intellect released OpenDiLoCo on 11 July 2024, an Apache 2.0 reproduction of the DiLoCo algorithm built on PyTorch FSDP for intra island scaling and the Hivemind distributed hash table library for inter island communication. Authored by Sami Jaghouar, Jack Min Ong, and Johannes Hagemann, the accompanying paper (arXiv:2407.07852) reproduced the DiLoCo numbers on a 150M model and scaled the implementation to a 1.1 billion parameter run, three times larger than the original paper's largest setting.^[2]

The headline OpenDiLoCo demonstration ran four DiLoCo workers, each with eight NVIDIA H100 GPUs, distributed across three countries on two continents (Canada via Hyperstack, Finland via DataCrunch, and the United States via Voltage Park and RunPod) with inter island bandwidth between 127 and 935 Mbit/s. The authors report a 90 to 95 percent compute utilization in the globally distributed setting, with the all-reduce step occupying only about 6.9 percent of training time.^[2] An ablation in the paper showed that the pseudo gradient all-reduce could be performed in FP16 without measurable degradation.^[2]

OpenDiLoCo is no longer actively maintained: the GitHub repository now redirects users to Prime Intellect's successor framework, prime (and the prime-rl variant used for INTELLECT-2), which the team describes as offering better fault tolerance and bandwidth utilization for production decentralized training.^[9]^[4]

INTELLECT-1 (Prime Intellect, November 2024)

INTELLECT-1 is a 10 billion parameter decoder only language model trained by Prime Intellect using a hybrid DiLoCo / FSDP2 stack across globally distributed nodes contributed by 30 independent compute providers. The model was released on 29 November 2024, with a technical report (arXiv:2412.01152) published on 2 December 2024.^[3]^[10]

The Llama 3 style architecture has 42 layers, hidden dimension 4096, 32 attention heads, and an 8,192 token context window; it was pretrained on roughly 1 trillion tokens drawn primarily from FineWeb-Edu (55 percent), Stack v2 (20 percent), FineWeb (10 percent), DCLM baseline (10 percent), and OpenWebMath (5 percent).^[3]^[10] Training ran across 14 concurrent nodes spanning five countries and three continents, using up to 112 H100 GPUs simultaneously.

The Prime Intellect team reports an overall 83 percent compute utilization across continents and 96 percent when training was restricted to nodes within the United States, together with a 400x reduction in communication bandwidth versus standard data parallel training. This figure combines the H step DiLoCo reduction with an additional custom int8 quantized all-reduce over the pseudo gradients.^[3]^[10] The supporting framework, prime, introduces ElasticDeviceMesh for dynamic process group management and live checkpoint recovery to handle the high failure rate of community contributed nodes.^[10]

INTELLECT-2 (Prime Intellect, May 2025)

INTELLECT-2 is a 32 billion parameter reasoning model based on Alibaba's QwQ-32B, post trained via fully asynchronous reinforcement learning across a permissionless swarm of compute contributors. It was released on 11 May 2025 under Apache 2.0, with a technical report at arXiv:2505.07291.^[4]^[11]

INTELLECT-2 uses a different recipe from INTELLECT-1: instead of pretraining with DiLoCo as the core synchronization scheme, it applies an asynchronous variant of GRPO called prime-rl, together with auxiliary components TOPLOC (a locality sensitive hashing system that verifies inference rollouts to detect tampering or hardware noise) and SHARDCAST (an HTTP based tree topology for distributing updated policy weights). The training data is roughly 285,000 verifiable math and coding tasks from NuminaMath-1.5, DeepScaleR, and SYNTHETIC-1, with two sided token probability ratio clipping and aggressive gradient clipping added to GRPO for stability.^[4]^[11] DiLoCo itself is listed in the report as a candidate future technique for fusing independently trained checkpoints into a single model.^[11]

Streaming DiLoCo (DeepMind, January 2025)

Streaming DiLoCo, posted to arXiv on 30 January 2025 by Douillard and colleagues at Google DeepMind, addresses two practical limitations of the original DiLoCo: the peak bandwidth spike at the moment all workers exchange P parameters, and the wall clock cost of pausing training during that exchange.^[5]

It contributes three changes. First, the network is partitioned into P fragments (in the published configuration, roughly three transformer layers per fragment), and at each outer step only one fragment is synchronized rather than all parameters at once; this reduces the peak bandwidth by a factor proportional to the fragment size over the total number of layers, roughly 8x in the canonical setup. Second, communication of each fragment overlaps with continued inner optimization for tau steps, so workers do not stall while the all-reduce is in flight. Third, the exchanged pseudo gradients are compressed to a 4 bit floating point format (E3M0) with FP32 accumulation, while random dropout based compression was found to be ineffective.^[5]

Combining these three techniques on a 1 billion parameter model trained on Dolma yields a roughly 400x reduction in total bits exchanged versus a vanilla data parallel baseline while matching evaluation loss and downstream task accuracy on HellaSwag, PIQA, and ARC Easy. Streaming DiLoCo experiments span 35M to 4B parameter models and 25B to 250B token budgets, all using Chinchilla style architectures.^[5] The authors describe the result as a step toward "a distributed free lunch," in the sense that the bandwidth tax of distributed training can be made very small without observable quality loss.^[5]

Asynchronous DiLoCo

The DeepMind team also published "Asynchronous Local-SGD Training for Language Modeling" (arXiv:2401.09135) on 17 January 2024, which addresses the case where workers cannot synchronize on a fixed schedule, for example because they are heterogeneous in compute speed.^[12] A naive asynchronous version of DiLoCo, in which each worker updates the global parameters as soon as it finishes H inner steps, underperforms the synchronous variant because the outer Nesterov momentum amplifies stale pseudo gradients. The paper proposes two corrections: a Delayed Nesterov scheme that postpones the momentum update relative to the gradient injection, and Dynamic Local Updates, in which each worker's H is adjusted on the fly to reflect its measured throughput.^[12] With these fixes, async DiLoCo matches the synchronous variant on per update perplexity while improving wall clock time on heterogeneous hardware. The accompanying code is published as google-deepmind/asyncdiloco.^[12]

Scaling laws for DiLoCo

A second DeepMind paper, "Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo" (arXiv:2503.09799, March 2025), studies how DiLoCo behaves under a fixed compute budget across model sizes.^[8] The headline findings are that DiLoCo scales as a power law in model size with a slightly more favorable exponent than data parallel training, that its compute optimal batch size is larger than that of AdamW data parallel training, and that the gap between DiLoCo and data parallel actually closes (or inverts in DiLoCo's favor) as models grow. The authors interpret this as evidence that DiLoCo is not only a low bandwidth fallback but a genuinely competitive optimization recipe at scale.^[8]

DiPaCo

A separate but related effort from DeepMind, DiPaCo (Distributed Path Composition, arXiv:2403.10616, March 2024), combines DiLoCo style outer optimization with a modular architecture in which each training example flows through one of many "paths" of shared modules. At inference time, only a single path is executed per input, sidestepping the need to load the full model. The authors report that on C4, a DiPaCo model using 256 paths of 150M parameters outperforms a 1 billion parameter dense baseline in the same number of training steps, while using less wall clock time thanks to DiLoCo style low frequency communication.^[13] DiPaCo is mainly of interest as an architectural co design rather than a drop in replacement for DiLoCo, but it illustrates how the low communication outer loop can be repurposed for Mixture of Experts like designs.

Other follow up directions

Beyond the named papers above, the DiLoCo idea has spawned a family of variant algorithms that treat its outer loop as a generic recipe to be combined with different inner optimizers and compression schemes. MuLoCo studies whether the recently popularized Muon optimizer can serve as a drop in replacement for AdamW as the inner optimizer in DiLoCo, finding that Muon's structured updates compose cleanly with the DiLoCo outer loop.^[14] Communication efficient pretraining methods such as SparseLoCo apply gradient sparsification to the pseudo gradients exchanged in each outer round, trading some quality for additional bandwidth reduction.^[15] None of these has yet displaced the canonical DiLoCo recipe in production use, but they illustrate that the algorithm is being treated as a portable component rather than a fixed pipeline.

Significance

DiLoCo and its successors are widely cited as the technical basis for the recent wave of decentralized pretraining efforts pursued by Prime Intellect, Together AI, and others, as well as for multi datacenter training within hyperscalers. Before DiLoCo, training a state of the art model across two datacenters tens of milliseconds apart was widely considered impractical because synchronous all-reduce penalties scale poorly with link latency; with DiLoCo, the latency tax is amortized over hundreds of inner steps, and inter datacenter bandwidth becomes the binding constraint rather than per step jitter.^[1]^[2]

The most visible demonstration of this shift is INTELLECT-1, the first publicly trained 10B parameter language model whose pretraining was distributed across multiple continents and multiple independent organizations, an outcome that would not have been feasible with vanilla data parallelism.^[3]^[10] The 400x communication reduction reported for INTELLECT-1 and the comparable figure for Streaming DiLoCo are at the heart of arguments that frontier model training does not need to remain locked to a small number of hyperscale clusters.^[3]^[5]

There are also more pragmatic implications inside individual organizations. DiLoCo style outer loops allow training jobs to be split across multiple datacenters of the same provider that are connected by ordinary backbone fiber rather than dedicated low latency lanes. This is sometimes referred to in industry as "multi DC pretraining," and several large labs have reported using DiLoCo derivatives or close cousins to operate above the single datacenter scale, although precise architectural details remain proprietary.

Economic implications for decentralized compute

A second order effect of low communication training algorithms is that the unit of compute that can profitably participate in a large pretraining run becomes smaller. With pure synchronous data parallelism, a useful contribution to a frontier training run requires colocating hundreds of GPUs on a single low latency fabric. With DiLoCo, a small cluster of eight H100 GPUs sitting on a 1 Gbit/s link can be a productive participant, because the algorithm's bandwidth budget is dominated by H step intervals rather than per microsecond synchronization. This dramatically lowers the barrier to entry for non hyperscale operators, university clusters, and individual contributors, which in turn underpins the commercial proposition of decentralized training platforms such as Prime Intellect's permissionless network. The INTELLECT-1 and INTELLECT-2 demonstrations are the strongest existence proofs to date that this market structure can actually train competitive models, although the specific economic terms of compute contribution and reward attribution remain in flux.^[3]^[4]^[11]

Limitations and open questions

Several caveats limit how broadly DiLoCo can be applied today.

First, DiLoCo reduces communication frequency, not the per round communication volume. In the canonical algorithm, every outer round still exchanges roughly P parameters across all workers, just less often. This is acceptable when the link can support an occasional bulk transfer but becomes a problem when peak bandwidth is itself the binding constraint; Streaming DiLoCo's fragment based synchronization is the most direct response to this issue.^[5]

Second, the algorithm is sensitive to outer hyperparameters. The original paper observes that the Nesterov outer momentum is critical, that the outer learning rate of 0.7 is unusually high, and that removing the outer optimizer entirely degrades performance substantially.^[1] This means a naive implementation that swaps in SGD or skips the Nesterov term can silently produce a much worse model.

Third, while DiLoCo scales well in the published 60M to 4B regime, very few public results extend it cleanly to the hundreds of billions of parameters typical of frontier dense models. Scaling laws for DiLoCo show favorable trends through 4B parameters,^[8] and INTELLECT-1 demonstrates a working 10B pretraining,^[3] but a fully open 70B or larger DiLoCo training run has yet to appear.

Fourth, the synchronization step is still a barrier in the basic algorithm: all workers must complete H steps before the global update can be applied. In practice some workers will be slower (heterogeneous hardware, contention, network jitter), which forces faster workers to idle. The asynchronous DiLoCo line of work addresses this with Delayed Nesterov momentum and Dynamic Local Updates, but a fully asynchronous DiLoCo without quality penalty is still an active area of research.^[12]

Fifth, fault tolerance and Byzantine robustness in fully open, permissionless training settings remain hard. INTELLECT-1 added ElasticDeviceMesh and live checkpoint recovery to handle node failures,^[10] and INTELLECT-2 added TOPLOC inference verification to detect tampering or hardware noise,^[11] but these are bolted on top of DiLoCo rather than baked into its math. A formal treatment of how much malicious or noisy participation DiLoCo can tolerate while still converging is an open problem.

DiLoCo sits in a broader family of communication efficient distributed training methods that includes local SGD (Stich, 2019), Lookahead (Zhang et al., 2019), SlowMo and FedMom (Wang et al., 2020), and federated averaging (McMahan et al., 2017).^[1]^[7] Each of these methods reduces synchronization frequency in some way, but DiLoCo is distinguished by the combination of a large inner step count (500), a strong inner optimizer (AdamW), and a Nesterov outer step, tuned specifically for language model pretraining rather than for federated learning on edge devices.

At the systems level DiLoCo is complementary to Fully Sharded Data Parallel (FSDP), tensor parallelism, and pipeline parallelism, which can run within each DiLoCo island; to DeepSpeed style memory efficient training stacks; and to elastic federation systems like Hivemind, which OpenDiLoCo uses for inter island communication.^[2] Mixture of experts approaches such as MoE models or DiPaCo provide an orthogonal axis of compute partitioning that can be combined with DiLoCo's low frequency outer loop.^[13]

References

Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc'Aurelio Ranzato, Arthur Szlam, Jiajun Shen, "DiLoCo: Distributed Low-Communication Training of Language Models", arXiv, 2023-11-14 (v1; v3 revised 2024-09-23). https://arxiv.org/abs/2311.08105. Accessed 2026-05-20. ↩
Sami Jaghouar, Jack Min Ong, Johannes Hagemann, "OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training", arXiv, 2024-07-10. https://arxiv.org/abs/2407.07852. Accessed 2026-05-20. ↩
Prime Intellect, "INTELLECT-1 Release: The First Globally Trained 10B Parameter Model", Prime Intellect Blog, 2024-11-29. https://www.primeintellect.ai/blog/intellect-1-release. Accessed 2026-05-20. ↩
Prime Intellect, "INTELLECT-2 Release: The First 32B Parameter Model Trained Through Globally Distributed Reinforcement Learning", Prime Intellect Blog, 2025-05-11. https://www.primeintellect.ai/blog/intellect-2-release. Accessed 2026-05-20. ↩
Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham, "Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch", arXiv, 2025-01-30. https://arxiv.org/abs/2501.18512. Accessed 2026-05-20. ↩
Yanli Zhao et al., "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel", arXiv, 2023-04-21. https://arxiv.org/abs/2304.11277. Accessed 2026-05-20. ↩
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Aguera y Arcas, "Communication-Efficient Learning of Deep Networks from Decentralized Data", arXiv, 2017-02-17 (v3). https://arxiv.org/abs/1602.05629. Accessed 2026-05-20. ↩
Zachary Charles, Gabriel Teston, Lucio Dery, Keith Rush, Nova Fallen, Zachary Garrett, Arthur Szlam, Arthur Douillard, "Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo", arXiv, 2025-03-12. https://arxiv.org/abs/2503.09799. Accessed 2026-05-20. ↩
Prime Intellect, "OpenDiLoCo GitHub repository", GitHub, 2024-07-11. https://github.com/PrimeIntellect-ai/OpenDiloco. Accessed 2026-05-20. ↩
Sami Jaghouar, Jack Min Ong, Manveer Basra, Fares Obeid, Jannik Straube, Michael Keiblinger, Elie Bakouch, Lucas Atkins, Maziyar Panahi, Charles Goddard, Max Ryabinin, Johannes Hagemann, "INTELLECT-1 Technical Report", arXiv, 2024-12-02. https://arxiv.org/abs/2412.01152. Accessed 2026-05-20. ↩
Prime Intellect team, "INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning", arXiv, 2025-05-12. https://arxiv.org/abs/2505.07291. Accessed 2026-05-20. ↩
Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato, "Asynchronous Local-SGD Training for Language Modeling", arXiv, 2024-01-17. https://arxiv.org/abs/2401.09135. Accessed 2026-05-20. ↩
Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Adhiguna Kuncoro, Yani Donchev, Rachita Chhaparia, Ionel Gog, Marc'Aurelio Ranzato, Jiajun Shen, Arthur Szlam, "DiPaCo: Distributed Path Composition", arXiv, 2024-03-15. https://arxiv.org/abs/2403.10616. Accessed 2026-05-20. ↩
Benjamin Thérien, Xiaolong Huang, Aaron Defazio, Irina Rish, Eugene Belilovsky, "MuLoCo: Muon is a practical inner optimizer for DiLoCo", arXiv, 2025-05-29. https://arxiv.org/abs/2505.23725. Accessed 2026-05-20. ↩
Amir Sarfi, Benjamin Thérien, Joel Lidin, Eugene Belilovsky, "Communication Efficient LLM Pre-training with SparseLoCo", arXiv, 2025-08-21. https://arxiv.org/abs/2508.15706. Accessed 2026-05-20. ↩
Dominik Kallusky, Vinay Rao, Vishal Nandavanam, Hao-Jun Michael Shi, "SNOO: Step-K Nesterov Outer Optimizer. The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients", arXiv, 2025-10-17. https://arxiv.org/abs/2510.15830. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Prime Intellect