Parameter Server (PS)

MLOps Training & Optimization

27 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v5 · 5,462 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Parameter Server (PS) is a distributed system architecture for training large machine learning models in which one set of machines, the server nodes, holds the global model parameters as a partitioned key-value store, while another set of machines, the worker nodes, processes data shards and exchanges parameter reads (pull) and gradient updates (push) with the servers. The pattern was popularized by Mu Li and collaborators in their 2014 OSDI paper Scaling Distributed Machine Learning with the Parameter Server ^[1], building on earlier work by Alex Smola and Shravan Narayanamurthy at VLDB 2010 ^[2] and on Google's DistBelief system described by Jeffrey Dean et al. at NeurIPS 2012 ^[3]. For roughly half a decade the parameter server was the default architecture for large-scale distributed machine learning, then ceded the deep-learning training market to AllReduce-based collective communication systems such as Horovod ^[4] and PyTorch DistributedDataParallel starting around 2017. It nevertheless remains the dominant architecture for industrial recommender systems, click-through rate (CTR) prediction, and any setting where the model is dominated by sparse embedding tables too large to replicate on every worker.

The parameter server is significant for two reasons beyond the systems it enabled. First, it formalized a separation of concerns that almost every later distributed-training framework adopted in some form: stateful, addressable parameter storage on one side, and stateless, mostly numeric workers on the other. Second, the parameter server papers were the first place where the design space of synchronous versus asynchronous SGD, bounded staleness, and elastic fault tolerance was laid out cleanly enough to be implemented as a general-purpose system rather than an algorithm-specific hack. The Stale Synchronous Parallel (SSP) consistency model from Qirong Ho and colleagues at NeurIPS 2013 ^[5] grew out of this design space and remains the cleanest articulation of the trade-off between convergence speed and parallel efficiency in stochastic gradient descent.

Background and motivation

By the late 2000s, machine learning datasets had outgrown what fit on a single machine. Search ranking, contextual ads, topic models for behavioral targeting, and the first commercial deep learning systems were all routinely working with billions of features or tens of millions of training examples. Several attempts to scale SGD horizontally already existed:

The classical synchronous Bulk Synchronous Parallel (BSP) approach, used by the MapReduce family and Apache Hadoop, required every worker to finish each iteration before the next could begin, which left fast workers idle whenever any worker straggled.
Lock-protected shared-memory updates were limited to a single multi-core machine.
Pure asynchronous schemes such as HOGWILD! by Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright at NeurIPS 2011 ^[6] showed that lock-free updates from many threads could converge for sparse problems, but the analysis was specific to shared-memory systems and did not address how to scale across a cluster.

The parameter server architecture took the lock-free, mostly asynchronous philosophy of HOGWILD! and pushed it across the network. Rather than every worker reading and writing a single shared memory region, a logical key-value store was sharded across many server machines, and workers did remote pull and push operations against it. This split allowed the parameters to be much larger than any single worker's memory and made the system trivially elastic at both layers.

The three generations

Mu Li's OSDI 2014 paper organized the history of the architecture into three generations ^[1]:

Generation	Representative system	Year	Key idea
First	Distributed Latent Dirichlet Allocation sampler of Smola and Narayanamurthy ^[2]	2010	Distributed (key, value) store for sampler state across a cluster, used to scale topic models to hundreds of millions of documents
Second	Google's DistBelief ^[3], YahooLDA, Microsoft's Project Adam ^[7]	2012 to 2014	Application-specific parameter servers for one model family (deep nets, LDA, ad CTR). Introduced Downpour SGD and Sandblaster L-BFGS ^[3]
Third	Mu Li et al. ps-lite and the Parameter Server ^[1], Petuum ^[8]	2014 to 2015	General-purpose parameter server with arbitrary user-defined update functions, vector clocks, range push and pull, and configurable consistency models including SSP

The third generation is what the term parameter server almost always refers to in modern usage, and what most production systems still in service trace back to.

Architecture

A parameter server deployment partitions the cluster into three logical roles. In small deployments, the scheduler often coexists with the first server.

Role	Stateful	Job
Scheduler	Yes (membership only)	Assigns key ranges to servers, monitors heartbeats, manages elastic join and leave, drives recovery
Server nodes	Yes (model parameters)	Hold partitioned parameters, apply user-defined update functions, serve push and pull requests, replicate for fault tolerance
Worker nodes	Mostly stateless	Hold a data shard, compute local gradients on minibatches, push gradients and pull current parameters, never communicate with each other

The defining property is that workers do not talk to each other directly. All inter-worker information flow goes through the parameter servers, which makes communication topologically a bipartite graph rather than the all-to-all of classical message-passing systems. The Mu Li paper formalized the worker-to-server interface as two operations ^[1]:

Push(keys, values) sends a sparse vector of updates (typically gradients or aggregated gradients) to the appropriate servers based on the key partitioning.
Pull(keys) -> values reads a sparse vector of parameters from the appropriate servers.

A range push and range pull form lets workers operate on contiguous key ranges in one round-trip, which is essential when the model is dense within a key range (such as a layer weight matrix). The user provides the update function that the server runs when a push arrives, which means the server can apply Adam, AdaGrad, FTRL, or any custom rule rather than just doing a sum.

Key-value abstraction

The parameter store is logically a function from keys (typically 64-bit integers) to values (typically dense vectors of fp32 or bf16 numbers). Each server is responsible for a contiguous range of keys. The Mu Li paper used consistent hashing for elasticity, similar to how distributed in-memory caches such as memcached handle membership changes ^[1].

This key-value framing is the reason parameter servers excel at sparse models. Each push or pull only touches the keys that appear in the current minibatch. For a click model with billions of features but only thousands active per example, this is an enormous reduction over schemes that broadcast or reduce the full parameter vector every step.

Fault tolerance and elasticity

Servers replicate parameters across multiple machines using chain replication or primary-backup, allowing a server failure to be handled by promoting a replica without halting the workers. The Mu Li paper introduced vector clocks that tag each parameter range with the worker that last updated it, supporting both range-based replication and incremental recovery ^[1]. Because workers are stateless, a worker failure means losing the last minibatch and is recovered by a fresh worker pulling current parameters from the servers. The system supports adding or removing workers and servers during training without restarting, which proved essential for production deployments where machines fail constantly.

Synchronization models

The single most consequential design decision in any parameter server is the consistency model that governs when a worker may proceed with the next iteration relative to other workers. The three canonical points on the spectrum are:

Model	Worker behavior	Convergence guarantees	Hardware utilization	Source
Bulk Synchronous Parallel (BSP)	All workers compute their gradients on iteration t, push to the servers, then wait until the servers have aggregated everything before pulling the new parameters	Identical to single-machine mini-batch SGD	Limited by the slowest worker (straggler problem)	Valiant 1990
Asynchronous Parallel (ASP)	Each worker pushes and pulls independently. Updates are applied to whatever parameters are currently on the server, with no waiting	No formal guarantees in the general non-convex case. May diverge if staleness is unbounded	Maximum: workers never wait	DistBelief Downpour SGD ^[3], HOGWILD! ^[6]
Stale Synchronous Parallel (SSP)	A worker at iteration c may proceed if no other worker is more than s iterations behind it (s is the staleness bound)	Bounded staleness regret bounds proven by Ho et al. ^[5]. Becomes BSP when $s = 0$ , ASP when $s = \infty$	Tunable: trade convergence quality for utilization by choosing s	Ho et al. NeurIPS 2013 ^[5]

Bulk Synchronous Parallel

BSP is the safest mode and the easiest to reason about. The aggregated gradient on each round is mathematically equivalent to a mini-batch of size (worker batch size) times (number of workers), which makes the model identical to single-machine SGD with a larger batch. The cost is the straggler effect: any worker that is slower (slow disk, GC pause, hot CPU, network blip) holds up the entire round. In a 100-worker cluster, even occasional 2x slowdowns on individual workers can destroy throughput.

Asynchronous SGD and the staleness problem

Asynchronous SGD removes the wait entirely. The DistBelief paper called this Downpour SGD ^[3]: each worker independently fetches parameters, computes a gradient, and pushes the gradient to the server, which applies the update to whatever the current parameter value is. Because the parameter the worker used to compute its gradient may already be several updates out of date, the gradient is a stale gradient with respect to the current parameters.

The HOGWILD! analysis showed that for sparse convex problems with bounded gradient delays, asynchronous SGD converges at almost the same rate as serial SGD ^[6]. The intuition is that if two workers update disjoint coordinates, their updates do not collide, and sparsity makes collisions rare. For dense, non-convex deep learning the picture is messier: empirically Downpour SGD often works, but practitioners report that aggressive asynchrony can hurt final accuracy and destabilize training.

Stale Synchronous Parallel

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, and Eric P. Xing introduced Stale Synchronous Parallel at NeurIPS 2013 as an explicit middle ground ^[5]. The system tracks each worker's clock (iteration count) and lets a worker advance to iteration $c+1$ only if no worker has a clock smaller than $(c+1) - s$ , where s is a configurable staleness threshold. When a worker reads a parameter, it sees all updates from iterations 0 through $c - s - 1$ , and it may or may not see updates from the s most recent iterations.

Ho et al. proved a regret bound of order $O(\sqrt{sT})$ for SSP applied to convex stochastic optimization, where T is the total number of iterations ^[5]. The bound recovers BSP at $s = 0$ and degrades gracefully as s grows. Empirically, SSP delivered most of the throughput of pure asynchronous SGD while avoiding its convergence pathologies, and it was adopted as the default consistency model in Petuum ^[8].

Eventual versus sequential consistency

The Mu Li OSDI paper made these consistency choices a first-class abstraction in the parameter server itself: the framework lets the user pick Sequential, Eventual, or Bounded Delay consistency for each key range, rather than baking the choice into the algorithm ^[1]. This was a key part of why the third-generation parameter server became a general-purpose system rather than a one-algorithm system.

History

First generation: distributed sampler state (2010)

The first paper to use the term parameter server in roughly its modern sense was Smola and Narayanamurthy's An Architecture for Parallel Topic Models at VLDB 2010 ^[2]. They scaled Latent Dirichlet Allocation Gibbs sampling to hundreds of millions of documents by storing the topic count tables in a distributed key-value store accessed by sampler workers. The system was specialized to LDA and used memcached as the underlying store. It established that the bottleneck for scaling sampling-based ML was synchronization of the shared state, not raw compute.

Second generation: DistBelief and friends (2012 to 2014)

The second generation built on the same idea but customized to one model family per system.

DistBelief at Google, described in the NeurIPS 2012 paper Large Scale Distributed Deep Networks by Dean, Corrado, Monga, Chen, Devin, Mao, Ranzato, Senior, Tucker, Yang, Le, and Ng ^[3], scaled deep neural networks across thousands of CPU cores. The framework introduced two algorithms that ran on top of a sharded parameter server:

Downpour SGD ran many model replicas in parallel, each pulling parameters and pushing gradients to the server asynchronously. The asynchrony allowed scaling to many machines despite the slow individual hardware (CPUs, not GPUs).
Sandblaster L-BFGS ran a synchronous batch L-BFGS optimizer across the parameter server, with a coordinator process directing the workers.

DistBelief was used to train the famous cat neuron unsupervised model, Inception precursors, and the speech recognition models that became the production Google Voice Search backend. It was the direct predecessor of TensorFlow, and many of TensorFlow's distributed primitives are direct descendants of DistBelief's parameter server.

Microsoft Research's Project Adam, presented by Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman at OSDI 2014 ^[7], was a parameter-server-based deep learning system that trained a 2 billion connection model on the ImageNet 22,000 class problem to roughly 2x the accuracy of previous distributed systems while using 30x fewer machines. Adam's contribution was whole-system co-design: the workers and parameter servers were tuned together for asynchronous overlap, sparse communication, and locality. (Project Adam shares its name with the Adam optimizer but is unrelated.)

YahooLDA, by Smola, Ahmed, and others, was a production LDA system at Yahoo built on the same parameter-server ideas as the VLDB 2010 paper but at much larger scale.

Third generation: Mu Li and Petuum (2014 to 2015)

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su published Scaling Distributed Machine Learning with the Parameter Server at OSDI 2014 ^[1]. The paper made three contributions that defined the modern architecture:

A general-purpose abstraction (push, pull, range operations, user-defined update functions) that made the framework algorithm-agnostic. Previous systems had been baked for one algorithm.
A flexible consistency model that exposed BSP, ASP, and bounded-delay (essentially SSP) as first-class options, configurable per key range.
Production-grade fault tolerance with replicated parameters, vector clocks, and elasticity for both workers and servers.

The paper reported scaling Sparse Logistic Regression to a 0.5 petabyte ad-click dataset with 17 billion examples and 100 billion parameters across 1000 machines, and LDA to 5 billion documents on a similar cluster ^[1]. The open-source implementation (called ps-lite) became the parameter server backend for the DMLC ecosystem and for MXNet ^[9].

Petuum, by Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu at IEEE Big Data 2015 (with earlier conference presentations in 2013) ^[8], was a parallel general-purpose ML framework built around an SSP parameter server. Petuum exposed the bounded-staleness consistency model directly to applications and added a dynamic scheduler for model-parallel updates. It became the platform of choice for academic distributed ML research throughout the mid-2010s.

Frameworks

A non-exhaustive timeline of frameworks that ship a parameter server, organized by year of first public release.

Year	Framework	Origin	Status (2026)	Notes
2010	YahooLDA	Yahoo Research (Smola, Ahmed, Narayanamurthy) ^[2]	Discontinued	First-generation, LDA-specialized
2012	DistBelief	Google (Dean et al.) ^[3]	Superseded by TensorFlow	Second-generation, Downpour SGD
2014	Project Adam	Microsoft Research ^[7]	Discontinued	Image classification specialist
2014	ps-lite / Parameter Server	CMU and Google (Mu Li et al.) ^[1]	Maintained as part of MXNet	First true general-purpose PS, OSDI 2014
2015	Petuum	CMU (Xing et al.) ^[8]	Spun out as commercial company	Native SSP, scheduler-based model-parallelism
2015	TensorFlow `tf.train.replica_device_setter`	Google	Replaced by `ParameterServerStrategy` in TF2	First-class PS in TensorFlow 1.x
2015	MXNet KVStore	DMLC (Mu Li et al.) ^[9]	Apache project, low activity	Built on ps-lite. `dist_sync` and `dist_async` modes
2018	TensorFlow `ParameterServerStrategy`	Google ^[10]	Maintained	TF2 successor, integrates with `tf.distribute` API
2018	Alibaba XDL	Alibaba ^[11]	Production at Alibaba	Optimized for CTR and recommendation
2019	BytePS	ByteDance ^[12]	Maintained	Unifies AllReduce and PS as special cases. Up to 84 percent throughput improvement over Horovod on BERT
2019	PyTorch RPC parameter server	PyTorch	Maintained but not recommended for the common case	Available via `torch.distributed.rpc` for custom topologies

The MXNet KVStore deserves special mention because it inherited the ps-lite codebase directly. The same Mu Li who wrote the OSDI paper led MXNet's distributed training design, and MXNet's kv = mx.kv.create('dist_sync') style of distributed training was for several years the easiest way to use a parameter server in a real ML pipeline ^[9].

TensorFlow ParameterServerStrategy

In TensorFlow 2, tf.distribute.ParameterServerStrategy is the modern API for parameter server training ^[10]. It scales to thousands of workers paired with a smaller number of parameter servers, integrates with the Model.fit Keras API and with custom training loops via tf.distribute.coordinator.ClusterCoordinator, and supports asynchronous updates by default. Google retains it as a first-class strategy primarily because internal recommendation and ranking workloads still benefit from it.

BytePS

BytePS, released by ByteDance in 2019 ^[12], is interesting because it is not a pure parameter server. The authors proved that classical AllReduce and classical parameter server are both special cases of a more general communication primitive, and built a system in which the user can blend the two at runtime. By placing aggregation responsibility on lightweight CPU-only "PS" nodes that exist alongside GPU workers, BytePS reportedly outperforms Horovod plus NCCL by up to 84 percent on BERT-large and outperforms classical PS by up to 245 percent on representative DNN training jobs ^[12].

Comparison to AllReduce

The other dominant distributed training architecture, AllReduce, takes a fundamentally different approach: every worker holds a full copy of the parameters and synchronizes through a collective communication step that aggregates gradients across all workers. The most common implementation is Ring AllReduce, popularized in machine learning by Andrew Gibiansky's 2017 Baidu blog post and operationalized at production scale by Horovod (originally from Uber) and by NCCL (NVIDIA Collective Communications Library) ^[4].

Property	Parameter Server	AllReduce (Ring)
Topology	Bipartite: workers <-> servers	Ring or tree across workers; no servers
Workers per parameter copy	One copy partitioned across servers	One full copy per worker
Bandwidth per worker per step	Proportional to size of pushed and pulled keys	$\frac{2(N-1)}{N} \times \text{model size}$ , asymptotically constant in $N$
Communication pattern	All-to-many (workers to servers)	All-to-all but ring-structured
Synchronous mode	Optional (BSP)	Default
Asynchronous mode	Native (Downpour, SSP)	Hard to support cleanly
Sparse model support	Native, only touched keys are communicated	Requires gradient compression or sparse AllReduce
Heterogeneous worker speed	Tolerated naturally with ASP or SSP	Rigid: stragglers stall the ring
Fault tolerance	Designed in via server replication	Relies on framework checkpointing
Bandwidth optimality	No: hot servers can bottleneck	Yes: ring is provably bandwidth-optimal for dense AllReduce
Deployment complexity	Higher: two roles to provision	Lower: single homogeneous cluster
Best fit	Sparse models, recsys, CPU clusters, heterogeneous hardware	Dense models, GPU clusters, Transformer and CNN training

A useful intuition for why AllReduce won the deep-learning training market: when every layer is dense and the parameter count fits comfortably on every GPU, parameter servers waste bandwidth by pushing the entire gradient back to a server only to read it back. Ring AllReduce performs the equivalent reduction with no central bottleneck and uses bandwidth proportional to the model size rather than to the worker count ^[4]. For LLM pretraining, where parameters are dense and every step touches every parameter, this is dispositive.

When the model is sparse in the parameters touched per step (recommendation models, large vocabulary embeddings, ad CTR), the calculus inverts. AllReduce still pushes the full gradient including all the zeros, while a parameter server only communicates the keys that the minibatch actually touched. For a model with 100 billion embedding rows but only 10,000 active per step, a parameter server is many orders of magnitude cheaper. This is why parameter servers remain dominant in industrial recsys despite being out of favor in deep-learning research.

Hybrid systems

Many modern frameworks combine the two. The dominant pattern in production recommender systems is:

Sparse part (embedding tables for users, items, hashed feature crosses): parameter server, often with the HOGWILD! update style for the sparse rows.
Dense part (the multi-layer perceptron on top of the embeddings): data-parallel AllReduce.

Meta's DLRM follows exactly this pattern in production CPU training, with embedding parameters partitioned across parameter servers and synchronized via Elastic Averaging SGD, while the dense MLP is data-parallel with allreduce-style synchronization ^[13]. Alibaba's XDL ^[11] and ByteDance's BytePS ^[12] codify the hybrid as the default system architecture.

Modern usage

Recommender systems and CTR prediction

The single largest production workload of parameter servers in 2026 is industrial recommendation and ad ranking. Models in this domain share a common shape: very wide embedding tables (often tens of billions of rows for user and item IDs and hashed feature crosses), a relatively small MLP on top, and gradient updates that are extremely sparse (only the keys for the active features in the current minibatch). Parameter servers were designed for exactly this access pattern.

The Google Wide and Deep Learning for Recommender Systems paper (Cheng et al., 2016) used FTRL-Proximal on the wide linear part and AdaGrad on the deep part, both running on a parameter server. Alibaba XDL is built around a parameter server tuned for hundreds of billions of features ^[11]. Meta DLRM uses parameter servers for the sparse embedding parameters and AllReduce for the dense MLP ^[13]. ByteDance uses BytePS, which is a generalization of the parameter server ^[12]. The industry has converged on the same broad architecture across companies.

Heterogeneous and CPU-only clusters

When workers are CPUs rather than GPUs (most ad and search ranking pipelines), the parameter server's tolerance for stragglers matters more, because CPU performance is more variable than GPU performance. Parameter servers also fit naturally into shared clusters where workers may be preempted by other jobs.

Online and continual learning

The push and pull abstraction maps well onto online learning, where streaming data updates a model continuously. Several production click prediction systems use parameter servers as the backbone for hot model updates served back to inference fleets, with the ability to ingest new training examples and propagate the updated parameters to inference workers within seconds.

Model parallelism

When a model is too large to fit on a single worker's memory, parameter servers offer a natural form of model parallelism: each server owns one slice of the model and the workers fetch only the slice they need for the current minibatch. Modern LLM training rarely uses parameter servers for this purpose because tensor parallelism, pipeline parallelism, and ZeRO sharding give better throughput on GPU clusters, but for sparse-first models with massive embeddings the parameter server remains the model-parallelism mechanism of choice.

Limitations

Communication bottleneck for dense models

For dense deep neural networks, every worker reads and writes the full set of parameters every step. Parameter servers funnel this traffic through the server fleet, which becomes a bottleneck once the model size and worker count grow. Ring AllReduce avoids this by spreading the reduction across the workers themselves ^[4]. This is the single biggest reason the deep-learning research community moved off parameter servers around 2017.

Choosing the worker-to-server ratio

A persistent operational pain point: too few servers and individual servers become hot; too many servers and aggregation latency rises. The right ratio depends on the model sparsity, the network topology, and the fraction of time spent in communication versus compute. Production deployments often re-tune this ratio per model.

Convergence under asynchrony

Asynchronous SGD with parameter servers has reliable convergence on convex sparse problems (HOGWILD!) but can hurt accuracy on dense non-convex deep learning. Practitioners working with parameter servers in deep learning have to carefully tune learning rates, staleness bounds, and warmup to avoid divergence. The empirical finding that synchronous large-batch SGD often matches or beats asynchronous SGD in final accuracy ^[14] further reduced the appeal of parameter servers for state-of-the-art deep learning.

Operational complexity

A parameter server deployment requires provisioning two distinct machine roles, configuring fault tolerance, handling consistent hashing under membership changes, and debugging asymmetric failure modes. AllReduce systems require only one role and use simpler collective primitives. This pushes researchers and small teams toward AllReduce by default.

Unfair comparison in benchmarks

Many academic comparisons of parameter server versus AllReduce use small dense models (ResNet-50, BERT-base) where AllReduce is unambiguously better. Comparisons that include real industrial recommendation models with billion-scale embeddings tend to favor parameter servers or hybrid systems. The phrase "parameter servers are obsolete" is therefore correct for LLM pretraining but misleading for the bulk of industrial ML compute. BytePS's authors explicitly argued this in their 2020 OSDI paper ^[12], demonstrating that PS-style aggregation can match or beat AllReduce on dense models too once spare CPU bandwidth is exploited.

Architecture	Where the gradients meet	Where parameters live	Typical use case
Parameter Server	At dedicated server nodes	Sharded across servers	Sparse models, recsys, CTR
AllReduce	At every worker	Replicated on every worker	Dense deep learning
Federated Learning	At an aggregator (sometimes hierarchical)	Replicated on each device with periodic global sync	Privacy-sensitive on-device training
Gossip / decentralized SGD	Between random worker pairs	Replicated, slowly converging	Research, peer-to-peer deployments
Pipeline parallelism	Between adjacent pipeline stages	Sharded by layer	Very large model training
Tensor parallelism	Within each layer via collective ops	Sharded along tensor dimensions	LLM training

The parameter server should be understood as one point in this design space, optimized for sparse stateful aggregation across a large fleet, rather than as a universally good or bad architecture.

Explain like I'm 5 (ELI5)

Imagine a class of students writing a giant recipe book together. The recipe book is too big for any one student to carry, so it lives on a few shelves at the front of the room (the parameter servers). Each student has their own pile of ingredients to test and notes to make (the worker nodes).

When a student wants to update a recipe, they walk up to the shelves, find the page (key), make their change (push), and walk back. When they want to know how the latest version of a recipe reads, they walk up, look at the page (pull), copy it down, and walk back to keep working.

Three teachers run the room with different rules for how strict to be:

The strict teacher (BSP) says everyone has to finish their notes for the day, line up at the shelves, and update their pages all at once. If anyone is slow, everyone waits.
The chaotic teacher (ASP) says do whatever you want whenever you want. If two students rewrite the same paragraph at the same time, oh well.
The reasonable teacher (SSP) says you can work ahead, but you cannot get more than three pages ahead of the slowest student. That way nobody is too far out of sync, but the fast students do not have to wait for every slow one.

The parameter server is the shelves plus the rule the teacher picks. It works really well when each student only writes on a handful of pages each time (sparse updates, like in a recommendation model). It works less well when every student needs to rewrite the entire book every day, because then the line at the shelves gets really long. That is when teachers switch to a different system called AllReduce, where the students hand notes to each other in a circle instead of going to the shelves.

References

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. (2014). "Scaling Distributed Machine Learning with the Parameter Server." *11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14)*, pages 583-598. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu ↩
Smola, A. and Narayanamurthy, S. (2010). "An Architecture for Parallel Topic Models." *Proceedings of the VLDB Endowment*, 3(1-2), 703-710. https://alex.smola.org/papers/2010/SmoNar10.pdf ↩
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M. A., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. (2012). "Large Scale Distributed Deep Networks." *Advances in Neural Information Processing Systems 25 (NeurIPS 2012)*. https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf ↩
Sergeev, A. and Del Balso, M. (2018). "Horovod: fast and easy distributed deep learning in TensorFlow." *arXiv preprint arXiv:1802.05799*. https://arxiv.org/abs/1802.05799 ↩
Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G., and Xing, E. P. (2013). "More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server." *Advances in Neural Information Processing Systems 26 (NeurIPS 2013)*. https://www.cs.cmu.edu/~seunghak/SSPTable_NIPS2013.pdf ↩
Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). "HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent." *Advances in Neural Information Processing Systems 24 (NeurIPS 2011)*. https://proceedings.neurips.cc/paper/2011/file/218a0aefd1d1a4be65601cc6ddc1520e-Paper.pdf ↩
Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (2014). "Project Adam: Building an Efficient and Scalable Deep Learning Training System." *11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14)*, pages 571-582. https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf ↩
Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., and Yu, Y. (2015). "Petuum: A New Platform for Distributed Machine Learning on Big Data." *IEEE Transactions on Big Data*, 1(2), 49-67. https://www.cs.cmu.edu/~yaoliang/mypapers/petuum.pdf ↩
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. (2015). "MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems." *Workshop on Machine Learning Systems (LearningSys), NeurIPS 2015*. https://arxiv.org/abs/1512.01274 ↩
TensorFlow Documentation. "Parameter server training with ParameterServerStrategy." Last updated April 2024. https://www.tensorflow.org/tutorials/distribute/parameter_server_training ↩
Jiang, B., Deng, C., Yi, H., Hu, Z., Zhou, G., Zheng, Y., Huang, S., Guo, X., Wang, D., Song, Y., et al. (2019). "XDL: An Industrial Deep Learning Framework for High-dimensional Sparse Data." *Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data*. https://dl.acm.org/doi/10.1145/3326937.3341255 ↩
Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., and Guo, C. (2020). "A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters." *14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20)*. https://www.usenix.org/conference/osdi20/presentation/jiang ↩
Naumov, M., Mudigere, D., Shi, H.-J. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C.-J., Azzolini, A. G., et al. (2019). "Deep Learning Recommendation Model for Personalization and Recommendation Systems." *arXiv preprint arXiv:1906.00091*. https://arxiv.org/abs/1906.00091 ↩
Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." *arXiv preprint arXiv:1706.02677*. https://arxiv.org/abs/1706.02677 ↩
Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems*. https://arxiv.org/abs/1606.07792
Cipar, J., Ho, Q., Kim, J. K., Lee, S., Ganger, G. R., Gibson, G., Keeton, K., and Xing, E. (2013). "Solving the Straggler Problem with Bounded Staleness." *Proceedings of the 14th USENIX Workshop on Hot Topics in Operating Systems (HotOS XIV)*. http://www.cs.cmu.edu/~epxing/papers/2013/Cipar_etal_HotOS13.pdf
Apache MXNet Documentation. "Distributed Training in MXNet." https://mxnet.apache.org/versions/master/api/faq/distributed_training
PyTorch Tutorials. "Implementing a Parameter Server Using Distributed RPC Framework." https://docs.pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html
Mu Li OSDI 2014 talk slides. https://www.cs.cmu.edu/~muli/file/osdi14_talk.pdf
ByteDance BytePS GitHub repository. https://github.com/bytedance/byteps

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Machine learning terms/All Machine learning terms/TensorFlow Terms

Background and motivation

The three generations

Architecture

Key-value abstraction

Fault tolerance and elasticity

Synchronization models

Bulk Synchronous Parallel

Asynchronous SGD and the staleness problem

Stale Synchronous Parallel

Eventual versus sequential consistency

History

First generation: distributed sampler state (2010)

Second generation: DistBelief and friends (2012 to 2014)

Third generation: Mu Li and Petuum (2014 to 2015)

Frameworks

TensorFlow ParameterServerStrategy

BytePS

Comparison to AllReduce

Hybrid systems

Modern usage

Recommender systems and CTR prediction

Heterogeneous and CPU-only clusters

Online and continual learning

Model parallelism

Limitations

Communication bottleneck for dense models

Choosing the worker-to-server ratio

Convergence under asynchrony

Operational complexity

Unfair comparison in benchmarks

Related architectures

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Partitioning strategy

Pipelining

Distributed training

AutoML (Automated Machine Learning)

Static inference

Data-centric AI (DCAI)

What links here

Related Articles

Partitioning strategy

Pipelining

Distributed training

AutoML (Automated Machine Learning)

Static inference

Data-centric AI (DCAI)

What links here