# Parameter Server (PS)

> Source: https://aiwiki.ai/wiki/parameter_server_ps
> Updated: 2026-07-12
> Categories: MLOps, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Distributed training](/wiki/distributed_training), [Machine learning systems](/wiki/machine_learning_systems)*

The **Parameter Server (PS)** is a distributed system architecture for training large [machine learning](/wiki/machine_learning) models in which one set of machines, the **server nodes**, holds the global model parameters as a partitioned [key-value store](/wiki/key_value_store), while another set of machines, the **worker nodes**, processes data shards and exchanges parameter reads (pull) and gradient updates (push) with the servers. The pattern was popularized by [Mu Li](/wiki/mu_li) and collaborators in their 2014 OSDI paper *Scaling Distributed Machine Learning with the Parameter Server* [1], building on earlier work by [Alex Smola](/wiki/alex_smola) and Shravan Narayanamurthy at VLDB 2010 [2] and on [Google](/wiki/google)'s [DistBelief](/wiki/distbelief) system described by [Jeffrey Dean](/wiki/jeffrey_dean) et al. at NeurIPS 2012 [3]. For roughly half a decade the parameter server was the default architecture for large-scale [distributed machine learning](/wiki/distributed_machine_learning), then ceded the deep-learning training market to [AllReduce](/wiki/allreduce)-based collective communication systems such as [Horovod](/wiki/horovod) [4] and [PyTorch](/wiki/pytorch) [DistributedDataParallel](/wiki/distributeddataparallel) starting around 2017. It nevertheless remains the dominant architecture for industrial [recommender systems](/wiki/recommender_systems), [click-through rate](/wiki/click_through_rate) (CTR) prediction, and any setting where the model is dominated by sparse [embedding tables](/wiki/embedding_table) too large to replicate on every worker.

The parameter server is significant for two reasons beyond the systems it enabled. First, it formalized a separation of concerns that almost every later distributed-training framework adopted in some form: stateful, addressable parameter storage on one side, and stateless, mostly numeric workers on the other. Second, the parameter server papers were the first place where the design space of [synchronous](/wiki/synchronous_sgd) versus [asynchronous SGD](/wiki/asynchronous_sgd), [bounded staleness](/wiki/bounded_staleness), and elastic fault tolerance was laid out cleanly enough to be implemented as a general-purpose system rather than an algorithm-specific hack. The [Stale Synchronous Parallel](/wiki/stale_synchronous_parallel) (SSP) consistency model from Qirong Ho and colleagues at NeurIPS 2013 [5] grew out of this design space and remains the cleanest articulation of the trade-off between [convergence](/wiki/convergence) speed and parallel efficiency in [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd).

## Background and motivation

By the late 2000s, machine learning datasets had outgrown what fit on a single machine. Search ranking, contextual ads, [topic models](/wiki/topic_model) for behavioral targeting, and the first commercial [deep learning](/wiki/deep_learning) systems were all routinely working with billions of features or tens of millions of training examples. Several attempts to scale [SGD](/wiki/stochastic_gradient_descent_sgd) horizontally already existed:

- The classical synchronous **Bulk Synchronous Parallel** (BSP) approach, used by the [MapReduce](/wiki/mapreduce) family and [Apache Hadoop](/wiki/apache_hadoop), required every worker to finish each iteration before the next could begin, which left fast workers idle whenever any worker straggled.
- Lock-protected shared-memory updates were limited to a single multi-core machine.
- Pure asynchronous schemes such as **HOGWILD!** by Feng Niu, [Benjamin Recht](/wiki/benjamin_recht), Christopher Re, and Stephen Wright at NeurIPS 2011 [6] showed that lock-free updates from many threads could converge for sparse problems, but the analysis was specific to shared-memory systems and did not address how to scale across a cluster.

The parameter server architecture took the lock-free, mostly asynchronous philosophy of HOGWILD! and pushed it across the network. Rather than every worker reading and writing a single shared memory region, a logical key-value store was sharded across many server machines, and workers did remote pull and push operations against it. This split allowed the parameters to be much larger than any single worker's memory and made the system trivially elastic at both layers.

### The three generations

Mu Li's OSDI 2014 paper organized the history of the architecture into three generations [1]:

| Generation | Representative system | Year | Key idea |
|---|---|---|---|
| First | Distributed [Latent Dirichlet Allocation](/wiki/latent_dirichlet_allocation) sampler of Smola and Narayanamurthy [2] | 2010 | Distributed (key, value) store for sampler state across a cluster, used to scale [topic models](/wiki/topic_model) to hundreds of millions of documents |
| Second | Google's [DistBelief](/wiki/distbelief) [3], YahooLDA, Microsoft's [Project Adam](/wiki/project_adam) [7] | 2012 to 2014 | Application-specific parameter servers for one model family (deep nets, LDA, ad CTR). Introduced [Downpour SGD](/wiki/downpour_sgd) and [Sandblaster L-BFGS](/wiki/sandblaster_l_bfgs) [3] |
| Third | Mu Li et al. ps-lite and the Parameter Server [1], [Petuum](/wiki/petuum) [8] | 2014 to 2015 | General-purpose parameter server with arbitrary user-defined update functions, vector clocks, range push and pull, and configurable consistency models including SSP |

The third generation is what the term *parameter server* almost always refers to in modern usage, and what most production systems still in service trace back to.

## Architecture

A parameter server deployment partitions the cluster into three logical roles. In small deployments, the scheduler often coexists with the first server.

| Role | Stateful | Job |
|---|---|---|
| **Scheduler** | Yes (membership only) | Assigns key ranges to servers, monitors heartbeats, manages elastic join and leave, drives recovery |
| **Server nodes** | Yes (model parameters) | Hold partitioned parameters, apply user-defined update functions, serve push and pull requests, replicate for fault tolerance |
| **Worker nodes** | Mostly stateless | Hold a data shard, compute local [gradients](/wiki/gradient) on minibatches, push gradients and pull current parameters, never communicate with each other |

The defining property is that workers do **not** talk to each other directly. All inter-worker information flow goes through the parameter servers, which makes communication topologically a bipartite graph rather than the all-to-all of classical message-passing systems. The Mu Li paper formalized the worker-to-server interface as two operations [1]:

- `Push(keys, values)` sends a sparse vector of updates (typically gradients or aggregated gradients) to the appropriate servers based on the key partitioning.
- `Pull(keys) -> values` reads a sparse vector of parameters from the appropriate servers.

A `range push` and `range pull` form lets workers operate on contiguous key ranges in one round-trip, which is essential when the model is dense within a key range (such as a layer weight matrix). The user provides the *update function* that the server runs when a push arrives, which means the server can apply [Adam](/wiki/adam_optimizer), [AdaGrad](/wiki/adagrad), [FTRL](/wiki/ftrl), or any custom rule rather than just doing a sum.

### Key-value abstraction

The parameter store is logically a function from keys (typically 64-bit integers) to values (typically dense vectors of fp32 or bf16 numbers). Each server is responsible for a contiguous range of keys. The Mu Li paper used consistent hashing for elasticity, similar to how distributed in-memory caches such as [memcached](/wiki/memcached) handle membership changes [1].

This key-value framing is the reason parameter servers excel at sparse models. Each push or pull only touches the keys that appear in the current minibatch. For a click model with billions of features but only thousands active per example, this is an enormous reduction over schemes that broadcast or reduce the full parameter vector every step.

### Fault tolerance and elasticity

Servers replicate parameters across multiple machines using chain replication or primary-backup, allowing a server failure to be handled by promoting a replica without halting the workers. The Mu Li paper introduced **vector clocks** that tag each parameter range with the worker that last updated it, supporting both range-based replication and incremental recovery [1]. Because workers are stateless, a worker failure means losing the last minibatch and is recovered by a fresh worker pulling current parameters from the servers. The system supports adding or removing workers and servers during training without restarting, which proved essential for production deployments where machines fail constantly.

## Synchronization models

The single most consequential design decision in any parameter server is the consistency model that governs when a worker may proceed with the next iteration relative to other workers. The three canonical points on the spectrum are:

| Model | Worker behavior | Convergence guarantees | Hardware utilization | Source |
|---|---|---|---|---|
| **Bulk Synchronous Parallel (BSP)** | All workers compute their gradients on iteration t, push to the servers, then wait until the servers have aggregated everything before pulling the new parameters | Identical to single-machine [mini-batch SGD](/wiki/mini_batch_sgd) | Limited by the slowest worker (straggler problem) | Valiant 1990 |
| **Asynchronous Parallel (ASP)** | Each worker pushes and pulls independently. Updates are applied to whatever parameters are currently on the server, with no waiting | No formal guarantees in the general non-convex case. May diverge if staleness is unbounded | Maximum: workers never wait | DistBelief Downpour SGD [3], HOGWILD! [6] |
| **Stale Synchronous Parallel (SSP)** | A worker at iteration c may proceed if no other worker is more than s iterations behind it (s is the staleness bound) | Bounded staleness regret bounds proven by Ho et al. [5]. Becomes BSP when $$s = 0$$, ASP when $$s = \infty$$ | Tunable: trade convergence quality for utilization by choosing s | Ho et al. NeurIPS 2013 [5] |

### Bulk Synchronous Parallel

BSP is the safest mode and the easiest to reason about. The aggregated gradient on each round is mathematically equivalent to a [mini-batch](/wiki/minibatch) of size (worker batch size) times (number of workers), which makes the model identical to single-machine SGD with a larger batch. The cost is the **straggler effect**: any worker that is slower (slow disk, GC pause, hot CPU, network blip) holds up the entire round. In a 100-worker cluster, even occasional 2x slowdowns on individual workers can destroy throughput.

### Asynchronous SGD and the staleness problem

Asynchronous SGD removes the wait entirely. The DistBelief paper called this **Downpour SGD** [3]: each worker independently fetches parameters, computes a gradient, and pushes the gradient to the server, which applies the update to whatever the current parameter value is. Because the parameter the worker used to compute its gradient may already be several updates out of date, the gradient is a *stale gradient* with respect to the current parameters.

The HOGWILD! analysis showed that for sparse convex problems with bounded gradient delays, asynchronous SGD converges at almost the same rate as serial SGD [6]. The intuition is that if two workers update disjoint coordinates, their updates do not collide, and sparsity makes collisions rare. For dense, non-convex deep learning the picture is messier: empirically Downpour SGD often works, but practitioners report that aggressive asynchrony can hurt final accuracy and destabilize training.

### Stale Synchronous Parallel

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, and [Eric P. Xing](/wiki/eric_xing) introduced Stale Synchronous Parallel at NeurIPS 2013 as an explicit middle ground [5]. The system tracks each worker's clock (iteration count) and lets a worker advance to iteration $$c+1$$ only if no worker has a clock smaller than $$(c+1) - s$$, where s is a configurable staleness threshold. When a worker reads a parameter, it sees all updates from iterations 0 through $$c - s - 1$$, and it may or may not see updates from the s most recent iterations.

Ho et al. proved a regret bound of order $$O(\sqrt{sT})$$ for SSP applied to convex stochastic optimization, where T is the total number of iterations [5]. The bound recovers BSP at $$s = 0$$ and degrades gracefully as s grows. Empirically, SSP delivered most of the throughput of pure asynchronous SGD while avoiding its convergence pathologies, and it was adopted as the default consistency model in [Petuum](/wiki/petuum) [8].

### Eventual versus sequential consistency

The Mu Li OSDI paper made these consistency choices a first-class abstraction in the parameter server itself: the framework lets the user pick *Sequential*, *Eventual*, or *Bounded Delay* consistency for each key range, rather than baking the choice into the algorithm [1]. This was a key part of why the third-generation parameter server became a general-purpose system rather than a one-algorithm system.

## History

### First generation: distributed sampler state (2010)

The first paper to use the term *parameter server* in roughly its modern sense was Smola and Narayanamurthy's *An Architecture for Parallel Topic Models* at VLDB 2010 [2]. They scaled [Latent Dirichlet Allocation](/wiki/latent_dirichlet_allocation) [Gibbs sampling](/wiki/markov_chain_monte_carlo) to hundreds of millions of documents by storing the topic count tables in a distributed key-value store accessed by sampler workers. The system was specialized to LDA and used [memcached](/wiki/memcached) as the underlying store. It established that the bottleneck for scaling sampling-based ML was synchronization of the shared state, not raw compute.

### Second generation: DistBelief and friends (2012 to 2014)

The second generation built on the same idea but customized to one model family per system.

**DistBelief** at Google, described in the NeurIPS 2012 paper *Large Scale Distributed Deep Networks* by Dean, Corrado, Monga, Chen, Devin, Mao, Ranzato, Senior, Tucker, Yang, Le, and Ng [3], scaled deep neural networks across thousands of CPU cores. The framework introduced two algorithms that ran on top of a sharded parameter server:

- **Downpour SGD** ran many model replicas in parallel, each pulling parameters and pushing gradients to the server asynchronously. The asynchrony allowed scaling to many machines despite the slow individual hardware (CPUs, not GPUs).
- **Sandblaster L-BFGS** ran a synchronous batch [L-BFGS](/wiki/l_bfgs) optimizer across the parameter server, with a coordinator process directing the workers.

DistBelief was used to train the famous *cat neuron* unsupervised model, [Inception](/wiki/inception) precursors, and the speech recognition models that became the production [Google Voice Search](/wiki/google_voice_search) backend. It was the direct predecessor of [TensorFlow](/wiki/tensorflow), and many of TensorFlow's distributed primitives are direct descendants of DistBelief's parameter server.

**Microsoft Research's Project Adam**, presented by Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman at OSDI 2014 [7], was a parameter-server-based deep learning system that trained a 2 billion connection model on the [ImageNet](/wiki/imagenet) 22,000 class problem to roughly 2x the accuracy of previous distributed systems while using 30x fewer machines. Adam's contribution was whole-system co-design: the workers and parameter servers were tuned together for asynchronous overlap, sparse communication, and locality. (Project Adam shares its name with the [Adam optimizer](/wiki/adam_optimizer) but is unrelated.)

**YahooLDA**, by Smola, Ahmed, and others, was a production LDA system at Yahoo built on the same parameter-server ideas as the VLDB 2010 paper but at much larger scale.

### Third generation: Mu Li and Petuum (2014 to 2015)

Mu Li, David G. Andersen, Jun Woo Park, [Alexander J. Smola](/wiki/alex_smola), Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su published *Scaling Distributed Machine Learning with the Parameter Server* at OSDI 2014 [1]. The paper made three contributions that defined the modern architecture:

1. A general-purpose abstraction (push, pull, range operations, user-defined update functions) that made the framework algorithm-agnostic. Previous systems had been baked for one algorithm.
2. A flexible consistency model that exposed BSP, ASP, and bounded-delay (essentially SSP) as first-class options, configurable per key range.
3. Production-grade fault tolerance with replicated parameters, vector clocks, and elasticity for both workers and servers.

The paper reported scaling [Sparse Logistic Regression](/wiki/logistic_regression) to a 0.5 petabyte ad-click dataset with 17 billion examples and 100 billion parameters across 1000 machines, and LDA to 5 billion documents on a similar cluster [1]. The open-source implementation (called **ps-lite**) became the parameter server backend for the [DMLC](/wiki/dmlc) ecosystem and for [MXNet](/wiki/mxnet) [9].

**Petuum**, by Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu at IEEE Big Data 2015 (with earlier conference presentations in 2013) [8], was a parallel general-purpose ML framework built around an SSP parameter server. Petuum exposed the bounded-staleness consistency model directly to applications and added a dynamic scheduler for model-parallel updates. It became the platform of choice for academic distributed ML research throughout the mid-2010s.

## Frameworks

A non-exhaustive timeline of frameworks that ship a parameter server, organized by year of first public release.

| Year | Framework | Origin | Status (2026) | Notes |
|---|---|---|---|---|
| 2010 | YahooLDA | [Yahoo Research](/wiki/yahoo_research) (Smola, Ahmed, Narayanamurthy) [2] | Discontinued | First-generation, LDA-specialized |
| 2012 | DistBelief | [Google](/wiki/google) (Dean et al.) [3] | Superseded by [TensorFlow](/wiki/tensorflow) | Second-generation, Downpour SGD |
| 2014 | Project Adam | [Microsoft Research](/wiki/microsoft_research) [7] | Discontinued | Image classification specialist |
| 2014 | ps-lite / Parameter Server | CMU and Google (Mu Li et al.) [1] | Maintained as part of [MXNet](/wiki/mxnet) | First true general-purpose PS, OSDI 2014 |
| 2015 | [Petuum](/wiki/petuum) | CMU (Xing et al.) [8] | Spun out as commercial company | Native SSP, scheduler-based model-parallelism |
| 2015 | TensorFlow `tf.train.replica_device_setter` | [Google](/wiki/google) | Replaced by `ParameterServerStrategy` in TF2 | First-class PS in TensorFlow 1.x |
| 2015 | [MXNet](/wiki/mxnet) KVStore | [DMLC](/wiki/dmlc) (Mu Li et al.) [9] | Apache project, low activity | Built on ps-lite. `dist_sync` and `dist_async` modes |
| 2018 | TensorFlow `ParameterServerStrategy` | [Google](/wiki/google) [10] | Maintained | TF2 successor, integrates with `tf.distribute` API |
| 2018 | Alibaba [XDL](/wiki/xdl) | Alibaba [11] | Production at Alibaba | Optimized for [CTR](/wiki/click_through_rate) and recommendation |
| 2019 | [BytePS](/wiki/byteps) | ByteDance [12] | Maintained | Unifies AllReduce and PS as special cases. Up to 84 percent throughput improvement over Horovod on BERT |
| 2019 | PyTorch RPC parameter server | [PyTorch](/wiki/pytorch) | Maintained but not recommended for the common case | Available via `torch.distributed.rpc` for custom topologies |

The MXNet KVStore deserves special mention because it inherited the ps-lite codebase directly. The same Mu Li who wrote the OSDI paper led MXNet's distributed training design, and MXNet's `kv = mx.kv.create('dist_sync')` style of distributed training was for several years the easiest way to use a parameter server in a real ML pipeline [9].

### TensorFlow ParameterServerStrategy

In TensorFlow 2, `tf.distribute.ParameterServerStrategy` is the modern API for parameter server training [10]. It scales to thousands of workers paired with a smaller number of parameter servers, integrates with the `Model.fit` Keras API and with custom training loops via `tf.distribute.coordinator.ClusterCoordinator`, and supports asynchronous updates by default. Google retains it as a first-class strategy primarily because internal recommendation and ranking workloads still benefit from it.

### BytePS

[BytePS](/wiki/byteps), released by ByteDance in 2019 [12], is interesting because it is *not* a pure parameter server. The authors proved that classical AllReduce and classical parameter server are both special cases of a more general communication primitive, and built a system in which the user can blend the two at runtime. By placing aggregation responsibility on lightweight CPU-only "PS" nodes that exist alongside GPU workers, BytePS reportedly outperforms [Horovod](/wiki/horovod) plus [NCCL](/wiki/nccl) by up to 84 percent on [BERT](/wiki/bert)-large and outperforms classical PS by up to 245 percent on representative DNN training jobs [12].

## Comparison to AllReduce

The other dominant distributed training architecture, **[AllReduce](/wiki/allreduce)**, takes a fundamentally different approach: every worker holds a full copy of the parameters and synchronizes through a collective communication step that aggregates gradients across all workers. The most common implementation is **[Ring AllReduce](/wiki/ring_allreduce)**, popularized in machine learning by Andrew Gibiansky's 2017 Baidu blog post and operationalized at production scale by [Horovod](/wiki/horovod) (originally from [Uber](/wiki/uber)) and by [NCCL](/wiki/nccl) (NVIDIA Collective Communications Library) [4].

| Property | Parameter Server | AllReduce (Ring) |
|---|---|---|
| Topology | Bipartite: workers <-> servers | Ring or tree across workers; no servers |
| Workers per parameter copy | One copy partitioned across servers | One full copy per worker |
| Bandwidth per worker per step | Proportional to size of pushed and pulled keys | $$\frac{2(N-1)}{N} \times \text{model size}$$, asymptotically constant in $$N$$ |
| Communication pattern | All-to-many (workers to servers) | All-to-all but ring-structured |
| Synchronous mode | Optional (BSP) | Default |
| Asynchronous mode | Native (Downpour, SSP) | Hard to support cleanly |
| Sparse model support | Native, only touched keys are communicated | Requires gradient compression or sparse AllReduce |
| Heterogeneous worker speed | Tolerated naturally with ASP or SSP | Rigid: stragglers stall the ring |
| Fault tolerance | Designed in via server replication | Relies on framework checkpointing |
| Bandwidth optimality | No: hot servers can bottleneck | Yes: ring is provably bandwidth-optimal for dense AllReduce |
| Deployment complexity | Higher: two roles to provision | Lower: single homogeneous cluster |
| Best fit | Sparse models, recsys, CPU clusters, heterogeneous hardware | Dense models, GPU clusters, [Transformer](/wiki/transformer) and [CNN](/wiki/convolutional_neural_network) training |

A useful intuition for why AllReduce won the deep-learning training market: when every layer is dense and the parameter count fits comfortably on every GPU, parameter servers waste bandwidth by pushing the entire gradient back to a server only to read it back. Ring AllReduce performs the equivalent reduction with no central bottleneck and uses bandwidth proportional to the model size rather than to the worker count [4]. For [LLM](/wiki/llm) pretraining, where parameters are dense and every step touches every parameter, this is dispositive.

When the model is **sparse** in the parameters touched per step (recommendation models, large vocabulary embeddings, ad CTR), the calculus inverts. AllReduce still pushes the full gradient including all the zeros, while a parameter server only communicates the keys that the minibatch actually touched. For a model with 100 billion embedding rows but only 10,000 active per step, a parameter server is many orders of magnitude cheaper. This is why parameter servers remain dominant in industrial recsys despite being out of favor in deep-learning research.

### Hybrid systems

Many modern frameworks combine the two. The dominant pattern in production [recommender systems](/wiki/recommender_systems) is:

- **Sparse part** (embedding tables for users, items, hashed feature crosses): parameter server, often with the [HOGWILD!](/wiki/hogwild) update style for the sparse rows.
- **Dense part** (the multi-layer perceptron on top of the embeddings): data-parallel AllReduce.

[Meta](/wiki/meta)'s [DLRM](/wiki/dlrm) follows exactly this pattern in production CPU training, with embedding parameters partitioned across parameter servers and synchronized via Elastic Averaging SGD, while the dense MLP is data-parallel with allreduce-style synchronization [13]. [Alibaba](/wiki/alibaba_cloud)'s XDL [11] and ByteDance's BytePS [12] codify the hybrid as the default system architecture.

## Modern usage

### Recommender systems and CTR prediction

The single largest production workload of parameter servers in 2026 is industrial recommendation and ad ranking. Models in this domain share a common shape: very wide embedding tables (often tens of billions of rows for user and item IDs and hashed feature crosses), a relatively small [MLP](/wiki/perceptron) on top, and gradient updates that are extremely sparse (only the keys for the active features in the current minibatch). Parameter servers were designed for exactly this access pattern.

The Google *Wide and Deep Learning for Recommender Systems* paper (Cheng et al., 2016) used [FTRL](/wiki/ftrl)-Proximal on the wide linear part and AdaGrad on the deep part, both running on a parameter server. Alibaba [XDL](/wiki/xdl) is built around a parameter server tuned for hundreds of billions of features [11]. Meta DLRM uses parameter servers for the sparse embedding parameters and AllReduce for the dense MLP [13]. ByteDance uses BytePS, which is a generalization of the parameter server [12]. The industry has converged on the same broad architecture across companies.

### Heterogeneous and CPU-only clusters

When workers are CPUs rather than GPUs (most ad and search ranking pipelines), the parameter server's tolerance for stragglers matters more, because CPU performance is more variable than GPU performance. Parameter servers also fit naturally into shared clusters where workers may be preempted by other jobs.

### Online and continual learning

The push and pull abstraction maps well onto online learning, where streaming data updates a model continuously. Several production click prediction systems use parameter servers as the backbone for hot model updates served back to inference fleets, with the ability to ingest new training examples and propagate the updated parameters to inference workers within seconds.

### Model parallelism

When a model is too large to fit on a single worker's memory, parameter servers offer a natural form of model parallelism: each server owns one slice of the model and the workers fetch only the slice they need for the current minibatch. Modern [LLM](/wiki/llm) training rarely uses parameter servers for this purpose because [tensor parallelism](/wiki/tensor_parallelism), [pipeline parallelism](/wiki/pipeline_parallelism), and [ZeRO](/wiki/zero) sharding give better throughput on GPU clusters, but for sparse-first models with massive embeddings the parameter server remains the model-parallelism mechanism of choice.

## Limitations

### Communication bottleneck for dense models

For dense deep neural networks, every worker reads and writes the full set of parameters every step. Parameter servers funnel this traffic through the server fleet, which becomes a bottleneck once the model size and worker count grow. Ring AllReduce avoids this by spreading the reduction across the workers themselves [4]. This is the single biggest reason the deep-learning research community moved off parameter servers around 2017.

### Choosing the worker-to-server ratio

A persistent operational pain point: too few servers and individual servers become hot; too many servers and aggregation latency rises. The right ratio depends on the model sparsity, the network topology, and the fraction of time spent in communication versus compute. Production deployments often re-tune this ratio per model.

### Convergence under asynchrony

Asynchronous SGD with parameter servers has reliable convergence on convex sparse problems (HOGWILD!) but can hurt accuracy on dense non-convex deep learning. Practitioners working with parameter servers in deep learning have to carefully tune learning rates, staleness bounds, and warmup to avoid divergence. The empirical finding that synchronous large-batch SGD often matches or beats asynchronous SGD in final accuracy [14] further reduced the appeal of parameter servers for state-of-the-art deep learning.

### Operational complexity

A parameter server deployment requires provisioning two distinct machine roles, configuring fault tolerance, handling consistent hashing under membership changes, and debugging asymmetric failure modes. AllReduce systems require only one role and use simpler collective primitives. This pushes researchers and small teams toward AllReduce by default.

### Unfair comparison in benchmarks

Many academic comparisons of parameter server versus AllReduce use small dense models (ResNet-50, BERT-base) where AllReduce is unambiguously better. Comparisons that include real industrial recommendation models with billion-scale embeddings tend to favor parameter servers or hybrid systems. The phrase "parameter servers are obsolete" is therefore correct for [LLM](/wiki/llm) pretraining but misleading for the bulk of industrial ML compute. BytePS's authors explicitly argued this in their 2020 OSDI paper [12], demonstrating that PS-style aggregation can match or beat AllReduce on dense models too once spare CPU bandwidth is exploited.

## Related architectures

| Architecture | Where the gradients meet | Where parameters live | Typical use case |
|---|---|---|---|
| Parameter Server | At dedicated server nodes | Sharded across servers | Sparse models, recsys, CTR |
| AllReduce | At every worker | Replicated on every worker | Dense deep learning |
| Federated Learning | At an aggregator (sometimes hierarchical) | Replicated on each device with periodic global sync | Privacy-sensitive on-device training |
| Gossip / decentralized SGD | Between random worker pairs | Replicated, slowly converging | Research, peer-to-peer deployments |
| Pipeline parallelism | Between adjacent pipeline stages | Sharded by layer | Very large model training |
| Tensor parallelism | Within each layer via collective ops | Sharded along tensor dimensions | LLM training |

The parameter server should be understood as one point in this design space, optimized for sparse stateful aggregation across a large fleet, rather than as a universally good or bad architecture.

## Explain like I'm 5 (ELI5)

Imagine a class of students writing a giant recipe book together. The recipe book is too big for any one student to carry, so it lives on a few shelves at the front of the room (the parameter servers). Each student has their own pile of ingredients to test and notes to make (the worker nodes).

When a student wants to update a recipe, they walk up to the shelves, find the page (key), make their change (push), and walk back. When they want to know how the latest version of a recipe reads, they walk up, look at the page (pull), copy it down, and walk back to keep working.

Three teachers run the room with different rules for how strict to be:

- **The strict teacher (BSP)** says everyone has to finish their notes for the day, line up at the shelves, and update their pages all at once. If anyone is slow, everyone waits.
- **The chaotic teacher (ASP)** says do whatever you want whenever you want. If two students rewrite the same paragraph at the same time, oh well.
- **The reasonable teacher (SSP)** says you can work ahead, but you cannot get more than three pages ahead of the slowest student. That way nobody is too far out of sync, but the fast students do not have to wait for every slow one.

The parameter server is the shelves plus the rule the teacher picks. It works really well when each student only writes on a handful of pages each time (sparse updates, like in a recommendation model). It works less well when every student needs to rewrite the entire book every day, because then the line at the shelves gets really long. That is when teachers switch to a different system called AllReduce, where the students hand notes to each other in a circle instead of going to the shelves.

## See also

- [Distributed training](/wiki/distributed_training)
- [AllReduce](/wiki/allreduce)
- [Ring AllReduce](/wiki/ring_allreduce)
- [Horovod](/wiki/horovod)
- [DistBelief](/wiki/distbelief)
- [HOGWILD!](/wiki/hogwild)
- [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd)
- [Asynchronous SGD](/wiki/asynchronous_sgd)
- [Synchronous SGD](/wiki/synchronous_sgd)
- [Stale Synchronous Parallel](/wiki/stale_synchronous_parallel)
- [Petuum](/wiki/petuum)
- [Project Adam](/wiki/project_adam)
- [TensorFlow](/wiki/tensorflow)
- [MXNet](/wiki/mxnet)
- [BytePS](/wiki/byteps)
- [DLRM](/wiki/dlrm)
- [Recommender systems](/wiki/recommender_systems)

## References

1. Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. (2014). "Scaling Distributed Machine Learning with the Parameter Server." *11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14)*, pages 583-598. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu
2. Smola, A. and Narayanamurthy, S. (2010). "An Architecture for Parallel Topic Models." *Proceedings of the VLDB Endowment*, 3(1-2), 703-710. https://alex.smola.org/papers/2010/SmoNar10.pdf
3. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M. A., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. (2012). "Large Scale Distributed Deep Networks." *Advances in Neural Information Processing Systems 25 (NeurIPS 2012)*. https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf
4. Sergeev, A. and Del Balso, M. (2018). "Horovod: fast and easy distributed deep learning in TensorFlow." *arXiv preprint arXiv:1802.05799*. https://arxiv.org/abs/1802.05799
5. Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G., and Xing, E. P. (2013). "More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server." *Advances in Neural Information Processing Systems 26 (NeurIPS 2013)*. https://www.cs.cmu.edu/~seunghak/SSPTable_NIPS2013.pdf
6. Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). "HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent." *Advances in Neural Information Processing Systems 24 (NeurIPS 2011)*. https://proceedings.neurips.cc/paper/2011/file/218a0aefd1d1a4be65601cc6ddc1520e-Paper.pdf
7. Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (2014). "Project Adam: Building an Efficient and Scalable Deep Learning Training System." *11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14)*, pages 571-582. https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf
8. Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., and Yu, Y. (2015). "Petuum: A New Platform for Distributed Machine Learning on Big Data." *IEEE Transactions on Big Data*, 1(2), 49-67. https://www.cs.cmu.edu/~yaoliang/mypapers/petuum.pdf
9. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. (2015). "MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems." *Workshop on Machine Learning Systems (LearningSys), NeurIPS 2015*. https://arxiv.org/abs/1512.01274
10. TensorFlow Documentation. "Parameter server training with ParameterServerStrategy." Last updated April 2024. https://www.tensorflow.org/tutorials/distribute/parameter_server_training
11. Jiang, B., Deng, C., Yi, H., Hu, Z., Zhou, G., Zheng, Y., Huang, S., Guo, X., Wang, D., Song, Y., et al. (2019). "XDL: An Industrial Deep Learning Framework for High-dimensional Sparse Data." *Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data*. https://dl.acm.org/doi/10.1145/3326937.3341255
12. Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., and Guo, C. (2020). "A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters." *14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20)*. https://www.usenix.org/conference/osdi20/presentation/jiang
13. Naumov, M., Mudigere, D., Shi, H.-J. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C.-J., Azzolini, A. G., et al. (2019). "Deep Learning Recommendation Model for Personalization and Recommendation Systems." *arXiv preprint arXiv:1906.00091*. https://arxiv.org/abs/1906.00091
14. Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." *arXiv preprint arXiv:1706.02677*. https://arxiv.org/abs/1706.02677
15. Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., et al. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems*. https://arxiv.org/abs/1606.07792
16. Cipar, J., Ho, Q., Kim, J. K., Lee, S., Ganger, G. R., Gibson, G., Keeton, K., and Xing, E. (2013). "Solving the Straggler Problem with Bounded Staleness." *Proceedings of the 14th USENIX Workshop on Hot Topics in Operating Systems (HotOS XIV)*. http://www.cs.cmu.edu/~epxing/papers/2013/Cipar_etal_HotOS13.pdf
17. Apache MXNet Documentation. "Distributed Training in MXNet." https://mxnet.apache.org/versions/master/api/faq/distributed_training
18. PyTorch Tutorials. "Implementing a Parameter Server Using Distributed RPC Framework." https://docs.pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html
19. Mu Li OSDI 2014 talk slides. https://www.cs.cmu.edu/~muli/file/osdi14_talk.pdf
20. ByteDance BytePS GitHub repository. https://github.com/bytedance/byteps