Horovod
Last reviewed
May 2, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 4,065 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 4,065 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Distributed training, Data parallelism, Ring-allreduce, TensorFlow, PyTorch, NVIDIA NCCL
Horovod is an open-source distributed training framework for deep learning models, originally developed at Uber and released as open source in October 2017. It lets users scale a single-GPU training script written in TensorFlow, Keras, PyTorch, or Apache MXNet across many GPUs and many machines by adding only a handful of lines of code, and it relies on the ring-allreduce algorithm rather than a parameter-server architecture for averaging gradients between workers. Uber contributed Horovod to the LF AI Foundation (now LF AI & Data) in December 2018, and the project graduated to full foundation status in 2020. The framework is named after a traditional Russian folk dance in which performers move with linked arms in a circle, mirroring how its workers exchange tensors.
Horovod was one of the first projects to make collective communication patterns from high-performance computing widely accessible to deep-learning practitioners, and for several years it was the default way to train models on tens or hundreds of GPUs. The project's relevance has narrowed over time as both TensorFlow and PyTorch absorbed similar ideas into their native distributed APIs, and Horovod's release cadence slowed sharply after 2023; the most recent tagged release on GitHub is v0.28.1, dated 12 June 2023. The repository is still hosted under the LF AI & Data Foundation and the source code is widely deployed inside HPC clusters, container images, and managed ML platforms, but newer projects more often build on PyTorch's DistributedDataParallel, DeepSpeed, or framework-native distributed strategies.
| Field | Value |
|---|---|
| Type | Distributed deep-learning training framework |
| Original developer | Uber Technologies (Michelangelo platform team) |
| Initial public release | August 2017 (v0.9.0 on GitHub) |
| Open-source announcement | 17 October 2017 (Uber Engineering blog) |
| Latest release | v0.28.1, 12 June 2023 |
| License | Apache License 2.0 |
| Repository | github.com/horovod/horovod |
| Languages | Python, C++, CUDA |
| Supported frameworks | TensorFlow, Keras, PyTorch, Apache MXNet |
| Communication backends | NVIDIA NCCL, MPI (Open MPI, MPICH), Facebook Gloo, Intel oneCCL |
| Algorithm | Synchronous data-parallel SGD using ring-allreduce |
| Governance | LF AI & Data Foundation (graduated project, September 2020) |
| Original paper | Sergeev & Del Balso, Horovod: fast and easy distributed deep learning in TensorFlow (arXiv:1802.05799, February 2018) |
Imagine eight kids each have a deck of flashcards and they all want to learn the same multiplication table. Working alone, each kid might take an hour. If they split the cards eight ways and study at the same time, they can finish much faster, but at the end they have to sit in a circle and share what they learned so everybody ends up with the same knowledge. That sharing step is the slow part. If one kid talks to every other kid one by one, you spend a lot of time just passing notes around.
The ring-allreduce trick is to have the kids sit in a circle, and on each turn each kid passes one piece of paper to the kid on their right and gets one from the kid on their left, adding up the numbers as they go. After two trips around the circle everyone has the full sum and nobody had to wait on a single popular kid in the middle. Horovod is a piece of software that does exactly this for the gradients of a neural network during training: it lets eight, or sixty-four, or a thousand GPUs share what they learned each step almost as efficiently as if they were one big GPU. The clever part is that it does the trick from inside almost any deep-learning library you already use, so you only have to add about four lines of code to a normal training script to make it run distributed.
Horovod grew out of Uber's Michelangelo machine-learning platform, which the company introduced in September 2017 as an internal end-to-end ML-as-a-service stack. Michelangelo handled feature management, training, model serving, and monitoring across hundreds of production ML use cases at Uber, from estimated time of arrival to fraud detection. As the company began to train larger deep-learning models, the team ran into a wall: the standard distributed training APIs in TensorFlow at the time, which were based on parameter servers and tf.train.SyncReplicasOptimizer, were both hard to use correctly and inefficient at scale. Uber engineers reported that they were unable to leverage roughly half of their GPU resources when scaling distributed TensorFlow training to 128 GPUs.
In February 2017, Baidu Research published a blog post by Andrew Gibiansky titled Bringing HPC Techniques to Deep Learning, along with a draft TensorFlow implementation, showing that the ring-allreduce algorithm long used in MPI-based scientific computing could be adapted to gradient averaging in deep learning. The algorithm itself comes from a 2009 paper by Pitch Patarasuk and Xin Yuan, Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations, which proved that ring-allreduce achieves bandwidth-optimal communication independent of the number of workers. Baidu showed that, on convolutional networks, ring-allreduce dramatically outperformed the parameter-server pattern at scale.
Uber's deep-learning team picked up Baidu's draft, refactored it into a stand-alone Python library that did not require a custom TensorFlow build, hooked it into the framework as a regular Python optimizer, and added performance work like Tensor Fusion. The result was Horovod. Internally, Uber benchmarks showed scaling efficiency rising from roughly 50% on standard distributed TensorFlow to about 88% on Horovod for the same hardware and models.
Uber tagged the first public release, v0.9.0, on GitHub in August 2017 under the Apache 2.0 license. The official open-source announcement came on 17 October 2017 in a blog post by Alexander Sergeev titled Meet Horovod: Uber's Open Source Distributed Deep Learning Framework for TensorFlow. The accompanying paper, Horovod: fast and easy distributed deep learning in TensorFlow by Alexander Sergeev and Mike Del Balso, was posted to arXiv on 15 February 2018 (arXiv:1802.05799).
The choice of name was deliberate: a horovod (Russian: хоровод) is a circle dance in which participants link arms and move together as a group, which mirrors the topology of the ring-allreduce algorithm. Sergeev's blog post explicitly drew the analogy.
In December 2018, Uber donated Horovod to the LF Deep Learning Foundation, a Linux Foundation project that was later folded into the LF AI Foundation and then renamed LF AI & Data. Horovod entered as an incubation-stage project. It graduated to full foundation status in September 2020, the highest project maturity tier in the foundation's lifecycle. Governance is handled by an elected technical steering committee under the foundation's open-source charter.
Through 2018-2022 Horovod expanded well beyond its original TensorFlow focus. PyTorch and Apache MXNet support arrived in 2018, Keras model integration was added, the Gloo controller landed so that users could run Horovod without an MPI installation, and elastic training was introduced in version 0.20 (mid-2020). Integrations with Apache Spark and Ray made it possible to launch Horovod jobs from inside data-processing pipelines.
Release activity slowed sharply after 2023. The most recent tagged release, v0.28.1, was published on 12 June 2023; no new versions have appeared on PyPI in the 18 months that followed. The repository continues to receive occasional commits and bug-fix work, but several large platforms have moved away from it as their default distributed-training tool. Databricks announced on 26 September 2024 that Horovod and HorovodRunner are deprecated and that releases after Databricks Runtime 15.4 LTS ML will not include the package pre-installed; Databricks now recommends TorchDistributor for PyTorch and the tf.distribute.Strategy API for TensorFlow.
Horovod implements synchronous data-parallel stochastic gradient descent. Each worker holds a full copy of the model, processes a different shard of each minibatch, computes local gradients, and then participates in a collective communication step that averages those gradients across all workers before the next parameter update. Because every worker has the same averaged gradient at the end of each step, every worker's parameters stay in sync, which means there is no parameter server, no asynchronous staleness, and no need for special checkpointing logic.
The core operation Horovod performs is allreduce: a collective in which N workers each contribute a tensor and each ends up with the sum (or mean) of all N tensors. Naive implementations either route everything through a central parameter server (bandwidth bottleneck on the server) or have every worker broadcast to every other worker (O(N²) traffic). Ring-allreduce avoids both problems.
In ring-allreduce the N workers are arranged in a logical ring. The tensor to be reduced is split into N chunks. The algorithm then runs in two phases:
Worker A --> Worker B --> Worker C --> Worker D
^ |
+--------------------------------------------+
The total amount of data each worker sends and receives is 2(N-1)/N times the size of the tensor, which approaches 2 as N grows. Critically, the per-worker bandwidth cost is independent of N. This is why Patarasuk and Yuan called the algorithm bandwidth optimal: when the network is the bottleneck, no allreduce algorithm can do better in terms of bytes-per-worker. Latency does scale with N (you need 2(N−1) communication steps), but for the tensor sizes typical in deep learning, bandwidth dominates.
Real neural networks generate many small gradient tensors per step, one for each parameter group. Running ring-allreduce on each tensor independently wastes network bandwidth because the per-message overhead becomes significant for small payloads. Horovod's Tensor Fusion feature batches small ready-to-reduce tensors into a single fused buffer (default fusion threshold of 128 MB) and runs one allreduce on the buffer instead of many. The Uber team reported that on models with many small layers running over an unoptimized TCP network, Tensor Fusion delivered improvements of up to 65 percent. The fusion threshold is configurable through the --fusion-threshold-mb flag of horovodrun; setting it to zero disables fusion.
The TensorFlow and PyTorch integrations both arrange for the allreduce on each gradient to be issued as soon as that gradient is computed during the backward pass, so that communication for early layers proceeds in parallel with backward-pass computation for later layers. This overlap is what makes the synchronous design competitive with asynchronous parameter-server schemes.
Horovod is structured as a thin C++ communication core wrapped in framework-specific Python bindings.
| Layer | Role |
|---|---|
| Framework adapter | A Python module per framework: horovod.tensorflow, horovod.keras, horovod.torch, horovod.mxnet. Each provides a DistributedOptimizer (or equivalent), a BroadcastGlobalVariablesHook, callbacks for Keras, and helpers for assigning each process to a local GPU. |
| Common Python layer | Initialization (hvd.init()), rank discovery, and the public allreduce / allgather / broadcast APIs. |
| C++ controller | Coordinates which tensors are ready to reduce across workers and decides when to fuse them. Two implementations exist: an MPI controller (uses the MPI runtime for both control plane and data plane by default) and a Gloo controller (no MPI dependency, required for elastic training). |
| Communication backend | Performs the actual allreduce on each fused buffer. NCCL is used on NVIDIA GPUs, MPI on CPUs and on systems where GPU-aware MPI is configured, Gloo as a portable fallback, and Intel oneCCL for CPU clusters tuned for Intel hardware. |
| Launcher | horovodrun wraps mpirun (for the MPI backend) or its own Gloo-based launcher to start one process per worker, set up rendezvous, and forward signals. Direct invocation through mpirun, Slurm, Kubernetes, or Spark is also supported. |
| Backend | Maintainer | Best fit | Notes |
|---|---|---|---|
| NCCL | NVIDIA | Multi-GPU and multi-node NVIDIA GPU clusters | Highly tuned ring and tree algorithms; supports GPUDirect RDMA over InfiniBand and RoCE; default backend for GPU allreduce in modern Horovod builds. |
| MPI | Open MPI / MPICH community | HPC clusters, mixed CPU/GPU systems | Required for the MPI controller; provides launcher, rendezvous, and collective operations; supports CUDA-aware builds. |
| Gloo | Sites without an MPI install, elastic training | Pure-Python rendezvous, TCP transport, no system dependencies; required for the elastic Horovod controller. | |
| Intel oneCCL | Intel | CPU clusters with Intel interconnects | Tuned for Intel Xeon and the Xe GPU line; integrated through the C++ controller. |
Later versions of Horovod added process sets, which let a job split its workers into independent groups that can run separate collective operations concurrently. This is useful for hybrid model-parallel and data-parallel training, where different subgroups of workers participate in different collectives.
Elastic training, introduced in version 0.20 (June 2020), allows the worker count to change at runtime. Workers can join or leave the job, for example because a spot instance is reclaimed or a new node is autoscaled in, without restarting from scratch or reloading from disk-based checkpoints. The training script wraps its main loop in an @hvd.elastic.run decorator, registers a state object that knows how to broadcast itself, and provides a host-discovery script. When the worker membership changes, Horovod reinitializes the ring, restores the last committed state, and resumes training. The elastic controller requires the Gloo backend.
Horovod ships as a pip package that compiles the C++ extensions for the user's chosen framework and backend at install time. A typical installation on a machine with CUDA and NCCL is:
HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[tensorflow,pytorch]
For builds without MPI, users add HOROVOD_WITH_GLOO=1 and omit HOROVOD_WITH_MPI=1. Pre-built containers from NVIDIA NGC and the official Horovod project also exist.
The original paper highlighted that converting a single-GPU TensorFlow script to distributed training with Horovod required only four key code changes. A minimal v1-style example looks like this:
import horovod.tensorflow as hvd
import tensorflow as tf
# 1. Initialize the Horovod runtime.
hvd.init()
# 2. Pin one GPU per worker by local rank.
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# 3. Scale the learning rate by the number of workers, then
# wrap the optimizer so it averages gradients across workers.
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
opt = hvd.DistributedOptimizer(opt)
# 4. Broadcast the initial variables from rank 0 to all others.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
with tf.train.MonitoredTrainingSession(config=config, hooks=hooks) as sess:
while not sess.should_stop():
sess.run(train_op)
A TensorFlow 2 / Keras script wraps the optimizer and adds Horovod callbacks; a PyTorch script wraps the optimizer with hvd.DistributedOptimizer and broadcasts state with hvd.broadcast_parameters.
The horovodrun launcher hides the MPI or Gloo plumbing. To run on four GPUs of a single host:
horovodrun -np 4 -H localhost:4 python train.py
For sixteen GPUs across four hosts:
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
Horovod can also be launched directly with mpirun, with Kubernetes operators (such as Kubeflow's MPIJob), with Spark via horovod.spark, with Ray via the RayExecutor API, and from inside HPC schedulers like Slurm.
The original Sergeev & Del Balso paper reported strong scaling experiments on three convolutional models, Inception V3, ResNet-101, and VGG-16, trained with synthetic ImageNet data on 16, 32, 64, and 128 NVIDIA Pascal GPUs (Pascal P100s) connected over 25 GbE. The headline numbers from that paper are:
| Model | Standard distributed TensorFlow | Horovod | Notes |
|---|---|---|---|
| Inception V3 | ~50% scaling efficiency at 128 GPUs | ~88% | Roughly 2x throughput vs the parameter-server baseline. |
| ResNet-101 | Comparable to Inception V3 | ~88% | Similar improvement profile. |
| VGG-16 | Substantially worse, communication bound | Lower than Inception/ResNet on TCP, +30% on RDMA networks | VGG-16 has many parameters per layer; bandwidth dominates. |
The paper also reported that enabling RDMA-capable networking (InfiniBand or RoCE) on top of NCCL gave only a 3-4% improvement for Inception V3 and ResNet-101 (which are communication-light relative to their compute) but a roughly 30% improvement for VGG-16, whose parameter-heavy fully connected layers shifted the bottleneck onto the network. Tensor Fusion contributed up to a 65% throughput improvement on TCP for models with many small tensors such as ResNet-101.
Third-party comparisons in the years since have generally confirmed that Horovod's per-step overhead is competitive with framework-native distributed implementations, and that the gap to PyTorch's DistributedDataParallel on small-to-medium models is small. On very large models, more recent systems like DeepSpeed outperform Horovod because they combine data parallelism with model parallelism and ZeRO-style optimizer-state partitioning, which Horovod does not implement directly.
Horovod was the standard answer to how do I train this on more than one GPU? for several years, but the landscape now has several mature options.
| Aspect | Horovod | PyTorch DistributedDataParallel | TensorFlow MultiWorkerMirroredStrategy | DeepSpeed |
|---|---|---|---|---|
| Frameworks | TensorFlow, Keras, PyTorch, MXNet | PyTorch only | TensorFlow only | PyTorch only |
| Algorithm | Ring-allreduce via NCCL/MPI/Gloo/oneCCL | Ring-allreduce via NCCL/Gloo/MPI | All-reduce via NCCL/RING/CollectiveOps | ZeRO-1/2/3, pipeline, tensor parallelism |
| Multi-node | Yes, MPI or Gloo | Yes, NCCL or Gloo | Yes, gRPC + NCCL | Yes |
| Code changes | Wrap optimizer, init, broadcast (~4 lines) | Wrap module, init process group | Wrap model in strategy scope | Wrap model and optimizer |
| Launcher | horovodrun, mpirun | torchrun (formerly torch.distributed.launch) | tf.distribute.cluster_resolver + TF_CONFIG | deepspeed launcher |
| Elastic training | Yes (since v0.20, Gloo only) | Yes via torchelastic (now torchrun) | Limited | Yes |
| Model-parallel/sharding | No (data-parallel only) | FSDP for sharded data-parallel | ParameterServerStrategy and others | First-class ZeRO and pipeline |
| Active development | Slow since 2023 | Actively developed | Actively developed | Actively developed |
| Primary maintainer | LF AI & Data Foundation | Meta + community | Google + community | Microsoft + community |
| License | Apache 2.0 | BSD-3 | Apache 2.0 | MIT |
The practical effect is that new PyTorch projects almost universally start with DistributedDataParallel or FSDP, new TensorFlow projects start with the tf.distribute API, and very large transformer training runs reach for DeepSpeed or Megatron-LM. Horovod remains useful when a project needs to run the same training code under both TensorFlow and PyTorch, when it has to integrate with an existing MPI-based HPC environment, or when elastic training on heterogeneous hardware is a priority.
Uber published Horovod from inside its production Michelangelo platform; the Sergeev & Del Balso paper credits Uber's deep-learning workloads as the original benchmarks. The framework has subsequently been used or distributed by:
tf.distribute.Uber donated Horovod to the Linux Foundation's LF Deep Learning Foundation in December 2018 as an incubation-stage project. The foundation merged with the LF AI Foundation in 2019, becoming the LF AI Foundation, and was renamed LF AI & Data in 2020 when LF Data joined. Horovod graduated to full project status in September 2020. Day-to-day decisions are made by a technical steering committee of project maintainers operating under the foundation's open-source governance model. The trademark, neutral repository, and CI infrastructure are owned by the foundation rather than by Uber, which limits any single company's ability to change direction unilaterally.
The project's release cadence has slowed considerably. As of mid-2026 the latest release on GitHub remains v0.28.1 from 12 June 2023, no new versions have been published to PyPI in more than 18 months, and the Snyk package-health analysis classifies the package as inactive based on release cadence and other signals. The repository continues to receive commits and bug reports are still being triaged, but the project no longer ships major new features at the rate it did between 2018 and 2022.
The practical reasons for the slowdown are widely agreed on in the community: PyTorch's DistributedDataParallel and FSDP, TensorFlow's tf.distribute strategies, NVIDIA's NCCL improvements, DeepSpeed, and Megatron-LM together cover most of what users originally went to Horovod for, often with tighter integration into their respective frameworks. Several large platforms have migrated away. Databricks deprecated Horovod and HorovodRunner in September 2024, and Databricks Runtime ML versions after 15.4 LTS no longer pre-install the package; Databricks now points users at TorchDistributor for PyTorch and the tf.distribute.Strategy API for TensorFlow.
Horovod still has real strengths that keep it in production: it remains the easiest way to get the same training script running in both TensorFlow and PyTorch, it has first-class MPI support that fits HPC environments, the Gloo backend lets it run without system-wide MPI, the elastic feature is useful on preemptible cloud instances, and the integrations with Apache Spark and Ray are still in active use. For new projects targeting a single framework, the framework-native distributed APIs are now the more natural starting point.