Horovod

AI Infrastructure Developer Tools Open Source AI

22 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v2 · 4,474 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Horovod is an open-source distributed training framework for deep learning that lets a single-GPU training script scale across many GPUs and many machines by adding only a few lines of code. ^[1] It was created at Uber by Alexander Sergeev and released as open source in 2017, and it averages gradients between workers using the ring-allreduce algorithm (built on MPI and NVIDIA's NCCL) instead of a parameter-server architecture. ^[1]^[2] Horovod supports TensorFlow, Keras, PyTorch, and Apache MXNet, and Uber donated it to the LF Deep Learning Foundation (a Linux Foundation project, now the LF AI Foundation / LF AI & Data) on 13 December 2018. ^[8] The name comes from a traditional Slavic folk dance: a horovod (Russian: хоровод) is a circle dance in which performers move with linked arms, mirroring how its workers exchange tensors around a ring. ^[2]

The foundational paper, Horovod: fast and easy distributed deep learning in TensorFlow by Alexander Sergeev and Mike Del Balso (arXiv:1802.05799, 15 February 2018), describes the goal directly: "an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow." ^[1] On Uber's own benchmarks the approach "achieved an 88 percent efficiency mark" scaling Inception V3 and ResNet-101 across many GPUs, roughly double the throughput of the standard parameter-server baseline. ^[2]

Horovod was one of the first projects to make collective communication patterns from high-performance computing widely accessible to deep-learning practitioners, and for several years it was the default way to train models on tens or hundreds of GPUs. The project's relevance has narrowed over time as both TensorFlow and PyTorch absorbed similar ideas into their native distributed APIs, and Horovod's release cadence slowed sharply after 2023; the most recent tagged release on GitHub is v0.28.1, dated 12 June 2023. ^[13] The repository is still hosted under the LF AI & Data Foundation and the source code is widely deployed inside HPC clusters, container images, and managed ML platforms, but newer projects more often build on PyTorch's DistributedDataParallel, DeepSpeed, or framework-native distributed strategies.

Facts

Field	Value
Type	Distributed deep-learning training framework
Original developer	Uber Technologies (Michelangelo platform team)
Lead author	Alexander Sergeev
Initial public release	August 2017 (v0.9.0 on GitHub)
Open-source announcement	17 October 2017 (Uber Engineering blog)
Latest release	v0.28.1, 12 June 2023
License	Apache License 2.0
Repository	github.com/horovod/horovod
Languages	Python, C++, CUDA
Supported frameworks	TensorFlow, Keras, PyTorch, Apache MXNet
Communication backends	NVIDIA NCCL, MPI (Open MPI, MPICH), Facebook Gloo, Intel oneCCL
Algorithm	Synchronous data-parallel SGD using ring-allreduce
Governance	LF AI & Data Foundation (graduated project, September 2020)
Original paper	Sergeev & Del Balso, Horovod: fast and easy distributed deep learning in TensorFlow (arXiv:1802.05799, February 2018)

ELI5: explain like I'm 5

Imagine eight kids each have a deck of flashcards and they all want to learn the same multiplication table. Working alone, each kid might take an hour. If they split the cards eight ways and study at the same time, they can finish much faster, but at the end they have to sit in a circle and share what they learned so everybody ends up with the same knowledge. That sharing step is the slow part. If one kid talks to every other kid one by one, you spend a lot of time just passing notes around.

The ring-allreduce trick is to have the kids sit in a circle, and on each turn each kid passes one piece of paper to the kid on their right and gets one from the kid on their left, adding up the numbers as they go. After two trips around the circle everyone has the full sum and nobody had to wait on a single popular kid in the middle. Horovod is a piece of software that does exactly this for the gradients of a neural network during training: it lets eight, or sixty-four, or a thousand GPUs share what they learned each step almost as efficiently as if they were one big GPU. The clever part is that it does the trick from inside almost any deep-learning library you already use, so you only have to add about four lines of code to a normal training script to make it run distributed.

What is Horovod and who created it?

Horovod is a distributed deep-learning training framework that turns a normal single-GPU script into a synchronous data-parallel job spread across many GPUs and machines, with only a handful of added lines. ^[1] It was built at Uber by Alexander Sergeev (the paper's lead author) on the company's Michelangelo machine-learning platform, and Uber released it as open source under the Apache 2.0 license in 2017. ^[2] Governance later moved to the Linux Foundation: Uber donated the project to the LF Deep Learning Foundation on 13 December 2018, and it now sits inside LF AI & Data. ^[8]

Origins at Uber and Michelangelo

Horovod grew out of Uber's Michelangelo machine-learning platform, which the company introduced in September 2017 as an internal end-to-end ML-as-a-service stack. Michelangelo handled feature management, training, model serving, and monitoring across hundreds of production ML use cases at Uber, from estimated time of arrival to fraud detection. As the company began to train larger deep-learning models, the team ran into a wall: the standard distributed training APIs in TensorFlow at the time, which were based on parameter servers and tf.train.SyncReplicasOptimizer, were both hard to use correctly and inefficient at scale. Uber engineers reported that they were unable to leverage roughly half of their GPU resources when scaling distributed TensorFlow training to 128 GPUs, noting after the switch that "we were no longer wasting half of the GPU resources." ^[2]

Where did the ring-allreduce idea come from?

In February 2017, Baidu Research published a blog post by Andrew Gibiansky titled Bringing HPC Techniques to Deep Learning, along with a draft TensorFlow implementation, showing that the ring-allreduce algorithm long used in MPI-based scientific computing could be adapted to gradient averaging in deep learning. ^[4] The algorithm itself comes from a 2009 paper by Pitch Patarasuk and Xin Yuan, Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations, which proved that ring-allreduce achieves bandwidth-optimal communication independent of the number of workers. ^[3] Baidu showed that, on convolutional networks, ring-allreduce dramatically outperformed the parameter-server pattern at scale.

Uber's deep-learning team picked up Baidu's draft, refactored it into a stand-alone Python library that did not require a custom TensorFlow build, hooked it into the framework as a regular Python optimizer, and added performance work like Tensor Fusion. The result was Horovod. Internally, Uber benchmarks showed scaling efficiency rising from roughly 50% on standard distributed TensorFlow to about 88% on Horovod for the same hardware and models. ^[2]

When was Horovod released as open source?

Uber tagged the first public release, v0.9.0, on GitHub in August 2017 under the Apache 2.0 license. The official open-source announcement came on 17 October 2017 in a blog post by Alexander Sergeev titled Meet Horovod: Uber's Open Source Distributed Deep Learning Framework for TensorFlow. ^[2] The accompanying paper, Horovod: fast and easy distributed deep learning in TensorFlow by Alexander Sergeev and Mike Del Balso, was posted to arXiv on 15 February 2018 (arXiv:1802.05799). ^[1]

The choice of name was deliberate. As Sergeev's blog post put it, Horovod is "named after a traditional Russian folk dance in which performers dance with linked arms in a circle, much like how distributed TensorFlow processes use Horovod to communicate with each other." ^[2] A horovod (Russian: хоровод) is a circle dance in which participants link arms and move together as a group, which mirrors the topology of the ring-allreduce algorithm.

Move to the LF AI Foundation

In December 2018, Uber donated Horovod to the LF Deep Learning Foundation, a Linux Foundation project that was later folded into the LF AI Foundation and then renamed LF AI & Data. ^[8] Horovod entered as an incubation-stage project. It graduated to full foundation status in September 2020, the highest project maturity tier in the foundation's lifecycle. ^[7] Governance is handled by an elected technical steering committee under the foundation's open-source charter.

Maturity and slowdown

Through 2018-2022 Horovod expanded well beyond its original TensorFlow focus. PyTorch and Apache MXNet support arrived in 2018, Keras model integration was added, the Gloo controller landed so that users could run Horovod without an MPI installation, and elastic training was introduced in version 0.20 (mid-2020). ^[10] Integrations with Apache Spark and Ray made it possible to launch Horovod jobs from inside data-processing pipelines. ^[11]^[12]

Release activity slowed sharply after 2023. The most recent tagged release, v0.28.1, was published on 12 June 2023; no new versions have appeared on PyPI in the 18 months that followed. ^[13]^[14] The repository continues to receive occasional commits and bug-fix work, but several large platforms have moved away from it as their default distributed-training tool. Databricks announced on 26 September 2024 that Horovod and HorovodRunner are deprecated and that releases after Databricks Runtime 15.4 LTS ML will not include the package pre-installed; Databricks now recommends TorchDistributor for PyTorch and the tf.distribute.Strategy API for TensorFlow. ^[15]

How does Horovod work?

Horovod implements synchronous data-parallel stochastic gradient descent. Each worker holds a full copy of the model, processes a different shard of each minibatch, computes local gradients, and then participates in a collective communication step that averages those gradients across all workers before the next parameter update. ^[1] Because every worker has the same averaged gradient at the end of each step, every worker's parameters stay in sync, which means there is no parameter server, no asynchronous staleness, and no need for special checkpointing logic.

How does ring-allreduce work?

The core operation Horovod performs is allreduce: a collective in which N workers each contribute a tensor and each ends up with the sum (or mean) of all N tensors. Naive implementations either route everything through a central parameter server (bandwidth bottleneck on the server) or have every worker broadcast to every other worker (O(N²) traffic). Ring-allreduce avoids both problems.

In ring-allreduce the N workers are arranged in a logical ring. The tensor to be reduced is split into N chunks. The algorithm then runs in two phases:

Scatter-reduce. For N-1 steps, each worker simultaneously sends one chunk to its right neighbor and receives one chunk from its left neighbor, adding the received chunk into its own buffer. After N-1 steps, each worker holds the fully reduced version of one of the N chunks.
All-gather. For another N-1 steps, each worker forwards its fully reduced chunk around the ring so that, by the end, every worker has the complete reduced tensor.

  Worker A  -->  Worker B  -->  Worker C  -->  Worker D
      ^                                            |
      +--------------------------------------------+

The total amount of data each worker sends and receives is 2(N-1)/N times the size of the tensor, which approaches 2 as N grows. Critically, the per-worker bandwidth cost is independent of N. This is why Patarasuk and Yuan called the algorithm bandwidth optimal: when the network is the bottleneck, no allreduce algorithm can do better in terms of bytes-per-worker. ^[3] Latency does scale with N (you need 2(N-1) communication steps), but for the tensor sizes typical in deep learning, bandwidth dominates.

Tensor Fusion

Real neural networks generate many small gradient tensors per step, one for each parameter group. Running ring-allreduce on each tensor independently wastes network bandwidth because the per-message overhead becomes significant for small payloads. Horovod's Tensor Fusion feature batches small ready-to-reduce tensors into a single fused buffer (default fusion threshold of 128 MB) and runs one allreduce on the buffer instead of many. The Uber team reported that on models with many small layers running over an unoptimized TCP network, Tensor Fusion delivered improvements of up to 65 percent. ^[9] The fusion threshold is configurable through the --fusion-threshold-mb flag of horovodrun; setting it to zero disables fusion.

Overlapping computation and communication

The TensorFlow and PyTorch integrations both arrange for the allreduce on each gradient to be issued as soon as that gradient is computed during the backward pass, so that communication for early layers proceeds in parallel with backward-pass computation for later layers. This overlap is what makes the synchronous design competitive with asynchronous parameter-server schemes.

Architecture and components

Horovod is structured as a thin C++ communication core wrapped in framework-specific Python bindings.

Layer	Role
Framework adapter	A Python module per framework: `horovod.tensorflow`, `horovod.keras`, `horovod.torch`, `horovod.mxnet`. Each provides a `DistributedOptimizer` (or equivalent), a `BroadcastGlobalVariablesHook`, callbacks for Keras, and helpers for assigning each process to a local GPU.
Common Python layer	Initialization (`hvd.init()`), rank discovery, and the public allreduce / allgather / broadcast APIs.
C++ controller	Coordinates which tensors are ready to reduce across workers and decides when to fuse them. Two implementations exist: an MPI controller (uses the MPI runtime for both control plane and data plane by default) and a Gloo controller (no MPI dependency, required for elastic training).
Communication backend	Performs the actual allreduce on each fused buffer. NCCL is used on NVIDIA GPUs, MPI on CPUs and on systems where GPU-aware MPI is configured, Gloo as a portable fallback, and Intel oneCCL for CPU clusters tuned for Intel hardware.
Launcher	`horovodrun` wraps `mpirun` (for the MPI backend) or its own Gloo-based launcher to start one process per worker, set up rendezvous, and forward signals. Direct invocation through `mpirun`, Slurm, Kubernetes, or Spark is also supported.

Which communication backends does Horovod support?

Backend	Maintainer	Best fit	Notes
NCCL	NVIDIA	Multi-GPU and multi-node NVIDIA GPU clusters	Highly tuned ring and tree algorithms; supports GPUDirect RDMA over InfiniBand and RoCE; default backend for GPU allreduce in modern Horovod builds.
MPI	Open MPI / MPICH community	HPC clusters, mixed CPU/GPU systems	Required for the MPI controller; provides launcher, rendezvous, and collective operations; supports CUDA-aware builds.
Gloo	Facebook	Sites without an MPI install, elastic training	Pure-Python rendezvous, TCP transport, no system dependencies; required for the elastic Horovod controller.
Intel oneCCL	Intel	CPU clusters with Intel interconnects	Tuned for Intel Xeon and the Xe GPU line; integrated through the C++ controller.

Process Sets

Later versions of Horovod added process sets, which let a job split its workers into independent groups that can run separate collective operations concurrently. This is useful for hybrid model-parallel and data-parallel training, where different subgroups of workers participate in different collectives.

Elastic Horovod

Elastic training, introduced in version 0.20 (June 2020), allows the worker count to change at runtime. ^[10] Workers can join or leave the job, for example because a spot instance is reclaimed or a new node is autoscaled in, without restarting from scratch or reloading from disk-based checkpoints. The training script wraps its main loop in an @hvd.elastic.run decorator, registers a state object that knows how to broadcast itself, and provides a host-discovery script. When the worker membership changes, Horovod reinitializes the ring, restores the last committed state, and resumes training. The elastic controller requires the Gloo backend.

Installation and usage

Horovod ships as a pip package that compiles the C++ extensions for the user's chosen framework and backend at install time. A typical installation on a machine with CUDA and NCCL is:

HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[tensorflow,pytorch]

For builds without MPI, users add HOROVOD_WITH_GLOO=1 and omit HOROVOD_WITH_MPI=1. Pre-built containers from NVIDIA NGC and the official Horovod project also exist.

How many lines of code does Horovod need?

The original paper highlighted that converting a single-GPU TensorFlow script to distributed training with Horovod required only four key code changes, and that this is a deliberate design goal: "requires only a few lines of modification to user code." ^[1] A minimal v1-style example looks like this:

import horovod.tensorflow as hvd
import tensorflow as tf

# 1. Initialize the Horovod runtime.
hvd.init()

# 2. Pin one GPU per worker by local rank.
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# 3. Scale the learning rate by the number of workers, then
#    wrap the optimizer so it averages gradients across workers.
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
opt = hvd.DistributedOptimizer(opt)

# 4. Broadcast the initial variables from rank 0 to all others.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

with tf.train.MonitoredTrainingSession(config=config, hooks=hooks) as sess:
    while not sess.should_stop():
        sess.run(train_op)

A TensorFlow 2 / Keras script wraps the optimizer and adds Horovod callbacks; a PyTorch script wraps the optimizer with hvd.DistributedOptimizer and broadcasts state with hvd.broadcast_parameters.

Launching a distributed job

The horovodrun launcher hides the MPI or Gloo plumbing. To run on four GPUs of a single host:

horovodrun -np 4 -H localhost:4 python train.py

For sixteen GPUs across four hosts:

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

Horovod can also be launched directly with mpirun, with Kubernetes operators (such as Kubeflow's MPIJob), with Spark via horovod.spark, with Ray via the RayExecutor API, and from inside HPC schedulers like Slurm.

Performance characteristics

The original Sergeev & Del Balso paper reported strong scaling experiments on three convolutional models, Inception V3, ResNet-101, and VGG-16, trained with synthetic ImageNet data on 16, 32, 64, and 128 NVIDIA Pascal GPUs (Pascal P100s) connected over 25 GbE. ^[1] The headline numbers from that paper are:

Model	Standard distributed TensorFlow	Horovod	Notes
Inception V3	~50% scaling efficiency at 128 GPUs	~88%	Roughly 2x throughput vs the parameter-server baseline.
ResNet-101	Comparable to Inception V3	~88%	Similar improvement profile.
VGG-16	Substantially worse, communication bound	Lower than Inception/ResNet on TCP, +30% on RDMA networks	VGG-16 has many parameters per layer; bandwidth dominates.

The paper also reported that enabling RDMA-capable networking (InfiniBand or RoCE) on top of NCCL gave only a 3-4% improvement for Inception V3 and ResNet-101 (which are communication-light relative to their compute) but a roughly 30% improvement for VGG-16, whose parameter-heavy fully connected layers shifted the bottleneck onto the network. ^[1] Tensor Fusion contributed up to a 65% throughput improvement on TCP for models with many small tensors such as ResNet-101. ^[9]

Third-party comparisons in the years since have generally confirmed that Horovod's per-step overhead is competitive with framework-native distributed implementations, and that the gap to PyTorch's DistributedDataParallel on small-to-medium models is small. On very large models, more recent systems like DeepSpeed outperform Horovod because they combine data parallelism with model parallelism and ZeRO-style optimizer-state partitioning, which Horovod does not implement directly.

Which frameworks does Horovod support?

Horovod supports four deep-learning frameworks through dedicated Python adapters: TensorFlow, Keras, PyTorch, and Apache MXNet. ^[5] TensorFlow was the original target (the 2017 release and the arXiv paper are TensorFlow-specific), while PyTorch and Apache MXNet support both landed in 2018. The same ring-allreduce core and horovodrun launcher serve all four, which is one of Horovod's distinctive properties: the same distributed training code can run under either TensorFlow or PyTorch with only the import line changed.

Comparison with alternatives

Horovod was the standard answer to how do I train this on more than one GPU? for several years, but the landscape now has several mature options.

Aspect	Horovod	PyTorch `DistributedDataParallel`	TensorFlow `MultiWorkerMirroredStrategy`	DeepSpeed
Frameworks	TensorFlow, Keras, PyTorch, MXNet	PyTorch only	TensorFlow only	PyTorch only
Algorithm	Ring-allreduce via NCCL/MPI/Gloo/oneCCL	Ring-allreduce via NCCL/Gloo/MPI	All-reduce via NCCL/RING/CollectiveOps	ZeRO-1/2/3, pipeline, tensor parallelism
Multi-node	Yes, MPI or Gloo	Yes, NCCL or Gloo	Yes, gRPC + NCCL	Yes
Code changes	Wrap optimizer, init, broadcast (~4 lines)	Wrap module, init process group	Wrap model in strategy scope	Wrap model and optimizer
Launcher	`horovodrun`, `mpirun`	`torchrun` (formerly `torch.distributed.launch`)	`tf.distribute.cluster_resolver` + `TF_CONFIG`	`deepspeed` launcher
Elastic training	Yes (since v0.20, Gloo only)	Yes via `torchelastic` (now `torchrun`)	Limited	Yes
Model-parallel/sharding	No (data-parallel only)	FSDP for sharded data-parallel	ParameterServerStrategy and others	First-class ZeRO and pipeline
Active development	Slow since 2023	Actively developed	Actively developed	Actively developed
Primary maintainer	LF AI & Data Foundation	Meta + community	Google + community	Microsoft + community
License	Apache 2.0	BSD-3	Apache 2.0	MIT

The practical effect is that new PyTorch projects almost universally start with DistributedDataParallel or FSDP, new TensorFlow projects start with the tf.distribute API, and very large transformer training runs reach for DeepSpeed or Megatron-LM. Horovod remains useful when a project needs to run the same training code under both TensorFlow and PyTorch, when it has to integrate with an existing MPI-based HPC environment, or when elastic training on heterogeneous hardware is a priority.

Adoption and known users

Uber published Horovod from inside its production Michelangelo platform; the Sergeev & Del Balso paper credits Uber's deep-learning workloads as the original benchmarks. ^[1] The framework has subsequently been used or distributed by:

NVIDIA, which packages Horovod inside its NGC container images for TensorFlow and PyTorch and recommends it in the NVIDIA Deep Learning Examples and NVIDIA AI Enterprise materials for multi-node training on DGX systems. ^[16]
Amazon Web Services, which has built tutorials around Horovod on Amazon SageMaker and Amazon EC2 spot instances (the latter using elastic Horovod for fault-tolerant training).
Microsoft Azure, which integrated Horovod with Azure Machine Learning and Azure Databricks.
Databricks, which shipped HorovodRunner inside Databricks Runtime ML from 2018 through 15.4 LTS ML, and which deprecated the package in September 2024 in favor of TorchDistributor and tf.distribute. ^[15]
NASA Advanced Supercomputing, which documents Horovod as a recommended distributed training tool for users on its HECC clusters. ^[17]
Alibaba, which contributed PyTorch and Apache MXNet bindings and used Horovod inside its PAI machine-learning platform.
Determined AI (acquired by Hewlett Packard Enterprise in 2021), which maintains a fork and integrates Horovod into the Determined training platform.
Academic and HPC sites including NERSC, Oak Ridge National Laboratory, and several university clusters that use it through their MPI-based scheduler stacks.

Project governance

Uber donated Horovod to the Linux Foundation's LF Deep Learning Foundation in December 2018 as an incubation-stage project. ^[8] The foundation merged with the LF AI Foundation in 2019, becoming the LF AI Foundation, and was renamed LF AI & Data in 2020 when LF Data joined. Horovod graduated to full project status in September 2020. ^[7] Day-to-day decisions are made by a technical steering committee of project maintainers operating under the foundation's open-source governance model. The trademark, neutral repository, and CI infrastructure are owned by the foundation rather than by Uber, which limits any single company's ability to change direction unilaterally.

Current status and maintenance

The project's release cadence has slowed considerably. As of mid-2026 the latest release on GitHub remains v0.28.1 from 12 June 2023, no new versions have been published to PyPI in more than 18 months, and the Snyk package-health analysis classifies the package as inactive based on release cadence and other signals. ^[13]^[14] The repository continues to receive commits and bug reports are still being triaged, but the project no longer ships major new features at the rate it did between 2018 and 2022.

The practical reasons for the slowdown are widely agreed on in the community: PyTorch's DistributedDataParallel and FSDP, TensorFlow's tf.distribute strategies, NVIDIA's NCCL improvements, DeepSpeed, and Megatron-LM together cover most of what users originally went to Horovod for, often with tighter integration into their respective frameworks. Several large platforms have migrated away. Databricks deprecated Horovod and HorovodRunner in September 2024, and Databricks Runtime ML versions after 15.4 LTS no longer pre-install the package; Databricks now points users at TorchDistributor for PyTorch and the tf.distribute.Strategy API for TensorFlow. ^[15]

Horovod still has real strengths that keep it in production: it remains the easiest way to get the same training script running in both TensorFlow and PyTorch, it has first-class MPI support that fits HPC environments, the Gloo backend lets it run without system-wide MPI, the elastic feature is useful on preemptible cloud instances, and the integrations with Apache Spark and Ray are still in active use. For new projects targeting a single framework, the framework-native distributed APIs are now the more natural starting point.

References

Sergeev, A., & Del Balso, M. (2018). *Horovod: fast and easy distributed deep learning in TensorFlow*. arXiv preprint arXiv:1802.05799. https://arxiv.org/abs/1802.05799 ↩
Sergeev, A. (17 October 2017). *Meet Horovod: Uber's Open Source Distributed Deep Learning Framework for TensorFlow*. Uber Engineering. https://www.uber.com/blog/horovod/ ↩
Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations. *Journal of Parallel and Distributed Computing*, 69(2), 117-124. ↩
Gibiansky, A. (21 February 2017). *Bringing HPC Techniques to Deep Learning*. Baidu Research. http://research.baidu.com/bringing-hpc-techniques-deep-learning/ ↩
Horovod project. *horovod/horovod GitHub repository*. https://github.com/horovod/horovod ↩
Horovod project. *Horovod documentation*. https://horovod.readthedocs.io/
LF AI & Data Foundation. *Horovod project page*. https://lfaidata.foundation/projects/horovod/ ↩
Linux Foundation (13 December 2018). *LF Deep Learning Welcomes Horovod Distributed Training Framework as Newest Project*. https://www.linuxfoundation.org/press/press-release/lf-deep-learning-welcomes-horovod-distributed-training-framework-as-newest-project ↩
Horovod project. *Tensor Fusion documentation*. https://horovod.readthedocs.io/en/stable/tensor-fusion_include.html ↩
Horovod project. *Elastic Horovod documentation*. https://horovod.readthedocs.io/en/latest/elastic_include.html ↩
Horovod project. *Horovod on Spark*. https://horovod.readthedocs.io/en/stable/spark_include.html ↩
Horovod project. *Horovod on Ray*. https://horovod.readthedocs.io/en/stable/ray_include.html ↩
Horovod project. *Releases page*. https://github.com/horovod/horovod/releases ↩
Snyk Advisor. *horovod package health analysis*. https://snyk.io/advisor/python/horovod ↩
Databricks (26 September 2024). *HorovodRunner: distributed deep learning with Horovod (deprecation notice)*. https://docs.databricks.com/aws/en/archive/machine-learning/train-model/horovod-runner ↩
NVIDIA. *Multi-Node Training with Tanzu Overview*. https://docs.nvidia.com/launchpad/ai/tanzu-multinode/latest/tanzu-multinode-training-overview.html ↩
NASA Advanced Supercomputing Division. *Using Horovod for Distributed Training*. https://www.nas.nasa.gov/hecc/support/kb/using-horovod-for-distributed-training_667.html ↩
Wikipedia. *Horovod (machine learning)*. https://en.wikipedia.org/wiki/Horovod_(machine_learning)
Hashemi, S. H., Abdu Jyothi, S., & Campbell, R. H. (2019). TicTac: accelerating distributed deep learning with communication scheduling. *Proceedings of Machine Learning and Systems*.
Bao, Y., Peng, Y., Chen, Y., & Wu, C. (2020). Preemptive all-reduce scheduling for expediting distributed DNN training. *IEEE INFOCOM*.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Data Parallelism Kubeflow Parameter Server (PS)Uber

Facts

ELI5: explain like I'm 5

What is Horovod and who created it?

Origins at Uber and Michelangelo

Where did the ring-allreduce idea come from?

When was Horovod released as open source?

Move to the LF AI Foundation

Maturity and slowdown

How does Horovod work?

How does ring-allreduce work?

Tensor Fusion

Overlapping computation and communication

Architecture and components

Which communication backends does Horovod support?

Process Sets

Elastic Horovod

Installation and usage

How many lines of code does Horovod need?

Launching a distributed job

Performance characteristics

Which frameworks does Horovod support?

Comparison with alternatives

Adoption and known users

Project governance

Current status and maintenance

See also

References

Improve this article

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

LanceDB

NVIDIA Dynamo

What links here

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

LanceDB

NVIDIA Dynamo

What links here