# Horovod

> Source: https://aiwiki.ai/wiki/horovod
> Updated: 2026-05-02
> Categories: AI Infrastructure, Developer Tools, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Distributed training](/wiki/distributed_training), [Data parallelism](/wiki/data_parallelism), [Ring-allreduce](/wiki/ring_allreduce), [TensorFlow](/wiki/tensorflow), [PyTorch](/wiki/pytorch), [NVIDIA NCCL](/wiki/nccl)*

**Horovod** is an open-source [distributed training](/wiki/distributed_training) framework for [deep learning](/wiki/deep_learning) models, originally developed at [Uber](/wiki/uber) and released as open source in October 2017. It lets users scale a single-GPU training script written in [TensorFlow](/wiki/tensorflow), [Keras](/wiki/keras), [PyTorch](/wiki/pytorch), or [Apache MXNet](/wiki/mxnet) across many GPUs and many machines by adding only a handful of lines of code, and it relies on the [ring-allreduce](/wiki/ring_allreduce) algorithm rather than a parameter-server architecture for averaging gradients between workers. Uber contributed Horovod to the [LF AI Foundation](/wiki/lf_ai_foundation) (now LF AI & Data) in December 2018, and the project graduated to full foundation status in 2020. The framework is named after a traditional Russian folk dance in which performers move with linked arms in a circle, mirroring how its workers exchange tensors.

Horovod was one of the first projects to make collective communication patterns from high-performance computing widely accessible to deep-learning practitioners, and for several years it was the default way to train models on tens or hundreds of GPUs. The project's relevance has narrowed over time as both [TensorFlow](/wiki/tensorflow) and [PyTorch](/wiki/pytorch) absorbed similar ideas into their native distributed APIs, and Horovod's release cadence slowed sharply after 2023; the most recent tagged release on GitHub is v0.28.1, dated 12 June 2023. The repository is still hosted under the LF AI & Data Foundation and the source code is widely deployed inside HPC clusters, container images, and managed ML platforms, but newer projects more often build on PyTorch's `DistributedDataParallel`, [DeepSpeed](/wiki/deepspeed), or framework-native distributed strategies.

## Facts

| Field | Value |
|-------|-------|
| Type | Distributed deep-learning training framework |
| Original developer | Uber Technologies (Michelangelo platform team) |
| Initial public release | August 2017 (v0.9.0 on GitHub) |
| Open-source announcement | 17 October 2017 (Uber Engineering blog) |
| Latest release | v0.28.1, 12 June 2023 |
| License | Apache License 2.0 |
| Repository | github.com/horovod/horovod |
| Languages | Python, C++, CUDA |
| Supported frameworks | TensorFlow, Keras, PyTorch, Apache MXNet |
| Communication backends | NVIDIA NCCL, MPI (Open MPI, MPICH), Facebook Gloo, Intel oneCCL |
| Algorithm | Synchronous data-parallel SGD using ring-allreduce |
| Governance | LF AI & Data Foundation (graduated project, September 2020) |
| Original paper | Sergeev & Del Balso, *Horovod: fast and easy distributed deep learning in TensorFlow* (arXiv:1802.05799, February 2018) |

## ELI5: explain like I'm 5

Imagine eight kids each have a deck of flashcards and they all want to learn the same multiplication table. Working alone, each kid might take an hour. If they split the cards eight ways and study at the same time, they can finish much faster, but at the end they have to sit in a circle and share what they learned so everybody ends up with the same knowledge. That sharing step is the slow part. If one kid talks to every other kid one by one, you spend a lot of time just passing notes around.

The ring-allreduce trick is to have the kids sit in a circle, and on each turn each kid passes one piece of paper to the kid on their right and gets one from the kid on their left, adding up the numbers as they go. After two trips around the circle everyone has the full sum and nobody had to wait on a single popular kid in the middle. Horovod is a piece of software that does exactly this for the gradients of a neural network during training: it lets eight, or sixty-four, or a thousand GPUs share what they learned each step almost as efficiently as if they were one big GPU. The clever part is that it does the trick from inside almost any deep-learning library you already use, so you only have to add about four lines of code to a normal training script to make it run distributed.

## History

### Origins at Uber and Michelangelo

Horovod grew out of Uber's [Michelangelo](/wiki/michelangelo) machine-learning platform, which the company introduced in September 2017 as an internal end-to-end ML-as-a-service stack. Michelangelo handled feature management, training, model serving, and monitoring across hundreds of production ML use cases at Uber, from estimated time of arrival to fraud detection. As the company began to train larger deep-learning models, the team ran into a wall: the standard distributed training APIs in [TensorFlow](/wiki/tensorflow) at the time, which were based on parameter servers and `tf.train.SyncReplicasOptimizer`, were both hard to use correctly and inefficient at scale. Uber engineers reported that they were unable to leverage roughly half of their GPU resources when scaling distributed TensorFlow training to 128 GPUs.

### The ring-allreduce idea from Baidu

In February 2017, [Baidu](/wiki/baidu) Research published a blog post by Andrew Gibiansky titled *Bringing HPC Techniques to Deep Learning*, along with a draft TensorFlow implementation, showing that the ring-allreduce algorithm long used in [MPI](/wiki/mpi)-based scientific computing could be adapted to gradient averaging in deep learning. The algorithm itself comes from a 2009 paper by Pitch Patarasuk and Xin Yuan, *Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations*, which proved that ring-allreduce achieves bandwidth-optimal communication independent of the number of workers. Baidu showed that, on convolutional networks, ring-allreduce dramatically outperformed the parameter-server pattern at scale.

Uber's deep-learning team picked up Baidu's draft, refactored it into a stand-alone Python library that did not require a custom TensorFlow build, hooked it into the framework as a regular Python optimizer, and added performance work like Tensor Fusion. The result was Horovod. Internally, Uber benchmarks showed scaling efficiency rising from roughly 50% on standard distributed TensorFlow to about 88% on Horovod for the same hardware and models.

### Open-source release

Uber tagged the first public release, v0.9.0, on GitHub in August 2017 under the Apache 2.0 license. The official open-source announcement came on 17 October 2017 in a blog post by Alexander Sergeev titled *Meet Horovod: Uber's Open Source Distributed Deep Learning Framework for TensorFlow*. The accompanying paper, *Horovod: fast and easy distributed deep learning in TensorFlow* by Alexander Sergeev and Mike Del Balso, was posted to arXiv on 15 February 2018 (arXiv:1802.05799).

The choice of name was deliberate: a *horovod* (Russian: хоровод) is a circle dance in which participants link arms and move together as a group, which mirrors the topology of the ring-allreduce algorithm. Sergeev's blog post explicitly drew the analogy.

### Move to the LF AI Foundation

In December 2018, Uber donated Horovod to the LF Deep Learning Foundation, a Linux Foundation project that was later folded into the [LF AI Foundation](/wiki/lf_ai_foundation) and then renamed LF AI & Data. Horovod entered as an incubation-stage project. It graduated to full foundation status in September 2020, the highest project maturity tier in the foundation's lifecycle. Governance is handled by an elected technical steering committee under the foundation's open-source charter.

### Maturity and slowdown

Through 2018-2022 Horovod expanded well beyond its original TensorFlow focus. PyTorch and Apache MXNet support arrived in 2018, Keras model integration was added, the Gloo controller landed so that users could run Horovod without an MPI installation, and elastic training was introduced in version 0.20 (mid-2020). Integrations with [Apache Spark](/wiki/apache_spark) and [Ray](/wiki/ray) made it possible to launch Horovod jobs from inside data-processing pipelines.

Release activity slowed sharply after 2023. The most recent tagged release, v0.28.1, was published on 12 June 2023; no new versions have appeared on PyPI in the 18 months that followed. The repository continues to receive occasional commits and bug-fix work, but several large platforms have moved away from it as their default distributed-training tool. Databricks announced on 26 September 2024 that Horovod and HorovodRunner are deprecated and that releases after Databricks Runtime 15.4 LTS ML will not include the package pre-installed; Databricks now recommends TorchDistributor for [PyTorch](/wiki/pytorch) and the `tf.distribute.Strategy` API for [TensorFlow](/wiki/tensorflow).

## How it works

Horovod implements *synchronous data-parallel* stochastic gradient descent. Each worker holds a full copy of the model, processes a different shard of each minibatch, computes local gradients, and then participates in a collective communication step that averages those gradients across all workers before the next parameter update. Because every worker has the same averaged gradient at the end of each step, every worker's parameters stay in sync, which means there is no parameter server, no asynchronous staleness, and no need for special checkpointing logic.

### The ring-allreduce algorithm

The core operation Horovod performs is *allreduce*: a collective in which N workers each contribute a tensor and each ends up with the sum (or mean) of all N tensors. Naive implementations either route everything through a central parameter server (bandwidth bottleneck on the server) or have every worker broadcast to every other worker (O(N²) traffic). Ring-allreduce avoids both problems.

In ring-allreduce the N workers are arranged in a logical ring. The tensor to be reduced is split into N chunks. The algorithm then runs in two phases:

1. **Scatter-reduce.** For N−1 steps, each worker simultaneously sends one chunk to its right neighbor and receives one chunk from its left neighbor, adding the received chunk into its own buffer. After N−1 steps, each worker holds the fully reduced version of one of the N chunks.
2. **All-gather.** For another N−1 steps, each worker forwards its fully reduced chunk around the ring so that, by the end, every worker has the complete reduced tensor.

```
  Worker A  -->  Worker B  -->  Worker C  -->  Worker D
      ^                                            |
      +--------------------------------------------+
```

The total amount of data each worker sends and receives is `2(N-1)/N` times the size of the tensor, which approaches 2 as N grows. Critically, the per-worker bandwidth cost is independent of N. This is why Patarasuk and Yuan called the algorithm *bandwidth optimal*: when the network is the bottleneck, no allreduce algorithm can do better in terms of bytes-per-worker. Latency does scale with N (you need 2(N−1) communication steps), but for the tensor sizes typical in deep learning, bandwidth dominates.

### Tensor Fusion

Real neural networks generate many small gradient tensors per step, one for each parameter group. Running ring-allreduce on each tensor independently wastes network bandwidth because the per-message overhead becomes significant for small payloads. Horovod's *Tensor Fusion* feature batches small ready-to-reduce tensors into a single fused buffer (default fusion threshold of 128 MB) and runs one allreduce on the buffer instead of many. The Uber team reported that on models with many small layers running over an unoptimized TCP network, Tensor Fusion delivered improvements of up to 65 percent. The fusion threshold is configurable through the `--fusion-threshold-mb` flag of `horovodrun`; setting it to zero disables fusion.

### Overlapping computation and communication

The TensorFlow and PyTorch integrations both arrange for the allreduce on each gradient to be issued as soon as that gradient is computed during the backward pass, so that communication for early layers proceeds in parallel with backward-pass computation for later layers. This overlap is what makes the synchronous design competitive with asynchronous parameter-server schemes.

## Architecture and components

Horovod is structured as a thin C++ communication core wrapped in framework-specific Python bindings.

| Layer | Role |
|-------|------|
| Framework adapter | A Python module per framework: `horovod.tensorflow`, `horovod.keras`, `horovod.torch`, `horovod.mxnet`. Each provides a `DistributedOptimizer` (or equivalent), a `BroadcastGlobalVariablesHook`, callbacks for [Keras](/wiki/keras), and helpers for assigning each process to a local GPU. |
| Common Python layer | Initialization (`hvd.init()`), rank discovery, and the public allreduce / allgather / broadcast APIs. |
| C++ controller | Coordinates which tensors are ready to reduce across workers and decides when to fuse them. Two implementations exist: an MPI controller (uses the MPI runtime for both control plane and data plane by default) and a Gloo controller (no MPI dependency, required for elastic training). |
| Communication backend | Performs the actual allreduce on each fused buffer. NCCL is used on NVIDIA GPUs, MPI on CPUs and on systems where GPU-aware MPI is configured, Gloo as a portable fallback, and Intel oneCCL for CPU clusters tuned for Intel hardware. |
| Launcher | `horovodrun` wraps `mpirun` (for the MPI backend) or its own Gloo-based launcher to start one process per worker, set up rendezvous, and forward signals. Direct invocation through `mpirun`, Slurm, Kubernetes, or Spark is also supported. |

### Communication backends in detail

| Backend | Maintainer | Best fit | Notes |
|---------|------------|----------|-------|
| [NCCL](/wiki/nccl) | NVIDIA | Multi-GPU and multi-node NVIDIA GPU clusters | Highly tuned ring and tree algorithms; supports GPUDirect RDMA over InfiniBand and RoCE; default backend for GPU allreduce in modern Horovod builds. |
| [MPI](/wiki/mpi) | Open MPI / MPICH community | HPC clusters, mixed CPU/GPU systems | Required for the MPI controller; provides launcher, rendezvous, and collective operations; supports CUDA-aware builds. |
| Gloo | Facebook | Sites without an MPI install, elastic training | Pure-Python rendezvous, TCP transport, no system dependencies; required for the elastic Horovod controller. |
| Intel oneCCL | Intel | CPU clusters with Intel interconnects | Tuned for Intel Xeon and the Xe GPU line; integrated through the C++ controller. |

### Process Sets

Later versions of Horovod added *process sets*, which let a job split its workers into independent groups that can run separate collective operations concurrently. This is useful for hybrid model-parallel and data-parallel training, where different subgroups of workers participate in different collectives.

### Elastic Horovod

Elastic training, introduced in version 0.20 (June 2020), allows the worker count to change at runtime. Workers can join or leave the job, for example because a spot instance is reclaimed or a new node is autoscaled in, without restarting from scratch or reloading from disk-based checkpoints. The training script wraps its main loop in an `@hvd.elastic.run` decorator, registers a state object that knows how to broadcast itself, and provides a host-discovery script. When the worker membership changes, Horovod reinitializes the ring, restores the last committed state, and resumes training. The elastic controller requires the Gloo backend.

## Installation and usage

Horovod ships as a `pip` package that compiles the C++ extensions for the user's chosen framework and backend at install time. A typical installation on a machine with CUDA and NCCL is:

```
HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[tensorflow,pytorch]
```

For builds without MPI, users add `HOROVOD_WITH_GLOO=1` and omit `HOROVOD_WITH_MPI=1`. Pre-built containers from NVIDIA NGC and the official Horovod project also exist.

### Adding Horovod to a TensorFlow script

The original paper highlighted that converting a single-GPU TensorFlow script to distributed training with Horovod required only four key code changes. A minimal v1-style example looks like this:

```python
import horovod.tensorflow as hvd
import tensorflow as tf

# 1. Initialize the Horovod runtime.
hvd.init()

# 2. Pin one GPU per worker by local rank.
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# 3. Scale the learning rate by the number of workers, then
#    wrap the optimizer so it averages gradients across workers.
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
opt = hvd.DistributedOptimizer(opt)

# 4. Broadcast the initial variables from rank 0 to all others.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

with tf.train.MonitoredTrainingSession(config=config, hooks=hooks) as sess:
    while not sess.should_stop():
        sess.run(train_op)
```

A TensorFlow 2 / Keras script wraps the optimizer and adds Horovod callbacks; a PyTorch script wraps the optimizer with `hvd.DistributedOptimizer` and broadcasts state with `hvd.broadcast_parameters`.

### Launching a distributed job

The `horovodrun` launcher hides the MPI or Gloo plumbing. To run on four GPUs of a single host:

```
horovodrun -np 4 -H localhost:4 python train.py
```

For sixteen GPUs across four hosts:

```
horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py
```

Horovod can also be launched directly with `mpirun`, with Kubernetes operators (such as Kubeflow's MPIJob), with Spark via `horovod.spark`, with Ray via the `RayExecutor` API, and from inside HPC schedulers like Slurm.

## Performance characteristics

The original Sergeev & Del Balso paper reported strong scaling experiments on three convolutional models, [Inception V3](/wiki/inception), [ResNet-101](/wiki/resnet), and [VGG-16](/wiki/vgg), trained with synthetic ImageNet data on 16, 32, 64, and 128 NVIDIA Pascal GPUs (Pascal P100s) connected over 25 GbE. The headline numbers from that paper are:

| Model | Standard distributed TensorFlow | Horovod | Notes |
|-------|---------------------------------|---------|-------|
| Inception V3 | ~50% scaling efficiency at 128 GPUs | ~88% | Roughly 2x throughput vs the parameter-server baseline. |
| ResNet-101 | Comparable to Inception V3 | ~88% | Similar improvement profile. |
| VGG-16 | Substantially worse, communication bound | Lower than Inception/ResNet on TCP, +30% on RDMA networks | VGG-16 has many parameters per layer; bandwidth dominates. |

The paper also reported that enabling RDMA-capable networking (InfiniBand or RoCE) on top of NCCL gave only a 3-4% improvement for Inception V3 and ResNet-101 (which are communication-light relative to their compute) but a roughly 30% improvement for VGG-16, whose parameter-heavy fully connected layers shifted the bottleneck onto the network. Tensor Fusion contributed up to a 65% throughput improvement on TCP for models with many small tensors such as ResNet-101.

Third-party comparisons in the years since have generally confirmed that Horovod's per-step overhead is competitive with framework-native distributed implementations, and that the gap to PyTorch's `DistributedDataParallel` on small-to-medium models is small. On very large models, more recent systems like [DeepSpeed](/wiki/deepspeed) outperform Horovod because they combine data parallelism with model parallelism and ZeRO-style optimizer-state partitioning, which Horovod does not implement directly.

## Comparison with alternatives

Horovod was the standard answer to *how do I train this on more than one GPU?* for several years, but the landscape now has several mature options.

| Aspect | Horovod | PyTorch `DistributedDataParallel` | TensorFlow `MultiWorkerMirroredStrategy` | DeepSpeed |
|--------|---------|------------------------------------|------------------------------------------|-----------|
| Frameworks | TensorFlow, Keras, PyTorch, MXNet | PyTorch only | TensorFlow only | PyTorch only |
| Algorithm | Ring-allreduce via NCCL/MPI/Gloo/oneCCL | Ring-allreduce via NCCL/Gloo/MPI | All-reduce via NCCL/RING/CollectiveOps | ZeRO-1/2/3, pipeline, tensor parallelism |
| Multi-node | Yes, MPI or Gloo | Yes, NCCL or Gloo | Yes, gRPC + NCCL | Yes |
| Code changes | Wrap optimizer, init, broadcast (~4 lines) | Wrap module, init process group | Wrap model in strategy scope | Wrap model and optimizer |
| Launcher | `horovodrun`, `mpirun` | `torchrun` (formerly `torch.distributed.launch`) | `tf.distribute.cluster_resolver` + `TF_CONFIG` | `deepspeed` launcher |
| Elastic training | Yes (since v0.20, Gloo only) | Yes via `torchelastic` (now `torchrun`) | Limited | Yes |
| Model-parallel/sharding | No (data-parallel only) | FSDP for sharded data-parallel | ParameterServerStrategy and others | First-class ZeRO and pipeline |
| Active development | Slow since 2023 | Actively developed | Actively developed | Actively developed |
| Primary maintainer | LF AI & Data Foundation | Meta + community | Google + community | Microsoft + community |
| License | Apache 2.0 | BSD-3 | Apache 2.0 | MIT |

The practical effect is that new PyTorch projects almost universally start with `DistributedDataParallel` or FSDP, new TensorFlow projects start with the `tf.distribute` API, and very large transformer training runs reach for DeepSpeed or [Megatron-LM](/wiki/megatron). Horovod remains useful when a project needs to run the same training code under both TensorFlow and PyTorch, when it has to integrate with an existing MPI-based HPC environment, or when elastic training on heterogeneous hardware is a priority.

## Adoption and known users

Uber published Horovod from inside its production Michelangelo platform; the Sergeev & Del Balso paper credits Uber's deep-learning workloads as the original benchmarks. The framework has subsequently been used or distributed by:

- **NVIDIA**, which packages Horovod inside its NGC container images for TensorFlow and PyTorch and recommends it in the NVIDIA Deep Learning Examples and NVIDIA AI Enterprise materials for multi-node training on DGX systems.
- **Amazon Web Services**, which has built tutorials around Horovod on Amazon SageMaker and Amazon EC2 spot instances (the latter using elastic Horovod for fault-tolerant training).
- **Microsoft Azure**, which integrated Horovod with Azure Machine Learning and Azure Databricks.
- **Databricks**, which shipped HorovodRunner inside Databricks Runtime ML from 2018 through 15.4 LTS ML, and which deprecated the package in September 2024 in favor of TorchDistributor and `tf.distribute`.
- **NASA Advanced Supercomputing**, which documents Horovod as a recommended distributed training tool for users on its HECC clusters.
- **Alibaba**, which contributed PyTorch and Apache MXNet bindings and used Horovod inside its PAI machine-learning platform.
- **Determined AI** (acquired by Hewlett Packard Enterprise in 2021), which maintains a fork and integrates Horovod into the Determined training platform.
- Academic and HPC sites including NERSC, Oak Ridge National Laboratory, and several university clusters that use it through their MPI-based scheduler stacks.

## Project governance

Uber donated Horovod to the Linux Foundation's LF Deep Learning Foundation in December 2018 as an incubation-stage project. The foundation merged with the LF AI Foundation in 2019, becoming the [LF AI Foundation](/wiki/lf_ai_foundation), and was renamed LF AI & Data in 2020 when LF Data joined. Horovod graduated to full project status in September 2020. Day-to-day decisions are made by a technical steering committee of project maintainers operating under the foundation's open-source governance model. The trademark, neutral repository, and CI infrastructure are owned by the foundation rather than by Uber, which limits any single company's ability to change direction unilaterally.

## Current status and maintenance

The project's release cadence has slowed considerably. As of mid-2026 the latest release on GitHub remains v0.28.1 from 12 June 2023, no new versions have been published to PyPI in more than 18 months, and the Snyk package-health analysis classifies the package as inactive based on release cadence and other signals. The repository continues to receive commits and bug reports are still being triaged, but the project no longer ships major new features at the rate it did between 2018 and 2022.

The practical reasons for the slowdown are widely agreed on in the community: PyTorch's `DistributedDataParallel` and FSDP, TensorFlow's `tf.distribute` strategies, NVIDIA's NCCL improvements, DeepSpeed, and Megatron-LM together cover most of what users originally went to Horovod for, often with tighter integration into their respective frameworks. Several large platforms have migrated away. Databricks deprecated Horovod and HorovodRunner in September 2024, and Databricks Runtime ML versions after 15.4 LTS no longer pre-install the package; Databricks now points users at TorchDistributor for PyTorch and the `tf.distribute.Strategy` API for TensorFlow.

Horovod still has real strengths that keep it in production: it remains the easiest way to get the same training script running in both TensorFlow and PyTorch, it has first-class MPI support that fits HPC environments, the Gloo backend lets it run without system-wide MPI, the elastic feature is useful on preemptible cloud instances, and the integrations with [Apache Spark](/wiki/apache_spark) and [Ray](/wiki/ray) are still in active use. For new projects targeting a single framework, the framework-native distributed APIs are now the more natural starting point.

## See also

- [Distributed training](/wiki/distributed_training)
- [Data parallelism](/wiki/data_parallelism)
- [Model parallelism](/wiki/model_parallelism)
- [Ring-allreduce](/wiki/ring_allreduce)
- [Allreduce](/wiki/allreduce)
- [TensorFlow](/wiki/tensorflow)
- [PyTorch](/wiki/pytorch)
- [Keras](/wiki/keras)
- [Apache MXNet](/wiki/mxnet)
- [NVIDIA NCCL](/wiki/nccl)
- [MPI](/wiki/mpi)
- [Gloo](/wiki/gloo)
- [Apache Spark](/wiki/apache_spark)
- [Ray](/wiki/ray)
- [DeepSpeed](/wiki/deepspeed)
- [Megatron-LM](/wiki/megatron)
- [Uber](/wiki/uber)
- [Michelangelo](/wiki/michelangelo)
- [Baidu](/wiki/baidu)
- [LF AI Foundation](/wiki/lf_ai_foundation)
- [Linux Foundation](/wiki/linux_foundation)
- [Stochastic gradient descent](/wiki/sgd)
- [Deep learning](/wiki/deep_learning)
- [GPU computing](/wiki/gpu)
- [InfiniBand](/wiki/infiniband)
- [DistributedDataParallel](/wiki/distributeddataparallel)

## References

1. Sergeev, A., & Del Balso, M. (2018). *Horovod: fast and easy distributed deep learning in TensorFlow*. arXiv preprint arXiv:1802.05799. https://arxiv.org/abs/1802.05799
2. Sergeev, A. (17 October 2017). *Meet Horovod: Uber's Open Source Distributed Deep Learning Framework for TensorFlow*. Uber Engineering. https://www.uber.com/blog/horovod/
3. Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations. *Journal of Parallel and Distributed Computing*, 69(2), 117-124.
4. Gibiansky, A. (21 February 2017). *Bringing HPC Techniques to Deep Learning*. Baidu Research. http://research.baidu.com/bringing-hpc-techniques-deep-learning/
5. Horovod project. *horovod/horovod GitHub repository*. https://github.com/horovod/horovod
6. Horovod project. *Horovod documentation*. https://horovod.readthedocs.io/
7. LF AI & Data Foundation. *Horovod project page*. https://lfaidata.foundation/projects/horovod/
8. Linux Foundation (3 December 2018). *Horovod Joins the LF Deep Learning Foundation as its Newest Project*. https://www.uber.com/blog/horovod-deep-learning-foundation/
9. Horovod project. *Tensor Fusion documentation*. https://horovod.readthedocs.io/en/stable/tensor-fusion_include.html
10. Horovod project. *Elastic Horovod documentation*. https://horovod.readthedocs.io/en/latest/elastic_include.html
11. Horovod project. *Horovod on Spark*. https://horovod.readthedocs.io/en/stable/spark_include.html
12. Horovod project. *Horovod on Ray*. https://horovod.readthedocs.io/en/stable/ray_include.html
13. Horovod project. *Releases page*. https://github.com/horovod/horovod/releases
14. Snyk Advisor. *horovod package health analysis*. https://snyk.io/advisor/python/horovod
15. Databricks (26 September 2024). *HorovodRunner: distributed deep learning with Horovod (deprecation notice)*. https://docs.databricks.com/aws/en/archive/machine-learning/train-model/horovod-runner
16. NVIDIA. *Multi-Node Training with Tanzu Overview*. https://docs.nvidia.com/launchpad/ai/tanzu-multinode/latest/tanzu-multinode-training-overview.html
17. NASA Advanced Supercomputing Division. *Using Horovod for Distributed Training*. https://www.nas.nasa.gov/hecc/support/kb/using-horovod-for-distributed-training_667.html
18. Wikipedia. *Horovod (machine learning)*. https://en.wikipedia.org/wiki/Horovod_(machine_learning)
19. Hashemi, S. H., Abdu Jyothi, S., & Campbell, R. H. (2019). TicTac: accelerating distributed deep learning with communication scheduling. *Proceedings of Machine Learning and Systems*.
20. Bao, Y., Peng, Y., Chen, Y., & Wu, C. (2020). Preemptive all-reduce scheduling for expediting distributed DNN training. *IEEE INFOCOM*.

