NCCL (NVIDIA Collective Communications Library)
NCCL (pronounced "Nickel") is an open-source library from Nvidia that implements multi-GPU and multi-node collective communication primitives optimised for NVIDIA hardware. It is the de facto communication backbone for synchronous distributed training on NVIDIA GPUs and underlies almost every modern deep learning framework that scales beyond a single device, including PyTorch DistributedDataParallel, Fully Sharded Data Parallel (FSDP), DeepSpeed, Megatron-LM, and JAX.
NCCL provides routines such as AllReduce, AllGather, ReduceScatter, Broadcast, Reduce, AlltoAll, Gather, Scatter, and point-to-point Send/Recv. The implementation is topology aware: it discovers the system's PCIe tree, NVLink and NVSwitch fabric, and InfiniBand or Ethernet HCAs, then chooses an algorithm and communication channels tuned for the message size and the underlying hardware. Each collective is fused into a single CUDA kernel that handles both the arithmetic and the data movement, which avoids extra kernel launches and lets NCCL saturate links such as NVLink at hundreds of gigabytes per second.
Overview
Deep learning at scale is bottlenecked less often by raw compute than by the cost of moving tensors between devices. Synchronous data parallelism, the most common form of distributed training, requires every worker to average its gradients with every other worker after each backward pass. Done naively over PCIe or TCP this collapses the throughput of even fast accelerators. NCCL exists to push those collective operations as close to the bandwidth ceiling of the underlying interconnect as possible, whether that interconnect is NVLink inside a single DGX node or InfiniBand across thousands of nodes in a supercomputer.
The API was deliberately modelled on MPI, the long-standing standard for collective communication in HPC, so engineers familiar with MPI can adopt NCCL with minimal friction. Unlike MPI, NCCL is GPU-centric: buffers live in device memory, calls are issued from the host but executed on a CUDA stream, and there is no implicit copy through the host. The library is licensed under BSD-3-Clause and is developed in the open at github.com/NVIDIA/nccl.
History and version timeline
NCCL began as an open-source research project at NVIDIA. Early commits in the public repository date from 2015, and the first widely cited NVIDIA Developer blog post introducing the library, "Fast Multi-GPU Collectives with NCCL," appeared on April 7, 2016. NCCL 1.x was a single-node, intra-server library that exchanged data between GPUs using direct peer-to-peer load/store across PCIe and NVLink. It implemented five collectives: AllReduce, AllGather, Broadcast, Reduce, and ReduceScatter, built on three internal kernels (Copy, Reduce, and ReduceAndCopy).
NCCL 2.0 (2017) added multi-node support. Inter-node transfers used GPUDirect RDMA over InfiniBand where available, with a CPU-side proxy thread driving the NIC. From that point onward NCCL became the standard collective library for distributed deep learning on NVIDIA hardware. Subsequent releases steadily added algorithms, transports, and resilience features.
| Version | Year | Notable additions |
|---|
| 1.x | 2015 to 2016 | Single-node multi-GPU; ring AllReduce; 5 collectives in fused kernels |
| 2.0 | 2017 | Multi-node support via GPUDirect RDMA over InfiniBand |
| 2.4 | February 2019 | Double binary tree algorithm for low-latency AllReduce |
| 2.7 | 2020 | Point-to-point ncclSend/ncclRecv (enables pipeline parallel and AlltoAll) |
| 2.11 | 2021 | NVLS (NVLink SHARP) and tighter NVSwitch integration |
| 2.14 | 2022 | Non-blocking communicator init via ncclConfig_t.blocking = 0 and ncclCommGetAsyncError polling |
| 2.17 to 2.19 | 2023 | User buffer registration; CUDA Graphs improvements; NVLink SHARP for AllReduce |
| 2.27 | July 14, 2025 | Symmetric memory low-latency kernels; SHARP on NVLink and InfiniBand; ncclCommShrink for fault tolerance; Direct NIC support on Grace Blackwell |
| 2.29 | December 2024 to February 2025 | Host-side one-sided APIs (ncclPutSignal, ncclWaitSignal); ncclCommGrow for elastic topology; multi-context GIN; nccl4py Python bindings |
| 2.30 | April 2026 | Tensor Memory Accelerator support; Dynamic Direct Path for multipath networking; one-sided RMA with CUDA graph capture |
Collective primitives
NCCL implements the standard MPI-style collective set, plus point-to-point send and receive. The current overview lists eight collectives.
| Primitive | What it does | Common use in deep learning |
|---|
ncclAllReduce | Reduces (sum, prod, min, max, avg) one buffer per rank into an identical result on every rank | Averaging gradients across data parallelism workers; the single most-called NCCL routine in training |
ncclAllGather | Concatenates one buffer per rank into a full buffer on every rank | Gathering sharded parameters in FSDP and ZeRO-3 forward/backward |
ncclReduceScatter | Reduces across ranks and scatters the result so each rank holds one shard | Sharded gradient reduction in FSDP/ZeRO-3 (the dual of AllGather) |
ncclBroadcast | Sends one rank's buffer to all others | Distributing initial weights, RNG seeds |
ncclReduce | Reduces across ranks into a single root | Collecting metrics, validation loss |
ncclAlltoAll | Each rank sends a distinct chunk to every other rank | Mixture of Experts routing, expert parallelism |
ncclGather / ncclScatter | Many-to-one and one-to-many reorganisations | Less common in training; used in some pipeline schedules |
ncclSend / ncclRecv | Point-to-point primitives | Pipeline-parallel activation/gradient passing in Megatron-LM and DeepSpeed |
A call is enqueued onto a CUDA stream and runs as a single fused kernel that interleaves transfers with the reduction arithmetic. Operations can be in-place (the receive buffer overlaps the send buffer) or out-of-place. Multiple calls can be batched into a single launch using ncclGroupStart() and ncclGroupEnd(), which both reduces launch overhead and allows NCCL to pipeline several small collectives, an important optimisation in pipeline parallelism and FSDP where many small communications overlap with compute.
Algorithms
NCCL chooses the algorithm and protocol per call based on message size, communicator size, and topology. The two foundational algorithms are the ring and the double binary tree, with several variants layered on top for switch-assisted hardware.
Ring algorithm
The ring AllReduce arranges ranks in a logical ring and proceeds in two phases: a reduce-scatter where each rank ends up with one chunk of the fully reduced result, then an all-gather where the chunks are circulated until every rank has the full result. With N ranks, each rank sends and receives 2(N-1)/N times its data, which approaches the optimal bandwidth as N grows. Pitch Patarasuk and Xin Yuan proved this is bandwidth-optimal in their 2009 paper "Bandwidth optimal all-reduce algorithms for clusters of workstations," published in the Journal of Parallel and Distributed Computing. Their construction is the basis for the ring AllReduce in NCCL, in Baidu's original Allreduce work, and in Horovod.
The ring's weakness is latency. Each rank waits for its predecessor before forwarding, so total time scales linearly with the number of ranks. For small messages on large clusters this becomes the dominant cost.
Tree algorithm and double binary trees
NCCL 2.4, released February 4, 2019, introduced the double binary tree, an algorithm originally described by Peter Sanders, Jochen Speck, and Jesper Larsson Träff in their 2009 paper "Two-tree algorithms for full bandwidth broadcast, reduction and scan." The construction builds two complementary binary trees over the ranks such that no rank is a non-leaf in both trees, and at most one rank is a leaf in both. Half the data flows through each tree, and the trees are walked simultaneously in two phases: a reduction up from leaves to root, then a broadcast down. Latency is O(log N) instead of O(N), while keeping the same effective bandwidth as a ring (each rank sends and receives half the data twice).
NVIDIA reported up to a 180x latency improvement at 24,576 GPUs on the Summit supercomputer using double binary trees instead of rings for small AllReduce calls. NCCL automatically falls back to rings when message sizes are large enough that the ring's higher bandwidth dominates.
Algorithm selection
| Regime | Preferred algorithm | Reason |
|---|
| Small messages, many ranks | Tree (double binary) | Latency scales as log N |
| Medium messages | Tree or Ring depending on tuning model | Crossover point |
| Large messages | Ring | Bandwidth-optimal as N grows |
| Single-node with NVSwitch and SHARP | NVLS / CollNet | In-fabric reduction offload |
| Multi-node with InfiniBand SHARP | CollNet (SHARP) | In-network reduction in the switch |
| Pre-NVSwitch hardware needing CollNet | CollNetChain | Tree-like pattern without SHARP |
The selectable set is exposed via the NCCL_ALGO environment variable, whose default value is Tree,Ring,CollnetDirect,CollnetChain (NVLS variants are added on hardware that supports them). The protocol used inside a chosen algorithm (LL, LL128, or Simple) is selected automatically and is exposed for diagnostic use as NCCL_PROTO.
CollNet, NVLS, and SHARP
NCCL also supports algorithms that offload part of the reduction into the network fabric itself. SHARP, the Scalable Hierarchical Aggregation and Reduction Protocol developed by Mellanox (now part of NVIDIA), is implemented in the silicon of Quantum InfiniBand switches and Spectrum Ethernet switches. When SHARP is available, the switch performs the reduction on data in flight, removing one full pass of traffic between endpoints. The CollNet algorithm in NCCL is the integration point: with the NCCL-SHARP plugin (distributed through HPC-X) the switch becomes a participant in the collective.
NVLink SHARP (NVLS) is the analogous capability inside an NVLink/NVSwitch domain, where the third-generation NVSwitch can reduce data as it passes through the fabric. NCCL 2.27, released July 14, 2025, extended SHARP support to AllGather and ReduceScatter on both NVLink and InfiniBand fabrics and reduced SM occupancy from 16 to 6 streaming multiprocessors during these collectives, freeing more compute for the model itself.
Topology awareness
When a communicator is created, NCCL probes the system: it walks the PCIe hierarchy through /sys, queries NVML for NVLink and NVSwitch connectivity, and inspects the InfiniBand subsystem for HCAs and their NUMA affinity. The result is a graph used to enumerate possible communication channels and to assign GPUs to NICs in a way that avoids crossing NUMA boundaries unnecessarily. Setting NCCL_TOPO_DUMP_FILE=/tmp/topo.xml writes the discovered topology to disk, which is invaluable when debugging slow runs or diagnosing why a particular NIC is not being used.
Topology awareness is the reason NCCL works well across a wide hardware range. The same code that runs across a DGX H100 (NVLink + NVSwitch + ConnectX-7 InfiniBand) will pick a different set of channels on an older server with only PCIe peer-to-peer, and a different one again on a cloud VM with no GPUDirect RDMA support.
Network transports
| Transport | When used |
|---|
| Shared host memory | Two GPUs on the same node without P2P, or as fallback |
| GPUDirect P2P over PCIe | Same-node GPUs with P2P enabled, no NVLink |
| NVLink / NVSwitch | Same-node GPUs with NVLink fabric (DGX, HGX, GB200) |
| GPUDirect RDMA over InfiniBand | Multi-node clusters with Mellanox/NVIDIA HCAs |
| GPUDirect RDMA over RoCE | Multi-node Ethernet with RDMA-capable NICs |
| TCP/IP sockets | Fallback when no RDMA path is available |
GPUDirect RDMA bypasses the host CPU entirely on both sides: the NIC reads and writes directly to GPU memory. This is what makes NCCL on a well-tuned cluster reach the line-rate of 200 or 400 Gb/s InfiniBand. NCCL 2.27 added "Direct NIC" support that connects NICs to GPUs over PCIe Gen6 x16 to unlock 800 Gb/s on Grace Blackwell platforms.
Programming model
The core object is a communicator, ncclComm_t, which represents a group of GPUs that participate in collective operations together. A typical multi-process setup has one process per GPU, each holding one communicator created with ncclCommInitRank (or, in modern code, ncclCommInitRankConfig for tunables) and a shared ncclUniqueId distributed out-of-band. NCCL also offers ncclCommInitAll for the convenience case of one process driving multiple GPUs in the same node.
Every NCCL call takes a CUDA stream argument and is enqueued asynchronously: the call returns immediately on the host while the kernel runs on the stream. Synchronisation with subsequent CUDA work happens through normal stream semantics. This design lets NCCL be composed naturally with other CUDA kernels and with CUDA Graphs, which NCCL has supported since 2.11 with progressively wider coverage.
Group calls fuse multiple operations into one launch:
ncclGroupStart();
for (int i = 0; i < num_layers; i++) {
ncclAllReduce(grad[i], grad[i], size[i], ncclFloat, ncclSum, comm, stream);
}
ncclGroupEnd();
This pattern is what lets PyTorch DDP overlap dozens of small per-bucket AllReduces with the backward pass without paying a launch cost for each one.
Non-blocking initialisation and asynchronous errors
NCCL 2.14 introduced non-blocking communicator initialisation. Setting config.blocking = 0 in ncclConfig_t and calling ncclCommInitRankConfig makes the call return immediately with the communicator in the ncclInProgress state. Applications then poll ncclCommGetAsyncError until the state is ncclSuccess, allowing several communicators to be brought up in parallel and avoiding deadlocks when initialisation crosses with other CUDA work. The same ncclCommGetAsyncError mechanism reports network errors that occur after init: an operation that hits a disconnected peer will not progress, and the application is expected to discover the failure asynchronously, then call ncclCommAbort and re-establish state.
Integration with deep learning frameworks
Most users never call NCCL directly. Frameworks wrap it behind familiar APIs.
| Framework | NCCL usage |
|---|
| PyTorch DDP | torch.distributed.init_process_group(backend="nccl") is the default for GPU training; the autograd engine fires AllReduce on each gradient bucket as it becomes ready in the backward pass, overlapped with computation |
| PyTorch FSDP | Uses AllGather to materialise sharded parameters and ReduceScatter to shard gradients; the entire ZeRO-3-style scheme runs on NCCL collectives |
| DeepSpeed | ZeRO stages 1, 2, and 3 are built on NCCL; pipeline parallelism uses ncclSend/ncclRecv |
| Megatron-LM | Tensor parallelism uses AllReduce after column-parallel and row-parallel matmuls; pipeline parallelism uses point-to-point send/recv across stages |
| Horovod | Originally used MPI; later added a NCCL backend after NCCL 2 enabled multi-node ring-allreduce. The 2018 paper by Alexander Sergeev and Mike Del Balso of Uber benchmarked the NCCL backend on up to 512 GPUs |
| JAX | jax.distributed and jax.lax.psum lower to NCCL collectives via XLA on NVIDIA hardware |
| TensorFlow | tf.distribute.MirroredStrategy and MultiWorkerMirroredStrategy use NCCL via the nccl_all_reduce op |
For a typical PyTorch training script the NCCL dependency is implicit: install a CUDA-capable PyTorch wheel and the bundled NCCL is used automatically.
NVIDIA maintains a separate benchmark suite, nccl-tests (github.com/NVIDIA/nccl-tests), with one binary per collective (all_reduce_perf, all_gather_perf, reduce_scatter_perf, and so on). The standard reported metric is bus bandwidth, which normalises algorithm bandwidth by 2(N-1)/N for AllReduce so that results are comparable across rank counts and can be directly compared with the hardware peak.
Real-world numbers depend heavily on the platform. On an 8x H100 NVLink-connected node (HGX H100), AllReduce bus bandwidth in the hundreds of GB/s is typical; users on the NCCL issue tracker have reported around 250 GB/s in some configurations and approaching 360 GB/s in others, against an NVLink Gen4 unidirectional ceiling of 450 GB/s per GPU. Across nodes the practical limit is set by the NIC: NDR InfiniBand at 400 Gb/s yields roughly 50 GB/s per HCA after protocol overhead, and NCCL aggregates several rails. Anyone building or operating a GPU cluster runs nccl-tests early and often, since deviations from expected bus bandwidth are usually the first signal that something in the topology, BIOS, or NIC firmware is misconfigured.
Tuning environment variables
NCCL exposes dozens of environment variables. A handful come up constantly in production tuning and debugging.
| Variable | Purpose |
|---|
NCCL_DEBUG=INFO | Prints version, topology, and per-call info; first thing to set when something is slow or broken |
NCCL_DEBUG_SUBSYS | Filters debug output by subsystem (INIT, COLL, NET, etc.) |
NCCL_TOPO_DUMP_FILE | Dumps detected topology to XML for inspection |
NCCL_SOCKET_IFNAME | Restricts the OOB socket bootstrap to specific interfaces (e.g., eth0 or ^docker) |
NCCL_IB_HCA | Selects which IB HCAs to use (e.g., mlx5_0:1,mlx5_1:1) |
NCCL_IB_GID_INDEX | Picks the RoCE GID for Ethernet RDMA |
NCCL_NET_GDR_LEVEL | Sets the maximum allowed GPU-NIC distance for GPUDirect RDMA (LOC, PIX, PXB, PHB, SYS) |
NCCL_P2P_DISABLE=1 | Forces NCCL off NVLink/PCIe P2P (used to confirm a P2P bug) |
NCCL_ALGO | Restricts the algorithm set (e.g., Tree,Ring) |
NCCL_PROTO | Restricts the protocol set (LL, LL128, Simple) |
NCCL_BLOCKING_WAIT | Used by frameworks (notably PyTorch) to control whether wait calls are blocking |
NCCL_ASYNC_ERROR_HANDLING | PyTorch-side knob that enables async error reporting through NCCL |
NVIDIA's documentation cautions that variables in the debugging section should not be left set in production, since they can suppress optimisations or paper over a real configuration bug.
Comparisons and alternatives
NCCL is one of several collective communication libraries used in HPC and AI. Its closest peers either target other accelerators or layer custom DSLs on top.
| Library | Vendor | Relationship to NCCL |
|---|
| RCCL | AMD (ROCm) | The ROCm Communication Collectives Library is ABI-compatible with NCCL: code that links against libnccl.so can run on AMD GPUs by linking librccl.so. Implements the same ncclAllReduce, ncclAllGather, etc. |
| oneCCL | Intel | Targets Intel CPUs and GPUs (Habana, Ponte Vecchio); part of oneAPI |
| MSCCL / MSCCL++ | Microsoft | Built on NCCL/RCCL plumbing but adds a DSL for custom collective algorithms; MSCCL++ rethinks the abstractions and reports up to 3.8x speedup over RCCL on small messages |
| OpenMPI / MPICH (CUDA-aware) | OpenMPI / Argonne | General-purpose MPI with CUDA-aware extensions; less tuned for dense GPU clusters but more flexible |
| UCX / UCC | UCF consortium | Lower-level transport (UCX) and collective (UCC) libraries; some MPI stacks build on them and can use NCCL underneath for GPU collectives |
For an organisation training large language models on NVIDIA hardware the choice is largely settled: PyTorch defaults to NCCL, every major distributed training framework targets it, and competing libraries either wrap or imitate its API.
License
NCCL is distributed under the permissive BSD-3-Clause license. The source tree is hosted at github.com/NVIDIA/nccl, and binary builds are also available as part of the CUDA Toolkit and the NGC containers.
Common pitfalls
A few categories of failure recur often enough to be worth flagging:
- No NVLink, only PCIe: Cloud VMs that expose multiple GPUs over PCIe without NVSwitch will see AllReduce bandwidth in the tens of GB/s rather than hundreds. This is a hardware limit, not an NCCL bug.
- Mismatched NCCL versions across nodes: Different NCCL versions on different ranks can interoperate at the protocol level but tuning models drift, and some new features will silently disable. Pin one version per cluster.
- Mismatched CUDA major versions: NCCL is built against a specific CUDA major version. Mixing a NCCL built for CUDA 12 with a CUDA 11 driver tends to produce confusing initialisation failures.
- Firewall blocking the OOB bootstrap: NCCL uses a TCP socket on each rank to exchange the unique ID and to coordinate setup. If a firewall blocks ephemeral ports between nodes,
NCCL INFO logs will hang at the bootstrap phase. Set NCCL_SOCKET_IFNAME to a known-open interface and verify connectivity with nc.
NCCL_P2P_DISABLE left set: A diagnostic flag that someone enabled once to work around a driver bug and then forgot. Always check the environment when bandwidth looks low.
- Mixing NVLink topologies: On heterogeneous clusters where some nodes have NVSwitch and others have only NVLink bridges, NCCL will pick the lowest common denominator. Splitting the job into homogeneous communicators avoids the penalty.
References