NCCL (NVIDIA Collective Communications Library)

AI Hardware AI Infrastructure NVIDIA Training & Optimization

18 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v3 · 3,591 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NCCL (the NVIDIA Collective Communications Library, pronounced "Nickel") is an open-source library from Nvidia that provides topology-aware, multi-GPU and multi-node collective communication primitives such as AllReduce, AllGather, ReduceScatter, and Broadcast, optimized for NVIDIA hardware.^[1] It is the de facto communication backbone for synchronous distributed training on NVIDIA GPUs and underlies almost every modern deep learning framework that scales beyond a single device, including PyTorch DistributedDataParallel, Fully Sharded Data Parallel (FSDP), DeepSpeed, Megatron-LM, and JAX.^[1] NVIDIA describes it as delivering "high-performance, topology-aware collective operations" that are "optimized for NVIDIA GPUs and a variety of interconnects."^[8]

NCCL provides routines such as AllReduce, AllGather, ReduceScatter, Broadcast, Reduce, AlltoAll, Gather, Scatter, and point-to-point Send/Recv.^[2] The implementation is topology aware: it discovers the system's PCIe tree, NVLink and NVSwitch fabric, and InfiniBand or Ethernet HCAs, then chooses an algorithm and communication channels tuned for the message size and the underlying hardware.^[1] Each collective is fused into a single CUDA kernel that handles both the arithmetic and the data movement, which avoids extra kernel launches and lets NCCL saturate links such as NVLink at hundreds of gigabytes per second.^[16]

What is NCCL used for?

Deep learning at scale is bottlenecked less often by raw compute than by the cost of moving tensors between devices. Synchronous data parallelism, the most common form of distributed training, requires every worker to average its gradients with every other worker after each backward pass. Done naively over PCIe or TCP this collapses the throughput of even fast accelerators. NCCL exists to push those collective operations as close to the bandwidth ceiling of the underlying interconnect as possible, whether that interconnect is NVLink inside a single DGX node or InfiniBand across thousands of nodes in a supercomputer.^[1]

The API was deliberately modelled on MPI, the long-standing standard for collective communication in HPC, so engineers familiar with MPI can adopt NCCL with minimal friction.^[2] Unlike MPI, NCCL is GPU-centric: buffers live in device memory, calls are issued from the host but executed on a CUDA stream, and there is no implicit copy through the host.^[2] The library is licensed under BSD-3-Clause and is developed in the open at github.com/NVIDIA/nccl.^[4]

When was NCCL released? History and version timeline

NCCL began as an open-source research project at NVIDIA. Early commits in the public repository date from 2015, and the first widely cited NVIDIA Developer blog post introducing the library, "Fast Multi-GPU Collectives with NCCL," appeared on April 7, 2016.^[6] NCCL 1.x was a single-node, intra-server library that exchanged data between GPUs using direct peer-to-peer load/store across PCIe and NVLink. It implemented five collectives: AllReduce, AllGather, Broadcast, Reduce, and ReduceScatter, built on three internal kernels (Copy, Reduce, and ReduceAndCopy).^[6]

NCCL 2.0 (2017) added multi-node support. Inter-node transfers used GPUDirect RDMA over InfiniBand where available, with a CPU-side proxy thread driving the NIC.^[7] From that point onward NCCL became the standard collective library for distributed deep learning on NVIDIA hardware. Subsequent releases steadily added algorithms, transports, and resilience features.

Version	Year	Notable additions
1.x	2015 to 2016	Single-node multi-GPU; ring AllReduce; 5 collectives in fused kernels
2.0	2017	Multi-node support via GPUDirect RDMA over InfiniBand
2.4	February 2019	Double binary tree algorithm for low-latency AllReduce
2.7	2020	Point-to-point `ncclSend`/`ncclRecv` (enables pipeline parallel and AlltoAll)
2.11	2021	NVLS (NVLink SHARP) and tighter NVSwitch integration
2.14	2022	Non-blocking communicator init via `ncclConfig_t.blocking = 0` and `ncclCommGetAsyncError` polling
2.17 to 2.19	2023	User buffer registration; CUDA Graphs improvements; NVLink SHARP for AllReduce
2.27	July 14, 2025	Symmetric memory low-latency kernels; SHARP on NVLink and InfiniBand; `ncclCommShrink` for fault tolerance; Direct NIC support on Grace Blackwell
2.29	December 2024 to February 2025	Host-side one-sided APIs (`ncclPutSignal`, `ncclWaitSignal`); `ncclCommGrow` for elastic topology; multi-context GIN; nccl4py Python bindings
2.30	April 2026	Tensor Memory Accelerator support; Dynamic Direct Path for multipath networking; one-sided RMA with CUDA graph capture

What collective primitives does NCCL provide?

NCCL implements the standard MPI-style collective set, plus point-to-point send and receive. The current overview lists eight collectives.^[2]

Primitive	What it does	Common use in deep learning
`ncclAllReduce`	Reduces (sum, prod, min, max, avg) one buffer per rank into an identical result on every rank	Averaging gradients across data parallelism workers; the single most-called NCCL routine in training
`ncclAllGather`	Concatenates one buffer per rank into a full buffer on every rank	Gathering sharded parameters in FSDP and ZeRO-3 forward/backward
`ncclReduceScatter`	Reduces across ranks and scatters the result so each rank holds one shard	Sharded gradient reduction in FSDP/ZeRO-3 (the dual of AllGather)
`ncclBroadcast`	Sends one rank's buffer to all others	Distributing initial weights, RNG seeds
`ncclReduce`	Reduces across ranks into a single root	Collecting metrics, validation loss
`ncclAlltoAll`	Each rank sends a distinct chunk to every other rank	Mixture of Experts routing, expert parallelism
`ncclGather` / `ncclScatter`	Many-to-one and one-to-many reorganisations	Less common in training; used in some pipeline schedules
`ncclSend` / `ncclRecv`	Point-to-point primitives	Pipeline-parallel activation/gradient passing in Megatron-LM and DeepSpeed

A call is enqueued onto a CUDA stream and runs as a single fused kernel that interleaves transfers with the reduction arithmetic. Operations can be in-place (the receive buffer overlaps the send buffer) or out-of-place. Multiple calls can be batched into a single launch using ncclGroupStart() and ncclGroupEnd(), which both reduces launch overhead and allows NCCL to pipeline several small collectives, an important optimisation in pipeline parallelism and FSDP where many small communications overlap with compute.^[1]

How does NCCL choose an algorithm?

NCCL chooses the algorithm and protocol per call based on message size, communicator size, and topology. The two foundational algorithms are the ring and the double binary tree, with several variants layered on top for switch-assisted hardware.^[16]

Ring algorithm

The ring AllReduce arranges ranks in a logical ring and proceeds in two phases: a reduce-scatter where each rank ends up with one chunk of the fully reduced result, then an all-gather where the chunks are circulated until every rank has the full result. With N ranks, each rank sends and receives 2(N-1)/N times its data, which approaches the optimal bandwidth as N grows. Pitch Patarasuk and Xin Yuan proved this is bandwidth-optimal in their 2009 paper "Bandwidth optimal all-reduce algorithms for clusters of workstations," published in the Journal of Parallel and Distributed Computing.^[9] Their construction is the basis for the ring AllReduce in NCCL, in Baidu's original Allreduce work, and in Horovod.^[9]

The ring's weakness is latency. Each rank waits for its predecessor before forwarding, so total time scales linearly with the number of ranks. For small messages on large clusters this becomes the dominant cost.

Tree algorithm and double binary trees

NCCL 2.4, released February 4, 2019, introduced the double binary tree, an algorithm originally described by Peter Sanders, Jochen Speck, and Jesper Larsson Traff in their 2009 paper "Two-tree algorithms for full bandwidth broadcast, reduction and scan."^[7]^[10] The construction builds two complementary binary trees over the ranks such that no rank is a non-leaf in both trees, and at most one rank is a leaf in both. Half the data flows through each tree, and the trees are walked simultaneously in two phases: a reduction up from leaves to root, then a broadcast down. Latency is O(log N) instead of O(N), while keeping the same effective bandwidth as a ring (each rank sends and receives half the data twice).^[10]

NVIDIA reported up to a 180x latency improvement at 24,576 GPUs on the Summit supercomputer using double binary trees instead of rings for small AllReduce calls.^[7] NCCL automatically falls back to rings when message sizes are large enough that the ring's higher bandwidth dominates.^[7]

Algorithm selection

Regime	Preferred algorithm	Reason
Small messages, many ranks	Tree (double binary)	Latency scales as log N
Medium messages	Tree or Ring depending on tuning model	Crossover point
Large messages	Ring	Bandwidth-optimal as N grows
Single-node with NVSwitch and SHARP	NVLS / CollNet	In-fabric reduction offload
Multi-node with InfiniBand SHARP	CollNet (SHARP)	In-network reduction in the switch
Pre-NVSwitch hardware needing CollNet	CollNetChain	Tree-like pattern without SHARP

The selectable set is exposed via the NCCL_ALGO environment variable, whose default value is Tree,Ring,CollnetDirect,CollnetChain (NVLS variants are added on hardware that supports them). The protocol used inside a chosen algorithm (LL, LL128, or Simple) is selected automatically and is exposed for diagnostic use as NCCL_PROTO.^[3]

CollNet, NVLS, and SHARP

NCCL also supports algorithms that offload part of the reduction into the network fabric itself. SHARP, the Scalable Hierarchical Aggregation and Reduction Protocol developed by Mellanox (now part of NVIDIA), is implemented in the silicon of Quantum InfiniBand switches and Spectrum Ethernet switches.^[13] When SHARP is available, the switch performs the reduction on data in flight, removing one full pass of traffic between endpoints. The CollNet algorithm in NCCL is the integration point: with the NCCL-SHARP plugin (distributed through HPC-X) the switch becomes a participant in the collective.^[14]

NVLink SHARP (NVLS) is the analogous capability inside an NVLink/NVSwitch domain, where the third-generation NVSwitch can reduce data as it passes through the fabric. NCCL 2.27, released July 14, 2025, extended SHARP support to AllGather and ReduceScatter on both NVLink and InfiniBand fabrics. By moving the reduction into the fabric, these collectives cut the streaming multiprocessors (SMs) used per GPU from the 16 that a ring-based implementation consumes to 6 or fewer, a 2.7x reduction that frees more compute for the model itself.^[8] The same release added symmetric memory low-latency kernels that NVIDIA reports deliver "up to 9x reduction in latency for small message sizes" when buffers share identical virtual addresses across GPUs.^[8]

How does NCCL discover the hardware topology?

When a communicator is created, NCCL probes the system: it walks the PCIe hierarchy through /sys, queries NVML for NVLink and NVSwitch connectivity, and inspects the InfiniBand subsystem for HCAs and their NUMA affinity.^[1] The result is a graph used to enumerate possible communication channels and to assign GPUs to NICs in a way that avoids crossing NUMA boundaries unnecessarily. Setting NCCL_TOPO_DUMP_FILE=/tmp/topo.xml writes the discovered topology to disk, which is invaluable when debugging slow runs or diagnosing why a particular NIC is not being used.^[3]

Topology awareness is the reason NCCL works well across a wide hardware range. The same code that runs across a DGX H100 (NVLink + NVSwitch + ConnectX-7 InfiniBand) will pick a different set of channels on an older server with only PCIe peer-to-peer, and a different one again on a cloud VM with no GPUDirect RDMA support.

What interconnects and transports does NCCL use?

Transport	When used
Shared host memory	Two GPUs on the same node without P2P, or as fallback
GPUDirect P2P over PCIe	Same-node GPUs with P2P enabled, no NVLink
NVLink / NVSwitch	Same-node GPUs with NVLink fabric (DGX, HGX, GB200)
GPUDirect RDMA over InfiniBand	Multi-node clusters with Mellanox/NVIDIA HCAs
GPUDirect RDMA over RoCE	Multi-node Ethernet with RDMA-capable NICs
TCP/IP sockets	Fallback when no RDMA path is available

GPUDirect RDMA bypasses the host CPU entirely on both sides: the NIC reads and writes directly to GPU memory. This is what makes NCCL on a well-tuned cluster reach the line-rate of 200 or 400 Gb/s InfiniBand. NCCL 2.27 added "Direct NIC" support that connects NICs to GPUs over PCIe Gen6 x16 to unlock the full 800 Gb/s of network bandwidth on Grace Blackwell platforms.^[8]

Programming model

The core object is a communicator, ncclComm_t, which represents a group of GPUs that participate in collective operations together. A typical multi-process setup has one process per GPU, each holding one communicator created with ncclCommInitRank (or, in modern code, ncclCommInitRankConfig for tunables) and a shared ncclUniqueId distributed out-of-band. NCCL also offers ncclCommInitAll for the convenience case of one process driving multiple GPUs in the same node.^[1]

Every NCCL call takes a CUDA stream argument and is enqueued asynchronously: the call returns immediately on the host while the kernel runs on the stream. Synchronisation with subsequent CUDA work happens through normal stream semantics. This design lets NCCL be composed naturally with other CUDA kernels and with CUDA Graphs, which NCCL has supported since 2.11 with progressively wider coverage.^[1]

Group calls fuse multiple operations into one launch:

ncclGroupStart();
for (int i = 0; i < num_layers; i++) {
    ncclAllReduce(grad[i], grad[i], size[i], ncclFloat, ncclSum, comm, stream);
}
ncclGroupEnd();

This pattern is what lets PyTorch DDP overlap dozens of small per-bucket AllReduces with the backward pass without paying a launch cost for each one.

Non-blocking initialisation and asynchronous errors

NCCL 2.14 introduced non-blocking communicator initialisation. Setting config.blocking = 0 in ncclConfig_t and calling ncclCommInitRankConfig makes the call return immediately with the communicator in the ncclInProgress state. Applications then poll ncclCommGetAsyncError until the state is ncclSuccess, allowing several communicators to be brought up in parallel and avoiding deadlocks when initialisation crosses with other CUDA work.^[1] The same ncclCommGetAsyncError mechanism reports network errors that occur after init: an operation that hits a disconnected peer will not progress, and the application is expected to discover the failure asynchronously, then call ncclCommAbort and re-establish state.^[1]

How does NCCL integrate with PyTorch and other frameworks?

Most users never call NCCL directly. Frameworks wrap it behind familiar APIs.

Framework	NCCL usage
PyTorch DDP	`torch.distributed.init_process_group(backend="nccl")` is the default for GPU training; the autograd engine fires AllReduce on each gradient bucket as it becomes ready in the backward pass, overlapped with computation
PyTorch FSDP	Uses AllGather to materialise sharded parameters and ReduceScatter to shard gradients; the entire ZeRO-3-style scheme runs on NCCL collectives
DeepSpeed	ZeRO stages 1, 2, and 3 are built on NCCL; pipeline parallelism uses `ncclSend`/`ncclRecv`
Megatron-LM	Tensor parallelism uses AllReduce after column-parallel and row-parallel matmuls; pipeline parallelism uses point-to-point send/recv across stages
Horovod	Originally used MPI; later added a NCCL backend after NCCL 2 enabled multi-node ring-allreduce. The 2018 paper by Alexander Sergeev and Mike Del Balso of Uber benchmarked the NCCL backend on up to 512 GPUs
JAX	`jax.distributed` and `jax.lax.psum` lower to NCCL collectives via XLA on NVIDIA hardware
TensorFlow	`tf.distribute.MirroredStrategy` and `MultiWorkerMirroredStrategy` use NCCL via the `nccl_all_reduce` op

For a typical PyTorch training script the NCCL dependency is implicit: install a CUDA-capable PyTorch wheel and the bundled NCCL is used automatically.

How is NCCL performance measured?

NVIDIA maintains a separate benchmark suite, nccl-tests (github.com/NVIDIA/nccl-tests), with one binary per collective (all_reduce_perf, all_gather_perf, reduce_scatter_perf, and so on).^[5] The standard reported metric is bus bandwidth, which normalises algorithm bandwidth by 2(N-1)/N for AllReduce so that results are comparable across rank counts and can be directly compared with the hardware peak.^[5]

Real-world numbers depend heavily on the platform. On an 8x H100 NVLink-connected node (HGX H100), AllReduce bus bandwidth in the hundreds of GB/s is typical; users on the NCCL issue tracker have reported around 250 GB/s in some configurations and approaching 360 GB/s in others, against an NVLink Gen4 unidirectional ceiling of 450 GB/s per GPU. Across nodes the practical limit is set by the NIC: NDR InfiniBand at 400 Gb/s yields roughly 50 GB/s per HCA after protocol overhead, and NCCL aggregates several rails. Anyone building or operating a GPU cluster runs nccl-tests early and often, since deviations from expected bus bandwidth are usually the first signal that something in the topology, BIOS, or NIC firmware is misconfigured.^[5]

Tuning environment variables

NCCL exposes dozens of environment variables. A handful come up constantly in production tuning and debugging.^[3]

Variable	Purpose
`NCCL_DEBUG=INFO`	Prints version, topology, and per-call info; first thing to set when something is slow or broken
`NCCL_DEBUG_SUBSYS`	Filters debug output by subsystem (INIT, COLL, NET, etc.)
`NCCL_TOPO_DUMP_FILE`	Dumps detected topology to XML for inspection
`NCCL_SOCKET_IFNAME`	Restricts the OOB socket bootstrap to specific interfaces (e.g., `eth0` or `^docker`)
`NCCL_IB_HCA`	Selects which IB HCAs to use (e.g., `mlx5_0:1,mlx5_1:1`)
`NCCL_IB_GID_INDEX`	Picks the RoCE GID for Ethernet RDMA
`NCCL_NET_GDR_LEVEL`	Sets the maximum allowed GPU-NIC distance for GPUDirect RDMA (`LOC`, `PIX`, `PXB`, `PHB`, `SYS`)
`NCCL_P2P_DISABLE=1`	Forces NCCL off NVLink/PCIe P2P (used to confirm a P2P bug)
`NCCL_ALGO`	Restricts the algorithm set (e.g., `Tree,Ring`)
`NCCL_PROTO`	Restricts the protocol set (`LL`, `LL128`, `Simple`)
`NCCL_BLOCKING_WAIT`	Used by frameworks (notably PyTorch) to control whether wait calls are blocking
`NCCL_ASYNC_ERROR_HANDLING`	PyTorch-side knob that enables async error reporting through NCCL

NVIDIA's documentation cautions that variables in the debugging section should not be left set in production, since they can suppress optimisations or paper over a real configuration bug.^[3]

How does NCCL compare to RCCL, oneCCL, and MPI?

NCCL is one of several collective communication libraries used in HPC and AI. Its closest peers either target other accelerators or layer custom DSLs on top.

Library	Vendor	Relationship to NCCL
RCCL	AMD (ROCm)	The ROCm Communication Collectives Library is ABI-compatible with NCCL: code that links against `libnccl.so` can run on AMD GPUs by linking `librccl.so`. Implements the same `ncclAllReduce`, `ncclAllGather`, etc.
oneCCL	Intel	Targets Intel CPUs and GPUs (Habana, Ponte Vecchio); part of oneAPI
MSCCL / MSCCL++	Microsoft	Built on NCCL/RCCL plumbing but adds a DSL for custom collective algorithms; MSCCL++ rethinks the abstractions and reports up to 3.8x speedup over RCCL on small messages
OpenMPI / MPICH (CUDA-aware)	OpenMPI / Argonne	General-purpose MPI with CUDA-aware extensions; less tuned for dense GPU clusters but more flexible
UCX / UCC	UCF consortium	Lower-level transport (UCX) and collective (UCC) libraries; some MPI stacks build on them and can use NCCL underneath for GPU collectives

For an organisation training large language models on NVIDIA hardware the choice is largely settled: PyTorch defaults to NCCL, every major distributed training framework targets it, and competing libraries either wrap or imitate its API.

Is NCCL open source?

Yes. NCCL is distributed under the permissive BSD-3-Clause license.^[4] The source tree is hosted at github.com/NVIDIA/nccl, and binary builds are also available as part of the CUDA Toolkit and the NGC containers.^[4]

Common pitfalls

A few categories of failure recur often enough to be worth flagging:

No NVLink, only PCIe: Cloud VMs that expose multiple GPUs over PCIe without NVSwitch will see AllReduce bandwidth in the tens of GB/s rather than hundreds. This is a hardware limit, not an NCCL bug.
Mismatched NCCL versions across nodes: Different NCCL versions on different ranks can interoperate at the protocol level but tuning models drift, and some new features will silently disable. Pin one version per cluster.
Mismatched CUDA major versions: NCCL is built against a specific CUDA major version. Mixing a NCCL built for CUDA 12 with a CUDA 11 driver tends to produce confusing initialisation failures.
Firewall blocking the OOB bootstrap: NCCL uses a TCP socket on each rank to exchange the unique ID and to coordinate setup. If a firewall blocks ephemeral ports between nodes, NCCL INFO logs will hang at the bootstrap phase. Set NCCL_SOCKET_IFNAME to a known-open interface and verify connectivity with nc.
NCCL_P2P_DISABLE left set: A diagnostic flag that someone enabled once to work around a driver bug and then forgot. Always check the environment when bandwidth looks low.
Mixing NVLink topologies: On heterogeneous clusters where some nodes have NVSwitch and others have only NVLink bridges, NCCL will pick the lowest common denominator. Splitting the job into homogeneous communicators avoids the penalty.

References

NVIDIA, "NVIDIA Collective Communication Library (NCCL) Documentation," https://docs.nvidia.com/deeplearning/nccl/
NVIDIA, "Overview of NCCL," https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html
NVIDIA, "NCCL Environment Variables," https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
NVIDIA/nccl GitHub repository, https://github.com/NVIDIA/nccl
NVIDIA/nccl-tests GitHub repository, https://github.com/NVIDIA/nccl-tests
N. Jouppi (et al.), "Fast Multi-GPU Collectives with NCCL," NVIDIA Developer Blog, April 7, 2016, https://developer.nvidia.com/blog/fast-multi-gpu-collectives-nccl/
S. Jeaugey, "Massively Scale Your Deep Learning Training with NCCL 2.4," NVIDIA Developer Blog, February 4, 2019, https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/
NVIDIA, "Enabling Fast Inference and Resilient Training with NCCL 2.27," NVIDIA Developer Blog, July 14, 2025, https://developer.nvidia.com/blog/enabling-fast-inference-and-resilient-training-with-nccl-2-27/
P. Patarasuk and X. Yuan, "Bandwidth optimal all-reduce algorithms for clusters of workstations," Journal of Parallel and Distributed Computing, vol. 69, no. 2, 2009, https://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf
P. Sanders, J. Speck, and J. L. Traff, "Two-tree algorithms for full bandwidth broadcast, reduction and scan," Parallel Computing, 2009.
A. Sergeev and M. Del Balso, "Horovod: fast and easy distributed deep learning in TensorFlow," arXiv:1802.05799, 2018, https://arxiv.org/abs/1802.05799
Y. Zhao et al., "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel," arXiv:2304.11277, 2023, https://arxiv.org/abs/2304.11277
NVIDIA, "Using NVIDIA SHARP with NVIDIA NCCL," https://docs.nvidia.com/networking/display/sharpv3103/Using+NVIDIA+SHARP+with+NVIDIA+NCCL
Mellanox/nccl-rdma-sharp-plugins repository, https://github.com/Mellanox/nccl-rdma-sharp-plugins
AMD, "RCCL Documentation," https://rocm.docs.amd.com/projects/rccl/
M. Hidayetoglu et al., "Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms," arXiv:2507.04786, 2025, https://arxiv.org/abs/2507.04786
PyTorch, "FullyShardedDataParallel," https://docs.pytorch.org/docs/stable/fsdp.html

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

NCCL (NVIDIA Collective Communications Library)

What is NCCL used for?

When was NCCL released? History and version timeline

What collective primitives does NCCL provide?

How does NCCL choose an algorithm?

Ring algorithm

Tree algorithm and double binary trees

Algorithm selection

CollNet, NVLS, and SHARP

How does NCCL discover the hardware topology?

What interconnects and transports does NCCL use?

Programming model

Non-blocking initialisation and asynchronous errors

How does NCCL integrate with PyTorch and other frameworks?

How is NCCL performance measured?

Tuning environment variables

How does NCCL compare to RCCL, oneCCL, and MPI?

Is NCCL open source?

Common pitfalls

References

Improve this article

What links here

What links here

What is NCCL used for?

When was NCCL released? History and version timeline

What collective primitives does NCCL provide?

How does NCCL choose an algorithm?

Ring algorithm

Tree algorithm and double binary trees

Algorithm selection

CollNet, NVLS, and SHARP

How does NCCL discover the hardware topology?

What interconnects and transports does NCCL use?

Programming model

Non-blocking initialisation and asynchronous errors

How does NCCL integrate with PyTorch and other frameworks?

How is NCCL performance measured?

Tuning environment variables

How does NCCL compare to RCCL, oneCCL, and MPI?

Is NCCL open source?

Common pitfalls

References

Improve this article

Related Articles

NVIDIA Picasso

NVIDIA H100

NVIDIA B200

NVIDIA GB300 NVL72

NVIDIA DGX B300

NVLink Fusion

What links here

Related Articles

NVIDIA Picasso

NVIDIA H100

NVIDIA B200

NVIDIA GB300 NVL72

NVIDIA DGX B300

NVLink Fusion

What links here