NVLink
Last reviewed
Apr 30, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,388 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,388 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVLink is a proprietary high-bandwidth, low-latency, cache-coherent point-to-point interconnect developed by Nvidia for connecting GPUs to other GPUs and, in some configurations, to host CPUs. It was introduced in 2014 and first shipped in 2016 with the Tesla P100 GPU based on the Pascal architecture. NVLink was created to overcome the bandwidth and latency limitations of PCI Express in multi-GPU systems, and over five generations it has become the dominant scale-up fabric inside modern AI servers, supercomputers, and rack-scale systems such as the DGX GB200 NVL72.
NVLink is the physical interconnect; the related NVSwitch chip is the crossbar that connects many NVLink endpoints together inside a node or a rack, and NVLink-C2C ("chip-to-chip") is a die-to-die variant used to bond Nvidia's Grace CPU to Hopper or Blackwell GPUs into a single coherent superchip. Nvidia Collective Communications Library collectives such as all-reduce, all-gather, and reduce-scatter run directly on top of NVLink whenever GPUs in a distributed training job share an NVLink fabric, which is why nearly every published frontier LLM training run depends on it.
By the early 2010s, the dominant host interconnect for accelerators was PCI Express. PCIe Gen3 x16, the standard interface for GPUs through the Maxwell generation, delivered roughly 16 GB/s per direction (about 32 GB/s aggregate). For a single accelerator that was acceptable, but for the kinds of multi-GPU servers that deep learning was beginning to demand, PCIe became the bottleneck. Gradient exchange, parameter synchronization, and activation transfer between GPUs all traveled across the same PCIe root complex and, often, across the host CPU. As models grew, the time spent shuffling tensors across PCIe began to dominate the time spent on compute.
Nvidia announced NVLink in March 2014 as an answer to this problem. The design goals were straightforward: much higher bandwidth than PCIe, lower latency, support for direct GPU-to-GPU memory access, and a cache-coherent memory model so that multiple GPUs could share data without the software overhead of explicit copies. The first commercial implementation arrived two years later inside the Tesla P100 and the original DGX-1 system.
Five generations of NVLink have shipped to date, each tied to a specific Nvidia GPU computing microarchitecture. The table below summarizes the per-link signaling rate, the data rate per direction per link, the number of links exposed by the flagship data-center GPU of that generation, and the resulting aggregate bidirectional bandwidth per GPU.
| Generation | Year | Microarchitecture | Flagship GPU | Per-link signaling | Data rate per direction per link | Links per GPU | Aggregate bandwidth per GPU |
|---|---|---|---|---|---|---|---|
| NVLink 1.0 | 2016 | Pascal | Tesla P100 | 20 GT/s | 20 GB/s | 4 | 160 GB/s |
| NVLink 2.0 | 2017 | Volta | Tesla V100 | 25 GT/s | 25 GB/s | 6 | 300 GB/s |
| NVLink 3.0 | 2020 | Ampere | A100 | 50 GT/s | 25 GB/s | 12 | 600 GB/s |
| NVLink 4.0 | 2022 | Hopper | H100 | 50 GT/s | 25 GB/s | 18 | 900 GB/s |
| NVLink 5.0 | 2024 | Blackwell | B100 / B200 / GB200 | 100 GT/s | 50 GB/s | 18 | 1,800 GB/s |
A few conventions are worth flagging because the marketing numbers can be confusing. Nvidia generally quotes "aggregate bandwidth" as the bidirectional sum across all links, so a 900 GB/s H100 is really 450 GB/s in each direction. Per-link bandwidth is also typically the bidirectional figure: a single NVLink 4.0 link is 25 GB/s in each direction, often written as 50 GB/s bidirectional. Generations 3.0 and 4.0 share the same 50 GT/s per-pair signaling and the same 25 GB/s per-direction-per-link figure. The improvement from A100 to H100 came from raising the link count from 12 to 18 and from cutting the number of differential pairs per link in half, which let Nvidia pack more links into roughly the same die area.
NVLink 1.0 debuted on the GP100 die inside the Tesla P100. Each link carried 20 GB/s in each direction, and each P100 exposed four links, giving 80 GB/s per direction or 160 GB/s aggregate per GPU. That was already roughly five times the bandwidth of PCIe Gen3 x16 and made it possible to build the first DGX-1 with eight P100s wired together in a hybrid cube-mesh topology. NVLink 1.0 was also the version IBM integrated into POWER8+ CPUs, which is how the Summit supercomputer's predecessors first put NVLink directly between a CPU and a GPU instead of going through PCIe.
Volta's V100 stepped the per-link signaling from 20 GT/s to 25 GT/s and increased the link count from four to six. The result was 25 GB/s per direction per link and 300 GB/s aggregate per GPU. NVLink 2.0 also introduced cache coherence with IBM POWER9, which is the configuration Oak Ridge's Summit and Lawrence Livermore's Sierra supercomputers used: each node had two POWER9 CPUs and six V100 GPUs all connected with NVLink 2.0, giving the CPU and GPU a shared coherent address space.
Ampere's A100 doubled the per-pair signaling rate to 50 Gbit/s while halving the number of pairs per link, then doubled the link count from six to twelve. Each NVLink 3.0 link still delivered 25 GB/s per direction, but with twelve of them per A100 the aggregate bandwidth reached 600 GB/s. The A100 generation also introduced second-generation NVSwitch, which Nvidia used inside the DGX A100 to connect eight A100s in an all-to-all topology where every pair of GPUs gets the full 600 GB/s of bisection bandwidth.
Hopper's H100 kept the same 50 GT/s per-pair signaling but pushed the link count from twelve to eighteen. Each H100 SXM5 module exposes 18 NVLink 4.0 links for 900 GB/s of aggregate bandwidth, which Nvidia consistently advertises as roughly seven times the bandwidth of PCIe Gen5 x16 (128 GB/s). The H100 NVL variant, which is a PCIe-form-factor product designed for inference of large language models, exposes 600 GB/s of NVLink instead of 900 GB/s. Hopper paired with third-generation NVSwitch, whose total switching capacity rose to 13.6 Tbit/s from 7.2 Tbit/s in the prior generation, and added in-network reduction (the SHARP accelerator) so that all-reduce can finish inside the switch fabric rather than touching every endpoint.
Blackwell's B100, B200, and the dual-die GB200 superchip use fifth-generation NVLink. The signaling rate doubles to 100 GT/s, and the per-direction bandwidth per link doubles correspondingly to 50 GB/s. With 18 links per GPU, aggregate bandwidth reaches 1.8 TB/s, twice that of an H100 and roughly fourteen times PCIe Gen5 x16. NVLink 5.0 is the foundation of the GB200 NVL72 rack, which is the largest single NVLink domain Nvidia has ever shipped.
NVLink by itself is a point-to-point link. To connect more than a handful of GPUs in a fully connected, high-bandwidth fashion you need a switch, and that switch is NVSwitch. NVSwitch first appeared in 2018 inside the DGX-2, where 16 V100 GPUs were tied together by twelve NVSwitch chips, giving every GPU full 300 GB/s of NVLink bandwidth to every other GPU in the system. That was the first time a 16-GPU server behaved, from the application's perspective, as a single shared-memory accelerator.
NVSwitch generations track NVLink generations but are numbered separately. First-generation NVSwitch shipped with V100 in DGX-2. Second-generation NVSwitch shipped with A100 in DGX A100, and used six switch chips per system rather than twelve. Third-generation NVSwitch shipped with H100 in DGX H100 and HGX H100 baseboards, with 13.6 Tbit/s per chip and the SHARP in-network reduction engine. Fourth-generation NVSwitch shipped with Blackwell in 2024, with each switch chip exposing 72 NVLink 5.0 ports.
The most important thing NVSwitch enables is non-blocking all-to-all bandwidth. Without a switch, an N-GPU system has to share its NVLink budget across every pair, which limits both the topology and the bisection bandwidth. With NVSwitch, every GPU can talk to every other GPU at full link bandwidth simultaneously, which is exactly what NCCL collectives like all-reduce and all-gather want.
Nvidia's DGX line has been the reference design for NVLink-based servers from the beginning, and the DGX generations roughly map onto the NVLink generations.
| System | Year | GPUs | NVLink generation | Topology | Notable detail |
|---|---|---|---|---|---|
| DGX-1 (P100) | 2016 | 8 x P100 | 1.0 | Hybrid cube mesh | First commercial NVLink server |
| DGX-1 (V100) | 2017 | 8 x V100 | 2.0 | Hybrid cube mesh | First V100 system, no NVSwitch |
| DGX-2 | 2018 | 16 x V100 | 2.0 | Full NVSwitch fabric | First NVSwitch system, 2.4 TB/s bisection |
| DGX A100 | 2020 | 8 x A100 | 3.0 | Full NVSwitch fabric | Six 2nd-gen NVSwitch chips |
| DGX H100 | 2022 | 8 x H100 | 4.0 | Full NVSwitch fabric | 3.6 TB/s bisection, SHARP reductions |
| DGX GB200 NVL72 | 2024 | 72 x B200 | 5.0 | Rack-scale NVSwitch | 130 TB/s NVLink domain bandwidth |
The DGX-1 with eight P100s used a hybrid cube-mesh because the GPUs only had four NVLinks each, so a fully connected topology was not possible without a switch. The DGX-2 was the first system to behave as a single 16-GPU device, and it produced 2.4 TB/s of bisection bandwidth and 75 GB/s of all-reduce bandwidth. The DGX A100 dropped back to eight GPUs but used the much larger A100 NVLink budget to give every pair the full 600 GB/s. The DGX H100 raised that to 3.6 TB/s bisection and 450 GB/s of all-reduce.
The big jump arrived with the GB200 NVL72. Instead of stopping at eight GPUs in a single chassis, Nvidia used fifth-generation NVLink and fourth-generation NVSwitch to extend the NVLink domain across an entire liquid-cooled rack. Seventy-two Blackwell GPUs and 36 Grace CPUs sit in eighteen compute trays, with nine NVLink Switch trays providing the all-to-all fabric. The result is a single NVLink domain with 130 TB/s of aggregate GPU communication bandwidth, which Nvidia treats as the basic unit of an "AI factory" rack. Before NVL72, the largest practical NVLink domain was eight GPUs (DGX H100). Going from eight to seventy-two in a single coherent fabric is the largest single-generation jump in scale-up topology Nvidia has ever made.
NVLink-C2C is a die-to-die variant of NVLink built specifically to bond a Grace CPU to a Hopper or Blackwell GPU on the same package or board. It first appeared in the Grace Hopper Superchip (GH200) and now anchors the GB200 superchip as well. NVLink-C2C delivers 900 GB/s of bidirectional bandwidth between the CPU and GPU, which Nvidia describes as roughly seven times the bandwidth of x16 PCIe Gen5. It is also dramatically more energy-efficient: 1.3 picojoules per bit transferred, which Nvidia cites as more than five times better than PCIe Gen5.
The more important property is coherence. NVLink-C2C is memory-coherent, meaning the GPU and the CPU share a single, hardware-managed address space. CPU threads and GPU threads can both access either CPU-resident or GPU-resident memory transparently, and atomic operations cross the boundary natively. For model parallelism and large-model inference this matters because the GPU can spill embedding tables, KV caches, and other oversized data structures into the CPU's much larger LPDDR5X memory pool without paying PCIe transfer costs every time it needs them.
In 2025 Nvidia announced NVLink Fusion, which licenses NVLink-C2C technology to third parties so they can integrate NVLink directly into their own silicon. Initial partners include Arm, SiFive, MediaTek, and Marvell, and the goal is to let custom CPUs and accelerators participate in NVLink-based clusters alongside Nvidia GPUs.
NVLink is more than a fast pipe. From the second generation onward, links can carry coherent memory traffic, which means that two GPUs (or a CPU and a GPU) connected by NVLink behave like nodes in a coherent NUMA system rather than two separate devices that happen to be wired together. A GPU can issue a load to a memory address that physically lives in another GPU's HBM, and the cache fabric will resolve it without explicit DMA. CUDA programs see this as Unified Virtual Addressing and, with newer GPUs, as full Unified Memory across the NVLink domain.
This matters for two reasons. First, it lets the CUDA runtime move data lazily, only fetching cache lines when they are actually touched, which is much more efficient than pre-staging entire tensors. Second, it lets compilers and frameworks treat the NVLink domain as one big memory pool. Tensor parallelism, in particular, depends on this: when a single matrix multiply is sharded across eight GPUs, each GPU has to read partial results from every other GPU during the all-reduce that follows, and the cost of that all-reduce is dominated by NVLink bandwidth.
PCIe is still the standard host interconnect, and every Nvidia data-center GPU also exposes a PCIe interface for talking to the host CPU and to NICs. But for GPU-to-GPU communication inside a server, NVLink wins by an enormous margin.
| Interconnect | Per-direction bandwidth | Aggregate bandwidth | Notes |
|---|---|---|---|
| PCIe Gen3 x16 | ~16 GB/s | ~32 GB/s | P100-era host link |
| PCIe Gen4 x16 | ~32 GB/s | ~64 GB/s | A100-era host link |
| PCIe Gen5 x16 | ~64 GB/s | ~128 GB/s | H100 / Blackwell host link |
| NVLink 4.0 (H100) | 450 GB/s | 900 GB/s | About 7x PCIe Gen5 x16 |
| NVLink 5.0 (B200) | 900 GB/s | 1,800 GB/s | About 14x PCIe Gen5 x16 |
| NVLink-C2C (GH200) | 450 GB/s | 900 GB/s | CPU-to-GPU, coherent |
The gap matters most for collectives. An all-reduce of a 100 GB gradient buffer over PCIe Gen5 x16 takes more than a second of pure transfer time per GPU pair. Over NVLink 5.0 it takes around 50 milliseconds. When that collective happens after every microbatch in a training step, the difference between PCIe and NVLink can be the difference between a model that trains in three months and a model that never finishes.
The headline reason NVLink exists today is large-model training. Modern frontier language models are too big to fit on a single GPU, so training (and increasingly inference) is sharded across many GPUs using a mix of data parallelism, tensor parallelism, pipeline parallelism, and expert parallelism. The communication patterns these strategies generate are very different in their bandwidth requirements.
Tensor parallelism is the most NVLink-hungry. It splits a single matrix multiply across N GPUs and requires an all-reduce after every transformer layer's attention and MLP blocks. The size of the activations being reduced is large (gigabytes per layer for big models) and the operation has to complete before the next layer can start. In practice this means tensor-parallel groups must live inside a single NVLink domain. For H100-based clusters that means tensor parallelism is bounded at eight (one DGX H100 node); for GB200 NVL72 it can go up to seventy-two without leaving the NVLink fabric. That is the single biggest reason hyperscalers care about NVL72: it lets them use much larger tensor-parallel degrees, which in turn lets them serve trillion-parameter models with low latency.
Pipeline parallelism is gentler on the network because the only thing that crosses GPUs is the activation tensor between pipeline stages, but it still benefits from NVLink because pipeline bubbles get smaller as inter-stage latency drops. Data parallelism uses all-reduce on gradients once per step, and FSDP uses all-gather on parameters before each forward pass; both are friendlier to InfiniBand-class inter-node networks but still benefit from NVLink for the intra-node portion of the collective.
NCCL, Nvidia's collective communication library, is what actually drives the wires. NCCL automatically discovers the topology, picks the best available transport (NVLink, NVSwitch, PCIe, InfiniBand, or RoCE), and chooses an algorithm (ring, tree, or SHARP) for each collective. When NCCL detects an NVLink fabric it prefers it over everything else, and on NVL72 it will try to route as much of the all-reduce as possible through the in-switch SHARP engine.
The effect on training is large enough to be visible in published numbers. The Megatron-LM paper from Nvidia and Microsoft reports that scaling a GPT-style model from 1 billion to 175 billion parameters and from 8 to 384 A100 GPUs sustains about 50 percent of peak FLOPS, which is only possible because NVLink keeps the all-reduce time small relative to compute. Frontier-class runs (GPT-4, Claude, Gemini, Llama 3 405B, and beyond) have not published their parallelism configurations in detail, but every public description of how they are trained references NVLink-connected nodes as the unit of tensor parallelism.
NVLink is proprietary, and that has been both its greatest strength and the main complaint against it. Because Nvidia controls the spec, it can iterate quickly: each generation has roughly doubled aggregate bandwidth, and the addition of NVSwitch and SHARP has happened on Nvidia's own schedule rather than a standards-body schedule. The cost is that there is no second source. If you want NVLink, you have to buy Nvidia GPUs, Nvidia switches, and (increasingly) Nvidia networking. The H100 PCIe NVL variant, which exposes only 600 GB/s instead of 900 GB/s, is a reminder that even Nvidia's own product segmentation can hide NVLink behind extra cost.
Competitors have responded with their own scale-up fabrics. AMD's Infinity Fabric connects its Instinct MI300X and MI325X GPUs at up to 896 GB/s aggregate per GPU on Infinity Fabric 4. Intel's Xe Link does the same job for the Ponte Vecchio and Gaudi accelerators, although Intel has shifted resources around in this area more than once. None of these is interchangeable with NVLink at the silicon level, and none of them currently has anything comparable to the NVL72 rack-scale topology.
The more interesting development is UALink (Ultra Accelerator Link), an open standard announced in May 2024 by AMD, Intel, Broadcom, Cisco, Google, HPE, Meta, and Microsoft. UALink 1.0, published in April 2025, defines an open scale-up fabric that can connect up to 1,024 accelerators with direct load, store, and atomic semantics, and it borrows the AMD Infinity Fabric protocol for the upper layers. The explicit goal is to be an open alternative to NVLink and NVSwitch. Whether UALink ships in volume, and whether it can match Nvidia's bandwidth and software ecosystem, is one of the open questions in AI infrastructure for the next several years.
One practical limitation worth noting: Nvidia removed NVLink from its consumer Ada Lovelace GeForce cards in 2022, and Blackwell consumer cards (RTX 50 series) likewise do not expose NVLink. NVLink is now strictly a data-center technology. Hobbyists who want to pool VRAM across multiple GPUs at home cannot do it the way they could with the RTX 30-series Quadro and earlier Titan cards.