NVSwitch
Last reviewed
Jun 3, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 1,963 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 1,963 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVSwitch is a family of switch chips (ASICs) designed by Nvidia to interconnect graphics processing units over NVLink, the company's proprietary high-bandwidth GPU-to-GPU link. Where a single NVLink connects two devices point to point, an NVSwitch acts as a non-blocking crossbar that lets every GPU attached to it talk to every other GPU at full link speed, creating an "all-to-all" fabric inside a server and, in later generations, across an entire rack. NVSwitch is the component that turns a tray of discrete accelerators into what Nvidia markets as a single, giant GPU with a unified shared memory space. It first shipped in 2018 in the DGX-2 supercomputer and has since become a defining feature of Nvidia's data-center platforms for large-scale AI training and inference, including the Hopper HGX H100 baseboard and the rack-scale GB200 NVL72 built on Blackwell.[1][2]
By the mid-2010s, training deep neural networks increasingly relied on splitting work across multiple GPUs, and the bottleneck shifted from raw compute to the bandwidth available between accelerators. Nvidia introduced NVLink in 2016 with the Pascal P100 to provide an interconnect roughly an order of magnitude faster than PCI Express, initially wiring eight GPUs together in a point-to-point "hybrid cube mesh" topology.[1][3] That arrangement worked well for traffic between directly linked neighbors but degraded for genuine all-to-all communication: GPU pairs that were not directly connected had to relay data through intermediate hops or fall back to the much slower PCIe path. It also capped a coherent NVLink island at eight GPUs.[1]
NVSwitch was Nvidia's answer. Rather than hard-wiring GPUs to one another, each GPU connects its NVLinks into a bank of switch chips, and the switches route packets between any pair of ports. This removes the topology penalty (every GPU is one or two switch traversals from every other GPU), scales past eight GPUs in a node, and presents software with a simpler single-node programming model that hides the underlying wiring.[1][3]
NVSwitch has advanced in step with the NVLink generation it carries, and Nvidia's product materials count the on-node switch as first generation (Volta), second generation (Ampere), third generation (Hopper), and fourth generation (Blackwell).[2][4] The table below summarizes the per-chip characteristics that Nvidia and its disclosures at Hot Chips have published; second-generation transistor and die figures were never published in detail and are omitted.
| Generation | Year | GPU / platform | NVLink gen | Ports per chip | Per-chip bandwidth | Process | Notable additions |
|---|---|---|---|---|---|---|---|
| 1st | 2018 | V100 / DGX-2, HGX-2 | NVLink 2 | 18 | 900 GB/s | TSMC 12 nm FFN | 18x18 crossbar, Fabric Manager |
| 2nd | 2020 | A100 / DGX A100 | NVLink 3 | 36 | 1,800 GB/s | TSMC 7 nm | doubled links per GPU |
| 3rd | 2022 | H100 / HGX H100 | NVLink 4 | 64 | 3,200 GB/s | TSMC 4N | NVLink SHARP, NVLink Network |
| 4th | 2024 | Blackwell / GB200 NVL72 | NVLink 5 | 72 | 7,200 GB/s | TSMC 4NP | rack-scale switch trays, SHARP FP8 |
All quoted bandwidths are full-duplex totals (the sum of both directions), following Nvidia's convention.[3]
The original NVSwitch is an NVLink-2 switch chip with 18 ports arranged as an 18-by-18 fully connected internal crossbar. Each port runs at 50 GB/s (25 GB/s in each direction), for 900 GB/s of aggregate switch bandwidth, and the crossbar is non-blocking so any port can reach any other at full link rate. The die holds about 2 billion transistors on TSMC's 12 nm FinFET FFN process and draws roughly 100 W.[4][5]
In the DGX-2 and the equivalent HGX-2 baseboard, six NVSwitch chips sit on each eight-GPU board. Every V100 GPU connects one of its six NVLinks to each of the six switches, so any two GPUs on the same baseboard communicate at the full 300 GB/s with a single switch traversal. Two baseboards are joined through eight ports on each switch to build the full 16-GPU system, giving a board-to-board bisection bandwidth of 2.4 TB/s. The DGX-2 used 16 of the 18 available ports per switch, leaving the remaining two reserved.[4] Nvidia described the result as 16 GPUs behaving as one accelerator with a half-terabyte of unified memory and roughly two petaFLOPS of deep-learning throughput.[4] Reliability features included cyclic redundancy checking (CRC) on the NVLinks with replay on error, error-correcting codes (ECC) on internal datapaths and routing structures, and a software Fabric Manager that programs the routing tables and confines each application to its assigned ranges.[4]
The second-generation NVSwitch accompanied the A100 and the DGX A100. It paired with third-generation NVLink, which kept the 25 GB/s per-direction signaling but increased the number of links per GPU to twelve, raising per-GPU bandwidth to 600 GB/s. The switch itself grew to 36 NVLink ports. A DGX A100 uses six of these switches; each A100 drives two NVLinks to every switch, preserving full any-to-any connectivity among the system's eight GPUs while doubling both communication and in-network reduction bandwidth relative to the V100 generation.[2][6]
The third-generation NVSwitch, built for the Hopper H100, was a substantial redesign. Disclosed at Hot Chips 2022, it is fabricated on TSMC's 4N process with 25.1 billion transistors on a 294 mm^2 die in a 50 mm by 50 mm package with 2,645 solder balls. It carries 64 NVLink-4 ports (two physical lanes per NVLink) and delivers 3.2 TB/s of full-duplex bandwidth, equivalent to 25.6 terabits per second, using 50 Gbaud PAM4 signaling that reaches 100 Gbps per differential pair.[3][7] Four of these switches sit on an eight-GPU HGX H100 or DGX H100 baseboard, providing 3.6 TB/s of bisection bandwidth.[2][3]
This generation introduced two architecturally important capabilities. The first is in-network compute through SHARP (Scalable Hierarchical Aggregation and Reduction Protocol), which Nvidia brands as NVLink SHARP. Embedded arithmetic-logic units inside the switch let the NVSwitch perform collective operations such as all-reduce on behalf of the GPUs, rather than shuttling every partial result back and forth between GPUs. A SHARP-enabled all-reduce reads the partial values once, sums them inside the switch, and multicasts the single reduced result back to all participants. This roughly halves the traffic each GPU interface must move during communication-intensive phases of training, approximately doubling effective NVLink bandwidth for those collectives.[3] Each Hopper NVSwitch provides up to 400 GFLOPS of FP32 SHARP throughput, its ALUs support logical, min/max and add operators across signed and unsigned integers as well as FP16, FP32, FP64 and BF16, and a SHARP controller can manage up to 128 SHARP groups in parallel.[3] These collectives are exposed to applications through libraries such as NCCL and NVSHMEM.[3]
The second addition is the NVLink Switch System, sometimes called the NVLink Network, which moves NVLink beyond a single chassis. The Hopper NVSwitch ports are PHY-compatible with 400G Ethernet and InfiniBand electricals and support OSFP cages carrying four NVLinks each, with extra forward-error-correction modes for optical cabling. Using external switches, up to 256 H100 GPUs across 32 nodes can be joined into one NVLink domain delivering 57.6 TB/s of all-to-all bandwidth.[3] Unlike on-node NVLink, the NVLink Network treats each endpoint as an independent address space connected at runtime through a software API call, adds a link-level translation buffer (TLB) so a destination GPU validates and maps incoming requests, and includes a security processor and partitioning so subsets of ports can be isolated into separate networks.[3]
The fourth-generation NVSwitch, paired with fifth-generation NVLink on the Blackwell architecture, is fabricated on TSMC's 4NP process and carries 72 NVLink-5 ports, providing 7.2 TB/s of full-duplex bandwidth per chip.[8][9] Fifth-generation NVLink doubles per-GPU bandwidth to 1.8 TB/s, built from 18 links of 100 GB/s each, roughly fourteen times the bandwidth of PCIe Gen5.[8][10] NVLink SHARP advances to support FP8 and is credited with up to a fourfold improvement in collective bandwidth efficiency.[8]
The most visible change is physical: the switch silicon leaves the GPU baseboard and moves into dedicated NVLink Switch trays. Each tray packages two NVSwitch ASICs and exposes 144 NVLink ports at 100 GB/s apiece, for 14.4 TB/s of non-blocking switching per tray.[9][10] This rack-level design lets a single NVLink domain span up to 576 fully connected GPUs.[8]
The flagship deployment of the fourth-generation switch is the GB200 NVL72, a liquid-cooled rack that links 36 Grace CPUs and 72 Blackwell GPUs into one NVLink domain so that all 72 GPUs behave as a single accelerator. The rack holds 18 compute trays (each with two Grace CPUs and four Blackwell GPUs) and nine NVLink Switch trays. With two NVSwitch ASICs per switch tray, eighteen fourth-generation NVSwitch chips form the spine that delivers 130 TB/s of aggregate all-to-all NVLink bandwidth across the rack.[10][11][12] The system exposes 13.4 TB of unified HBM3e memory and is rated at up to 1.44 exaFLOPS of FP4 tensor compute.[11] Nvidia positions this single-domain, in-rack fabric as the key to serving trillion-parameter models, because a fast 72-GPU NVLink domain keeps the heavy all-to-all and expert-parallel traffic of large mixture-of-experts models on NVLink rather than on slower scale-out networking.[11]
The same NVSwitch fabric underpins Nvidia's scale-up roadmap: the NVLink Fusion program licenses NVLink and the switch ecosystem so partners can connect custom CPUs and accelerators into NVLink rack-scale systems, and successor racks built on the Rubin platform are slated to use a sixth-generation NVLink switch.[10]
NVSwitch occupies a distinct tier of the data-center interconnect hierarchy. It is a scale-up fabric: extremely high bandwidth and low latency, tightly coupled to the GPU memory system, and used to bind a relatively modest number of GPUs (8, 16, 72, up to 576) into a single coherent domain. This complements, rather than replaces, scale-out networking such as InfiniBand or Ethernet, which links thousands of such domains into a full cluster but at lower per-GPU bandwidth. Nvidia's own materials note that a Hopper NVLink Network offers several times the bisection bandwidth of an equivalent InfiniBand fabric for tightly coupled workloads.[3] The scale-out interconnects largely descend from technology Nvidia acquired with Mellanox in 2019, and the Hopper-generation NVSwitch deliberately borrows from that world, adopting InfiniBand-style PHYs, OSFP optics and SHARP collectives.[3]
Two technical themes define NVSwitch across its generations. The first is bandwidth scaling roughly in lockstep with NVLink, from 900 GB/s on the first chip to 7.2 TB/s on the fourth. The second is the steady migration of network functionality into the switch: in-network reductions via SHARP, security and isolation engines, link-level address translation, and finally the move from an on-board component to a rack-scale switch tray. Together these let Nvidia present ever-larger pools of GPUs to software as one machine, which has made the NVSwitch fabric central to how frontier-scale AI models are trained and served on Nvidia hardware.[2][3][8]