NVSwitch

AI Hardware NVIDIA

11 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v2 · 2,255 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NVSwitch is a family of switch chips (ASICs) designed by Nvidia that fully connect multiple GPUs over NVLink into a single shared-memory fabric, letting every GPU talk to every other GPU at full link speed. Acting as a non-blocking crossbar, an NVSwitch turns a tray of discrete accelerators into what Nvidia markets as one giant GPU with a unified memory space, so that, in Nvidia's words, the 72-GPU domain of the GB200 NVL72 "acts as a single, massive GPU and delivers 30x faster real-time trillion-parameter large language model (LLM) inference."^[11] NVSwitch first shipped in 2018 in the DGX-2 supercomputer, where six switch chips wired 16 V100 GPUs into one fabric, and it has since become a defining feature of Nvidia's data-center platforms for large-scale AI training and inference, from the Hopper HGX H100 baseboard to the rack-scale GB200 NVL72 built on Blackwell.^[1]^[2]

What problem does NVSwitch solve?

By the mid-2010s, training deep neural networks increasingly relied on splitting work across multiple GPUs, and the bottleneck shifted from raw compute to the bandwidth available between accelerators. Nvidia introduced NVLink in 2016 with the Pascal P100 to provide an interconnect roughly an order of magnitude faster than PCI Express, initially wiring eight GPUs together in a point-to-point "hybrid cube mesh" topology.^[1]^[3] That arrangement worked well for traffic between directly linked neighbors but degraded for genuine all-to-all communication: GPU pairs that were not directly connected had to relay data through intermediate hops or fall back to the much slower PCIe path. It also capped a coherent NVLink island at eight GPUs.^[1]

NVSwitch was Nvidia's answer. Rather than hard-wiring GPUs to one another, each GPU connects its NVLinks into a bank of switch chips, and the switches route packets between any pair of ports. This removes the topology penalty (every GPU is one or two switch traversals from every other GPU), scales past eight GPUs in a node, and presents software with a simpler single-node programming model that hides the underlying wiring.^[1]^[3] Nvidia describes the result plainly: "With the NVSwitch, every NVIDIA Hopper GPU in a server can communicate at 900 GB/s with any other NVIDIA Hopper GPU simultaneously," and because "the peak rate does not depend on the number of GPUs that are communicating ... the NVSwitch is non-blocking."^[2]

What are the NVSwitch generations?

NVSwitch has advanced in step with the NVLink generation it carries, and Nvidia's product materials count the on-node switch as first generation (Volta), second generation (Ampere), third generation (Hopper), and fourth generation (Blackwell).^[2]^[4] The table below summarizes the per-chip characteristics that Nvidia and its disclosures at Hot Chips have published; second-generation transistor and die figures were never published in detail and are omitted.

Generation	Year	GPU / platform	NVLink gen	Ports per chip	Per-chip bandwidth	Process	Notable additions
1st	2018	V100 / DGX-2, HGX-2	NVLink 2	18	900 GB/s	TSMC 12 nm FFN	18x18 crossbar, Fabric Manager
2nd	2020	A100 / DGX A100	NVLink 3	36	1,800 GB/s	TSMC 7 nm	doubled links per GPU
3rd	2022	H100 / HGX H100	NVLink 4	64	3,200 GB/s	TSMC 4N	NVLink SHARP, NVLink Network
4th	2024	Blackwell / GB200 NVL72	NVLink 5	72	7,200 GB/s	TSMC 4NP	rack-scale switch trays, SHARP FP8

All quoted bandwidths are full-duplex totals (the sum of both directions), following Nvidia's convention.^[3]

First generation (Volta, 2018)

The original NVSwitch is an NVLink-2 switch chip with 18 ports arranged as an 18-by-18 fully connected internal crossbar. Each port runs at 50 GB/s (25 GB/s in each direction), for 900 GB/s of aggregate switch bandwidth, and the crossbar is non-blocking so any port can reach any other at full link rate. The die holds about 2 billion transistors on TSMC's 12 nm FinFET FFN process and draws roughly 100 W.^[4]^[5]

In the DGX-2 and the equivalent HGX-2 baseboard, six NVSwitch chips sit on each eight-GPU board. Every V100 GPU connects one of its six NVLinks to each of the six switches, so any two GPUs on the same baseboard communicate at the full 300 GB/s with a single switch traversal. Two baseboards are joined through eight ports on each switch to build the full 16-GPU system, giving a board-to-board bisection bandwidth of 2.4 TB/s. The DGX-2 used 16 of the 18 available ports per switch, leaving the remaining two reserved.^[4] Nvidia described the result as 16 GPUs behaving as one accelerator with a half-terabyte of unified memory and roughly two petaFLOPS of deep-learning throughput.^[4] Reliability features included cyclic redundancy checking (CRC) on the NVLinks with replay on error, error-correcting codes (ECC) on internal datapaths and routing structures, and a software Fabric Manager that programs the routing tables and confines each application to its assigned ranges.^[4]

Second generation (Ampere, 2020)

The second-generation NVSwitch accompanied the A100 and the DGX A100. It paired with third-generation NVLink, which kept the 25 GB/s per-direction signaling but increased the number of links per GPU to twelve, raising per-GPU bandwidth to 600 GB/s. The switch itself grew to 36 NVLink ports. A DGX A100 uses six of these switches; each A100 drives two NVLinks to every switch, preserving full any-to-any connectivity among the system's eight GPUs while doubling both communication and in-network reduction bandwidth relative to the V100 generation.^[2]^[6]

Third generation (Hopper, 2022)

The third-generation NVSwitch, built for the Hopper H100, was a substantial redesign. Disclosed at Hot Chips 2022, it is fabricated on TSMC's 4N process with 25.1 billion transistors on a 294 mm^2 die in a 50 mm by 50 mm package with 2,645 solder balls. It carries 64 NVLink-4 ports (two physical lanes per NVLink) and delivers 3.2 TB/s of full-duplex bandwidth, equivalent to 25.6 terabits per second, using 50 Gbaud PAM4 signaling that reaches 100 Gbps per differential pair.^[3]^[7] Each H100 drives 18 fourth-generation NVLinks for 900 GB/s of total per-GPU bandwidth, about seven times the bandwidth of PCIe Gen5.^[2]^[13] Four of these switches sit on an eight-GPU HGX H100 or DGX H100 baseboard, providing 3.6 TB/s of bisection bandwidth.^[2]^[3]

This generation introduced two architecturally important capabilities. The first is in-network compute through SHARP (Scalable Hierarchical Aggregation and Reduction Protocol), which Nvidia brands as NVLink SHARP. Embedded arithmetic-logic units inside the switch let the NVSwitch perform collective operations such as all-reduce on behalf of the GPUs, rather than shuttling every partial result back and forth between GPUs. A SHARP-enabled all-reduce reads the partial values once, sums them inside the switch, and multicasts the single reduced result back to all participants. This roughly halves the traffic each GPU interface must move during communication-intensive phases of training, approximately doubling effective NVLink bandwidth for those collectives.^[3] Each Hopper NVSwitch provides up to 400 GFLOPS of FP32 SHARP throughput, its ALUs support logical, min/max and add operators across signed and unsigned integers as well as FP16, FP32, FP64 and BF16, and a SHARP controller can manage up to 128 SHARP groups in parallel.^[3] These collectives are exposed to applications through libraries such as NCCL and NVSHMEM.^[3]

The second addition is the NVLink Switch System, sometimes called the NVLink Network, which moves NVLink beyond a single chassis. The Hopper NVSwitch ports are PHY-compatible with 400G Ethernet and InfiniBand electricals and support OSFP cages carrying four NVLinks each, with extra forward-error-correction modes for optical cabling. Using external switches, up to 256 H100 GPUs across 32 nodes can be joined into one NVLink domain delivering 57.6 TB/s of all-to-all bandwidth.^[3] Unlike on-node NVLink, the NVLink Network treats each endpoint as an independent address space connected at runtime through a software API call, adds a link-level translation buffer (TLB) so a destination GPU validates and maps incoming requests, and includes a security processor and partitioning so subsets of ports can be isolated into separate networks.^[3]

Fourth generation (Blackwell, 2024)

The fourth-generation NVSwitch, paired with fifth-generation NVLink on the Blackwell architecture, is fabricated on TSMC's 4NP process and carries 72 NVLink-5 ports, providing 7.2 TB/s of full-duplex bandwidth per chip.^[8]^[9] Fifth-generation NVLink doubles per-GPU bandwidth to 1.8 TB/s, built from 18 links of 100 GB/s each, roughly fourteen times the bandwidth of PCIe Gen5.^[8]^[10] NVLink SHARP advances to support FP8 and is credited with up to a fourfold improvement in collective bandwidth efficiency.^[8]

The most visible change is physical: the switch silicon leaves the GPU baseboard and moves into dedicated NVLink Switch trays. Each tray packages two NVSwitch ASICs and exposes 144 NVLink ports at 100 GB/s apiece, for 14.4 TB/s of non-blocking switching per tray.^[9]^[10] This rack-level design lets a single NVLink domain span up to 576 fully connected GPUs.^[8]

How does NVSwitch enable rack-scale systems?

The flagship deployment of the fourth-generation switch is the GB200 NVL72, a liquid-cooled rack that links 36 Grace CPUs and 72 Blackwell GPUs into one NVLink domain so that all 72 GPUs behave as a single accelerator. The rack holds 18 compute trays (each with two Grace CPUs and four Blackwell GPUs) and nine NVLink Switch trays. With two NVSwitch ASICs per switch tray, eighteen fourth-generation NVSwitch chips form the spine that delivers 130 TB/s of aggregate all-to-all NVLink bandwidth across the rack, what Nvidia calls "the largest NVIDIA NVLink domain ever offered."^[10]^[11]^[12] The system exposes 13.4 TB of unified HBM3e memory and is rated at up to 1.44 exaFLOPS (1,440 petaFLOPS) of FP4 tensor compute.^[11] Nvidia positions this single-domain, in-rack fabric as the key to serving trillion-parameter models, because a fast 72-GPU NVLink domain keeps the heavy all-to-all and expert-parallel traffic of large mixture-of-experts models on NVLink rather than on slower scale-out networking, which the company credits for 30x faster real-time trillion-parameter LLM inference versus the prior generation.^[11]

The same NVSwitch fabric underpins Nvidia's scale-up roadmap: the NVLink Fusion program licenses NVLink and the switch ecosystem so partners can connect custom CPUs and accelerators into NVLink rack-scale systems, and successor racks built on the Rubin platform are slated to use a sixth-generation NVLink switch.^[10]

How does NVSwitch differ from PCIe and InfiniBand?

NVSwitch occupies a distinct tier of the data-center interconnect hierarchy. It is a scale-up fabric: extremely high bandwidth and low latency, tightly coupled to the GPU memory system, and used to bind a relatively modest number of GPUs (8, 16, 72, up to 576) into a single coherent domain. This complements, rather than replaces, scale-out networking such as InfiniBand or Ethernet, which links thousands of such domains into a full cluster but at lower per-GPU bandwidth. On Blackwell the gap is stark: a single GPU's 1.8 TB/s of NVLink bandwidth is roughly fourteen times that of PCIe Gen5, and the on-node NVSwitch fabric carries any-to-any traffic that PCIe cannot.^[8]^[10] Nvidia's own materials note that a Hopper NVLink Network offers several times the bisection bandwidth of an equivalent InfiniBand fabric for tightly coupled workloads.^[3] The scale-out interconnects largely descend from technology Nvidia acquired with Mellanox in 2019, and the Hopper-generation NVSwitch deliberately borrows from that world, adopting InfiniBand-style PHYs, OSFP optics and SHARP collectives.^[3]

The contrast is summarized below.

Property	PCIe Gen5 (x16)	InfiniBand NDR (per port)	NVLink 5 / NVSwitch
Role	Host-to-device I/O	Scale-out cluster fabric	Scale-up GPU fabric
Per-GPU or per-port bandwidth	~128 GB/s	400 Gb/s (~50 GB/s)	1.8 TB/s per GPU^[8]
Topology to GPUs	Tree to host	Switched, multi-hop	Non-blocking all-to-all^[2]
Memory model	Separate address spaces	Separate address spaces	Shared NVLink domain^[11]
Typical scale	1 host	thousands of nodes	8 to 576 GPUs^[8]

What makes NVSwitch significant for AI?

Two technical themes define NVSwitch across its generations. The first is bandwidth scaling roughly in lockstep with NVLink, from 900 GB/s on the first chip to 7.2 TB/s on the fourth. The second is the steady migration of network functionality into the switch: in-network reductions via SHARP, security and isolation engines, link-level address translation, and finally the move from an on-board component to a rack-scale switch tray. Together these let Nvidia present ever-larger pools of GPUs to software as one machine. On the official NVLink page Nvidia describes the goal as a "seamless, high-bandwidth, multi-node GPU cluster, effectively forming a data-center-sized GPU," which has made the NVSwitch fabric central to how frontier-scale AI models are trained and served on Nvidia hardware.^[2]^[3]^[8]^[10]

References

Nvidia. "NVSwitch: The World's Highest-Bandwidth On-Node Switch" (Technical Overview, 2018). https://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf ↩
Nvidia Technical Blog. "NVIDIA NVLink and NVIDIA NVSwitch Supercharge Large Language Model Inference." https://developer.nvidia.com/blog/nvidia-nvlink-and-nvidia-nvswitch-supercharge-large-language-model-inference/ ↩
A. Ishii and R. Wells, Nvidia. "The NVLink-Network Switch: Nvidia's Switch Chip for High Communication-Bandwidth SuperPODs" (Hot Chips 34, 2022). https://hc34.hotchips.org/assets/program/conference/day2/Network%20and%20Switches/NVSwitch%20HotChips%202022%20r5.pdf ↩
Nvidia. "NVSwitch: The World's Highest-Bandwidth On-Node Switch" (Technical Overview), pp. 3-4. https://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf ↩
WikiChip. "NVSwitch - Nvidia." https://en.wikichip.org/wiki/nvidia/nvswitch ↩
Nvidia Technical Blog. "NVIDIA Ampere Architecture In-Depth." https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ ↩
Nvidia Technical Blog. "Upgrading Multi-GPU Interconnectivity with the Third-Generation NVIDIA NVSwitch." https://developer.nvidia.com/blog/upgrading-multi-gpu-interconnectivity-with-the-third-generation-nvidia-nvswitch/ ↩
Nvidia. "NVIDIA Blackwell Architecture." https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/ ↩
FiberMall. "Understanding Nvidia's NVLink and NVSwitch Evolution: Topology and Rates." https://www.fibermall.com/blog/nvidia-nvlink-and-nvswitch-evolution.htm ↩
Nvidia. "NVLink and NVLink Switch." https://www.nvidia.com/en-us/data-center/nvlink/ ↩
Nvidia. "GB200 NVL72." https://www.nvidia.com/en-us/data-center/gb200-nvl72/ ↩
Nvidia. "System Hardware and Components, NVL72 AI Factory Reference Architecture." https://docs.nvidia.com/enterprise-reference-architectures/nvl72-ai-factory/latest/components.html ↩
Nvidia Technical Blog. "NVIDIA Hopper Architecture In-Depth." https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AI Infrastructure Abbreviations NVIDIA A100 NVIDIA ConnectX NVIDIA DGX NVIDIA DGX B300 NVIDIA DGX SuperPOD NVIDIA GB200 NVL72 NVIDIA H100 NVIDIA HGX Volta (microarchitecture)

What problem does NVSwitch solve?

What are the NVSwitch generations?

First generation (Volta, 2018)

Second generation (Ampere, 2020)

Third generation (Hopper, 2022)

Fourth generation (Blackwell, 2024)

How does NVSwitch enable rack-scale systems?

How does NVSwitch differ from PCIe and InfiniBand?

What makes NVSwitch significant for AI?

References

Improve this article

Related Articles

CuDNN

Jetson Thor

NVIDIA Blackwell

NVIDIA DGX Spark

NVIDIA Picasso

Jensen Huang

What links here

Related Articles

CuDNN

Jetson Thor

NVIDIA Blackwell

NVIDIA DGX Spark

NVIDIA Picasso

Jensen Huang

What links here