NVIDIA Vera Rubin is the next generation of NVIDIA's data center AI computing platform, succeeding the NVIDIA Blackwell architecture. The platform combines the Vera CPU with the Rubin GPU into an integrated superchip designed for agentic AI, large-scale inference, and trillion-parameter model training. It was first publicly unveiled at Computex 2024, fleshed out at GTC March 2025, formally announced as a six-chip platform at CES January 2026, and rebranded as a seven-chip platform at GTC March 2026 after the addition of the Groq 3 LPU. NVIDIA shipped its first Vera Rubin samples to customers in February 2026 and has stated that production shipments will commence in the second half of 2026.
The platform is named after American astronomer Vera Rubin, whose observations of galaxy rotation curves provided the first compelling evidence for dark matter. The naming continues NVIDIA's tradition of honoring scientists on its data center GPU architectures, following Hopper (named for Grace Hopper), Ada Lovelace, and Blackwell (after David Blackwell).
NVIDIA's data center GPU roadmap has followed an annual or near-annual cadence since the Hopper generation. The NVIDIA H100 (Hopper, 2022) established the transformer training and inference workloads that have come to define modern AI computing. Its successor, the Blackwell architecture launched in 2024, introduced a dual-die design with the GB200 superchip pairing two B200 GPU dies with a single Grace CPU.
NVIDIA Blackwell Ultra (GB300 series) followed in 2025 as a mid-generation refresh. Blackwell Ultra upgraded the HBM3E memory to 288 GB per GPU package, raised the TDP to 1,400 W, and pushed FP4 inference performance to approximately 15 PetaFLOPS per chip. The NVIDIA GB300 NVL72 rack configuration became the standard deployment unit for the largest hyperscaler clusters, housing 72 GPU packages and delivering roughly 1 ExaFLOP of FP4 compute.
Blackwell Ultra also introduced improvements to the MUFU softmax unit, improving attention calculation throughput by 2.5x over standard Blackwell, which reduced a latency bottleneck that had limited inference throughput at long sequence lengths.
The Vera Rubin disclosures unfolded across four separate keynote events, each adding more architectural detail:
| Event | Date | Disclosure |
|---|---|---|
| Computex 2024 (Taipei) | June 2, 2024 | First public reveal of "Rubin" GPU and "Vera" CPU as the post-Blackwell roadmap, alongside Blackwell Ultra. NVLink 6, CX9 SuperNIC, and the X1600 converged switch named for the first time. HBM4 confirmed. |
| GTC March 2025 (San Jose) | March 18, 2025 | Vera Rubin Superchip shown publicly for the first time. Rubin Ultra (NVL576, 2027) and the Feynman successor architecture (2028) added to the roadmap. Initial NVL144 die-based naming proposed. |
| CES 2026 (Las Vegas) | January 5, 2026 | NVIDIA announced the platform name "Vera Rubin," confirmed all chips were back from the fab, declared full production, and renamed the rack from VR200 NVL144 (die-based) back to VR200 NVL72 (package-based). The platform was presented as comprising six chips. |
| GTC March 2026 (San Jose) | March 16, 2026 | Seven-chip configuration announced with the addition of the Groq 3 LPU, alongside the broader DGX SuperPOD and the five rack-scale system family. |
This staged rollout, with each event adding more concrete specifications, gave hyperscalers an unusually long lead time on infrastructure planning. NVIDIA framed the Rubin generation as purpose-built for the agentic AI era, where models reason across long context windows, chain tool calls, and operate autonomously over extended compute sessions. These workloads place different demands on hardware than the batch inference or fine-tuning tasks that dominated earlier generations.
The Vera Rubin platform takes its name from Vera Florence Cooper Rubin (1928 to 2016), an American astronomer who spent much of her career at the Carnegie Institution of Washington. Rubin's most significant contribution came from her meticulous measurements of galactic rotation curves throughout the 1970s, carried out in collaboration with astronomer Kent Ford.
In a spinning galaxy held together only by visible mass, stars near the outer edges should orbit more slowly than those closer to the center, just as outer planets in the solar system move more slowly than inner ones. Rubin and Ford found the opposite: stars at the periphery of galaxies moved at nearly the same speed as those near the center. The only explanation consistent with Newtonian gravity was the presence of a large quantity of invisible mass distributed throughout the galaxy. This was the first direct observational evidence for what physicists call dark matter.
Rubin's findings transformed cosmology. Dark matter is now understood to constitute roughly 27 percent of the energy content of the universe, compared to about 5 percent for ordinary matter. Despite this foundational contribution, Rubin never received the Nobel Prize in Physics, an omission widely regarded as one of the most consequential oversights in Nobel history. She did receive the National Medal of Science in 1993, the Gold Medal of the Royal Astronomical Society in 1996 (the first woman so honored since 1828), and election to the National Academy of Sciences.
The Vera C. Rubin Observatory in Chile, a major ground-based telescope facility dedicated to the Legacy Survey of Space and Time, was named in her honor in 2020.
NVIDIA's choice of name reflects an effort to commemorate scientists who advanced human knowledge through careful, data-intensive observation. The pairing of the CPU name (Vera) and GPU name (Rubin) as a single honorific preserves the full name in the product line rather than splitting it across generations.
The Vera Rubin platform is organized around a superchip that integrates one Vera CPU with two Rubin GPU dies through NVLink-C2C, a high-bandwidth chip-to-chip interconnect. NVIDIA describes its approach with Vera Rubin as treating the data center as the unit of compute rather than the individual chip. The NVL72 rack, with 72 GPU packages unified by NVLink 6 into a single high-bandwidth domain, functions more like one large accelerator than a cluster of discrete devices. This architecture lets collective operations such as all-reduce passes during training proceed at memory speeds rather than network speeds.
By GTC 2026, the full deployment configuration involved seven chip families:
| Chip | Function |
|---|---|
| Vera CPU | General computation, reinforcement learning environments, agentic orchestration |
| Rubin GPU | AI training and generation-phase inference |
| Rubin CPX GPU | Long-context prefill inference acceleration |
| NVLink 6 Switch | Scale-up GPU interconnect within a rack |
| ConnectX-9 SuperNIC | 1.6 Tb/s scale-out networking per GPU |
| BlueField-4 DPU | Network offload, storage disaggregation, security |
| Spectrum-6 Ethernet Switch | Data center scale-out networking |
| Groq 3 LPU (added at GTC 2026) | High-throughput decode-phase inference for trillion-parameter models |
At CES 2026, NVIDIA's official press release described the platform as "six chips" because Rubin CPX was treated as a Rubin GPU variant rather than a separate chip, and the Groq integration had not yet been announced. NVIDIA's GTC March 2026 press release expanded the count to seven by listing the Groq 3 LPU as a distinct platform component, even as some technical documentation continued to describe the core platform as six chips with Groq 3 LPU as an additional partnership.
The Vera CPU is NVIDIA's second fully custom Arm-architecture processor for data centers, following the Grace CPU introduced with the Hopper generation. Where Grace used Arm Neoverse V2 cores licensed from Arm Holdings, Vera uses entirely custom microarchitecture cores designed by NVIDIA, called Olympus cores.
The Olympus core implements the Armv9.2 instruction set architecture but departs from the Neoverse V2 design in its internal pipeline. Each Vera CPU die contains 88 Olympus cores. The cores support NVIDIA Spatial Multithreading, a form of simultaneous multithreading that allows up to 176 threads to execute concurrently per socket (two threads per core).
Per-core resources include 2 MB of L2 cache, with 164 MB of unified L3 cache shared across the socket. NVIDIA describes Olympus as the first general-purpose CPU to support FP8 precision natively in the integer and floating-point pipelines, a feature aimed at small-model serving and reinforcement learning environments where mixed precision is useful even on the host side.
The Scalable Coherency Fabric, now in its second generation, connects all 88 cores to a shared L3 cache and the memory subsystem with 3.4 TB/s of bisection bandwidth.
Vera CPU supports up to 1.5 TB of LPDDR5X memory per socket, a threefold increase over Grace's 480 GB maximum. The memory subsystem sustains up to 1.2 TB/s of bandwidth. NVIDIA reports that the CPU sustains over 90 percent of peak memory bandwidth under realistic workloads, a figure that compares favorably with typical x86 server processors that commonly sustain 60 to 70 percent of rated bandwidth.
The GPU-facing interface uses the second generation of NVLink-C2C at 1.8 TB/s of coherent bandwidth per socket. In the Vera Rubin superchip, the Vera CPU and two Rubin GPU dies share a unified memory address space through this interconnect. CPU and GPU threads can access each other's memory without explicit copy operations, which simplifies programming models for heterogeneous workloads.
NVIDIA's marketing materials cite approximately 2x performance over the Grace generation at comparable power, and approximately 50 percent faster reinforcement-learning evaluation cycles versus competing server CPU platforms. These are NVIDIA-published figures and have not yet been independently benchmarked at scale.
GTC 2026 introduced a CPU-only rack variant that houses 256 liquid-cooled Vera CPUs in a single rack, intended for reinforcement-learning rollouts where many environment instances must run in parallel without requiring GPU compute. NVIDIA cited up to 6x throughput on RL workloads versus a comparably configured Grace-based rack.
The Rubin GPU (internally designated VR200) is a dual-die design built on TSMC's N3P process node, a 3 nm-class technology. Each of the two compute dies is reticle-sized, approaching the largest possible single die on the lithography step used. The two dies are assembled on a CoWoS-L (Chip on Wafer on Substrate, Large) interposer alongside two smaller I/O tiles that handle the SerDes interfaces for NVLink, PCIe, and NVLink-C2C connections.
The combined Rubin GPU package contains 336 billion transistors across its two compute dies, compared with approximately 208 billion in the NVIDIA B200. This 1.6x increase in transistor count reflects both the process node improvement from TSMC's 4NP (used in Blackwell) to N3P and the larger reticle layout in the CoWoS-L assembly.
The CoWoS-L (Chip on Wafer on Substrate, Large) interposer used for Rubin places two compute dies side by side, with two smaller I/O tiles arranged at the periphery. The I/O tiles handle the off-package SerDes for NVLink 6, PCIe Gen 6, and NVLink-C2C. Decoupling the I/O from the compute dies lets NVIDIA hold the I/O tiles on a more mature node while pushing the compute dies to N3P, which optimizes both yield and cost. This is a deeper version of the partitioning approach that AMD has used for several generations on its Instinct line and that Intel adopted on Ponte Vecchio, and it has become the dominant packaging strategy for reticle-pushing accelerators across the industry.
The HBM4 stacks sit at the edges of the interposer, four stacks per side, with a 2048-bit interface per stack. Total HBM signal count across all eight stacks is in the tens of thousands of pins, and the substrate routing required to fan all of those signals out to the compute dies is one of the most demanding parts of the design. NVIDIA worked closely with TSMC on substrate routing rule changes to enable the Rubin layout, with reporting in 2025 that early CoWoS-L test substrates carrying the full Rubin signal count had higher than usual rework rates that gradually fell as TSMC's process matured.
The Rubin GPU contains 224 Streaming Multiprocessors, each equipped with fifth-generation Tensor Cores. The fifth-generation design adds a hardware-accelerated adaptive compression unit, the third-generation NVIDIA Transformer Engine, which dynamically selects numerical formats for each layer of a transformer model to preserve accuracy while maximizing throughput. The primary compute format is NVFP4, a 4-bit floating point type introduced with Blackwell Ultra and now the standard for inference at scale on NVIDIA hardware.
NVFP4 uses a two-level scaling scheme paired with hardware-accelerated quantization. NVIDIA reports that NVFP4 delivers near-FP8 accuracy (typically within 1 percent on representative inference benchmarks), reduces memory footprint by approximately 1.8x relative to FP8, and approximately 3.5x relative to FP16.
Peak performance per Rubin GPU package, as published by NVIDIA:
| Precision | Performance |
|---|---|
| NVFP4 (inference) | 50 PetaFLOPS |
| NVFP4 (training) | 35 PetaFLOPS |
| FP8 | approximately 17 PetaFLOPS |
| FP16 / BF16 | approximately 8 PetaFLOPS |
The Tensor Core path is tightly coupled with expanded special-function units and execution pipelines designed for the attention, activation, and sparse-compute paths common in modern reasoning models.
Each Rubin GPU package contains 8 stacks of HBM4 (High Bandwidth Memory, 4th generation), each stack using a 12-Hi (12 layer) configuration. Total memory capacity per package is 288 GB. The HBM4 interface operates at 6.4 GT/s per pin across a 2048-bit memory bus, delivering approximately 22 TB/s of aggregate memory bandwidth per package.
This compares with the HBM3E memory in Blackwell Ultra, which delivered approximately 8 TB/s per package. The roughly 2.8x increase in memory bandwidth is one of the largest generational jumps in NVIDIA's roadmap history.
At the rack level, the 72-GPU NVL72 configuration aggregates 20.7 TB of HBM4 and 1.6 PB/s of HBM4 bandwidth.
Alongside the primary Rubin GPU, NVIDIA introduced a variant called the Rubin CPX (Context Phase eXtreme). While the standard Rubin GPU is optimized for both training and generation-phase inference, the CPX is purpose-built for the context or prefill phase of inference, where the model processes a large input prompt before generating any output tokens.
The CPX uses 128 GB of GDDR7 memory rather than HBM4, prioritizing high-bandwidth access to a smaller pool. It delivers 30 PetaFLOPS of NVFP4 compute and provides up to 3x faster attention calculation compared with the GB300 NVL72 baseline, according to NVIDIA's published comparisons. The lower cost of GDDR7 relative to HBM4 makes the CPX economical for the context workload, which consumes large amounts of compute but does not need the same memory capacity as the generation phase.
NVIDIA frames Rubin CPX as the answer to million-token context windows used in coding agents, video generation models, and long-form reasoning tasks where the prefill pass dominates total inference cost.
In the Vera Rubin NVL144 CPX rack, 144 Rubin CPX GPUs handle context processing while 144 standard Rubin GPUs handle token generation, alongside 36 Vera CPUs. The combined rack delivers 8 ExaFLOPS of NVFP4 compute, 7.5x more than a GB300 NVL72, with 100 TB of total memory and approximately 1.7 PB/s of memory bandwidth. This is a different SKU from the standard Vera Rubin NVL72 GPU rack, which uses 72 standard Rubin GPU packages without CPX.
NVLink 6 is the sixth generation of NVIDIA's proprietary high-speed GPU interconnect, introduced with the Vera Rubin platform. It doubles the per-GPU bandwidth compared with NVLink 5 used in Blackwell, from 1.8 TB/s to 3.6 TB/s bidirectional.
NVLink 6 uses 400 Gbps SerDes lanes, compared with the 200 Gbps SerDes in NVLink 5. The NVLink 6 Switch chip delivers 28.8 TB/s of aggregate switch bandwidth per tray and provides 260 TB/s of total scale-up connectivity when 72 Rubin GPUs are unified in the NVL72 configuration.
All 72 GPUs in the NVL72 rack communicate through an all-to-all NVLink topology managed by NVSwitch 6 blades. Each GPU can reach every other GPU at full NVLink bandwidth without traversing the slower PCIe or Ethernet fabric. This flat topology eliminates the bandwidth degradation that multi-hop network paths introduce and is what allows NVIDIA to describe the NVL72 rack as a single performance domain.
NVLink 6 Switch includes hardware support for SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) in-network compute. The switch can perform floating-point reduction operations directly on data in flight, delivering 14.4 TFLOPs of FP8 in-network compute per switch tray. This reduces the volume of data that must traverse the network during all-reduce passes in distributed training and cuts collective-operation latency.
NVLink-C2C (Chip-to-Chip), in its second generation, provides coherent interconnect between the Vera CPU and the Rubin GPU dies within the superchip at 1.8 TB/s. The coherence protocol means CPU and GPU can share data structures without software-managed copies, which is particularly useful for agentic AI frameworks where orchestration logic running on the CPU needs low-latency access to KV caches resident on GPU memory.
The Vera Rubin platform pairs the scale-up NVLink 6 fabric with a refreshed scale-out networking stack.
ConnectX-9 SuperNICs deliver 1.6 Tb/s of per-GPU scale-out bandwidth using either Ethernet (Spectrum-X) or InfiniBand (Quantum-X800). The card supports programmable RDMA for low-latency, GPU-direct networking. Each Vera Rubin compute blade typically provisions one ConnectX-9 per Rubin GPU.
BlueField-4 integrates a 64-core Grace-based control CPU, hardware-accelerated offload engines for storage and security, and an 800 Gb/s network interface. BlueField-4 anchors the storage tier of the platform through a dedicated STX storage rack, which exposes large pools of NVMe storage to the compute racks at line rate.
Each Spectrum-6 ASIC delivers 102.4 Tb/s of switch bandwidth and supports lossless Ethernet with adaptive routing tuned for collective operations. Spectrum-6 deploys in a dedicated SPX Ethernet rack that aggregates traffic across an entire AI factory.
The Vera Rubin development timeline followed NVIDIA's now-standard public roadmap pattern of progressive disclosure across keynotes:
| Date | Event |
|---|---|
| June 2024 | Computex 2024: First public reveal of Rubin GPU, Vera CPU, NVLink 6, CX9 SuperNIC, X1600 switch concept |
| March 2025 | GTC 2025: Vera Rubin Superchip shown for the first time; Rubin Ultra (2027) and Feynman (2028) added to roadmap |
| Late 2025 | CFO Colette Kress, on the Q3 FY26 earnings call (November 2025), reported $500B of Blackwell-plus-Rubin revenue visibility through end of calendar 2026 |
| January 2026 | CES 2026: Vera Rubin platform officially launched as a six-chip platform; full production declared; rack rebranded from VR200 NVL144 to VR200 NVL72 |
| February 2026 | First Vera Rubin samples shipped to early customers (per CFO commentary on Q4 FY26 earnings call) |
| March 2026 | GTC 2026: Seven-chip platform announced with Groq 3 LPU; full DGX SuperPOD and five rack-system family detailed |
| H2 2026 | Production shipments to early customers commence |
NVIDIA confirmed that manufacturing uses TSMC's N3P process for the Rubin GPU compute dies, with CoWoS-L packaging. At CES in January 2026, Jensen Huang stated that all necessary chips were back from the fab, indicating silicon validation was complete. Full production was confirmed during the same keynote.
CFO Colette Kress, on the Q4 FY26 earnings call in February 2026, told investors that NVIDIA had "shipped our first Vera Rubin samples to customers earlier this week" and that the company was on track to commence production shipments in the second half of 2026. The same call provided the first concrete signal that hyperscalers had silicon in hand.
The Vera Rubin NVL72 is the primary rack-scale deployment unit for the platform. It continues the architecture established with the GB200 NVL72 and GB300 NVL72, using the same Oberon rack chassis with cooling modifications.
NVIDIA initially planned to call this rack VR200 NVL144 to count GPU dies (72 packages times 2 dies per package). In late December 2025, NVIDIA reverted to the package-based NVL72 naming for marketing consistency with the prior Blackwell generation. References published before that date often use NVL144; references published from CES 2026 onwards use NVL72.
| Parameter | Value |
|---|---|
| Rubin GPU packages | 72 |
| Rubin GPU compute dies (total) | 144 |
| Vera CPUs | 36 |
| HBM4 memory (total) | 20.7 TB |
| HBM4 bandwidth | 1.6 PB/s |
| LPDDR5X memory | 54 TB |
| NVSwitch 6 blades | 9 |
| Compute blades | 18 |
| Total transistors | approximately 220 trillion |
| NVFP4 inference performance | 3.6 ExaFLOPS |
| NVFP4 training performance | 2.5 ExaFLOPS |
| Scale-up bandwidth | 260 TB/s |
The rack is fully liquid cooled and accepts coolant at up to 45 degrees Celsius supply temperature, which keeps it compatible with facility cooling systems that do not chill water below data center ambient.
A meaningful operational improvement over Blackwell: NVIDIA reports that the NVL72 rack can be assembled in approximately 6 minutes, compared with around 100 minutes for the GB200 NVL72. This is achieved through pre-integrated cable management and a revised blade design.
The DGX Vera Rubin SuperPOD groups 14 NVL72 racks together with a shared high-speed network fabric, totaling 1,008 Rubin GPUs and 504 Vera CPUs. NVIDIA cites 50.4 ExaFLOPS of NVFP4 compute and 1,046 TB of fast memory at this scale.
A separate DGX Rubin NVL8 SuperPOD configuration aggregates 64 NVL8 systems (8 Rubin GPUs each) for a total of 512 Rubin GPUs, suitable for workloads that need fewer GPUs unified at NVLink speed.
Rubin Ultra is the planned mid-generation refresh of the Rubin platform, targeted for the second half of 2027. It follows the pattern established by Blackwell Ultra as a substantially upgraded variant on the same platform foundations. NVIDIA disclosed Rubin Ultra at GTC March 2025 alongside the Vera Rubin announcement.
The Rubin Ultra GPU (VR300) moves from two to four reticle-sized compute chiplets per package. NVIDIA confirmed at GTC 2025 that Rubin Ultra would carry 1 TB of HBM4E across 16 stacks per package, with packaging on TSMC's CoWoS-L. TrendForce reporting in April 2026 indicated that NVIDIA decided to retain the multi-die assembly approach rather than attempt a monolithic super-die for the Ultra refresh.
Key Rubin Ultra GPU specifications, as published or projected:
| Parameter | Rubin (VR200) | Rubin Ultra (VR300) |
|---|---|---|
| Compute dies | 2 | 4 |
| FP4 inference (per package) | 50 PetaFLOPS | approximately 100 PetaFLOPS |
| Memory | 288 GB HBM4 | 1 TB HBM4E |
| Memory stacks | 8 | 16 |
| Memory bandwidth (per package) | 22 TB/s | approximately 32 TB/s |
| TDP (per package) | approximately 1,800 W | approximately 3,600 W |
| NVLink generation | NVLink 6 | NVLink 7 |
The memory upgrade to HBM4E (an enhanced version of HBM4 with higher pin speeds) and 16 stacks per package raises total memory per GPU from 288 GB to approximately 1 TB.
The Rubin Ultra platform introduces a new rack architecture called the Kyber rack, which departs significantly from the Oberon design used for Blackwell, Blackwell Ultra, and standard Rubin.
In the Kyber design, compute blades rotate 90 degrees into a vertical blade form factor for higher density. The rack holds four canisters (sometimes called pods), each containing 18 compute blades, for a total of 576 GPU dies in the NVL576 configuration. A PCB backplane replaces the copper cable interconnects used in Oberon, a change made necessary by the higher density and reduced available routing space.
The Kyber rack has an estimated total power draw of 600 kW, making it the highest-power computing platform NVIDIA has designed for commercial deployment. Facilities deploying Rubin Ultra NVL576 racks must be engineered for power densities roughly equivalent to several hundred typical residential homes per rack.
NVIDIA cites a target of approximately 14x higher inference performance than the GB300 NVL72 for the Rubin Ultra NVL576 system, reflecting both the per-GPU performance improvement and the higher GPU count per rack. The company has published a target of approximately 15 ExaFLOPS of dense FP4 inference and 5 ExaFLOPS of FP8 training per Rubin Ultra NVL576 rack, with 365 TB of fast memory and 4.6 PB/s of HBM4E bandwidth.
Feynman is the planned successor to the Rubin generation, targeting a 2028 launch. It is named after theoretical physicist Richard Feynman. NVIDIA announced Feynman's existence at GTC March 2025 alongside the Rubin announcement.
At GTC 2025, Jensen Huang stated only that "our next generation will be named after Feynman" without providing detailed specifications. By GTC 2026, NVIDIA had added several headline elements to the Feynman roadmap: 3D die stacking (NVIDIA's first use of stacked GPU dies), custom HBM memory co-designed with the GPU, and fabrication on TSMC's A16 (1.6 nm-class) node.
Feynman is expected to be paired with the Rosa CPU, a successor to Vera in the same way Vera succeeded Grace. The networking stack is projected to advance to NVLink 8, NVSwitch 8, ConnectX-10, BlueField-5, and Spectrum-7 Ethernet. These details remain forward-looking and may evolve as Feynman moves through tape-out and validation.
NVIDIA provides performance comparisons relative to prior generations at the system level (NVL72 or equivalent rack), rather than on a per-chip basis, to reflect real-world deployment configurations. The numbers below are NVIDIA-published marketing figures and should be read as such.
| Metric | Blackwell Ultra (GB300) | Rubin (VR200) | Improvement |
|---|---|---|---|
| FP4 inference | 15 PetaFLOPS | 50 PetaFLOPS | 3.3x |
| FP4 training | 10 PetaFLOPS | 35 PetaFLOPS | 3.5x |
| Memory capacity | 288 GB HBM3E | 288 GB HBM4 | Same |
| Memory bandwidth | 8 TB/s | 22 TB/s | 2.8x |
| NVLink bandwidth | 1.8 TB/s | 3.6 TB/s | 2x |
| TDP (typical) | 1,400 W | approximately 1,800 W | 1.3x |
The 3.3x FP4 inference figure reflects per-GPU compute with comparable TDP growth, derived from SemiAnalysis analysis of NVIDIA disclosures. NVIDIA's own marketing materials cite a 5x inference improvement at the rack level, measured by comparing the VR200 NVL72 against a GB200 NVL72 (rather than against Blackwell Ultra), and accounting for the disaggregated CPX configuration where applicable.
| Configuration | NVFP4 Inference | Memory | Memory BW |
|---|---|---|---|
| GB200 NVL72 (Blackwell) | approximately 720 PetaFLOPS | 13.5 TB HBM3E | approximately 570 TB/s |
| GB300 NVL72 (Blackwell Ultra) | approximately 1 ExaFLOP | 20.7 TB HBM3E | approximately 580 TB/s |
| VR200 NVL72 (Rubin) | 3.6 ExaFLOPS | 20.7 TB HBM4 | 1.6 PB/s |
| VR200 NVL144 CPX (Rubin + CPX) | 8 ExaFLOPS | 100 TB | approximately 1.7 PB/s |
| VR300 NVL576 (Rubin Ultra, 2027 target) | approximately 15 ExaFLOPS | approximately 365 TB | approximately 4.6 PB/s |
NVIDIA claims approximately 10x higher inference throughput per watt for the Vera Rubin NVL72 versus the GB200 NVL72 generation. Cost per token at scale is projected by NVIDIA at approximately one-tenth of Blackwell costs, a figure that combines the throughput gains with anticipated reductions in HBM bandwidth costs as HBM4 volume production matures. With the addition of the Groq 3 LPX rack at GTC 2026, NVIDIA claimed up to 35x higher tokens-per-megawatt versus the Blackwell NVL72 alone, at a target cost point of approximately $45 per million tokens for trillion-parameter models. Both of these are vendor-published numbers and have not yet been independently validated.
The Rubin GPU TDP is approximately 1,800 W per package in its standard configuration. Reports in early 2026 indicated that NVIDIA raised some performance targets to widen the gap with AMD's Instinct MI400 series, lifting boost clocks and memory bandwidth on certain SKUs at a cost of an additional roughly 500 W per accelerator, bringing those configurations to 2,300 W per GPU.
At the rack level, the NVL72 with 2,300 W GPUs has a TDP of up to 220 kW. The standard 1,800 W configuration draws approximately 160 to 170 kW per rack.
For Rubin Ultra, the higher TDP of 3,600 W per package scales the NVL576 Kyber rack to an estimated 600 kW total. Infrastructure partners including CoreWeave, Lambda, Nebius, Oracle Cloud Infrastructure, and Together AI have announced plans to engineer facilities around 800-volt power distribution to accommodate these loads. NVIDIA has also developed DSX Max-Q power management software that allows up to 30 percent more infrastructure deployment within fixed power budgets by dynamically managing GPU power states.
The DSX Flex capability extends the platform to grid-flexible operation, allowing AI factories to modulate compute load in response to grid availability signals. NVIDIA describes this as a way to access what it calls 100 gigawatts of stranded grid capacity.
NVIDIA announced customer commitments for Vera Rubin from a broad set of AI labs, cloud providers, and system manufacturers. The most concrete commitments came in the November 2025 to March 2026 window.
OpenAI's Stargate program, originally announced in early 2025 as a multi-hundred-billion-dollar AI infrastructure buildout, was a heavy intended consumer of Rubin-class hardware. Through 2026 the picture changed materially:
These reversals do not eliminate OpenAI's Vera Rubin commitments; OpenAI remains on the published Vera Rubin customer list. They do indicate that the original Stargate scale projection was elastic and that hyperscaler-led deployments (Microsoft, AWS, Google Cloud) are doing more of the actual buildout than originally pitched.
AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure have announced plans to offer Vera Rubin instances. CoreWeave, Crusoe, Lambda, Nebius, Nscale, and Together AI are among the earliest neocloud deployments. Nebius has confirmed Vera Rubin NVL72 availability in the United States and Europe from H2 2026.
OEM partners building Vera Rubin systems include Cisco, Dell Technologies, Hewlett Packard Enterprise, Lenovo, and Supermicro. Contract manufacturers ASUS, Foxconn, GIGABYTE, Inventec, Pegatron, QCT, Wistron, and Wiwynn are producing boards and systems for cloud providers and hyperscalers.
Vera Rubin sits at the center of NVIDIA's near-term revenue trajectory. On the Q3 FY26 earnings call (November 19, 2025), CFO Colette Kress stated that NVIDIA had visibility into approximately $500 billion of combined Blackwell and Rubin revenue from the start of calendar 2025 through the end of calendar 2026, of which approximately $150 billion had already shipped. By the Q4 FY26 call (February 25, 2026), the company reported quarterly data center revenue of approximately $50 billion and reiterated that Rubin samples were now in customer hands.
The Rubin generation also reshapes how NVIDIA reports revenue mix. Where Hopper and early Blackwell were dominated by individual GPU and HGX baseboard sales, Rubin is sold in much larger units: the standard SKU is a full NVL72 rack, and NVIDIA's revenue per Rubin shipment carries far more value from the rack-integrated networking, switches, and software than from the GPU silicon alone. This has lengthened sales cycles but raised the average selling price per design win.
TSMC has expanded CoWoS-L capacity through 2027 to accommodate Rubin and Rubin Ultra orders, which dominate advanced packaging allocation at TSMC. The transition from Blackwell's CoWoS-L footprint to Rubin's larger 4x reticle layout consumes more wafer-equivalent capacity per package, so the same wafer-out throughput at TSMC produces fewer Rubin packages than Blackwell packages. This is one reason Rubin Ultra's four-die assembly was a significant point of supply-chain analysis: the question of whether four reticle-sized dies could fit on a single CoWoS-L substrate or whether NVIDIA would split into two paired packages was unresolved through early 2026, with TrendForce reporting in April 2026 that the dual-substrate approach (Rubin Ultra retains a multi-die on-package layout rather than splitting into paired packages) remains the path forward.
HBM4 supply, which had been a concern in mid-2025, was reported on track for Rubin's H2 2026 ramp by the end of 2025, with NVIDIA repeatedly refuting earlier delay reports. Samsung, SK Hynix, and Micron all qualified HBM4 production for Rubin, with Hynix taking the largest share of initial volume. A reported NVIDIA decision in early 2026 to push HBM4 pin speeds higher (above the JEDEC reference 6.4 GT/s for some SKUs) introduced additional qualification work but ultimately did not delay the H2 2026 ramp.
NVIDIA also disclosed in December 2025 that it had signed an approximately $20 billion licensing and talent agreement with Groq, the inference startup, to integrate Groq's LPU technology into the Vera Rubin platform. The Groq 3 LPU manufactured on Samsung 4nm is the first chip to emerge from that agreement, and the Groq 3 LPX rack pairs with Vera Rubin NVL72 inside the same AI factory. This is an unusual arrangement: NVIDIA has historically positioned its accelerator as the complete answer for inference, and integrating a competitor's silicon directly into its roadmap is a notable strategic shift.
The Vera Rubin platform ships with a refreshed CUDA and inference software stack tuned for agentic AI workloads.
CUDA 13 is the baseline release for Rubin. It introduces native programming-model support for the third-generation Transformer Engine, the new NVFP4 quantization helpers, and the disaggregated CPX scheduling primitives that the Dynamo orchestrator builds on. The runtime extends the unified memory model first introduced with Grace Hopper to a true tri-tier addressing scheme that spans LPDDR5X (Vera CPU), HBM4 (standard Rubin GPU), and GDDR7 (Rubin CPX) without requiring application-side copies. Migration of pages between tiers is handled by hardware coherence protocols on NVLink-C2C and by software prefetch hints on NVLink 6.
CUDA 13 also raises the per-thread-block resource limits on Rubin to take advantage of the larger shared-memory partitions in the fifth-generation SM design. Code paths written for Blackwell remain binary-compatible through CUDA 13's forward-compatible PTX layer, so existing Blackwell deployments can move to Rubin without recompilation, though specific kernels often need retuning to capture the full memory-bandwidth uplift.
The third-generation Transformer Engine extends the FP8 and FP4 mixed-precision flow with hardware-accelerated adaptive compression, and ties more tightly into the attention kernels that dominate inference time on long contexts. Per-tensor scaling has been refined so that the format selection runs at the granularity of individual transformer layers, with a fallback to BF16 on layers that show numerical sensitivity during calibration. The Transformer Engine library also exposes a programming interface for sparse mixture-of-experts attention, which has become more relevant as model architectures shift toward MoE designs at the trillion-parameter scale.
NVIDIA Dynamo is the inference orchestrator built on top of CUDA 13. It is designed specifically to manage Rubin plus Rubin CPX disaggregated serving. Dynamo's Smart Router routes prefill traffic to CPX nodes and decode traffic to standard Rubin nodes, while the Dynamo GPU Planner autoscales the CPX-to-Rubin ratio in response to incoming traffic shape. KV cache transfer between prefill and decode nodes uses GPU-direct RDMA over ConnectX-9, with a software path that explicitly sequences the cache layer-by-layer to overlap with the first decode steps. NVIDIA reports that this overlapped transfer reduces the effective time-to-first-token tax of disaggregation to under 5 percent of total request latency on representative long-context workloads.
DSX Max-Q and DSX Flex provide rack and facility-level power management for AI factories built on Vera Rubin. DSX Max-Q applies dynamic frequency and voltage scaling across an entire NVL72 rack, with the goal of fitting more deployed GPUs within a fixed facility power envelope. NVIDIA cites up to 30 percent more deployed compute under fixed power budgets, though the exact figure depends on workload shape and how much headroom an operator was carrying. DSX Flex extends this to coordinated grid-flexible operation, where racks can throttle in response to grid demand-response signals, with NVIDIA arguing that this unlocks access to so-called stranded grid capacity for AI factories sited near constrained interconnect points.
NVIDIA Mission Control, the cluster-management plane introduced with Blackwell, has been extended with Rubin-aware health monitoring, optical-link telemetry from the Spectrum-6 fabric, and unified observability across the new disaggregated CPX-and-Rubin topology. Mission Control includes scheduler hooks that surface CPX availability as a separate pool to upper-layer schedulers like Slurm, Run:ai, and Kubernetes-based AI platforms.
The introduction of the Rubin CPX accelerates a broader architectural trend in large language model serving: disaggregation of the prefill (context) and decode (generation) phases. This pattern was popularized by research papers including Splitwise (Microsoft) and DistServe (UC San Diego), and was already in use at major labs running Blackwell. Vera Rubin is the first NVIDIA platform to ship dedicated silicon for the prefill side rather than asking operators to repurpose a single SKU.
In traditional inference serving, a single GPU handles both the prefill pass (processing the user's input prompt) and the decode loop (generating output tokens one at a time). These two phases have very different compute profiles. Prefill is compute-intensive and parallelizes well across many GPU cores. Decode is memory-bandwidth-intensive, as it reads the full KV cache at each generation step.
With the Rubin platform, NVIDIA's Dynamo framework can route prefill requests to Rubin CPX nodes (optimized for compute-dense prefill on GDDR7 memory) and decode requests to standard Rubin nodes (optimized for memory bandwidth on token generation with HBM4). The Groq 3 LPX rack, added at GTC 2026, takes this further by providing an even more decode-specialized accelerator with 512 MB of on-chip SRAM per die and 150 TB/s of memory bandwidth, sitting alongside Rubin GPUs in the same platform.
This three-tier disaggregation (CPX for prefill, Rubin for general inference and training, Groq 3 LPX for decode at extreme throughput) is the operational shape NVIDIA expects for trillion-parameter inference at scale. It also represents an unusual choice: NVIDIA integrating a third-party accelerator (Groq) into its own platform rather than blocking competition. The licensing arrangement is reportedly worth approximately $20 billion across the term of the partnership.
| Feature | Blackwell Ultra (GB300) | Vera Rubin (VR200) |
|---|---|---|
| Architecture | Blackwell | Rubin |
| CPU | 72-core Grace (Arm Neoverse V2) | 88-core Vera (Olympus, custom Arm) |
| GPU dies per package | 2 | 2 |
| Process node (GPU) | TSMC 4NP | TSMC N3P |
| HBM generation | HBM3E | HBM4 |
| Memory per package | 288 GB | 288 GB |
| Memory bandwidth | approximately 8 TB/s | 22 TB/s |
| FP4 inference | 15 PetaFLOPS | 50 PetaFLOPS |
| FP4 training | 10 PetaFLOPS | 35 PetaFLOPS |
| NVLink generation | NVLink 5 | NVLink 6 |
| NVLink BW per GPU | 1.8 TB/s | 3.6 TB/s |
| Transistor count | approximately 208 billion | 336 billion |
| GPU TDP | 1,400 W | approximately 1,800 to 2,300 W |
| NVL rack GPU count | 72 | 72 |
| NVL72 NVFP4 inference | approximately 1 ExaFLOP | 3.6 ExaFLOPS |
| Production | 2025 | H2 2026 |