NVIDIA Rubin CPX
Last reviewed
Jun 2, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,805 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,805 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVIDIA Rubin CPX is a class of GPU announced by NVIDIA on September 9, 2025, purpose-built to accelerate the compute-heavy "context" phase of large-model inference. [1][2] It is part of the company's Vera Rubin platform and is designed to work alongside standard Rubin GPUs in a disaggregated serving architecture, where Rubin CPX handles input processing (prefill) and the standard Rubin GPUs handle token generation (decode). NVIDIA states that a single Rubin CPX delivers up to 30 petaFLOPS of NVFP4 compute and carries 128 GB of GDDR7 memory, and that it accelerates attention by 3 times relative to the company's GB300 NVL72 systems. [1][3] NVIDIA targets availability for the end of 2026. [1][4]
Rubin CPX is positioned as a specialized accelerator for "massive-context" workloads, meaning inference over very long input sequences such as million-token software codebases and long-form generative video. [1] Rather than serving an entire inference request on one type of GPU, NVIDIA splits the request across two hardware classes optimized for the two distinct computational regimes of large language model inference. Rubin CPX is the part tuned for the first regime, the ingestion and analysis of large input contexts, which is dominated by dense matrix math rather than by memory traffic. [2]
The "CPX" suffix denotes this context-processing role within the Rubin family. The design departs from NVIDIA's flagship accelerators in two notable ways: it is built on a single monolithic die rather than the dual-die package used by the main Rubin GPU, and it pairs that die with relatively inexpensive GDDR7 memory instead of the high-bandwidth memory (HBM) used on NVIDIA's top data center parts. [3][5] These choices reflect a bet that the prefill phase needs abundant compute but comparatively modest memory bandwidth, so capacity-oriented GDDR7 is a better economic fit than HBM for that specific job. [5]
NVIDIA unveiled Rubin CPX at its AI Infra Summit on September 9, 2025. [1] The chip belongs to the Vera Rubin generation, NVIDIA's successor architecture to Blackwell, which pairs the Rubin GPU with the Arm-based Vera CPU. Rubin CPX is offered in configurations including the rack-scale Vera Rubin NVL144 CPX, which combines Vera CPUs, standard Rubin GPUs, and Rubin CPX GPUs in a single system. [1]
In its announcement NVIDIA attributed a statement to chief executive Jensen Huang: "Just as RTX revolutionized graphics and physical AI, Rubin CPX is the first CUDA GPU purpose-built for massive-context AI." [1][6] The reference to CUDA underscores that Rubin CPX runs the same software stack as the rest of the NVIDIA accelerator line. Press coverage noted that the part had taped out at TSMC around the time of the announcement, consistent with the targeted end-of-2026 ship date. [4]
The motivation for Rubin CPX is the observation that LLM inference comprises two phases with very different hardware demands. [2]
The first is the context phase, also called prefill. Here the model ingests the entire input prompt, including any long document or codebase, and processes it to produce the first output token. NVIDIA describes this phase as compute-bound: it requires high-throughput arithmetic to analyze large volumes of input data, and the attention computation in particular grows expensive as sequences get longer. [2]
The second is the generation phase, also called decode. Here the model emits output tokens one at a time. NVIDIA describes this phase as memory-bandwidth-bound: throughput depends on moving model weights and the key-value cache quickly, which relies on fast memory and high-speed interconnects such as NVLink. [2]
Running both phases on the same homogeneous GPU forces a compromise, because hardware sized for one phase is mismatched for the other. Disaggregated inference instead processes the phases independently on separate hardware pools, allowing each to be sized and optimized on its own. [2] NVIDIA's framing is that this raises throughput, lowers latency, and improves resource utilization, and that it lets generation GPUs stay focused on their bandwidth needs while a cheaper, compute-dense part such as Rubin CPX absorbs the prefill work. [2][5] The general technique of splitting prefill and decode across distinct GPU pools predates Rubin CPX in the research and systems literature; Rubin CPX is NVIDIA's purpose-built silicon for the prefill side of that split. [5]
NVIDIA describes Rubin CPX as a single-die GPU with hardware acceleration for both attention and video processing. [1][3] The headline figures NVIDIA reports for the chip are summarized below.
| Specification | Reported figure | Notes |
|---|---|---|
| Compute (NVFP4) | Up to 30 petaFLOPS | NVFP4 precision; one report notes it is unclear whether the figure assumes sparsity [1][7] |
| Memory | 128 GB GDDR7 | GDDR7 rather than HBM; NVIDIA's first GPU at this memory capacity [1][8] |
| Memory technology | GDDR7 | Reported at roughly one-fifth the cost per gigabyte of HBM4, on a standard substrate rather than a silicon interposer [5] |
| Attention performance | 3x faster than GB300 NVL72 | Hardware attention acceleration, per NVIDIA [1][3] |
| Die configuration | Single monolithic die | Contrasts with the dual-die package of the standard Rubin GPU [5][7] |
| Video engines | 4 NVENC and 4 NVDEC units | Hardware encode and decode for long-format video workflows [3][7] |
NVIDIA's choice of GDDR7 over HBM is the defining engineering trade. GDDR7 offers far less bandwidth per device than HBM, but it is substantially cheaper and uses conventional packaging. [5][7] The Register observed that GDDR7-based NVIDIA parts top out near 1.6 to 1.7 TB/s of bandwidth, well below the multiple-terabyte-per-second figures of HBM-based accelerators, and noted that this is an acceptable trade for prefill work whose bottleneck is compute rather than data movement. [7] The integrated NVFP4 low-precision support is the numerical format NVIDIA uses to report the chip's peak throughput. The four NVENC and four NVDEC engines reflect the part's intended use in generative video pipelines, where input video must be decoded and output video encoded at scale. [3]
The flagship system built around the chip is the Vera Rubin NVL144 CPX, a single rack that combines all three Vera Rubin components. NVIDIA reports the following aggregate figures for the rack. [1][2]
| Rack metric | Reported figure |
|---|---|
| Rubin CPX GPUs | 144 |
| Standard Rubin GPUs | 144 |
| Vera CPUs | 36 |
| NVFP4 compute | 8 exaFLOPS |
| Fast memory | 100 TB |
| Memory bandwidth | 1.7 PB/s |
| Relative AI performance | 7.5x a GB300 NVL72, per NVIDIA |
The 144 Rubin CPX GPUs handle the context phase while the 144 standard Rubin GPUs handle the generation phase, all within one rack. [1][2] The Register reported the physical layout as 288 GPUs and 36 CPU sockets per rack, with each compute tray holding eight standard Rubin (HBM) GPUs and eight Rubin CPX (GDDR7) GPUs. [7] For scale-out across multiple racks, NVIDIA offers the configuration with either Quantum-X800 InfiniBand or Spectrum-X Ethernet networking paired with ConnectX-9 SuperNICs, and the workload is coordinated by NVIDIA's Dynamo inference-serving platform, which routes and balances the prefill and decode pools. [1][2]
NVIDIA's performance claims for Rubin CPX are stated relative to its current-generation GB300 NVL72 systems. At the chip level the company cites a 3x speedup in attention; at the rack level it cites 7.5x the AI compute of a GB300 NVL72, alongside the 8 exaFLOPS, 100 TB, and 1.7 PB/s aggregate figures. [1][2] These are vendor figures and had not been independently benchmarked at announcement, since the hardware was not yet shipping.
NVIDIA also frames Rubin CPX in terms of return on investment for inference operators. The company states that the platform can deliver 30x to 50x return on investment, which it expresses as as much as 5 billion US dollars in token revenue for every 100 million dollars of capital expenditure. [1][2] The argument behind the figure is that offloading compute-intensive context processing onto cheaper Rubin CPX silicon improves the cost structure of long-context serving, lowering total cost of ownership compared with running prefill on more expensive HBM-based GPUs. [2][5] As with the performance numbers, this is a projection supplied by NVIDIA rather than a measured result.
NVIDIA has stated that Rubin CPX and the Vera Rubin NVL144 CPX racks are expected to be available at the end of 2026. [1][4] Hardware press reported that the part had taped out at TSMC around the time of the September 2025 announcement, with shipments penciled in for late 2026. [4] Because the announcement preceded production silicon by more than a year, the published specifications are NVIDIA's stated design targets rather than figures verified on shipping hardware.
Rubin CPX is significant as one of the first mainstream data center GPUs designed for a single phase of inference rather than for general-purpose training and serving. It reflects a broader industry shift toward disaggregated inference, in which long-context workloads such as agentic coding over large repositories and long-form video generation strain conventional homogeneous GPU deployments. [1][2][5] By building a compute-dense part with high memory capacity but deliberately modest bandwidth, NVIDIA is matching silicon to the specific economics of the prefill phase, where attention over long sequences dominates cost. [5][7] If the approach holds up in production, it points toward inference fleets composed of heterogeneous accelerators tuned per phase rather than uniform racks of identical chips.
Within the Vera Rubin lineup, Rubin CPX is a complement to the standard Rubin GPU rather than a replacement. The standard Rubin part, with HBM and a dual-die package, remains the engine for the bandwidth-bound generation phase and for training, while Rubin CPX is the specialist for the compute-bound context phase. [1][5][7] Relative to the prior Blackwell generation and its GB300 NVL72 systems, NVIDIA presents the Vera Rubin NVL144 CPX as a large step up specifically for massive-context inference economics, rather than as a uniform across-the-board successor. [1][2] The closest conceptual neighbor on the wiki is the broader NVIDIA Vera Rubin platform, of which Rubin CPX is the context-processing member.