NVIDIA Groq LPX Rack
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,461 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,461 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVIDIA Groq 3 LPX is a rack-scale inference accelerator that NVIDIA introduced at GTC 2026, built around 256 Groq Language Processing Units and designed to sit beside Vera Rubin NVL72 racks as a dedicated decode engine.[1][2] Often shortened to the "LPX rack," the system packs hundreds of small, SRAM-heavy chips that generate output tokens at very low latency while the Rubin GPUs handle the heavier prompt-processing work. It is the first time NVIDIA has shipped a non-GPU accelerator as part of its data center platform, and it is the most visible product to come out of the company's roughly $20 billion deal for Groq's inference technology in December 2025.[3][4]
The chip itself is branded the Groq 3 LPU, the seventh processor in the Vera Rubin lineup alongside the Rubin GPU, Vera CPU, NVLink 6 switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet switch.[1][5] NVIDIA describes the LPX as "co-designed with NVIDIA Vera Rubin NVL72," and the two systems are meant to be deployed together rather than sold as a standalone box.[1] General availability is targeted for the second half of 2026, lining up with the broader Vera Rubin rollout.[1][2]
The LPX traces directly back to NVIDIA's agreement with Groq, the inference-chip startup founded by former Google engineer Jonathan Ross. On December 24, 2025, CNBC reported that NVIDIA had struck a deal worth about $20 billion, which would make it NVIDIA's largest transaction to date, well past the $6.9 billion Mellanox acquisition in 2020.[3][4] Rather than buying the company outright, NVIDIA structured the arrangement as a non-exclusive license for Groq's inference technology plus an acquihire of its leadership, with Ross, president Sunny Madra, and other senior staff moving to NVIDIA while Groq continued as a nominally independent entity.[4] Groq had been valued at $6.9 billion after a $750 million funding round in September 2025, so the reported figure represented a steep premium.[4]
Several outlets framed the LPX as the productization of that deal: customers who once might have rented Groq's cloud could now buy comparable silicon directly from NVIDIA, folded into the CUDA and Vera Rubin stack.[2][6] NVIDIA's own materials do not dwell on the dollar figure, but they confirm that the Groq 3 LPU is "newly integrated" into the platform after the technology came over from Groq.[1]
The design philosophy behind the LPU is the same one Groq built its business on: keep the model weights and activations in fast on-chip SRAM instead of slower off-chip HBM4, and use a deterministic, compiler-orchestrated execution model so there is no dynamic hardware scheduling.[2][6] Each Groq 3 LPU carries 500 MB of on-chip SRAM running at 150 TB/s, versus the roughly 22 TB/s of HBM4 bandwidth on a Rubin GPU.[6][7] The tradeoff is capacity: a single LPU holds very little memory, so NVIDIA gangs many of them together to fit a model.[7]
A full LPX rack contains 256 LPUs arranged as 32 liquid-cooled 1U compute trays, each holding 8 chips.[2][5] At the rack level that adds up to 128 GB of total SRAM, about 40 PB/s of aggregate SRAM bandwidth, and 315 PFLOPS of FP8 compute, plus 12 TB of DDR5 for spillover.[1][8] The chips talk to each other over Groq's RealScale chip-to-chip fabric, with 96 links per chip at 112 Gbps, giving 2.5 TB/s of bidirectional bandwidth per LPU and 640 TB/s of scale-up bandwidth across the rack.[2][8] NVIDIA uses a cableless backplane spine of paired copper connections to wire the trays together.[6]
| Specification | Per LPU | Per tray (8 LPUs) | Per rack (256 LPUs) |
|---|---|---|---|
| On-chip SRAM | 500 MB | 4 GB | 128 GB |
| SRAM bandwidth | 150 TB/s | 1.2 PB/s | ~40 PB/s |
| FP8 compute | 1.2 PFLOPS | 9.6 PFLOPS | 315 PFLOPS |
| Scale-up bandwidth | 2.5 TB/s | 20 TB/s | 640 TB/s |
| Form factor | single chip | 1U, liquid-cooled | 32 trays, MGX rack |
The whole system is built on NVIDIA's MGX modular rack architecture and is fully liquid-cooled, the same plumbing approach used for the Vera Rubin NVL72 it pairs with.[1][2]
The LPX is not a replacement for the GPU. It is a specialist that takes over one slice of the inference pipeline. Large language model serving splits into two phases: a compute-heavy prefill stage that processes the input prompt, and a memory-bound decode stage that emits one token at a time. Decode is where latency hurts most, and where the LPU's bandwidth advantage pays off.[2][6]
NVIDIA's scheme for splitting the work is called attention-FFN disaggregation, or AFD. Under AFD the Rubin GPUs run the attention layers over the KV cache during both prefill and decode, while the LPUs execute the feed-forward and Mixture of Experts layers that dominate the per-token math.[1][2] NVIDIA's product page puts it more simply: "Rubin GPUs and LPUs boost decode by jointly computing every layer of the AI model for every output token."[8] The handoff is coordinated by NVIDIA Dynamo, the inference serving framework that routes prefill to GPU workers and then orchestrates the AFD loop, passing intermediate activations to the LPUs for the FFN and MoE work.[2]
Physically, an LPX rack is meant to stand next to a Vera Rubin NVL72 rack and connect to it over a Spectrum-X interconnect, so the two systems behave like one disaggregated inference machine.[7] Multiple LPX racks can be ganged together to serve larger models or higher concurrency.[7] Because the LPU operates as an accelerator under the existing CUDA stack, NVIDIA says computation is offloaded to it transparently on a per-token basis, without developers having to rewrite their serving code.[2]
NVIDIA's headline claim is that an LPX-plus-Vera-Rubin configuration delivers up to 35 times higher inference throughput per megawatt and up to 10 times more revenue opportunity for trillion-parameter models, measured against a GB200 NVL72 Blackwell system.[1][2] The pitch is aimed squarely at agentic workloads and million-token context windows, where models reason over long inputs and users expect fast, steady token streams.[1][8]
The underlying argument is about token rates and stability. Groq's deterministic execution avoids the scheduling jitter that variable-latency systems suffer under load, so per-token latency stays predictable even at high concurrency.[1] In practice that translates to token-generation rates in the thousands of tokens per second.[7] The Register noted that, depending on configuration, the economics could land around $45 per million tokens generated, though it also cautioned that software support is likely to be limited at launch.[7]
Coverage of GTC 2026 generally read the LPX as a strategic move rather than just a new SKU. Several analysts pointed out that by absorbing Groq's technology and reselling it inside its own platform, NVIDIA both neutralized a fast-rising inference competitor and extended its reach into the part of the market, low-latency token generation, where dedicated inference chips were starting to look like a genuine threat to GPUs.[2][6] The Decoder described it as NVIDIA adding "a dedicated inference pipeline for the first time," and tied that directly to the company's quasi-acquisition of Groq.[2]
It is worth keeping the two threads distinct. The December 2025 licensing-and-acquihire deal and the March 2026 LPX product are connected, but they are separate events: the deal moved the technology and people, and the LPX is the first shipping system NVIDIA built from them. Many of the finer specifications, including the per-token economics and some networking details, come from technical journalism and NVIDIA's developer documentation rather than from a single press release, and the hardware is not expected in customers' hands until the second half of 2026.[1][2][7]