# NVIDIA Groq LPX Rack

> Source: https://aiwiki.ai/wiki/nvidia_groq_lpx_rack
> Updated: 2026-06-03
> Categories: AI Hardware, AI Inference, NVIDIA
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**NVIDIA Groq 3 LPX** is a rack-scale inference accelerator that [NVIDIA](/wiki/nvidia) introduced at GTC 2026, built around 256 Groq Language Processing Units and designed to sit beside [Vera Rubin NVL72](/wiki/nvidia_vera_rubin) racks as a dedicated decode engine.[1][2] Often shortened to the "LPX rack," the system packs hundreds of small, [SRAM](/wiki/sram)-heavy chips that generate output tokens at very low latency while the Rubin GPUs handle the heavier prompt-processing work. It is the first time NVIDIA has shipped a non-GPU accelerator as part of its data center platform, and it is the most visible product to come out of the company's roughly $20 billion deal for [Groq](/wiki/groq_hardware)'s inference technology in December 2025.[3][4]

The chip itself is branded the Groq 3 LPU, the seventh processor in the Vera Rubin lineup alongside the Rubin GPU, Vera CPU, NVLink 6 switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet switch.[1][5] NVIDIA describes the LPX as "co-designed with NVIDIA Vera Rubin NVL72," and the two systems are meant to be deployed together rather than sold as a standalone box.[1] General availability is targeted for the second half of 2026, lining up with the broader Vera Rubin rollout.[1][2]

## Background: the Groq deal

The LPX traces directly back to NVIDIA's agreement with Groq, the inference-chip startup founded by former Google engineer Jonathan Ross. On December 24, 2025, CNBC reported that NVIDIA had struck a deal worth about $20 billion, which would make it NVIDIA's largest transaction to date, well past the $6.9 billion Mellanox acquisition in 2020.[3][4] Rather than buying the company outright, NVIDIA structured the arrangement as a non-exclusive license for Groq's inference technology plus an acquihire of its leadership, with Ross, president Sunny Madra, and other senior staff moving to NVIDIA while Groq continued as a nominally independent entity.[4] Groq had been valued at $6.9 billion after a $750 million funding round in September 2025, so the reported figure represented a steep premium.[4]

Several outlets framed the LPX as the productization of that deal: customers who once might have rented Groq's cloud could now buy comparable silicon directly from NVIDIA, folded into the CUDA and Vera Rubin stack.[2][6] NVIDIA's own materials do not dwell on the dollar figure, but they confirm that the Groq 3 LPU is "newly integrated" into the platform after the technology came over from Groq.[1]

## Architecture

The design philosophy behind the LPU is the same one Groq built its business on: keep the model weights and activations in fast on-chip SRAM instead of slower off-chip [HBM4](/wiki/hbm4), and use a deterministic, compiler-orchestrated execution model so there is no dynamic hardware scheduling.[2][6] Each Groq 3 LPU carries 500 MB of on-chip SRAM running at 150 TB/s, versus the roughly 22 TB/s of HBM4 bandwidth on a Rubin GPU.[6][7] The tradeoff is capacity: a single LPU holds very little memory, so NVIDIA gangs many of them together to fit a model.[7]

A full LPX rack contains 256 LPUs arranged as 32 liquid-cooled 1U compute trays, each holding 8 chips.[2][5] At the rack level that adds up to 128 GB of total SRAM, about 40 PB/s of aggregate SRAM bandwidth, and 315 PFLOPS of FP8 compute, plus 12 TB of DDR5 for spillover.[1][8] The chips talk to each other over Groq's RealScale chip-to-chip fabric, with 96 links per chip at 112 Gbps, giving 2.5 TB/s of bidirectional bandwidth per LPU and 640 TB/s of scale-up bandwidth across the rack.[2][8] NVIDIA uses a cableless backplane spine of paired copper connections to wire the trays together.[6]

| Specification | Per LPU | Per tray (8 LPUs) | Per rack (256 LPUs) |
| --- | --- | --- | --- |
| On-chip SRAM | 500 MB | 4 GB | 128 GB |
| SRAM bandwidth | 150 TB/s | 1.2 PB/s | ~40 PB/s |
| FP8 compute | 1.2 PFLOPS | 9.6 PFLOPS | 315 PFLOPS |
| Scale-up bandwidth | 2.5 TB/s | 20 TB/s | 640 TB/s |
| Form factor | single chip | 1U, liquid-cooled | 32 trays, MGX rack |

The whole system is built on NVIDIA's MGX modular rack architecture and is fully liquid-cooled, the same plumbing approach used for the Vera Rubin NVL72 it pairs with.[1][2]

## How it works with Vera Rubin

The LPX is not a replacement for the GPU. It is a specialist that takes over one slice of the [inference](/wiki/ai_inference) pipeline. Large language model serving splits into two phases: a compute-heavy prefill stage that processes the input prompt, and a memory-bound decode stage that emits one token at a time. Decode is where latency hurts most, and where the LPU's bandwidth advantage pays off.[2][6]

NVIDIA's scheme for splitting the work is called attention-FFN disaggregation, or AFD. Under AFD the Rubin GPUs run the attention layers over the [KV cache](/wiki/kv_cache) during both prefill and decode, while the LPUs execute the feed-forward and [Mixture of Experts](/wiki/mixture_of_experts) layers that dominate the per-token math.[1][2] NVIDIA's product page puts it more simply: "Rubin GPUs and LPUs boost decode by jointly computing every layer of the AI model for every output token."[8] The handoff is coordinated by [NVIDIA Dynamo](/wiki/nvidia_dynamo), the inference serving framework that routes prefill to GPU workers and then orchestrates the AFD loop, passing intermediate activations to the LPUs for the FFN and MoE work.[2]

Physically, an LPX rack is meant to stand next to a Vera Rubin NVL72 rack and connect to it over a [Spectrum-X](/wiki/nvidia_spectrum_x) interconnect, so the two systems behave like one disaggregated inference machine.[7] Multiple LPX racks can be ganged together to serve larger models or higher concurrency.[7] Because the LPU operates as an accelerator under the existing CUDA stack, NVIDIA says computation is offloaded to it transparently on a per-token basis, without developers having to rewrite their serving code.[2]

## Performance and economics

NVIDIA's headline claim is that an LPX-plus-Vera-Rubin configuration delivers up to 35 times higher inference throughput per megawatt and up to 10 times more revenue opportunity for trillion-parameter models, measured against a [GB200 NVL72](/wiki/nvidia_gb200_nvl72) [Blackwell](/wiki/blackwell) system.[1][2] The pitch is aimed squarely at agentic workloads and million-token context windows, where models reason over long inputs and users expect fast, steady token streams.[1][8]

The underlying argument is about token rates and stability. Groq's deterministic execution avoids the scheduling jitter that variable-latency systems suffer under load, so per-token latency stays predictable even at high concurrency.[1] In practice that translates to token-generation rates in the thousands of tokens per second.[7] The Register noted that, depending on configuration, the economics could land around $45 per million tokens generated, though it also cautioned that software support is likely to be limited at launch.[7]

## Reception and context

Coverage of GTC 2026 generally read the LPX as a strategic move rather than just a new SKU. Several analysts pointed out that by absorbing Groq's technology and reselling it inside its own platform, NVIDIA both neutralized a fast-rising inference competitor and extended its reach into the part of the market, low-latency token generation, where dedicated inference chips were starting to look like a genuine threat to GPUs.[2][6] The Decoder described it as NVIDIA adding "a dedicated inference pipeline for the first time," and tied that directly to the company's quasi-acquisition of Groq.[2]

It is worth keeping the two threads distinct. The December 2025 licensing-and-acquihire deal and the March 2026 LPX product are connected, but they are separate events: the deal moved the technology and people, and the LPX is the first shipping system NVIDIA built from them. Many of the finer specifications, including the per-token economics and some networking details, come from technical journalism and NVIDIA's developer documentation rather than from a single press release, and the hardware is not expected in customers' hands until the second half of 2026.[1][2][7]

## References

[1] NVIDIA Newsroom, "NVIDIA Vera Rubin Opens Agentic AI Frontier," March 16, 2026. https://nvidianews.nvidia.com/news/nvidia-vera-rubin-platform

[2] StorageReview.com, "NVIDIA Groq 3 LPX: Everything we know," March 18, 2026. https://www.storagereview.com/news/nvidia-groq-3-lpx-everything-we-know

[3] CNBC, "Nvidia buying AI chip startup Groq's assets for about $20 billion in its largest deal on record," December 24, 2025. https://www.cnbc.com/2025/12/24/nvidia-buying-ai-chip-startup-groq-for-about-20-billion-biggest-deal.html

[4] The Motley Fool, "Nvidia's 'Acqui-Hire' of Groq Eliminates a Potential Competitor and Marks Its Entrance Into the Non-GPU, AI Inference Chip Space," December 28, 2025. https://www.fool.com/investing/2025/12/28/nvidia-groq-deal-acquisition-ai-inference-lpu/

[5] Tom's Hardware, "Nvidia Groq 3 LPU and Groq LPX racks join Rubin platform at GTC," March 16, 2026. https://www.tomshardware.com/pc-components/gpus/nvidia-groq-3-lpu-and-groq-lpx-racks-join-rubin-platform-at-gtc-sram-packed-accelerator-boosts-every-layer-of-the-ai-model-on-every-token

[6] The Decoder, "GTC 2026: With Groq 3 LPX, Nvidia adds dedicated inference hardware to its platform for the first time," March 17, 2026. https://the-decoder.com/gtc-2026-with-groq-3-lpx-nvidia-adds-dedicated-inference-hardware-to-its-platform-for-the-first-time/

[7] The Register, "Nvidia slaps Groq into new LPX racks for faster AI response," March 16, 2026. https://www.theregister.com/2026/03/16/nvidia_lpx_groq_3/

[8] NVIDIA, "NVIDIA Groq 3 LPX: Inference Accelerator for Agentic AI" (product page). https://www.nvidia.com/en-us/data-center/lpx/

[9] NVIDIA Technical Blog, "Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform," March 16, 2026. https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/