Groq LPU

The Groq LPU (Language Processing Unit) is a custom silicon inference accelerator designed by Groq, an American AI chip company founded in 2016. The LPU is purpose-built for running large language model (LLM) inference workloads with deterministic, low-latency execution. Unlike conventional GPUs, the LPU eliminates hardware-level non-determinism by delegating all scheduling to a compiler, which arranges every instruction and data movement down to individual clock cycles before execution begins.[^1][^9] This approach allows the chip to sustain extremely high memory bandwidth from on-chip SRAM, delivering output token rates that substantially exceed GPU-based inference systems on single-request latency.[^2][^12]

This article focuses on the LPU as a chip product: its silicon architecture, the underlying Tensor Streaming Processor design, the GroqRack and LPX scaling fabrics, and quantitative comparisons with rival accelerators. For coverage of Groq Inc. as a company, including its founding history, leadership, GroqCloud business, funding rounds, the Saudi Arabia partnership, and the December 2025 NVIDIA licensing agreement, see Groq.

Groq's first-generation chip, originally called the Tensor Streaming Processor (TSP), was fabricated on a 14 nm process node by GlobalFoundries.[^8][^9] The second-generation chip, known as Groq 3 or LP30, is fabricated on Samsung Foundry's 4 nm node and was unveiled by NVIDIA at GTC 2026 in March, three months after Groq's roughly $20 billion non-exclusive technology licensing agreement with NVIDIA was announced on December 24, 2025.[^13][^15][^18] The deal, frequently described in trade press as an acqui-hire, transferred founder Jonathan Ross and most of Groq's engineering team to NVIDIA, while leaving the Groq corporate entity intact under new CEO Simon Edwards.[^14] As of May 2026, the LP30 is in commercial ramp at Samsung Foundry on the SF4X variant of its 4 nm node, with first LPX racks shipping into NVIDIA's Vera Rubin reference systems targeted for the third quarter of 2026.[^16][^18][^19]

Naming and origins

The architecture was originally introduced as the Tensor Streaming Processor (TSP), with the first silicon codenamed Alan. It was initially marketed to high-performance computing (HPC) and financial services customers under the TSP name, where its deterministic timing was useful for low-latency analytics and risk-modelling workloads at firms such as Argonne National Laboratory and a handful of quantitative trading desks.[^6][^9] The breakthrough of ChatGPT in 2022 shifted the product focus toward LLM inference, and Groq rebranded the processor as a Language Processing Unit to reflect the dominant use case.[^1] The hardware itself remained the TSP; LPU is the productized name for the same underlying engine.

The Tensor Streaming Processor concept itself draws on work that founder Jonathan Ross and several other Groq engineers had done at Google on the original Tensor Processing Unit.[^24] The TSP can be read as a deliberate inversion of the TPU's design priorities: where the TPU emphasizes systolic-array efficiency for training-style matrix multiplies and pairs the array with HBM, the TSP discards HBM altogether and rebuilds the chip around a software-managed dataflow whose entire schedule is fixed at compile time. The first published academic description of the architecture appeared in the ISCA 2020 paper "Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads," with first silicon taped out in July 2019 at GlobalFoundries.[^8]

Tensor Streaming Processor design

The LPU is a functionally sliced processor that organizes compute units in vertical columns rather than the conventional core-based layout found in CPUs and GPUs.[^4][^8] Each TSP contains four types of functional slices arranged horizontally across the chip:

Slice	Function
MXM	Matrix multiply operations (320x320 multiply-accumulate arrays)
SXM	Shift and rotate vector operations for tensor reshaping
VXM	Vector arithmetic for activation functions and element-wise ops
MEM	Memory read and write across globally shared SRAM

A horizontal instruction control unit (ICU) dispatches instructions across all slices in parallel. Each slice is divided into 20 tiles, and each tile processes 16 vector elements simultaneously, giving a total of 320 SIMD lanes per TSP.[^8] The first-generation chip measures 25 by 29 millimeters and operates at a nominal clock frequency of 900 MHz, delivering more than one teraoperation per second per square millimeter of silicon on 14 nm process technology.[^8][^24] Groq engineers refer to this column-and-tile layout as a "superlane" structure, where each superlane is a vertical slice spanning all four functional units. The compiler can route a tensor down a superlane such that data flows through MEM, then VXM, then MXM, then SXM in a strictly forward direction, with each functional unit consuming the previous one's output one clock cycle later.[^4][^9]

The second-generation Groq 3 chip is fabricated on Samsung's 4 nm node, improving density and power efficiency.[^16][^18] Groq has reported 188 TFLOPS of FP16 throughput and 750 TOPS of INT8 throughput from the first-generation LPU, with the Groq 3 raising on-chip SRAM to 500 MB (some sources cite 512 MB) and bandwidth to 150 TB/s.[^16][^18][^19] The LP30 die contains roughly 98 billion transistors and integrates 96 high-speed SerDes lanes running at 112 Gbps each, yielding 2.5 TB/s of bidirectional off-chip bandwidth for rack-scale fabrics.[^16][^19] FP8 peak compute is reported at approximately 1.23 PFLOPS per die, or about 315 PFLOPS aggregated across the 256-die LPX rack that NVIDIA showcased at GTC 2026.[^16][^18][^19]

Single-core layout

Unlike GPUs, which contain thousands of small cores, or Google TPUs, which use a systolic array architecture with multiple compute tiles, the LPU is fundamentally a single-core processor.[^4][^9] The chip presents one programmable execution context to the compiler, eliminating the need for inter-core communication, cache coherence, or work-stealing schedulers. This is consistent with the streaming dataflow model: data moves through the chip from one functional slice to the next, rather than circulating between independent cores. The absence of cores in the traditional sense also means there is no notion of a "thread," no warp scheduler, and no per-block local memory. From the compiler's point of view, the entire die is a single very wide datapath whose every cycle can be reasoned about as a state of 320 SIMD lanes plus the contents of SRAM.[^4][^8]

Dataflow versus von Neumann

Groq positions the TSP as a dataflow processor rather than a von Neumann processor.[^1][^19] Conventional CPUs and most GPUs operate on a fetch-decode-execute-writeback cycle in which instructions are pulled from memory and routed through pipelines whose ordering can be adjusted at runtime. A dataflow processor instead streams operands through fixed compute resources in a predetermined order, with the schedule resolved before the program runs. The LPU realizes this model in hardware by making the instruction stream and the data stream walk through the chip in lockstep: instructions flow east-to-west across the ICU, while tensors flow north-to-south through the functional slices.[^4][^9] The intersection of an instruction wavefront and a data wavefront is the operation that fires at that clock cycle.

Deterministic execution

The defining property of the TSP is the elimination of hardware-managed non-determinism.[^1][^6][^9] Conventional processors and GPUs contain dynamic schedulers, cache controllers, branch predictors, reorder buffers, and network arbitration logic that make runtime decisions about instruction ordering, memory access, and data routing. These mechanisms add flexibility but introduce timing variability that makes worst-case latency difficult to predict or guarantee.

The TSP removes this hardware-level decision-making and moves it entirely into the compiler. Before a model runs, Groq's compiler analyzes the full computational graph of the neural network, including every matrix multiplication, activation function, and attention operation, and produces a static schedule that assigns each instruction to a specific time slot and each data movement to a specific clock cycle.[^4][^6] The hardware executes this schedule verbatim, with no runtime arbitration.

The compiler performs what Groq engineers describe as two-dimensional scheduling: it schedules instructions in both time (which clock cycle each operation fires) and space (which functional unit executes it and which memory location stores the result).[^4][^6] Because the compiler has complete knowledge of hardware latencies, it can arrange operations so that data arrives at each functional unit exactly when that unit is ready to process it, eliminating stall cycles and cache misses.

This determinism has a practical consequence for inference: since each forward pass through the model takes exactly the same number of clock cycles every time, the first token latency (time to first token) and per-token output rate are both highly predictable. Tail latency equals median latency, an unusual property for production AI infrastructure.[^9] Groq has cited time-to-first-token values in the 0.2 to 0.3 second range for 70-billion-parameter models, and independent benchmarking by Artificial Analysis recorded median TTFT for Groq's Llama 3.3 70B endpoint at roughly 120 milliseconds with a P95 of 280 milliseconds in April 2026, the narrowest TTFT distribution of any major LLM API tracked by the firm.[^11][^25]

Compiler-as-orchestrator

The compiler is responsible for far more than instruction selection. It owns memory layout, register allocation, instruction issue timing, and data movement across the chip's east-west and north-south buses.[^4] Because tensors must arrive at each functional slice at a specific clock cycle, the compiler frequently inserts no-op cycles or staged transposes to delay an operand by the exact number of cycles needed to align with a downstream consumer. The result is a binary that resembles a long sequence of explicit time slots, each carrying a fixed instruction word that fans out across all 20 tiles in a slice.

A second consequence is that the chip itself is unusually small in terms of control logic. Groq engineers have reported that more than 90 percent of the die area on GroqChip1 is devoted to compute and SRAM, with control structures occupying less than 10 percent.[^4][^8] In contrast, modern out-of-order CPUs commonly spend 30 to 40 percent of their die on reorder buffers, branch predictors, and rename tables. A comparable Hopper-class GPU dedicates significant area to warp schedulers, L2 cache, and tensor memory accelerator (TMA) units. By offloading scheduling to the compiler, the LPU recovers that area for compute and on-die memory.

Debuggability and reproducibility

Deterministic execution also gives operators a property that is rare for AI hardware: bit-exact reproducibility of inference runs given the same model, prompt, and decoding parameters.[^6] Two LPU systems executing the same compiled binary will produce identical output token streams down to the floating-point bit pattern, since no part of the execution depends on dynamic state outside the compiled schedule. This is a useful feature for regulated industries that need audit trails over LLM outputs and for engineers debugging numerical issues, both of which are notoriously difficult on GPUs where kernel ordering can vary between runs.

On-chip SRAM and memory bandwidth

The most unusual feature of the LPU's memory system is its complete reliance on on-chip SRAM for model weight storage, with no external DRAM or High Bandwidth Memory (HBM).[^1][^2] A single first-generation TSP contains approximately 220 to 230 MB of globally shared SRAM.[^4][^8] There is no cache hierarchy; every byte of memory is accessible at the same latency and bandwidth from any compute unit on the chip.

The bandwidth available from this on-chip SRAM exceeds 80 terabytes per second on the GroqChip1, and rises to 150 TB/s on the second-generation Groq 3.[^8][^16][^18] For comparison, the NVIDIA H100's HBM3 memory delivers approximately 3.35 terabytes per second, and the Vera Rubin GPU that ships alongside the LP30 in NVIDIA's reference rack reaches about 22 TB/s of HBM4 bandwidth per package.[^19] The LP30's on-chip bandwidth advantage over Rubin is therefore about a factor of seven on a per-die basis, which directly translates to faster weight loading during token generation.

LLM inference is constrained by memory bandwidth rather than raw compute in the single-sequence (low-batch) regime. Each new output token requires loading billions of model weights from memory to perform one forward pass. A processor with higher memory bandwidth can load those weights faster and produce tokens faster. This is the fundamental reason the LPU outperforms H100-based systems on single-request latency benchmarks.[^4][^12]

The trade-off is capacity. At 230 MB per chip, a single LPU holds only a small fraction of a large language model's weights. A Llama 3 70B model stored in 8-bit quantization requires approximately 70 GB of memory. Running that model on first-generation LPUs therefore requires distributing the weights across hundreds of chips, each holding a shard of the model. Groq addresses this through the GroqRack system and, on the second-generation hardware, the higher-density LPX rack.[^16][^19]

Memory comparison

Memory characteristic	GPU (HBM-based)	Groq LPU (SRAM-based)
Memory technology	HBM2e/HBM3/HBM3e/HBM4	On-chip SRAM
Bandwidth per chip	3 to 22 TB/s	80 TB/s (Gen 1), 150 TB/s (Gen 2)
Latency	100 to 400 ns	1 to 5 ns
Capacity per chip	80 to 288 GB	230 MB (GroqChip1), 500 MB (LP30)
Access pattern	Variable (cache-dependent)	Fixed (compiler-determined)
Power per access	Higher (off-chip)	Lower (on-chip)
Failure modes	DRAM bit flips, refresh stalls	SRAM single-event upset only

Why no HBM

The decision to omit HBM is technical as much as economic. HBM stacks impose a packaging cost (interposer, microbumps, base die) and a power cost (PHY interfaces at the controller side) that scale linearly with the number of stacks. They also introduce variable latency, because DRAM rows must be opened, refreshed, and arbitrated between bank groups. By keeping all weights on-die, the LPU avoids these costs and achieves a bandwidth-per-watt ratio that, on a per-token basis, is several times better than HBM-based GPUs.[^4][^21] The cost is a hard upper bound on how much state any single chip can hold, which is why the rack-level fabric and compiler-orchestrated sharding are central to the LPU value proposition.

Key specifications

The two generations of the LPU have the following headline specifications:

Specification	GroqChip1	LP30 (Groq 3, 2026)
INT8 performance	Up to 750 TOPS	Not separately disclosed
FP16 performance	188 TFLOPS (at 900 MHz)	Not separately disclosed
FP8 performance	Not natively supported	~1.23 PFLOPS per die; ~315 PFLOPS per 256-die LPX rack
On-chip SRAM	230 MB	500 to 512 MB
Memory bandwidth	80 TB/s	150 TB/s
External HBM	None	None
Fabrication	GlobalFoundries 14 nm	Samsung Foundry 4 nm (SF4X)
Die size	25 x 29 mm (725 mm^2)	Not disclosed
Transistor count	~26.8 billion	~98 billion
Off-die SerDes	16 lanes per chip	96 lanes at 112 Gbps; 2.5 TB/s bidirectional
Nominal clock	900 MHz	Not disclosed
Reference rack	GroqRack (72 chips, 9 nodes)	LPX rack (256 chips, 32 trays of 8 LPUs)

A notable aspect of the GroqChip1 is that it uses no high-bandwidth memory at all. Instead, it relies entirely on on-chip SRAM for memory, which provides extremely high bandwidth but limits total memory capacity per chip. The LP30 retains the same memory philosophy but more than doubles SRAM per die and roughly doubles bandwidth per die, courtesy of the move from a 14 nm process to a leading-edge 4 nm process.[^16][^18]

GroqRack and multi-chip scaling

GroqRack is Groq's first-generation rack-level hardware product for on-premises and colocation deployments. The system connects multiple TSP chips in a high-radix mesh topology that extends the deterministic scheduling model across a distributed fabric.[^6][^9] The compiler that schedules a single chip also schedules the entire rack, and across racks, as a single logical system. The Gen 2 successor is the LPX rack, which densifies the same fabric concept around the LP30's 96-lane SerDes complement and is the basis for NVIDIA's Vera Rubin reference racks.[^16][^17][^19]

Topology

At the chip level, interconnection is organized as follows:

Layer	Composition
Node (Gen 1)	8 TSPs, each connected to 7 local neighbors plus 4 global links to adjacent nodes
Rack (Gen 1 GroqRack)	9 nodes; 72 TSPs total
Compute tray (Gen 2 LPX)	8 liquid-cooled LP30 LPUs with 4 GB of aggregate SRAM
Rack (Gen 2 LPX)	32 compute trays; 256 LP30 LPUs; 128 GB aggregate on-rack SRAM; ~40 PB/s aggregate fabric bandwidth
Largest cluster	10,440 TSPs across 145 racks; max 5 network hops between any two chips

The Gen 2 LPX rack abandons the 8-chip node structure of GroqRack in favor of a denser direct mesh between trays, taking advantage of the LP30's 96-lane SerDes complement.[^16][^19] Each LP30 can communicate with several dozen peers at full SerDes bandwidth without traversing a switch, which both reduces the hop count for collective operations and simplifies the compiler's view of the rack as a flat fabric. NVIDIA reports that the rack-wide control plane is anchored by an FPGA paired with an x86 host CPU, which handles scheduling hand-off to the LPU binaries but plays no role in per-cycle execution.[^16]

Plesiosynchronous coordination

For workloads that span multiple LPUs, Groq uses a plesiosynchronous chip-to-chip protocol to cancel natural clock drift and align hundreds of LPUs to act as a single logical core.[^9] Periodic software synchronization adjusts for crystal-based clock drift, enabling not just compute scheduling but also network scheduling across the entire system. Cross-chip communication is handled by the same compiler-driven approach used for single-chip execution: the compiler determines exactly when each chip will need data from a neighbor and schedules data injection to arrive at the destination precisely when the receiving chip needs it. This eliminates routing tables, congestion control, and network back-pressure mechanisms.

To maintain synchronization across chips, the system uses a Hardware Aligned Counter (HAC) that provides a global time reference, supplemented by software-aligned counters and deskew instructions that realign execution at epoch boundaries after multi-hop data transfers.[^9] The HAC is essentially a free-running counter whose value the compiler can read and act on, allowing the binary to insert deskew waits at chip boundaries without relying on hardware barriers. In practice, this means that an all-reduce or pipeline-parallel collective in a Groq cluster is implemented as a precisely choreographed exchange of packets rather than as a software collective on top of a switched fabric.

Forward error correction

For error handling, the system uses forward error correction rather than packet retransmission.[^9] Retransmission would break the static schedule by introducing unpredictable latency; correcting errors in transit preserves timing guarantees. Each rack also includes spare TSP nodes for failover, which the compiler can route around at recompile time if a chip is taken out of service.

Multi-chip deployment for large models

Running a 70B-parameter model at FP8 requires approximately 576 LPU chips across eight first-generation GroqRacks.[^4] The newer LPX rack format, introduced with the second-generation hardware, packs 256 LPUs into a single rack and provides 128 GB of aggregate on-rack SRAM, comfortably holding the full Llama 3.3 70B model's weights at FP8 within a single rack.[^16][^19] Even larger or mixture-of-experts models, such as the Llama 4 Maverick variant (17B active parameters across 128 experts) or 400B-parameter DeepSeek-class deployments, can exceed even a full LPX rack's aggregate SRAM and require multi-rack configurations with more complex sharding strategies. Groq's stated approach for such models is to combine tensor parallelism across superlanes inside a chip, pipeline parallelism across chips inside a rack, and data parallelism across racks, with the compiler emitting a single binary that spans the entire deployment.[^4][^19]

By the end of the first quarter of 2025, Groq had deployed more than 108,000 first-generation LPUs across its data centers (including a 19,000-LPU cluster in Dammam, Saudi Arabia, operated under the HUMAIN partnership), making it the largest non-hyperscaler AI inference deployment publicly disclosed.[^26][^27]

Performance benchmarks

The LPU has consistently ranked first among public cloud providers on output token throughput benchmarks, particularly for medium-scale models in the 7B to 70B parameter range.[^10][^11][^12]

Model	Decode mode	Tokens per second	Source
Llama 3 8B	Standard	1,300+	Early 2024 single-request tests[^12]
Llama 3 70B	Standard	800+	VentureBeat (April 2024)[^12]
Llama 3.3 70B	Standard	276	Artificial Analysis (late 2024)[^10][^11]
Llama 3.3 70B	Speculative decoding	1,665	Artificial Analysis (late 2024)[^10][^11]
Llama 3.3 70B	Median throughput (2026)	~330	Artificial Analysis (April 2026)[^25]
Llama 4 Scout	Standard	460-625	GroqCloud / R&D World (April-October 2025)[^28][^29]
Llama 4 Maverick	Production tuned	1,200+	Voiceflow / NeuraPulse benchmarks (early 2026)[^30]
Mixtral 8x7B	Standard	500-727	Groq early demos / NeuraPulse (Feb 2024-2026)[^31]
Qwen-3 32B	Standard	~662	GroqCloud (2026)[^23]
DeepSeek R1 70B distill	Standard	800+	Groq newsroom (2025)

The April 2024 VentureBeat report noted that Groq was serving Meta's Llama 3 at over 800 tokens per second, which attracted broad public attention because consumer-grade GPU servers typically produce 20 to 60 tokens per second for similar model sizes.[^12] With speculative decoding enabled on the updated first-generation chip, Artificial Analysis recorded a roughly six-times increase over the non-speculative baseline.[^10][^11] By early 2026, third-party benchmarking sites had recorded Groq running Llama 3.3 70B at roughly five times the throughput of the same model on a comparable cloud H100 deployment, and roughly four to five times faster than reference DeepSeek R1 70B distill inference on commodity GPUs.[^23]

Customer benchmarks and adoption

The LPU has been adopted as the primary inference backend for several notable open-weight and partner deployments. Groq publicly hosts both DeepSeek R1 distilled variants and the Qwen-3 family on GroqCloud, with DeepSeek R1 70B running at over 800 tokens per second per request, several times the typical GPU baseline. Llama 4 Scout was launched day-zero on GroqCloud on April 5, 2025, at over 460 tokens per second; on April 29, 2025, Meta and Groq announced a collaboration to power Meta's official Llama 4 API with the LPU, citing throughput up to 625 tokens per second.[^28][^29] Saudi Arabia's HUMAIN sovereign AI initiative selected the LPU as the basis for a multi-hundred-megawatt inference build-out anchored on a Groq-operated data center in Dammam, with the December 2024 facility brought online in eight days and a follow-on $1.5 billion commitment announced at LEAP 2025 in February 2025.[^26][^27] NVIDIA's GTC 2026 keynote highlighted the LPU's deterministic latency as the basis for the LP30's role in the Vera Rubin platform.[^16][^17]

Latency characteristics

Metric	LPU value
Time-to-first-token (70B models, single request)	0.2 to 0.3 seconds (steady state)
Median TTFT (Llama 3.3 70B, Artificial Analysis April 2026)	~120 ms
P95 TTFT (Llama 3.3 70B, Artificial Analysis April 2026)	~280 ms
Per-token generation latency	Sub-millisecond, predictable
Throughput consistency	No variance between requests
Tail latency vs median latency	Identical (single-tenant)

With GPU-based inference, tail latency (the worst-case response time) can be several times higher than median latency due to cache misses, memory contention, and scheduling delays. With Groq's LPU, the tail latency equals the median latency because execution is fully deterministic. In multi-tenant cloud deployments, queueing at the orchestration layer adds variability that is independent of the chip itself, which is why public API benchmarks (such as Artificial Analysis) show a small TTFT spread even on Groq endpoints.[^25]

Energy efficiency

Energy consumption per token is reported at 1 to 3 joules for LPU-based inference, against 10 to 30 joules on H100-based systems, reflecting the bandwidth efficiency of avoiding off-chip memory accesses.[^21] The elimination of caches, branch predictors, and reorder buffers from the LPU design reduces transistor count dedicated to control logic. In a traditional GPU, these reactive components can consume 30 to 40 percent of the chip's power budget. The LPU redirects that silicon area and power toward compute and SRAM, improving the ratio of useful computation to total power consumption. NVIDIA's GTC 2026 presentation described the combined LP30 plus Vera Rubin reference rack as delivering roughly 35 times more useful inference tokens per megawatt than Blackwell NVL72-based inference, at a target price point in the neighborhood of $45 per million tokens, although the comparison covers a system-level metric rather than chip-versus-chip.[^18][^19]

Comparison with NVIDIA H100 and Cerebras WSE-3

Three architectures are most commonly compared in the AI inference accelerator market: the NVIDIA H100 GPU, the Cerebras WSE-3, and the Groq LPU. Each reflects a different set of engineering tradeoffs.

Dimension	Groq LPU (Gen 1)	NVIDIA H100	Cerebras WSE-3
Process node	14 nm (Gen 1), 4 nm (Gen 2)	4 nm (TSMC)	5 nm (TSMC)
On-chip memory	~230 MB SRAM per chip	~50 MB L2 cache	44 GB SRAM
Off-chip memory	None (model sharded across chips)	80 GB HBM3	None
Memory bandwidth	80 TB/s (on-chip)	3.35 TB/s (HBM3)	~21 TB/s (on-chip)
Compute (FP16)	188 TFLOPS per chip	989 TFLOPS	~125 TFLOPS per square millimeter (wafer-scale)
Execution model	Deterministic, compiler-scheduled	Dynamic, hardware-scheduled	Dataflow, compiler-scheduled
Chip/wafer area	25 x 29 mm	~814 square mm	Full 300 mm wafer
Model capacity	Requires multi-chip for 7B+	80 GB per card	~4 trillion parameters (wafer-scale)
Token throughput (70B)	276 tokens/s standard, 1,665 with speculative	60 to 200 tokens/s	Comparable to Groq
Best use case	Low-latency, single-request inference	Training, high-batch inference, general workloads	Extremely large model inference, research

The NVIDIA H100 is the dominant data center GPU for both training and inference. Its 80 GB of HBM3 memory allows it to hold large models on a single card without sharding, simplifying deployment. It supports high batch sizes well, making it more efficient for throughput-optimized inference serving many simultaneous users. Its primary disadvantage relative to the LPU is memory bandwidth: at 3.35 TB/s, it loads model weights much more slowly per token than the LPU's on-chip SRAM, resulting in higher first-token latency in low-batch scenarios.

The Cerebras WSE-3 takes the on-chip memory concept further than Groq by using a full 300 mm silicon wafer as a single processor. The WSE-3 integrates 44 GB of on-chip SRAM across 4 trillion transistors on a TSMC 5 nm process. This allows Cerebras to hold extremely large models in on-chip memory without any multi-chip sharding, and to process them at the full bandwidth of SRAM. The trade-off is manufacturing complexity: wafer-scale fabrication has lower yields than standard chip fabrication, and the physical scale of the chip complicates system integration and cooling.[^22]

SambaNova Systems takes a third architectural approach with its Reconfigurable Dataflow Unit (RDU), which uses a three-tiered memory hierarchy of SRAM, HBM, and DRAM. This allows SambaNova's chips to hold larger models per physical chip count than Groq, though with lower per-chip bandwidth than a pure SRAM design.

Comparison with Google TPU

Because Groq's founder came from the Google TPU team, the TPU is the most natural architectural comparison.

Feature	Google TPU	Groq LPU
Primary focus	Training and inference	Inference only
Architecture	Systolic array	Functionally sliced streaming
Memory	HBM-based	SRAM only
Execution model	Dynamic scheduling	Fully deterministic
Availability	Google Cloud only	GroqCloud and on-premises
Core design	Multi-core	Single-core
Compiler role	Standard runtime scheduling	Central compile-time scheduling
Tail latency	Variable	Equal to median

The TPU optimizes for flexibility across both training and inference, while the LPU sacrifices training capability entirely to achieve superior inference latency and determinism.

Software stack

The LPU's hardware design only makes sense in conjunction with the Groq Compiler, which converts higher-level model representations into the static schedule the chip executes.[^4][^32] The toolchain is anchored on three components and an open-source flow runner.

Front-end and GroqFlow

The Groq Compiler accepts neural network graphs from standard ML frameworks. As of early 2026, the supported entry points include PyTorch (via TorchScript and an ONNX export path), TensorFlow SavedModel, and the company's own Groq Runtime Format (GRF), which is the canonical input for production-grade compiles. The front-end normalizes operators, performs constant folding, and rewrites attention and feed-forward layers into shapes the back-end can schedule efficiently.

Groq publishes the front-end orchestration layer as GroqFlow, an open-source Python tool that decomposes PyTorch and ONNX models into torch-mlir or ONNX-MLIR dialects and then lowers them into Groq's internal GTen operation set before invoking the Groq Compiler proper.[^32] GroqFlow is the public-facing entry point for developers compiling custom models against the LPU.

Middle layer: tiling and slicing

The middle layer is responsible for mapping operators to functional slices and assigning data layouts that minimize cross-slice traffic. A matrix multiplication, for example, is decomposed into chunks that fit within the 320-element SIMD width and that align with the MXM's 320x320 multiply-accumulate array. Activation functions are mapped to the VXM; transposes and broadcasts are mapped to the SXM; and weight loads from SRAM are issued from the MEM slice in a strict producer-consumer order with the consuming functional unit.[^4][^9]

Back-end: cycle-accurate scheduling

The back-end performs the cycle-accurate scheduling that gives the chip its determinism. It produces a long binary that resembles a sequence of explicit time slots, each containing a vector instruction word per slice. Because the schedule covers everything the chip will do for the model, it is large: a 70B-parameter Llama-class model compiled for a full LPX rack produces a binary on the order of tens of gigabytes, which is loaded into rack-level SRAM at startup.[^4] The trade-off is that loading a new model is a non-trivial operation that the orchestration layer must plan around.

GroqCloud and orchestration

For the public cloud product, Groq runs a fleet of LPX (and previously GroqRack) systems behind a unified API that exposes pre-compiled models. New models go through an internal compile pipeline before being placed into rotation. Customers running custom or fine-tuned weights submit them to the compiler service, which produces a binary for the target rack configuration. Recompiles for new sequence lengths, batch sizes, or quantization formats are required, in contrast to a GPU stack where these can often be adjusted at runtime.

Limitations

The LPU's SRAM-only memory architecture, while responsible for its bandwidth advantages, also imposes constraints that GPU-based systems do not face.

The most significant constraint is per-chip capacity. At 230 MB per chip on the first-generation hardware, even a 7-billion-parameter model stored in 8-bit quantization (roughly 7 GB) cannot fit on a single LPU. Running any practically useful LLM requires distributing weights across dozens to hundreds of chips. For a 70B model, this means 576 chips on Gen 1 hardware or 256 chips in a single Gen 2 LPX rack.[^4][^19] This chip count requirement drives up infrastructure cost and physical footprint compared to a single H100 or pair of H100s that can hold the same model entirely in on-card HBM.

The deterministic scheduling model, while powerful for inference, creates challenges for workloads with dynamic or irregular computation patterns. Tasks like training, where gradient shapes and batch sizes vary, do not map cleanly onto the compiler-scheduled execution model. Groq has publicly positioned the LPU as an inference-only product, with no stated plans to support training workloads.

Very large models, such as 405-billion-parameter or mixture-of-experts variants that exceed even a full LPX rack's 128 GB of aggregate SRAM, require multi-rack deployments with more complex sharding strategies. The communication overhead across rack boundaries is lower than off-chip memory access but still higher than within-rack transfers, which can reduce efficiency for models at the largest scale.[^16]

The fixed compilation model means that routing inference requests through the system requires completing a compilation pass before a new model can run. For standard public models this is a one-time cost, but fine-tuned or custom models require a recompile, adding time to deployment. Variable-length attention, in particular, requires the compiler to emit specialized binaries for each supported context length window; serving requests across many sequence lengths typically involves preparing several binaries and routing at the orchestration layer rather than at the chip.

Finally, the lack of HBM means that mixture-of-experts models, where only a subset of weights is activated per token, cannot benefit from the kind of dynamic weight loading that an HBM-based system can perform. Either all experts must be resident in SRAM (which limits how many experts a deployment can host) or expert weights must be sharded across more chips, increasing the network traffic per token. NVIDIA's GTC 2026 messaging acknowledged this constraint and described the LP30 as a specialized decode-phase co-processor rather than a general-purpose inference engine, with the parallel Vera Rubin GPU complement handling prefill, mixture-of-experts routing, and any workload that needs HBM-scale capacity.[^17][^19]

Groq 3 LPU and the NVIDIA partnership

The Groq 3 LPU (designated LP30) was unveiled by NVIDIA at GTC 2026 in March, three months after the December 2025 NVIDIA-Groq licensing agreement.[^15][^16][^17] The Groq 3 is fabricated on Samsung's 4 nm node (SF4X) and increases on-chip SRAM to 500 to 512 MB and bandwidth to 150 TB/s per chip.[^16][^18][^20] It is paired with NVIDIA's Vera Rubin GPU platform as a dedicated decode-phase co-processor, with the LPX rack format integrating 256 LPUs across 32 trays of 8 liquid-cooled LP30 dies each.[^16][^19]

The LPX rack, when paired with NVIDIA's Vera Rubin CPU-GPU super-rack, has been described as offering a 35x throughput-per-megawatt improvement over Blackwell NVL72-based inference for trillion-parameter models, with NVIDIA quoting a target inference price of approximately $45 per million tokens.[^18][^19] Industry analysts expect future NVIDIA platforms to integrate the deterministic LPU IP more tightly with NVIDIA's GPU silicon; NVIDIA's GTC 2026 roadmap calls for the LP35 (with NVFP4 floating-point support to match the Rubin Ultra GPU's data formats) in the second half of 2027, and the LP40 in 2028 alongside the Rosa CPU and Feynman GPU, with the LP40 expected to use NVLink natively for tighter GPU-LPU integration.[^33][^34] The business and corporate aspects of the NVIDIA agreement are covered in Groq.

Deal structure and IP licensing

The December 24, 2025 agreement is structured as a non-exclusive technology licensing deal worth approximately $20 billion in cash and equity, not a traditional acquisition.[^13][^14][^15] NVIDIA paid for access to Groq's LPU intellectual property and the right to integrate LP30-class engines into its own systems, while leaving the Groq corporate entity intact. Groq founder Jonathan Ross, president Sunny Madra, and most of the chip and compiler engineering teams transferred to NVIDIA, while Simon Edwards took over as Groq's CEO and the remaining staff continued to operate GroqCloud and existing sovereign deployments.[^14] Trade press has characterized the structure as a way to obtain the people and IP of an acquisition without triggering the antitrust review a full takeover would invite; U.S. senators including Elizabeth Warren and Richard Blumenthal publicly questioned whether the structure constitutes a de facto acquisition designed to avoid regulatory scrutiny.[^14]

LPX rack and Vera Rubin integration

The LPX is a single-rack reference configuration that packs 256 LP30 dies into one chassis across 32 liquid-cooled compute trays of 8 LPUs each, with each tray contributing 4 GB of on-tray SRAM toward a 128 GB rack-wide total. The full rack delivers roughly 315 PFLOPS of FP8 compute and approximately 40 petabytes per second of aggregate fabric bandwidth, with a chassis-level FPGA paired with an x86 host CPU providing the control plane that hands compiled binaries off to the LPUs.[^16][^19] In NVIDIA's Vera Rubin reference design, an LPX rack sits beside a Vera Rubin CPU-GPU super-rack and acts as a decode-phase co-processor. The intended division of labor is that the GPU rack handles prefill, mixture-of-experts routing, and any large-batch workloads that benefit from HBM capacity, while the LPX rack handles single-stream or low-batch decoding where its deterministic latency dominates.[^17][^19] Both racks share a common control plane and present as one logical inference cluster to client software. First LPX shipments to NVIDIA reference customers are targeted for the third quarter of 2026.[^16][^18]

Strategic context

For NVIDIA, the deal closes a long-standing gap in its inference story. Hopper- and Blackwell-class GPUs are dominant on training and high-batch inference, but they have repeatedly trailed Groq, Cerebras, and SambaNova on single-request latency benchmarks of the kind that consumer-facing LLM products care about.[^21] Acquiring the LPU IP gives NVIDIA a credible answer to that workload without abandoning its GPU roadmap. For Groq, the deal monetizes years of architectural research while leaving the company with a continuing operating business, although it inevitably re-orients the LPU around NVIDIA's product cadence rather than Groq's own.[^14][^33]

References

Naming and origins

Tensor Streaming Processor design

Single-core layout

Dataflow versus von Neumann

Deterministic execution

Compiler-as-orchestrator

Debuggability and reproducibility

On-chip SRAM and memory bandwidth

Memory comparison

Why no HBM

Key specifications

GroqRack and multi-chip scaling

Topology

Plesiosynchronous coordination

Forward error correction

Multi-chip deployment for large models

Performance benchmarks

Customer benchmarks and adoption

Latency characteristics

Energy efficiency

Comparison with NVIDIA H100 and Cerebras WSE-3

Comparison with Google TPU

Software stack

Front-end and GroqFlow

Middle layer: tiling and slicing

Back-end: cycle-accurate scheduling

GroqCloud and orchestration

Limitations

Groq 3 LPU and the NVIDIA partnership

Deal structure and IP licensing

LPX rack and Vera Rubin integration

Strategic context

See also

References

Improve this article

Related Articles

Static inference

NVIDIA Picasso

NVIDIA Triton Inference Server

Offline inference

Online inference

Post-processing

Naming and origins

Tensor Streaming Processor design

Single-core layout

Dataflow versus von Neumann

Deterministic execution

Compiler-as-orchestrator

Debuggability and reproducibility

On-chip SRAM and memory bandwidth

Memory comparison

Why no HBM

Key specifications

GroqRack and multi-chip scaling

Topology

Plesiosynchronous coordination

Forward error correction

Multi-chip deployment for large models

Performance benchmarks

Customer benchmarks and adoption

Latency characteristics

Energy efficiency

Comparison with NVIDIA H100 and Cerebras WSE-3

Comparison with Google TPU

Software stack

Front-end and GroqFlow

Middle layer: tiling and slicing

Back-end: cycle-accurate scheduling

GroqCloud and orchestration

Limitations

Groq 3 LPU and the NVIDIA partnership

Deal structure and IP licensing

LPX rack and Vera Rubin integration

Strategic context

See also

References

Related Articles

Static inference

NVIDIA Picasso

NVIDIA Triton Inference Server