The Groq LPU (Language Processing Unit) is a custom silicon inference accelerator designed by Groq, an American AI chip company founded in 2016. The LPU is purpose-built for running large language model (LLM) inference workloads with deterministic, low-latency execution. Unlike conventional GPUs, the LPU eliminates hardware-level non-determinism by delegating all scheduling to a compiler, which arranges every instruction and data movement down to individual clock cycles before execution begins. This approach allows the chip to sustain extremely high memory bandwidth from on-chip SRAM, delivering output token rates that substantially exceed GPU-based inference systems on single-request latency.
This article focuses on the LPU as a chip product: its silicon architecture, the underlying Tensor Streaming Processor design, the GroqRack scaling fabric, and quantitative comparisons with rival accelerators. For coverage of Groq Inc. as a company, including its founding history, leadership, GroqCloud business, funding rounds, the Saudi Arabia partnership, and the December 2025 NVIDIA licensing agreement, see Groq.
Groq's first-generation chip, originally called the Tensor Streaming Processor (TSP), was fabricated on a 14 nm process node by GlobalFoundries. The second-generation chip, known as Groq 3 or LP30, is fabricated on Samsung's 4 nm node and was unveiled by NVIDIA at GTC 2026 following its $20 billion technology licensing agreement with Groq.
The architecture was originally introduced as the Tensor Streaming Processor (TSP), with the first silicon codenamed Alan. It was initially marketed to high-performance computing (HPC) and financial services customers under the TSP name, where its deterministic timing was useful for low-latency analytics. The breakthrough of ChatGPT in 2022 shifted the product focus toward LLM inference, and Groq rebranded the processor as a Language Processing Unit to reflect the dominant use case. The hardware itself remained the TSP; LPU is the productized name for the same underlying engine.
The LPU is a functionally sliced processor that organizes compute units in vertical columns rather than the conventional core-based layout found in CPUs and GPUs. Each TSP contains four types of functional slices arranged horizontally across the chip:
| Slice | Function |
|---|---|
| MXM | Matrix multiply operations |
| SXM | Shift and rotate vector operations |
| VXM | Vector arithmetic |
| MEM | Memory read and write |
A horizontal instruction control unit (ICU) dispatches instructions across all slices in parallel. Each slice is divided into 20 tiles, and each tile processes 16 vector elements simultaneously, giving a total of 320 SIMD lanes per TSP. The first-generation chip measures 25 by 29 millimeters and operates at a nominal clock frequency of 900 MHz, delivering more than one teraoperation per second per square millimeter of silicon on 14 nm process technology.
The second-generation Groq 3 chip is fabricated on Samsung's 4 nm node, improving density and power efficiency. Groq has reported 188 TFLOPS of FP16 throughput and 750 TOPS of INT8 throughput from the first-generation LPU, with the Groq 3 raising on-chip SRAM to 500 MB and bandwidth to 150 TB/s.
Unlike GPUs, which contain thousands of small cores, or Google TPUs, which use a systolic array architecture with multiple compute tiles, the LPU is fundamentally a single-core processor. The chip presents one programmable execution context to the compiler, eliminating the need for inter-core communication, cache coherence, or work-stealing schedulers. This is consistent with the streaming dataflow model: data moves through the chip from one functional slice to the next, rather than circulating between independent cores.
The defining property of the TSP is the elimination of hardware-managed non-determinism. Conventional processors and GPUs contain dynamic schedulers, cache controllers, branch predictors, reorder buffers, and network arbitration logic that make runtime decisions about instruction ordering, memory access, and data routing. These mechanisms add flexibility but introduce timing variability that makes worst-case latency difficult to predict or guarantee.
The TSP removes this hardware-level decision-making and moves it entirely into the compiler. Before a model runs, Groq's compiler analyzes the full computational graph of the neural network, including every matrix multiplication, activation function, and attention operation, and produces a static schedule that assigns each instruction to a specific time slot and each data movement to a specific clock cycle. The hardware executes this schedule verbatim, with no runtime arbitration.
The compiler performs what Groq engineers describe as two-dimensional scheduling: it schedules instructions in both time (which clock cycle each operation fires) and space (which functional unit executes it and which memory location stores the result). Because the compiler has complete knowledge of hardware latencies, it can arrange operations so that data arrives at each functional unit exactly when that unit is ready to process it, eliminating stall cycles and cache misses.
This determinism has a practical consequence for inference: since each forward pass through the model takes exactly the same number of clock cycles every time, the first token latency (time to first token) and per-token output rate are both highly predictable. Tail latency equals median latency, an unusual property for production AI infrastructure. Groq has cited time-to-first-token values as low as 0.2 to 0.3 seconds for 70-billion-parameter models in production deployments.
The most unusual feature of the LPU's memory system is its complete reliance on on-chip SRAM for model weight storage, with no external DRAM or High Bandwidth Memory (HBM). A single first-generation TSP contains approximately 220 to 230 MB of globally shared SRAM. There is no cache hierarchy; every byte of memory is accessible at the same latency and bandwidth from any compute unit on the chip.
The bandwidth available from this on-chip SRAM exceeds 80 terabytes per second on the GroqChip1, and rises to 150 TB/s on the second-generation Groq 3. For comparison, the NVIDIA H100's HBM3 memory delivers approximately 3.35 terabytes per second. The on-chip bandwidth advantage is therefore on the order of 24 times higher on first-generation hardware, which directly translates to faster weight loading during token generation.
LLM inference is constrained by memory bandwidth rather than raw compute in the single-sequence (low-batch) regime. Each new output token requires loading billions of model weights from memory to perform one forward pass. A processor with higher memory bandwidth can load those weights faster and produce tokens faster. This is the fundamental reason the LPU outperforms H100-based systems on single-request latency benchmarks.
The trade-off is capacity. At 230 MB per chip, a single LPU holds only a small fraction of a large language model's weights. A Llama 3 70B model stored in 8-bit quantization requires approximately 70 GB of memory. Running that model on LPUs therefore requires distributing the weights across hundreds of chips, each holding a shard of the model. Groq addresses this through the GroqRack system.
| Memory characteristic | GPU (HBM-based) | Groq LPU (SRAM-based) |
|---|---|---|
| Memory technology | HBM2e/HBM3/HBM3e | On-chip SRAM |
| Bandwidth per chip | 3 to 8 TB/s | 80 TB/s (Gen 1), 150 TB/s (Gen 2) |
| Latency | 100 to 400 ns | 1 to 5 ns |
| Capacity per chip | 80 to 288 GB | 230 MB (GroqChip1), 500 MB (LP30) |
| Access pattern | Variable (cache-dependent) | Fixed (compiler-determined) |
| Power per access | Higher (off-chip) | Lower (on-chip) |
The two generations of the LPU have the following headline specifications:
| Specification | GroqChip1 | LP30 (Groq 3, 2026) |
|---|---|---|
| INT8 performance | Up to 750 TOPS | TBD |
| FP16 performance | 188 TFLOPS (at 900 MHz) | TBD |
| FP8 performance | - | 315 PFLOPS per LPX rack reported at unveil |
| On-chip SRAM | 230 MB | 500 MB |
| Memory bandwidth | 80 TB/s | 150 TB/s |
| External HBM | None | None |
| Fabrication | GlobalFoundries 14 nm | Samsung 4 nm |
| Die size | 25 x 29 mm | TBD |
| Nominal clock | 900 MHz | TBD |
A notable aspect of the GroqChip1 is that it uses no high-bandwidth memory at all. Instead, it relies entirely on on-chip SRAM for memory, which provides extremely high bandwidth but limits total memory capacity per chip.
GroqRack is Groq's rack-level hardware product for on-premises and colocation deployments. The system connects multiple TSP chips in a high-radix mesh topology that extends the deterministic scheduling model across a distributed fabric. The compiler that schedules a single chip also schedules the entire rack, and across racks, as a single logical system.
At the chip level, interconnection is organized as follows:
| Layer | Composition |
|---|---|
| Node | 8 TSPs, each connected to 7 local neighbors plus 4 global links to adjacent nodes |
| Rack (Gen 1 GroqRack) | 9 nodes; 72 TSPs total |
| Rack (Gen 2 LPX) | 128 LPUs (single rack format) |
| Largest cluster | 10,440 TSPs across 145 racks; max 5 network hops between any two chips |
For workloads that span multiple LPUs, Groq uses a plesiosynchronous chip-to-chip protocol to cancel natural clock drift and align hundreds of LPUs to act as a single logical core. Periodic software synchronization adjusts for crystal-based clock drift, enabling not just compute scheduling but also network scheduling across the entire system. Cross-chip communication is handled by the same compiler-driven approach used for single-chip execution: the compiler determines exactly when each chip will need data from a neighbor and schedules data injection to arrive at the destination precisely when the receiving chip needs it. This eliminates routing tables, congestion control, and network back-pressure mechanisms.
To maintain synchronization across chips, the system uses a Hardware Aligned Counter (HAC) that provides a global time reference, supplemented by software-aligned counters and deskew instructions that realign execution at epoch boundaries after multi-hop data transfers.
For error handling, the system uses forward error correction rather than packet retransmission. Retransmission would break the static schedule by introducing unpredictable latency; correcting errors in transit preserves timing guarantees. Each rack also includes spare TSP nodes for failover.
Running a 70B-parameter model at FP8 requires approximately 576 LPU chips across eight first-generation GroqRacks. The newer LPX rack format, introduced with the second-generation hardware, packs 128 LPUs into a single rack and (in Groq's reported configurations) fits the full 70B model's weights within a single rack's aggregate SRAM. Even larger or mixture-of-experts models, such as 405B-parameter variants, require multi-rack deployments with more complex sharding strategies.
The LPU has consistently ranked first among public cloud providers on output token throughput benchmarks, particularly for medium-scale models in the 7B to 70B parameter range.
| Model | Decode mode | Tokens per second | Source |
|---|---|---|---|
| Llama 3 8B | Standard | 1,300+ | Early 2024 single-request tests |
| Llama 3 70B | Standard | 800+ | VentureBeat (April 2024) |
| Llama 3.3 70B | Standard | 276 | Artificial Analysis (late 2024) |
| Llama 3.3 70B | Speculative decoding | 1,665 | Artificial Analysis (late 2024) |
| Mixtral 8x7B | Standard | 500+ | Groq early demos (Feb 2024) |
The April 2024 VentureBeat report noted that Groq was serving Meta's Llama 3 at over 800 tokens per second, which attracted broad public attention because consumer-grade GPU servers typically produce 20 to 60 tokens per second for similar model sizes. With speculative decoding enabled on the updated first-generation chip, Artificial Analysis recorded a roughly six-times increase over the non-speculative baseline.
| Metric | LPU value |
|---|---|
| Time-to-first-token (70B models) | 0.2 to 0.3 seconds |
| Per-token generation latency | Sub-millisecond, predictable |
| Throughput consistency | No variance between requests |
| Tail latency vs median latency | Identical |
With GPU-based inference, tail latency (the worst-case response time) can be several times higher than median latency due to cache misses, memory contention, and scheduling delays. With Groq's LPU, the tail latency equals the median latency because execution is fully deterministic.
Energy consumption per token is reported at 1 to 3 joules for LPU-based inference, against 10 to 30 joules on H100-based systems, reflecting the bandwidth efficiency of avoiding off-chip memory accesses. The elimination of caches, branch predictors, and reorder buffers from the LPU design reduces transistor count dedicated to control logic. In a traditional GPU, these reactive components can consume 30 to 40 percent of the chip's power budget. The LPU redirects that silicon area and power toward compute and SRAM, improving the ratio of useful computation to total power consumption.
Three architectures are most commonly compared in the AI inference accelerator market: the NVIDIA H100 GPU, the Cerebras WSE-3, and the Groq LPU. Each reflects a different set of engineering tradeoffs.
| Dimension | Groq LPU (Gen 1) | NVIDIA H100 | Cerebras WSE-3 |
|---|---|---|---|
| Process node | 14 nm (Gen 1), 4 nm (Gen 2) | 4 nm (TSMC) | 5 nm (TSMC) |
| On-chip memory | ~230 MB SRAM per chip | ~50 MB L2 cache | 44 GB SRAM |
| Off-chip memory | None (model sharded across chips) | 80 GB HBM3 | None |
| Memory bandwidth | 80 TB/s (on-chip) | 3.35 TB/s (HBM3) | ~21 TB/s (on-chip) |
| Compute (FP16) | 188 TFLOPS per chip | 989 TFLOPS | ~125 TFLOPS per square millimeter (wafer-scale) |
| Execution model | Deterministic, compiler-scheduled | Dynamic, hardware-scheduled | Dataflow, compiler-scheduled |
| Chip/wafer area | 25 x 29 mm | ~814 square mm | Full 300 mm wafer |
| Model capacity | Requires multi-chip for 7B+ | 80 GB per card | ~4 trillion parameters (wafer-scale) |
| Token throughput (70B) | 276 tokens/s standard, 1,665 with speculative | 60 to 200 tokens/s | Comparable to Groq |
| Best use case | Low-latency, single-request inference | Training, high-batch inference, general workloads | Extremely large model inference, research |
The NVIDIA H100 is the dominant data center GPU for both training and inference. Its 80 GB of HBM3 memory allows it to hold large models on a single card without sharding, simplifying deployment. It supports high batch sizes well, making it more efficient for throughput-optimized inference serving many simultaneous users. Its primary disadvantage relative to the LPU is memory bandwidth: at 3.35 TB/s, it loads model weights much more slowly per token than the LPU's on-chip SRAM, resulting in higher first-token latency in low-batch scenarios.
The Cerebras WSE-3 takes the on-chip memory concept further than Groq by using a full 300 mm silicon wafer as a single processor. The WSE-3 integrates 44 GB of on-chip SRAM across 4 trillion transistors on a TSMC 5 nm process. This allows Cerebras to hold extremely large models in on-chip memory without any multi-chip sharding, and to process them at the full bandwidth of SRAM. The trade-off is manufacturing complexity: wafer-scale fabrication has lower yields than standard chip fabrication, and the physical scale of the chip complicates system integration and cooling.
SambaNova Systems takes a third architectural approach with its Reconfigurable Dataflow Unit (RDU), which uses a three-tiered memory hierarchy of SRAM, HBM, and DRAM. This allows SambaNova's chips to hold larger models per physical chip count than Groq, though with lower per-chip bandwidth than a pure SRAM design.
Because Groq's founder came from the Google TPU team, the TPU is the most natural architectural comparison.
| Feature | Google TPU | Groq LPU |
|---|---|---|
| Primary focus | Training and inference | Inference only |
| Architecture | Systolic array | Functionally sliced streaming |
| Memory | HBM-based | SRAM only |
| Execution model | Dynamic scheduling | Fully deterministic |
| Availability | Google Cloud only | GroqCloud and on-premises |
| Core design | Multi-core | Single-core |
| Compiler role | Standard runtime scheduling | Central compile-time scheduling |
| Tail latency | Variable | Equal to median |
The TPU optimizes for flexibility across both training and inference, while the LPU sacrifices training capability entirely to achieve superior inference latency and determinism.
The LPU's SRAM-only memory architecture, while responsible for its bandwidth advantages, also imposes constraints that GPU-based systems do not face.
The most significant constraint is per-chip capacity. At 230 MB per chip, even a 7-billion-parameter model stored in 8-bit quantization (roughly 7 GB) cannot fit on a single LPU. Running any practically useful LLM requires distributing weights across dozens to hundreds of chips. For a 70B model, this means 576 chips on Gen 1 hardware or 256 chips on a Gen 2 LPX rack. This chip count requirement drives up infrastructure cost and physical footprint compared to a single H100 or pair of H100s that can hold the same model entirely in on-card HBM.
The deterministic scheduling model, while powerful for inference, creates challenges for workloads with dynamic or irregular computation patterns. Tasks like training, where gradient shapes and batch sizes vary, do not map cleanly onto the compiler-scheduled execution model. Groq has publicly positioned the LPU as an inference-only product, with no stated plans to support training workloads.
Very large models, such as 405-billion-parameter or mixture-of-experts variants that exceed even a full GroqRack's aggregate SRAM, require multi-rack deployments with more complex sharding strategies. The communication overhead across rack boundaries is lower than off-chip memory access but still higher than within-rack transfers, which can reduce efficiency for models at the largest scale.
The fixed compilation model means that routing inference requests through the system requires completing a compilation pass before a new model can run. For standard public models this is a one-time cost, but fine-tuned or custom models require a recompile, adding time to deployment.
The Groq 3 LPU (designated LP30) was unveiled by NVIDIA at GTC 2026 in March, three months after the December 2025 NVIDIA-Groq licensing agreement. The Groq 3 is fabricated on Samsung's 4 nm node and increases on-chip SRAM to 500 MB and bandwidth to 150 TB/s per chip. It is paired with NVIDIA's Vera Rubin GPU platform as a dedicated decode-phase co-processor, with the LPX rack format integrating 128 LPUs.
The LPX rack, when paired with NVIDIA's Vera Rubin CPU-GPU super-rack, has been described as offering a 35x throughput-per-megawatt improvement over previous-generation inference solutions. Industry analysts expect future NVIDIA platforms to integrate the deterministic LPU IP directly into a hybrid die that combines GPU-style parallel compute with a dedicated SRAM-based inference engine. The business and corporate aspects of the NVIDIA agreement are covered in Groq.