Groq is an American artificial intelligence hardware company that designs and manufactures the Language Processing Unit (LPU), a custom ASIC built specifically for AI inference. Founded in 2016 by Jonathan Ross, a former Google engineer who helped design the original Tensor Processing Unit (TPU), Groq has differentiated itself through a deterministic computing architecture that delivers ultra-low-latency, predictable performance for large language model inference. The company gained widespread attention in early 2024 when public demos of its inference speed went viral, and it has since grown into a significant player in the AI infrastructure market.
Groq was founded in 2016 by Jonathan Ross along with several other former Google engineers. Ross had been one of the key architects behind Google's Tensor Processing Unit (TPU), the custom AI accelerator that Google developed internally to handle the computational demands of its machine learning workloads. The experience of building the TPU gave Ross insight into the limitations of existing processor architectures for AI workloads, particularly for inference, where latency and predictability matter more than raw training throughput.
Ross founded Groq with the thesis that inference workloads required a fundamentally different architectural approach than what GPUs or even Google's TPUs provided. While GPUs excel at parallel computation for training, their complex memory hierarchies, caches, and scheduling mechanisms introduce unpredictable latency during inference. Ross wanted to build a chip where execution time could be determined at compile time, not at runtime.
The company's name, Groq, is unrelated to Elon Musk's AI chatbot Grok (developed by xAI), which launched later. The similarity in names has been a source of occasional confusion, though the two companies operate in entirely different segments of the AI market.
The Language Processing Unit is Groq's custom-designed processor, originally called the Tensor Streaming Processor (TSP) before being rebranded to reflect its particular strengths in language model inference. The LPU represents a fundamental departure from both GPU and TPU architectures.
The LPU is built on the Tensor Streaming Processor (TSP) architecture, internally codenamed "Alan." The TSP was designed from scratch to eliminate the sources of latency variability found in conventional processors. Rather than optimizing the same general-purpose computing paradigm used by CPUs and GPUs, the TSP introduces a streaming execution model where data flows through the chip in a single direction, passing through computation units in sequence without backtracking to a central memory pool.
The TSP architecture achieves this through three core design principles:
The most distinctive feature of the LPU is its deterministic architecture. Traditional processors use a variety of reactive hardware components to manage the unpredictability of program execution: branch predictors guess which code path will be taken, caches store frequently accessed data to hide memory latency, reordering buffers rearrange instructions for efficiency, and arbiters manage contention for shared resources. All of these components introduce variability in execution time.
The LPU eliminates all of these components. Instead, the compiler handles all scheduling decisions at compile time, producing a fully deterministic execution plan. Every memory access, every computation, and every data movement is predetermined before the program runs. This means that the execution time of any given workload is known exactly before it begins, enabling guaranteed latency and predictable throughput.
The software-controlled hardware knows with a high degree of precision exactly when and where an operation will occur and how long it will take. This determinism extends beyond individual chips: the compiler pre-computes the entire execution graph, including inter-chip communication patterns, down to individual clock cycles.
A defining characteristic of the LPU is its exclusive use of on-chip SRAM as primary memory. Unlike GPUs, which rely on off-chip High Bandwidth Memory (HBM) stacks, or CPUs, which use DRAM with multi-level cache hierarchies, the LPU integrates hundreds of megabytes of SRAM directly alongside its compute units. This SRAM serves as the primary storage for model weights and activations, not as a cache.
The architectural implications are significant:
| Memory characteristic | GPU (HBM-based) | Groq LPU (SRAM-based) |
|---|---|---|
| Memory technology | HBM2e/HBM3/HBM3e | On-chip SRAM |
| Bandwidth per chip | ~3-8 TB/s | 80 TB/s |
| Latency | ~100-400 ns | ~1-5 ns |
| Capacity per chip | 80-288 GB | 230 MB (GroqChip1) / 500 MB (LP30) |
| Access pattern | Variable (cache-dependent) | Fixed (compiler-determined) |
| Power per access | Higher (off-chip) | Lower (on-chip) |
The 80 TB/s of on-chip SRAM bandwidth is roughly 10x the bandwidth of an NVIDIA H100's HBM3 (3.35 TB/s). This bandwidth advantage is the fundamental driver of the LPU's inference speed: during autoregressive token generation, the bottleneck is typically reading model weights from memory for each token, and SRAM delivers these weights to the compute units an order of magnitude faster than HBM.
The trade-off is capacity. At 230 MB per GroqChip1, a single chip cannot hold the weights of even a small language model. For large models, hundreds of LPUs are connected together, with model weights distributed across the SRAM of many chips. This is why Groq deploys racks of LPUs working in concert.
The LPU uses a functionally sliced microarchitecture where memory units are interleaved with vector and matrix computation units across the chip. This design exploits the dataflow locality inherent in AI compute graphs. Data flows through the chip in a streaming fashion, moving from one functional unit to the next without needing to be written back to a central memory and re-fetched. This eliminates the memory bandwidth bottleneck that limits GPU-based inference.
Unlike GPUs, which contain thousands of small cores, or TPUs, which use a systolic array architecture, the LPU is fundamentally a single-core processor. This simplifies programming and eliminates the need for complex inter-core communication and synchronization, further contributing to deterministic execution.
For workloads that span multiple LPUs, Groq uses a plesiosynchronous chip-to-chip protocol to cancel natural clock drift and align hundreds of LPUs to act as a single logical core. Periodic software synchronization adjusts for crystal-based clock drift, enabling not just compute scheduling but also network scheduling across the entire system. The compiler can predict exactly when data will arrive at each chip, allowing developers to reason about timing across the full system.
The GroqChip1, released in early 2024, provides the following capabilities:
| Specification | GroqChip1 | LP30 (Groq 3, 2026) |
|---|---|---|
| INT8 performance | Up to 750 TOPS | TBD |
| FP16 performance | 188 TFLOPs (at 900 MHz) | TBD |
| On-chip SRAM | 230 MB | 500 MB |
| Memory bandwidth | Up to 80 TB/s | 150 TB/s |
| External HBM | None (SRAM only) | None (SRAM only) |
| Fabrication | GlobalFoundries | Samsung 4nm |
A notable aspect of the GroqChip1 is that it uses no high-bandwidth memory (HBM) at all. Instead, it relies entirely on on-chip SRAM for memory, which provides extremely high bandwidth but limits total memory capacity per chip. For large models, multiple LPUs are connected together, with model weights distributed across the SRAM of many chips.
Groq gained massive public attention in February 2024 when demonstrations of its inference speed went viral on social media. Users reported receiving responses from large language models at speeds that felt instantaneous, with tokens appearing faster than they could be read.
Groq has published and demonstrated inference speeds across multiple popular open-source models:
| Model | Tokens/second | Date | Notes |
|---|---|---|---|
| Llama 2 70B Chat | 241 | Feb 2024 | Early viral demos |
| Mixtral 8x7B | 500+ | Feb 2024 | Mixture-of-experts model |
| Llama 3 8B | 800+ | Apr 2024 | Day-zero launch support |
| Llama 3 70B | 300+ | Apr 2024 | Standard decoding |
| Llama 3 70B (speculative) | 1,660+ | Late 2024 | With speculative decoding |
| Llama 3.3 70B | Record-setting | Jan 2025 | New speed benchmark |
Groq led the first independent LLM benchmark for inference speed conducted by ArtificialAnalysis.ai, outperforming all GPU-based and competing ASIC-based providers on throughput and latency metrics.
The LPU's deterministic architecture provides several advantages for inference:
| Metric | LPU Advantage |
|---|---|
| Token generation latency | Predictable, sub-millisecond per token |
| Time-to-first-token | Near-instantaneous |
| Throughput consistency | No variance between requests |
| Tail latency | Identical to median latency |
The consistent latency is particularly important for production AI systems. With GPU-based inference, tail latency (the worst-case response time) can be several times higher than median latency due to cache misses, memory contention, and scheduling delays. With Groq's LPU, the tail latency equals the median latency because execution is fully deterministic.
GroqCloud is Groq's cloud-based API platform that provides developers with access to LPU-powered inference. The platform supports a range of open-source models and offers an API that is compatible with OpenAI's API format for ease of integration.
As of early 2026, GroqCloud supports the following model families:
| Model Family | Variants | Context Length |
|---|---|---|
| Llama 3.x | 8B, 70B (Llama 3.1, 3.3) | Up to 128K |
| OpenAI gpt-oss | 20B, 120B | 128K |
| DeepSeek | Various | Model-dependent |
| Qwen 3 | Various | Model-dependent |
| Mistral | Various | Model-dependent |
| Whisper (speech-to-text) | large-v3, large-v3-turbo | Audio input |
GroqCloud uses a pay-as-you-go pricing model with three tiers: Free, Developer, and Enterprise.
| Model | Input price (per M tokens) | Output price (per M tokens) |
|---|---|---|
| gpt-oss-120B | $0.15 | $0.75 |
| gpt-oss-20B | $0.10 | $0.50 |
| Whisper large-v3 | $0.111/hour | - |
| Whisper large-v3-turbo | $0.04/hour | - |
GroqCloud also offers batch processing at 50% lower cost for asynchronous workloads, and prompt caching provides an additional 50% discount on cached input tokens. The platform serves over two million developers and multiple Fortune 500 companies.
In 2025, Groq launched Compound, its first agent and compound AI system, on GroqCloud. Compound integrates agentic AI capabilities with server-side tool use, allowing developers to build systems that can conduct research, execute code, control browsers, and navigate the web. All tool calls run server-side on Groq's inference fleet, keeping latency low. The orchestration layer determines which tools (web search, Wolfram Alpha, code execution, browsers) are needed and manages iterative reasoning loops where the model consumes tool outputs and refines its responses.
Compound moved to general availability on October 1, 2025, delivering approximately 25% higher accuracy and roughly 50% fewer errors across evaluation benchmarks compared to its preview version.
Groq planned to deploy over 108,000 LPUs manufactured by GlobalFoundries by the end of Q1 2025, which would represent the largest AI inference compute deployment by any non-hyperscaler. The company has built data centers across North America, Europe, and the Middle East.
In February 2025, Groq announced that it had secured a $1.5 billion commitment from the Kingdom of Saudi Arabia (through HUMAIN) to expand its LPU-based AI inference infrastructure, including a new GroqCloud data center in Dammam, Saudi Arabia. This partnership reflected the growing interest from Middle Eastern sovereign wealth entities in building domestic AI compute capacity.
In December 2025, NVIDIA and Groq announced a landmark agreement reportedly valued at approximately $20 billion. The deal involved NVIDIA licensing Groq's AI inference technology through a non-exclusive licensing agreement signed on December 24, 2025. The agreement was structured to deliver $17 billion in cash payments across three installments by the end of 2026, with several senior Groq executives, including founder Jonathan Ross and president Sunny Madra, transferring to NVIDIA as part of the arrangement.
The deal was widely interpreted as an acknowledgment from NVIDIA that Groq's deterministic inference architecture offered capabilities that NVIDIA's GPU-based approach could not easily replicate. For Groq, the deal provided substantial capital while allowing the company to continue operating independently and licensing its technology non-exclusively.
The first tangible result of the NVIDIA partnership emerged at GTC 2026 in March, just three months after the licensing agreement. NVIDIA unveiled the Groq 3 LPU (designated LP30), along with the LPX server node:
| Specification | GroqChip1 | Groq 3 (LP30) |
|---|---|---|
| On-chip SRAM | 230 MB | 500 MB |
| SRAM bandwidth | 80 TB/s | 150 TB/s |
| Fabrication | GlobalFoundries | Samsung 4nm |
| Integration | Standalone | Pairs with Vera Rubin GPU platform |
The Groq 3 LPX server rack packs 128 LPUs and, when paired with NVIDIA's Vera Rubin CPU-GPU super-rack, promises 35x higher throughput per megawatt than previous-generation inference solutions. Industry analysts expect NVIDIA to integrate Groq's deterministic inference logic into its upcoming Vera Rubin architecture, creating a hybrid chip that combines the massive parallel processing of a traditional GPU with a dedicated inference engine powered by Groq's SRAM-based IP.
Groq has raised significant capital across multiple funding rounds:
| Round | Date | Amount | Valuation | Key Investors |
|---|---|---|---|---|
| Series A | 2017 | $10M | - | Social Capital |
| Series B | 2018 | $52M | - | Social Capital, D1 Capital |
| Series C | 2021 | $300M | ~$1B | Tiger Global, D1 Capital |
| Series D | August 2024 | $640M | $2.8B | BlackRock Private Equity Partners |
| Series E | September 2025 | $750M | $6.9B | Disruptive, BlackRock, Neuberger Berman, DTCP |
The rapid growth in valuation from $2.8 billion in August 2024 to $6.9 billion by September 2025 reflected the surging demand for inference infrastructure and investor confidence in Groq's differentiated technology. Including the NVIDIA licensing deal, Groq's total capital base grew dramatically through 2025 and 2026.
Groq competes in the AI inference accelerator market against several players, each with different architectural approaches:
| Competitor | Architecture | Focus | Key Differentiator |
|---|---|---|---|
| NVIDIA | GPU (H100, Blackwell) | Training and inference | Ecosystem breadth, CUDA |
| Cerebras | Wafer-scale engine | Training and inference | On-chip SRAM bandwidth |
| TPU | Training and inference | Vertical integration | |
| AMD | GPU (MI300X) | Training and inference | Price-performance ratio |
| Amazon | Inferentia/Trainium | Cloud inference | AWS integration |
| SambaNova | Reconfigurable dataflow | Enterprise AI | Dataflow architecture |
Groq's primary differentiator is its focus on inference-only workloads. While competitors like NVIDIA and Google design chips that handle both training and inference, Groq has optimized its architecture exclusively for inference, betting that the inference market will grow substantially larger than the training market as deployed AI models serve billions of users. The company's deterministic latency guarantee is particularly valuable for real-time applications and agentic AI systems that require predictable response times.
Groq's strategic bet rests on the observation that while training a model happens once (or a few times), inference happens billions of times as the model serves users. As AI moves from the research and training phase into mass deployment, the ratio of inference compute to training compute is expected to shift dramatically in favor of inference. Groq estimates that inference will eventually consume 10x or more compute than training, making inference-specialized hardware increasingly valuable.
The LPU architecture offers distinct power efficiency characteristics compared to GPU-based inference:
| Metric | Groq LPU Rack | NVIDIA H100 8-GPU Node |
|---|---|---|
| Inference throughput (Llama 70B) | Higher per-token throughput | Lower per-token throughput |
| Power per token | Lower (SRAM is more power-efficient than HBM) | Higher (HBM access dominates power budget) |
| Utilization predictability | Near-100% (deterministic scheduling) | Variable (depends on batching efficiency) |
| Idle power waste | Minimal (no speculative execution hardware) | Higher (caches, predictors consume power when idle) |
The elimination of caches, branch predictors, and reorder buffers from the LPU design reduces transistor count dedicated to control logic. In a traditional GPU, these reactive components can consume 30-40% of the chip's power budget. The LPU redirects that silicon area and power toward compute and SRAM, improving the ratio of useful computation to total power consumption.
For organizations running inference at scale, the total cost of ownership calculation favors the LPU for latency-sensitive workloads. The deterministic performance means capacity planning is straightforward: operators can predict exactly how many tokens per second a given number of LPUs will deliver, without the variability that makes GPU-based capacity planning more complex.
Groq's low-latency inference is particularly well-suited to several application categories:
Given that Groq's founder came from the Google TPU team, the comparison between the LPU and TPU is frequently drawn. While both are custom ASICs designed for AI workloads, they differ fundamentally:
| Feature | Google TPU | Groq LPU |
|---|---|---|
| Primary focus | Training and inference | Inference only |
| Architecture | Systolic array | Functionally sliced streaming |
| Memory | HBM-based | SRAM only |
| Execution model | Dynamic scheduling | Fully deterministic |
| Availability | Google Cloud only | GroqCloud and on-premises |
| Core design | Multi-core | Single-core |
| Compiler role | Standard (runtime scheduling) | Central (compile-time scheduling) |
| Tail latency | Variable | Equal to median |
The TPU optimizes for flexibility across both training and inference, while the LPU sacrifices training capability entirely to achieve superior inference latency and determinism.
As of early 2026, Groq has established itself as a leading inference infrastructure provider. The company powers over two million developers, operates data centers on three continents, and has secured partnerships with major sovereign entities and technology companies. The NVIDIA licensing deal validated the value of Groq's deterministic architecture, while continued funding rounds have provided capital for expansion.
The unveiling of the Groq 3 LPU at GTC 2026 marks a new chapter for the technology, with NVIDIA's manufacturing and distribution capabilities potentially bringing Groq's inference architecture to a far wider audience than the company could reach independently. GroqCloud continues to add model support and features, with the Compound AI system enabling more sophisticated agentic applications.
Groq's bet on inference as the dominant AI compute workload appears to be paying off, as the industry shifts from a training-focused phase to a deployment and scaling phase where inference costs and latency become the primary concerns.