Groq
Last reviewed
May 8, 2026
Sources
16 citations
Review status
Source-backed
Revision
v4 ยท 2,647 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
16 citations
Review status
Source-backed
Revision
v4 ยท 2,647 words
Add missing citations, update stale details, or suggest a clearer explanation.
Groq is an American artificial intelligence hardware company that designs custom silicon for AI inference. Founded in 2016 by Jonathan Ross, a former Google engineer who helped design the original Tensor Processing Unit (TPU), Groq has differentiated itself through a deterministic computing architecture that delivers ultra-low-latency, predictable performance for large language model inference. The company gained widespread attention in early 2024 when public demos of its inference speed went viral, and it has since grown into a significant player in the AI infrastructure market.
This article covers Groq Inc., the company, its history, products, and business operations. For an in-depth technical description of the chip itself, including the Tensor Streaming Processor design, deterministic compiler scheduling, on-chip SRAM, and rack-level architecture, see Groq LPU.
Groq was founded in 2016 by Jonathan Ross along with several other former Google engineers. Ross had been one of the key architects behind Google's Tensor Processing Unit (TPU), the custom AI accelerator that Google developed internally to handle the computational demands of its machine learning workloads. The experience of building the TPU gave Ross insight into the limitations of existing processor architectures for AI workloads, particularly for inference, where latency and predictability matter more than raw training throughput.
Ross founded Groq with the thesis that inference workloads required a fundamentally different architectural approach than what GPUs or even Google's TPUs provided. While GPUs excel at parallel computation for training, their complex memory hierarchies, caches, and scheduling mechanisms introduce unpredictable latency during inference. Ross wanted to build a chip where execution time could be determined at compile time, not at runtime.
The early years focused on chip design and compiler development, reflecting the company's philosophy that hardware and software must be co-designed. Ross has described this approach in public talks as treating the compiler as a first-class citizen of the architecture rather than an afterthought layered on top of existing hardware.
The company's name, Groq, is unrelated to Elon Musk's AI chatbot Grok (developed by xAI), which launched later. The similarity in names has been a source of occasional confusion, though the two companies operate in entirely different segments of the AI market.
Groq has expanded its capabilities through two acquisitions:
| Year | Target | Purpose |
|---|---|---|
| March 2022 | Maxeler Technologies | UK-based dataflow computing firm; added HPC and financial services expertise |
| March 2024 | Definitive Intelligence | Accelerated development of GroqCloud platform |
Groq's product portfolio centers on its custom inference silicon and the cloud platform that exposes it to developers.
The Language Processing Unit is Groq's custom-designed processor, originally called the Tensor Streaming Processor (TSP) before being rebranded to reflect its strengths in language model inference. The LPU represents a fundamental departure from both GPU and TPU architectures.
Key design choices:
The GroqChip1 was fabricated by GlobalFoundries on a 14 nm process. The second-generation Groq 3 LPU (LP30) is fabricated by Samsung on a 4 nm process, with 500 MB of SRAM and 150 TB/s of memory bandwidth. Detailed coverage of the architecture, the GroqRack system, the multi-chip deterministic fabric, and benchmark results lives in the Groq LPU article.
GroqCloud is Groq's cloud-based API platform that provides developers with access to LPU-powered inference. The platform supports a range of open-source models and offers an API that is compatible with OpenAI's API format for ease of integration.
As of early 2026, GroqCloud supports the following model families:
| Model Family | Variants | Context Length |
|---|---|---|
| Llama 3.x | 8B, 70B (Llama 3.1, 3.3) | Up to 128K |
| OpenAI gpt-oss | 20B, 120B | 128K |
| DeepSeek | Various | Model-dependent |
| Qwen 3 | Various | Model-dependent |
| Mistral | Various | Model-dependent |
| Whisper (speech-to-text) | large-v3, large-v3-turbo | Audio input |
GroqCloud uses a pay-as-you-go pricing model with three tiers: Free, Developer, and Enterprise.
| Model | Input price (per M tokens) | Output price (per M tokens) |
|---|---|---|
| gpt-oss-120B | $0.15 | $0.75 |
| gpt-oss-20B | $0.10 | $0.50 |
| Whisper large-v3 | $0.111/hour | - |
| Whisper large-v3-turbo | $0.04/hour | - |
GroqCloud also offers batch processing at 50% lower cost for asynchronous workloads, and prompt caching provides an additional 50% discount on cached input tokens. The platform serves over two million developers and multiple Fortune 500 companies.
In 2025, Groq launched Compound, its first agent and compound AI system, on GroqCloud. Compound integrates agentic AI capabilities with server-side tool use, allowing developers to build systems that can conduct research, execute code, control browsers, and navigate the web. All tool calls run server-side on Groq's inference fleet, keeping latency low. The orchestration layer determines which tools (web search, Wolfram Alpha, code execution, browsers) are needed and manages iterative reasoning loops where the model consumes tool outputs and refines its responses.
Compound moved to general availability on October 1, 2025, delivering approximately 25 percent higher accuracy and roughly 50 percent fewer errors across evaluation benchmarks compared to its preview version.
Groq gained massive public attention in February 2024 when demonstrations of its inference speed went viral on social media. Users reported receiving responses from large language models at speeds that felt instantaneous, with tokens appearing faster than they could be read.
Groq has published and demonstrated inference speeds across multiple popular open-source models:
| Model | Tokens/second | Date | Notes |
|---|---|---|---|
| Llama 2 70B Chat | 241 | Feb 2024 | Early viral demos |
| Mixtral 8x7B | 500+ | Feb 2024 | Mixture-of-experts model |
| Llama 3 8B | 800+ | Apr 2024 | Day-zero launch support |
| Llama 3 70B | 300+ | Apr 2024 | Standard decoding |
| Llama 3 70B (speculative) | 1,660+ | Late 2024 | With speculative decoding |
| Llama 3.3 70B | Record-setting | Jan 2025 | New speed benchmark |
Groq led the first independent LLM benchmark for inference speed conducted by ArtificialAnalysis.ai, outperforming all GPU-based and competing ASIC-based providers on throughput and latency metrics. The deterministic architecture also provides identical median and tail latency, an unusual property for production AI infrastructure.
Groq planned to deploy over 108,000 LPUs manufactured by GlobalFoundries by the end of Q1 2025, which would represent the largest AI inference compute deployment by any non-hyperscaler. The company has built data centers across North America, Europe, and the Middle East.
In February 2025, at the LEAP 2025 technology conference in Riyadh, the Kingdom of Saudi Arabia announced a $1.5 billion commitment (through HUMAIN) to expand Groq's LPU-based AI inference infrastructure within the country. The commitment was made jointly by Groq CEO Jonathan Ross and representatives from Saudi Aramco. The investment is tied to Saudi Arabia's Vision 2030 economic diversification program. Groq's Dammam data center, which went live in December 2024, hosts inference workloads for Saudi Aramco's Norous generative AI assistant and supports development of ALLaM, a large language model developed by the Saudi Data and Artificial Intelligence Authority (SDAIA).
In December 2025, NVIDIA and Groq announced a landmark agreement reportedly valued at approximately $20 billion. The deal involved NVIDIA licensing Groq's AI inference technology through a non-exclusive licensing agreement signed on December 24, 2025. The agreement was structured to deliver $17 billion in cash payments across three installments by the end of 2026, with several senior Groq executives, including founder Jonathan Ross and president Sunny Madra, transferring to NVIDIA as part of the arrangement. CFO Simon Edwards stepped up as CEO of the continuing independent Groq entity.
The deal was widely interpreted as an acknowledgment from NVIDIA that Groq's deterministic inference architecture offered capabilities that NVIDIA's GPU-based approach could not easily replicate. Analysts noted that structuring the agreement as a non-exclusive license rather than an outright acquisition was likely intended to reduce antitrust scrutiny, given NVIDIA's roughly 85 to 90 percent share of the AI accelerator market. For Groq, the deal provided substantial capital while allowing the company to continue operating independently and licensing its technology non-exclusively.
The first tangible result of the NVIDIA partnership emerged at GTC 2026 in March, just three months after the licensing agreement. NVIDIA unveiled the Groq 3 LPU (designated LP30), along with the LPX server node:
| Specification | GroqChip1 | Groq 3 (LP30) |
|---|---|---|
| On-chip SRAM | 230 MB | 500 MB |
| SRAM bandwidth | 80 TB/s | 150 TB/s |
| Fabrication | GlobalFoundries 14 nm | Samsung 4 nm |
| Integration | Standalone | Pairs with Vera Rubin GPU platform |
The Groq 3 LPX server rack packs 128 LPUs and, when paired with NVIDIA's Vera Rubin CPU-GPU super-rack, promises 35x higher throughput per megawatt than previous-generation inference solutions. Industry analysts expect NVIDIA to integrate Groq's deterministic inference logic into its upcoming Vera Rubin architecture, creating a hybrid chip that combines the massive parallel processing of a traditional GPU with a dedicated inference engine powered by Groq's SRAM-based IP.
Groq has raised significant capital across multiple funding rounds:
| Round | Date | Amount | Valuation | Key Investors |
|---|---|---|---|---|
| Seed | 2017 | $10M | - | Social Capital |
| Series B | 2018 | $52M | - | Social Capital, D1 Capital |
| Series C | April 2021 | $300M | ~$1B | Tiger Global, D1 Capital |
| Series D | August 2024 | $640M | $2.8B | BlackRock Private Equity Partners |
| Growth round | September 2024 | $750M | $6.9B | Disruptive, BlackRock, Neuberger Berman, DTCP |
The rapid growth in valuation from $2.8 billion in August 2024 to $6.9 billion by September 2024 reflected the surging demand for inference infrastructure and investor confidence in Groq's differentiated technology. Including the $20 billion NVIDIA licensing payment, Groq's total capital base grew dramatically through 2025 and 2026.
Groq competes in the AI inference accelerator market against several players, each with different architectural approaches:
| Competitor | Architecture | Focus | Key Differentiator |
|---|---|---|---|
| NVIDIA | GPU (H100, Blackwell) | Training and inference | Ecosystem breadth, CUDA |
| Cerebras | Wafer-scale engine | Training and inference | On-chip SRAM bandwidth |
| TPU | Training and inference | Vertical integration | |
| AMD | GPU (MI300X) | Training and inference | Price-performance ratio |
| Amazon | Inferentia/Trainium | Cloud inference | AWS integration |
| SambaNova | Reconfigurable dataflow | Enterprise AI | Dataflow architecture |
Groq's primary differentiator is its focus on inference-only workloads. While competitors like NVIDIA and Google design chips that handle both training and inference, Groq has optimized its architecture exclusively for inference, betting that the inference market will grow substantially larger than the training market as deployed AI models serve billions of users. The company's deterministic latency guarantee is particularly valuable for real-time applications and agentic AI systems that require predictable response times.
Groq's strategic bet rests on the observation that while training a model happens once (or a few times), inference happens billions of times as the model serves users. As AI moves from the research and training phase into mass deployment, the ratio of inference compute to training compute is expected to shift dramatically in favor of inference. Groq estimates that inference will eventually consume 10x or more compute than training, making inference-specialized hardware increasingly valuable.
Groq's low-latency inference is particularly well-suited to several application categories:
As of early 2026, Groq has established itself as a leading inference infrastructure provider. The company powers over two million developers, operates data centers on three continents, and has secured partnerships with major sovereign entities and technology companies. The NVIDIA licensing deal validated the value of Groq's deterministic architecture, while continued funding rounds have provided capital for expansion.
The unveiling of the Groq 3 LPU at GTC 2026 marks a new chapter for the technology, with NVIDIA's manufacturing and distribution capabilities potentially bringing Groq's inference architecture to a far wider audience than the company could reach independently. GroqCloud continues to add model support and features, with the Compound AI system enabling more sophisticated agentic applications.
Groq's bet on inference as the dominant AI compute workload appears to be paying off, as the industry shifts from a training-focused phase to a deployment and scaling phase where inference costs and latency become the primary concerns.