Cerebras Systems is an American artificial intelligence hardware company that designs and manufactures wafer-scale processors for AI training and inference. Founded in 2016 and headquartered in Sunnyvale, California, Cerebras is best known for producing the Wafer-Scale Engine (WSE), the largest chip ever made, which integrates an entire silicon wafer into a single processor. The company has positioned itself as a leading challenger to NVIDIA in the AI accelerator market, with a particular focus on high-speed inference for large language models.
Cerebras Systems was founded in 2016 by Andrew Feldman, Gary Lauterbach, Michael James, Sean Lie, and Jean-Philippe Fricker. The founding team had deep experience working together; Feldman and Lauterbach had previously co-founded SeaMicro, a server company that designed energy-efficient microservers and was acquired by AMD in 2012 for $334 million. The core insight behind Cerebras was that AI workloads, particularly deep learning training, could benefit from a processor built at a radically larger scale than conventional chips.
Rather than designing a small chip and connecting many of them together (the approach used by GPUs), Cerebras decided to build a single chip spanning an entire 300mm silicon wafer. This approach eliminates the inter-chip communication bottlenecks that slow down distributed AI systems and allows the entire model to reside on-chip during computation.
The company operated in stealth mode for its first three years before unveiling the original Wafer-Scale Engine (WSE) at the Hot Chips conference in August 2019.
The defining innovation of Cerebras is the Wafer-Scale Engine, a processor that occupies an entire silicon wafer rather than being cut into individual dies like a traditional chip. This approach required Cerebras to solve several engineering challenges that had prevented wafer-scale integration for decades, including managing defects, distributing power evenly, and handling thermal dissipation across such a large area.
Building a chip the size of an entire wafer presented five major technical challenges that Cerebras had to overcome:
Defect tolerance. In traditional chip manufacturing, defective dies are simply discarded. On a wafer-scale chip, defects cannot be avoided but must be managed. Cerebras designed the WSE with approximately 100x the defect tolerance of a conventional GPU. Each AI core on the WSE-3 occupies roughly 0.05 mm2, about 1% the size of an NVIDIA H100 streaming multiprocessor (SM) core. When a defect hits a WSE core, it disables only 0.05 mm2 of silicon, whereas the same defect on an H100 would disable approximately 6 mm2. The WSE-3 contains 970,000 physical cores, with 900,000 active in the shipping product, meaning Cerebras achieves 93% silicon utilization, which is higher than leading GPUs despite building the world's largest chip [10].
Cerebras also developed a sophisticated routing architecture that allows dynamic reconfiguration of connections between cores. When a defect is detected during manufacturing test, the system automatically routes around the disabled core using redundant communication pathways, maintaining the fabric's full bandwidth and connectivity.
Power delivery. Delivering tens of kilowatts of power uniformly across a wafer-sized chip cannot be accomplished with traditional edge-of-die power connections. Cerebras designed a custom "engine block" that delivers power directly into the face of the wafer, achieving the power density required for hundreds of thousands of active cores. The custom connector and PCB design ensures uniform voltage across the entire wafer surface [11].
Thermal management. With overall power delivery in the mid-teen kilowatt range, the WSE generates substantial heat that must be removed uniformly. Traditional heat sink attachment techniques could not be used because of the thermal expansion mismatch between silicon and copper. Cerebras invented a new material and connector design that allows the wafer to expand and contract while remaining in thermal contact with a copper heat exchanger. Water flows through micro-fins on the backside of the heat exchanger, and the wafer slides against the polished front surface, maintaining thermal coupling despite differing coefficients of thermal expansion [11].
Cross-reticle connectivity. A standard photolithographic reticle can expose only a portion of a wafer at a time. To create a chip that spans the entire wafer, Cerebras had to connect circuits across reticle boundaries with high bandwidth and low latency, a problem unique to wafer-scale integration.
Die-to-die communication. The fabric interconnect that links all 900,000 cores must provide enormous aggregate bandwidth while consuming minimal power and area. The WSE-3's on-chip fabric provides 214 Pbit/s of bandwidth, enabling data to flow between any two cores on the wafer with predictable, low latency.
The first-generation Wafer-Scale Engine was announced in August 2019. It featured 400,000 AI-optimized processing cores, 1.2 trillion transistors, and 18 gigabytes of on-chip SRAM. The CS-1 system, which housed the WSE-1, included twelve 100 Gigabit Ethernet connections for data transfer. At the time, it was by far the largest chip ever fabricated, with a die area of 46,225 square millimeters, roughly 56 times larger than the biggest GPU available.
In April 2021, Cerebras announced the second-generation Wafer-Scale Engine (WSE-2), manufactured using TSMC's 7nm process. The WSE-2 represented a major leap in specifications:
| Specification | WSE-1 | WSE-2 |
|---|---|---|
| Transistors | 1.2 trillion | 2.6 trillion |
| AI cores | 400,000 | 850,000 |
| On-chip SRAM | 18 GB | 40 GB |
| Memory bandwidth | 9.6 PB/s | 20 PB/s |
| Fabric bandwidth | 100 Pb/s | 220 Pb/s |
| Process node | 16nm | 7nm |
The CS-2 system built around the WSE-2 became the primary commercial product for Cerebras, used by research institutions and enterprises for AI training. Notably, the 40 GB of on-chip SRAM eliminated the need for external high-bandwidth memory (HBM), allowing the entire working set of many AI models to reside directly on the processor.
The third-generation Wafer-Scale Engine (WSE-3) was announced in March 2024 at a dedicated launch event and later presented in detail at the Hot Chips 2024 conference. Manufactured on TSMC's 5nm process, the WSE-3 is the most powerful AI chip ever built, containing approximately 4 trillion transistors across its 46,255 mm2 die area.
| Specification | WSE-2 | WSE-3 |
|---|---|---|
| Transistors | 2.6 trillion | 4 trillion |
| AI cores (active) | 850,000 | 900,000 |
| Physical cores | ~900,000 | 970,000 |
| On-chip SRAM | 40 GB | 44 GB |
| Memory bandwidth | 20 PB/s | 21 PB/s |
| Fabric bandwidth | 220 Pb/s | 214 Pb/s |
| Peak AI performance | N/A | 125 PFLOPS |
| Process node | 7nm | 5nm |
| Manufacturer | TSMC | TSMC |
The WSE-3 delivers 125 petaflops of peak AI performance, which Cerebras claims is roughly double the performance of the WSE-2 at the same power consumption and cost. The memory bandwidth of 21 PB/s is approximately 7,000 times greater than the NVIDIA H100's off-chip HBM bandwidth, which is the fundamental advantage that enables Cerebras's inference speed records. Because the WSE stores model weights in on-chip SRAM distributed alongside the compute cores, data travels only fractions of a millimeter from memory to compute, rather than crossing package boundaries as it does in GPU architectures [3].
The Cerebras CS-2 and CS-3 are complete computing systems designed to house the WSE-2 and WSE-3 chips, respectively. Each system integrates the wafer-scale processor with all necessary power delivery, cooling, and I/O connectivity in a single rack-mountable unit. The systems are designed to be straightforward to deploy, requiring only standard datacenter power and cooling.
The CS-3, powered by the WSE-3, delivers 125 petaflops of AI compute while consuming 23 kW of power per system. Key system-level specifications include:
| Specification | CS-3 |
|---|---|
| Processor | WSE-3 (4T transistors, 900K cores) |
| Peak AI compute | 125 PFLOPS |
| On-chip memory | 44 GB SRAM |
| System memory (with MemoryX) | Up to 1.2 PB |
| Power consumption | 23 kW |
| Cooling | Direct liquid cooling |
| Max systems in cluster | 2,048 (via SwarmX) |
| Max cluster compute | ~0.25 zettaFLOPS |
One of the key advantages of the CS systems is that they eliminate the need for complex multi-node distributed computing setups. A single CS-3, for instance, can train models that would otherwise require clusters of hundreds of GPUs, dramatically simplifying the software stack and reducing the engineering effort required to scale AI training.
Cerebras developed two complementary technologies to extend the CS systems beyond single-chip limitations:
MemoryX is an external memory system that extends the available memory for the WSE beyond the on-chip SRAM. With up to 1.2 petabytes of capacity, MemoryX enables the CS-3 to handle models with trillions of parameters by streaming weight data to the processor at high bandwidth. This is large enough to store models with 24 trillion parameters in a single logical memory space without partitioning or refactoring, enabling training of next-generation frontier models 10x larger than GPT-4 and Gemini [3].
SwarmX is a high-bandwidth fabric that connects multiple CS systems together. Up to 2,048 CS-3 systems can be linked via SwarmX to build hyperscale AI supercomputers delivering up to a quarter of a zettaFLOP. SwarmX maintains near-linear scaling efficiency, meaning that doubling the number of CS-3 systems nearly doubles aggregate performance.
Cerebras Inference launched as a cloud-based inference service that leverages the unique architecture of the WSE to deliver what the company describes as the fastest LLM inference available. Because the WSE keeps the entire model on-chip (or streams it efficiently via MemoryX), it avoids the memory bandwidth bottleneck that limits GPU-based inference.
The CS-3 has achieved remarkable inference speeds across a range of models, consistently setting records for tokens-per-second output:
| Model | Tokens/second | Notes |
|---|---|---|
| Llama 3.1 8B | 1,800 | Single-user latency |
| Llama 3.1 70B | 2,100 | Per-user, roughly 8x faster than H200 |
| Llama 3.1 405B | 969 | Largest open-source model at launch |
| Llama 4 Scout | 2,600+ | Announced at launch |
| gpt-oss-120B | 2,700+ | Via Core42 partnership |
| K2 Think | 2,000 | Reasoning-optimized model |
In head-to-head comparisons, Cerebras claims the CS-3 achieves over 21x faster inference than NVIDIA's flagship Blackwell B200 GPU on the Llama 3 70B model in a reasoning scenario with 1,024 input tokens and 4,096 output tokens. On the gpt-oss-120B model, the CS-3 delivered 2,700+ tokens/second compared to approximately 900 tokens/second on Blackwell B200 [4].
The service supports popular open-source models including Llama 3.1 (8B, 70B, and 405B variants), Llama 4 Scout, Mistral models, and others. In partnership with Mistral, Cerebras powered the Le Chat AI assistant to claim AI speed records for inference latency. Cerebras and Core42 (a subsidiary of G42) also launched global access to OpenAI's gpt-oss-120B model, serving it at approximately 3,000 tokens per second.
The speed advantage of Cerebras Inference has made it particularly attractive for real-time applications, agentic AI workflows, and scenarios where low latency directly impacts user experience. Cerebras planned to increase its inference capacity from 2 million to over 40 million tokens per second by Q4 2025, distributed across eight data centers.
Cerebras has contributed several open-source models and research efforts to the AI community:
One of Cerebras's most significant strategic partnerships has been with G42, the Abu Dhabi-based AI technology holding company. Together, they built the Condor Galaxy constellation of AI supercomputers:
| System | Location | Compute | AI Cores | WSE Generation |
|---|---|---|---|---|
| CG-1 | Santa Clara, CA | 4 exaFLOPS | 54 million | WSE-2 |
| CG-2 | Undisclosed | 4 exaFLOPS | 54 million | WSE-2 |
| CG-3 | Dallas, TX | 8 exaFLOPS | 58 million | WSE-3 |
| Full constellation (planned) | Global | 36 exaFLOPS | - | Mixed |
The full Condor Galaxy constellation of nine interconnected AI supercomputers is designed to deliver 36 exaFLOPS of AI compute by the end of 2026, making it one of the largest collections of interconnected AI supercomputers in the world.
The G42 partnership provided Cerebras with both a major customer and a development platform for demonstrating the capabilities of its wafer-scale technology at datacenter scale. However, the relationship also created complications for Cerebras's IPO plans due to U.S. regulatory scrutiny of G42's ties to China and the UAE. By early 2026, Cerebras successfully restructured its investor base, moving G42 out of its primary stakeholder list to satisfy U.S. regulators.
In March 2026, Amazon Web Services (AWS) signed a multiyear deal with Cerebras to make the WSE-3 wafer-scale chip available to cloud customers through Amazon Bedrock. AWS became the first major cloud provider to offer Cerebras's disaggregated inference solution, with plans to launch Cerebras hardware on Amazon Bedrock in the coming months and add open-source LLM support later in 2026. This partnership represented a significant validation of Cerebras's technology by one of the world's largest cloud providers.
In January 2026, Cerebras agreed to provide 750 megawatts of computing power to OpenAI through 2028, in an estimated deal valued at more than $10 billion. This agreement underscored the growing demand for alternatives to NVIDIA GPUs for large-scale AI workloads and positioned Cerebras as a key infrastructure provider for one of the world's leading AI research organizations.
The architectural differences between Cerebras's wafer-scale approach and traditional GPU clusters lead to fundamentally different performance profiles:
| Metric | Cerebras CS-3 | NVIDIA DGX B200 (8x B200) | Advantage |
|---|---|---|---|
| Inference latency (Llama 70B) | ~2,100 tok/s per user | ~250 tok/s per user | CS-3 ~8x faster |
| On-chip memory bandwidth | 21 PB/s | ~64 TB/s (8 GPUs combined) | CS-3 ~300x higher |
| Programming model | Single device | Distributed (tensor/pipeline parallelism) | CS-3 simpler |
| Power consumption | 23 kW (system) | ~10 kW (8 GPUs + system) | GPU cluster more efficient per watt |
| Training flexibility | Optimized via MemoryX/SwarmX | Industry-standard frameworks | GPUs more flexible |
| Software ecosystem | Cerebras SDK | CUDA/PyTorch/TensorFlow | GPUs much broader |
The CS-3's primary advantage is inference latency, driven by the massive on-chip memory bandwidth that eliminates the memory wall problem. For training, GPU clusters retain advantages in software ecosystem maturity and flexibility, though Cerebras has made significant progress in training support through its compiler and MemoryX/SwarmX infrastructure.
Cerebras has raised substantial capital through multiple funding rounds:
| Round | Date | Amount | Lead Investors |
|---|---|---|---|
| Series A | 2016 | $25M | Benchmark Capital |
| Series B | 2017 | $60M | Benchmark Capital |
| Series C | 2018 | $112M | Benchmark Capital |
| Series D | 2019 | $272M | Koch Disruptive Technologies |
| Series E | 2021 | $250M | Alpha Wave Ventures |
| Series F | 2021 | $720M | Alpha Wave Ventures, Abu Dhabi Growth Fund |
| Series G | Late 2025 | $1.1B | Fidelity, Atreides Management |
Following the Series G round, analysts projected a listing valuation exceeding $15 billion.
Cerebras filed for an IPO in late 2024, selecting the Nasdaq under the reserved ticker symbol CBRS. However, the initial filing faced delays due to regulatory concerns related to the company's partnership with G42 and the broader geopolitical scrutiny of AI technology transfers. After restructuring its investor base, Cerebras rekindled its IPO plans in late 2025, targeting a listing in the second quarter of 2026.
The anticipated IPO has drawn significant attention from investors, as Cerebras represents one of the few pure-play AI hardware companies challenging NVIDIA's dominance in the accelerator market.
Cerebras competes in the AI accelerator market against several established and emerging players:
| Competitor | Approach | Key Products |
|---|---|---|
| NVIDIA | GPU-based accelerators | H100, H200, B200, B300 |
| AMD | GPU-based accelerators | MI300X, MI325X, MI350 |
| Custom ASICs | TPU v6 (Trillium), TPU v7 (Ironwood) | |
| Groq | LPU inference accelerators | GroqChip1, LPU v2 |
| Intel | Gaudi accelerators (discontinued) | Gaudi 3 |
| Amazon | Custom ASICs | Trainium2, Trainium3 |
Cerebras differentiates itself through the sheer scale of its wafer-scale approach, the simplicity of programming a single massive chip versus a distributed cluster, and its focus on both training and inference performance. The company's inference speed records have been a particularly effective marketing tool, demonstrating tangible performance advantages over GPU-based solutions.
Cerebras's wafer-scale approach offers several distinct advantages:
Limitations include:
As of early 2026, Cerebras is in a strong position. The company has secured major partnerships with AWS and OpenAI, raised over $2.4 billion in total funding, and is preparing for its anticipated Nasdaq IPO in Q2 2026. The WSE-3 and CS-3 systems continue to set inference speed records, and the company is expanding its cloud-based inference service to reach a broader market of AI developers and enterprises. The Condor Galaxy constellation continues to grow, with CG-3 now operational and the full 36 exaFLOP network on track for completion by end of 2026. With the AI inference market growing rapidly as deployed models scale to serve billions of users, Cerebras's focus on inference speed positions it well for the next phase of AI infrastructure buildout.