The Cerebras WSE-3 (Wafer-Scale Engine 3) is the third-generation wafer-scale AI processor developed by Cerebras Systems, announced on March 13, 2024. Built on TSMC's 5nm process node, it integrates 4 trillion transistors and 900,000 AI-optimized compute cores onto a single die spanning 46,225 square millimeters, equivalent to nearly an entire 300mm silicon wafer. It is the largest semiconductor chip ever produced, surpassing its predecessor, the WSE-2, by a 54% increase in transistor count. The WSE-3 powers the CS-3 system, Cerebras' third-generation AI supercomputer appliance, and underpins the Cerebras Inference cloud service launched in August 2024.
Cerebras Systems was founded in 2015 by Andrew Feldman, Gary Lauterbach, Michael James, Sean Lie, and Jean-Philippe Fricker with the explicit goal of redesigning the processor from first principles for AI workloads. In August 2019, at Hot Chips, the company unveiled the original Wafer-Scale Engine (WSE-1): the first and only trillion-transistor processor at that time. Manufactured on TSMC's 16nm process, it packed 1.2 trillion transistors, 400,000 cores, and 18GB of on-chip SRAM onto 46,225mm² of silicon.
The fundamental design premise departed from conventional chip design by treating the entire 300mm wafer as a single processor rather than dicing it into dozens of smaller chips. Conventional wafer yields benefit from smaller die sizes because any defect affects only one chip among many. Cerebras addressed yield by building redundant circuitry into the wafer, so defective cores are routed around rather than discarded.
In April 2021, Cerebras unveiled the second-generation WSE-2, produced on TSMC's 7nm process. The WSE-2 raised the transistor count to 2.6 trillion, added 850,000 AI-optimized cores, expanded on-chip SRAM to 40GB, and increased memory bandwidth to 20 petabytes per second. Total fabric bandwidth reached 220 petabits per second across the on-wafer interconnect mesh.
The CS-2 system based on the WSE-2 could support clusters of up to 192 nodes through Cerebras' weight streaming architecture, which separates parameter storage from the compute wafer. In this model, weights reside in an external MemoryX cluster and stream onto the wafer during computation, while the SwarmX fabric coordinates gradient reduction across nodes. CS-2 pricing was estimated at approximately $2.5 million per node.
By early 2024, the WSE-2 and CS-2 were deployed at research institutions, pharmaceutical companies, and large technology groups, and had trained notable models including the Jais-30B Arabic language model and the Med42 medical language model via the Condor Galaxy supercomputer network.
The three WSE generations show a consistent pattern of roughly halving the process node and roughly doubling transistor count every two years:
| WSE-1 (2019) | WSE-2 (2021) | WSE-3 (2024) | |
|---|---|---|---|
| Process node | TSMC 16nm | TSMC 7nm | TSMC 5nm |
| Transistors | 1.2 trillion | 2.6 trillion | 4 trillion |
| AI cores | 400,000 | 850,000 | 900,000 |
| On-chip SRAM | 18 GB | 40 GB | 44 GB |
| Memory bandwidth | 9 PB/s | 20 PB/s | 21 PB/s |
| Die area | 46,225 mm² | 46,225 mm² | 46,225 mm² |
| System | CS-1 | CS-2 | CS-3 |
| Max cluster size | N/A | 192 nodes | 2,048 nodes |
The area has remained fixed across all three generations by design, as Cerebras sized its architecture to one full 300mm wafer and has not needed to change that baseline. The improvements come entirely from shrinking the process node and redesigning the compute cores.
The WSE-3 is fabricated on TSMC's 5nm process (N5), a transition down from the 7nm used for the WSE-2. The chip measures 21.5 centimeters on each side and covers 46,225mm², the same physical footprint as all prior WSE generations since the company designed its wafer processes around a fixed 300mm platform. Despite keeping the same area, the move to 5nm allowed Cerebras to raise transistor density enough to reach 4 trillion transistors, a 54% increase over the 2.6 trillion of the WSE-2.
The core count rose modestly from 850,000 to 900,000 AI-optimized cores, reflecting a design choice to widen each core's math pipeline rather than simply multiply core count. The WSE-3 SIMD units are eight-wide for FP16 operations, doubled from the four-wide units in the WSE-2, which accounts for the bulk of the 2x performance improvement at the same power envelope.
On-chip SRAM increased to 44GB from 40GB, a 10% gain. Memory bandwidth improved to 21 petabytes per second from 20 petabytes per second. The on-wafer network fabric bandwidth is 214 petabits per second, broadly similar to the WSE-2's 220 petabits per second. These relatively modest gains in memory metrics reflect that bandwidth and capacity improvements at 5nm over 7nm are constrained by the same physical wafer area.
The WSE-3 does not include HBM or any form of stacked off-die memory integrated into the chip package. All 44GB of memory is SRAM distributed physically alongside the compute cores across the wafer surface. This co-location eliminates the off-chip memory bus bottleneck that limits GPU memory bandwidth at scale, but it also means the 44GB ceiling is hard for models that cannot be weight-streamed from external storage.
Cerebras connects cores through a 2D mesh fabric etched onto the wafer itself. Because all wires are on-wafer, latencies are in the range of nanoseconds, and the aggregate bandwidth of 214 petabits per second across the mesh exceeds the interconnect bandwidth of much larger GPU clusters by orders of magnitude. Cerebras claims the CS-3's on-wafer fabric provides more than 200 times the interconnect bandwidth of an NVIDIA NVL72 rack containing 72 B200 GPUs.
One manufacturing challenge specific to wafer-scale design is that the silicon wafer is divided by narrow "scribe lines" used in conventional chip fabrication to separate individual dies before dicing. Cerebras worked with TSMC to repurpose these scribe lines as conductive paths, creating die-to-die connections across the wafer without adding a separate packaging layer.
Because 44GB of on-chip SRAM cannot hold the parameters of a large language model (Llama 3 70B alone requires roughly 140GB), Cerebras employs a weight streaming execution model. In this approach, model weights are stored in an external MemoryX appliance rather than on the wafer. During a forward or backward pass, weights for each layer stream from MemoryX to the wafer compute cores layer by layer. After computation, updated gradients flow back. The wafer's 21 petabytes per second of SRAM-to-core bandwidth allows streaming to proceed fast enough that the external memory latency does not become the bottleneck for most workloads.
MemoryX configurations range from 1.5TB to 1.2 petabytes depending on deployment scale. A 1.5TB MemoryX configuration supports models up to roughly 24 billion parameters at 16-bit precision, while a 1.2PB configuration supports models up to 24 trillion parameters.
The SwarmX fabric is a separate network switch designed specifically for the weight streaming architecture. It sits between MemoryX and the CS-3 compute nodes, broadcasting weights to all compute nodes simultaneously and reducing gradients back to MemoryX after each layer. SwarmX uses a tree topology that enables near-linear performance scaling: ten CS-3 nodes complete a job approximately ten times faster than a single CS-3 node, because each additional node receives its own copy of the weights from the same broadcast.
Producing a chip that spans an entire 300mm wafer creates a yield problem that does not exist in conventional chip manufacturing. In standard chip production, a single defect ruins only the small die containing it, leaving all other dies on the wafer usable. On a wafer-scale chip, one defect anywhere could theoretically disable the entire processor.
Cerebras addressed this with a dual approach. First, each AI core on the WSE-3 is extremely small, approximately 0.05mm² per core. By comparison, a single SM (streaming multiprocessor) unit on an NVIDIA H100 covers roughly 6mm². Because a random manufacturing defect disables whatever silicon area it occupies, a defect on the WSE-3 knocks out 0.05mm² of compute, while the same defect on an H100 would disable 6mm² of compute. Cerebras argues this makes the wafer-scale design approximately 100x more fault tolerant per defect, at the silicon area level.
Second, Cerebras incorporates approximately 1% spare cores distributed across the wafer, along with redundant mesh routing links. Distributed autonomous repair logic runs during chip initialization to detect defective cores and remap data paths around them. The on-wafer 2D mesh topology supports multiple paths between any two cores, so routing around a cluster of adjacent defects does not require significant detours. Cerebras reports 93% silicon utilization on the WSE-3 after accounting for defective cores and spare overhead.
The WSE-3 includes hardware accelerators for both dynamic and unstructured sparsity. Most AI accelerators, including NVIDIA H100 and B200, support only structured sparsity (2:4 sparsity), which requires that exactly two of every four weights be zero. Unstructured sparsity, where any weight can be zero, is more flexible and can achieve higher compression ratios but is computationally harder to accelerate in hardware.
Cerebras claims up to 8x speedup for models compressed with unstructured sparsity on the WSE-3. This capability underpins the company's partnership with Qualcomm: models trained on the CS-3 with sparsity can be deployed on Qualcomm AI 100 Ultra inference accelerators at a claimed 10x reduction in inference cost compared to dense models on conventional hardware.
The CS-3 is Cerebras' third-generation system appliance built around the WSE-3. In terms of rack space, a single CS-3 occupies 15U. Power consumption is 23 kilowatts per unit, the same thermal envelope as the CS-2 despite delivering twice the compute performance.
The CS-3 uses liquid cooling with redundant pumps to manage the heat load from a 23kW compute element in 15U. External connectivity is via fiber. The system supports three memory configurations through its MemoryX attachment:
| MemoryX configuration | Parameter capacity | Suitable deployment |
|---|---|---|
| 1.5 TB | Up to ~75B parameters | Single-model inference |
| 12 TB | Up to ~600B parameters | Large model training |
| 1.2 PB | Up to 24T parameters | Hyperscale training |
Clusters scale up to 2,048 CS-3 systems through the SwarmX fabric. A 2,048-node cluster delivers 256 exaflops of AI compute at FP16 precision and contains a total of 1.84 billion compute cores. By comparison, the CS-2 supported clusters of up to 192 systems, so the CS-3 increased the maximum cluster size by a factor of roughly 10.7.
Cerebras priced the CS-3 at parity with the CS-2 at approximately $2.5 to $3.1 million per node. A full 2,048-node cluster would cost in the range of $5 to $6 billion.
Software support includes native PyTorch 2.0 integration. Cerebras states that a GPT-3 training script requires 565 lines of code on the CS-3, compared to roughly 20,000 lines for an equivalent GPU cluster implementation, because the CS-3 abstracts away the distributed communication primitives that GPU users must manage manually. The system also supports pure data-parallel training for models from 1 billion to 24 trillion parameters without requiring model parallelism or tensor parallelism strategies.
Condor Galaxy is a series of AI supercomputers built jointly by Cerebras and G42, the Abu Dhabi-based technology holding company. The network is designed to eventually comprise nine interconnected supercomputers totaling 36 exaflops of AI compute.
| Supercomputer | Compute | Underlying hardware | Location | Status |
|---|---|---|---|---|
| Condor Galaxy 1 (CG-1) | 4 exaFLOPS | 64 CS-2 nodes | Santa Clara, California | Operational |
| Condor Galaxy 2 (CG-2) | 4 exaFLOPS | 64 CS-2 nodes | Dallas, Texas | Operational |
| Condor Galaxy 3 (CG-3) | 8 exaFLOPS | 64 CS-3 nodes | Dallas, Texas | Under construction (announced March 2024) |
| CG-4 through CG-9 | 20 exaFLOPS combined | CS-3 nodes | Multiple locations | Planned |
CG-1, launched in 2023, was among the largest AI supercomputers in the world at the time of its announcement, delivering 4 exaFLOPS and hosting 54 million compute cores. CG-1 and CG-2 used Cerebras CS-2 systems. CG-3, announced on March 13, 2024 (the same day as the WSE-3 announcement), is the first Condor Galaxy system built on CS-3 hardware. CG-3 uses 64 CS-3 nodes and delivers 8 exaFLOPS, double the capacity of each of its predecessors.
The Condor Galaxy network has trained several publicly released models: Jais-30B (an Arabic foundation model), Med42 (a medical domain model), Crystal-Coder-7B (a code generation model), and BTLM-3B-8K (a general-purpose language model with 8,192 context length). G42 accounted for 83% of Cerebras' reported revenue for 2023 and 97% of hardware sales in the first half of 2024, reflecting how central the Condor Galaxy partnership was to Cerebras' commercial operations during this period.
On August 27, 2024, Cerebras launched Cerebras Inference, a commercial cloud API offering LLM inference at throughputs it claimed were the fastest publicly available at the time.
At launch, the service delivered:
| Model | Tokens per second | Comparison to GPU clouds |
|---|---|---|
| Llama 3.1 8B | 1,800 | ~20x faster than NVIDIA GPU-based hyperscale clouds |
| Llama 3.1 70B | 450 | ~20x faster than NVIDIA GPU-based hyperscale clouds |
Cerebras positioned the service against Groq LPU-based inference, claiming 2.4x higher throughput for the 8B model. Pricing started at $0.10 per million tokens. Developers received one million free tokens per day at launch. Cerebras also advertised that the service uses native 16-bit weights rather than the 8-bit quantization common among inference providers, noting that 16-bit models score up to 5% higher than 8-bit counterparts on standard benchmarks.
The speed advantage comes from the WSE-3's architecture: because all KV cache and activation state fit within 44GB of SRAM during single-batch inference (for smaller models) or are handled through pipeline parallelism across a small number of CS-3 nodes, memory access latency per token is substantially lower than on GPU systems that move data between HBM and compute units on each step.
For Llama 3.1 70B, which requires approximately 140GB of memory, Cerebras distributes the model's 80 layers across four CS-3 systems using pipeline parallelism, with each system handling 20 layers. Tokens flow through the pipeline sequentially, and the system's fast per-layer processing keeps latency low.
In November 2024, Cerebras raised the Llama 3.1 70B throughput to 2,100 tokens per second, a 4.7x improvement over the August launch figure, attributed to software optimizations including speculative decoding. Around the same time, Cerebras demonstrated Llama 3.1 405B inference at 969 tokens per second, another claimed record for that model size.
The rapid inference throughput numbers Cerebras published in 2024 came amid a broader competition among specialized inference hardware vendors. The main contenders for the fastest-LLM-inference title in 2024 were Cerebras (WSE-3), Groq LPU, and SambaNova.
Groq's Language Processing Unit (LPU) is built around a different philosophy than the WSE-3: it uses a large array of relatively small chips, each with a modest amount of SRAM, interconnected through a synchronous fabric. Groq was the first to publicize sub-second response latencies for 7B and 13B models at scale and attracted early developer attention.
SambaNova uses a reconfigurable dataflow architecture and its SN40L chip, released in late 2023, combined on-chip SRAM with HBM to allow larger models per system than either Groq or a single CS-3. In mid-2024, SambaNova claimed to have exceeded 1,000 tokens per second for the 8B model, and was reporting around 580 tokens per second for 70B models at the time Cerebras launched its inference service.
At the August 2024 Cerebras Inference launch, Cerebras placed the competitive benchmark for Llama 3.1 8B at 1,800 tokens per second, ahead of SambaNova's reported 1,084 and Groq's reported 750 for the same model. On 70B models the competition was tighter: Cerebras reported 450 tokens per second versus SambaNova's 580 and Groq's 544. By November 2024, after software updates, Cerebras pushed its 70B throughput to 2,100 tokens per second, well ahead of both competitors.
These comparisons are vendor-reported and have not been independently verified through a standardized benchmark suite such as MLPerf Inference. The methodologies differ: vendors vary batch sizes, context lengths, and precision settings, making direct comparison difficult. SambaNova also noted that its system uses full 16-bit precision on 405B models while Cerebras' 405B inference required distributing the model across multiple CS-3 nodes.
The WSE-3 and the NVIDIA NVIDIA H100 and NVIDIA B200 serve overlapping but distinct purposes. The H100 and B200 are general-purpose data center GPUs used across training, inference, and simulation workloads at a wide range of model sizes. The WSE-3 is optimized specifically for AI training and inference workloads where models fit within the weight streaming architecture.
| Specification | Cerebras CS-3 (WSE-3) | NVIDIA H100 SXM | NVIDIA B200 SXM |
|---|---|---|---|
| Process node | TSMC 5nm | TSMC 4N | TSMC 4NP |
| Transistors | 4 trillion | 80 billion | ~208 billion |
| Die area | 46,225 mm² | 814 mm² | ~904 mm² |
| On-chip memory | 44 GB SRAM | 80 GB HBM3 | 192 GB HBM3e |
| Memory bandwidth | 21 PB/s (on-wafer) | 3.35 TB/s | 8.0 TB/s |
| Peak AI compute (FP16) | 125 petaFLOPS | 1.979 petaFLOPS | ~4.5 petaFLOPS |
| Interconnect | 214 Pb/s on-wafer | NVLink 900 GB/s | NVLink 1.8 TB/s |
| System power | 23 kW (CS-3) | 700 W (per GPU) | 1,000 W (per GPU) |
| Form factor | 15U appliance | SXM module (server-integrated) | SXM module (server-integrated) |
The transistor and die area comparisons illustrate the wafer-scale approach: the WSE-3 integrates roughly 50 times more transistors than the H100 and covers approximately 57 times more silicon area. However, the majority of the WSE-3's transistors go to compute logic and SRAM, not to HBM interface circuitry, since the WSE-3 does not use HBM at all.
For peak AI compute, a single CS-3 (125 petaFLOPS FP16) outperforms a single H100 (1.979 petaFLOPS FP16) by approximately 63x and exceeds a single B200 (approximately 4.5 petaFLOPS FP16) by approximately 28x. However, NVIDIA systems are deployed in racks: an NVL72 rack containing 72 B200 GPUs delivers roughly 324 petaFLOPS at FP16, exceeding the CS-3 at the rack level while consuming approximately 132 kW compared to the CS-3's 23 kW. The CS-3 has a substantially better performance-per-watt ratio on paper.
A March 2025 paper from researchers at national laboratories published on arXiv directly compared Cerebras wafer-scale technology with NVIDIA GPU-based systems and found that the CS-3 achieved approximately 3.5x higher throughput at FP8 precision and 7x higher throughput at FP16 precision compared to a single H100-based system for the specific workloads tested.
The interconnect comparison is where the architectural gap is widest. The WSE-3's on-wafer fabric at 214 petabits per second vastly exceeds NVLink's 900 GB/s (7.2 terabits per second) per H100 or 1.8 TB/s (14.4 terabits per second) per B200. Cerebras states the CS-3 provides over 200 times the interconnect bandwidth of the full NVL72 rack. This matters most for training jobs where gradient synchronization across many units becomes the bottleneck.
The critical limitation is memory capacity. The WSE-3's 44GB of SRAM is much smaller than the 80GB per H100 or 192GB per B200, and smaller still than the 5.76TB to 13.8TB total HBM available across a full 72-GPU NVL72 rack. Large models that do not fit neatly within Cerebras' weight streaming pipeline suffer performance degradation, whereas on GPU clusters they can often be tensor-parallelized or pipeline-parallelized with less architectural constraint. Models requiring more than roughly 24 trillion parameters require more than the largest available MemoryX configuration.
G42 is Cerebras' largest customer and a strategic investor. The Condor Galaxy partnership, under which G42 committed to purchase approximately $1.43 billion worth of Cerebras computing systems and services, accounts for the dominant share of Cerebras' hardware revenue. G42 uses the Condor Galaxy network to train large Arabic and multilingual language models and to develop enterprise AI products for the Middle East and Africa markets. Jais-30B, a 30-billion parameter Arabic foundation model, was trained on Condor Galaxy 1 and released in 2023.
Mayo Clinic is a Cerebras customer for healthcare AI research. The partnership, announced in 2022 and deepened at the January 2024 J.P. Morgan Healthcare Conference, focuses on training genomic foundation models on data from more than 100,000 patients. In 2024, Cerebras and Mayo Clinic jointly unveiled a genomic model designed to identify patient response to therapy and shorten time to effective treatment selection. Mayo Clinic also uses Cerebras systems as part of a three-way collaboration with Microsoft Research to develop multimodal models trained on radiology images including CT scans and MRIs.
AstraZeneca has used Cerebras hardware for drug discovery modeling. Total Energies deployed Cerebras systems for climate modeling and carbon capture simulations; a Cerebras case study reported a 210x speedup over NVIDIA H100 for a specific carbon capture fluid dynamics simulation. Lawrence Livermore National Laboratory and other U.S. Department of Energy research sites have used Cerebras hardware for scientific computing, representing the carve-out the company made for government use cases beyond pure commercial AI training. Other reported customers include financial services firms and defense research organizations.
By the time Cerebras filed its revised IPO prospectus in April 2026, OpenAI had emerged as a significant customer. The filing disclosed a $10 billion commitment from OpenAI for Cerebras compute resources, representing a major shift from the company's earlier near-total dependence on G42. The OpenAI relationship reflected broader demand for high-throughput inference capacity as OpenAI scaled its API and product workloads.
Cerebras filed an S-1 registration statement with the Securities and Exchange Commission on September 30, 2024, disclosing its intention to raise funds through an initial public offering. The filing revealed $78.7 million in revenue for fiscal year 2023, $24.6 million for 2022, and $136.4 million for the first half of 2024.
The filing also disclosed that G42 accounted for 83% of Cerebras' 2023 revenue and 97% of hardware revenue for the first half of 2024. This concentration attracted scrutiny from the Committee on Foreign Investment in the United States (CFIUS), which opened a national security review of the relationship given G42's ties to the UAE government and its prior investments involving Chinese technology companies. The CFIUS review prompted Cerebras to withdraw the 2024 IPO in late October 2024.
Days before withdrawing, Cerebras announced it had raised a $1.1 billion funding round at an $8.1 billion valuation. The company subsequently filed a new S-1 in April 2026, disclosing $510 million in revenue for 2025 and a $23 billion valuation, and targeted raising $3.5 billion. The 2026 filing reflected a substantially larger customer base including OpenAI as a major customer.
The WSE-3 and the broader Cerebras architecture have several practical constraints.
Memory capacity is the most cited limitation. The 44GB of on-chip SRAM is small relative to HBM-equipped GPUs. Although weight streaming mitigates this for training workloads by keeping weights off-chip, inference workloads must also accommodate KV cache growth as context length increases. At high batch sizes and long contexts, the KV cache can consume a substantial portion of the 44GB, reducing effective model capacity. Running inference on a 1-trillion parameter model requires networking approximately 45 CS-3 systems together, which at $2.5 to $3.1 million per node implies a hardware cost of over $100 million for that configuration.
The chip does not support FP64 double-precision computation, which limits its applicability to scientific computing workloads that require high numerical precision beyond AI training. General-purpose computing tasks that do not map well onto the 2D mesh topology also perform poorly.
The wafer-scale manufacturing approach involves tradeoffs in yield and cost. Although Cerebras' redundancy approach routes around defective cores, producing defect-free wafers at sufficient volume to meet enterprise demand requires tight manufacturing partnership with TSMC, and each wafer corresponds to a single processor rather than dozens, reducing the economies of scale from which conventional chipmakers benefit.
The software ecosystem is narrower than that surrounding NVIDIA GPUs. While the CS-3 supports PyTorch 2.0, it does not support the full CUDA ecosystem, and workloads written for NVIDIA hardware typically require porting. Cerebras provides its own SDK and compiler, and the company claims that porting reduces code complexity, but the initial migration effort is a barrier for organizations with existing GPU-optimized codebases.