Cerebras WSE-3
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 6,542 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 6,542 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Cerebras WSE-3 (Wafer-Scale Engine 3) is the third-generation wafer-scale AI processor developed by Cerebras Systems, announced on March 13, 2024.[^1] Built on TSMC's 5nm process node, it integrates 4 trillion transistors and 900,000 AI-optimized compute cores onto a single die spanning 46,225 square millimeters, equivalent to nearly an entire 300mm silicon wafer.[^1][^2] It is the largest semiconductor chip ever produced, surpassing its predecessor, the WSE-2, by a 54% increase in transistor count.[^1][^3] The WSE-3 powers the CS-3 system, Cerebras' third-generation AI supercomputer appliance, and underpins the Cerebras Inference cloud service launched in August 2024,[^8][^9] which by 2025 had become the platform behind Meta's official Llama API,[^21] Mistral's Le Chat, and AI workloads at IBM, Notion, AWS, and the U.S. Department of Defense. The WSE-3 was also the chip behind Cerebras' May 14, 2026 Nasdaq IPO under ticker CBRS, which raised $5.55 billion at a $39.8 billion basic valuation (approximately $95 billion fully diluted) and made Cerebras the largest pure-play AI hardware IPO to date.[^27][^28][^29]
Cerebras Systems was founded in 2015 by Andrew Feldman, Gary Lauterbach, Michael James, Sean Lie, and Jean-Philippe Fricker with the explicit goal of redesigning the processor from first principles for AI workloads. In August 2019, at Hot Chips, the company unveiled the original Wafer-Scale Engine (WSE-1): the first and only trillion-transistor processor at that time. Manufactured on TSMC's 16nm process, it packed 1.2 trillion transistors, 400,000 cores, and 18GB of on-chip SRAM onto 46,225mm² of silicon.
The fundamental design premise departed from conventional chip design by treating the entire 300mm wafer as a single processor rather than dicing it into dozens of smaller chips. Conventional wafer yields benefit from smaller die sizes because any defect affects only one chip among many. Cerebras addressed yield by building redundant circuitry into the wafer, so defective cores are routed around rather than discarded.
In April 2021, Cerebras unveiled the second-generation WSE-2, produced on TSMC's 7nm process. The WSE-2 raised the transistor count to 2.6 trillion, added 850,000 AI-optimized cores, expanded on-chip SRAM to 40GB, and increased memory bandwidth to 20 petabytes per second. Total fabric bandwidth reached 220 petabits per second across the on-wafer interconnect mesh.
The CS-2 system based on the WSE-2 could support clusters of up to 192 nodes through Cerebras' weight streaming architecture, which separates parameter storage from the compute wafer. In this model, weights reside in an external MemoryX cluster and stream onto the wafer during computation, while the SwarmX fabric coordinates gradient reduction across nodes. CS-2 pricing was estimated at approximately $2.5 million per node.
By early 2024, the WSE-2 and CS-2 were deployed at research institutions, pharmaceutical companies, and large technology groups, and had trained notable models including the Jais-30B Arabic language model and the Med42 medical language model via the Condor Galaxy supercomputer network.
The three WSE generations show a consistent pattern of roughly halving the process node and roughly doubling transistor count every two years:
| WSE-1 (2019) | WSE-2 (2021) | WSE-3 (2024) | |
|---|---|---|---|
| Process node | TSMC 16nm | TSMC 7nm | TSMC 5nm |
| Transistors | 1.2 trillion | 2.6 trillion | 4 trillion |
| AI cores | 400,000 | 850,000 | 900,000 |
| On-chip SRAM | 18 GB | 40 GB | 44 GB |
| Memory bandwidth | 9 PB/s | 20 PB/s | 21 PB/s |
| Die area | 46,225 mm² | 46,225 mm² | 46,225 mm² |
| System | CS-1 | CS-2 | CS-3 |
| Max cluster size | N/A | 192 nodes | 2,048 nodes |
The area has remained fixed across all three generations by design, as Cerebras sized its architecture to one full 300mm wafer and has not needed to change that baseline. The improvements come entirely from shrinking the process node and redesigning the compute cores.
The WSE-3 is fabricated on TSMC's 5nm process (N5), a transition down from the 7nm used for the WSE-2.[^1][^2] The chip measures 21.5 centimeters on each side and covers 46,225mm², the same physical footprint as all prior WSE generations since the company designed its wafer processes around a fixed 300mm platform. Despite keeping the same area, the move to 5nm allowed Cerebras to raise transistor density enough to reach 4 trillion transistors, a 54% increase over the 2.6 trillion of the WSE-2.[^1]
The core count rose modestly from 850,000 to 900,000 AI-optimized cores, reflecting a design choice to widen each core's math pipeline rather than simply multiply core count.[^7][^17] The WSE-3 SIMD units are eight-wide for FP16 operations, doubled from the four-wide units in the WSE-2, which accounts for the bulk of the 2x performance improvement at the same power envelope.[^7][^17] At peak, a single WSE-3 delivers 125 petaFLOPS of FP16 compute,[^1][^3] which Cerebras has summarized as approximately equivalent to 62 NVIDIA H100 GPUs in raw matrix throughput, though the comparison ignores precision and clustering differences.[^3]
On-chip SRAM increased to 44GB from 40GB, a 10% gain.[^1] Memory bandwidth improved to 21 petabytes per second from 20 petabytes per second.[^1] The on-wafer network fabric bandwidth is 214 petabits per second, broadly similar to the WSE-2's 220 petabits per second.[^17] These relatively modest gains in memory metrics reflect that bandwidth and capacity improvements at 5nm over 7nm are constrained by the same physical wafer area.
The WSE-3 does not include HBM or any form of stacked off-die memory integrated into the chip package. All 44GB of memory is SRAM distributed physically alongside the compute cores across the wafer surface. This co-location eliminates the off-chip memory bus bottleneck that limits GPU memory bandwidth at scale, but it also means the 44GB ceiling is hard for models that cannot be weight-streamed from external storage. Cerebras has often summarized the advantage by citing roughly 7,000 times more memory bandwidth than an NVIDIA H100, which it identifies as the underlying enabler of its inference throughput claims.[^9]
Cerebras connects cores through a 2D mesh fabric etched onto the wafer itself. Because all wires are on-wafer, latencies are in the range of nanoseconds, and the aggregate bandwidth of 214 petabits per second across the mesh exceeds the interconnect bandwidth of much larger GPU clusters by orders of magnitude. Cerebras claims the CS-3's on-wafer fabric provides more than 200 times the interconnect bandwidth of an NVIDIA NVL72 rack containing 72 B200 GPUs.[^20]
One manufacturing challenge specific to wafer-scale design is that the silicon wafer is divided by narrow "scribe lines" used in conventional chip fabrication to separate individual dies before dicing. Cerebras worked with TSMC to repurpose these scribe lines as conductive paths, creating die-to-die connections across the wafer without adding a separate packaging layer.[^18]
Because 44GB of on-chip SRAM cannot hold the parameters of a large language model (Llama 3 70B alone requires roughly 140GB), Cerebras employs a weight streaming execution model. In this approach, model weights are stored in an external MemoryX appliance rather than on the wafer. During a forward or backward pass, weights for each layer stream from MemoryX to the wafer compute cores layer by layer. After computation, updated gradients flow back. The wafer's 21 petabytes per second of SRAM-to-core bandwidth allows streaming to proceed fast enough that the external memory latency does not become the bottleneck for most workloads.
MemoryX configurations range from 1.5TB to 1.2 petabytes depending on deployment scale.[^1][^6] A 1.5TB MemoryX configuration supports models up to roughly 24 billion parameters at 16-bit precision, while a 1.2PB configuration supports models up to 24 trillion parameters.[^1]
The SwarmX fabric is a separate network switch designed specifically for the weight streaming architecture. It sits between MemoryX and the CS-3 compute nodes, broadcasting weights to all compute nodes simultaneously and reducing gradients back to MemoryX after each layer. SwarmX uses a tree topology that enables near-linear performance scaling: ten CS-3 nodes complete a job approximately ten times faster than a single CS-3 node, because each additional node receives its own copy of the weights from the same broadcast.[^6][^7]
Producing a chip that spans an entire 300mm wafer creates a yield problem that does not exist in conventional chip manufacturing. In standard chip production, a single defect ruins only the small die containing it, leaving all other dies on the wafer usable. On a wafer-scale chip, one defect anywhere could theoretically disable the entire processor.
Cerebras addressed this with a dual approach. First, each AI core on the WSE-3 is extremely small, approximately 0.05mm² per core. By comparison, a single SM (streaming multiprocessor) unit on an NVIDIA H100 covers roughly 6mm². Because a random manufacturing defect disables whatever silicon area it occupies, a defect on the WSE-3 knocks out 0.05mm² of compute, while the same defect on an H100 would disable 6mm² of compute. Cerebras argues this makes the wafer-scale design approximately 100x more fault tolerant per defect, at the silicon area level.[^17]
Second, Cerebras incorporates approximately 1% spare cores distributed across the wafer, along with redundant mesh routing links. Distributed autonomous repair logic runs during chip initialization to detect defective cores and remap data paths around them. The on-wafer 2D mesh topology supports multiple paths between any two cores, so routing around a cluster of adjacent defects does not require significant detours. Cerebras reports 93% silicon utilization on the WSE-3 after accounting for defective cores and spare overhead.[^17]
The WSE-3 includes hardware accelerators for both dynamic and unstructured sparsity. Most AI accelerators, including NVIDIA H100 and B200, support only structured sparsity (2:4 sparsity), which requires that exactly two of every four weights be zero. Unstructured sparsity, where any weight can be zero, is more flexible and can achieve higher compression ratios but is computationally harder to accelerate in hardware.
Cerebras claims up to 8x speedup for models compressed with unstructured sparsity on the WSE-3.[^17] This capability underpins the company's partnership with Qualcomm: models trained on the CS-3 with sparsity can be deployed on Qualcomm AI 100 Ultra inference accelerators at a claimed 10x reduction in inference cost compared to dense models on conventional hardware.
The CS-3 is Cerebras' third-generation system appliance built around the WSE-3. In terms of rack space, a single CS-3 occupies 15U. Power consumption is 23 kilowatts per unit, the same thermal envelope as the CS-2 despite delivering twice the compute performance.[^1][^6]
The CS-3 uses liquid cooling with redundant pumps to manage the heat load from a 23kW compute element in 15U. External connectivity is via fiber. The system supports three memory configurations through its MemoryX attachment:
| MemoryX configuration | Parameter capacity | Suitable deployment |
|---|---|---|
| 1.5 TB | Up to ~75B parameters | Single-model inference |
| 12 TB | Up to ~600B parameters | Large model training |
| 1.2 PB | Up to 24T parameters | Hyperscale training |
Clusters scale up to 2,048 CS-3 systems through the SwarmX fabric.[^1] A 2,048-node cluster delivers 256 exaflops of AI compute at FP16 precision and contains a total of 1.84 billion compute cores.[^1] By comparison, the CS-2 supported clusters of up to 192 systems, so the CS-3 increased the maximum cluster size by a factor of roughly 10.7.
Cerebras priced the CS-3 at parity with the CS-2 at approximately $2.5 to $3.1 million per node.[^7] A full 2,048-node cluster would cost in the range of $5 to $6 billion.
Software support includes native PyTorch 2.0 integration. Cerebras states that a GPT-3 training script requires 565 lines of code on the CS-3, compared to roughly 20,000 lines for an equivalent GPU cluster implementation, because the CS-3 abstracts away the distributed communication primitives that GPU users must manage manually. The system also supports pure data-parallel training for models from 1 billion to 24 trillion parameters without requiring model parallelism or tensor parallelism strategies.
Condor Galaxy is a series of AI supercomputers built jointly by Cerebras and G42, the Abu Dhabi-based technology holding company. The network is designed to eventually comprise nine interconnected supercomputers totaling 36 exaflops of AI compute.
| Supercomputer | Compute | Underlying hardware | Location | Status |
|---|---|---|---|---|
| Condor Galaxy 1 (CG-1) | 4 exaFLOPS | 64 CS-2 nodes | Santa Clara, California | Operational |
| Condor Galaxy 2 (CG-2) | 4 exaFLOPS | 64 CS-2 nodes | Dallas, Texas | Operational |
| Condor Galaxy 3 (CG-3) | 8 exaFLOPS | 64 CS-3 nodes | Dallas, Texas | Under construction (announced March 2024) |
| CG-4 through CG-9 | 20 exaFLOPS combined | CS-3 nodes | Multiple locations | Planned |
CG-1, launched in 2023, was among the largest AI supercomputers in the world at the time of its announcement, delivering 4 exaFLOPS and hosting 54 million compute cores.[^13] CG-1 and CG-2 used Cerebras CS-2 systems. CG-3, announced on March 13, 2024 (the same day as the WSE-3 announcement), is the first Condor Galaxy system built on CS-3 hardware.[^12] CG-3 uses 64 CS-3 nodes and delivers 8 exaFLOPS, double the capacity of each of its predecessors.
The Condor Galaxy network has trained several publicly released models: Jais-30B (an Arabic foundation model), Med42 (a medical domain model), Crystal-Coder-7B (a code generation model), and BTLM-3B-8K (a general-purpose language model with 8,192 context length). G42 accounted for 83% of Cerebras' reported revenue for 2023 and 97% of hardware sales in the first half of 2024,[^15] reflecting how central the Condor Galaxy partnership was to Cerebras' commercial operations during this period.
On August 27, 2024, Cerebras launched Cerebras Inference, a commercial cloud API offering LLM inference at throughputs it claimed were the fastest publicly available at the time.[^8][^9]
At launch, the service delivered:
| Model | Tokens per second | Comparison to GPU clouds |
|---|---|---|
| Llama 3.1 8B | 1,800 | ~20x faster than NVIDIA GPU-based hyperscale clouds |
| Llama 3.1 70B | 450 | ~20x faster than NVIDIA GPU-based hyperscale clouds |
Cerebras positioned the service against Groq LPU-based inference, claiming 2.4x higher throughput for the 8B model. Pricing started at $0.10 per million tokens for Llama 3.1 8B and $0.60 per million tokens for Llama 3.1 70B.[^9] Developers received one million free tokens per day at launch. Cerebras also advertised that the service uses native 16-bit weights rather than the 8-bit quantization common among inference providers, noting that 16-bit models score up to 5% higher than 8-bit counterparts on standard benchmarks.[^8] Artificial Analysis independently verified launch throughput at over 1,800 output tokens per second for Llama 3.1 8B and over 446 output tokens per second for Llama 3.1 70B.[^9]
The speed advantage comes from the WSE-3's architecture: because all KV cache and activation state fit within 44GB of SRAM during single-batch inference (for smaller models) or are handled through pipeline parallelism across a small number of CS-3 nodes, memory access latency per token is substantially lower than on GPU systems that move data between HBM and compute units on each step.
For Llama 3.1 70B, which requires approximately 140GB of memory, Cerebras distributes the model's 80 layers across four CS-3 systems using pipeline parallelism, with each system handling 20 layers. Tokens flow through the pipeline sequentially, and the system's fast per-layer processing keeps latency low.
In November 2024, Cerebras raised the Llama 3.1 70B throughput to 2,100 tokens per second, a 4.7x improvement over the August launch figure, attributed to software optimizations including speculative decoding.[^10] Around the same time, Cerebras demonstrated Llama 3.1 405B inference at 969 tokens per second, another claimed record for that model size.[^11]
Inference performance continued to scale through 2025 as Cerebras refined the speculative decoding stack, multi-CS-3 pipeline schedulers, and an internal compiler pass that fuses adjacent attention and MLP layers within a single wafer pass. In early 2025, the company reported tripling its industry-leading inference performance, surpassing 2,500 tokens per second on the 70B model and reaching new records on Llama 3.3 variants.[^23] The third-party benchmarking firm Artificial Analysis ranked the Cerebras endpoint as the fastest publicly available LLM API in successive monthly reports through the first half of 2025.
In April 2025, alongside Meta's release of Llama 4 (the Scout and Maverick mixture-of-experts variants), Cerebras launched same-day inference support and published Artificial Analysis-verified numbers of over 2,600 tokens per second on Llama 4 Scout, which the firm described as 19x faster than the fastest GPU solutions tested for the same model (which Artificial Analysis measured at 137 tokens per second).[^22] On the 400-billion-parameter Llama 4 Maverick model, Artificial Analysis benchmarked Cerebras at 2,522 tokens per second compared to 1,038 tokens per second on NVIDIA Blackwell, an explicit head-to-head result on a frontier model.[^30] In the same Maverick test, Artificial Analysis recorded SambaNova at 794 t/s, Groq at 549 t/s, Amazon at 290 t/s, Google at 125 t/s, and Microsoft Azure at 54 t/s, placing Cerebras roughly 3x ahead of the next-fastest specialty silicon and 2.4x ahead of Blackwell.[^30]
These numbers fall outside the MLPerf Inference suite, which Cerebras has not formally entered. The company has argued that MLPerf's batch-oriented methodology under-rewards architectures optimized for single-user latency, and instead points customers to Artificial Analysis's independent measurements. Critics note that MLPerf abstinence makes apples-to-apples comparison with NVIDIA, Intel, and AMD submissions impossible, while Cerebras counters that customer-facing latency under realistic chat and agentic loads is the metric the inference market actually pays for.
On April 29, 2025, Meta and Cerebras jointly announced that Cerebras would serve as a launch inference partner for Meta's official Llama API, the company's first hosted endpoint for its open-weight Llama family.[^21] Under the agreement, Cerebras provides the high-throughput tier of the Llama API across its U.S.-based data centers, while Meta handles model curation, safety filtering, and developer authentication. The deal made the WSE-3 the default acceleration target for any developer who selects the fast inference option on Meta's official API.
Earlier, in February 2025, the French foundation-model startup Mistral AI announced that Le Chat, its consumer chat product, would run on Cerebras Inference and cited a speed record for its underlying Mistral Large model. The Mistral integration was the first major European foundation-model deployment on wafer-scale hardware.
Together, the Meta and Mistral partnerships marked an inflection in Cerebras' commercial profile. Where 2023 and the first half of 2024 had been dominated by G42 hardware sales, by mid-2025 the company's revenue mix had broadened to include direct inference cloud usage by Meta, IBM, Mistral, AlphaSense, Cognition (maker of the Devin software agent), and Notion, plus indirect distribution through AWS Marketplace. Cerebras' S-1/A filings disclosed that AI cloud and inference revenue, which had been close to zero in 2023, accounted for the bulk of growth toward the $510 million 2025 revenue figure.[^29]
In March 2025, Cerebras announced plans to open six new AI data centers to house CS-3 systems dedicated to inference workloads.[^24] The sites span Dallas, Minneapolis, Oklahoma City, Montreal, New York, and a location in France. Together with the existing Santa Clara and pre-existing Dallas footprints, the build-out was projected to raise total Cerebras inference capacity roughly twenty-fold, into the range of tens of millions of tokens per second at peak. The geographic spread also positioned Cerebras to offer regional endpoints to enterprise customers with data residency requirements, particularly in Canada and the European Union.
The capacity expansion was funded in part by a $1.1 billion Series G round announced in late September 2025 at an $8.1 billion valuation,[^25] with proceeds earmarked for U.S. manufacturing partnerships with TSMC and for build-out of the inference network. The data center plan also informed the eventual $20 billion-plus OpenAI commitment that began with a $10 billion deal disclosed in January 2026[^26] and expanded into the multi-year master agreement summarized in the OpenAI section below.
The rapid inference throughput numbers Cerebras published in 2024 came amid a broader competition among specialized inference hardware vendors. The main contenders for the fastest-LLM-inference title in 2024 were Cerebras (WSE-3), Groq LPU, and SambaNova.
Groq's Language Processing Unit (LPU) is built around a different philosophy than the WSE-3: it uses a large array of relatively small chips, each with a modest amount of SRAM, interconnected through a synchronous fabric. Groq was the first to publicize sub-second response latencies for 7B and 13B models at scale and attracted early developer attention.
SambaNova uses a reconfigurable dataflow architecture and its SN40L chip, released in late 2023, combined on-chip SRAM with HBM to allow larger models per system than either Groq or a single CS-3. In mid-2024, SambaNova claimed to have exceeded 1,000 tokens per second for the 8B model, and was reporting around 580 tokens per second for 70B models at the time Cerebras launched its inference service.
At the August 2024 Cerebras Inference launch, Cerebras placed the competitive benchmark for Llama 3.1 8B at 1,800 tokens per second, ahead of SambaNova's reported 1,084 and Groq's reported 750 for the same model. On 70B models the competition was tighter: Cerebras reported 450 tokens per second versus SambaNova's 580 and Groq's 544. By November 2024, after software updates, Cerebras pushed its 70B throughput to 2,100 tokens per second, well ahead of both competitors.[^10] Through 2025, the competitive picture broadened to include NVIDIA Blackwell B200 GPU clouds on the high-throughput end, with Cerebras' Llama 4 Maverick result of 2,522 tokens per second per user roughly 2.4x ahead of Blackwell in published Artificial Analysis figures.[^30]
These comparisons are vendor-reported or come from a single independent benchmarker, and they have not been independently verified through a standardized benchmark suite such as MLPerf Inference. The methodologies differ: vendors vary batch sizes, context lengths, and precision settings, making direct comparison difficult. SambaNova also noted that its system uses full 16-bit precision on 405B models while Cerebras' 405B inference required distributing the model across multiple CS-3 nodes.
The WSE-3 and the NVIDIA NVIDIA H100 and NVIDIA B200 serve overlapping but distinct purposes. The H100 and B200 are general-purpose data center GPUs used across training, inference, and simulation workloads at a wide range of model sizes. The WSE-3 is optimized specifically for AI training and inference workloads where models fit within the weight streaming architecture.
| Specification | Cerebras CS-3 (WSE-3) | NVIDIA H100 SXM | NVIDIA B200 SXM |
|---|---|---|---|
| Process node | TSMC 5nm | TSMC 4N | TSMC 4NP |
| Transistors | 4 trillion | 80 billion | ~208 billion |
| Die area | 46,225 mm² | 814 mm² | ~904 mm² |
| On-chip memory | 44 GB SRAM | 80 GB HBM3 | 192 GB HBM3e |
| Memory bandwidth | 21 PB/s (on-wafer) | 3.35 TB/s | 8.0 TB/s |
| Peak AI compute (FP16) | 125 petaFLOPS | 1.979 petaFLOPS | ~4.5 petaFLOPS |
| Interconnect | 214 Pb/s on-wafer | NVLink 900 GB/s | NVLink 1.8 TB/s |
| System power | 23 kW (CS-3) | 700 W (per GPU) | 1,000 W (per GPU) |
| Form factor | 15U appliance | SXM module (server-integrated) | SXM module (server-integrated) |
The transistor and die area comparisons illustrate the wafer-scale approach: the WSE-3 integrates roughly 50 times more transistors than the H100 and covers approximately 57 times more silicon area.[^4] However, the majority of the WSE-3's transistors go to compute logic and SRAM, not to HBM interface circuitry, since the WSE-3 does not use HBM at all.
For peak AI compute, a single CS-3 (125 petaFLOPS FP16) outperforms a single H100 (1.979 petaFLOPS FP16) by approximately 63x and exceeds a single B200 (approximately 4.5 petaFLOPS FP16) by approximately 28x.[^3] However, NVIDIA systems are deployed in racks: an NVL72 rack containing 72 B200 GPUs delivers roughly 324 petaFLOPS at FP16, exceeding the CS-3 at the rack level while consuming approximately 132 kW compared to the CS-3's 23 kW. The CS-3 has a substantially better performance-per-watt ratio on paper.
A March 2025 paper from researchers at national laboratories published on arXiv directly compared Cerebras wafer-scale technology with NVIDIA GPU-based systems and found that the CS-3 achieved approximately 3.5x higher throughput at FP8 precision and 7x higher throughput at FP16 precision compared to a single H100-based system for the specific workloads tested.[^16]
The interconnect comparison is where the architectural gap is widest. The WSE-3's on-wafer fabric at 214 petabits per second vastly exceeds NVLink's 900 GB/s (7.2 terabits per second) per H100 or 1.8 TB/s (14.4 terabits per second) per B200. Cerebras states the CS-3 provides over 200 times the interconnect bandwidth of the full NVL72 rack.[^20] This matters most for training jobs where gradient synchronization across many units becomes the bottleneck.
The critical limitation is memory capacity. The WSE-3's 44GB of SRAM is much smaller than the 80GB per H100 or 192GB per B200, and smaller still than the 5.76TB to 13.8TB total HBM available across a full 72-GPU NVL72 rack. Large models that do not fit neatly within Cerebras' weight streaming pipeline suffer performance degradation, whereas on GPU clusters they can often be tensor-parallelized or pipeline-parallelized with less architectural constraint. Models requiring more than roughly 24 trillion parameters require more than the largest available MemoryX configuration.
G42 is Cerebras' largest historical customer and a strategic investor. The Condor Galaxy partnership, under which G42 committed to purchase approximately $1.43 billion worth of Cerebras computing systems and services, accounts for the dominant share of Cerebras' early hardware revenue. G42 uses the Condor Galaxy network to train large Arabic and multilingual language models and to develop enterprise AI products for the Middle East and Africa markets. Jais-30B, a 30-billion parameter Arabic foundation model, was trained on Condor Galaxy 1 and released in 2023. G42-affiliated entities accounted for approximately 86% of Cerebras' 2025 revenue per the company's pre-IPO disclosures, indicating that customer concentration remained elevated even after the broader inference cloud build-out.[^32]
Mayo Clinic is a Cerebras customer for healthcare AI research. The partnership, announced in 2022 and deepened at the January 2024 J.P. Morgan Healthcare Conference, focuses on training genomic foundation models on data from more than 100,000 patients. In 2024, Cerebras and Mayo Clinic jointly unveiled a genomic model designed to identify patient response to therapy and shorten time to effective treatment selection.[^14] Mayo Clinic also uses Cerebras systems as part of a three-way collaboration with Microsoft Research to develop multimodal models trained on radiology images including CT scans and MRIs.
Beginning in early 2025, the customer profile shifted toward large-scale inference users. Meta selected Cerebras as a launch inference partner for the Llama API in April 2025,[^21] making the WSE-3 the default fast-tier accelerator for the Llama 4 family on Meta's official endpoint. Mistral migrated Le Chat onto Cerebras in February 2025. IBM signed a multi-year inference agreement covering watsonx workloads, and Notion adopted Cerebras to power its real-time enterprise search and AI features at scale. AlphaSense, Cognition (the developer of the Devin coding agent), and several U.S. government customers including the Department of Defense and the Department of Energy joined the list during 2025.
AstraZeneca has used Cerebras hardware for drug discovery modeling. Total Energies deployed Cerebras systems for climate modeling and carbon capture simulations; a Cerebras case study reported a 210x speedup over NVIDIA H100 for a specific carbon capture fluid dynamics simulation. GlaxoSmithKline adopted Cerebras for protein and small-molecule modeling work disclosed in 2025. Lawrence Livermore National Laboratory and other U.S. Department of Energy research sites have used Cerebras hardware for scientific computing, representing the carve-out the company made for government use cases beyond pure commercial AI training. Other reported customers include financial services firms and defense research organizations.
OpenAI emerged as a major Cerebras customer in 2026. On January 14, 2026, Cerebras and OpenAI disclosed an initial inference deal worth more than $10 billion under which OpenAI would deploy 750 megawatts of Cerebras CS-3 capacity to power inference workloads behind ChatGPT and the OpenAI API.[^26] Ahead of the May 2026 IPO, the relationship expanded into a multi-year master relationship agreement valued at more than $20 billion, with the same 750 MW deployed through 2028 and options for OpenAI to acquire up to an additional 1.25 gigawatts of capacity between 2029 and 2030, potentially bringing total committed capacity to approximately 2 gigawatts.[^28] As part of the arrangement, OpenAI advanced Cerebras a $1 billion working capital facility at 6% interest secured by warrants exercisable for up to 33.4 million Cerebras shares at a near-zero strike price, a structure that could give OpenAI roughly an 11% equity stake in Cerebras if all warrants vested.[^28][^33] The OpenAI commitment effectively replaced G42 as the single largest source of forward revenue and was the most-cited factor in the upsized May 2026 IPO pricing.[^27]
Cerebras filed an S-1 registration statement with the Securities and Exchange Commission on September 30, 2024, disclosing its intention to raise funds through an initial public offering.[^15] The filing revealed $78.7 million in revenue for fiscal year 2023, $24.6 million for 2022, and $136.4 million for the first half of 2024.
The filing also disclosed that G42 accounted for 83% of Cerebras' 2023 revenue and 97% of hardware revenue for the first half of 2024.[^15] This concentration attracted scrutiny from the Committee on Foreign Investment in the United States (CFIUS), which opened a national security review of the relationship given G42's ties to the UAE government and its prior investments involving Chinese technology companies. The CFIUS review prompted Cerebras to withdraw the 2024 IPO in late October 2024.
Days before withdrawing, Cerebras announced it had raised a $1.1 billion funding round at an $8.1 billion valuation.[^25] The company subsequently filed a new S-1 in April 2026, disclosing $510 million in revenue for 2025 (up 76% from 2024) and net income of $88 million versus a loss of $481.6 million the prior year, with the swing to profitability driven by the inference cloud build-out, scaling of the OpenAI relationship, and the maturation of CS-3 deployments at Meta, IBM, Mistral, and other customers.[^27][^29]
On May 13, 2026, Cerebras priced its IPO at $185 per share, above the indicated $150 to $170 range, selling 30 million Class A common shares for gross proceeds of approximately $5.55 billion, with underwriters holding a 30-day option for an additional 4.5 million shares that would lift gross proceeds to roughly $6.4 billion if exercised.[^29] Morgan Stanley, Citigroup, Barclays, and UBS were lead book-running managers.[^29] Trading opened on Nasdaq Global Select Market under ticker CBRS on May 14, 2026 at $350 per share and closed the first session at $311.07, a 68% gain that lifted the basic market capitalization to approximately $39.8 billion and a fully diluted valuation - including OpenAI warrants and employee equity - of roughly $95 billion.[^27][^28] By that measure Cerebras was the largest pure-play AI hardware IPO ever, eclipsing prior records held by traditional semiconductor offerings. Cerebras itself characterized the listing as ending the longest U.S. drought in major tech IPOs since 2021 and as validation of the wafer-scale design thesis the company had pursued since 2015.[^28]
Post-IPO, customer concentration remained the most visible risk identified in analyst commentary: G42-affiliated entities still represented roughly 86% of fiscal 2025 revenue at the time of pricing, and OpenAI's expected ramp through 2028 would heavily skew forward revenue toward a small number of large buyers.[^32] Cerebras stated that the IPO proceeds, combined with the OpenAI working capital facility, would fund the six-data-center build-out, additional U.S. wafer manufacturing capacity with TSMC, and accelerated work on the WSE-4 successor.
Cerebras has not formally announced a fourth-generation Wafer-Scale Engine as of mid-2026, but industry analysts and the company's own filings outline the expected shape of WSE-4 and CS-4. Trade press reporting in late 2025 and early 2026 indicated that WSE-4 engines and CS-4 systems were targeted for launch later in 2026,[^31] fabricated on a more advanced TSMC node and benefiting from 3D-stacked SRAM to substantially increase on-wafer memory capacity and bandwidth. Analyst speculation also pointed to co-packaged optical interconnects between CS-4 nodes to relieve the cluster-bandwidth bottleneck on very large training jobs and to make MemoryX an optically-attached pool rather than a switched fabric.[^31] The expected performance leap is concentrated at lower precision data types, especially FP8 and FP4, in line with the broader industry shift away from FP16 for inference.
A more disruptive successor, sometimes referred to as WSE-5 in analyst writing, would integrate High Bandwidth Flash (HBF), a non-volatile memory technology that began sampling in the second half of 2026. HBF is intended to deliver per-stack capacities of 512 GB at first generation and 1 TB at second generation, which would allow a single wafer-scale processor to host an entire frontier-scale model without external MemoryX streaming. Whether HBF makes it into the first CS-4 generation or waits for CS-5 remained unclear at the time the 2026 IPO priced. Cerebras itself has been deliberately quiet about specific WSE-4 specifications, preferring to keep the WSE-3 and its expanding cloud footprint at the center of its public messaging through the listing.[^31]
The WSE-3 and the broader Cerebras architecture have several practical constraints.
Memory capacity is the most cited limitation. The 44GB of on-chip SRAM is small relative to HBM-equipped GPUs. Although weight streaming mitigates this for training workloads by keeping weights off-chip, inference workloads must also accommodate KV cache growth as context length increases. At high batch sizes and long contexts, the KV cache can consume a substantial portion of the 44GB, reducing effective model capacity. Running inference on a 1-trillion parameter model requires networking approximately 45 CS-3 systems together, which at $2.5 to $3.1 million per node implies a hardware cost of over $100 million for that configuration.
The chip does not support FP64 double-precision computation, which limits its applicability to scientific computing workloads that require high numerical precision beyond AI training. General-purpose computing tasks that do not map well onto the 2D mesh topology also perform poorly.
The wafer-scale manufacturing approach involves tradeoffs in yield and cost. Although Cerebras' redundancy approach routes around defective cores, producing defect-free wafers at sufficient volume to meet enterprise demand requires tight manufacturing partnership with TSMC, and each wafer corresponds to a single processor rather than dozens, reducing the economies of scale from which conventional chipmakers benefit.
The software ecosystem is narrower than that surrounding NVIDIA GPUs. While the CS-3 supports PyTorch 2.0, it does not support the full CUDA ecosystem, and workloads written for NVIDIA hardware typically require porting. Cerebras provides its own SDK and compiler, and the company claims that porting reduces code complexity, but the initial migration effort is a barrier for organizations with existing GPU-optimized codebases.
Independent benchmarking remains a contested topic. Cerebras has chosen not to submit to MLPerf Inference, the most widely cited industry benchmark, and instead relies on its own measurements and on Artificial Analysis third-party numbers. This makes direct comparisons with NVIDIA, Intel, and AMD inference submissions difficult, and some prospective enterprise buyers cite the absence of MLPerf participation as a procurement obstacle.
Customer concentration is a financial rather than technical limitation but figures heavily in post-IPO analyst commentary. G42 still represented approximately 86% of fiscal 2025 revenue at the time of the May 2026 listing,[^32] and the $20 billion-plus OpenAI commitment compounds rather than diversifies that concentration profile over the 2026 to 2030 horizon. Cerebras has argued that the Meta Llama API, IBM, Mistral, and AWS Marketplace channels will dilute concentration as the inference cloud scales, but those buyers were small relative to the OpenAI ramp in the financials disclosed at IPO.