Huawei CloudMatrix 384
Last reviewed
Jun 7, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 2,189 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 7, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 2,189 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Huawei CloudMatrix 384 (commonly shortened to CM384) is a rack-scale artificial intelligence computing system that Huawei markets as an "AI supernode." It wires together 384 Huawei Ascend 910C accelerators, plus 192 Huawei Kunpeng server CPUs, into a single tightly coupled machine using an all-optical, fully meshed interconnect. Built and operated by Huawei Cloud, the CM384 is widely described as China's answer to Nvidia's flagship rack-scale product, the GB200 NVL72. Its central design idea is to compensate for weaker individual AI chips by lashing together far more of them at the system level, trading energy efficiency for raw, deployable scale at a time when US export controls bar Nvidia's most capable accelerators from the Chinese market. The system was first detailed publicly by the analysis firm SemiAnalysis on April 16, 2025, showcased at the World Artificial Intelligence Conference in Shanghai in late July 2025, and put into commercial operation on Huawei Cloud's Ascend Cloud service during 2025.
The CM384 is not a single chip or a single server but a full data center building block spanning 16 physical racks. Where Nvidia packs 72 Blackwell GPUs into one NVL72 rack joined by copper NVLink, Huawei spreads 384 Ascend 910C NPUs across 12 compute racks and ties them together through 4 central networking racks using only optics. According to SemiAnalysis, this brute-force approach lets Huawei field a system that beats the NVL72 on aggregate compute, memory capacity, and memory bandwidth, despite each Ascend 910C being roughly one-third as capable as a single Nvidia Blackwell GPU. The penalty is power and density: the CM384 draws several times more electricity than the NVL72 for the same class of work. Because Huawei has not released a complete official specification sheet, most headline figures cited below come from SemiAnalysis's independent teardown and modeling, cross-checked against Huawei's own technical paper and reporting from Tom's Hardware, QSFPTEK, and others. Specs should be read as well-sourced estimates rather than vendor-certified numbers.
Each Ascend 910C is a dual-die package. Huawei pairs two compute dies in one module, an approach conceptually similar to Nvidia's two-die Blackwell B200. SemiAnalysis pegs a single 910C package at roughly 780 BF16 TFLOPS of dense compute, about 128 GB of high-bandwidth memory, and about 3.2 TB/s of memory bandwidth. Multiplied across 384 packages, the supernode aggregates to roughly 300 PFLOPS of dense BF16 compute, about 49.2 TB of HBM, and about 1,229 TB/s of memory bandwidth. The 910C is fabricated on a 7nm-class process; while Huawei's domestic foundry SMIC produces 910C wafers, teardowns by TechInsights found that many shipped 910C units still use 7nm dies fabricated earlier by TSMC, stockpiled before sanctions tightened.
The defining feature of the CM384 is its interconnect. Rather than a hierarchy of copper links and top-of-rack switches, every NPU and CPU in the supernode is joined by Huawei's UnifiedBus (UB), an ultra-high-bandwidth scale-up fabric that provides direct, non-blocking, all-to-all communication across all 384 NPUs and 192 Kunpeng CPUs. The fabric is entirely optical: SemiAnalysis counts roughly 6,912 linear-pluggable optics (LPO) transceivers stitching the racks together, each rated at 400G in the firm's accounting (some outlets report the modules as 800G). Using optics rather than copper is what lets Huawei push a coherent fabric across 16 racks instead of one, since copper cannot carry the needed bandwidth over those distances. The tradeoff is that thousands of optical transceivers add cost, power, and failure points; optics specialists have flagged reliability and yield as a concern for LPO at this scale.
Huawei's own engineering paper, "Serving Large Language Models on Huawei CloudMatrix384" (arXiv, June 2025), describes the system as 384 Ascend NPUs and 192 Kunpeng CPUs interconnected by the UB network to enable "direct all-to-all communication and dynamic pooling of resources." That topology is tuned for communication-heavy workloads such as large mixture-of-experts models, where tokens must be routed among many expert networks. In the paper, Huawei reports serving DeepSeek-R1 with expert parallelism as wide as EP320, achieving a prefill throughput of 6,688 tokens per second per NPU and a decode throughput of 1,943 tokens per second per NPU under a sub-50 ms time-per-output-token target, and sustaining 538 tokens per second per NPU even under a stringent 15 ms latency constraint. These are vendor-reported figures from Huawei.
On an apples-to-apples dense BF16 basis, with no sparsity assumed, SemiAnalysis puts the CM384 at about 300 PFLOPS against roughly 180 PFLOPS for the GB200 NVL72, or about 1.7 times the Nvidia rack's compute. SemiAnalysis characterizes this as the CM384 being "nearly double" the NVL72. Precision matters here: Nvidia's marketed Tensor Core numbers often include a 2x sparsity factor and lower-precision formats such as FP8 and FP4, so headline NVIDIA PFLOPS figures are much larger than the 180 PFLOPS dense-BF16 baseline used for this comparison. On the same dense BF16 basis, the gap widens on memory: the CM384 carries about 3.6 times the aggregate HBM capacity (about 49.2 TB versus about 13.8 TB) and about 2.1 times the memory bandwidth (about 1,229 TB/s versus about 576 TB/s) of the NVL72. SemiAnalysis also estimates the CM384 provides roughly 2.1 times the scale-up bandwidth within the 384-NPU domain and about 5.3 times the scale-out bandwidth for linking multiple supernodes.
The way Huawei gets there is by quantity. With about five times as many accelerators as the NVL72 (384 versus 72), the CM384 more than offsets each Ascend being only about one-third as fast as a Blackwell. The competitive point made by analysts is that Huawei's real advantage is at the system level, in networking, optics, and software co-design, not at the level of any single chip, where Huawei remains behind.
The CM384's strength is also its weakness. SemiAnalysis estimates total system power, including networking and storage, at around 559 kW, against roughly 145 kW for a GB200 NVL72. That is on the order of 3.9 to 4.1 times the power for about 1.7 times the dense compute. SemiAnalysis breaks the inefficiency down as roughly 2.5 times worse power per FLOP, about 1.9 times worse power per TB/s of memory bandwidth, and about 1.2 times worse power per TB of HBM capacity. Tom's Hardware summarized the same analysis with slightly rounded figures of about 2.3 times worse per FLOP and 1.8 times worse per unit of bandwidth. Either way, the conclusion is the same: the CM384 is a far heavier electricity consumer per unit of useful work than Nvidia's rack.
That tradeoff is deliberate and, in the Chinese context, defensible. As multiple analyses note, China has comparatively abundant and cheap electricity and ample land for data centers, while it cannot freely buy Nvidia's most efficient accelerators. Under those constraints, the binding limit is access to compute, not power or floor space, so spending extra megawatts to assemble competitive aggregate performance from domestically available chips is a rational substitution. SemiAnalysis estimated a CM384 system price on the order of 8 million US dollars, reflecting both the large chip count and the thousands of optical transceivers.
| Specification (dense BF16 basis) | CloudMatrix 384 (CM384) | Nvidia GB200 NVL72 | Source / note |
|---|---|---|---|
| Accelerators | 384 Ascend 910C NPUs | 72 Blackwell B200 GPUs | Huawei; Nvidia |
| Host CPUs | 192 Huawei Kunpeng | 36 Nvidia Grace | Huawei; Nvidia |
| Physical footprint | 16 racks (12 compute + 4 networking) | 1 rack | SemiAnalysis |
| Interconnect | All-optical UnifiedBus mesh, ~6,912 LPO transceivers | Copper NVLink spine | SemiAnalysis |
| Aggregate dense BF16 compute | ~300 PFLOPS | ~180 PFLOPS (~1.7x lower) | SemiAnalysis estimate |
| Per-accelerator dense BF16 | ~780 TFLOPS (~1/3 of a B200) | ~2,500 TFLOPS | SemiAnalysis estimate |
| Total HBM capacity | ~49.2 TB | ~13.8 TB (~3.6x lower) | SemiAnalysis estimate |
| Aggregate memory bandwidth | ~1,229 TB/s | ~576 TB/s (~2.1x lower) | SemiAnalysis estimate |
| Total system power | ~559 kW | ~145 kW (~4x lower) | SemiAnalysis estimate |
| Relative energy efficiency | ~2.5x worse power per FLOP | baseline | SemiAnalysis estimate |
| Estimated system price | ~8 million USD | varies | SemiAnalysis estimate |
The CM384 exists because of US export controls. The GB200 NVL72 and other top-tier Nvidia products cannot legally be sold into China, and the semiconductor restrictions extend to advanced AI accelerators and the HBM that feeds them. With Nvidia's best parts off the table, Chinese AI developers needed a domestic alternative at training and large-scale inference scale, and the CM384 is Huawei's answer. The controls also shape the supply chain behind the supernode. Investigators including TechInsights found foreign-made 7nm dies inside Ascend parts, and US authorities determined that roughly 2.9 million TSMC 7nm dies reached Huawei through the intermediary Sophgo, a finding that led to a reported 1 billion US dollar penalty against TSMC and the addition of Sophgo-linked entities to the US Entity List in early 2025. SemiAnalysis has reported that this stockpiled die bank, together with stockpiled HBM, let Huawei ship meaningful Ascend volumes through 2024 and 2025 while SMIC's domestic capacity ramped.
The surrounding policy environment was unusually volatile in 2025. On April 15, 2025, Washington restricted sales of Nvidia's China-specific H20 chip, forcing Nvidia to take a charge of about 5.5 billion US dollars on unsellable inventory; the move, as the IEEE Communications Society and others noted, pushed Chinese buyers toward Huawei's 910C. In a reversal around mid-July 2025, the administration said it would license H20 sales to resume, with officials arguing that keeping a degraded Nvidia product in China was preferable to ceding the entire market to Huawei. Later, on December 8, 2025, President Trump said the United States would permit Nvidia to sell the more capable H200 to China, reportedly in exchange for a revenue share, a decision that drew sharp criticism from export-control hawks. Analysts at SemiAnalysis, Stratechery, CSIS, and the Council on Foreign Relations have used the CM384 as a focal point in the broader debate over whether chip controls are slowing or inadvertently accelerating China's domestic AI hardware ecosystem.
The CloudMatrix 384 is significant less as a chip and more as a proof of concept for system-level competition. It demonstrates that a company cut off from leading-edge process technology can still field a rack-scale machine that rivals or exceeds the aggregate performance of the Western state of the art by innovating in packaging, optics, fabric design, and software, and by accepting a large efficiency penalty that China's energy abundance can absorb. For Nvidia, it validates the worry that export controls create a captive market in which a domestic champion can mature. For policymakers, it is a live test of whether controls can hold back capability when the target can substitute scale and power for per-chip excellence. And for the global AI buildout, it signals that the unit of competition is shifting from the individual accelerator to the integrated supernode, where networking and co-designed software increasingly decide who wins. Whether Huawei can sustain the effort depends heavily on its post-stockpile supply chain, namely SMIC wafer yields and domestically produced HBM, which remain the binding constraints on how many CM384-class systems China can actually build.