Wormhole (Tenstorrent)
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,127 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,127 words
Add missing citations, update stale details, or suggest a clearer explanation.
Wormhole is the second-generation AI accelerator application-specific integrated circuit (ASIC) designed by Tenstorrent, a Toronto-based hardware startup led by chip architect Jim Keller. Announced in 2021 and made commercially available to developers in mid-2024, Wormhole is sold on two PCIe add-in cards: the n150 (single ASIC, 12 GB GDDR6) and the n300 (dual ASIC, 24 GB GDDR6). The chip is notable as one of the first commercially shipping AI accelerators built around a many-core grid of in-house Tensix tiles managed by an array of small RISC-V control cores, and for its emphasis on a fully open-source software stack including TT-Metalium and TT-Buda.
Wormhole sits between Tenstorrent's first-generation Grayskull part and its third-generation Blackhole chip in the company's product line. While its raw FP8 throughput per card is well below contemporary flagship parts from NVIDIA and AMD, Wormhole was designed as a building block for scale-out systems, with each ASIC exposing 16 ports of 100 Gigabit Ethernet for direct chip-to-chip communication. The 32-card Galaxy server rack assembles these links into a single mesh delivering 9.32 PetaFLOPS of FP8 compute behind a unified 384 GB GDDR6 memory pool.
Tenstorrent was founded in 2016 by Ljubisa Bajic, Ivan Hamer, and Milos Trajkovic. The company set out to design a chip architecture that would scale across a wide range of AI chips deployments, from a single PCIe card in a developer workstation to large multi-rack inference clusters. The first generation product, code-named Grayskull, was a 120-core chip fabricated on GlobalFoundries' 12 nanometer process and targeted as a proof-of-concept developer card.
Wormhole is the follow-on architecture and the first Tenstorrent design to include native high-speed Ethernet for direct chip-to-chip scale-out. The chip was first publicly analyzed in detail by SemiAnalysis in June 2021, which described it as a 670 square millimeter die on the same GlobalFoundries 12 nanometer process as Grayskull, but with the addition of sixteen 100 Gigabit Ethernet ports along the perimeter and an upgraded Tensix core.
Jim Keller, the architect behind AMD's Zen CPU family, Apple's A4 and A5 application processors, and the original Tesla Full Self-Driving chip, joined Tenstorrent as Chief Technology Officer in early 2021 and was promoted to Chief Executive Officer in January 2023. His arrival raised the company's profile considerably and signaled an intent to build not just an inference accelerator but a long-term competitor to NVIDIA in both AI silicon and licensable RISC-V CPU intellectual property.
Tenstorrent quietly shipped Wormhole-based development boards to selected partners and research groups beginning in 2022, and the parts were available through the Tenstorrent DevCloud remote access service. Wide commercial availability of the n150 and n300 PCIe cards arrived on 20 July 2024 with an online store launch.
The Wormhole die measures approximately 670 square millimeters and is manufactured on GlobalFoundries' 12 nanometer FinFET process. This is the same node as the prior Grayskull part, a deliberate choice that traded peak transistor density for lower mask costs and faster time to silicon. The Wormhole package exposes a 192 bit GDDR6 memory interface that is wired to six external GDDR6 memory devices on the carrier board. It also brings out 80 SerDes lanes that the on-package controllers configure as sixteen 100 Gigabit Ethernet ports, giving each chip 1.6 terabits per second of aggregate off-chip bandwidth dedicated to direct neighbor links.
Each Wormhole ASIC contains a grid of Tensix cores. In the n150 product configuration, 72 Tensix cores are enabled (with additional disabled cores on the die for yield), while the full mesh exposed in some internal Tenstorrent documentation is reported as 80 tiles. Each Tensix tile is a self-contained processing element that bundles together:
The split into five baby RISC-V cores per tile is a distinctive feature: two of the cores are dedicated to moving data between external GDDR6, neighbor tiles, and the local SRAM scratchpad, while the remaining three orchestrate the matrix and vector compute units. This decoupled data-movement model is meant to let the compiler explicitly schedule tensor traffic across the mesh rather than relying on a hardware-managed cache hierarchy, which Tenstorrent argues yields more predictable performance for repetitive AI workloads.
The Tensix tiles are arranged in a 2D mesh and stitched together by a bi-directional NoC. At the chip edge, the NoC fabric extends outward through the 16 Ethernet ports, allowing data to traverse from any tile on one chip to any tile on a neighboring chip without going back through the host CPU or PCIe complex. Tenstorrent refers to this property as a scale-out architecture, with the chip-to-chip Ethernet links treated by software as an extension of the on-die NoC.
Each Ethernet link is a standard 100 Gigabit Ethernet PHY, and pairs of links are exposed externally through QSFP-DD cages on the n150 and n300 carrier boards. This choice means Wormhole-based systems can be cabled together using off-the-shelf datacenter optics or direct-attach copper, with no proprietary cabling standard analogous to NVIDIA's NVLink or AMD's Infinity Fabric.
Wormhole's matrix engines natively support a wide range of numeric formats:
The combination of dense low precision math, on-tile SRAM, and software-managed data movement is the basis for Tenstorrent's claim that Wormhole achieves a high fraction of its peak FLOPS on real transformer workloads even at small batch sizes.
Wormhole is sold to developers and small deployments on two three-quarter length PCIe Gen 4 x16 add-in cards. The n150 carries a single Wormhole ASIC and 12 gigabytes of GDDR6; the n300 carries two Wormhole ASICs on a single board for 24 gigabytes of GDDR6 and double the headline compute. Each variant is offered in two cooling configurations: the d suffix indicates a desktop active cooler with integrated fan, while the s suffix indicates a passive server cooler designed for chassis with strong directed airflow.
| Specification | Wormhole n150 (n150d / n150s) | Wormhole n300 (n300d / n300s) |
|---|---|---|
| Wormhole ASICs per card | 1 | 2 |
| Tensix cores | 72 | 128 (64 per ASIC) |
| On-chip SRAM | 108 MB | 192 MB (96 MB per ASIC) |
| GDDR6 memory | 12 GB | 24 GB |
| Memory speed | 12 GT/sec | 12 GT/sec |
| Memory bandwidth | 288 GB/sec | 576 GB/sec |
| FP8 peak throughput | 262 TFLOPS | 466 TFLOPS |
| AI clock | 1.0 GHz | 1.0 GHz |
| Board power (TDP) | 160 W | 300 W |
| QSFP-DD scale-out ports | 2 x 200 G active | 2 x 200 G active |
| Host interface | PCIe Gen 4 x16 | PCIe Gen 4 x16 |
| Form factor | 3/4 length, dual slot | 3/4 length, 2.5 slot (active) |
A few details on the table merit explanation. The n300's Tensix core count of 128 is lower than two times the n150's 72, because Tenstorrent fuses off additional tiles on the n300 to land at a clean 64 active tiles per ASIC for symmetric mesh routing across the dual-chip card. Similarly, the n300's SRAM of 96 megabytes per ASIC is slightly below the n150's full 108 megabytes for the same reason.
The two QSFP-DD ports on each card carry the chip's high-speed Ethernet links, and the n300 internally also wires its two ASICs together over a chip-to-chip link so that the pair appears as a single tightly coupled compute domain. Multiple n150 or n300 cards in the same chassis can be cabled together through the QSFP-DD ports to form larger meshes without involving the PCIe bus for data plane traffic.
The direct-sale prices at the July 2024 launch were:
| Product | Configuration | Launch price (USD) |
|---|---|---|
| Wormhole n150s | Single ASIC, passive cooler | $999 |
| Wormhole n300s | Dual ASIC, passive cooler | $1,399 |
| TT-LoudBox workstation | 4 x n300s (8 ASICs), tower | $12,000 |
| TT-QuietBox workstation | 4 x n300s (8 ASICs), liquid cooled | $15,000 |
The $999 and $1,399 retail prices placed Wormhole well below the per-card prices of contemporary datacenter AI accelerators (which were generally above $20,000 per H100 SXM) and made it one of the few server-class AI parts that an individual developer could realistically buy. The pricing strategy was explicitly aimed at building a software ecosystem around the chip, in the same way that NVIDIA's consumer GeForce cards seeded CUDA adoption.
The Wormhole Galaxy server is Tenstorrent's reference design for rack-scale deployment of the chip. The system is a custom 6U chassis that houses 32 Wormhole Tensix Processors interconnected through their Ethernet links, with an integrated x86 head node for host duties. Galaxy is the first product to fully exploit the chip's scale-out fabric: the 32 ASICs form a single mesh that the software stack presents as one logical accelerator with a pooled memory and combined compute budget.
| Specification | Wormhole Galaxy |
|---|---|
| Wormhole ASICs per chassis | 32 |
| Chassis form factor | Custom 6U |
| Aggregate FP8 compute | 9.32 PetaFLOPS |
| Aggregate on-die SRAM | ~3.8 GB |
| Aggregate GDDR6 memory | 384 GB (globally accessible) |
| Per-chip scale-out bandwidth | 3.2 Tbps Ethernet (16 x 200 Gb effective via NoC) |
| Integrated head node | Yes (x86) |
| Cabling | Standard Ethernet (200 G QSFP-DD) |
Because each ASIC contributes 12 gigabytes of GDDR6 to the pool, the full Galaxy presents 384 gigabytes of memory addressable from any compute tile in the mesh through the NoC and Ethernet fabric. This is a meaningfully larger working set than a single H100 SXM (80 GB HBM3) and is comparable to the per-GPU memory of AMD's MI300X (192 GB HBM3), though Galaxy's GDDR6 has lower bandwidth per byte than HBM3.
Multiple Galaxy chassis can themselves be cabled together using their QSFP-DD ports, since the underlying transport is standard 100 Gigabit Ethernet PHYs. Tenstorrent positions this property as the principal architectural advantage of Wormhole, since scale-out beyond a single rack does not require proprietary switches.
A distinguishing feature of Wormhole is that its entire programming environment is open source, with code published on GitHub and developed in the open. The stack is layered to expose progressively lower-level control of the hardware.
TT-Buda is the high-level inference and training framework. It accepts model graphs from PyTorch, TensorFlow, ONNX, and Hugging Face Transformers, lowers them through an internal intermediate representation, and emits kernels for the Tensix mesh. TT-Buda is the recommended entry point for users porting an existing trained model to Wormhole hardware and is the path used by the llama-tt reference implementations of Meta's Llama family on Wormhole.
TT-Metalium (often abbreviated TT-Metal) is the low-level C++ programming environment, analogous in spirit to NVIDIA's CUDA Driver API or AMD's HIP. It exposes the Tensix grid, the on-chip NoC, the per-tile SRAM, and the baby RISC-V control cores as first-class entities. Kernels written in TT-Metalium are programs running on the baby RISC-V cores that explicitly orchestrate data movement and matrix engine invocations. This degree of explicit control is intended to let library authors hand-tune critical kernels for transformers, convolutions, and attention.
llama-tt is Tenstorrent's open source reference implementation of the Llama 2 and Llama 3 model families on Wormhole. It is built on TT-Metalium and is the canonical performance demonstration for the chip on large language model inference, used for the throughput numbers Tenstorrent quotes against competing accelerators.
For developers who want to evaluate Wormhole without buying hardware, Tenstorrent operates a remote access service called DevCloud that hosts n150 cards, n300 cards, and Galaxy systems behind a queueing system. DevCloud has been used by academic groups and prospective customers as the primary on-ramp for evaluating the TT-Buda and TT-Metalium stacks.
Wormhole sits in an unusual position in the AI accelerator landscape: its per-card throughput is well below the flagship datacenter parts, but its open software stack, low entry price, and built-in scale-out Ethernet give it a different design point. The table below summarizes how a single n300 compares to a small selection of contemporary 2023-2024 AI accelerators at a high level.
| Accelerator | Process | Peak FP8 (per package) | Memory | Memory bandwidth | TDP | Notable interconnect |
|---|---|---|---|---|---|---|
| Tenstorrent Wormhole n300 | GF 12 nm | 466 TFLOPS | 24 GB GDDR6 | 576 GB/s | 300 W | 2 x 200 G Ethernet (QSFP-DD) |
| NVIDIA H100 SXM5 | TSMC 4N | ~1,979 TFLOPS (dense) | 80 GB HBM3 | 3.35 TB/s | 700 W | NVLink 4 (900 GB/s) |
| AMD Instinct MI300X | TSMC N5/N6 | ~2,615 TFLOPS (dense) | 192 GB HBM3 | 5.3 TB/s | 750 W | Infinity Fabric (896 GB/s) |
| Groq LPU (v1) | GF 14 nm | N/A (INT8 750 TOPS) | 230 MB SRAM (no DRAM) | ~80 TB/s on-die | 275 W | Proprietary chip-to-chip |
| Cerebras WSE-3 | TSMC 5 nm | 125 PFLOPS (sparse FP16) | 44 GB on-wafer SRAM | 21 PB/s on-wafer | ~23 kW (system) | Wafer-scale, no external |
On raw arithmetic throughput per chip, Wormhole n300 trails the H100 by roughly four times and the MI300X by roughly five times. Wormhole's memory subsystem uses GDDR6 rather than HBM3, which delivers approximately one fifth to one tenth the bandwidth of the high-end HBM parts on H100 and MI300X. The architectural rebuttal from Tenstorrent is that Wormhole is designed to be deployed at scale: the headline performance number a user should care about, in this view, is the throughput of a 32-card Galaxy mesh rather than a single board, and the throughput per dollar at the rack level is closer to parity than the per-card comparison would suggest.
Wormhole's competitive position with respect to inference-only accelerators such as Groq's LPU is more nuanced. Groq's first-generation parts dispense with external DRAM entirely and rely on hundreds of LPUs ganged together to provide model capacity, which yields extremely low latency but constrains practical model sizes. Wormhole's GDDR6 gives it a much larger per-card working set, at the cost of a more conventional memory hierarchy.
The technical press reaction to Wormhole at its July 2024 launch was generally positive on the architecture and software openness, but cautious on real-world performance. Reviewers and analysts highlighted four points consistently:
Independent benchmarks published in 2025 and 2026 by parties such as SemiAnalysis and the Spheron Network blog generally found that Wormhole achieved competitive throughput per dollar on transformer inference at the rack level, particularly for medium-sized language models that fit comfortably in the Galaxy's 384 GB pooled memory, while CUDA remained the dominant software environment for training and for production serving with heterogeneous workloads.
Tenstorrent announced the Wormhole successor, Blackhole, in August 2024 and launched developer products at the Tenstorrent Dev Day event in San Francisco on 3 April 2025. Blackhole moves the design to a 6 nanometer process, increases the number of integrated general purpose RISC-V cores, raises the on-die NoC bandwidth, and increases memory density. The headline Blackhole p150 PCIe card is rated at up to 774 FP8 TOPS, roughly 1.7 times the per-board throughput of the Wormhole n300 with which it is otherwise broadly comparable in form factor.
Blackhole's developer cards launched at the same $999 (p100) and $1,399 (p150) price points that Wormhole established, and the TT-QuietBox workstation built around four Blackhole processors launched at $11,999. The Galaxy Blackhole rack-scale server reached general availability in April 2026 with broader Hugging Face model coverage than the Wormhole generation. Wormhole itself remained in the catalog as the lower-priced option in the Tenstorrent product range.