Blackhole (Tenstorrent)
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 3,488 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 3,488 words
Add missing citations, update stale details, or suggest a clearer explanation.
Blackhole is the third-generation AI accelerator architecture from Tenstorrent, the Toronto and Santa Clara based fabless semiconductor company led by CEO Jim Keller. The Blackhole architecture succeeds the Grayskull and Wormhole generations and is the company's first design to integrate a substantial general purpose CPU complex on die, in the form of sixteen SiFive Intelligence X280 RISC-V cores capable of booting Linux directly on the accelerator. Each Blackhole chip pairs that CPU island with 140 Tensix++ compute cores, 32 GB of GDDR6 memory, a Gen5 PCI Express interface and ten 400 Gbps Ethernet links, packaging the result as a self contained AI computer rather than a coprocessor dependent on a host system.
The Blackhole silicon was first detailed at Hot Chips 2024 in August 2024, and Tenstorrent began selling Blackhole based developer products at its inaugural Tenstorrent Dev Day on April 3, 2025 in San Francisco. The launch lineup consisted of the entry level p100a card at 999 US dollars, the networked p150a and passive p150b cards at 1,399 US dollars, and the TT-QuietBox liquid cooled workstation containing four Blackhole processors at 11,999 US dollars. A two chip flagship card known as the p300 was placed on the roadmap but had not shipped as of early 2026. The family is sold alongside Tenstorrent's fully open source software stack, which includes the low level TT-Metalium runtime, the TT-NN neural network library and the TT-Forge compiler frontend.
Blackhole combines a relatively conventional packaging approach, using GDDR6 memory rather than the high bandwidth memory adopted by Nvidia Blackwell and AMD's Instinct line, with an unusually heterogeneous on die compute fabric and an aggressively open business model that ships full board schematics, kernel level access and source code for the entire software stack. Tenstorrent has positioned the architecture as a building block for what it calls scale out AI compute, with the on die Ethernet fabric intended to let multiple Blackhole processors federate into larger meshes without proprietary switches.
Blackhole is the third generation of accelerator from Tenstorrent, following Grayskull, a 12 nanometer design that shipped to early developers in 2020 and 2021, and Wormhole, which introduced GDDR6 memory and Ethernet based scaling. The naming is unrelated to Nvidia's Blackwell architecture, which is named for the mathematician David Blackwell, though the overlap has caused occasional confusion in coverage because both were prominent AI silicon stories of 2024 and 2025.
The project began under Tenstorrent's pre Jim Keller leadership and was carried through to tapeout after Keller joined as chief executive in early 2023. Keller, previously known for leading CPU programs at AMD, Apple, Tesla and Intel, has positioned Blackhole as the first generation of Tenstorrent silicon to embody the company's longer term thesis that AI accelerators should be programmable in plain C++ against an exposed network of small cores rather than hidden behind a proprietary tensor compiler.
A single Blackhole die contains three principal classes of compute, all built around the open RISC-V instruction set. At the center of the die sit 140 Tensix++ cores arranged in a two dimensional mesh and connected by a fast on chip Network on Chip, often abbreviated NoC. Around the periphery of the die sit dedicated controllers for GDDR6 memory, PCI Express, Ethernet and other system functions. Each of these controllers is itself driven by one or more small RISC-V cores. Finally, a discrete CPU island contains sixteen large RISC-V cores from SiFive that can run a general purpose operating system. Counting every RISC-V core on the die, including the ones embedded inside the Tensix cores and the controllers, Tenstorrent reports a total of 752 small or baby RISC-V cores plus the 16 large SiFive cores, for a chip that is fundamentally a RISC-V multiprocessor with specialized matrix engines bolted on.
The Blackhole die is manufactured on a 6 nanometer process at TSMC. Tenstorrent has not published full die area or transistor count figures.
Each Tensix++ core is itself a small heterogeneous compute cluster. A single core contains five baby RISC-V cores, a tile based math engine, a vector math engine, a pair of NoC routers and a block of L1 SRAM that functions as locally addressable scratchpad memory. The baby RISC-V cores act as orchestrators, issuing matrix and vector instructions to the math engines and managing data movement to and from the L1 scratchpad and the surrounding mesh.
The tile math engine supports a wide spread of data types including INT8, TF32, BF16, FP16, FP8 and several block floating point formats from two to eight bits per element. The vector math engine focuses on FP32, INT16 and INT32 operations. Across the full die, Blackhole reports 745 teraFLOPS of FP8 throughput and 372 teraFLOPS of FP16 throughput at the silicon level, though commercial cards have shipped with reduced configurations as discussed below.
A distinguishing feature of the Tensix++ architecture compared to the Tensix+ cores in Wormhole is a substantial increase in L1 SRAM per tile. Across the chip, Blackhole provides 180 MB of total SRAM, which is a meaningful uplift over Wormhole and allows larger working sets to remain on chip during a kernel. The Network on Chip in Blackhole is also wider and faster than the equivalent fabric in Wormhole, which reduces contention for shared bandwidth across the mesh.
The most visible architectural change between Wormhole and Blackhole is the addition of a CPU island built from sixteen SiFive Intelligence X280 cores. The X280 is a 64 bit, dual issue, in order RISC-V core with a vector unit that targets edge AI and signal processing workloads. In Blackhole, the cores are organized into four clusters of four cores each and are intended to run a full Linux distribution directly on the accelerator. Tenstorrent has published a demonstration of Linux booting on the X280 island in its public tt-bh-linux repository.
The practical purpose of the CPU island is to remove the host PC from the critical path for small or latency sensitive workloads. On Wormhole, small batch inference often spent significant time waiting on the host CPU for orchestration, since the on chip baby RISC-V cores were not suitable for general purpose code paths. With the X280 cluster, Blackhole can run scheduling, runtime services and even small portions of model code without leaving the card, an approach Tenstorrent describes as making Blackhole a standalone AI computer.
Each Blackhole chip is paired with 32 GB of GDDR6 memory, organized across eight memory controllers of 4 GB each. Peak GDDR6 bandwidth is 512 GB per second. This compares to 6 controllers and 12 GB of GDDR6 per chip on Wormhole, with peak bandwidth of approximately 336 GB per second. The choice of GDDR6 rather than HBM is a deliberate one, intended to hit aggressive price points and avoid the supply constraints associated with HBM3 and HBM3e.
The Network on Chip carries traffic between the Tensix tiles, the memory controllers, the PCIe block, the Ethernet block and the CPU island. Tenstorrent has emphasized that the NoC is designed to expose its addressing and routing model to user kernels, so that programmers can explicitly choreograph data movement across the chip rather than relying on opaque caches.
Blackhole exposes its scaling fabric as standard Ethernet rather than a proprietary interconnect. The chip provides ten Ethernet links at 400 Gbps each, for a total of approximately 1 TB per second of off chip bandwidth. On a p150 card these links are surfaced as four QSFP-DD 800 Gbps ports on the card bracket, which can be cabled directly to other Blackhole cards or to standard Ethernet switches. This contrasts with Nvidia's NVLink and AMD's Infinity Fabric, both of which require dedicated switch silicon. Tenstorrent has used the same fabric to build its Blackhole Galaxy reference system, a 32 chip 4 by 8 mesh with around 23.8 petaFLOPS of FP8 throughput and 1 TB of pooled memory.
At Dev Day in April 2025, Tenstorrent launched three Blackhole PCIe cards together with the QuietBox workstation. A fourth card, the dual chip p300, was placed on the roadmap. All commercial Blackhole cards ship with 120 Tensix cores enabled rather than the full 140 cores present on the die, a configuration discussed in more detail in the section on the firmware reduction below.
| Card | Configuration | Memory | Memory bandwidth | TBP | Cooling | Ethernet | Price (USD) |
|---|---|---|---|---|---|---|---|
| p100a | 1 Blackhole, 120 Tensix cores | 28 GB GDDR6 | 448 GB/s | 300 W | Active axial | None | 999 |
| p150a | 1 Blackhole, 120 Tensix cores | 32 GB GDDR6 | 512 GB/s | 300 W | Active axial | 4 x QSFP-DD 800G | 1,399 |
| p150b | 1 Blackhole, 120 Tensix cores | 32 GB GDDR6 | 512 GB/s | 300 W | Passive | 4 x QSFP-DD 800G | 1,399 |
| p300 (roadmap) | 2 Blackhole, dual ASIC | 64 GB GDDR6 | ~1 TB/s aggregate | TBD | TBD | TBD | TBD |
All three shipping cards use a PCI Express 5.0 x16 host interface and require a single 12+4 pin 12V-2x6 power connector fed from an ATX 3.1 certified or better power supply. The recommended host configuration is 64 GB of system memory, Ubuntu 22.04 and at least 100 GB of free storage, with 2 TB or more recommended for serious experimentation. The p100a is targeted at desktop developers who want a single card without networking, the p150a targets workstation users who plan to cable two or more cards together, and the p150b is the passively cooled variant intended for server chassis with chassis level airflow.
Original documentation and early marketing for Blackhole referenced a configuration of 140 Tensix cores per chip, matching the physical core count on the die. After early p150 units shipped, Tenstorrent disclosed that it would reduce the enabled core count to 120 via a mandatory firmware update, and that all subsequent cards would ship with 120 cores enabled. The company characterized the change as a yield and reliability adjustment and stated that the expected real world performance impact was on the order of one to two percent. The decision generated significant discussion on enthusiast forums and was covered in detail by Tom's Hardware and other outlets, partly because firmware downgrades that permanently disable already paid for compute are unusual in the PC accelerator market.
The published specification page for the Blackhole cards now lists 120 Tensix cores across the p100a, p150a and p150b. Reported chip level throughput numbers of 745 teraFLOPS FP8 refer to the underlying silicon. The shipped commercial figures are 664 BLOCKFP8 teraFLOPS per card at the 120 core configuration.
Tenstorrent ships an end to end software stack for Blackhole that is fully open source on GitHub and is shared with the earlier Wormhole generation. The stack is organized as a layered set of components, each of which is usable independently.
| Layer | Component | Purpose |
|---|---|---|
| Low level kernel API | TT-LLK | C++ kernel intrinsics for the Tensix math engines |
| Runtime and host API | TT-Metalium | Plain C++ programming model that exposes the Network on Chip, cores and L1 SRAM directly |
| Neural network library | TT-NN | Operator library implementing common deep learning kernels on top of TT-Metalium |
| Compiler frontend | TT-Forge | MLIR based compiler that ingests PyTorch, ONNX, JAX, TensorFlow and vLLM workloads |
TT-Metalium is positioned as the analogue of CUDA at the C++ level. It exposes the chip's mesh of Tensix cores, allows the developer to allocate L1 scratchpad memory, dispatch kernels to specific tiles, and orchestrate data movement over the NoC. TT-NN sits above this and provides a more familiar operator library for those who want to build models without writing kernels by hand. TT-Forge, an MLIR based compiler frontend, is intended to take models from standard frameworks and lower them onto the underlying stack, broadly comparable in role to OpenAI Triton or to AMD's ROCm graph compilers. Linux on the X280 island uses standard upstream RISC-V kernels.
Unusually for the AI accelerator market, Tenstorrent also publishes board schematics, full register level documentation and source code for its firmware. The company has framed this transparency as a deliberate counterweight to the closed nature of competing accelerators, and it is one of the principal reasons that Blackhole has attracted attention from academic and open source projects despite shipping in much smaller volumes than products from Nvidia and AMD.
Blackhole is a substantial step up from Wormhole on most axes, though the two generations share many architectural ideas. The Tensix tile concept, the baby RISC-V orchestrators, the use of GDDR6 and the Ethernet based scale out fabric all carry over from Wormhole. The major differences lie in compute density, on chip memory, network bandwidth and the new on die CPU island.
| Attribute | Wormhole (n150 / n300) | Blackhole (p100a / p150) |
|---|---|---|
| Process node | TSMC 12 nm | TSMC 6 nm |
| Tensix cores per chip | 80 (Tensix+) | 140 physical, 120 enabled (Tensix++) |
| Big RISC-V CPU cores | None on die | 16 SiFive X280 |
| GDDR6 memory per chip | 12 GB | 32 GB |
| GDDR6 controllers per chip | 6 x 2 GB | 8 x 4 GB |
| GDDR6 bandwidth per chip | ~336 GB/s | 512 GB/s |
| Ethernet | 16 x 100 Gbps | 10 x 400 Gbps |
| PCIe | Gen4 x16 | Gen5 x16 |
| FP8 throughput per chip | ~328 TFLOPS | 745 TFLOPS silicon, 664 TFLOPS shipped |
| Card power | 160 W (n150d) to 300 W (n300d) | 300 W |
| Software stack | TT-Buda, TT-Metalium | TT-Forge, TT-NN, TT-Metalium |
The combined effect is that a single p150 card carries more than two and a half times the FP8 throughput, more than two and a half times the memory and a faster, simpler Ethernet fabric compared to a Wormhole n150. The addition of the X280 island substantially reduces the host CPU bottleneck that affected small batch workloads on Wormhole, since orchestration can now run directly on the card.
The AI accelerator market in 2025 and 2026 is dominated by Nvidia's Blackwell generation, with AMD and a number of specialist start ups including Groq and Cerebras competing on a mix of price, software and architectural niches. Blackhole occupies a distinctive position in this landscape. It is a PCIe card with a list price comparable to a consumer GPU, but it ships with substantially more memory than mainstream consumer Nvidia and AMD cards and with a fully open software stack. It does not attempt to match the absolute throughput of HBM equipped data center parts.
| Accelerator | Memory | Memory bandwidth | FP8 throughput | Interconnect | Indicative price (USD) |
|---|---|---|---|---|---|
| Tenstorrent Blackhole p150 | 32 GB GDDR6 | 512 GB/s | 664 TFLOPS | 10 x 400G Ethernet | 1,399 |
| Nvidia B200 (Blackwell) | 192 GB HBM3e | ~8 TB/s | ~4,500 TFLOPS dense | NVLink 5 | tens of thousands, OEM |
| AMD Instinct MI325X | 256 GB HBM3e | ~6 TB/s | ~2,600 TFLOPS dense | Infinity Fabric | tens of thousands, OEM |
| Groq LPU | 230 MB SRAM | on chip only | inference focused | proprietary | not sold standalone |
Direct comparison is difficult because the products target very different market segments. Nvidia's B200 and AMD's MI325X are HBM equipped data center accelerators sold in eight way OAM trays at OEM prices that are typically two orders of magnitude higher than a Blackhole p150. Groq's LPU is an inference focused architecture that does not sell discrete cards at all and instead is offered as a cloud service. Blackhole's nearest competitors at the developer card price point are arguably high end Nvidia consumer GPUs such as the RTX 5090, which has been compared head to head against the p150 in several enthusiast reviews. Compared to a consumer GPU, the p150 offers substantially more on board memory, native datacenter style Ethernet for scale out, full open documentation and a software stack that does not rely on Nvidia's proprietary CUDA, but it has a much narrower model and operator coverage and significantly less mature performance tuning.
In addition to the discrete cards, Tenstorrent has shipped or previewed two larger Blackhole systems. The TT-QuietBox is a desktop workstation containing four Blackhole processors with a liquid cooling loop, priced at 11,999 US dollars at launch. It is intended as a developer station for engineers who want to prototype multi card workloads without building their own server. The QuietBox was reviewed publicly in late 2025 and received generally favorable coverage for its acoustics and software experience.
The Blackhole Galaxy is a rack scale reference design that combines 32 Blackhole accelerators in a 4 by 8 mesh. Tenstorrent reports the Galaxy as delivering around 23.8 petaFLOPS of FP8 throughput and 11.9 petaFLOPS of FP16 throughput, with 1 TB of pooled memory and an aggregate 16 TB per second of memory bandwidth. Because each Blackhole chip carries its own Ethernet endpoints, the Galaxy can also be configured as a flexible AI switch with up to 11.2 TB per second of aggregate switching bandwidth, in addition to or instead of running as a single coherent compute domain.
Blackhole has been received with substantial interest from the open source and hobbyist AI communities, due primarily to the combination of 32 GB of memory at a sub 1,500 US dollar price point and a fully open software stack. Reviews from outlets including The Register and Hardware Corner have highlighted that the cards are practical for running local large language models that would not fit on typical consumer GPUs, with the trade off that performance is highly model dependent and operator coverage in TT-NN is still maturing.
Coverage of the firmware reduction from 140 to 120 cores has been more critical, with several outlets describing the move as unusual for the PC accelerator market and questioning the precedent of disabling already paid for hardware via a mandatory update. Tenstorrent's response has been that the long term performance impact is on the order of one to two percent and that the change is required for sustained reliability across the install base.
From an architectural standpoint, Blackhole is widely considered the most ambitious shipping example of the all RISC-V AI accelerator concept. Academic microbenchmark studies, including a 2025 paper titled Dissecting the Tenstorrent Blackhole Architecture via Microbenchmarking, have used the chip as a vehicle for characterizing exposed Network on Chip programming models and scratchpad based AI compute. The choice of GDDR6 over HBM, the use of standard Ethernet for scaling, and the full open hardware and software release have all been cited in broader discussions of how the AI accelerator industry might evolve beyond proprietary stacks.
Whether Blackhole succeeds commercially against entrenched competitors remains open as of early 2026. Tenstorrent does not publish unit shipment figures and is not believed to have penetrated the hyperscale training market that Nvidia dominates. The company's near term focus appears to be on enterprise inference, embedded AI for automotive and industrial customers, and licensing of its IP for AI chips integrated into other companies' systems on chip.