Blackhole (Tenstorrent)

AI Hardware AI Infrastructure

19 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v2 · 3,836 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Blackhole is the third-generation AI accelerator architecture from Tenstorrent, the Toronto and Santa Clara based fabless semiconductor company led by CEO Jim Keller. Each Blackhole processor pairs an array of Tensix++ compute cores with a cluster of sixteen general purpose RISC-V CPU cores that can boot Linux directly on the chip, and it scales to other Blackhole processors over standard Ethernet, so Tenstorrent describes it as "a standalone AI computer" rather than a coprocessor that depends on a host system. ^[1]^[5]^[7] The architecture succeeds the earlier Grayskull and Wormhole generations, and Tenstorrent ships it with a fully open source software stack. ^[2]^[5]

A single Blackhole chip combines 140 Tensix++ compute cores (120 enabled on shipping cards), the sixteen SiFive Intelligence X280 RISC-V cores, 32 GB of GDDR6 memory, a Gen5 PCI Express interface and ten 400 Gbps Ethernet links. ^[1]^[5] Tenstorrent first detailed the silicon at Hot Chips 2024 in August 2024 and began selling Blackhole based developer products at its inaugural Tenstorrent Dev Day on April 3, 2025 in San Francisco. ^[5]^[2] The launch lineup consisted of the entry level p100 card at 999 US dollars, the networked p150 card at 1,399 US dollars, and the TT-QuietBox liquid cooled workstation containing four Blackhole processors at 11,999 US dollars. ^[2] The family ships alongside Tenstorrent's open source stack, which includes the low level TT-Metalium runtime, the TT-NN neural network library and the TT-Forge compiler frontend. ^[2]

Blackhole combines a relatively conventional packaging approach, using GDDR6 memory rather than the high bandwidth memory adopted by Nvidia Blackwell and AMD's Instinct line, with an unusually heterogeneous on die compute fabric and an aggressively open business model that ships full board schematics, kernel level access and source code for the entire software stack. ^[5] Tenstorrent has positioned the architecture as a building block for what it calls scale out AI compute, with the on die Ethernet fabric intended to let multiple Blackhole processors federate into larger meshes without proprietary switches. ^[5]^[7]

What is Tenstorrent Blackhole?

Blackhole is the third generation of accelerator from Tenstorrent, following Grayskull, a 12 nanometer design that shipped to early developers in 2020 and 2021, and Wormhole, which introduced GDDR6 memory and Ethernet based scaling. ^[5] The defining idea of the design is that it is, in Tenstorrent's framing, a "standalone AI computer based on Ethernet": the sixteen large RISC-V cores are powerful enough to run Linux and act as an on-device host, which removes the requirement for a separate host CPU, while the chip's ten 400 Gbps Ethernet links let many processors be wired together directly. ^[5]^[7]

The naming is unrelated to Nvidia's Blackwell architecture, which is named for the mathematician David Blackwell, though the overlap has caused occasional confusion in coverage because both were prominent AI silicon stories of 2024 and 2025. ^[5]

The project began under Tenstorrent's pre Jim Keller leadership and was carried through to tapeout after Keller joined as chief executive in early 2023. Keller, previously known for leading CPU programs at AMD, Apple, Tesla and Intel, has positioned Blackhole as the first generation of Tenstorrent silicon to embody the company's longer term thesis that AI accelerators should be programmable in plain C++ against an exposed network of small cores rather than hidden behind a proprietary tensor compiler. Keller has summarized Tenstorrent's positioning against the market leader bluntly, telling EE Times: "Whatever Nvidia does, we'll do the opposite." ^[15]

What is the Blackhole architecture?

Chip overview

A single Blackhole die contains three principal classes of compute, all built around the open RISC-V instruction set. At the center of the die sit 140 Tensix++ cores arranged in a two dimensional mesh and connected by a fast on chip Network on Chip, often abbreviated NoC. ^[5]^[7] Around the periphery of the die sit dedicated controllers for GDDR6 memory, PCI Express, Ethernet and other system functions. Each of these controllers is itself driven by one or more small RISC-V cores. Finally, a discrete CPU island contains sixteen large RISC-V cores from SiFive that can run a general purpose operating system. ^[8] Counting every RISC-V core on the die, including the ones embedded inside the Tensix cores and the controllers, Tenstorrent reports a total of 752 small or baby RISC-V cores plus the 16 large SiFive cores, for 768 RISC-V cores in all, making the chip fundamentally a RISC-V multiprocessor with specialized matrix engines bolted on. ^[5]

The Blackhole die is manufactured on a 6 nanometer process at TSMC. ^[2]^[5] Tenstorrent has not published full die area or transistor count figures.

Tensix++ cores

Each Tensix++ core is itself a small heterogeneous compute cluster. A single core contains five baby RISC-V cores, a tile based math engine, a vector math engine, a pair of NoC routers and a block of L1 SRAM that functions as locally addressable scratchpad memory. ^[5]^[7] The baby RISC-V cores act as orchestrators, issuing matrix and vector instructions to the math engines and managing data movement to and from the L1 scratchpad and the surrounding mesh.

The tile math engine supports a wide spread of data types including INT8, TF32, BF16, FP16, FP8 and several block floating point formats from two to eight bits per element. The vector math engine focuses on FP32, INT16 and INT32 operations. Across the full die, Blackhole reports 745 teraFLOPS of FP8 throughput and 372 teraFLOPS of FP16 throughput at the silicon level, though commercial cards have shipped with reduced configurations as discussed below. ^[1]^[5]

A distinguishing feature of the Tensix++ architecture compared to the Tensix+ cores in Wormhole is a substantial increase in L1 SRAM per tile. Across the chip, Blackhole provides 180 MB of total SRAM, which is a meaningful uplift over Wormhole and allows larger working sets to remain on chip during a kernel. The Network on Chip in Blackhole is also wider and faster than the equivalent fabric in Wormhole, which reduces contention for shared bandwidth across the mesh. ^[2]

Big RISC-V CPU island

The most visible architectural change between Wormhole and Blackhole is the addition of a CPU island built from sixteen SiFive Intelligence X280 cores. ^[8] The X280 is a 64 bit, dual issue, in order RISC-V core with a vector unit that targets edge AI and signal processing workloads. In Blackhole, the cores are organized into four clusters of four cores each and are intended to run a full Linux distribution directly on the accelerator. ^[5] Tenstorrent has published a demonstration of Linux booting on the X280 island in its public tt-bh-linux repository. ^[9]

The practical purpose of the CPU island is to remove the host PC from the critical path for small or latency sensitive workloads. On Wormhole, small batch inference often spent significant time waiting on the host CPU for orchestration, since the on chip baby RISC-V cores were not suitable for general purpose code paths. With the X280 cluster, Blackhole can run scheduling, runtime services and even small portions of model code without leaving the card, an approach Tenstorrent describes as making Blackhole a standalone AI computer. ^[5]^[7]

Memory and Network on Chip

Each Blackhole chip is paired with 32 GB of GDDR6 memory, organized across eight memory controllers of 4 GB each. Peak GDDR6 bandwidth is 512 GB per second. ^[1]^[5] This compares to 6 controllers and 12 GB of GDDR6 per chip on Wormhole, with peak bandwidth of approximately 336 GB per second. The choice of GDDR6 rather than HBM is a deliberate one, intended to hit aggressive price points and avoid the supply constraints associated with HBM3 and HBM3e.

The Network on Chip carries traffic between the Tensix tiles, the memory controllers, the PCIe block, the Ethernet block and the CPU island. Tenstorrent has emphasized that the NoC is designed to expose its addressing and routing model to user kernels, so that programmers can explicitly choreograph data movement across the chip rather than relying on opaque caches. ^[7]

How does Blackhole scale over Ethernet?

Blackhole exposes its scaling fabric as standard Ethernet rather than a proprietary interconnect. The chip provides ten Ethernet links at 400 Gbps each, for a total of approximately 1 TB per second of off chip bandwidth. ^[1]^[5] On a p150 card these links are surfaced as four QSFP-DD 800 Gbps ports on the card bracket, which can be cabled directly to other Blackhole cards or to standard Ethernet switches. ^[2] This contrasts with Nvidia's NVLink and AMD's Infinity Fabric, both of which require dedicated switch silicon. At Hot Chips 2024, Tenstorrent engineer Davor Capalija described the design as composable from a single building block, saying, "You can make an entire training cluster just using this as a Lego." ^[5]

Tenstorrent has used the same fabric to build its Blackhole Galaxy reference system, a 32 chip 4 by 8 mesh with around 23.8 petaFLOPS of FP8 throughput and 1 TB of pooled memory. ^[5] Because each Blackhole chip carries its own Ethernet endpoints, the mesh does not need an external switch fabric to form a single coherent compute domain.

What are the Blackhole specifications and card variants?

At Dev Day in April 2025, Tenstorrent launched three Blackhole PCIe cards together with the QuietBox workstation. ^[2] A fourth card, the dual chip p300, was placed on the roadmap. All commercial Blackhole cards ship with 120 Tensix cores enabled rather than the full 140 cores present on the die, a configuration discussed in more detail in the section on the firmware reduction below. ^[3]^[6]

Card	Configuration	Memory	Memory bandwidth	TBP	Cooling	Ethernet	Price (USD)
p100a	1 Blackhole, 120 Tensix cores	28 GB GDDR6	448 GB/s	300 W	Active axial	None	999
p150a	1 Blackhole, 120 Tensix cores	32 GB GDDR6	512 GB/s	300 W	Active axial	4 x QSFP-DD 800G	1,399
p150b	1 Blackhole, 120 Tensix cores	32 GB GDDR6	512 GB/s	300 W	Passive	4 x QSFP-DD 800G	1,399
p300 (roadmap)	2 Blackhole, dual ASIC	64 GB GDDR6	~1 TB/s aggregate	TBD	TBD	TBD	TBD

All three shipping cards use a PCI Express 5.0 x16 host interface and require a single 12+4 pin 12V-2x6 power connector fed from an ATX 3.1 certified or better power supply. ^[3] The recommended host configuration is 64 GB of system memory, Ubuntu 22.04 and at least 100 GB of free storage, with 2 TB or more recommended for serious experimentation. ^[3] The p100a is targeted at desktop developers who want a single card without networking, the p150a targets workstation users who plan to cable two or more cards together, and the p150b is the passively cooled variant intended for server chassis with chassis level airflow. ^[2]^[3]

Why did Blackhole drop from 140 to 120 cores?

Original documentation and early marketing for Blackhole referenced a configuration of 140 Tensix cores per chip, matching the physical core count on the die. After early p150 units shipped, Tenstorrent disclosed that it would reduce the enabled core count to 120 via a mandatory firmware update, and that all subsequent cards would ship with 120 cores enabled. ^[6] The company characterized the change as a yield and reliability adjustment and stated that the expected real world performance impact was on the order of one to two percent. ^[6] The decision generated significant discussion on enthusiast forums and was covered in detail by Tom's Hardware and other outlets, partly because firmware downgrades that permanently disable already paid for compute are unusual in the PC accelerator market. ^[6]

The published specification page for the Blackhole cards now lists 120 Tensix cores across the p100a, p150a and p150b. ^[3] Reported chip level throughput numbers of 745 teraFLOPS FP8 refer to the underlying silicon. The shipped commercial figures are 664 BLOCKFP8 teraFLOPS per card at the 120 core configuration. ^[3]

Is the Blackhole software stack open source?

Tenstorrent ships an end to end software stack for Blackhole that is fully open source on GitHub and is shared with the earlier Wormhole generation. ^[2] In its launch announcement, the company stated that the "Blackhole cards and TT-Quietbox are fully supported by Tenstorrent's open source TT-Forge, TT-NN, TT-Metalium, and TT-LLK software stacks." ^[2] The stack is organized as a layered set of components, each of which is usable independently.

Layer	Component	Purpose
Low level kernel API	TT-LLK	C++ kernel intrinsics for the Tensix math engines
Runtime and host API	TT-Metalium	Plain C++ programming model that exposes the Network on Chip, cores and L1 SRAM directly
Neural network library	TT-NN	Operator library implementing common deep learning kernels on top of TT-Metalium
Compiler frontend	TT-Forge	MLIR based compiler that ingests PyTorch, ONNX, JAX, TensorFlow and vLLM workloads

TT-Metalium is positioned as the analogue of CUDA at the C++ level. It exposes the chip's mesh of Tensix cores, allows the developer to allocate L1 scratchpad memory, dispatch kernels to specific tiles, and orchestrate data movement over the NoC. ^[7] TT-NN sits above this and provides a more familiar operator library for those who want to build models without writing kernels by hand. TT-Forge, an MLIR based compiler frontend, is intended to take models from standard frameworks and lower them onto the underlying stack, broadly comparable in role to OpenAI Triton or to AMD's ROCm graph compilers. Linux on the X280 island uses standard upstream RISC-V kernels. ^[9]

Unusually for the AI accelerator market, Tenstorrent also publishes board schematics, full register level documentation and source code for its firmware. ^[5] The company has framed this transparency as a deliberate counterweight to the closed nature of competing accelerators, and it is one of the principal reasons that Blackhole has attracted attention from academic and open source projects despite shipping in much smaller volumes than products from Nvidia and AMD.

How does Blackhole compare to Wormhole?

Blackhole is a substantial step up from Wormhole on most axes, though the two generations share many architectural ideas. The Tensix tile concept, the baby RISC-V orchestrators, the use of GDDR6 and the Ethernet based scale out fabric all carry over from Wormhole. The major differences lie in compute density, on chip memory, network bandwidth and the new on die CPU island.

Attribute	Wormhole (n150 / n300)	Blackhole (p100a / p150)
Process node	TSMC 12 nm	TSMC 6 nm
Tensix cores per chip	80 (Tensix+)	140 physical, 120 enabled (Tensix++)
Big RISC-V CPU cores	None on die	16 SiFive X280
GDDR6 memory per chip	12 GB	32 GB
GDDR6 controllers per chip	6 x 2 GB	8 x 4 GB
GDDR6 bandwidth per chip	~336 GB/s	512 GB/s
Ethernet	16 x 100 Gbps	10 x 400 Gbps
PCIe	Gen4 x16	Gen5 x16
FP8 throughput per chip	~328 TFLOPS	745 TFLOPS silicon, 664 TFLOPS shipped
Card power	160 W (n150d) to 300 W (n300d)	300 W
Software stack	TT-Buda, TT-Metalium	TT-Forge, TT-NN, TT-Metalium

The combined effect is that a single p150 card carries more than two and a half times the FP8 throughput, more than two and a half times the memory and a faster, simpler Ethernet fabric compared to a Wormhole n150. The addition of the X280 island substantially reduces the host CPU bottleneck that affected small batch workloads on Wormhole, since orchestration can now run directly on the card.

How does Blackhole compare to peer accelerators?

The AI accelerator market in 2025 and 2026 is dominated by Nvidia's Blackwell generation, with AMD and a number of specialist start ups including Groq and Cerebras competing on a mix of price, software and architectural niches. Blackhole occupies a distinctive position in this landscape. It is a PCIe card with a list price comparable to a consumer GPU, but it ships with substantially more memory than mainstream consumer Nvidia and AMD cards and with a fully open software stack. It does not attempt to match the absolute throughput of HBM equipped data center parts.

Accelerator	Memory	Memory bandwidth	FP8 throughput	Interconnect	Indicative price (USD)
Tenstorrent Blackhole p150	32 GB GDDR6	512 GB/s	664 TFLOPS	10 x 400G Ethernet	1,399
Nvidia B200 (Blackwell)	192 GB HBM3e	~8 TB/s	~4,500 TFLOPS dense	NVLink 5	tens of thousands, OEM
AMD Instinct MI325X	256 GB HBM3e	~6 TB/s	~2,600 TFLOPS dense	Infinity Fabric	tens of thousands, OEM
Groq LPU	230 MB SRAM	on chip only	inference focused	proprietary	not sold standalone

Direct comparison is difficult because the products target very different market segments. Nvidia's B200 and AMD's MI325X are HBM equipped data center accelerators sold in eight way OAM trays at OEM prices that are typically two orders of magnitude higher than a Blackhole p150. Groq's LPU is an inference focused architecture that does not sell discrete cards at all and instead is offered as a cloud service. Blackhole's nearest competitors at the developer card price point are arguably high end Nvidia consumer GPUs such as the RTX 5090, which has been compared head to head against the p150 in several enthusiast reviews. ^[12] Compared to a consumer GPU, the p150 offers substantially more on board memory, native datacenter style Ethernet for scale out, full open documentation and a software stack that does not rely on Nvidia's proprietary CUDA, but it has a much narrower model and operator coverage and significantly less mature performance tuning. ^[12]

What is the Blackhole Galaxy system?

In addition to the discrete cards, Tenstorrent has shipped or previewed two larger Blackhole systems. The TT-QuietBox is a desktop workstation containing four Blackhole processors with a liquid cooling loop, priced at 11,999 US dollars at launch. ^[2] It is intended as a developer station for engineers who want to prototype multi card workloads without building their own server. The QuietBox was reviewed publicly in late 2025 and received generally favorable coverage for its acoustics and software experience. ^[11]

The Blackhole Galaxy is a rack scale reference design that combines 32 Blackhole accelerators in a 4 by 8 mesh. Tenstorrent reports the Galaxy as delivering around 23.8 petaFLOPS of FP8 throughput and 11.9 petaFLOPS of FP16 throughput, with 1 TB of pooled memory and an aggregate 16 TB per second of memory bandwidth. ^[5] Because each Blackhole chip carries its own Ethernet endpoints, the Galaxy can also be configured as a flexible AI switch with up to 11.2 TB per second of aggregate switching bandwidth, in addition to or instead of running as a single coherent compute domain. ^[5]

How has Blackhole been received?

Blackhole has been received with substantial interest from the open source and hobbyist AI communities, due primarily to the combination of 32 GB of memory at a sub 1,500 US dollar price point and a fully open software stack. Reviews from outlets including The Register and Hardware Corner have highlighted that the cards are practical for running local large language models that would not fit on typical consumer GPUs, with the trade off that performance is highly model dependent and operator coverage in TT-NN is still maturing. ^[11]^[12]

Coverage of the firmware reduction from 140 to 120 cores has been more critical, with several outlets describing the move as unusual for the PC accelerator market and questioning the precedent of disabling already paid for hardware via a mandatory update. ^[6] Tenstorrent's response has been that the long term performance impact is on the order of one to two percent and that the change is required for sustained reliability across the install base. ^[6]

From an architectural standpoint, Blackhole is widely considered the most ambitious shipping example of the all RISC-V AI accelerator concept. Academic microbenchmark studies, including a 2025 paper titled Dissecting the Tenstorrent Blackhole Architecture via Microbenchmarking, have used the chip as a vehicle for characterizing exposed Network on Chip programming models and scratchpad based AI compute. ^[10] The choice of GDDR6 over HBM, the use of standard Ethernet for scaling, and the full open hardware and software release have all been cited in broader discussions of how the AI accelerator industry might evolve beyond proprietary stacks. ^[13]

In June 2026, multiple outlets, citing a Reuters report, said Qualcomm was in talks to acquire Tenstorrent in a deal valued at roughly 8 billion to 10 billion US dollars, which would fold Blackhole and the company's open RISC-V roadmap into a much larger semiconductor vendor. Both companies declined to comment, and the talks were reported as unconfirmed with no certainty of completion. ^[16]

Whether Blackhole succeeds commercially against entrenched competitors remains open as of mid 2026. Tenstorrent does not publish unit shipment figures and is not believed to have penetrated the hyperscale training market that Nvidia dominates. The company's near term focus appears to be on enterprise inference, embedded AI for automotive and industrial customers, and licensing of its IP for AI chips integrated into other companies' systems on chip. ^[13]

References

Tenstorrent. "Blackhole product page." tenstorrent.com/en/hardware/blackhole ↩
Tenstorrent. "Tenstorrent Launches Blackhole Developer Products at Tenstorrent Dev Day." April 3, 2025. tenstorrent.com/en/newsroom/tenstorrent-launches-blackhole-developer-products-at-tenstorrent-dev-day ↩
Tenstorrent. "Blackhole Specifications and Requirements." Tenstorrent Documentation, docs.tenstorrent.com/aibs/blackhole/specifications.html ↩
Tenstorrent. "Cards." tenstorrent.com/en/hardware/cards
Tobias Mann. "Tenstorrent details its RISC-V packed Blackhole chips." The Register, August 27, 2024. ↩
Anton Shilov. "Jim Keller's Tenstorrent is downgrading Blackhole p150 cards from 140 to 120 tensor cores via firmware update." Tom's Hardware, 2025. ↩
Jasmina Vasiljevic and Davor Capalija. "Blackhole and TT-Metalium: The Standalone AI Computer and its Programming Model." Hot Chips 2024 presentation, Tenstorrent. ↩
SiFive. "Tenstorrent Selects SiFive Intelligence X280 for Next-Generation AI Processors." SiFive Press Release. ↩
Tenstorrent. "tt-bh-linux: Tenstorrent Blackhole P100/P150 card RISC-V Linux demo." GitHub repository, github.com/tenstorrent/tt-bh-linux ↩
"Dissecting the Tenstorrent Blackhole Architecture via Microbenchmarking." ASPLOS workshop paper, 2025. ↩
Tobias Mann. "Blackhole QuietBox, Tenstorrent's AI workstation reviewed." The Register, November 27, 2025. ↩
Hardware Corner. "Running Local LLMs? This 32GB Card Might Be Better Than Your RTX 5090." hardware-corner.net ↩
Spheron Blog. "Tenstorrent vs Nvidia: Open-Source AI Hardware Compared for Inference and Training." 2026. ↩
TechForward. "From Closed Silicon to Community Hardware: Inside Tenstorrent's Developer Day." techforward.io
Sally Ward-Foxton. "Tenstorrent's Jim Keller: Whatever Nvidia Does, We'll Do The Opposite." EE Times. ↩
Reuters. "Qualcomm in talks to acquire AI chip startup Tenstorrent for up to 10 billion dollars." June 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Tenstorrent Tenstorrent Galaxy Blackhole Untether AI Wormhole (Tenstorrent)

What is Tenstorrent Blackhole?

What is the Blackhole architecture?

Chip overview

Tensix++ cores

Big RISC-V CPU island

Memory and Network on Chip

How does Blackhole scale over Ethernet?

What are the Blackhole specifications and card variants?

Why did Blackhole drop from 140 to 120 cores?

Is the Blackhole software stack open source?

How does Blackhole compare to Wormhole?

How does Blackhole compare to peer accelerators?

What is the Blackhole Galaxy system?

How has Blackhole been received?

See also

References

Improve this article

Related Articles

Cloud TPU

NVIDIA Picasso

Tensor Processing Unit (TPU)

TPU Pod

TPU Node

TPU Worker

What links here

Related Articles

Cloud TPU

NVIDIA Picasso

Tensor Processing Unit (TPU)

TPU Pod

TPU Node

TPU Worker

What links here