Tenstorrent Galaxy Blackhole
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,615 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,615 words
Add missing citations, update stale details, or suggest a clearer explanation.
Tenstorrent Galaxy Blackhole is an AI inference server built by Tenstorrent, the fabless semiconductor company led by chief executive Jim Keller. The system packs 32 Blackhole accelerators into a 6U air-cooled chassis and was placed into general availability on April 28, 2026, with a public launch event on May 1. Tenstorrent positions the Galaxy as a lower-cost alternative to Nvidia for running large language models and generative video, leaning on three deliberate departures from the GPU mainstream: GDDR6 memory instead of HBM, standard Ethernet instead of a proprietary interconnect, and a fully open-source RISC-V based software stack. A single server lists at $110,000, and four of them form a base "supercluster" priced at $440,000.[1][2][3]
The machine is the productized version of a reference design that Tenstorrent had been showing for more than a year. Earlier descriptions of a 32-chip Blackhole "Galaxy" mesh delivering roughly 23.8 petaFLOPS appeared alongside the chip's Hot Chips 2024 debut, but the shipping product announced in April 2026 is the first time the company offered the system for sale with published specifications, named launch partners, and benchmark claims. The Register summarized the arrival bluntly, noting that Tenstorrent's "Galaxy Blackhole AI servers are finally out."[1][4]
Each Galaxy is a self-contained inference appliance rather than a host plus a tray of coprocessors. The 32 Blackhole ASICs sit in a single 6U air-cooled box and talk to one another over an on-board Ethernet mesh, so a buyer does not need a separate head node or a proprietary switch fabric to make the chips cooperate. That design choice matters because Blackhole itself was built to act as a standalone computer: every chip carries sixteen large SiFive Intelligence X280 RISC-V cores that can boot Linux directly, removing the host CPU from the critical path for orchestration.[1][5]
Tenstorrent's framing for the product is that inference does not need to be carved into the elaborate, multi-tier hardware stacks that have become common. Keller put it this way in the launch announcement: "Every company in the industry is pairing up to build the accelerator accelerator accelerator. CPUs run code. GPUs accelerate CPUs. TPUs accelerate GPUs. LPUs accelerate TPUs. And so on. This leads to complex solutions which are unlikely to be compatible with changes in AI models and uses. At Tenstorrent, we thought something more general and simpler would work."[2]
A related selling point is that the Galaxy runs both phases of LLM inference, the compute-heavy prefill and the memory-bound decode, on the same hardware. Much of the industry has moved toward disaggregated inference, where separate pools of machines handle prefill and decode. Tenstorrent argues that its on-chip data flow lets one system do both well, which simplifies deployment for teams that would rather not operate two specialized fleets.[2][6]
The Galaxy inherits everything that defines the Blackhole generation. A single Blackhole die is fundamentally a RISC-V multiprocessor with matrix engines attached. It contains 140 Tensix++ compute cores plus a large CPU island, and counting every embedded controller core Tenstorrent reports 752 small "baby" RISC-V cores alongside the 16 big SiFive cores. The silicon is manufactured on a 6 nanometer TSMC process, pairs each chip with 32 GB of GDDR6, and exposes ten 400 Gbps Ethernet links for roughly 1 TB/s of off-chip bandwidth. At the chip level Blackhole is rated at 745 teraFLOPS of FP8 throughput.[5]
Stacking 32 of those chips is what produces the Galaxy's headline numbers. The per-chip Ethernet becomes a dense in-chassis fabric, and the per-chip GDDR6 pools into a terabyte of memory addressable across the system. Because the scale-out fabric is ordinary Ethernet rather than NVLink or Infinity Fabric, the same cabling that links chips inside one box also links boxes together, which is how Tenstorrent extends a single server into a multi-rack cluster without specialized switch silicon.[1][3][5]
The figures below come from Tenstorrent's published Galaxy product page and the company's general availability announcement.[2][3]
| Specification | Tenstorrent Galaxy Blackhole |
|---|---|
| Form factor | 6U rackmount, air-cooled |
| Accelerators | 32 Blackhole ASICs |
| Compute | 23 PFLOPS Block FP8 |
| On-chip SRAM | 6.2 GB at 2.9 PB/s |
| DRAM | 1 TB GDDR6 at 16 TB/s |
| On-chip fabric | 10 x 400 GbE per ASIC, 32 TB/s aggregate |
| Scale-out networking | Up to 56 x 800 GbE QSFP-DD ports, 11.2 TB/s |
| Power | 8 to 10 kW average, 12 kW maximum |
| List price | $110,000 |
| Base supercluster | 4 Galaxy systems, from $440,000 |
The use of GDDR6 rather than the high-bandwidth memory found in Nvidia Blackwell parts is intentional. Analysts at Moor Insights & Strategy noted that Tenstorrent deliberately chose GDDR6 over HBM, standard Ethernet over proprietary fabrics, and air cooling over liquid cooling, all in service of lowering the cost and complexity of inference at scale rather than chasing peak training throughput.[6]
Tenstorrent ships the whole software stack as open source, from the compiler down to the kernel level. The toolchain centers on TT-Forge, the company's MLIR-based compiler, layered over the lower-level TT-Metalium runtime and the TT-NN neural network library. Tenstorrent claims that "ninety percent of models from Hugging Face just run on Tenstorrent," and the programming model exposes a Python interface for writing optimized kernels rather than hiding the hardware behind an opaque tensor compiler. Frontier models the company lists as in progress include Moonshot AI's Kimi K2.[1][2]
The openness extends to the hardware design itself. Blackhole boards ship with full schematics and kernel-level access, and the choice to build the entire compute fabric on the open RISC-V instruction set is the foundation of Tenstorrent's pitch that customers can avoid vendor lock-in. For buyers worried about being tied to a single proprietary software ecosystem, the open stack is arguably as important as the price tag.[2][5]
The performance numbers Tenstorrent cites for the Galaxy are vendor figures and should be read as such. On DeepSeek's DeepSeek-R1 0528 671B model, the company reports decode throughput of up to 350 tokens per second per user across batch sizes of 8 to 64 while supporting a 128k token context. Independent reporting put the figure measured at launch closer to 308 tokens per second per user, with a software roadmap toward 500. On the prefill side, Tenstorrent says a four-node supercluster reaches a sub-four-second time to first token on a 100,000 token prompt, which it describes as roughly 166 pages of text processed in under four seconds. For generative video, the company claims it can produce 720p output faster than real time, citing an 81-frame 720p clip in about 2.4 seconds and describing the result as roughly ten times faster than leading GPU systems.[1][2][7]
The economic argument is where Tenstorrent is most aggressive. The company says the Galaxy delivers output at about $6 per million tokens against roughly $30 for a comparable Nvidia GB300 setup, which is the basis for its claim of a fivefold total cost of ownership advantage. WCCFtech reported the company vowing to "crush everyone" on inference economics. These are Tenstorrent's own comparisons, and they apply to inference specifically rather than to the full training-plus-inference lifecycle.[7]
Tenstorrent is careful, and so are most reviewers, to frame the Galaxy as an inference machine first. The Register noted that Nvidia's eight-way DGX boxes are "faster and higher capacity" than a single Galaxy, but cost roughly three to five times as much, which is the gap Tenstorrent is trying to exploit. For organizations that need one platform for both large-scale training and inference, Nvidia remains the default. For teams whose problem is serving models cheaply and predictably, the open stack and the lower sticker price are the draw.[1][6]
That positioning showed up in the launch partners. Tenstorrent named cloud and infrastructure providers including Cirrascale and Equinix, Japan's ai&, and customers such as OrionVM, BetterBrain, Virtu Financial, Turiyam, and Prodia Labs as early adopters. Dave Driggers, chief executive of Cirrascale, said the company evaluates a lot of hardware and that "most of it is incremental," adding that "Tenstorrent Galaxy Blackhole is not" and that Tenstorrent "has taken a clean-sheet approach to AI infrastructure." Equinix's Justen Aguillon said the system lets enterprises "stay focused on building differentiated products, not managing infrastructure complexity."[2]
The Galaxy reached general availability on April 28, 2026 and lists at $110,000 for a single 6U server. Customers can deploy configurations ranging from 4 to 36 or more systems, with workloads the company targets including large-scale LLM inference, AI video generation, and private AI infrastructure. The base supercluster bundles four Galaxy systems for $440,000, and because the chips federate over standard Ethernet the architecture scales beyond that to multi-rack deployments, with Tenstorrent highlighting a "supercluster 36" configuration that links 36 boxes into a single system. Reporting around the launch noted the design can support on the order of 32 or more nodes and well over a thousand chips in total.[1][2][3]