Google TPU 8i
Last reviewed
Jun 3, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,701 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,701 words
Add missing citations, update stale details, or suggest a clearer explanation.
Google TPU 8i is an eighth-generation Tensor Processing Unit from Google, built specifically for AI inference rather than model training. It was previewed at Google Cloud Next 2026 in Las Vegas on 22 April 2026, alongside a training-focused sibling, the TPU 8t. The pairing marks the first time Google has split a single TPU generation into two purpose-built chips, one tuned for the long, latency-sensitive decoding loops of serving and reasoning, the other for large-scale pretraining. Google says the TPU 8i delivers up to 80 percent better performance-per-dollar than its predecessor, Ironwood, for low-latency serving of large mixture-of-experts models.[1][2][3]
The TPU 8i is an application-specific integrated circuit that accelerates the inference half of the AI workload: sampling, serving, and the autoregressive decoding that underpins chain-of-thought reasoning. Google positions it as the answer to a shift it calls the "agentic era," in which fleets of AI agents make many small, sequential model calls. At that scale, Google argues, even tiny per-step inefficiencies compound, so the chip is engineered around keeping its compute cores fed rather than around raw peak throughput.[1][2]
It is part of Google's AI Hypercomputer, the company's integrated stack of custom silicon, networking, storage, and software. On the software side the TPU 8i is supported by JAX, PyTorch, the XLA compiler, vLLM, and Google's Pathways runtime.[2]
Several widely cited details did not come from Google's own announcement. Multiple outlets reported a codename of "Zebrafish" for the TPU 8i (and "Sunfish" for the TPU 8t), and reported that MediaTek co-designed the inference chip while Broadcom co-designed the training chip. Google's published materials do not confirm the codenames or name the design partners, so those claims rest on trade-press reporting rather than primary documentation.[4][5][6]
For seven generations, a Google TPU was a single design asked to do both jobs. With the eighth generation, Google bifurcated the line. The reasoning is that training and inference now pull hardware in opposite directions. Training rewards enormous, tightly coupled pods that synchronize gradients across thousands of chips; inference rewards memory capacity, memory bandwidth, and low collective-operation latency so that a model can stream tokens to a user without stalling.[1][3][7]
The two chips reflect that divergence at the physical level. According to trade reporting, the training-oriented TPU 8t carries two compute dies, one I/O chiplet, and eight twelve-high stacks of HBM3e, while the inference-oriented TPU 8i uses a single compute die, one I/O die, and six stacks of HBM3e. The TPU 8t scales into superpods of 9,600 chips with about two petabytes of shared memory and 121 exaflops of FP4 compute; the TPU 8i is built around smaller, lower-latency serving domains.[5][6][8] Industry analysts described the move as Google stepping away from a single general-purpose accelerator toward workload-specialized silicon.[3]
The TPU 8i's design centers on memory and on minimizing the cost of communication between cores. Its headline figures are 288 GB of high-bandwidth memory and 384 MB of on-chip SRAM, the latter being three times the on-chip SRAM of the previous generation. The large pool of fast on-die memory is meant to stop the compute cores from sitting idle while they wait for data, a recurring bottleneck during long-context decoding.[1][2]
A defining feature is a dedicated Collectives Acceleration Engine (CAE). Collective operations, the reductions and synchronizations that combine partial results across cores, dominate the decode step in reasoning and MoE models. The CAE offloads these operations and, per Google, cuts their on-chip latency by up to five times. In Google's description, each TPU 8i chip has two Tensor Cores on the compute die and one CAE on the chiplet die, and the engine aggregates results across cores with near-zero latency, specifically targeting the synchronization steps of autoregressive decoding.[1][2]
The chip also moves a workload that normally spills into external memory onto the die itself. Google says the TPU 8i can host a larger key-value cache, the KV cache that holds attention state during generation, entirely on silicon, which reduces idle time during long-context decoding. Host duties are handled by Google's own Axion Arm-based CPUs, and the eighth generation doubles the number of physical CPU hosts per server compared with Ironwood.[1][2]
| Specification | Google TPU 8i (reported figures) |
|---|---|
| Generation | 8th-generation TPU, inference-optimized [1] |
| HBM capacity | 288 GB [1][2] |
| HBM bandwidth | About 8.6 TB/s [8][2] |
| On-chip SRAM | 384 MB, roughly 3x the prior generation [1][2] |
| Inter-chip interconnect (ICI) | 19.2 Tb/s bidirectional per chip, double the previous generation [1][9] |
| Interconnect topology | Boardfly, cutting maximum network diameter by more than 50 percent [1][2] |
| Collectives engine | Collectives Acceleration Engine (CAE), up to 5x lower on-chip collective latency [1][2] |
| Peak compute (per chip) | About 10.1 PFLOPs FP4 [8] |
| Pod scale | Up to about 1,152 chips per pod [9] |
| Pod compute | About 11.6 exaflops FP8 [9] |
| Pod HBM | About 331.8 TB [9] |
| Host CPU | Google Axion (Arm) [1][2] |
| Reported design partner | MediaTek (per trade reporting) [4][5][6] |
| Reported codename | "Zebrafish" (per trade reporting) [4][6] |
To tie chips together, the TPU 8i uses a fabric Google calls Boardfly rather than the 3D torus used by the training chip. Boardfly is designed for the all-to-all traffic patterns of MoE inference, where tokens are routed to different expert sub-networks and partial results must be gathered quickly. Google says the topology cuts the maximum network diameter by more than half: any chip can reach any other in at most seven hops, against sixteen in the comparable torus arrangement. Inter-chip interconnect bandwidth is 19.2 Tb/s per chip, double that of Ironwood, which matters because MoE routing magnifies the cost of slow links.[1][2][9]
At the pod level, reporting from the announcement put the TPU 8i at up to about 1,152 chips per pod, delivering roughly 11.6 exaflops of FP8 compute and a total of about 331.8 TB of HBM, with 19.2 Tb/s of scale-up bandwidth per chip. These are smaller domains than the 9,600-chip training superpods, reflecting the inference focus on latency rather than on the largest possible synchronous job.[9]
Trade coverage of the launch reported that Google brought in MediaTek to co-design the TPU 8i, while its longtime partner Broadcom handled the TPU 8t. By these accounts MediaTek had joined the eighth-generation program as a second silicon design partner in late 2025, splitting the work so that one partner concentrated on the inference die and the other on the training die. Bringing in MediaTek would diversify Google's design supply chain beyond Broadcom, which has handled prior TPU generations. Because Google did not name either partner in its own posts, these reports should be read as industry sourcing rather than confirmed by Google.[4][5][6]
The chips are manufactured by TSMC, and reporting on HBM is consistent: both eighth-generation TPUs use HBM3e. The process node is less settled. Some outlets reported the chips are built on TSMC's 2-nanometre process, while others placed them on the N3 (3-nanometre) family. The disagreement is unresolved in public sourcing as of mid-2026, so the node should be treated as not definitively established.[6][8][10]
The TPU 8i succeeds Ironwood (TPU v7), Google's previous flagship, which reached general availability around the same Cloud Next event. Against Ironwood, Google's central claim for the TPU 8i is up to 80 percent better performance-per-dollar, concentrated at low-latency targets for large MoE models. Both eighth-generation chips also deliver up to twice the performance-per-watt of the previous generation, which Google frames as essential to scaling inference without a matching rise in energy use.[1][2][3]
The gains come mostly from architecture rather than from a single headline throughput number. Tripling on-chip SRAM, doubling interconnect bandwidth, adding the CAE, and hosting the KV cache on silicon together attack the memory-bound, communication-bound nature of decoding. Independent commentary on the launch noted that Google was emphasizing system-level efficiency, making each token cheaper to generate, over the raw peak-flops comparisons that dominate training hardware.[7][3]
Google said both eighth-generation TPUs, the 8t and the 8i, will become generally available on Google Cloud later in 2026, usable as part of the AI Hypercomputer. That timeline was stated in Google's own materials and echoed by most coverage of the launch. At least one outlet reported a later target of 2027, but Google's stated guidance and the bulk of reporting point to general availability within 2026.[1][2][11][9]
The launch was widely read as part of a broader effort by hyperscalers to reduce reliance on Nvidia GPUs for AI workloads by scaling their own accelerators, with inference, the larger and faster-growing share of production AI spend, as the main battleground. By shipping a chip aimed squarely at cheap, low-latency serving, Google signaled that it intends to compete on the economics of running models, not only on the cost of training them.[3][7]