Google TPU 8i

AI Hardware AI Inference Google

9 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v1 · 1,701 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Google TPU 8i is an eighth-generation Tensor Processing Unit from Google, built specifically for AI inference rather than model training. It was previewed at Google Cloud Next 2026 in Las Vegas on 22 April 2026, alongside a training-focused sibling, the TPU 8t. The pairing marks the first time Google has split a single TPU generation into two purpose-built chips, one tuned for the long, latency-sensitive decoding loops of serving and reasoning, the other for large-scale pretraining. Google says the TPU 8i delivers up to 80 percent better performance-per-dollar than its predecessor, Ironwood, for low-latency serving of large mixture-of-experts models.^[1]^[2]^[3]

What it is

The TPU 8i is an application-specific integrated circuit that accelerates the inference half of the AI workload: sampling, serving, and the autoregressive decoding that underpins chain-of-thought reasoning. Google positions it as the answer to a shift it calls the "agentic era," in which fleets of AI agents make many small, sequential model calls. At that scale, Google argues, even tiny per-step inefficiencies compound, so the chip is engineered around keeping its compute cores fed rather than around raw peak throughput.^[1]^[2]

It is part of Google's AI Hypercomputer, the company's integrated stack of custom silicon, networking, storage, and software. On the software side the TPU 8i is supported by JAX, PyTorch, the XLA compiler, vLLM, and Google's Pathways runtime.^[2]

Several widely cited details did not come from Google's own announcement. Multiple outlets reported a codename of "Zebrafish" for the TPU 8i (and "Sunfish" for the TPU 8t), and reported that MediaTek co-designed the inference chip while Broadcom co-designed the training chip. Google's published materials do not confirm the codenames or name the design partners, so those claims rest on trade-press reporting rather than primary documentation.^[4]^[5]^[6]

Splitting training from inference

For seven generations, a Google TPU was a single design asked to do both jobs. With the eighth generation, Google bifurcated the line. The reasoning is that training and inference now pull hardware in opposite directions. Training rewards enormous, tightly coupled pods that synchronize gradients across thousands of chips; inference rewards memory capacity, memory bandwidth, and low collective-operation latency so that a model can stream tokens to a user without stalling.^[1]^[3]^[7]

The two chips reflect that divergence at the physical level. According to trade reporting, the training-oriented TPU 8t carries two compute dies, one I/O chiplet, and eight twelve-high stacks of HBM3e, while the inference-oriented TPU 8i uses a single compute die, one I/O die, and six stacks of HBM3e. The TPU 8t scales into superpods of 9,600 chips with about two petabytes of shared memory and 121 exaflops of FP4 compute; the TPU 8i is built around smaller, lower-latency serving domains.^[5]^[6]^[8] Industry analysts described the move as Google stepping away from a single general-purpose accelerator toward workload-specialized silicon.^[3]

Architecture

The TPU 8i's design centers on memory and on minimizing the cost of communication between cores. Its headline figures are 288 GB of high-bandwidth memory and 384 MB of on-chip SRAM, the latter being three times the on-chip SRAM of the previous generation. The large pool of fast on-die memory is meant to stop the compute cores from sitting idle while they wait for data, a recurring bottleneck during long-context decoding.^[1]^[2]

A defining feature is a dedicated Collectives Acceleration Engine (CAE). Collective operations, the reductions and synchronizations that combine partial results across cores, dominate the decode step in reasoning and MoE models. The CAE offloads these operations and, per Google, cuts their on-chip latency by up to five times. In Google's description, each TPU 8i chip has two Tensor Cores on the compute die and one CAE on the chiplet die, and the engine aggregates results across cores with near-zero latency, specifically targeting the synchronization steps of autoregressive decoding.^[1]^[2]

The chip also moves a workload that normally spills into external memory onto the die itself. Google says the TPU 8i can host a larger key-value cache, the KV cache that holds attention state during generation, entirely on silicon, which reduces idle time during long-context decoding. Host duties are handled by Google's own Axion Arm-based CPUs, and the eighth generation doubles the number of physical CPU hosts per server compared with Ironwood.^[1]^[2]

Specification	Google TPU 8i (reported figures)
Generation	8th-generation TPU, inference-optimized ^[1]
HBM capacity	288 GB ^[1]^[2]
HBM bandwidth	About 8.6 TB/s ^[8]^[2]
On-chip SRAM	384 MB, roughly 3x the prior generation ^[1]^[2]
Inter-chip interconnect (ICI)	19.2 Tb/s bidirectional per chip, double the previous generation ^[1]^[9]
Interconnect topology	Boardfly, cutting maximum network diameter by more than 50 percent ^[1]^[2]
Collectives engine	Collectives Acceleration Engine (CAE), up to 5x lower on-chip collective latency ^[1]^[2]
Peak compute (per chip)	About 10.1 PFLOPs FP4 ^[8]
Pod scale	Up to about 1,152 chips per pod ^[9]
Pod compute	About 11.6 exaflops FP8 ^[9]
Pod HBM	About 331.8 TB ^[9]
Host CPU	Google Axion (Arm) ^[1]^[2]
Reported design partner	MediaTek (per trade reporting) ^[4]^[5]^[6]
Reported codename	"Zebrafish" (per trade reporting) ^[4]^[6]

Interconnect and networking

To tie chips together, the TPU 8i uses a fabric Google calls Boardfly rather than the 3D torus used by the training chip. Boardfly is designed for the all-to-all traffic patterns of MoE inference, where tokens are routed to different expert sub-networks and partial results must be gathered quickly. Google says the topology cuts the maximum network diameter by more than half: any chip can reach any other in at most seven hops, against sixteen in the comparable torus arrangement. Inter-chip interconnect bandwidth is 19.2 Tb/s per chip, double that of Ironwood, which matters because MoE routing magnifies the cost of slow links.^[1]^[2]^[9]

At the pod level, reporting from the announcement put the TPU 8i at up to about 1,152 chips per pod, delivering roughly 11.6 exaflops of FP8 compute and a total of about 331.8 TB of HBM, with 19.2 Tb/s of scale-up bandwidth per chip. These are smaller domains than the 9,600-chip training superpods, reflecting the inference focus on latency rather than on the largest possible synchronous job.^[9]

The MediaTek partnership

Trade coverage of the launch reported that Google brought in MediaTek to co-design the TPU 8i, while its longtime partner Broadcom handled the TPU 8t. By these accounts MediaTek had joined the eighth-generation program as a second silicon design partner in late 2025, splitting the work so that one partner concentrated on the inference die and the other on the training die. Bringing in MediaTek would diversify Google's design supply chain beyond Broadcom, which has handled prior TPU generations. Because Google did not name either partner in its own posts, these reports should be read as industry sourcing rather than confirmed by Google.^[4]^[5]^[6]

The chips are manufactured by TSMC, and reporting on HBM is consistent: both eighth-generation TPUs use HBM3e. The process node is less settled. Some outlets reported the chips are built on TSMC's 2-nanometre process, while others placed them on the N3 (3-nanometre) family. The disagreement is unresolved in public sourcing as of mid-2026, so the node should be treated as not definitively established.^[6]^[8]^[10]

Comparison to Ironwood

The TPU 8i succeeds Ironwood (TPU v7), Google's previous flagship, which reached general availability around the same Cloud Next event. Against Ironwood, Google's central claim for the TPU 8i is up to 80 percent better performance-per-dollar, concentrated at low-latency targets for large MoE models. Both eighth-generation chips also deliver up to twice the performance-per-watt of the previous generation, which Google frames as essential to scaling inference without a matching rise in energy use.^[1]^[2]^[3]

The gains come mostly from architecture rather than from a single headline throughput number. Tripling on-chip SRAM, doubling interconnect bandwidth, adding the CAE, and hosting the KV cache on silicon together attack the memory-bound, communication-bound nature of decoding. Independent commentary on the launch noted that Google was emphasizing system-level efficiency, making each token cheaper to generate, over the raw peak-flops comparisons that dominate training hardware.^[7]^[3]

Availability

Google said both eighth-generation TPUs, the 8t and the 8i, will become generally available on Google Cloud later in 2026, usable as part of the AI Hypercomputer. That timeline was stated in Google's own materials and echoed by most coverage of the launch. At least one outlet reported a later target of 2027, but Google's stated guidance and the bulk of reporting point to general availability within 2026.^[1]^[2]^[11]^[9]

The launch was widely read as part of a broader effort by hyperscalers to reduce reliance on Nvidia GPUs for AI workloads by scaling their own accelerators, with inference, the larger and faster-growing share of production AI spend, as the main battleground. By shipping a chip aimed squarely at cheap, low-latency serving, Google signaled that it intends to compete on the economics of running models, not only on the cost of training them.^[3]^[7]

References

Our eighth generation TPUs: two chips for the agentic era, The Keyword (Google), 22 April 2026. ↩
TPU 8t and TPU 8i technical deep dive, Google Cloud Blog, April 2026. ↩
Google Cloud Next 2026: Google Cloud Bifurcates the AI Future, Specialized TPU 8t and 8i Architectures Signal the End of General-Purpose Silicon, HyperFrame Research, 22 April 2026. ↩
Google Splits TPUv8 Strategy Into Two Chips, Handing Broadcom Training and MediaTek Inference Duties, Wccftech, April 2026. ↩
Google launches Ironwood TPU and previews eighth-gen split into training and inference chips, The Next Web, April 2026. ↩
Google TPU 8t and TPU 8i: The Agentic-Era Chip Split, Nerd Level Tech, April 2026. ↩
With TPU 8, Google Makes GenAI Systems Much Better, Not Just Bigger, The Next Platform, 24 April 2026. ↩
Google dual tracks TPU 8 to conquer training and inference, The Register, 22 April 2026. ↩
Two new TPUs to power the next wave of AI training and inference at Google, SiliconANGLE, 22 April 2026. ↩
Inside Google's TPU V8 strategy, delivering two chips for two crucial tasks at incredible scale, Tom's Hardware, April 2026. ↩
Google unveils eighth-generation TPUs, two dedicated training and inference chips, DataCenterDynamics, April 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Tensor Processing Unit (TPU)