TPU Ironwood (officially TPU v7 or TPU7x) is Google's seventh-generation Tensor Processing Unit, unveiled at Google Cloud Next 2025 in Las Vegas on April 9, 2025. It is the first TPU generation designed primarily for inference workloads, reflecting a broader industry shift from training-dominated compute to large-scale AI serving. Each chip delivers 4,614 teraFLOPS at FP8 precision, carries 192 GB of HBM3E memory with 7.37 TB/s of bandwidth, and can be assembled into superpods of up to 9,216 chips reaching 42.5 exaFLOPS of aggregate FP8 compute. Google has deployed Ironwood to power its Gemini 2.5 models and to run Google DeepMind's AlphaFold research workloads internally. External customers including Anthropic, Lightricks, and Essential AI were announced at launch as early adopters.
Google began designing its first custom AI accelerator in 2013, motivated by a projection that if users interacted with voice search using neural networks for only a few minutes per day, the company would need to double its entire data-center compute capacity to handle the inference load. The solution was an application-specific integrated circuit (ASIC) built around a systolic array optimized for matrix multiplication, rather than a general-purpose GPU.
TPU v1 entered production in Google data centers in 2015 and was publicly disclosed at Google I/O 2016. Fabricated on a 28 nm process, it contained a 256x256 array of 8-bit multiply-accumulate units delivering roughly 92 TOPS of 8-bit integer performance at 75 W. It was an inference-only accelerator connected to host servers over PCIe 3.0.
TPU v2, announced in 2017, was the first generation capable of training neural networks. It introduced bfloat16 arithmetic, a 16-bit floating-point format that Google popularized across the AI industry, and 16 GB of HBM memory providing 600 GB/s of bandwidth. Pods of 64 chips were assembled for training jobs.
TPU v3, announced in May 2018, doubled the per-chip compute of v2 and introduced liquid cooling, a thermal design that has persisted through every subsequent generation. Pods scaled to 1,024 chips.
TPU v4, disclosed in 2021, moved to a 7 nm process and introduced optical circuit switches (OCS) to interconnect chips in a pod without fixed wiring. The OCS fabric allowed any two chips in a pod to communicate through a dynamically reconfigurable optical path, dramatically simplifying large-scale distributed training. TPU v4 pods contained 4,096 chips.
TPU v5e and TPU v5p, launched in 2023, represented a split into cost-optimized (v5e) and performance-optimized (v5p) variants. TPU v5p carried 95 GB of HBM and delivered 918 teraFLOPS of BF16 compute, positioning it against the NVIDIA H100.
TPU v6e (Trillium), announced at Google I/O in May 2024 and entering preview that October, was the largest architectural leap between consecutive generations since v2. Trillium doubled the matrix multiply unit from 128x128 to 256x256 MACs, raised clock speed, and delivered 4.7x the peak compute of v5e per chip. It carried 32 GB of HBM at 1,640 GB/s bandwidth, ran pods of up to 256 chips, and improved energy efficiency by 4.7x over v5e.
TPU v7 (Ironwood), announced April 2025, is the subject of this article. Its successor, the eighth-generation TPU, was announced in April 2026 as two separate chips: TPU 8t for training and TPU 8i for inference, each pushing into FP4 precision territory.
Ironwood is the first Tensor Processing Unit built with a multi-chiplet package. Each Ironwood chip contains two compute dies, called Ironwood compute chiplets, bonded together in a single package. Each chiplet is functionally self-contained: it includes one TensorCore, two SparseCores, and 96 GB of HBM3E memory across four stacks. The two chiplets are joined by a die-to-die (D2D) interface that runs at a bandwidth six times faster than a single 1D inter-chip interconnect (ICI) link, making intra-package communication far faster than any off-package link.
This gives the assembled chip two TensorCores, four SparseCores, 192 GB of HBM3E across eight stacks, and a combined peak FP8 throughput of 4,614 teraFLOPS.
The package also includes separate I/O dies responsible for the ICI links that connect chips to neighbors in a pod. Separating I/O logic from compute logic allows Google to optimize each die independently and improve yields.
The chiplet design was co-developed with Broadcom, which handles packaging and supply chain logistics. Fabrication uses TSMC's N3P 3 nm process for the compute dies.
Each TensorCore contains a Matrix Multiply Unit (MXU) and a Vector Processing Unit (VPU).
The MXU uses a 256x256 systolic array for matrix operations. For FP8 precision, the implementation maps two FP8 operations per FP16 data path, effectively creating a 512x512 logical MAC array and providing an additional computational throughput boost over structural improvements alone. FP8 support is new to Ironwood; it is the first TPU generation with hardware FP8 tensor calculations.
The VPU handles element-wise operations such as activations, layer normalizations, and scaling. It runs alongside the MXU to handle the non-matrix portions of typical transformer inference passes.
Operating frequency is approximately 1.1 GHz, lower than the 1.75 GHz of TPU v5p. The frequency reduction reflects a deliberate trade-off: the wider MXU and higher-bandwidth memory deliver more throughput per clock, while the lower frequency reduces power and heat per compute unit.
Each TensorCore is paired with two SparseCores (four total per chip). SparseCore is a specialized accelerator for large embedding table lookups, which dominate the compute profile of ranking and recommendation systems. For large language models and mixture-of-experts architectures, SparseCores handle the gating and routing operations that direct tokens to appropriate expert subnetworks.
SparseCore supports efficient data distribution across chips during embedding lookups, enabling recommendation workloads that would otherwise saturate conventional memory hierarchies. This makes Ironwood viable for advertising and content-recommendation serving workloads alongside LLM inference.
Google used its AlphaChip system, a reinforcement-learning-based chip floorplanning tool developed by Google DeepMind, to optimize the physical layout of Ironwood dies. AlphaChip was first applied to TPU v4 and has been used for every subsequent generation. The system produces floorplans that human engineers have difficulty matching within comparable time budgets.
| Specification | Value |
|---|---|
| Generation | 7th (TPU v7 / TPU7x) |
| Codename | Ironwood |
| Architecture | Dual-chiplet, multi-die package |
| TensorCores per chip | 2 |
| SparseCores per chip | 4 |
| MXU array size | 256x256 (logical 512x512 at FP8) |
| Peak compute (FP8) | 4,614 TFLOPS |
| Peak compute (BF16) | 2,307 TFLOPS |
| HBM type | HBM3E |
| HBM capacity | 192 GB (8 stacks, 24 GB per stack) |
| HBM bandwidth | 7,380 GiB/s (approx. 7.37 TB/s) |
| ICI bandwidth | 1,200 GB/s bidirectional |
| Data center network | 100 Gbps per chip |
| Estimated TDP | ~700 W to 1 kW per chip |
| Process node | TSMC N3P (3 nm class) |
| Cooling | Third-generation liquid cooling |
| Host CPU | Google Axion (Arm-based) |
Google offers Ironwood in two standard configurations through Google Cloud.
The 256-chip pod targets moderate-scale workloads, including dedicated inference serving for individual model deployments and smaller training runs.
The 9,216-chip superpod is the largest Ironwood configuration. At 9,216 chips and 42.5 exaFLOPS of FP8 throughput, this superpod exceeds the theoretical peak of the world's fastest traditional supercomputers (the Frontier system at Oak Ridge National Laboratory reached 1.1 exaFLOPS; El Capitan at Lawrence Livermore reached approximately 1.7 exaFLOPS). Google notes the full superpod holds 1.77 petabytes of directly addressable HBM across all chips.
The superpod draws approximately 10 megawatts of power under full load.
For deployments requiring more than 9,216 chips, multiple superpods can be interconnected through Google's data-center network (DCN) using the Jupiter fabric, scaling to hundreds of thousands of chips. Google's internal Gemini production clusters operate at this multi-superpod scale.
The VM configuration for Ironwood in Google Cloud uses the tpu7x-standard-4t machine type, which bundles 4 TPU chips with 224 vCPUs and 960 GB of RAM on a shared host. This grouping reflects the physical hardware layout: 4 Ironwood chips per liquid-cooled tray, 16 trays per rack (64 chips), paired with CPU host racks.
Within a pod, chips are arranged in a 3D torus topology. The smallest deployable slice is 2x2x1 (4 chips, one host). The largest single-pod slice is 8x16x16 (2,048 chips across 512 hosts). The full superpod is assembled by connecting multiple cubes through the OCS fabric.
The Inter-Chip Interconnect (ICI) is a proprietary high-speed serial link that connects Ironwood chips within a pod. Each chip has four ICI links providing 9.6 Tbps of aggregate bidirectional bandwidth (approximately 1.2 TB/s). This represents a 1.5x bandwidth increase over Trillium's ICI.
ICI uses custom Remote Direct Memory Access (RDMA) protocols that allow chips to read and write each other's HBM without involving the host CPU. This reduces latency for collective operations such as all-reduce and all-gather, which are ubiquitous in distributed training and tensor-parallel inference.
Within a cube of 64 chips, ICI links form a 3D torus mesh. In this topology, every chip connects to six neighbors (two in each of three spatial dimensions). The torus structure gives each chip multiple paths to every other chip, providing both bandwidth and fault tolerance. Three distinct parallelism axes exist, which map naturally to the tensor parallelism, pipeline parallelism, and data parallelism axes used in large model training and inference.
Since TPU v4, Google has used Optical Circuit Switches (OCS) to interconnect chip cubes within a pod. Ironwood continues this approach. The OCS fabric manager can establish, reconfigure, and route around failed links or chips dynamically. When a chip or cube becomes unhealthy, the OCS fabric manager establishes new optical paths using designated spares, maintaining full torus connectivity without requiring a restart of the workload. This contributes to Google's reported ~99.999% fleet-wide TPU uptime since 2020.
For the largest Ironwood clusters, multiple superpods connect through Google's Jupiter data-center network, which uses electrical packet switching for inter-superpod traffic.
Each Ironwood chip carries eight stacks of HBM3E memory, providing 192 GB of capacity and approximately 7.37 TB/s of peak bandwidth. This is a dramatic increase over previous generations: Trillium (TPU v6e) carried 32 GB at 1,640 GB/s per chip, and TPU v5p carried 95 GB at 2,765 GB/s.
The 192 GB per chip was deliberately chosen to allow large language models to run with fewer chips. A 70-billion-parameter model in BF16 requires roughly 140 GB of weight memory alone. With Trillium, that model would need a minimum of 5 chips for weights; with Ironwood, two chips can accommodate it, reducing communication overhead and improving latency for serving.
The 9,216-chip superpod presents a single addressable pool of 1.77 petabytes of HBM to the Pathways distributed runtime. XLA and the Pathways layer manage sharding of model weights and KV-cache tensors across this pool transparently.
Beyond HBM, each TensorCore has a quantity of fast on-chip SRAM used for tile-level data staging. This SRAM is explicitly managed through the Pallas kernel programming interface, allowing engineers to implement custom fused operations that overlap HBM-to-SRAM transfers with MXU computation.
Bandwidth improvements vs. prior generations:
| Generation | HBM capacity | HBM bandwidth | Ratio to Trillium (bandwidth) |
|---|---|---|---|
| TPU v5p | 95 GB | 2,765 GB/s | 0.37x |
| TPU v6e (Trillium) | 32 GB | 1,640 GB/s | 1.0x |
| TPU v7 (Ironwood) | 192 GB | 7,370 GB/s | 4.5x |
Ironwood uses Google's third-generation liquid cooling infrastructure. Liquid cooling was first introduced with TPU v3 in 2018, following thermal limits that made air cooling impractical for high-density AI accelerator racks.
The system uses cold plates mounted directly on the compute dies. A closed water loop circulates coolant through the cold plate and carries heat to facility-level cooling distribution units. Google's design keeps the water entering the cold plate chemically treated and filtered to prevent mineral deposits from blocking the narrow channels.
The thermal benefit is substantial: Google states that advanced liquid cooling sustains twice the sustained compute performance of standard air cooling under continuous heavy workloads. For Ironwood, this matters because the chip's estimated per-chip TDP is in the 700 W to 1 kW range, a power density that air cooling cannot remove reliably at high rack density.
The full 9,216-chip superpod draws approximately 10 MW under load, requiring industrial-scale cooling plant infrastructure. Google designs this cooling infrastructure in-house and co-locates it with TPU production clusters.
The following table summarizes Ironwood's key performance figures as reported by Google at announcement:
| Metric | Value | Notes |
|---|---|---|
| FP8 peak (per chip) | 4,614 TFLOPS | First TPU generation with native FP8 |
| BF16 peak (per chip) | 2,307 TFLOPS | Standard training precision |
| FP8 peak (9,216-chip superpod) | 42.5 ExaFLOPS | Google's headline figure |
| BF16 peak (9,216-chip superpod) | 21.26 ExaFLOPS | |
| HBM bandwidth (per chip) | 7.37 TB/s | 4.5x Trillium |
| HBM capacity (per chip) | 192 GB | 6x Trillium |
| ICI bandwidth (per chip) | 1.2 TB/s bidirectional | 1.5x Trillium |
| Perf/watt vs. Trillium | 2x improvement | Measured at FP8 FLOPS per watt TDP |
| Perf/watt vs. TPU v2 (2018) | ~30x improvement | Long-term efficiency trend |
| Peak vs. TPU v5p | 10x higher peak TFLOPS | |
| Per-chip perf vs. Trillium | 4x+ better | Training and inference |
Google also reports the 9,216-chip superpod has 1.77 petabytes of total addressable HBM and 88,473.6 Tbps of aggregate ICI bandwidth across all inter-chip links.
| Specification | Trillium (TPU v6e) | Ironwood (TPU v7) | Improvement |
|---|---|---|---|
| Peak compute (FP8) | 1,836 TFLOPS | 4,614 TFLOPS | 2.5x per chip |
| Peak compute (BF16) | 918 TFLOPS | 2,307 TFLOPS | 2.5x per chip |
| HBM capacity | 32 GB | 192 GB | 6x |
| HBM bandwidth | 1,640 GB/s | 7,370 GB/s | 4.5x |
| ICI bandwidth | ~800 GB/s | 1,200 GB/s | 1.5x |
| Max pod size | 256 chips | 9,216 chips | 36x |
| Perf/watt | 1x (baseline) | 2x | 2x |
| FP8 support | Yes | Yes | -- |
| Chiplet design | No | Yes (2 dies) | New |
| HBM type | HBM | HBM3E | Newer generation |
Google positions Trillium as the preferred choice for customers who need both training and inference capacity simultaneously. Ironwood is the preferred choice for pure inference serving and for training jobs that need very large per-chip memory.
NVIDIA's Blackwell architecture, released in 2024 and 2025 across the B100, B200, and GB200 products, is the primary commercial competitor to Ironwood for large-scale AI inference infrastructure.
| Specification | NVIDIA B200 | NVIDIA GB200 NVL72 | Ironwood (TPU v7) | Ironwood 9,216-chip pod |
|---|---|---|---|---|
| FP8 peak (per accelerator) | 4,500 TFLOPS | 5,000 TFLOPS | 4,614 TFLOPS | -- |
| HBM capacity | 192 GB | 192 GB | 192 GB | 1.77 PB |
| HBM bandwidth | 8.0 TB/s | 8.0 TB/s | 7.37 TB/s | -- |
| Max GPUs/chips per system | 72 (NVL72) | 72 (NVL72) | 9,216 (superpod) | -- |
| Interconnect bandwidth | 14.4 Tbps (NVLink) | 14.4 Tbps | 9.6 Tbps (ICI) | -- |
| FP4 support | Yes (B200) | Yes | No | -- |
| Cooling | Liquid (NVL72) | Liquid | Liquid | -- |
| Third-party ecosystem | Very broad | Very broad | Limited (JAX, PyTorch XLA) | -- |
Key differences in competitive positioning:
At the per-chip level, Ironwood and NVIDIA Blackwell B200 deliver comparable FP8 throughput (4,614 vs. 4,500 TFLOPS) and carry identical HBM capacity (192 GB). NVIDIA's NVLink provides higher per-link interconnect bandwidth (14.4 Tbps vs. 9.6 Tbps ICI), and NVIDIA's B200 and GB300 support FP4 precision for compressed inference, which Ironwood does not.
At the pod or cluster level, Google's advantage is scale. NVIDIA's largest self-contained systems top out at 72 GPUs in a single NVL72 rack. Ironwood is purpose-built to operate as 9,216 chips in a single optical-fabric domain, presenting that entire pool as one coherent parallel processor. For workloads that benefit from all-to-all or all-reduce collectives across the entire cluster, the lower per-link ICI bandwidth of Ironwood is offset by the ability to run those collectives within a single optical domain rather than across multiple compute islands.
On total cost of ownership, third-party estimates suggest Ironwood superpod costs approximately 30% less per hour than an equivalent GB200 configuration and approximately 41% less than GB300, when accounting for compute capacity and power costs.
All Ironwood workloads compile through XLA (Accelerated Linear Algebra), Google's domain-specific compiler for tensor programs. XLA ingests programs written in JAX or PyTorch, performs whole-program optimization including operator fusion, operation scheduling, and layout selection, then emits optimized TPU machine code. XLA's whole-program view lets it make decisions that per-operator compilers cannot, such as fusing a sequence of matrix multiplications, activations, and normalizations into a single kernel that avoids repeated round trips to HBM.
JAX is the primary framework for Ironwood. JAX programs are Python expressions that describe tensor computations using NumPy-like syntax. The jit decorator triggers XLA compilation. The grad function computes exact gradients via reverse-mode automatic differentiation, enabling training without manually writing backward passes. The shard_map primitive allows explicit specification of how tensors are partitioned across a mesh of chips.
Google's production LLM training and serving codebase is written in JAX. MaxText, an open-source LLM framework from Google, supports pre-training, supervised fine-tuning, and reinforcement learning from human feedback on Ironwood. MaxText supports popular open-weight architectures including Gemma, DeepSeek, Qwen, and Mixtral-style mixture-of-experts models.
The JAX ecosystem for Ironwood includes several production-grade libraries:
| Library | Purpose |
|---|---|
| Optax | Gradient processing and optimization algorithms (AdamW, Lion, etc.) |
| Orbax | Asynchronous distributed checkpointing for large model arrays |
| Qwix | Quantization-aware training and post-training quantization |
| Metrax | Distributed evaluation metric computation |
| Tunix | Post-training pipeline orchestration |
| Goodput | ML training efficiency measurement and monitoring |
PyTorch users can run on Ironwood through PyTorch XLA, which translates PyTorch eager operations into XLA computations. PyTorch XLA on Ironwood supports full eager mode execution, torch.distributed for multi-chip parallelism, and torch.compile for graph-traced optimization. Google has invested specifically in making PyTorch feel idiomatic on TPUs to reduce the porting burden for teams with existing PyTorch codebases.
vLLM, the popular open-source LLM serving framework, has a TPU backend that works with Ironwood. The tpu-inference plugin provides a unified JAX/PyTorch lowering path for inference use cases, supporting paged attention and continuous batching.
Pallas is a domain-specific language embedded in JAX for writing custom TPU kernels. It provides explicit control over the HBM-to-SRAM data movement pipeline and allows engineers to overlap memory transfers with MXU computation. Mosaic, Pallas's compiler backend, handles tiling, operator fusion, and software pipelining to produce optimized TPU machine code. Pallas is used for custom attention kernels, mixture-of-experts routing, and other operations where XLA's automatic optimization falls short of theoretical peak.
Pathways is Google's distributed ML runtime that manages job execution across large TPU clusters. It presents a pool of TPU chips as a single machine to the programmer. Pathways handles fault recovery: if a chip or link fails during a training run, it checkpoints, reroutes around the failed hardware using the OCS fabric, and resumes from the checkpoint. This elastic execution model is critical for multi-week training jobs on tens of thousands of chips where hardware failures are statistically certain to occur.
TensorFlow is not supported on TPU7x. Google's documentation specifies JAX as the only officially supported framework for direct TPU7x use; PyTorch runs through the XLA bridge.
Vertex AI and Google Kubernetes Engine (GKE) are the primary access paths for Ironwood on Google Cloud.
Ironwood was announced at Google Cloud Next on April 9, 2025, with initial access for select customers immediately. General availability was announced on November 7, 2025. Access requires contacting a Google Cloud account team; quota is not available through the standard self-service console at launch.
TPU7x runs on the tpu7x-standard-4t VM shape: 4 TPU chips, 224 vCPUs, 960 GB RAM per VM. Storage uses Hyperdisk Balanced by default, with Hyperdisk ML also supported. Boot and attached disks use Google's standard Hyperdisk infrastructure.
GKE's Cluster Director feature provides topology-aware scheduling for Ironwood jobs, ensuring that chips assigned to a given workload are physically adjacent in the ICI fabric rather than scattered across different cubes. GKE's Inference Gateway reduces time-to-first-token latency for serving workloads by up to 96% and lowers serving costs by approximately 30% compared to naive round-robin load balancing, by routing prefill and decode phases to different chip allocations.
Vertex AI Model Garden and Vertex AI online prediction support Ironwood for hosted inference, abstracting the chip allocation and batching details from end users.
Google DeepMind has deployed Ironwood to serve Gemini 2.5 in production, as disclosed at the April 2025 launch. Gemini 2.5 is a reasoning-oriented model that uses extended chain-of-thought generation, making it particularly memory-bandwidth intensive at inference time. Ironwood's large per-chip HBM and high bandwidth reduce the number of chips needed to cache KV states for long context windows during generation.
Gemini 2.5 Pro, used for coding, reasoning, and multimodal tasks, is served from Ironwood superpods that keep the full model's KV-cache in on-chip HBM, eliminating DRAM offload latency that would otherwise degrade interactive latency.
Google DeepMind also uses Ironwood for AlphaFold workloads, including structure prediction and structure-guided drug discovery pipelines. AlphaFold 3's all-atom structure diffusion model requires large per-chip memory for attention over long protein sequences, making Ironwood's 192 GB HBM an operational fit.
Ironwood powers inference serving for other Google internal products, including Google Search's AI Overviews feature and Google Maps' AI-assisted navigation features, though Google has not publicly disclosed the full list of production applications.
Anthropic announced in October 2025 an expansion of its Google Cloud TPU agreement that covers access to up to 1 million TPUs, with the first phase covering approximately 400,000 Ironwood (TPU v7) chips. The deal is estimated at roughly $10 billion for the first phase of finished racks. Anthropic cited Ironwood's inference performance and price-performance improvements as the primary motivation, planning to use Ironwood both for training future Claude models and for serving Claude to users.
Lightricks, the maker of AI-based creative tools, used Ironwood to train its LTX-2 multimodal generative video model. Lightricks reported breakthrough training efficiency on Ironwood and was preparing inference workloads on the same hardware at the time of the April 2025 announcement.
Essential AI, a frontier model startup, adopted Ironwood for training large models, citing easy onboarding and immediate operational efficiency.
In late 2025, TrendForce reported that Meta was evaluating a large-scale TPU deployment for 2027, with Google developing native PyTorch support specifically to lower the barrier for Meta's PyTorch-heavy codebase.