See also: Tensor Processing Unit (TPU), Data parallelism, Model parallelism, Distributed training
A TPU Pod is a cluster of Google Tensor Processing Units (TPUs) connected through a proprietary high-speed Inter-Chip Interconnect (ICI) network, designed to function as a single large accelerator for machine learning training and inference workloads. Rather than treating each chip as a separate device, a TPU Pod allows software to view the entire cluster as one unified computational resource, enabling efficient distributed training of large language models and other compute-intensive AI systems.
Google introduced the TPU Pod concept with TPU v2 in 2017, when the company first connected multiple TPU chips via custom ICI links arranged in a 2D torus topology. Since then, each TPU generation has expanded Pod scale and interconnect sophistication. Early Pods contained 256 chips; the latest generation, Ironwood (TPU v7), scales to 9,216 chips per Pod and delivers 42.5 exaFLOPS of aggregate compute. TPU Pods have been used to train many of the most influential AI models of the past decade, including PaLM, Gemini, and BERT.
A TPU Pod differs from a conventional GPU cluster in a fundamental way: the chips communicate through a dedicated, low-latency torus network rather than through general-purpose data center switches. This design enables collective operations (such as all-reduce) to run with minimal overhead, which is particularly important when thousands of chips must synchronize gradient updates during training.
Imagine you have a giant jigsaw puzzle with millions of pieces. If you try to solve it by yourself, it could take weeks. But what if you got a whole classroom of friends to help? Each friend works on a section of the puzzle, and whenever they need a piece from someone else's section, they can pass it over quickly because they are all sitting at the same big table.
A TPU Pod works the same way. Each TPU chip is like one of those friends, and the special wires connecting them (called the "Inter-Chip Interconnect") are like the table that lets them pass puzzle pieces to each other really fast. Because the chips are all connected directly to their neighbors, they can share information almost instantly, so the whole group finishes the puzzle much faster than any single chip could on its own.
The first TPU was an inference-only chip with no inter-chip network. Each TPU v1 sat on its own PCIe card inside a server, operating independently. There was no concept of a "Pod" at this stage; the v1 was designed to accelerate neural network inference for production services such as Google Search, Google Translate, and the AlphaGo system.
Announced at Google I/O in May 2017, TPU v2 was the first generation to support both training and inference. It was also the first to introduce the Inter-Chip Interconnect (ICI), a custom high-speed bidirectional link connecting each chip directly to four neighbors in a 2D torus topology. Groups of four chips were packaged into modules delivering 180 TFLOPS. Sixty-four of these modules formed a 256-chip Pod with a peak throughput of approximately 11.5 petaFLOPS.
The 2D torus meant that each chip could communicate with its north, south, east, and west neighbors, and wraparound links connected chips on opposite edges of the grid. This was the architectural foundation on which all subsequent Pods were built.
TPU v3 retained the 2D torus ICI topology but increased per-link bandwidth and doubled per-chip performance to 123 TFLOPS (bf16). Pods scaled from 256 to 1,024 chips, and the aggregate Pod throughput exceeded 100 petaFLOPS. The higher power density of v3 chips required liquid cooling for the first time in Google's TPU program. Google submitted TPU v3 Pod configurations to the MLPerf v0.6 training benchmark, demonstrating the scalability of industry-standard ML models across 1,024 chips.
TPU v4 was a major architectural leap. The interconnect moved from a 2D torus to a 3D torus, where each chip connects to six neighbors along three axes instead of four neighbors in two axes. This reduced the network diameter from roughly 2 times the square root of N (for a 2D torus with N chips) to roughly 3 times the cube root of N (for a 3D torus), lowering worst-case communication latency.
The most significant innovation was the introduction of Optical Circuit Switches (OCS), making TPU v4 the first supercomputer with a dynamically reconfigurable interconnect. Pods scaled to 4,096 chips and delivered 1.1 exaFLOPS (bf16). The TPU v4 system was described in a paper presented at ISCA 2023.
Google split the fifth generation into two products. TPU v5e was a cost-efficient chip using a 2D torus with Pods of up to 256 chips, aimed at inference and moderate-scale training. TPU v5p was the performance variant, scaling to 8,960 chips in a 3D torus and delivering approximately 4.45 exaFLOPS across a full Pod.
Trillium doubled ICI bandwidth over v5e and achieved 918 TFLOPS per chip (bf16). Pods scale to 256 chips in a 2D torus. Google reported that 100,000 Trillium chips can be connected within a single Jupiter data center fabric with 13 petabits per second of bisection bandwidth.
Ironwood represents the largest TPU Pod to date. A single rack of hosts contains 64 chips arranged as a 4x4x4 "cube" connected in a 3D torus. Multiple cubes are linked through OCS connections to form a full Pod of 9,216 chips with 42.5 exaFLOPS of FP8 compute. ICI bandwidth reaches 1.2 TB/s bidirectional per chip.
The following table summarizes key Pod-level specifications across TPU generations:
| Generation | Year | Max chips per Pod | Topology | Per-chip bf16 TFLOPS | Per-chip HBM | ICI bandwidth per chip | Pod peak compute |
|---|---|---|---|---|---|---|---|
| TPU v2 | 2017 | 256 | 2D torus | 45 | 16 GB | N/A | 11.5 PFLOPS |
| TPU v3 | 2018 | 1,024 | 2D torus | 123 | 32 GB | N/A | ~100 PFLOPS |
| TPU v4 | 2021 | 4,096 | 3D torus + OCS | 275 | 32 GB | N/A | 1.1 EFLOPS |
| TPU v5e | 2023 | 256 | 2D torus | 197 | 16 GB | 400 GB/s | 50.6 PFLOPS |
| TPU v5p | 2023 | 8,960 | 3D torus | 459 | 95 GB | 1,200 GB/s | ~4.1 EFLOPS |
| TPU v6e (Trillium) | 2024 | 256 | 2D torus | 918 | 32 GB | 800 GB/s | 234.9 PFLOPS |
| TPU v7 (Ironwood) | 2025 | 9,216 | 3D torus + OCS | 2,307 (bf16) / 4,614 (FP8) | 192 GB | 1,200 GB/s | 42.5 EFLOPS (FP8) |
The ICI is a custom high-speed serial link that directly connects neighboring TPU chips. Unlike GPU clusters that route inter-accelerator traffic through PCIe switches, NVLink bridges, or InfiniBand fabrics, ICI provides a direct chip-to-chip path with microsecond-scale latency and terabit-per-second bandwidth. No host CPU is involved in ICI communication; the TPU hardware handles data movement autonomously.
Key characteristics of ICI:
TPU v2, v3, v5e, and v6e use a 2D torus. In this layout, chips are arranged in a rectangular grid with wraparound links connecting each edge to the opposite edge. Each chip connects to four neighbors: north, south, east, and west.
The largest 2D torus Pods are 16x16 (256 chips). This topology is simpler to program and sufficient for workloads that can be partitioned along two dimensions, such as data parallelism across one axis and model parallelism across the other.
TPU v4, v5p, and v7 (Ironwood) use a 3D torus, adding a third axis with wraparound links. Each chip connects to six neighbors. The three-dimensional layout provides several advantages over a 2D torus:
The basic building block of a 3D torus Pod is a cube of 4x4x4 = 64 chips. Cubes are connected to each other via wraparound links, which in TPU v4 and Ironwood pass through Optical Circuit Switches.
TPU v4 introduced a twisted torus variant, where the wraparound links are shifted so that a chip at position (x, y, z) connects not to position (x, y, 0) on the opposite edge but to a position offset by a fixed amount. The twisted torus increases bisection bandwidth by approximately 70% for certain slice shapes (for example, a 4x4x8 twisted torus vs. a standard 4x4x8 torus). Because the twist is implemented through OCS routing tables rather than physical rewiring, users can choose between a standard torus and a twisted torus for any given workload.
OCS technology is one of the most distinctive features of TPU v4 and Ironwood Pods. An OCS uses arrays of tiny mirrors built with Micro-Electro-Mechanical Systems (MEMS) technology to steer optical signals between fiber-optic cables. Switching happens in milliseconds and requires no electrical-to-optical conversion, since the signals remain in the optical domain throughout.
The OCS layer sits between cubes. Within a cube, chips are connected via direct electrical ICI links in a 3D mesh. The OCS provides the wraparound links that turn this mesh into a torus, and it can dynamically reconfigure which cubes are connected to each other.
Benefits of OCS include:
| Benefit | Description |
|---|---|
| Fault tolerance | If a chip, cable, or OCS port fails, the fabric manager reconfigures optical paths to bypass the fault. Jobs continue on healthy hardware without manual intervention. |
| Flexible partitioning | A single physical Pod can be subdivided into multiple independent slices for different users or workloads. |
| Topology selection | Users can select standard torus, twisted torus, or other topologies through software configuration. |
| Low cost and power | OCS infrastructure accounts for less than 5% of total system cost and less than 3% of total system power, far cheaper and more efficient than electrical switching (e.g., InfiniBand). |
For workloads that span multiple Pods or multiple slices, TPU hosts communicate through Google's data center network. DCN bandwidth per chip is much lower than ICI bandwidth (roughly 6.25 GB/s per chip on v5p, compared to ~270 GB/s for ICI), so the XLA compiler and runtime schedule DCN communication carefully to overlap it with computation.
A slice is a contiguous set of TPU chips within a single Pod, all connected via ICI. Slices are the unit of allocation in Google Cloud: when a user requests TPU resources, they receive a slice of a specific topology (for example, a v4-128 slice is a 4x4x4 cube of 64 chips assigned as 128 TensorCores, since each v4 chip has two TensorCores).
Slice sizes vary by TPU generation. For TPU v4, slice configurations range from v4-8 (4 chips) to v4-4096 (4,096 chips). For TPU v5p, configurations range from 4 chips (2x2x1) up to 6,144 chips (16x16x24), which is the largest schedulable job size. The full v5p Pod of 8,960 chips contains additional spare cubes used for fault tolerance.
Cloud TPU Multislice is a scaling technology that allows a single training job to span multiple slices. Chips within each slice communicate via ICI as usual, while chips in different slices exchange data through DCN by routing traffic through host CPUs.
The XLA compiler automatically generates the inter-slice DCN communication code. Developers do not need to write explicit networking logic; the compiler inserts the necessary collective operations and overlaps them with computation to hide latency.
Multislice scaling has been demonstrated at very large scale. Google reported near-linear scaling across 50,944 TPU v5e chips (approximately 200 slices) in a real-world LLM training job, the largest distributed LLM training run publicly disclosed at the time.
| Feature | Single slice | Multislice |
|---|---|---|
| Communication | ICI only | ICI within slice, DCN between slices |
| Max scale (v5p) | 6,144 chips | 18,432+ chips |
| Max scale (v5e) | 256 chips | 50,944+ chips demonstrated |
| Latency | Microseconds (ICI) | Higher (DCN adds latency) |
| Programming model | Transparent (SPMD) | Transparent (XLA-managed) |
The primary compiler for TPU Pods is XLA (Accelerated Linear Algebra), a domain-specific compiler that translates high-level ML operations into optimized TPU machine code. XLA performs whole-program analysis, fusing operations, tiling computations to fit in on-chip memory, and scheduling data transfers to keep the hardware busy.
For distributed workloads on TPU Pods, XLA uses GSPMD (General-purpose Single Program Multiple Data), a partitioning pass that automatically shards a computation across all chips in the Pod. Developers annotate tensors with sharding specifications (for example, "shard this tensor's batch dimension across the first axis of the Pod"), and GSPMD transforms the single-device program into a distributed one, inserting the correct collective communication operations.
This approach means developers can write code as if it will run on a single large device. The compiler handles:
TPU Pods are supported by three major ML frameworks:
| Framework | TPU Pod mechanism | Distributed training approach |
|---|---|---|
| JAX | Native XLA, jit + shmap/pjit | GSPMD sharding annotations; write for one device, compiler distributes |
| TensorFlow | tf.distribute.TPUStrategy | Data parallelism and model parallelism via distribution strategies |
| PyTorch | PyTorch/XLA with SPMD | XLA-based sharding; FSDP and tensor parallelism supported |
JAX is the most commonly used framework for TPU Pod workloads at Google. Its functional programming model aligns naturally with XLA's compilation requirements, and JAX transformations like jit, vmap, pmap, and shmap map directly to Pod topologies.
TPU Pods support all standard distributed training strategies:
In practice, large-scale training on TPU Pods combines multiple strategies. For example, PaLM 540B used a combination of data parallelism and model parallelism across 6,144 TPU v4 chips spanning two Pods.
Operating thousands of chips continuously for days or weeks of training requires robust fault-handling mechanisms. At the scale of a TPU v4 Pod, hardware failures are not rare events but routine occurrences.
A study by Zu et al., presented at NSDI 2024, reported the following daily failure rates for Google's TPU v4 supercomputers:
| Component | Daily failure rate |
|---|---|
| TPU machines | 0.08% |
| ICI cables | 0.005% |
| Optical circuit switches | 0.04% |
These rates mean that in a 4,096-chip Pod, roughly 3 to 4 machines experience a failure on any given day. Without automated recovery, such failures would cause frequent training interruptions.
Google's TPU infrastructure uses several techniques to maintain high availability:
The Zu et al. study reported that TPU v4 supercomputers achieve 99.98% system availability through these automated mechanisms, with hardware outages affecting approximately 1% of training jobs.
The following table lists major models and systems trained on TPU Pods:
| Model | Year | TPU generation | Pod scale | Notes |
|---|---|---|---|---|
| BERT | 2018 | TPU v3 | 16 chips | Trained in 4 days; pre-training that transformed NLP |
| T5 | 2019 | TPU v3 | 1,024 chips | Text-to-text framework; explored scaling laws |
| AlphaFold 2 | 2020 | TPU v3 | 128 chips | Solved protein structure prediction; won CASP14 |
| LaMDA | 2021 | TPU v3 | 1,024 chips | Conversational model that powered early Google Bard |
| PaLM | 2022 | TPU v4 | 6,144 chips (2 Pods) | 540B parameters; first large-scale use of Pathways system; 57.8% hardware FLOPS utilization |
| Gemini | 2023 | TPU v4 / v5p | Multi-Pod | Google's flagship multimodal model family |
| Gemma | 2024 | TPU v5e | N/A | Open-weights model family |
PaLM is a particularly instructive example of TPU Pod usage. The 540B-parameter model was trained across two TPU v4 Pods, each with 3,072 chips (for a total of 6,144 chips), using the Pathways system to coordinate computation across Pods. The training achieved 57.8% hardware FLOPS utilization, the highest figure reported for LLM training at that scale at the time of publication.
The primary alternative to TPU Pods for large-scale ML training is NVIDIA GPU clusters connected via NVLink and InfiniBand. The two approaches differ in several respects:
| Aspect | TPU Pod | GPU cluster (NVIDIA) |
|---|---|---|
| Intra-node interconnect | ICI (custom torus, direct chip-to-chip) | NVLink + NVSwitch (within DGX node) |
| Inter-node interconnect | ICI continues across the Pod via OCS | InfiniBand / RoCE between nodes |
| Topology | 2D or 3D torus | Fat-tree (typically via InfiniBand switches) |
| Programming model | XLA/GSPMD (compiler-driven sharding) | CUDA / NCCL (explicit collectives) |
| Availability | Google Cloud only | Available for purchase; all major clouds |
| Max single-system scale | 9,216 chips (Ironwood Pod) | 72 GPUs per DGX SuperPOD node; larger via InfiniBand |
| Software ecosystem | JAX, TensorFlow, PyTorch/XLA | CUDA ecosystem (broad third-party support) |
TPU Pods have a structural advantage in that ICI provides a uniform, high-bandwidth fabric across the entire Pod without the bandwidth bottleneck that occurs between nodes in GPU clusters. In a GPU cluster, NVLink provides very high bandwidth within a single multi-GPU node (e.g., 900 GB/s per GPU in DGX H100), but communication between nodes drops to InfiniBand speeds (typically 400 Gb/s per port). In a TPU Pod, every chip-to-chip link uses ICI regardless of physical distance within the Pod.
GPU clusters, on the other hand, benefit from a much larger and more mature software ecosystem, broader availability across cloud providers, and the ability to be purchased as on-premises hardware.
Cloud TPU Pods are billed on a per-chip-hour basis. Pricing varies by generation, region, and commitment level:
| TPU generation | On-demand price (approx.) | 1-year committed | 3-year committed |
|---|---|---|---|
| TPU v5e | ~$1.20/chip/hour | ~25-30% discount | ~40-45% discount |
| TPU v5p | ~$1.92/chip/hour | Discounted | Discounted |
| TPU v6e (Trillium) | ~$1.38/chip/hour | Discounted | As low as ~$0.39/chip/hour |
Google also offers spot (preemptible) pricing at significant discounts for fault-tolerant workloads, and queued resources for users who can wait for availability.
TPU Pods are available in select Google Cloud regions. Primary regions (such as us-central1 and us-east1) typically offer the broadest selection of TPU types and the largest Pod configurations. Newer generations like Trillium and Ironwood are initially available in a limited set of regions before expanding. Access to larger Pod configurations and newer hardware often requires quota approval or an enterprise agreement with Google Cloud.
Google's TPU Research Cloud program provides free Cloud TPU access to academic researchers and open-source developers. Accepted participants receive temporary quota for TPU v4 and newer hardware. In exchange, researchers share their work through peer-reviewed publications, open-source code, or blog posts. The TRC has supported research across natural language processing, protein structure prediction, climate modeling, and many other fields.