# TPU Pod

> Source: https://aiwiki.ai/wiki/tpu_pod
> Updated: 2026-06-23
> Categories: AI Hardware, AI Infrastructure, Google, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Tensor Processing Unit (TPU)](/wiki/tpu), [Cloud TPU](/wiki/cloud_tpu), [Data parallelism](/wiki/data_parallelism), [Model parallelism](/wiki/model_parallelism), [Distributed training](/wiki/distributed_training)*

## Overview

A **TPU Pod** is a single Google supercomputer built from many [Tensor Processing Unit (TPU)](/wiki/tpu) chips wired directly to each other by a high-speed Inter-Chip Interconnect (ICI) fabric arranged as a 2D or 3D torus, so that software addresses the whole cluster as one accelerator for [machine learning](/wiki/machine_learning) training and [inference](/wiki/inference). Pod size grows with each TPU generation: a TPU v4 Pod holds 4,096 chips, a TPU v5p Pod holds 8,960 chips, and the seventh-generation [Ironwood (TPU v7)](/wiki/tpu_ironwood) Pod holds 9,216 chips delivering 42.5 exaFLOPS of FP8 compute, the largest TPU Pod Google has built to date.[2][9][12] TPU Pods are the machines on which Google trains its largest models, including [PaLM](/wiki/palm) and the [Gemini](/wiki/gemini) family.[4]

Google describes the top Ironwood configuration in concrete terms: "When scaled to 9,216 chips per pod for a total of 42.5 Exaflops, Ironwood supports more than 24x the compute power of the world's largest supercomputer, El Capitan."[12] Rather than treating each chip as a separate device, a TPU Pod lets the [XLA](/wiki/xla) compiler view the entire cluster as one unified computational resource, enabling efficient [distributed training](/wiki/distributed_training) of [large language models](/wiki/large_language_model) and other compute-intensive AI systems.

Google introduced the TPU Pod concept with [TPU](/wiki/tpu) v2 in 2017, when the company first connected multiple TPU chips via custom ICI links arranged in a 2D torus topology.[14] Since then, each TPU generation has expanded Pod scale and interconnect sophistication. Early Pods contained 256 chips; the latest generation, [Ironwood](/wiki/tpu_ironwood) (TPU v7), scales to 9,216 chips per Pod and delivers 42.5 exaFLOPS of aggregate compute.[12] TPU Pods have been used to train many of the most influential AI models of the past decade, including [PaLM](/wiki/palm), [Gemini](/wiki/gemini), and [BERT](/wiki/bert).[4]

A TPU Pod differs from a conventional [GPU](/wiki/gpu) cluster in a fundamental way: the chips communicate through a dedicated, low-latency torus network rather than through general-purpose data center switches. This design enables collective operations (such as [all-reduce](/wiki/all_reduce)) to run with minimal overhead, which is particularly important when thousands of chips must synchronize gradient updates during training.

## ELI5 (Explain like I'm 5)

Imagine you have a giant jigsaw puzzle with millions of pieces. If you try to solve it by yourself, it could take weeks. But what if you got a whole classroom of friends to help? Each friend works on a section of the puzzle, and whenever they need a piece from someone else's section, they can pass it over quickly because they are all sitting at the same big table.

A TPU Pod works the same way. Each TPU chip is like one of those friends, and the special wires connecting them (called the "Inter-Chip Interconnect") are like the table that lets them pass puzzle pieces to each other really fast. Because the chips are all connected directly to their neighbors, they can share information almost instantly, so the whole group finishes the puzzle much faster than any single chip could on its own.

## How did the TPU Pod evolve across generations?

### TPU v1: single-chip inference (2015)

The first [TPU](/wiki/tpu) was an inference-only chip with no inter-chip network. Each TPU v1 sat on its own PCIe card inside a server, operating independently.[1] There was no concept of a "Pod" at this stage; the v1 was designed to accelerate neural network inference for production services such as Google Search, Google Translate, and the [AlphaGo](/wiki/alphago) system.[1]

### TPU v2: the first Pods (2017)

Announced at Google I/O in May 2017, TPU v2 was the first generation to support both training and inference.[14] It was also the first to introduce the Inter-Chip Interconnect (ICI), a custom high-speed bidirectional link connecting each chip directly to four neighbors in a **2D torus** topology. Groups of four chips were packaged into modules delivering 180 TFLOPS. Sixty-four of these modules formed a 256-chip Pod with a peak throughput of approximately 11.5 petaFLOPS.[14]

The 2D torus meant that each chip could communicate with its north, south, east, and west neighbors, and wraparound links connected chips on opposite edges of the grid. This was the architectural foundation on which all subsequent Pods were built.

### TPU v3: scaling to 1,024 chips (2018)

TPU v3 retained the 2D torus ICI topology but increased per-link bandwidth and doubled per-chip performance to 123 TFLOPS (bf16). Pods scaled from 256 to 1,024 chips, and the aggregate Pod throughput exceeded 100 petaFLOPS.[14] The higher power density of v3 chips required liquid cooling for the first time in Google's TPU program. Google submitted TPU v3 Pod configurations to the MLPerf v0.6 training benchmark, demonstrating the scalability of industry-standard ML models across 1,024 chips.[14]

### TPU v4: 3D torus and optical switches (2021)

TPU v4 was a major architectural leap. The interconnect moved from a 2D torus to a **3D torus**, where each chip connects to six neighbors along three axes instead of four neighbors in two axes. This reduced the network diameter from roughly 2 times the square root of N (for a 2D torus with N chips) to roughly 3 times the cube root of N (for a 3D torus), lowering worst-case communication latency.

The most significant innovation was the introduction of **Optical Circuit Switches (OCS)**, making TPU v4 the first supercomputer with a dynamically reconfigurable interconnect.[2] Pods scaled to 4,096 chips and delivered 1.1 exaFLOPS (bf16).[2] The TPU v4 system was described in a paper presented at ISCA 2023.[2]

### TPU v5e and v5p: cost and performance tiers (2023)

Google split the fifth generation into two products. TPU v5e was a cost-efficient chip using a 2D torus with Pods of up to 256 chips, aimed at inference and moderate-scale training. TPU v5p was the performance variant, scaling to 8,960 chips in a 3D torus and delivering approximately 4.45 exaFLOPS across a full Pod.[9][11] Google states that each v5p Pod "composes together 8,960 chips over our highest-bandwidth inter-chip interconnect (ICI) at 4,800 Gbps/chip in a 3D torus topology."[11]

### TPU v6e / Trillium (2024)

Trillium doubled ICI bandwidth over v5e and achieved 918 TFLOPS per chip (bf16). Pods scale to 256 chips in a 2D torus.[7] Google reported that 100,000 Trillium chips can be connected within a single Jupiter data center fabric with 13 petabits per second of bisection bandwidth.[7]

### TPU v7 / Ironwood (2025)

Ironwood represents the largest TPU Pod to date.[12] A single rack of hosts contains 64 chips arranged as a 4x4x4 "cube" connected in a 3D torus. Multiple cubes are linked through OCS connections to form a full Pod of 9,216 chips with 42.5 exaFLOPS of FP8 compute.[12][13] Each Ironwood chip carries 192 GB of HBM3E and an ICI bandwidth of 9.6 Tb/s (1.2 TB/s) per chip, and a full Pod exposes about 1.77 PB of directly accessible HBM.[12][13]

## What are the Pod specifications by generation?

The following table summarizes key Pod-level specifications across TPU generations:

| Generation | Year | Max chips per Pod | Topology | Per-chip bf16 TFLOPS | Per-chip HBM | ICI bandwidth per chip | Pod peak compute |
|---|---|---|---|---|---|---|---|
| TPU v2 | 2017 | 256 | 2D torus | 45 | 16 GB | N/A | 11.5 PFLOPS |
| TPU v3 | 2018 | 1,024 | 2D torus | 123 | 32 GB | N/A | ~100 PFLOPS |
| TPU v4 | 2021 | 4,096 | 3D torus + OCS | 275 | 32 GB | N/A | 1.1 EFLOPS |
| TPU v5e | 2023 | 256 | 2D torus | 197 | 16 GB | 400 GB/s | 50.6 PFLOPS |
| TPU v5p | 2023 | 8,960 | 3D torus | 459 | 95 GB | 1,200 GB/s | ~4.1 EFLOPS |
| TPU v6e (Trillium) | 2024 | 256 | 2D torus | 918 | 32 GB | 800 GB/s | 234.9 PFLOPS |
| TPU v7 (Ironwood) | 2025 | 9,216 | 3D torus + OCS | 2,307 (bf16) / 4,614 (FP8) | 192 GB | 1,200 GB/s | 42.5 EFLOPS (FP8) |

## How does the interconnect architecture work?

### Inter-Chip Interconnect (ICI)

The ICI is a custom high-speed serial link that directly connects neighboring TPU chips.[7] Unlike GPU clusters that route inter-accelerator traffic through PCIe switches, NVLink bridges, or InfiniBand fabrics, ICI provides a direct chip-to-chip path with microsecond-scale latency and terabit-per-second bandwidth.[15] No host CPU is involved in ICI communication; the TPU hardware handles data movement autonomously.[7]

Key characteristics of ICI:

- **Nearest-neighbor connectivity.** Each chip connects only to its immediate neighbors in the torus. In a 2D torus, each chip has 4 ICI links; in a 3D torus, each chip has 6.
- **Multi-hop routing.** Communication between non-adjacent chips is routed through intermediate chips. The torus wraparound links cut the maximum hop count in half compared to a simple mesh.
- **Bandwidth hierarchy.** On TPU v5p, for example, ICI bandwidth per axis is approximately 90 GB/s, for a total of roughly 270 GB/s across three axes. This is much lower than the chip's HBM bandwidth (2,765 GB/s) but significantly higher than the data center network bandwidth (~6.25 GB/s per chip).[15]

### 2D torus topology

TPU v2, v3, v5e, and v6e use a 2D torus. In this layout, chips are arranged in a rectangular grid with wraparound links connecting each edge to the opposite edge. Each chip connects to four neighbors: north, south, east, and west.

The largest 2D torus Pods are 16x16 (256 chips). This topology is simpler to program and sufficient for workloads that can be partitioned along two dimensions, such as data parallelism across one axis and model parallelism across the other.

### 3D torus topology

TPU v4, v5p, and v7 (Ironwood) use a 3D torus, adding a third axis with wraparound links. Each chip connects to six neighbors. The three-dimensional layout provides several advantages over a 2D torus:

- **Lower network diameter.** For a 3D torus of N chips, the maximum hop count scales as 3 times the cube root of N, compared to 2 times the square root of N for a 2D torus. At 4,096 chips, this reduces maximum hops from about 128 (2D) to about 48 (3D).
- **Higher bisection bandwidth.** The 3D torus has more links crossing any bisecting plane, providing greater bandwidth for collective operations like all-reduce.
- **Three axes for parallelism.** The three physical dimensions can be mapped to data parallelism, model (tensor) parallelism, and pipeline parallelism simultaneously.

The basic building block of a 3D torus Pod is a **cube** of 4x4x4 = 64 chips. Cubes are connected to each other via wraparound links, which in TPU v4 and Ironwood pass through Optical Circuit Switches.[2]

### Twisted torus

TPU v4 introduced a **twisted torus** variant, where the wraparound links are shifted so that a chip at position (x, y, z) connects not to position (x, y, 0) on the opposite edge but to a position offset by a fixed amount.[2] The twisted torus increases bisection bandwidth by approximately 70% for certain slice shapes (for example, a 4x4x8 twisted torus vs. a standard 4x4x8 torus).[2] Because the twist is implemented through OCS routing tables rather than physical rewiring, users can choose between a standard torus and a twisted torus for any given workload.

### Optical Circuit Switches (OCS)

OCS technology is one of the most distinctive features of TPU v4 and Ironwood Pods.[2] An OCS uses arrays of tiny mirrors built with **Micro-Electro-Mechanical Systems (MEMS)** technology to steer optical signals between fiber-optic cables.[6] Switching happens in milliseconds and requires no electrical-to-optical conversion, since the signals remain in the optical domain throughout.[6]

The OCS layer sits between cubes. Within a cube, chips are connected via direct electrical ICI links in a 3D mesh. The OCS provides the wraparound links that turn this mesh into a torus, and it can dynamically reconfigure which cubes are connected to each other.

Benefits of OCS include:

| Benefit | Description |
|---|---|
| Fault tolerance | If a chip, cable, or OCS port fails, the fabric manager reconfigures optical paths to bypass the fault. Jobs continue on healthy hardware without manual intervention. |
| Flexible partitioning | A single physical Pod can be subdivided into multiple independent slices for different users or workloads. |
| Topology selection | Users can select standard torus, twisted torus, or other topologies through software configuration. |
| Low cost and power | OCS infrastructure accounts for less than 5% of total system cost and less than 3% of total system power, far cheaper and more efficient than electrical switching (e.g., InfiniBand).[2] |

### Data center network (DCN)

For workloads that span multiple Pods or multiple slices, TPU hosts communicate through Google's data center network.[7] DCN bandwidth per chip is much lower than ICI bandwidth (roughly 6.25 GB/s per chip on v5p, compared to ~270 GB/s for ICI), so the [XLA](/wiki/xla) compiler and runtime schedule DCN communication carefully to overlap it with computation.[15]

## What are slices and multislice?

### Slices

A **slice** is a contiguous set of TPU chips within a single Pod, all connected via ICI. Slices are the unit of allocation in [Cloud TPU](/wiki/cloud_tpu): when a user requests TPU resources, they receive a slice of a specific topology (for example, a v4-128 slice is a 4x4x4 cube of 64 chips assigned as 128 TensorCores, since each v4 chip has two TensorCores).[7][8]

Slice sizes vary by TPU generation. For TPU v4, slice configurations range from v4-8 (4 chips) to v4-4096 (4,096 chips).[8] For TPU v5p, configurations range from 4 chips (2x2x1) up to 6,144 chips (16x16x24), which is the largest schedulable job size. The full v5p Pod of 8,960 chips contains additional spare cubes used for fault tolerance.[9]

### Multislice

Cloud TPU **Multislice** is a scaling technology that allows a single training job to span multiple slices. Chips within each slice communicate via ICI as usual, while chips in different slices exchange data through DCN by routing traffic through host CPUs.[10]

The [XLA](/wiki/xla) compiler automatically generates the inter-slice DCN communication code. Developers do not need to write explicit networking logic; the compiler inserts the necessary collective operations and overlaps them with computation to hide latency.[10]

Multislice scaling has been demonstrated at very large scale. Google reported near-linear scaling across 50,944 TPU v5e chips (roughly 199 v5e Pods) while training a 128B-parameter LLM, the largest publicly disclosed distributed LLM training job at the time.[10] Google states that Multislice "can offer near-linear scaling performance from single slice to multiple slices with up to tens of thousands of chips."[10]

| Feature | Single slice | Multislice |
|---|---|---|
| Communication | ICI only | ICI within slice, DCN between slices |
| Max scale (v5p) | 6,144 chips | 18,432+ chips |
| Max scale (v5e) | 256 chips | 50,944+ chips demonstrated |
| Latency | Microseconds (ICI) | Higher (DCN adds latency) |
| Programming model | Transparent (SPMD) | Transparent (XLA-managed) |

## How are TPU Pods programmed?

### XLA and GSPMD

The primary compiler for TPU Pods is [XLA (Accelerated Linear Algebra)](/wiki/xla), a domain-specific compiler that translates high-level ML operations into optimized TPU machine code. XLA performs whole-program analysis, fusing operations, tiling computations to fit in on-chip memory, and scheduling data transfers to keep the hardware busy.

For distributed workloads on TPU Pods, XLA uses **GSPMD (General-purpose Single Program Multiple Data)**, a partitioning pass that automatically shards a computation across all chips in the Pod. Developers annotate tensors with sharding specifications (for example, "shard this tensor's batch dimension across the first axis of the Pod"), and GSPMD transforms the single-device program into a distributed one, inserting the correct collective communication operations.[5]

This approach means developers can write code as if it will run on a single large device. The compiler handles:

- Partitioning the model and data across chips
- Inserting [all-reduce](/wiki/all_reduce), all-gather, and reduce-scatter collectives
- Overlapping communication with computation
- Mapping logical tensor dimensions to physical Pod topology axes

### Framework support

TPU Pods are supported by three major ML frameworks:

| Framework | TPU Pod mechanism | Distributed training approach |
|---|---|---|
| [JAX](/wiki/jax) | Native XLA, `jit` + `shmap`/`pjit` | GSPMD sharding annotations; write for one device, compiler distributes |
| [TensorFlow](/wiki/tensorflow) | `tf.distribute.TPUStrategy` | Data parallelism and model parallelism via distribution strategies |
| [PyTorch](/wiki/pytorch) | PyTorch/XLA with SPMD | XLA-based sharding; FSDP and tensor parallelism supported |

[JAX](/wiki/jax) is the most commonly used framework for TPU Pod workloads at Google. Its functional programming model aligns naturally with XLA's compilation requirements, and JAX transformations like `jit`, `vmap`, `pmap`, and `shmap` map directly to Pod topologies.[5]

### Parallelism strategies

TPU Pods support all standard distributed training strategies:

- **[Data parallelism](/wiki/data_parallelism).** The model is replicated across chips, and each chip processes a different mini-batch. Gradients are synchronized via all-reduce over ICI. This is the simplest and most common strategy.
- **[Model parallelism](/wiki/model_parallelism) (tensor parallelism).** Individual layers are split across chips, with each chip holding a slice of the weight tensors. Activations are communicated between chips during the forward and backward passes.
- **Pipeline parallelism.** Different layers of the model are assigned to different groups of chips. Micro-batches flow through the pipeline, with each group processing a different micro-batch at any given time.
- **Expert parallelism.** Used in [Mixture of Experts (MoE)](/wiki/mixture_of_experts) architectures, where different expert sub-networks reside on different chips and a gating network routes tokens to the appropriate expert.

In practice, large-scale training on TPU Pods combines multiple strategies. For example, PaLM 540B used a combination of data parallelism and model parallelism across 6,144 TPU v4 chips spanning two Pods.[4]

## How do TPU Pods handle hardware failures?

Operating thousands of chips continuously for days or weeks of training requires robust fault-handling mechanisms. At the scale of a TPU v4 Pod, hardware failures are not rare events but routine occurrences.[3]

### Failure rates

A study by Zu et al., presented at NSDI 2024, reported the following daily failure rates for Google's TPU v4 supercomputers:[3]

| Component | Daily failure rate |
|---|---|
| TPU machines | 0.08% |
| ICI cables | 0.005% |
| Optical circuit switches | 0.04% |

These rates mean that in a 4,096-chip Pod, roughly 3 to 4 machines experience a failure on any given day.[3] Without automated recovery, such failures would cause frequent training interruptions.

### Automated recovery

Google's TPU infrastructure uses several techniques to maintain high availability:

- **OCS reconfiguration.** When the fabric manager detects a failed chip or link, it reconfigures the optical switches to route around the fault, connecting healthy cubes into a new slice that excludes the defective hardware. This happens without human intervention.[3]
- **ICI resiliency.** For TPU v4, v5p, and Ironwood, ICI connections can be routed around OCS and optical ICI faults, maintaining slice availability with only temporary performance degradation.[3]
- **Spare cubes.** TPU v5p Pods contain spare cubes beyond the schedulable maximum. When a cube fails, a spare replaces it.
- **Checkpointing.** Training frameworks save model state to persistent storage at regular intervals (typically every few minutes). When a failure occurs, training restarts from the most recent checkpoint rather than from scratch.
- **In-memory redundancy.** For training Gemini, Google used redundant in-memory copies of model state distributed across replicas. On hardware failure, intact replicas provided the model state for recovery, avoiding the latency of reading checkpoints from storage.

The Zu et al. study reported that TPU v4 supercomputers achieve **99.98% system availability** through these automated mechanisms, with hardware outages affecting approximately 1% of training jobs.[3]

## What models are trained on TPU Pods?

The following table lists major models and systems trained on TPU Pods:

| Model | Year | TPU generation | Pod scale | Notes |
|---|---|---|---|---|
| [BERT](/wiki/bert) | 2018 | TPU v3 | 16 chips | Trained in 4 days; pre-training that transformed [NLP](/wiki/natural_language_processing) |
| [T5](/wiki/t5) | 2019 | TPU v3 | 1,024 chips | Text-to-text framework; explored [scaling laws](/wiki/scaling_laws) |
| [AlphaFold 2](/wiki/alphafold) | 2020 | TPU v3 | 128 chips | Solved protein structure prediction; won CASP14 |
| [LaMDA](/wiki/lamda) | 2021 | TPU v3 | 1,024 chips | Conversational model that powered early Google Bard |
| [PaLM](/wiki/palm) | 2022 | TPU v4 | 6,144 chips (2 Pods) | 540B parameters; first large-scale use of Pathways system; 57.8% hardware FLOPS utilization |
| [Gemini](/wiki/gemini) | 2023 | TPU v4 / v5p | Multi-Pod | Google's flagship multimodal model family |
| [Gemma](/wiki/gemma) | 2024 | TPU v5e | N/A | Open-weights model family |

PaLM is a particularly instructive example of TPU Pod usage. The 540B-parameter model was trained across two TPU v4 Pods, each with 3,072 chips (for a total of 6,144 chips), using the Pathways system to coordinate computation across Pods. The training achieved 57.8% hardware FLOPS utilization, the highest figure reported for LLM training at that scale at the time of publication.[4]

## How do TPU Pods differ from GPU clusters?

The primary alternative to TPU Pods for large-scale ML training is [NVIDIA](/wiki/nvidia) GPU clusters connected via NVLink and InfiniBand. The two approaches differ in several respects:

| Aspect | TPU Pod | [GPU](/wiki/gpu) cluster ([NVIDIA](/wiki/nvidia)) |
|---|---|---|
| Intra-node interconnect | ICI (custom torus, direct chip-to-chip) | NVLink + NVSwitch (within DGX node) |
| Inter-node interconnect | ICI continues across the Pod via OCS | InfiniBand / RoCE between nodes |
| Topology | 2D or 3D torus | Fat-tree (typically via InfiniBand switches) |
| Programming model | XLA/GSPMD (compiler-driven sharding) | CUDA / NCCL (explicit collectives) |
| Availability | Google Cloud only | Available for purchase; all major clouds |
| Max single-system scale | 9,216 chips (Ironwood Pod) | 72 GPUs per DGX SuperPOD node; larger via InfiniBand |
| Software ecosystem | [JAX](/wiki/jax), [TensorFlow](/wiki/tensorflow), [PyTorch](/wiki/pytorch)/XLA | CUDA ecosystem (broad third-party support) |

TPU Pods have a structural advantage in that ICI provides a uniform, high-bandwidth fabric across the entire Pod without the bandwidth bottleneck that occurs between nodes in GPU clusters.[15] In a GPU cluster, NVLink provides very high bandwidth within a single multi-GPU node (e.g., 900 GB/s per GPU in DGX H100), but communication between nodes drops to InfiniBand speeds (typically 400 Gb/s per port). In a TPU Pod, every chip-to-chip link uses ICI regardless of physical distance within the Pod.

GPU clusters, on the other hand, benefit from a much larger and more mature software ecosystem, broader availability across cloud providers, and the ability to be purchased as on-premises hardware.

## How do you access TPU Pods on Google Cloud?

### Pricing

[Cloud TPU](/wiki/cloud_tpu) Pods are billed on a per-chip-hour basis. Pricing varies by generation, region, and commitment level:

| TPU generation | On-demand price (approx.) | 1-year committed | 3-year committed |
|---|---|---|---|
| TPU v5e | ~$1.20/chip/hour | ~25-30% discount | ~40-45% discount |
| TPU v5p | ~$1.92/chip/hour | Discounted | Discounted |
| TPU v6e (Trillium) | ~$1.38/chip/hour | Discounted | As low as ~$0.39/chip/hour |

Google also offers **spot (preemptible)** pricing at significant discounts for fault-tolerant workloads, and **queued resources** for users who can wait for availability.[7]

### Regional availability

TPU Pods are available in select Google Cloud regions. Primary regions (such as us-central1 and us-east1) typically offer the broadest selection of TPU types and the largest Pod configurations.[7] Newer generations like Trillium and Ironwood are initially available in a limited set of regions before expanding.[13] Access to larger Pod configurations and newer hardware often requires quota approval or an enterprise agreement with Google Cloud.

### TPU Research Cloud (TRC)

Google's **TPU Research Cloud** program provides free Cloud TPU access to academic researchers and open-source developers. Accepted participants receive temporary quota for TPU v4 and newer hardware. In exchange, researchers share their work through peer-reviewed publications, open-source code, or blog posts. The TRC has supported research across [natural language processing](/wiki/natural_language_processing), protein structure prediction, climate modeling, and many other fields.

## See also

- [Tensor Processing Unit (TPU)](/wiki/tpu)
- [Cloud TPU](/wiki/cloud_tpu)
- [TPU Ironwood](/wiki/tpu_ironwood)
- [TPU Board](/wiki/tpu_board)

## References

1. Jouppi, N. P., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*, 2017. [arXiv:1704.04760](https://arxiv.org/abs/1704.04760)
2. Jouppi, N. P., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." *Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA)*, 2023. [arXiv:2304.01433](https://arxiv.org/abs/2304.01433)
3. Zu, Y., et al. "Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer." *Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI)*, 2024. [usenix.org/conference/nsdi24/presentation/zu](https://www.usenix.org/conference/nsdi24/presentation/zu)
4. Chowdhery, A., et al. "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint*, 2022. [arXiv:2204.02311](https://arxiv.org/abs/2204.02311)
5. Yoo, J., et al. "Scalable Training of Language Models using JAX pjit and TPUv4." *arXiv preprint*, 2022. [arXiv:2204.06514](https://arxiv.org/abs/2204.06514)
6. Patel, D., et al. "Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems." *Proceedings of the ACM SIGCOMM Conference*, 2023. [dl.acm.org/doi/10.1145/3603269.3604836](https://dl.acm.org/doi/10.1145/3603269.3604836)
7. Google Cloud. "TPU system architecture." *Cloud TPU Documentation*. [docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm](https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm)
8. Google Cloud. "TPU v4 documentation." *Cloud TPU Documentation*. [docs.cloud.google.com/tpu/docs/v4](https://docs.cloud.google.com/tpu/docs/v4)
9. Google Cloud. "TPU v5p documentation." *Cloud TPU Documentation*. [docs.cloud.google.com/tpu/docs/v5p](https://docs.cloud.google.com/tpu/docs/v5p)
10. Google Cloud. "Cloud TPU Multislice overview" and "The world's largest distributed LLM training job on TPU v5e." *Cloud TPU Documentation and Google Cloud Blog*. [docs.cloud.google.com/tpu/docs/multislice-introduction](https://docs.cloud.google.com/tpu/docs/multislice-introduction), [cloud.google.com/blog/products/compute/the-worlds-largest-distributed-llm-training-job-on-tpu-v5e](https://cloud.google.com/blog/products/compute/the-worlds-largest-distributed-llm-training-job-on-tpu-v5e)
11. Google Cloud. "Introducing Cloud TPU v5p and AI Hypercomputer." *Google Cloud Blog*, 2023. [cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-tpu-v5p-and-ai-hypercomputer](https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-tpu-v5p-and-ai-hypercomputer)
12. Google. "Ironwood: The first Google TPU for the age of inference." *The Keyword*, 2025. [blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/ironwood-tpu-age-of-inference](https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/ironwood-tpu-age-of-inference/)
13. Google Cloud. "TPU7x (Ironwood) documentation." *Cloud TPU Documentation*. [docs.cloud.google.com/tpu/docs/tpu7x](https://docs.cloud.google.com/tpu/docs/tpu7x)
14. Wikipedia. "Tensor Processing Unit." [en.wikipedia.org/wiki/Tensor_Processing_Unit](https://en.wikipedia.org/wiki/Tensor_Processing_Unit)
15. Scaling Book (JAX). "How to Think About TPUs." [jax-ml.github.io/scaling-book/tpus](https://jax-ml.github.io/scaling-book/tpus/)

