# TPU Board

> Source: https://aiwiki.ai/wiki/tpu_board
> Updated: 2026-04-26
> Categories: AI Hardware, Google, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **TPU board** (Tensor Processing Unit board) is a printed circuit board (PCB) that houses one or more [Tensor Processing Unit](/wiki/tensor_processing_unit_tpu) chips along with associated memory, power delivery, and interconnect components. Designed by [Google](/wiki/google), TPU boards serve as the physical substrate on which TPU [ASICs](/wiki/asic) are mounted, enabling their integration into server racks and datacenter infrastructure for accelerating [machine learning](/wiki/machine_learning) workloads such as [training](/wiki/training) and [inference](/wiki/inference).

Since the first TPU was deployed in Google's datacenters in 2015, the board designs have evolved considerably, progressing from simple PCIe add-in cards to complex, liquid-cooled multi-chip trays with optical interconnects. Google also produces a smaller Edge TPU board through its Coral product line, targeting on-device [inference](/wiki/inference) at the edge.

## ELI5 (Explain like I'm 5)

Imagine you have a special calculator chip that is really, really fast at doing one type of math (the kind computers need for learning things). A TPU board is like the flat card that holds that chip, connects it to electricity, and plugs it into a big computer. Think of it like a LEGO baseplate: the chip is the special brick, and the board is the flat piece that lets you snap it into the rest of your LEGO tower. Google makes these boards so their computers can learn and think faster.

## Background and history

Google began developing its first TPU in 2013, driven by projections that if every user spoke to their Android phone for just three minutes per day, the company would need to double its datacenter compute capacity to handle the [neural network](/wiki/neural_network) workloads required for [speech recognition](/wiki/speech_recognition). The engineering team designed, verified, built, and deployed the first TPU to production datacenters in only 15 months.

The first-generation TPU board was designed to fit into the same physical slot as a SATA hard disk drive within existing servers. This approach allowed Google to deploy TPUs without redesigning its server infrastructure. The TPU on its printed circuit card could be inserted into a server's hard drive bay, drawing power from the same connectors. Google originally planned to build fewer than 10,000 units but ended up manufacturing over 100,000 TPU v1 boards to support applications including Google Search, Google Translate, Google Maps, Google Photos, and [AlphaGo](/wiki/alphago).

Over the years, the board design has become increasingly sophisticated. TPU v3 introduced liquid cooling for the first time among TPU generations, and TPU v4 moved to a multi-chip tray design with optical circuit switching (OCS) for inter-chip communication across racks.

## Architecture of a TPU board

### Chip-level components

At the center of every TPU board sits the TPU chip itself, a custom [ASIC](/wiki/asic) built around one or more TensorCores. Each TensorCore contains several key functional units:

- **Matrix Multiply Unit (MXU):** A [systolic array](/wiki/systolic_array) of multiply-accumulate ALUs arranged in a grid (128x128 or 256x256, depending on the generation). The MXU performs the bulk of computation for [matrix multiplication](/wiki/matrix_multiplication) operations central to neural network workloads. Each MXU can execute 16,000 multiply-accumulate operations per clock cycle using [bfloat16](/wiki/bfloat16) inputs with FP32 accumulation.
- **Vector Processing Unit (VPU):** Handles element-wise operations, activations, and normalization across vectors.
- **Scalar Unit:** Processes control flow, address computation, and scalar arithmetic.
- **SparseCore (v4 and later):** A specialized dataflow processor for accelerating [embedding](/wiki/embeddings) lookups, providing 5x to 7x speedup for embedding-heavy models while consuming only about 5% of die area and power.

### Memory hierarchy

TPU boards incorporate a layered memory system:

- **High Bandwidth Memory (HBM):** Stacked DRAM packages mounted directly on the TPU package substrate, providing high-bandwidth off-chip memory. Capacity has grown from 8 GiB (TPU v2) to 192 GB (Ironwood/TPU v7).
- **Vector Memory (VMEM):** Software-managed on-chip scratchpad memory used by the vector unit, typically 32 MiB per TensorCore.
- **Common Memory (CMEM):** Shared memory accessible by multiple TensorCores on the same chip, introduced in TPU v4 at 128 MiB.
- **Scalar Memory (SMEM/spMEM):** Smaller on-chip memory for scalar operations, typically 5 to 10 MiB.
- **Accumulator buffers:** Dedicated on-chip buffers for storing intermediate results from the MXU (28 MiB in TPU v1).

### Host connection

TPU boards connect to host CPUs via [PCIe](/wiki/pcie) bus. In the first generation, this was a PCIe 3.0 interface carrying CISC instructions from the host processor to the TPU. In multi-chip tray designs (TPU v4 and later), each tray has four PCIe ports, one for each TPU chip, connecting to a dedicated CPU host.

### Inter-chip interconnect (ICI)

Chip-to-chip communication on TPU boards and across boards uses the Inter-Core Interconnect (ICI), a proprietary high-speed link that provides significantly higher bandwidth than the PCIe host connection. ICI enables TPU chips to form torus network topologies for distributed [training](/wiki/training) across hundreds or thousands of chips. The bandwidth has grown from around 500 Gbps per chip in early generations to 9.6 Tbps aggregate bidirectional bandwidth in Ironwood.

## Board form factors across generations

### TPU v1 board (2015)

The original TPU board was designed as a PCIe add-in card that fit into a standard server SATA disk slot. The board carried a single TPU chip fabricated on a 28 nm process with a die size of 331 mm or less. It connected to the host via PCIe 3.0 and drew 28 to 40 watts of power, cooled passively or with the server's existing airflow. The board provided 8 GiB of DDR3 memory at 34 GB/s bandwidth.

### TPU v2 board (2017)

The TPU v2 board housed chips built on a 16 nm process. Each chip contained two TensorCores with dual 128x128 MXUs and was paired with 16 GiB of [HBM](/wiki/high_bandwidth_memory) providing 600 GB/s bandwidth. The board introduced ICI links to connect chips in a 2D torus topology, with each chip linking to four nearest neighbors. TPU v2 pods scaled to 256 chips with 11.5 petaFLOPS of aggregate compute.

### TPU v3 board (2018)

TPU v3 boards continued using 16 nm process technology but increased the die size to under 700 mm squared. Clock speed rose to 940 MHz, and HBM capacity doubled to 32 GiB per board with 900 GB/s bandwidth. TPU v3 was the first generation to require liquid cooling due to higher power density. The 2D torus ICI topology was retained, and pods scaled up to 1,024 chips (later expanded to 2,048 chips).

### TPU v4 tray (2021)

TPU v4 represented a major redesign of the physical board. Instead of individual PCIe cards, Google moved to a tray-based design. Each TPU v4 tray is a single PCB carrying four TPU chips (eight TensorCores total). The tray's front panel has four top-side PCIe connectors for host CPU connections and 16 bottom-side OSFP connectors for external ICI links to other trays.

The TPU v4 chip was fabricated on a 7 nm process with a die size under 400 mm squared, running at 1,050 MHz and consuming around 170 watts. Each chip package consists of the ASIC in the center surrounded by four HBM stacks, and the PCB carries four of these liquid-cooled packages. The ICI topology shifted from 2D to 3D torus, with each chip connecting to six neighbors instead of four.

### TPU v5e and v5p boards (2023)

Google split the fifth generation into two product lines. TPU v5e was optimized for cost-efficient inference and smaller training jobs, featuring 16 GB HBM2e with 819 GB/s bandwidth and air cooling with a TDP of 120 to 200 watts. It used a 2D torus topology with pods up to 256 chips.

TPU v5p targeted maximum training performance, offering 95 GB HBM2e with 2,765 GB/s bandwidth and liquid cooling at 250 to 300 watts TDP. It used a 3D torus topology with OCS support and scaled to 8,960 chips per pod in a 16x20x28 superpod configuration, delivering approximately 4.45 exaFLOPS.

### TPU v6e Trillium (2024)

Trillium boards carry chips with enlarged 256x256 MXU arrays (quadrupled from the 128x128 arrays in v2 through v5), achieving 918 teraFLOPS in bfloat16 and 1,836 teraFLOPS in INT8. Each chip contains one TensorCore with two MXUs, a vector unit, and a scalar unit. HBM capacity is 32 GB with 1,640 GB/s bandwidth. Trillium returned to air cooling and 2D torus topology with pods of up to 256 chips. It achieves 67% better energy efficiency than TPU v5e and includes a third-generation SparseCore.

### TPU v7 Ironwood (2025)

Ironwood is the first TPU generation designed primarily for inference at scale. Each chip contains two TensorCores and four SparseCores, with 192 GB of HBM3e memory at 7.37 TB/s bandwidth. The 256x256 MXU array handles bfloat16 operations, and Ironwood is the first TPU to support FP8 natively, mapping two FP8 MACs onto each FP16 data path to create an effective 512x512 array. Peak performance reaches 4,614 teraFLOPS in FP8. TDP is 600 watts with liquid cooling. Pods scale to 9,216 chips with 42.5 exaFLOPS aggregate FP8 compute. ICI bandwidth reaches 1.2 TB/s bidirectional per chip across a 3D torus topology.

## Specifications comparison

| Specification | TPU v1 (2015) | TPU v2 (2017) | TPU v3 (2018) | TPU v4 (2021) | TPU v5e (2023) | TPU v5p (2023) | TPU v6e/Trillium (2024) | TPU v7/Ironwood (2025) |
|---|---|---|---|---|---|---|---|---|
| Process node | 28 nm | 16 nm | 16 nm | 7 nm | N/A | N/A | N/A | Advanced node |
| Die size | ≤331 mm² | <625 mm² | <700 mm² | <400 mm² | N/A | N/A | N/A | N/A |
| Clock speed | 700 MHz | 700 MHz | 940 MHz | 1,050 MHz | N/A | N/A | N/A | N/A |
| MXU array size | 256x256 (INT8) | 128x128 (bf16) | 128x128 | 128x128 | 128x128 | 128x128 | 256x256 | 256x256 (bf16), 512x512 (FP8) |
| Memory type | 8 GiB DDR3 | 16 GiB HBM | 32 GiB HBM | 32 GiB HBM | 16 GB HBM2e | 95 GB HBM2e | 32 GB HBM | 192 GB HBM3e |
| Memory bandwidth | 34 GB/s | 600 GB/s | 900 GB/s | 1,200 GB/s | 819 GB/s | 2,765 GB/s | 1,640 GB/s | 7,370 GB/s |
| Peak perf (bf16) | 92 TOPS (INT8) | 45 TFLOPS | 123 TFLOPS | 275 TFLOPS | 197 TFLOPS | 459 TFLOPS | 918 TFLOPS | ~2,300 TFLOPS |
| TDP (watts) | 28-40 | N/A | N/A | 170 | 120-200 | 250-300 | 120-200 | 600 |
| Cooling | Air | Air | Liquid | Liquid | Air | Liquid | Air | Liquid |
| ICI topology | None (PCIe) | 2D torus | 2D torus | 3D torus | 2D torus | 3D torus | 2D torus | 3D torus |
| Max pod size | N/A | 256 chips | 2,048 chips | 4,096 chips | 256 chips | 8,960 chips | 256 chips | 9,216 chips |
| OCS support | No | No | No | Yes | No | Yes | No | Yes |

## Physical hierarchy: from chip to pod

TPU boards are organized into a physical hierarchy that enables scaling from individual chips to warehouse-scale supercomputers.

### Chip and package

Each TPU chip is packaged with its HBM stacks in a single integrated package. In TPU v4 and later, the package consists of the ASIC die in the center surrounded by four HBM stacks, all mounted on a silicon interposer or organic substrate.

### Tray (board)

A tray is the PCB that holds multiple TPU packages. In the TPU v4 design, each tray contains four TPU chips (eight TensorCores). The tray provides PCIe connectivity to a dedicated CPU host and ICI links between the on-board chips. Within a tray, the four chips are connected via embedded ICI links in a 2x2 mesh configuration, with 16 external ICI links routed through OSFP connectors on the front panel for connections to other trays.

### Cube (rack unit)

Sixteen trays (64 chips) form a single 4x4x4 cube, which serves as the basic building block for pod construction. All ICI connections within a cube use direct-attached copper (DAC) cables because the physical distances are short (all chips reside in the same rack or adjacent racks).

### Pod

Multiple cubes are interconnected to form a pod. For TPU v4, a full pod consists of 4,096 chips (64 cubes). Inter-cube connections use optical fiber links managed by optical circuit switches (OCS). For Ironwood (TPU v7), pods scale to 9,216 chips.

### Multi-pod and datacenter network

Multiple pods communicate through Google's datacenter network (DCN), which operates at lower bandwidth than ICI but enables cross-pod distributed training for the largest models.

## Cooling systems

TPU board cooling has evolved significantly across generations:

- **TPU v1 and v2:** Air-cooled, relying on server chassis fans. The v1 board's 28-40 watt TDP was manageable with passive heatsinks and ambient airflow.
- **TPU v3:** First liquid-cooled TPU generation. The higher clock speed (940 MHz) and increased compute density required direct liquid cooling to dissipate heat effectively.
- **TPU v4:** Liquid-cooled packages with active coolant flow control via valves. This allows the cooling system to adjust flow rates based on each chip's workload, improving energy efficiency. Google had to retrofit datacenters to support liquid cooling infrastructure.
- **TPU v5e and v6e/Trillium:** Returned to air cooling for cost efficiency and simpler deployment.
- **TPU v5p and Ironwood:** Liquid-cooled, needed for their higher power envelopes (250-300 watts and 600 watts respectively).

## Optical circuit switching

Starting with TPU v4, Google introduced [optical circuit switching](/wiki/optical_circuit_switching) (OCS) as a key innovation in TPU board interconnect architecture. OCS replaces fixed copper cabling between racks with programmable optical switches that can dynamically reconfigure which chips connect to which.

The OCS system uses micro-electro-mechanical systems (MEMS) mirrors to redirect light beams, enabling sub-10-nanosecond reconfiguration of the network topology. This provides several advantages:

- **Flexibility:** Users can select different 3D torus topologies (including twisted torus configurations) without physical rewiring.
- **Power efficiency:** OCS switches consume power only during reconfiguration events; while connections remain stable, light passes through with minimal loss.
- **Fault tolerance:** Failed chips or links can be bypassed by reconfiguring the optical network around them.
- **Improved utilization:** The network can be partitioned into smaller, independent subclusters for different workloads.

For TPU v5p, 48 OCS units manage 13,824 optical ports across the pod. Ironwood continues to use OCS for its 9,216-chip pods.

## Edge TPU boards (Google Coral)

In addition to datacenter TPU boards, Google produces Edge TPU boards through its Coral product line for on-device machine learning inference at the edge.

### Coral Dev Board

The Coral Dev Board is a single-board computer built around Google's Edge TPU coprocessor and an NXP i.MX 8M [SoC](/wiki/system_on_chip). The board is designed for prototyping edge ML applications.

| Specification | Details |
|---|---|
| Edge TPU performance | 4 TOPS at 2 TOPS per watt |
| Main processor | NXP i.MX 8M SoC, quad-core Cortex-A53 at 1.5 GHz plus Cortex-M4F |
| RAM | 1 GB or 4 GB LPDDR4 |
| Storage | 8 GB or 16 GB eMMC, microSD slot |
| Connectivity | Gigabit Ethernet, Wi-Fi 802.11ac (2x2 MIMO), Bluetooth 4.2 |
| Display | HDMI 2.0a (up to 1080p), MIPI DSI (4-lane) |
| Camera | MIPI CSI-2 (4-lane) |
| USB | USB 3.0 Type-C OTG, USB 3.0 Type-A host |
| GPIO | 40-pin I/O header |
| Dimensions | Approximately 88 mm x 60 mm (SoM), baseboard approximately 138 mm x 104 mm |
| Power | 2-3 A at 5 V DC via USB Type-C |
| Operating temperature | 0 to 50 degrees Celsius |

The Edge TPU chip itself can execute models such as MobileNet v2 at nearly 400 frames per second while consuming only about 2 watts. It supports quantized (INT8) [TensorFlow Lite](/wiki/tensorflow_lite) models and is used in applications including smart cameras, robotics, industrial quality inspection, and IoT devices.

### Coral USB Accelerator

For developers who already have a host computer or single-board computer such as a [Raspberry Pi](/wiki/raspberry_pi), the Coral USB Accelerator provides an Edge TPU in a USB dongle form factor. It connects via USB 3.0 and adds 4 TOPS of ML inference acceleration to any compatible system.

## Manufacturing and supply chain

[Broadcom](/wiki/broadcom) serves as Google's silicon implementation partner for TPU chip development. Broadcom translates Google's chip architecture and specifications into manufacturable silicon, providing proprietary technologies such as SerDes high-speed interfaces, overseeing ASIC design verification, and managing chip fabrication and packaging through [TSMC](/wiki/tsmc) (Taiwan Semiconductor Manufacturing Company).

Broadcom has been involved in every TPU generation since the program's inception. For TPU v7 Ironwood, TSMC manufactures the chips on an advanced 3 nm class process node (N3P). Starting with the next generation beyond Ironwood (expected 2026), Google has also partnered with [MediaTek](/wiki/mediatek) to handle I/O design and manufacturing coordination with TSMC.

The PCB fabrication, assembly, and system integration are handled separately from the chip fabrication. TPU boards are assembled in facilities that mount the TPU packages, HBM stacks, power delivery components, PCIe connectors, and ICI link connectors onto multi-layer printed circuit boards.

## Software ecosystem

TPU boards are supported by a mature software stack that enables developers to run ML workloads across Google's frameworks:

- **[TensorFlow](/wiki/tensorflow):** Google's original ML framework with native TPU support. Models can be compiled for TPU execution with minimal code changes.
- **[JAX](/wiki/jax):** A newer framework from Google built on a functional programming model. JAX uses the XLA compiler to target TPU hardware directly and is now the primary framework for large-scale TPU training at Google.
- **[XLA](/wiki/xla) (Accelerated Linear Algebra):** Google's domain-specific compiler that performs whole-program analysis to fuse operators and optimize memory layouts for TPU execution. XLA compiles high-level model code into efficient TPU machine code.
- **[PyTorch](/wiki/pytorch)/XLA:** An open-source package that enables running [PyTorch](/wiki/pytorch) models on TPUs via the XLA compiler backend. It uses torch.compile with XLA as the backend to trace computation graphs and generate TPU-optimized code.
- **XProf:** A profiling tool deeply integrated into the JAX and TPU ecosystems, providing hardware-level visibility into workload execution with less than 1% overhead.

## Applications and deployment

TPU boards power a wide range of Google services and external applications through [Google Cloud](/wiki/google_cloud_terms) TPU:

- **Google Search:** TPUs run [RankBrain](/wiki/rankbrain) and other ranking models that determine search result relevancy.
- **Google Translate:** Neural machine translation models run on TPU infrastructure, handling billions of translation requests.
- **Google Photos:** Individual TPU chips can process over 100 million photos per day for image recognition and organization.
- **Google Maps and Street View:** TPUs accelerate navigation, map feature recognition, and visual processing.
- **[Gemini](/wiki/gemini):** Google's multimodal AI model family is trained and served on TPU pods.
- **[AlphaGo](/wiki/alphago) and AlphaFold:** Google's game-playing and protein-folding AI systems were trained on TPU infrastructure.
- **Cloud TPU:** Since 2018, Google has offered TPU access to external developers and enterprises through Google Cloud Platform. Users can provision TPU v4, v5e, v5p, v6e, and Ironwood resources for their own training and inference workloads.
- **On-premises deployment:** In 2025, Google began selling TPU hardware directly to enterprises for on-premises installation.

## TPU boards compared to GPU boards

TPU boards differ from [GPU](/wiki/gpu) accelerator boards (such as [NVIDIA](/wiki/nvidia) A100 or H100 boards) in several important ways:

| Aspect | TPU board | GPU board |
|---|---|---|
| Design philosophy | Purpose-built ASIC for tensor operations | General-purpose parallel processor repurposed for ML |
| Precision | Optimized for bfloat16, INT8, FP8 | Supports wide range (FP64, FP32, FP16, INT8, FP8) |
| Memory | HBM with high bandwidth, on-chip scratchpad | HBM with high bandwidth, L1/L2 cache hierarchy |
| Interconnect | Proprietary ICI with torus topology | NVLink, NVSwitch, InfiniBand |
| Availability | Google Cloud only (until 2025 on-prem offering) | Multiple cloud providers and on-premises |
| Software ecosystem | TensorFlow, JAX, PyTorch/XLA | CUDA, PyTorch, TensorFlow, and many others |
| Typical advantage | Higher efficiency for large-batch tensor workloads | Greater flexibility and broader framework support |
| Performance per watt | 15x to 30x higher than contemporary GPUs (v1 benchmark) | Lower for pure tensor ops, higher for mixed workloads |

## See also

- [Tensor Processing Unit](/wiki/tensor_processing_unit_tpu)
- [Cloud TPU](/wiki/cloud_tpu)
- [TPU pod](/wiki/tpu_pod)
- [TPU chip](/wiki/tpu_chip)
- [GPU](/wiki/gpu)
- [ASIC](/wiki/asic)
- [Systolic array](/wiki/systolic_array)
- [Google Cloud](/wiki/google_cloud_terms)
- [Machine learning](/wiki/machine_learning)

## References

1. Jouppi, N.P., Young, C., Patil, N., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1-12, 2017. https://arxiv.org/abs/1704.04760

2. Jouppi, N.P., Kurian, G., Li, S., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), 2023. https://arxiv.org/abs/2304.01433

3. Google Cloud. "An in-depth look at Google's first Tensor Processing Unit (TPU)." Google Cloud Blog, 2017. https://cloud.google.com/blog/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu

4. Google Cloud. "TPU system architecture." Google Cloud Documentation, 2024. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm

5. Google Cloud. "TPU v5p documentation." Google Cloud Documentation, 2023. https://docs.cloud.google.com/tpu/docs/v5p

6. Google Cloud. "TPU v6e documentation." Google Cloud Documentation, 2024. https://docs.google.cloud.com/tpu/docs/v6e

7. Google. "Ironwood: The first Google TPU for the age of inference." Google Blog, 2025. https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/

8. Google Cloud. "Inside the Ironwood TPU codesigned AI stack." Google Cloud Blog, 2025. https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack

9. Coral. "Dev Board datasheet." Google Coral Documentation, 2019. https://www.coral.ai/docs/dev-board/datasheet/

10. Google Cloud. "TPU transformation: A look back at 10 years of our AI-specialized chips." Google Cloud Blog, 2025. https://cloud.google.com/transform/ai-specialized-chips-tpu-history-gen-ai

11. Sanmartin, D. and Prohaska, V. "Exploring TPUs for AI Applications." arXiv preprint arXiv:2309.08918, 2023. https://arxiv.org/pdf/2309.08918

12. Google Cloud. "Building production AI on Cloud TPUs with JAX." Google Cloud Documentation, 2024. https://docs.google.cloud.com/tpu/docs/jax-ai-stack

13. PyTorch/XLA. "Enabling PyTorch on XLA Devices (e.g. Google TPU)." GitHub, 2024. https://github.com/pytorch/xla

14. Wikipedia contributors. "Tensor Processing Unit." Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Tensor_Processing_Unit
