A TPU board (Tensor Processing Unit board) is a printed circuit board (PCB) that houses one or more Tensor Processing Unit chips along with associated memory, power delivery, and interconnect components. Designed by Google, TPU boards serve as the physical substrate on which TPU ASICs are mounted, enabling their integration into server racks and datacenter infrastructure for accelerating machine learning workloads such as training and inference.
Since the first TPU was deployed in Google's datacenters in 2015, the board designs have evolved considerably, progressing from simple PCIe add-in cards to complex, liquid-cooled multi-chip trays with optical interconnects. Google also produces a smaller Edge TPU board through its Coral product line, targeting on-device inference at the edge.
Imagine you have a special calculator chip that is really, really fast at doing one type of math (the kind computers need for learning things). A TPU board is like the flat card that holds that chip, connects it to electricity, and plugs it into a big computer. Think of it like a LEGO baseplate: the chip is the special brick, and the board is the flat piece that lets you snap it into the rest of your LEGO tower. Google makes these boards so their computers can learn and think faster.
Google began developing its first TPU in 2013, driven by projections that if every user spoke to their Android phone for just three minutes per day, the company would need to double its datacenter compute capacity to handle the neural network workloads required for speech recognition. The engineering team designed, verified, built, and deployed the first TPU to production datacenters in only 15 months.
The first-generation TPU board was designed to fit into the same physical slot as a SATA hard disk drive within existing servers. This approach allowed Google to deploy TPUs without redesigning its server infrastructure. The TPU on its printed circuit card could be inserted into a server's hard drive bay, drawing power from the same connectors. Google originally planned to build fewer than 10,000 units but ended up manufacturing over 100,000 TPU v1 boards to support applications including Google Search, Google Translate, Google Maps, Google Photos, and AlphaGo.
Over the years, the board design has become increasingly sophisticated. TPU v3 introduced liquid cooling for the first time among TPU generations, and TPU v4 moved to a multi-chip tray design with optical circuit switching (OCS) for inter-chip communication across racks.
At the center of every TPU board sits the TPU chip itself, a custom ASIC built around one or more TensorCores. Each TensorCore contains several key functional units:
TPU boards incorporate a layered memory system:
TPU boards connect to host CPUs via PCIe bus. In the first generation, this was a PCIe 3.0 interface carrying CISC instructions from the host processor to the TPU. In multi-chip tray designs (TPU v4 and later), each tray has four PCIe ports, one for each TPU chip, connecting to a dedicated CPU host.
Chip-to-chip communication on TPU boards and across boards uses the Inter-Core Interconnect (ICI), a proprietary high-speed link that provides significantly higher bandwidth than the PCIe host connection. ICI enables TPU chips to form torus network topologies for distributed training across hundreds or thousands of chips. The bandwidth has grown from around 500 Gbps per chip in early generations to 9.6 Tbps aggregate bidirectional bandwidth in Ironwood.
The original TPU board was designed as a PCIe add-in card that fit into a standard server SATA disk slot. The board carried a single TPU chip fabricated on a 28 nm process with a die size of 331 mm or less. It connected to the host via PCIe 3.0 and drew 28 to 40 watts of power, cooled passively or with the server's existing airflow. The board provided 8 GiB of DDR3 memory at 34 GB/s bandwidth.
The TPU v2 board housed chips built on a 16 nm process. Each chip contained two TensorCores with dual 128x128 MXUs and was paired with 16 GiB of HBM providing 600 GB/s bandwidth. The board introduced ICI links to connect chips in a 2D torus topology, with each chip linking to four nearest neighbors. TPU v2 pods scaled to 256 chips with 11.5 petaFLOPS of aggregate compute.
TPU v3 boards continued using 16 nm process technology but increased the die size to under 700 mm squared. Clock speed rose to 940 MHz, and HBM capacity doubled to 32 GiB per board with 900 GB/s bandwidth. TPU v3 was the first generation to require liquid cooling due to higher power density. The 2D torus ICI topology was retained, and pods scaled up to 1,024 chips (later expanded to 2,048 chips).
TPU v4 represented a major redesign of the physical board. Instead of individual PCIe cards, Google moved to a tray-based design. Each TPU v4 tray is a single PCB carrying four TPU chips (eight TensorCores total). The tray's front panel has four top-side PCIe connectors for host CPU connections and 16 bottom-side OSFP connectors for external ICI links to other trays.
The TPU v4 chip was fabricated on a 7 nm process with a die size under 400 mm squared, running at 1,050 MHz and consuming around 170 watts. Each chip package consists of the ASIC in the center surrounded by four HBM stacks, and the PCB carries four of these liquid-cooled packages. The ICI topology shifted from 2D to 3D torus, with each chip connecting to six neighbors instead of four.
Google split the fifth generation into two product lines. TPU v5e was optimized for cost-efficient inference and smaller training jobs, featuring 16 GB HBM2e with 819 GB/s bandwidth and air cooling with a TDP of 120 to 200 watts. It used a 2D torus topology with pods up to 256 chips.
TPU v5p targeted maximum training performance, offering 95 GB HBM2e with 2,765 GB/s bandwidth and liquid cooling at 250 to 300 watts TDP. It used a 3D torus topology with OCS support and scaled to 8,960 chips per pod in a 16x20x28 superpod configuration, delivering approximately 4.45 exaFLOPS.
Trillium boards carry chips with enlarged 256x256 MXU arrays (quadrupled from the 128x128 arrays in v2 through v5), achieving 918 teraFLOPS in bfloat16 and 1,836 teraFLOPS in INT8. Each chip contains one TensorCore with two MXUs, a vector unit, and a scalar unit. HBM capacity is 32 GB with 1,640 GB/s bandwidth. Trillium returned to air cooling and 2D torus topology with pods of up to 256 chips. It achieves 67% better energy efficiency than TPU v5e and includes a third-generation SparseCore.
Ironwood is the first TPU generation designed primarily for inference at scale. Each chip contains two TensorCores and four SparseCores, with 192 GB of HBM3e memory at 7.37 TB/s bandwidth. The 256x256 MXU array handles bfloat16 operations, and Ironwood is the first TPU to support FP8 natively, mapping two FP8 MACs onto each FP16 data path to create an effective 512x512 array. Peak performance reaches 4,614 teraFLOPS in FP8. TDP is 600 watts with liquid cooling. Pods scale to 9,216 chips with 42.5 exaFLOPS aggregate FP8 compute. ICI bandwidth reaches 1.2 TB/s bidirectional per chip across a 3D torus topology.
| Specification | TPU v1 (2015) | TPU v2 (2017) | TPU v3 (2018) | TPU v4 (2021) | TPU v5e (2023) | TPU v5p (2023) | TPU v6e/Trillium (2024) | TPU v7/Ironwood (2025) |
|---|---|---|---|---|---|---|---|---|
| Process node | 28 nm | 16 nm | 16 nm | 7 nm | N/A | N/A | N/A | Advanced node |
| Die size | ≤331 mm² | <625 mm² | <700 mm² | <400 mm² | N/A | N/A | N/A | N/A |
| Clock speed | 700 MHz | 700 MHz | 940 MHz | 1,050 MHz | N/A | N/A | N/A | N/A |
| MXU array size | 256x256 (INT8) | 128x128 (bf16) | 128x128 | 128x128 | 128x128 | 128x128 | 256x256 | 256x256 (bf16), 512x512 (FP8) |
| Memory type | 8 GiB DDR3 | 16 GiB HBM | 32 GiB HBM | 32 GiB HBM | 16 GB HBM2e | 95 GB HBM2e | 32 GB HBM | 192 GB HBM3e |
| Memory bandwidth | 34 GB/s | 600 GB/s | 900 GB/s | 1,200 GB/s | 819 GB/s | 2,765 GB/s | 1,640 GB/s | 7,370 GB/s |
| Peak perf (bf16) | 92 TOPS (INT8) | 45 TFLOPS | 123 TFLOPS | 275 TFLOPS | 197 TFLOPS | 459 TFLOPS | 918 TFLOPS | ~2,300 TFLOPS |
| TDP (watts) | 28-40 | N/A | N/A | 170 | 120-200 | 250-300 | 120-200 | 600 |
| Cooling | Air | Air | Liquid | Liquid | Air | Liquid | Air | Liquid |
| ICI topology | None (PCIe) | 2D torus | 2D torus | 3D torus | 2D torus | 3D torus | 2D torus | 3D torus |
| Max pod size | N/A | 256 chips | 2,048 chips | 4,096 chips | 256 chips | 8,960 chips | 256 chips | 9,216 chips |
| OCS support | No | No | No | Yes | No | Yes | No | Yes |
TPU boards are organized into a physical hierarchy that enables scaling from individual chips to warehouse-scale supercomputers.
Each TPU chip is packaged with its HBM stacks in a single integrated package. In TPU v4 and later, the package consists of the ASIC die in the center surrounded by four HBM stacks, all mounted on a silicon interposer or organic substrate.
A tray is the PCB that holds multiple TPU packages. In the TPU v4 design, each tray contains four TPU chips (eight TensorCores). The tray provides PCIe connectivity to a dedicated CPU host and ICI links between the on-board chips. Within a tray, the four chips are connected via embedded ICI links in a 2x2 mesh configuration, with 16 external ICI links routed through OSFP connectors on the front panel for connections to other trays.
Sixteen trays (64 chips) form a single 4x4x4 cube, which serves as the basic building block for pod construction. All ICI connections within a cube use direct-attached copper (DAC) cables because the physical distances are short (all chips reside in the same rack or adjacent racks).
Multiple cubes are interconnected to form a pod. For TPU v4, a full pod consists of 4,096 chips (64 cubes). Inter-cube connections use optical fiber links managed by optical circuit switches (OCS). For Ironwood (TPU v7), pods scale to 9,216 chips.
Multiple pods communicate through Google's datacenter network (DCN), which operates at lower bandwidth than ICI but enables cross-pod distributed training for the largest models.
TPU board cooling has evolved significantly across generations:
Starting with TPU v4, Google introduced optical circuit switching (OCS) as a key innovation in TPU board interconnect architecture. OCS replaces fixed copper cabling between racks with programmable optical switches that can dynamically reconfigure which chips connect to which.
The OCS system uses micro-electro-mechanical systems (MEMS) mirrors to redirect light beams, enabling sub-10-nanosecond reconfiguration of the network topology. This provides several advantages:
For TPU v5p, 48 OCS units manage 13,824 optical ports across the pod. Ironwood continues to use OCS for its 9,216-chip pods.
In addition to datacenter TPU boards, Google produces Edge TPU boards through its Coral product line for on-device machine learning inference at the edge.
The Coral Dev Board is a single-board computer built around Google's Edge TPU coprocessor and an NXP i.MX 8M SoC. The board is designed for prototyping edge ML applications.
| Specification | Details |
|---|---|
| Edge TPU performance | 4 TOPS at 2 TOPS per watt |
| Main processor | NXP i.MX 8M SoC, quad-core Cortex-A53 at 1.5 GHz plus Cortex-M4F |
| RAM | 1 GB or 4 GB LPDDR4 |
| Storage | 8 GB or 16 GB eMMC, microSD slot |
| Connectivity | Gigabit Ethernet, Wi-Fi 802.11ac (2x2 MIMO), Bluetooth 4.2 |
| Display | HDMI 2.0a (up to 1080p), MIPI DSI (4-lane) |
| Camera | MIPI CSI-2 (4-lane) |
| USB | USB 3.0 Type-C OTG, USB 3.0 Type-A host |
| GPIO | 40-pin I/O header |
| Dimensions | Approximately 88 mm x 60 mm (SoM), baseboard approximately 138 mm x 104 mm |
| Power | 2-3 A at 5 V DC via USB Type-C |
| Operating temperature | 0 to 50 degrees Celsius |
The Edge TPU chip itself can execute models such as MobileNet v2 at nearly 400 frames per second while consuming only about 2 watts. It supports quantized (INT8) TensorFlow Lite models and is used in applications including smart cameras, robotics, industrial quality inspection, and IoT devices.
For developers who already have a host computer or single-board computer such as a Raspberry Pi, the Coral USB Accelerator provides an Edge TPU in a USB dongle form factor. It connects via USB 3.0 and adds 4 TOPS of ML inference acceleration to any compatible system.
Broadcom serves as Google's silicon implementation partner for TPU chip development. Broadcom translates Google's chip architecture and specifications into manufacturable silicon, providing proprietary technologies such as SerDes high-speed interfaces, overseeing ASIC design verification, and managing chip fabrication and packaging through TSMC (Taiwan Semiconductor Manufacturing Company).
Broadcom has been involved in every TPU generation since the program's inception. For TPU v7 Ironwood, TSMC manufactures the chips on an advanced 3 nm class process node (N3P). Starting with the next generation beyond Ironwood (expected 2026), Google has also partnered with MediaTek to handle I/O design and manufacturing coordination with TSMC.
The PCB fabrication, assembly, and system integration are handled separately from the chip fabrication. TPU boards are assembled in facilities that mount the TPU packages, HBM stacks, power delivery components, PCIe connectors, and ICI link connectors onto multi-layer printed circuit boards.
TPU boards are supported by a mature software stack that enables developers to run ML workloads across Google's frameworks:
TPU boards power a wide range of Google services and external applications through Google Cloud TPU:
TPU boards differ from GPU accelerator boards (such as NVIDIA A100 or H100 boards) in several important ways:
| Aspect | TPU board | GPU board |
|---|---|---|
| Design philosophy | Purpose-built ASIC for tensor operations | General-purpose parallel processor repurposed for ML |
| Precision | Optimized for bfloat16, INT8, FP8 | Supports wide range (FP64, FP32, FP16, INT8, FP8) |
| Memory | HBM with high bandwidth, on-chip scratchpad | HBM with high bandwidth, L1/L2 cache hierarchy |
| Interconnect | Proprietary ICI with torus topology | NVLink, NVSwitch, InfiniBand |
| Availability | Google Cloud only (until 2025 on-prem offering) | Multiple cloud providers and on-premises |
| Software ecosystem | TensorFlow, JAX, PyTorch/XLA | CUDA, PyTorch, TensorFlow, and many others |
| Typical advantage | Higher efficiency for large-batch tensor workloads | Greater flexibility and broader framework support |
| Performance per watt | 15x to 30x higher than contemporary GPUs (v1 benchmark) | Lower for pure tensor ops, higher for mixed workloads |