A Tensor Processing Unit (TPU) is a custom application-specific integrated circuit (ASIC) developed by Google specifically to accelerate machine learning workloads. First deployed internally in 2015 and publicly announced in 2016, TPUs power many of Google's core services and have been used to train and serve some of the largest neural networks ever built, including AlphaGo, AlphaFold, BERT, LaMDA, PaLM, and Gemini. As of 2025, Google has released seven generations of TPU chips, each bringing improvements in compute performance, memory capacity, interconnect bandwidth, and energy efficiency.
Unlike general-purpose GPUs, which were originally designed for graphics rendering and later adapted for parallel computation, TPUs are purpose-built for the matrix arithmetic that dominates deep learning. This specialization allows TPUs to deliver higher throughput per watt on machine learning tasks compared to general-purpose processors.
Imagine you have a regular calculator that can do all sorts of math problems, from addition to complicated algebra. That is like a GPU. Now imagine you have a special calculator that can only do one type of math (multiplying big grids of numbers), but it does that one thing incredibly fast. That is what a TPU is. Google built these special calculators because deep learning programs spend almost all their time multiplying big grids of numbers. By making a chip that only does that job, Google can train and run AI programs much faster while using less electricity.
The core compute engine inside every TPU is the matrix multiply unit (MXU), which is built as a systolic array. The name "systolic" comes from the analogy to a beating heart: data flows through the array in rhythmic, wave-like pulses. In a systolic array, multiply-accumulate (MAC) units are arranged in a two-dimensional grid. Weight values from one matrix are preloaded into the MAC units. Activation values from the other matrix enter from one edge and flow horizontally across the grid. Each MAC unit multiplies its stored weight by the incoming activation, adds the result to a partial sum arriving from the neighboring unit above, and passes both values onward. All intermediate results move directly between adjacent MAC units without returning to off-chip memory, which reduces power consumption and memory bandwidth requirements.
In TPU generations prior to v6e, the MXU is a 128 x 128 systolic array, giving 16,384 MAC operations per cycle. Starting with TPU v6e (Trillium) and continuing in TPU v7 (Ironwood), the MXU was enlarged to 256 x 256, maintaining 16,384 MAC operations per cycle but at higher precision or with architectural changes that increase effective throughput. All MXU multiplications accept bfloat16 inputs, while accumulations are performed in FP32 to preserve numerical stability.
Each TPU chip contains one or more TensorCores. A TensorCore bundles together one or more MXUs, a vector processing unit (VPU), and a scalar processing unit. The VPU handles element-wise operations such as activations, normalization, and softmax, while the scalar unit manages control flow and address computation. The number of MXUs per TensorCore and the number of TensorCores per chip have increased with each TPU generation.
| TPU version | TensorCores per chip | MXUs per TensorCore | Total MXUs per chip |
|---|---|---|---|
| v1 | 1 | 1 | 1 |
| v2 | 2 | 1 | 2 |
| v3 | 2 | 1 | 2 |
| v4 | 2 | 2 | 4 |
| v5e | 1 | 4 | 4 |
| v5p | 2 | 4 | 8 |
| v6e (Trillium) | 1 | 2 | 2 |
| v7 (Ironwood) | 2 | (not disclosed) | (not disclosed) |
Starting with TPU v4, Google added SparseCores to the chip. SparseCores are specialized dataflow processors designed to accelerate embedding lookups, which are common in recommendation and ranking models. They accelerate embedding-heavy models by 5x to 7x while using only about 5% of the die area and power budget. TPU v6e includes a third-generation SparseCore, and TPU v7 (Ironwood) includes four SparseCores per chip.
TPUs use high-bandwidth memory (HBM) for off-chip storage, providing the large capacity and bandwidth needed for model weights, activations, and optimizer states. On-chip, each TensorCore has vector memory (VMEM), which serves as a high-speed scratchpad for data being actively processed. Some generations also include separate common memory (CMEM) and SparseCore memory (spMEM).
TPU chips within a pod communicate over a custom inter-chip interconnect (ICI). TPU v2, v3, v5e, and v6e use a 2D torus topology, where each chip connects to its four nearest neighbors. TPU v4, v5p, and v7 use a 3D torus topology, where each chip connects to six neighbors. The additional dimension reduces the network diameter (the maximum number of hops between any two chips) from approximately 2 times the square root of N to 3 times the cube root of N, where N is the total number of chips. This lower diameter improves collective communication performance for large-scale distributed training.
TPU v4 introduced optical circuit switches (OCSes), which allow the interconnect topology to be dynamically reconfigured. This feature improves cluster availability, utilization, and fault isolation, and it enables users to select twisted torus topologies that provide up to 70% higher bisection bandwidth compared to standard tori. OCSes account for less than 5% of system cost and less than 3% of system power.
Google has released seven generations of TPU hardware since 2015. The following table summarizes the key specifications of each generation.
| Specification | v1 (2015) | v2 (2017) | v3 (2018) | v4 (2021) | v5e (2023) | v5p (2023) | v6e / Trillium (2024) | v7 / Ironwood (2025) |
|---|---|---|---|---|---|---|---|---|
| Process node | 28 nm | 16 nm | 16 nm | 7 nm | Not disclosed | Not disclosed | Not disclosed | Not disclosed |
| Clock speed | 700 MHz | 700 MHz | 940 MHz | 1,050 MHz | Not disclosed | 1,750 MHz | Not disclosed | Not disclosed |
| TensorCores per chip | 1 | 2 | 2 | 2 | 1 | 2 | 1 | 2 |
| HBM capacity per chip | 8 GiB (DDR3) | 16 GiB HBM | 32 GiB HBM | 32 GiB HBM | 16 GiB HBM | 95 GiB HBM | 32 GiB HBM | 192 GiB HBM3e |
| HBM bandwidth per chip | 34 GB/s | 600 GB/s | 900 GB/s | 1,200 GB/s | 819 GB/s | 2,765 GB/s | 1,640 GB/s | 7,380 GB/s |
| Peak compute (BF16) | N/A (INT8 only) | 45 TFLOPS | 123 TFLOPS | 275 TFLOPS | 197 TFLOPS | 459 TFLOPS | 918 TFLOPS | 2,307 TFLOPS |
| Peak compute (INT8) | 92 TOPS | N/A | N/A | N/A | 393 TOPS | 918 TOPS | 1,836 TOPS | N/A |
| Peak compute (FP8) | N/A | N/A | N/A | N/A | N/A | 459 TFLOPS | N/A | 4,614 TFLOPS |
| TDP (per chip) | 28-40 W | Not disclosed | Not disclosed | 170 W | Not disclosed | Not disclosed | Not disclosed | ~1,000 W |
| ICI bandwidth per chip | N/A (PCIe) | Not disclosed | Not disclosed | Not disclosed | 400 GBps | 1,200 GBps | 800 GBps | 1,200 GBps |
| ICI topology | N/A | 2D torus | 2D torus | 3D torus | 2D torus | 3D torus | 2D torus | 3D torus |
| Max chips per pod | 1 (PCIe card) | 256 | 1,024 | 4,096 | 256 | 8,960 | 256 | 9,216 |
| Cooling | Air | Air | Liquid | Liquid | Air | Liquid | Not disclosed | Liquid |
| Primary use | Inference | Training and inference | Training and inference | Training and inference | Inference and fine-tuning | Large-scale training | Training and inference | Inference-optimized |
The first-generation TPU was designed exclusively for inference. It contained a single 256 x 256 systolic array of 8-bit integer multiply-accumulate units, delivering a peak throughput of 92 TOPS (INT8). The chip was fabricated on a 28 nm process, fit on a PCIe card, drew 28 to 40 W, and used 8 GiB of DDR3 SDRAM rather than HBM. Google deployed TPU v1 across its data centers starting in 2015 to accelerate inference for services such as Google Search (RankBrain), Google Translate, Google Photos, and Google Street View. The chip was publicly described in a 2017 ISCA paper by Jouppi et al., which showed that the TPU was 15x to 30x faster and 30x to 80x more energy-efficient than contemporary CPUs and GPUs on inference workloads [1].
TPU v2 was the first generation designed for both training and inference. The architecture was significantly restructured: the single 256 x 256 INT8 array was replaced by two TensorCores, each containing a 128 x 128 bfloat16 MXU. This was the first chip to use the bfloat16 floating-point format, which Google Brain developed to preserve the dynamic range of FP32 (by keeping 8 exponent bits) while halving the storage and bandwidth costs (by truncating the mantissa to 7 bits). Each chip delivered 45 TFLOPS in bfloat16 and had 16 GiB of HBM with 600 GB/s bandwidth. Up to 256 chips could be connected in a 2D torus topology to form a TPU v2 Pod, achieving 11.5 petaFLOPS of aggregate peak compute [2].
TPU v3 retained the two-TensorCore-per-chip design but increased clock speed from 700 MHz to 940 MHz and doubled HBM capacity to 32 GiB per chip with 900 GB/s bandwidth. Peak per-chip performance rose to 123 TFLOPS in bfloat16, more than doubling the v2. The higher power density required liquid cooling for the first time. A TPU v3 Pod contained up to 1,024 chips. TPU v3 was used to train AlphaFold, which predicted protein structures with atomic-level accuracy using 128 TPU v3 cores [3].
TPU v4 moved to a 7 nm process node and doubled the number of MXUs per TensorCore from one to two. It introduced a 3D torus interconnect topology, replacing the 2D torus of previous generations, and was the first TPU to deploy optical circuit switches (OCSes) for reconfigurable networking. A single TPU v4 Pod contained 4,096 chips. The chip also introduced SparseCores for embedding acceleration. TPU v4 delivered 275 TFLOPS per chip in bfloat16, consumed 170 W per chip, and was described in a 2023 ISCA paper as being 1.2x to 1.7x faster than the NVIDIA A100 while using 1.3x to 1.9x less power [4]. A v4i variant was also produced for inference-only workloads without liquid cooling.
TPU v5e was designed as a cost-efficient option optimized for inference and fine-tuning rather than maximum training performance. It has a single TensorCore with four MXUs, 16 GiB HBM with 819 GB/s bandwidth, and delivers 197 TFLOPS in bfloat16 or 393 TOPS in INT8 per chip. It returned to the 2D torus topology (sufficient for its smaller target pod sizes of up to 256 chips) and uses air cooling. The v5e provides a lower cost-per-inference than the v5p, making it popular for serving workloads on Google Cloud [5].
TPU v5p targeted maximum performance for large-scale training. Each chip has two TensorCores with four MXUs each (eight MXUs total), 95 GiB of HBM with 2,765 GB/s bandwidth, and delivers 459 TFLOPS per chip in bfloat16. It uses a 3D torus topology with 1,200 GBps of bidirectional ICI bandwidth per chip. A TPU v5p Pod contains 8,960 chips, with the largest schedulable job using 6,144 chips in a 3D torus configuration. Google described the v5p as competitive with the NVIDIA H100 [6].
TPU v6e, marketed as Trillium, was announced at Google I/O in May 2024 and became generally available in late 2024. It features an enlarged 256 x 256 MXU (up from 128 x 128 in prior generations), delivering 918 TFLOPS per chip in bfloat16, a 4.7x increase over TPU v5e. HBM capacity is 32 GiB per chip with 1,640 GB/s bandwidth. Each chip has 800 GBps of bidirectional ICI bandwidth over a 2D torus topology, with pods scaling to 256 chips. Trillium includes a third-generation SparseCore and is over 67% more energy-efficient than TPU v5e. In training benchmarks, Trillium delivered more than 4x the training performance of v5e for models such as Gemma 2-27B and Llama 2-70B, and a 3x increase in inference throughput for Stable Diffusion XL [7].
TPU v7, code-named Ironwood, was unveiled at Google Cloud Next in April 2025. Google described it as "the first TPU for the age of inference." Each chip contains two TensorCores and four SparseCores, fabricated as two chiplets, each with its own 96 GiB HBM3e partition (192 GiB total per chip with 7,380 GB/s bandwidth). Peak performance is 4,614 TFLOPS in FP8 and 2,307 TFLOPS in bfloat16 per chip. The chip uses a 3D torus topology with 1,200 GBps of bidirectional ICI bandwidth per chip and scales up to 9,216 chips in a single cluster, delivering a combined 42.5 exaFLOPS of FP8 compute. At approximately 1 kW per chip, the full 9,216-chip cluster requires nearly 10 MW and uses liquid cooling. Compared to Trillium, Ironwood delivers a 4x improvement in both training performance and inference throughput per chip [8].
TPU v2 introduced the bfloat16 (Brain Floating-Point 16) number format, which has since been adopted by other hardware vendors including NVIDIA, AMD, Intel, and Arm. The format uses 1 sign bit, 8 exponent bits, and 7 mantissa bits. By keeping the same 8-bit exponent as IEEE 754 FP32, bfloat16 preserves the same dynamic range (approximately 1.2 x 10^-38 to 3.4 x 10^38) while halving the storage and bandwidth requirements. Neural network training is much more sensitive to dynamic range than to precision, so the reduced mantissa has minimal impact on model accuracy. In TPU MXUs, bfloat16 inputs are multiplied and the results are accumulated in FP32, providing a mixed-precision pipeline that combines the bandwidth savings of 16-bit operands with the numerical stability of 32-bit accumulation [9].
TPUs are programmed through the XLA (Accelerated Linear Algebra) compiler, which takes high-level operations from machine learning frameworks and compiles them into optimized TPU machine code. XLA performs operation fusion, memory layout optimization, and scheduling to maximize hardware utilization.
Three major frameworks support TPUs:
A TPU Pod is a collection of TPU chips connected through high-bandwidth ICI links. Pods allow users to distribute training across hundreds or thousands of chips using data parallelism, model parallelism, or pipeline parallelism. TPU slice topologies are specified as tuples (for example, 4x4 for a 2D torus or 4x4x8 for a 3D torus), where each value represents the number of chips along one dimension.
For workloads that require more chips than a single pod provides, Google offers Multislice training, which connects multiple TPU slices over the data center network (DCN). Multislice training has been demonstrated with up to 18,432 TPU v5p chips across multiple slices.
TPUs and GPUs take fundamentally different approaches to accelerating machine learning.
| Aspect | TPU | GPU |
|---|---|---|
| Design philosophy | Purpose-built for tensor operations | General-purpose parallel processor adapted for ML |
| Core compute unit | Systolic array (MXU) | CUDA cores / Tensor Cores |
| Programming model | XLA compiler (JAX, TensorFlow, PyTorch/XLA) | CUDA, cuDNN, and broad ecosystem |
| Availability | Google Cloud only | Multiple cloud providers, on-premises, consumer hardware |
| Framework support | JAX (native), TensorFlow, PyTorch/XLA | PyTorch, TensorFlow, JAX, and many others |
| Interconnect | Custom ICI (2D/3D torus) | NVLink, NVSwitch, InfiniBand |
| Strengths | High throughput per watt on matrix operations; tightly integrated pods; cost-effective at scale | Broad ecosystem; flexible for diverse workloads; widely available |
| Limitations | Limited to Google Cloud; narrower framework ecosystem; less flexible for non-ML workloads | Higher power per FLOP on pure matrix work; less integrated multi-chip topology |
In addition to cloud TPUs, Google developed the Edge TPU for on-device inference at the network edge. The Edge TPU is a small, low-power ASIC capable of 4 TOPS while consuming only 2 W (2 TOPS per watt). It is available through the Google Coral product line, which includes USB accelerators, PCIe modules, and system-on-module boards. The Edge TPU runs TensorFlow Lite models compiled with the Edge TPU compiler and is designed for applications such as object detection, image classification, and keyword spotting on embedded devices. Google also integrated a custom Edge TPU variant called the Pixel Neural Core into certain Pixel smartphones for on-device camera processing [10].
TPUs have been used to train and serve many well-known AI models:
| Model | TPU generation used | Year | Description |
|---|---|---|---|
| AlphaGo | v1 (inference), v2 (training) | 2016-2017 | Defeated world Go champion Lee Sedol |
| Transformer (original) | v2 | 2017 | Introduced the attention mechanism that underlies modern LLMs |
| BERT | v3 | 2018 | Pre-trained bidirectional language representations |
| AlphaFold | v3 | 2020 | Predicted protein structures with atomic accuracy |
| LaMDA | v3/v4 | 2021 | Conversational language model |
| PaLM | v4 | 2022 | 540B-parameter language model trained on 6,144 TPU v4 chips |
| Gemini | v4/v5p | 2023 | Google's multimodal foundation model |
| Gemma | v5e/v5p | 2024 | Open-weight language models |