Cloud TPU (Tensor Processing Unit) is a family of custom application-specific integrated circuits (ASICs) developed by Google to accelerate machine learning workloads. First deployed internally at Google data centers in 2015 and publicly announced in May 2016, TPUs are designed from the ground up for neural network computation rather than general-purpose processing. Google offers TPU hardware to external users through its Google Cloud platform, where they are marketed as Cloud TPUs. As of 2025, Google has released seven generations of TPU hardware, each bringing substantial improvements in performance, memory capacity, and energy efficiency.
Imagine your brain is really good at lots of different things: reading, drawing, playing games, and doing math. That is like a regular computer chip (a CPU or GPU). Now imagine a special calculator that can only do one kind of math problem, but it does that one problem incredibly fast. That is what a TPU is. Google built this special calculator because training an AI model requires doing the same type of math (multiplying big grids of numbers) over and over, billions of times. By making a chip that only does this one job, Google made AI training and inference much faster and cheaper than using a regular chip that tries to do everything.
Google began developing TPUs around 2013 in response to internal projections showing that if every user spoke to their Android phone for just three minutes a day using voice search, the company would need to double its data center compute capacity. At the time, running deep learning inference on CPUs and GPUs was expensive in terms of both cost and power consumption. Google engineers, led by Norman Jouppi, designed a purpose-built chip that could handle neural network inference at scale with far better performance per watt than existing hardware.
The first TPU (v1) was deployed in Google data centers in 2015 and publicly disclosed at the Google I/O conference in May 2016. Jouppi and colleagues published the landmark paper "In-Datacenter Performance Analysis of a Tensor Processing Unit" at the International Symposium on Computer Architecture (ISCA) in June 2017. The paper demonstrated that the TPU achieved 15 to 30 times higher performance and 30 to 80 times higher performance per watt compared to contemporary CPUs and GPUs for neural network inference workloads [1].
Google made TPUs available to external users through Google Cloud Platform starting with TPU v2 in 2017. Since then, each new generation has expanded capabilities from inference-only (v1) to both training and inference (v2 onward), while continuously scaling up in compute power, memory bandwidth, and interconnect speed.
At the core of every TPU is a systolic array, a grid of multiply-accumulate (MAC) units through which data flows in a rhythmic, pipelined fashion. In a systolic array, partial results move from one processing element to the next without returning to memory at each step. This design minimizes memory access overhead and maximizes throughput for matrix multiplication, which is the dominant operation in neural network training and inference.
The original TPU v1 contained a single 256 x 256 systolic array of 8-bit multiply-accumulate units, providing 65,536 MACs that could perform up to 92 trillion operations per second (TOPS). Starting with TPU v2, the array was reorganized into 128 x 128 units operating on bfloat16 inputs with FP32 accumulation. TPU v6e and TPU v7 (Ironwood) expanded the MXU back to 256 x 256 multiply-accumulators, increasing per-cycle throughput.
Starting from TPU v2, each TPU chip contains one or more TensorCores. A TensorCore is a self-contained compute unit that includes:
Each TPU chip in v2 and v3 contains two TensorCores. TPU v4 and later generations also contain two TensorCores per chip, with each TensorCore housing four 128 x 128 MXUs (or 256 x 256 MXUs in v6e and v7).
Starting with TPU v4, Google introduced SparseCores, specialized dataflow processors designed to accelerate models that rely heavily on sparse embedding lookups. Embedding-heavy models are common in recommendation systems and ranking workloads. TPU v4 includes four SparseCores per chip, each with dedicated scratchpad memory and optimized dataflow for sparse memory access patterns. Models with ultra-large embeddings have achieved 5 to 7 times speedups using SparseCores while consuming only about 5% of the total chip die area and power budget. TPU v5p features second-generation SparseCores, and TPU v6e includes third-generation SparseCores (two per chip).
Google developed the bfloat16 (Brain Floating Point) number format specifically for TPU-based machine learning workloads. Bfloat16 is a 16-bit floating-point representation that uses one sign bit, eight exponent bits, and seven mantissa bits. Unlike IEEE FP16 (which trades exponent range for precision), bfloat16 preserves the same dynamic range as FP32 while halving memory usage. This design choice reflects the observation that neural networks are more sensitive to dynamic range than to precision during training.
On Cloud TPUs, matrix multiplications are performed with bfloat16 inputs and accumulated in FP32, providing a practical balance between computational speed and numerical accuracy. Because bfloat16 multipliers are roughly half the silicon area of FP16 multipliers and eight times smaller than FP32 multipliers, TPUs can pack more compute into the same die area [2].
TPU chips use High Bandwidth Memory (HBM) as their primary data store. HBM capacity and bandwidth have increased substantially with each generation, from 8 GB of DDR3 in TPU v1 to 192 GB of HBM per chip in TPU v7 (Ironwood).
TPU chips within a pod or slice communicate through high-speed Inter-Chip Interconnects (ICI). The network topology varies by generation:
The following table summarizes the specifications of each TPU generation:
| Generation | Year | Process | Peak performance | HBM capacity | HBM bandwidth | Max pod size | Topology | Key feature |
|---|---|---|---|---|---|---|---|---|
| TPU v1 | 2015 | 28 nm | 92 TOPS (INT8) | 8 GB DDR3 | 34 GB/s | N/A (single board) | N/A | Inference only; 256x256 systolic array |
| TPU v2 | 2017 | 16 nm | 45 TFLOPS (bf16) | 16 GB HBM | 600 GB/s | 256 chips (11.5 PFLOPS) | 2D torus | First TPU for training; introduced bfloat16 |
| TPU v3 | 2018 | 16 nm | 123 TFLOPS (bf16) | 32 GB HBM | 900 GB/s | 1,024 chips | 2D torus | Liquid cooling; 2.7x perf over v2 |
| TPU v4 | 2021 | 7 nm | 275 TFLOPS (bf16) | 32 GB HBM | 1,200 GB/s | 4,096 chips | 3D torus | SparseCores; optical reconfigurable interconnect |
| TPU v5e | 2023 | N/A | 197 TFLOPS (bf16) | 16 GB HBM | 819 GB/s | 256 chips | 2D torus | Cost-efficient; training and inference |
| TPU v5p | 2023 | N/A | 459 TFLOPS (bf16) | 95 GB HBM | 2,765 GB/s | 8,960 chips | 3D torus | 2nd-gen SparseCores; competitive with H100 |
| TPU v6e (Trillium) | 2024 | N/A | 918 TFLOPS (bf16) | 32 GB HBM | 1,640 GB/s | 256 chips | 2D torus | 4.7x perf over v5e; 3rd-gen SparseCores |
| TPU v7 (Ironwood) | 2025 | N/A | 4,614 TFLOPS (FP8) | 192 GB HBM | 7,370 GB/s | 9,216 chips | 3D (ICI 9.6 Tb/s) | Inference-optimized; 2x perf/watt over v6e |
The first-generation TPU was designed exclusively for neural network inference. It featured a single 256 x 256 systolic array of 8-bit integer ALUs, 28 MiB of on-chip SRAM, and 8 GB of DDR3 memory. Operating at 700 MHz on a 28 nm process, it consumed only 28 to 40 watts while delivering 92 TOPS. TPU v1 was deployed as a coprocessor on the PCIe bus and was never offered as a standalone cloud product. It powered latency-sensitive Google services including Search ranking, Google Translate, Google Photos, and the inference engine for AlphaGo [1].
Announced at Google I/O in May 2017, TPU v2 was the first generation to support both training and inference. Each chip contained two TensorCores with 128 x 128 MXUs, 16 GB of HBM, and 600 GB/s memory bandwidth. TPU v2 introduced the bfloat16 number format and delivered 45 TFLOPS per chip. Pods of up to 256 chips provided 11.5 petaFLOPS of aggregate compute. TPU v2 was the first generation made available to external users through Google Cloud and the TensorFlow Research Cloud (TFRC) program [3].
Announced at Google I/O 2018, TPU v3 doubled the HBM capacity to 32 GB per chip and increased memory bandwidth to 900 GB/s. The clock speed rose from 700 MHz to 940 MHz, and peak performance reached 123 TFLOPS per chip. Pods scaled up to 1,024 chips, providing over 100 petaFLOPS of aggregate compute. TPU v3 was the first generation to require liquid cooling due to its higher power density [4].
Announced at Google I/O 2021 and made generally available in 2022, TPU v4 represented a major architectural leap. Built on a 7 nm process with a die size under 400 mm², it delivered 275 TFLOPS per chip. Each chip contained two TensorCores (four 128 x 128 MXUs each), four SparseCores, and 32 GB of HBM with 1,200 GB/s bandwidth.
TPU v4 introduced a 3D torus interconnect topology with optically reconfigurable circuit switches (OCS), allowing dynamic reconfiguration of the network topology to match workload requirements. A full v4 pod contained 4,096 chips with 10x the interconnect bandwidth per chip compared to previous generations. Google described the TPU v4 pod as an "optically reconfigurable supercomputer" in a 2023 paper [5].
Released in August 2023, TPU v5e was designed as a cost-efficient accelerator for both training and inference. It delivers 197 TFLOPS in bfloat16 and 393 TFLOPS in INT8, with 16 GB of HBM per chip. Pods support up to 256 chips in a 2D torus topology. Google positioned v5e as delivering the best price-performance ratio for mid-scale workloads, including large language model fine-tuning and serving [6].
Announced in December 2023 alongside the Gemini model, TPU v5p is Google's most powerful training-focused TPU prior to Trillium. Each chip delivers 459 TFLOPS in bfloat16 and 918 TFLOPS in INT8, with 95 GB of HBM and 2,765 GB/s bandwidth. A full v5p pod connects 8,960 chips in a 16 x 20 x 28 3D torus topology with 4,800 Gbps of ICI bandwidth per chip. TPU v5p features second-generation SparseCores that can train embedding-dense models 1.9x faster than TPU v4. Google stated that TPU v5p is competitive with the NVIDIA H100 for large model training [7].
Announced in mid-2024 and made generally available in late 2024, Trillium is Google's sixth-generation TPU. It achieves roughly 918 TFLOPS in bfloat16 per chip (approximately 4.7x the performance of TPU v5e) through larger 256 x 256 MXUs and a higher clock speed. HBM capacity doubled to 32 GB with doubled bandwidth (1,640 GB/s), and ICI bandwidth also doubled compared to v5e. Trillium includes third-generation SparseCores and is over 67% more energy efficient than TPU v5e.
Trillium pods scale up to 256 chips, and with Multislice technology and Titanium IPUs (Intelligence Processing Units), multiple pods can be connected into building-scale supercomputers with tens of thousands of chips. Google reported a 2.1x improvement in performance per dollar over v5e and 2.5x over v5p for dense LLM training on models such as Llama 2-70B and Llama 3.1-405B [8].
Unveiled at Google Cloud Next in April 2025, Ironwood is Google's seventh-generation TPU and the first generation explicitly designed for inference at scale. Each chip delivers 4,614 TFLOPS peak performance (FP8), a 10x improvement over TPU v5p per chip. Memory capacity jumps to 192 GB of HBM per chip with 7.37 TB/s bandwidth, six times the memory of Trillium.
Ironwood chips communicate via ICI at 9.6 Tb/s per chip. A full Ironwood superpod consists of 9,216 chips with access to 1.77 petabytes of aggregate HBM. Performance per watt is 2x that of Trillium, and Google states Ironwood is nearly 30x more efficient than the original TPU v1. Each chip contains two TensorCores and four SparseCores [9].
TPU hardware is organized into a hierarchy of groupings:
Cloud TPU Multislice is a scaling technology that allows a single training job to span multiple TPU slices, even across different pods. Slices within a Multislice configuration communicate through data center networking (DCN), which has higher latency but lower bandwidth than ICI. Multislice supports data parallelism, Fully Sharded Data Parallelism (FSDP), model parallelism, and pipeline parallelism. Google demonstrated this capability by running the world's largest distributed LLM training job across 50,944 TPU v5e chips [10].
Cloud TPUs support three major machine learning frameworks:
| Framework | Integration method | Notes |
|---|---|---|
| JAX | Native via XLA | Primary framework for TPU development; developed by Google; compiles Python and NumPy-like code to XLA |
| TensorFlow | Native via XLA | Supported from TPU v2 onward; TPU v5e, v5p, and v6e support TensorFlow 2.15.0 and later via PJRT |
| PyTorch | Via PyTorch/XLA | Open-source library maintained by Google and the PyTorch community; uses XLA as the compiler backend |
JAX is a numerical computing library developed by Google that combines NumPy-like syntax with automatic differentiation and XLA (Accelerated Linear Algebra) compilation. JAX is the primary framework for TPU development at Google and is used for training large-scale models including Gemini. JAX's functional programming model maps naturally to TPU hardware, and its pjit and shard_map APIs provide fine-grained control over how computations and data are distributed across TPU chips [11].
TensorFlow was the original framework supported on Cloud TPUs. The TPU execution model in TensorFlow uses XLA compilation to translate TensorFlow graphs into optimized TPU machine code. Starting with TensorFlow 2.15.0, the PJRT (Portable JAX Runtime) interface provides automatic device memory defragmentation and a simpler hardware integration path.
PyTorch/XLA is an open-source library that enables PyTorch models to run on TPUs by converting PyTorch operations into XLA HLO (High Level Operations) graphs. The torchax library from Google further bridges PyTorch and JAX by wrapping JAX arrays as PyTorch tensor subclasses, enabling seamless interoperability. More recently, vLLM TPU (powered by tpu-inference) has unified JAX and PyTorch under a single lowering path for high-throughput LLM inference on TPUs.
Google Cloud offers several ways to provision and use TPUs:
Approximate pricing (as of 2025) varies by generation and commitment level:
| TPU type | On-demand (per chip-hour) | 1-year CUD | 3-year CUD |
|---|---|---|---|
| TPU v5e | ~$1.20 | Discounted | Discounted |
| TPU v5p | ~$4.20 | Discounted | Discounted |
| TPU v6e (Trillium) | ~$2.70 | ~$1.89 | ~$1.22 |
TPU resources can be provisioned through the Google Cloud console, the gcloud CLI, or programmatically through Google Kubernetes Engine (GKE). GKE is the recommended orchestration layer for production TPU workloads, providing features such as job queueing with Kueue and Multislice job abstraction through the JobSet API.
TPUs and GPUs differ in their design philosophy and target workloads. The following table highlights the main differences:
| Aspect | Cloud TPU | GPU (e.g., NVIDIA H100/A100) |
|---|---|---|
| Design approach | Purpose-built ASIC for ML | General-purpose parallel processor |
| Precision formats | bfloat16, INT8, FP8 (v7), FP32 accum. | FP16, bfloat16, FP8, TF32, FP32, INT8 |
| Primary compute unit | Systolic array (MXU) | CUDA cores, Tensor Cores |
| Memory type | HBM (integrated) | HBM (integrated) |
| Interconnect | ICI (custom, in-pod) | NVLink, NVSwitch, InfiniBand |
| Software ecosystem | JAX, TensorFlow, PyTorch/XLA | CUDA, cuDNN, all major frameworks |
| Vendor lock-in | Google Cloud only | Multi-cloud, on-premises |
| Strengths | Large-batch training, LLM inference, cost per FLOP | Flexibility, broad framework support, general-purpose compute |
TPUs tend to offer better performance per dollar for large-scale, batch-oriented ML workloads, particularly for models that map well to matrix-heavy computation. Google has reported that TPU v6e provides up to 4x better performance per dollar compared to the NVIDIA H100 for LLM training and large-batch inference. However, GPUs offer broader software compatibility, support from multiple cloud providers, and the ability to handle diverse workloads beyond ML, including graphics rendering, simulation, and scientific computing [12].
The choice between TPUs and GPUs often depends on the specific workload, scale, framework preference, and whether vendor portability is a priority.
TPUs have powered many of Google's most notable AI systems and attracted major external customers:
Despite their strong performance for ML workloads, Cloud TPUs have several limitations: