The Tensor Processing Unit (TPU) is a custom application-specific integrated circuit (ASIC) developed by Google to accelerate machine learning workloads. First deployed in Google's data centers in 2015, TPUs are purpose-built for high-throughput, low-latency tensor operations, particularly the matrix multiplication at the heart of neural network training and inference. Over seven generations, Google has scaled the TPU from an inference-only accelerator running at 92 TOPS to the Ironwood (TPU v7) chip delivering 4,614 TFLOPS per chip, with superpods reaching 42.5 exaflops of aggregate compute.
TPUs have powered some of the most widely known AI systems in the world, including AlphaGo, AlphaFold, BERT, and Gemini. Google makes TPUs available to external users through Google Cloud, the TPU Research Cloud program, and Google Colab.
In 2013, Google recognized that if every user spoke to their Android phone for just three minutes per day, the company would need to double its data center compute capacity to handle the inference load. This realization prompted an internal effort to build custom silicon optimized for neural network inference. Dr. Amir Salek was recruited to establish custom silicon capabilities, and engineer Jonathan Ross (who later founded Groq) was among the original TPU designers.
The TPU v1 was designed, verified, fabricated, and deployed to production data centers in just 15 months, an unusually fast timeline for a custom ASIC. Google began deploying TPU v1 chips in its data centers in early 2015, but the existence of the chip remained secret for more than a year.
On May 18, 2016, at the Google I/O conference, CEO Sundar Pichai revealed that Google had been running TPUs inside its data centers for over a year. He stated that TPUs delivered "an order of magnitude better performance per watt for machine learning" compared to existing processors. The announcement came shortly after AlphaGo defeated world Go champion Lee Sedol in March 2016, a match in which TPUs played a role in powering the inference computations.
The TPU v1 architecture was formally described in the paper "In-Datacenter Performance Analysis of a Tensor Processing Unit" by Norman P. Jouppi et al., presented at the 44th International Symposium on Computer Architecture (ISCA) in June 2017. The paper reported that the TPU was 15 to 30 times faster and 30 to 80 times more energy-efficient than contemporary server-class CPUs and GPUs (an Intel Haswell CPU and an NVIDIA K80 GPU) on production neural network inference workloads.
Broadcom serves as the co-developer of TPUs, translating Google's architecture and specifications into manufacturable silicon. All TPU generations have been fabricated by TSMC.
The defining architectural feature of the TPU is its systolic array, a grid of multiply-accumulate (MAC) units through which data flows in a regular, wave-like pattern (the name "systolic" is an analogy to the rhythmic pumping of the heart). In TPU v1, the matrix multiply unit (MXU) consists of a 256 x 256 grid of 8-bit MAC units, totaling 65,536 ALUs.
During a matrix multiplication, weight values are preloaded into the array from above (the right-hand side, or RHS), while activation values enter from the left (the left-hand side, or LHS) and flow horizontally across the array. Each MAC unit multiplies its stored weight by the incoming activation, adds the result to a partial sum arriving from above, and passes both the activation (horizontally) and the updated partial sum (vertically) to neighboring units. Because all 65,536 ALUs pass intermediate results directly between spatially adjacent units without any memory access, power consumption is significantly reduced. The short, local wires connecting adjacent ALUs are also more energy-efficient than long global interconnects.
From TPU v2 onward, the MXU uses a 128 x 128 systolic array (16,384 multiply-accumulate units per MXU), with each chip containing two or more MXUs. The TPU v6e (Trillium) and TPU v7 (Ironwood) expanded to a 256 x 256 MXU, quadrupling the number of FLOPs per cycle compared to earlier generations.
TPU v1 uses 8 GB of DDR3 DRAM as off-chip memory, providing 34 GB/s of bandwidth. On-chip, the design includes 28 MiB of software-managed SRAM (the "Unified Buffer") and 4 MiB of accumulators. This simplified memory hierarchy, with no hardware-managed caches, reduces memory access latency and die area compared to general-purpose processors.
Starting with TPU v2, Google switched to High Bandwidth Memory (HBM), dramatically increasing both capacity and bandwidth. By TPU v7, each chip has 192 GB of HBM with 7.37 TB/s of bandwidth.
TPU v2 introduced the bfloat16 (Brain Floating Point) number format, a custom 16-bit floating-point representation conceived at Google Brain. Bfloat16 uses 1 sign bit, 8 exponent bits, and 7 mantissa bits. By retaining the same 8-bit exponent as IEEE 754 float32, bfloat16 preserves the same dynamic range (values up to approximately 3.4 x 10^38) while halving memory usage. This is in contrast to the IEEE 754 float16 (half-precision) format, which uses 5 exponent bits and 10 mantissa bits, giving it a narrower dynamic range.
Inside the MXU, multiplications are performed in bfloat16 while accumulations use full float32 precision, a mixed-precision strategy that maintains model accuracy while doubling throughput relative to pure float32 computation. Bfloat16 has since been adopted by other hardware vendors, including Intel, AMD, and NVIDIA, and is supported across all major deep learning frameworks.
TPU v2 introduced the Inter-Chip Interconnect (ICI), a custom high-bandwidth, low-latency network that links multiple TPU chips into a single logical accelerator called a "pod" or "slice." TPU v2 and v3 use a 2D torus topology, in which each chip connects to its four nearest neighbors (north, south, east, west). TPU v4 and v5p upgraded to a 3D torus, where each chip connects to six neighbors, increasing bisection bandwidth.
TPU v4 introduced optical circuit switches (OCSes) based on 3D Micro-Electro-Mechanical Systems (MEMS) mirrors that can dynamically reconfigure the interconnect topology. This allows the system to form "twisted" 3D torus topologies that provide up to 70% higher bisection bandwidth than a standard torus. The OCS hardware accounts for less than 5% of system cost and less than 3% of system power. Each TPU v4 pod connects 4,096 chips through 48 OCSes using Google's custom Palomar 136x136 OCS.
TPU v7 (Ironwood) scales the ICI to 9.6 Tb/s per chip, enabling superpods of up to 9,216 chips.
Starting with TPU v4, Google added SparseCores to each chip. SparseCores are specialized dataflow processors designed to accelerate models that rely on embedding lookups, a common operation in recommendation systems and large language models. SparseCores occupy only about 5% of die area and power but accelerate embedding-heavy workloads by 5 to 7 times. TPU v5p introduced second-generation SparseCores with further improvements, and TPU v7 contains four SparseCores per chip.
The table below summarizes the key specifications of each TPU generation.
| Generation | Release year | Process node | Clock (MHz) | Memory | Memory bandwidth | Peak compute | TDP (W) | Chips per pod | Training support |
|---|---|---|---|---|---|---|---|---|---|
| TPU v1 | 2015 | 28 nm | 700 | 8 GB DDR3 | 34 GB/s | 92 TOPS (INT8) | 75 | N/A (inference only) | No |
| TPU v2 | 2017 | 16 nm | 700 | 16 GB HBM | 600 GB/s | 45 TFLOPS (BF16) | 280 | 256 (11.5 PFLOPS) | Yes |
| TPU v3 | 2018 | 16 nm | 940 | 32 GB HBM | 900 GB/s | 123 TFLOPS (BF16) | 220 | 1,024 (>100 PFLOPS) | Yes |
| TPU v4 | 2021 | 7 nm | 1,050 | 32 GB HBM | 1,200 GB/s | 275 TFLOPS (BF16) | 170 | 4,096 (>1 EFLOPS) | Yes |
| TPU v5e | 2023 | N/A | N/A | 16 GB HBM | 819 GB/s | 197 TFLOPS (BF16) | N/A | 256 | Yes |
| TPU v5p | 2023 | N/A | 1,750 | 95 GB HBM | 2,765 GB/s | 459 TFLOPS (BF16) | N/A | 8,960 (4.45 EFLOPS) | Yes |
| TPU v6e (Trillium) | 2024 | N/A | N/A | 32 GB HBM | 1,640 GB/s | 918 TFLOPS (BF16) | N/A | 256 | Yes |
| TPU v7 (Ironwood) | 2025 | N/A | N/A | 192 GB HBM | 7,370 GB/s | 4,614 TFLOPS (FP8) | N/A | 9,216 (42.5 EFLOPS) | Yes |
The first-generation TPU was designed exclusively for inference. It connects to its host server via a PCIe 3.0 bus and operates as a coprocessor, receiving instructions from the host CPU. The chip was fabricated on a 28 nm process, runs at 700 MHz, and consumes 75 W. Its 256 x 256 systolic array of 8-bit integer MAC units delivers 92 TOPS. Google deployed over 100,000 TPU v1 chips across its data centers to serve production workloads including RankBrain (search ranking), Google Street View text recognition, and Google Photos image processing. A single TPU v1 could process over 100 million Google Photos per day.
Announced in May 2017, TPU v2 was the first generation to support both training and inference. It introduced HBM, bfloat16 arithmetic, and the ICI interconnect. Each chip contains two MXUs delivering a combined 45 TFLOPS in bfloat16. Four chips form a board, and 64 boards (256 chips) form a full pod delivering 11.5 petaflops. TPU v2 was the first TPU made available to external users through Google Cloud.
Announced on May 8, 2018, TPU v3 doubled per-chip performance relative to TPU v2, reaching 123 TFLOPS in bfloat16. The clock speed increased to 940 MHz. Pods scaled to 1,024 chips with over 100 petaflops of aggregate compute. TPU v3 required liquid cooling due to its higher power density.
Announced on May 18, 2021, and described in the 2023 ISCA paper "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings" by Jouppi et al., TPU v4 moved to a 7 nm process. Each chip delivers 275 TFLOPS in bfloat16 with 32 GB of HBM at 1,200 GB/s. The chip introduced SparseCores for embedding acceleration and optical circuit switches for reconfigurable 3D torus interconnect topology. A full pod of 4,096 chips exceeds 1 exaflop. Google reported that a TPU v4 deployment uses approximately 3 times less electricity and emits approximately 20 times less CO2 than a comparable on-premises GPU cluster performing the same training. On production ML benchmarks, TPU v4 was reported to be 5 to 87% faster than an NVIDIA A100 GPU.
The TPU v5e is a cost-optimized variant designed for both training and inference on models up to approximately 200 billion parameters. It prioritizes price-performance, achieving 2.3 times better price-performance than TPU v4. Each chip has 16 GB of HBM and delivers 197 TFLOPS in bfloat16 (or 393 TOPS in INT8). Google reports that 8 TPU v5e chips can generate approximately 2,175 tokens per second on Llama 2-70B inference.
Announced in December 2023, the TPU v5p is the high-performance variant of the fifth generation, intended for large-scale training. Each chip delivers 459 TFLOPS in bfloat16 with 95 GB of HBM at 2,765 GB/s. A full v5p pod composes 8,960 chips in a 3D torus with 4,800 Gbps of ICI bandwidth per chip, reaching approximately 4.45 exaflops. TPU v5p can train large LLM models 2.8 times faster than TPU v4. It includes second-generation SparseCores that accelerate embedding-dense workloads 1.9 times faster than TPU v4. The physical layout of TPU v5p was designed with the assistance of deep reinforcement learning.
Announced at Google I/O in May 2024 and made generally available in late 2024, Trillium is Google's sixth-generation TPU. Each chip delivers 918 TFLOPS in bfloat16, a 4.7 times increase over TPU v5e. The MXU was expanded from 128 x 128 to 256 x 256. HBM capacity doubled to 32 GB with 1,640 GB/s bandwidth. Trillium is over 67% more energy-efficient than TPU v5e. Pods scale to 256 chips with up to 13 TB/s of ICI bandwidth per chip.
Unveiled at Google Cloud Next in April 2025, Ironwood is Google's seventh-generation TPU and the first designed with inference as the primary target. Each chip delivers 4,614 TFLOPS in FP8 and contains 192 GB of HBM with 7.37 TB/s bandwidth. The chip uses a chiplet architecture: two chiplets, each containing one TensorCore, two SparseCores, and 96 GB of HBM. Superpods scale to 9,216 chips connected via a 3D torus ICI at 9.6 Tb/s per chip, delivering 42.5 exaflops of aggregate compute and 1.77 petabytes of shared HBM. Ironwood offers more than 4 times better performance per chip for both training and inference compared to the previous generation.
In addition to data center TPUs, Google developed the Edge TPU, a small ASIC designed for on-device inference in low-power environments. The Edge TPU delivers 4 TOPS of INT8 inference performance while consuming only 2 watts (2 TOPS per watt). It can run models such as MobileNet V2 at nearly 400 frames per second. The Edge TPU supports only forward-pass operations (inference, not training) and requires 8-bit quantized TensorFlow Lite models.
Google sells Edge TPU hardware under the Coral brand in several form factors, including USB accelerators, PCI-e modules, development boards, and system-on-module packages.
TPUs are supported by three major deep learning frameworks:
| Framework | Integration method | Notes |
|---|---|---|
| TensorFlow | Native support via XLA compiler | TensorFlow was the first framework with TPU support; tight integration with Google's ecosystem |
| JAX | Native support via XLA compiler | JAX's functional programming model and GSPMD (General-purpose SPMD) partitioner allow automatic parallelization across TPU pods with minimal code changes |
| PyTorch | PyTorch/XLA library | Open-source package that translates PyTorch operations to XLA for execution on TPUs |
XLA (Accelerated Linear Algebra) is an open-source compiler for machine learning that takes computation graphs from TensorFlow, JAX, and PyTorch and optimizes them for high-performance execution on TPUs, GPUs, and CPUs. XLA performs whole-program optimization, including operator fusion, memory layout assignment, and tile-size selection, producing efficient machine code for the target hardware.
| Feature | CPU | GPU | TPU |
|---|---|---|---|
| Design purpose | General-purpose computing | Parallel computing; originally graphics rendering | Machine learning inference and training |
| Core architecture | Few complex cores with large caches | Thousands of smaller CUDA/stream cores | Systolic array of MAC units |
| Arithmetic precision | FP64, FP32, INT32, INT64 | FP64, FP32, FP16, BF16, INT8, FP8 | BF16, FP32, INT8, FP8 (varies by generation) |
| Memory hierarchy | Multi-level hardware caches (L1, L2, L3) | HBM with hardware caches | HBM with software-managed SRAM (no hardware caches in v1) |
| Interconnect for scaling | Ethernet, InfiniBand | NVLink, NVSwitch, InfiniBand | Custom ICI with optical circuit switches |
| Programming model | Any language/framework | CUDA, ROCm, OpenCL | XLA (via TensorFlow, JAX, or PyTorch/XLA) |
| Availability | Ubiquitous | Multiple vendors (NVIDIA, AMD, Intel) | Google Cloud only |
TPUs are optimized for workloads dominated by large matrix multiplications and convolutions, such as training and serving transformer models, convolutional neural networks, and recommendation systems. GPUs offer broader flexibility for workloads with irregular computation patterns, custom CUDA kernels, or non-ML parallel computing tasks. CPUs remain the best choice for workloads with complex branching logic, low parallelism, or tasks that require broad instruction set support.
TPUs have been used to train and serve many well-known AI systems:
| Model or system | Year | TPU generation used | Domain |
|---|---|---|---|
| AlphaGo | 2016 | TPU v1 | Game playing (Go) |
| RankBrain | 2015 | TPU v1 | Search ranking |
| Google Street View text processing | 2015 | TPU v1 | OCR |
| AlphaZero | 2017 | TPU v2 | Game playing (chess, Shogi, Go) |
| BERT | 2018 | TPU v3 | Natural language processing |
| AlphaFold | 2020 | TPU v3 | Protein structure prediction |
| LaMDA | 2021 | TPU v4 | Conversational AI |
| PaLM | 2022 | TPU v4 | Large language model |
| Gemini | 2023 | TPU v4, v5e, v5p | Multimodal AI |
| Gemma | 2024 | TPU v5e | Open-weight LLM |
Google also offers the open-weight Gemma model family, which shares technical infrastructure with Gemini and was trained on TPUs.
TPUs are available to external users exclusively through Google Cloud. Pricing is per chip-hour and varies by TPU generation and region.
| TPU version | On-demand price (per chip-hour, USD) | Committed use (1-year) discount |
|---|---|---|
| TPU v4 | $0.24 | ~25-30% |
| TPU v5e | $0.32 | ~25-30% |
| TPU v5p | $0.48 | ~25-30% |
| TPU v6e (Trillium) | Varies by region | Available |
| TPU v7 (Ironwood) | Varies by region | Available |
Google also provides free or subsidized TPU access through several programs:
As of 2026, TPU v7 (Ironwood) is generally available. Google has also been in discussions with cloud providers such as CoreWeave and Crusoe about deploying TPUs outside of Google's own infrastructure.
Imagine your brain is really good at all kinds of things: reading, talking, doing math, playing games. That is like a regular computer chip (a CPU). Now imagine a special calculator that can only do one thing, but it does that one thing incredibly fast: multiplying lots of numbers at once. That is what a TPU is. Google built this special calculator because artificial intelligence programs need to multiply millions of numbers together over and over again. By making a chip that only does multiplication really well, Google can run AI programs much faster and using much less electricity than a regular chip.