A Tensor Processing Unit (TPU) is an application-specific integrated circuit (ASIC) designed by Google to accelerate machine learning workloads. Unlike general-purpose processors such as CPUs or GPUs, TPUs are purpose-built for the mathematical operations that dominate neural network training and inference. Google first deployed TPUs internally in 2015 and publicly announced them in 2016. Since then, seven generations of TPUs have been released, each delivering significant improvements in compute performance, memory capacity, and energy efficiency. TPUs power many of Google's largest AI systems, including Gemini, AlphaFold, and Google Search.
Imagine you have a regular toolbox that can fix lots of different things around the house. That is like a normal computer chip (a CPU). Now imagine you have a special tool that is really, really good at one specific job, like tightening screws super fast. A TPU is like that special tool, but instead of screws, it is really good at doing the math that helps computers learn. Google built TPUs so that programs that recognize pictures, understand speech, and answer questions can do their math much faster and use less electricity than if they used regular chips.
Google's motivation for building custom silicon came from a projection made in 2013. Engineers estimated that if every user spoke to their Android phone for just three minutes per day using voice search, Google would need to double the number of data centers worldwide to handle the deep learning inference load. This projected cost was unacceptable, so Google began developing a domain-specific accelerator that could run neural network inference far more efficiently than CPUs or GPUs.
The first TPU (v1) entered Google's data centers in 2015 and was formally announced at Google I/O in May 2016. A landmark paper by Norman Jouppi and colleagues, presented at the International Symposium on Computer Architecture (ISCA) in June 2017, described the TPU v1 architecture and showed that it was 15 to 30 times faster than contemporary CPUs and GPUs on inference tasks, with 30 to 80 times better performance per watt.
TPU v2, announced in 2017, expanded the scope from inference only to both training and inference, and introduced the bfloat16 number format, which later became an industry standard adopted by other hardware vendors including Intel, AMD, and NVIDIA. Each subsequent generation has pushed performance and scale further, culminating in the seventh-generation Ironwood chip announced in April 2025.
The fundamental compute unit inside a TPU chip is the TensorCore. Each TPU chip contains one or more TensorCores (Ironwood uses a chiplet design with two TensorCores per chip). A TensorCore consists of three main processing elements:
| Component | Function | Details |
|---|---|---|
| Matrix multiply unit (MXU) | Performs dense matrix multiplications | 128x128 systolic array (v2 through v5p) or 256x256 systolic array (v6e and Ironwood). Inputs in bfloat16; accumulations in FP32. Each MXU performs 16,384 multiply-accumulate operations per cycle. |
| Vector unit | General-purpose computation | Handles activations, softmax, normalization, and element-wise operations. |
| Scalar unit | Control and addressing | Manages control flow, memory address calculations, and other maintenance operations. |
The MXU uses a systolic array architecture, named after the Greek word for heartbeat because data pulses rhythmically through the chip. In a systolic array, each multiply-accumulator passes its result directly to the next one in the grid without writing intermediate values back to memory. This eliminates the memory bottleneck that limits performance on conventional processors.
In the TPU v1, the systolic array was 256x256, performing 65,536 multiply-accumulate operations per clock cycle. Running at 700 MHz, this delivered 92 trillion 8-bit operations per second (92 TOPS) while consuming only 40 watts. More than 90% of the silicon area was devoted to useful computation, compared to roughly 30% in a typical GPU.
The engineering tradeoff behind the systolic array is deliberate: it sacrifices the general-purpose flexibility of a GPU's thousands of programmable CUDA cores in exchange for much higher operation density and energy efficiency on matrix workloads.
TPUs use high-bandwidth memory (HBM) as their primary off-chip memory. Data flows through a pipeline: the host streams data into an infeed queue, the TPU loads it from the infeed queue into HBM, computations are performed, and results are placed into an outfeed queue.
On-chip, TPUs have vector memory (VMEM) that feeds data to the MXU and vector unit. VMEM bandwidth is approximately 22 times higher than HBM bandwidth, meaning operations reading from VMEM require an arithmetic intensity of only 10 to 20 to achieve peak FLOPS utilization. This layered memory system is designed to keep the MXU fed with data as continuously as possible.
Starting with TPU v4, Google added a specialized processor called SparseCore to handle embedding operations. While the MXU excels at dense matrix multiplication, embedding lookups in recommendation systems, ranking models, and large language models involve irregular, data-dependent memory access patterns where the MXU provides no advantage.
TPU v4 featured four SparseCores per chip, each containing 16 compute tiles that operate in parallel on disjoint subsets of embedding operations. The SparseCore achieved 5 to 7 times speedups over previous approaches while using only 5% of total chip die area and power budget. It is 3 times faster than TPU v3 on recommendation models and 5 to 30 times faster than CPU-based systems.
TPU chips within a single slice communicate over a high-speed inter-chip interconnect (ICI). The ICI connects chips directly to their neighbors without any external switches, a design Google calls "glueless" networking. The networking logic is integrated directly into the chip itself.
Different TPU generations use different ICI topologies:
| TPU generation | ICI topology | Notes |
|---|---|---|
| TPU v2, v3 | 2D torus | Each chip connects to 4 neighbors |
| TPU v4 | 3D torus | Each chip connects to 6 neighbors; uses optical circuit switches for reconfigurability |
| TPU v5e, v6e | 2D torus | Each chip connects to 4 neighbors |
| TPU v5p | 3D torus | 4,800 Gbps per chip; 8,960 chips per pod |
| Ironwood (v7) | 3D torus | 1.2 TB/s bidirectional bandwidth per chip |
Google has released seven generations of TPU hardware, each with different design targets and capabilities.
| Generation | Year | Compute (per chip) | HBM capacity | HBM bandwidth | Key features |
|---|---|---|---|---|---|
| TPU v1 | 2015 (deployed) / 2016 (announced) | 92 TOPS (INT8) | N/A (28 MiB on-chip) | N/A | Inference only; 256x256 systolic array; 40W TDP |
| TPU v2 | 2017 | 180 TFLOPS | 64 GB | N/A | First training-capable TPU; introduced bfloat16 |
| TPU v3 | 2018 | 420 TFLOPS | 128 GB | N/A | Liquid cooling; 2x performance over v2 |
| TPU v4 | 2021 | 275 TFLOPS (bf16) | 32 GB HBM2e | N/A | SparseCore; optical circuit switches; 3D torus ICI |
| TPU v5e | 2023 | ~197 TFLOPS | 16 GB | 819 GB/s | Cost-optimized; 2.7x perf/dollar over v4; single core per chip |
| TPU v5p | 2023 | ~459 TFLOPS | 95 GB | 2.8 TB/s | Performance-optimized; 8,960-chip pods (~4.45 EFLOPS) |
| TPU v6e (Trillium) | 2024 | ~918 TFLOPS | 32 GB | N/A | 256x256 MXU; 4.7x peak compute over v5e; 2x memory and ICI bandwidth over v5e |
| TPU v7 (Ironwood) | 2025 | 4,614 TFLOPS | 192 GB HBM3e | 7.4 TB/s | First inference-focused TPU; chiplet design (two TensorCores); FP8 support; 5nm process; ~100B transistors; 600W TDP |
TPU v1 was designed exclusively for inference. Its 256x256 systolic array and 28 MiB of on-chip memory were sufficient for running trained models but not for the larger memory and bidirectional data flow requirements of training. Google deployed TPU v1 across its data centers to accelerate services like Google Search, Google Photos, Google Translate, and Gmail.
TPU v2 was the first generation capable of both training and inference. It introduced the bfloat16 floating-point format, a 16-bit format with the same exponent range as 32-bit IEEE float but reduced mantissa precision (7 bits instead of 23). This design maintains numerical stability during training while halving memory usage compared to float32. TPU v2 was made available to external researchers through the TensorFlow Research Cloud program.
TPU v3 doubled compute performance over v2 to more than 420 TFLOPS per chip and doubled HBM capacity to 128 GB. It was the first TPU generation to use liquid cooling, which allowed higher clock speeds and denser chip packaging.
Announced at Google I/O 2021, TPU v4 introduced two major architectural changes. First, it added SparseCore for embedding-heavy workloads. Second, it replaced electrical inter-pod connections with optical circuit switches (OCS), enabling dynamic reconfiguration of the 3D torus topology. A 2023 paper in the proceedings of ISCA described the TPU v4 supercomputer as an "optically reconfigurable supercomputer" with 4,096 chips per pod. TPU v4 achieved more than 2x the performance of v3.
The fifth generation was split into two variants. TPU v5e, announced in August 2023, was designed for cost efficiency, reducing core count and clock speed to hit aggressive power and cost targets. It delivers 2.7 times higher performance per dollar than TPU v4. TPU v5p, announced in December 2023, was designed for maximum training performance, scaling to 8,960-chip pods delivering approximately 4.45 exaFLOPS. Google positioned TPU v5p as competitive with the NVIDIA H100.
Announced at Google I/O in May 2024 and available in preview from October 2024, Trillium expanded the MXU from 128x128 to 256x256 multiply-accumulators. This quadrupled peak FLOPS per cycle at the same clock speed, delivering 4.7 times the peak compute performance of TPU v5e. HBM capacity and bandwidth also doubled compared to v5e.
Unveiled at Google Cloud Next in April 2025, Ironwood is the first TPU generation designed primarily for inference. Each chip delivers 4,614 TFLOPS of peak compute and includes 192 GB of HBM3e memory with 7.4 TB/s bandwidth. Ironwood uses a chiplet design where two TensorCores, each with its own SparseCore pair and 96 GB of HBM, are connected by a die-to-die (D2D) interface that is six times faster than a single ICI link. The chip is fabricated on a 5nm process with approximately 100 billion transistors.
Ironwood is offered in two pod configurations: 256 chips and 9,216 chips. The larger configuration delivers 42.5 exaFLOPS, more than 24 times the compute of the El Capitan supercomputer. It is also the first TPU to support FP8 calculations in its matrix math units.
All code that runs on TPUs must be compiled by the XLA (Accelerated Linear Algebra) compiler. XLA is a just-in-time compiler that takes the computational graph emitted by a machine learning framework and compiles it into TPU machine code. Its most important optimization is operator fusion, which merges multiple operations into a single kernel to reduce memory transfers. Since memory bandwidth is typically the scarcest resource on hardware accelerators, eliminating unnecessary memory operations is one of the most effective ways to improve performance.
XLA is now part of the OpenXLA project, an open-source initiative that provides a common compilation stack for JAX, PyTorch, and TensorFlow. Google's MLPerf submissions demonstrated a seven-fold performance gain in training throughput for BERT using XLA-optimized compilation.
TPUs support three major machine learning frameworks:
| Framework | TPU integration | Notes |
|---|---|---|
| JAX | Native support | JAX is designed around XLA from the ground up; the recommended framework for TPU development |
| TensorFlow | Native support | Historically the primary TPU framework; supports TPUStrategy for distributed training |
| PyTorch | Via PyTorch/XLA | Open-source package that enables PyTorch to run on XLA devices; uses the PJRT runtime |
To run PyTorch on TPUs, users install the torch_xla package and obtain a TPU device handle via xm.xla_device(). The PyTorch/XLA project has migrated from the older XRT runtime to the PJRT runtime used by JAX, improving compatibility and performance.
Google Cloud provides TPU access through Cloud TPU VMs, which give users direct SSH access to a Linux virtual machine with root privileges and access to the underlying TPU hardware. This architecture replaced the earlier "TPU node" model, which required a separate host VM communicating with TPU workers over gRPC. Cloud TPU VMs simplify debugging by providing direct access to compiler and runtime logs.
TPUs natively support the bfloat16 floating-point format. Bfloat16 uses one sign bit, eight exponent bits, and seven mantissa bits. By retaining the same exponent range as float32, bfloat16 avoids the overflow and underflow issues that plague the IEEE float16 format during training. Unlike float16, bfloat16 does not require loss scaling, making it nearly a drop-in replacement for float32.
By default, TPUs perform matrix multiplications with bfloat16 inputs and accumulate results in float32. This mixed-precision approach delivers performance gains ranging from 4% to 47% (geometric mean of 13.9%) while using half the memory of full float32 training. Ironwood is the first TPU to also support FP8 calculations.
A TPU pod is a contiguous set of TPU chips grouped together within a specialized network. A slice is a subset of chips within a pod, all connected by ICI. Users provision TPU resources in slices of various sizes (for example, v5e-8 refers to a slice of 8 TPU v5e chips).
Multislice is a scaling technology that extends TPU connectivity beyond the ICI network of a single slice. In a multislice configuration, chips within each slice communicate over ICI, while chips in different slices communicate through host CPUs over the data-center network (DCN). The XLA compiler automatically inserts hierarchical collective operations and optimizes compute-communication overlap across the hybrid DCN/ICI topology.
Multislice enables training jobs to use more than 4,096 chips in a single run with TPU v4, and even larger configurations with later generations. This technology uses standard data parallelism and requires minimal code changes from the user.
TPUs differ from CPUs and GPUs in fundamental ways. CPUs are general-purpose processors optimized for sequential tasks with complex control flow. GPUs contain thousands of smaller programmable cores designed for parallel workloads, originally graphics rendering but now widely used for machine learning via the CUDA programming model. TPUs sacrifice general-purpose flexibility entirely, dedicating almost all silicon area to matrix arithmetic.
| Attribute | CPU | GPU | TPU |
|---|---|---|---|
| Architecture | Few powerful cores with large caches | Thousands of small programmable CUDA cores | Systolic array (MXU) plus vector and scalar units |
| Design target | General-purpose computing | Parallel workloads (graphics, ML, HPC) | Neural network training and inference |
| Programmability | Fully programmable | Programmable via CUDA, OpenCL, etc. | Programmable via XLA compiler only |
| Memory | System DRAM (DDR) | HBM (up to 80 GB on H100) | HBM (up to 192 GB on Ironwood) |
| Power per chip | 65 to 350W (typical) | 300 to 1,000W (high-end AI GPUs) | 40W (v1) to 600W (Ironwood) |
| Availability | Universal | Multi-vendor (NVIDIA, AMD, Intel) | Google Cloud only |
| Software ecosystem | All languages and frameworks | CUDA (dominant), ROCm, OpenCL | XLA, JAX, TensorFlow, PyTorch/XLA |
TPU v3 trained BERT models 8 times faster than NVIDIA V100 GPUs and delivered 1.7 to 2.4 times faster training for ResNet-50 and large language models. BERT training completes 2.8 times faster on TPUs than on A100 GPUs, and batch inference delivers 4 times higher throughput for transformer models. Single-query latency is 30% lower for models exceeding 10 billion parameters.
Google's Cloud TPU v6e (Trillium) delivers approximately 4 times better performance per dollar than NVIDIA H100 GPUs for large language model inference, according to Google's published benchmarks.
TPUs consume significantly less power than comparable GPU setups on supported workloads. Modern TPUs deliver 2 to 3 times better performance per watt than contemporary GPUs. Individual TPU chips typically consume 175 to 250 watts (prior to Ironwood), while high-end AI GPUs may use 700 to 1,000 watts. TPU-based systems can reduce overall power consumption by 60 to 65% compared to equivalent GPU deployments.
TPUs are used across a wide range of AI applications both within Google and by external Cloud customers.
All phases of Gemini model training run on TPU v5e and v6e pods without fallback to NVIDIA GPUs. Google Search, Google Translate, Gmail, Google Photos, and YouTube all use TPUs for inference workloads. The Nobel Prize-winning AlphaFold protein structure prediction system runs on TPUs.
AssemblyAI reports that Cloud TPU v5e delivers up to 4 times greater performance per dollar for speech-recognition inference compared to other solutions. Gridspace achieved 5 times training speedups and 6 times larger inference scale on TPUs for conversational AI models. AI21 Labs uses Trillium TPUs for its Mamba/Jamba language models.
In addition to the data-center TPU line, Google produces the Edge TPU, a small ASIC designed for machine learning inference on low-power edge devices. The Edge TPU delivers 4 trillion operations per second (4 TOPS) while consuming only 2 watts, achieving 2 TOPS per watt.
The Edge TPU uses an estimated 64x64 systolic array running at 480 MHz. It can execute MobileNet V2 at nearly 400 frames per second and runs inference 70 to 100 times faster than a CPU on supported models. It supports only TensorFlow Lite models that are fully 8-bit quantized and compiled for the Edge TPU.
Google's Coral product line offers several hardware form factors containing the Edge TPU, including the Coral Dev Board (a single-board Linux computer), the Coral USB Accelerator (a USB-C dongle), and system-on-module variants for custom designs.
Despite their performance advantages on supported workloads, TPUs have several limitations:
Google has published lifecycle analyses of TPU carbon efficiency. Over two generations (TPU v4 to Trillium), TPU hardware design improvements have led to a 3 times improvement in the carbon efficiency of AI workloads. Ironwood demonstrates an approximately 3.7 times improvement in Compute Carbon Intensity compared to TPU v5p and is 30 times more power-efficient than the first Cloud TPU released in 2018.
Operational electricity emissions account for more than 70% of a TPU's lifetime carbon footprint. Google's data centers operate at a fleet-wide average Power Usage Effectiveness (PUE) of 1.09, meaning nearly all energy consumed goes directly to computation rather than cooling or overhead.