See also: AI chip, GPU, Edge TPU
A Tensor Processing Unit (TPU) is an application-specific integrated circuit (ASIC) designed by Google to accelerate machine learning workloads, especially the dense matrix multiplications that dominate neural network training and inference. Unlike a GPU, which is a general-purpose parallel processor adapted for deep learning, the TPU is purpose-built around a systolic array for matrix multiply-accumulate (MAC) operations and a memory hierarchy tuned for tensor traffic.
Google began designing the TPU around 2013, deployed the first generation internally in 2015, and publicly announced the chip in May 2016 at the Google I/O conference. Since then the company has shipped seven public generations: TPU v1 (inference only), v2, v3, v4, v5e, v5p, the sixth-generation Trillium (v6e), and the seventh-generation Ironwood (v7). The newer chips power Google's own large-scale services and are also rented to outside customers through Google Cloud Platform.
TPUs sit behind many of the workloads people now associate with modern AI: AlphaGo and AlphaZero in 2016, AlphaFold protein structure prediction in 2020, the PaLM family of large language models, the Gemini family, and Google products such as Search, Photos, and Translate. Anthropic disclosed in 2025 that its Claude models also train and serve on TPU pods, with multi-gigawatt commitments stretching into 2026 and beyond.
The TPU project started inside Google Research and Google's hardware engineering group when leadership realized that running deep neural networks at Google scale on CPUs would force a doubling of the company's data center footprint. Norman Jouppi, who had previously worked on the MIPS R4000 and on cache memory designs, led the team. The first chip was an inference accelerator built on a 28 nm process, clocked at 700 MHz, and rated at 28 to 40 watts thermal design power. It used 8-bit integer multiplications inside a 256 by 256 systolic array of 65,536 MAC units, with 28 MiB of on-chip software-managed memory and 8 GiB of attached DDR3 SDRAM at 34 GB/s.
Google deployed TPU v1 across its datacenters in 2015 and let the chip stay quiet for more than a year before the public reveal. The 2017 ISCA paper "In-Datacenter Performance Analysis of a Tensor Processing Unit," written by Jouppi and 75 co-authors, described how TPU v1 was 15 to 30 times faster than contemporary GPUs and CPUs on Google's production neural networks (multilayer perceptrons, convolutional networks, and LSTMs that together represented about 95% of inference demand at the time), with 30 to 80 times better performance per watt. The same paper noted that four of the six benchmarked applications were memory bandwidth limited, an observation that shaped every TPU generation since.
The v2 chip, announced at Google I/O 2017, was the first generation to support training as well as inference. It introduced bfloat16, a 16-bit format with the same exponent range as IEEE FP32, which Google argued was a better fit for neural network gradients than IEEE FP16. v2 also introduced the pod concept: 256 chips wired together with custom interconnect, presented to programmers as a single training target.
From v3 onward Google switched to liquid cooling, then to optical circuit switching for inter-rack networking with v4, and then to dedicated SparseCore accelerators for embedding lookups. Each generation has roughly tracked the trajectory of transformer workloads: more memory per chip, more bandwidth, larger pods, and more aggressive low-precision arithmetic.
The table below lists each public TPU generation with the figures Google itself publishes in its Cloud TPU documentation, in the original ISCA papers, and in the launch blog posts. Numbers are quoted at the precision the source uses; bfloat16 unless otherwise noted.
| Generation | Announced | Process | Peak compute per chip | HBM per chip | HBM bandwidth per chip | Max chips per pod | Topology | Cooling |
|---|---|---|---|---|---|---|---|---|
| TPU v1 | May 2016 | 28 nm | 92 TOPS (INT8) | 8 GiB DDR3 | 34 GB/s | n/a (inference only) | PCIe attached | Air |
| TPU v2 | May 2017 | 16 nm | 45 TFLOPS (bf16) | 16 GB HBM | 600 GB/s | 256 | 2D torus | Air |
| TPU v3 | May 2018 | 16 nm | 123 TFLOPS (bf16) | 32 GB HBM | 900 GB/s | 1,024 | 2D torus | Liquid |
| TPU v4 | May 2021 | 7 nm | 275 TFLOPS (bf16) | 32 GB HBM | 1,200 GB/s | 4,096 | 3D torus, OCS | Liquid |
| TPU v5e | August 2023 | n/a | 197 TFLOPS (bf16) | 16 GB HBM | 819 GB/s | 256 | 2D torus | Liquid |
| TPU v5p | December 2023 | n/a | 459 TFLOPS (bf16) | 95 GB HBM3 | 2,765 GB/s | 8,960 | 3D torus, OCS | Liquid |
| TPU v6e (Trillium) | May 2024, GA Dec 2024 | n/a | 918 TFLOPS (bf16), 1,836 TOPS (INT8) | 32 GB HBM3 | 1,638 GiB/s | 256 | 2D torus | Liquid |
| TPU v7 (Ironwood) | April 2025, GA Nov 2025 | n/a | 4,614 TFLOPS (FP8) | 192 GB HBM3E | 7,380 GiB/s (~7.37 TB/s) | 9,216 | 3D mesh, OCS | Liquid |
A few notes on the table. Google has not always disclosed the manufacturing process for newer TPUs, so some cells say n/a. The v1 figure of 92 TOPS comes from the Jouppi 2017 paper; Google's earlier marketing sometimes quoted 23 TOPS, which referred to a sustained measurement on real workloads rather than the peak. v5p uses HBM3 with the highest per-chip bandwidth of any pre-Ironwood TPU. Trillium doubles peak FLOPs and ICI bandwidth versus v5e and ships with the third-generation SparseCore for embedding-heavy workloads. Ironwood pushes 192 GB of HBM3E per chip, six times Trillium, and pods of up to 9,216 chips connected at 1.2 TB/s bidirectional ICI per chip, for a total of about 42.5 FP8 ExaFLOPS per pod.
The heart of every TPU is the matrix multiply unit (MXU), a systolic array of multiply-accumulate cells. In v1 the array was 256 by 256 INT8 cells; from v2 through v5p Google used 128 by 128 bfloat16 cells, with two MXUs per TensorCore. Trillium and Ironwood return to a larger 256 by 256 MXU per TensorCore. Operands flow through the array in lockstep: weights stay resident while activations stream across, accumulating partial sums as they go. The design eliminates almost all register-file traffic, which is why TPUs hit such high utilization on dense GEMMs.
A systolic array is a poor fit for irregular workloads. It assumes the multiplication has a fixed shape large enough to fill the array, and it punishes sparse or branch-heavy code. This is one reason Google added separate hardware paths for embeddings (the SparseCore introduced in v4 and refreshed in Trillium) and for vector operations.
Each TensorCore also contains a vector processing unit (VPU) for elementwise operations such as activations, normalizations, and softmax, plus a scalar unit for control flow and address arithmetic. The VPU is wider than a typical CPU SIMD lane but narrower than the MXU, and the compiler is responsible for scheduling work between the three.
TPUs rely heavily on high-bandwidth memory (HBM) stacked next to the die. v2 had 16 GB of first-generation HBM at 600 GB/s; Ironwood ships 192 GB of HBM3E at roughly 7.37 TB/s. On-chip there is a vector memory and a small unified buffer. Unlike most CPUs and GPUs, the on-chip memory is software managed: the XLA compiler decides when to stage tensors in and out, which removes the cost and unpredictability of hardware caches but pushes more work onto the toolchain.
A TPU pod is the unit Google sells as a single coherent training target. Inside a pod, chips are wired together with a proprietary inter-chip interconnect (ICI) that does not go through the host CPU. From v2 to v3 the topology was a 2D torus; v4 introduced a 3D torus with optical circuit switches (OCS) at the rack level, letting Google reconfigure the network on the fly to route around failed components and pick the right shape for a given job (the user can request a twisted torus if the model benefits from it). Ironwood expands the 3D mesh to 9,216 chips per pod and 1.2 TB/s bidirectional ICI per chip.
The optical circuit switch is one of the most distinctive parts of modern TPU systems. Per the v4 ISCA paper, OCS and its underlying optical components account for less than 5% of system cost and less than 5% of system power, and they let multi-week training runs survive component failures that would otherwise require a full restart.
TPU code is written against the XLA compiler (Accelerated Linear Algebra) regardless of the front-end framework. XLA takes a high-level computation graph, fuses operators, picks layouts, schedules collectives across the ICI, and emits TPU machine code.
| Framework | Status on TPU | Notes |
|---|---|---|
| TensorFlow | Native, since v1 | The original TPU front end. v7 (Ironwood) does not support TensorFlow per Google's docs. |
| JAX | Native | The dominant choice for new research at Google DeepMind and at outside labs using Cloud TPUs. |
| PyTorch/XLA | Supported | Bridges PyTorch's eager tensors to XLA HLO for compilation. Google introduced TorchTPU in 2025 to give native PyTorch performance on TPUs. |
| Keras | Supported via TF or JAX backends | Used in many tutorials and Kaggle notebooks. |
XLA itself was open-sourced and now lives under the OpenXLA project, which also targets GPUs and CPUs. The trade-off of the compiler-first model is real: anything that does not fit XLA's static-shape, fused-graph assumption (dynamic shapes, data-dependent control flow, heavy Python in the inner loop) usually runs slowly on TPUs without rewrites.
| Dimension | TPU | GPU |
|---|---|---|
| Primary target | Dense matmul for neural networks | General parallel compute, graphics, HPC |
| Core compute unit | Systolic array (MXU) | SIMT cores plus tensor cores |
| Numeric formats | bf16 (since v2), INT8, FP8 (Ironwood) | FP16, bf16, FP8, FP4, INT8, FP32, FP64 |
| Memory model | Software-managed on-chip buffers, HBM | Hardware caches, HBM or GDDR |
| Pod-scale interconnect | Proprietary ICI plus optical circuit switching | NVLink and InfiniBand or Ethernet |
| Software ecosystem | XLA, JAX, TensorFlow, PyTorch/XLA | CUDA, ROCm, broad CUDA library ecosystem |
| Procurement | Google Cloud only (rental) | Sold by Nvidia, AMD, Intel; available from many clouds and on-prem |
| Best for | Large batch training and serving of dense models | Anything from a laptop to a hyperscale cluster |
In practice the choice usually comes down to software compatibility and supply. CUDA's depth makes GPUs the default for researchers who want maximum library coverage; TPUs win on certain large training jobs where a JAX or TF model fits the systolic array cleanly and where Google's pod scale and OCS keep utilization high.
Google has used TPUs for almost every flagship machine learning system it has shipped since 2016. The list below covers the better-documented examples.
| System | TPU generation | Year | Note |
|---|---|---|---|
| AlphaGo (Lee Sedol match) | v1 | 2016 | First high-profile TPU workload disclosed publicly. |
| Google Translate (Neural Machine Translation) | v1 | 2016 | Inference acceleration in production. |
| AlphaZero | v1, v2 | 2017 | Self-play reinforcement learning for Go, chess, shogi. |
| AlphaFold 2 | v3 | 2020 | DeepMind trained the model on 128 TPU v3 cores; convergence took roughly two weeks. |
| MUM and LaMDA | v3, v4 | 2021 | Internal Google language models. |
| PaLM | v4 | 2022 | Trained on 6,144 TPU v4 chips (two pods) over 56 days, sustaining ~60% of peak FLOPs. |
| Gemini 1.0 / 1.5 | v4, v5 | 2023, 2024 | Google's flagship multimodal models. |
| Claude (Anthropic) | v5, v6, v7 | 2024 onward | Anthropic disclosed multi-gigawatt TPU commitments through 2026 and beyond. |
| Gemini 3 | v6, v7 | 2025 | Used Trillium for training and Ironwood for serving. |
Outside Google, customers such as Salesforce, Hugging Face, Snap, and various academic groups run TPUs through Google Cloud. The chips have also become a fixture of Kaggle competitions, where Google offers free TPU time to participants.
The Edge TPU, announced in 2018, is a separate product line meant for on-device inference rather than data center training. A single Edge TPU runs about 4 trillion operations per second (TOPS) at roughly 0.5 watts per TOPS, restricted to 8-bit integer arithmetic and to models that have been compiled with the Edge TPU Compiler. It supports TensorFlow Lite models only.
Google ships the chip through the Coral brand in several form factors: a USB accelerator stick, a mini PCIe card, an M.2 module, a system-on-module, and a small single-board computer. Pixel phones include a related but distinct "Pixel Neural Core" or "Tensor" chip, designed in collaboration with Google Silicon, which is not the same silicon as the Coral Edge TPU.
Google has been a regular submitter to MLCommons' MLPerf training and inference rounds. In MLPerf Training v4.1 (late 2024), Google reported that Trillium delivered up to 1.8x better performance per dollar than TPU v5p on dense LLM training, and that scaling efficiency hit 99% on the GPT-3 175B benchmark when going from a single pod to thousands of chips. On the same benchmark with 2,048 chips, Trillium completed training about two minutes faster than v5p's 29.6 minutes.
MLPerf results should be read with caution. They cover a fixed list of model architectures (BERT-large, GPT-3 175B pretraining, Llama 2 70B fine-tuning, Stable Diffusion, recommendation, object detection, graph node classification) and they let vendors tune software stacks aggressively. They are still the most public, most reproducible head-to-head comparison between TPU and GPU systems.
TPUs are not sold as standalone chips. The only commercial path is Google Cloud, where TPUs are rented as VMs. Pricing varies by generation and region. Cloud TPU is available in a handful of regions, mostly in North America, Europe, and parts of Asia, and individual generations are not always available in every region. Customers who want guaranteed long-term capacity sign multi-year reservation contracts; Anthropic's 2025 deal for over a gigawatt of TPU capacity is a recent example.
For lighter workloads Google offers TPU access through Colab and through the TPU Research Cloud program, which gives free TPU time to academic projects.
The TPU model has clear trade-offs. Workloads that do not fit the XLA mental model run poorly: dynamic shapes, data-dependent loops, and heavy Python control flow all hurt. Models written for PyTorch usually need at least light porting before they run efficiently. Debugging is harder than on a GPU because tensors live on a remote accelerator and because the compiled-graph model hides intermediate values. Procurement is tied to one vendor, which means TPU users carry the same single-supplier risk as CUDA users in reverse. Regional availability inside Google Cloud is limited compared to GPUs, and some generations (notably Ironwood as of late 2025) require an account-team conversation before access. Finally, on a per-chip basis TPUs sometimes lose to current GPUs on workloads dominated by sparse computation or by very small batch sizes.
| Paper | Year | Venue | Topic |
|---|---|---|---|
| Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" | 2017 | ISCA | First public deep dive into TPU v1 hardware and workloads. |
| Jouppi et al., "A Domain-Specific Supercomputer for Training Deep Neural Networks" | 2020 | Comm. ACM / IEEE Micro | Architecture of TPU v2 and v3. |
| Jouppi et al., "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings" | 2023 | ISCA | TPU v4 system design, OCS, SparseCore. |
| Various Google Cloud blog posts | 2023 to 2025 | n/a | Per-generation announcements with peak FLOPs and pod sizes. |
A TPU is a calculator chip that Google built specifically for the kind of math that teaches computers to recognize pictures, translate languages, and chat. Regular computer chips can do that math, but it takes them a long time and uses a lot of electricity. The TPU does only one trick, but it does it very fast. A bunch of TPUs wired together into a "pod" act like one giant brain that helps train models like AlphaGo, AlphaFold, Gemini, and Claude.