TPU Type

A Tensor Processing Unit (TPU) is a custom application-specific integrated circuit (ASIC) developed by Google specifically to accelerate machine learning workloads. First deployed internally in 2015 and publicly announced in 2016, TPUs power many of Google's core services and have been used to train and serve some of the largest neural networks ever built, including AlphaGo, AlphaFold, BERT, LaMDA, PaLM, and Gemini. As of 2025, Google has released seven generations of TPU chips, each bringing improvements in compute performance, memory capacity, interconnect bandwidth, and energy efficiency.

Unlike general-purpose GPUs, which were originally designed for graphics rendering and later adapted for parallel computation, TPUs are purpose-built for the matrix arithmetic that dominates deep learning. This specialization allows TPUs to deliver higher throughput per watt on machine learning tasks compared to general-purpose processors.

Explain like I'm 5 (ELI5)

Imagine you have a regular calculator that can do all sorts of math problems, from addition to complicated algebra. That is like a GPU. Now imagine you have a special calculator that can only do one type of math (multiplying big grids of numbers), but it does that one thing incredibly fast. That is what a TPU is. Google built these special calculators because deep learning programs spend almost all their time multiplying big grids of numbers. By making a chip that only does that job, Google can train and run AI programs much faster while using less electricity.

Architecture

Systolic array

The core compute engine inside every TPU is the matrix multiply unit (MXU), which is built as a systolic array. The name "systolic" comes from the analogy to a beating heart: data flows through the array in rhythmic, wave-like pulses. In a systolic array, multiply-accumulate (MAC) units are arranged in a two-dimensional grid. Weight values from one matrix are preloaded into the MAC units. Activation values from the other matrix enter from one edge and flow horizontally across the grid. Each MAC unit multiplies its stored weight by the incoming activation, adds the result to a partial sum arriving from the neighboring unit above, and passes both values onward. All intermediate results move directly between adjacent MAC units without returning to off-chip memory, which reduces power consumption and memory bandwidth requirements.

In TPU generations prior to v6e, the MXU is a 128 x 128 systolic array, giving 16,384 MAC operations per cycle. Starting with TPU v6e (Trillium) and continuing in TPU v7 (Ironwood), the MXU was enlarged to 256 x 256, maintaining 16,384 MAC operations per cycle but at higher precision or with architectural changes that increase effective throughput. All MXU multiplications accept bfloat16 inputs, while accumulations are performed in FP32 to preserve numerical stability.

TensorCore

Each TPU chip contains one or more TensorCores. A TensorCore bundles together one or more MXUs, a vector processing unit (VPU), and a scalar processing unit. The VPU handles element-wise operations such as activations, normalization, and softmax, while the scalar unit manages control flow and address computation. The number of MXUs per TensorCore and the number of TensorCores per chip have increased with each TPU generation.

TPU version	TensorCores per chip	MXUs per TensorCore	Total MXUs per chip
v1	1	1	1
v2	2	1	2
v3	2	1	2
v4	2	2	4
v5e	1	4	4
v5p	2	4	8
v6e (Trillium)	1	2	2
v7 (Ironwood)	2	(not disclosed)	(not disclosed)

SparseCore

Starting with TPU v4, Google added SparseCores to the chip. SparseCores are specialized dataflow processors designed to accelerate embedding lookups, which are common in recommendation and ranking models. They accelerate embedding-heavy models by 5x to 7x while using only about 5% of the die area and power budget. TPU v6e includes a third-generation SparseCore, and TPU v7 (Ironwood) includes four SparseCores per chip.

Memory subsystem

TPUs use high-bandwidth memory (HBM) for off-chip storage, providing the large capacity and bandwidth needed for model weights, activations, and optimizer states. On-chip, each TensorCore has vector memory (VMEM), which serves as a high-speed scratchpad for data being actively processed. Some generations also include separate common memory (CMEM) and SparseCore memory (spMEM).

Interconnect topology

TPU chips within a pod communicate over a custom inter-chip interconnect (ICI). TPU v2, v3, v5e, and v6e use a 2D torus topology, where each chip connects to its four nearest neighbors. TPU v4, v5p, and v7 use a 3D torus topology, where each chip connects to six neighbors. The additional dimension reduces the network diameter (the maximum number of hops between any two chips) from approximately 2 times the square root of N to 3 times the cube root of N, where N is the total number of chips. This lower diameter improves collective communication performance for large-scale distributed training.

TPU v4 introduced optical circuit switches (OCSes), which allow the interconnect topology to be dynamically reconfigured. This feature improves cluster availability, utilization, and fault isolation, and it enables users to select twisted torus topologies that provide up to 70% higher bisection bandwidth compared to standard tori. OCSes account for less than 5% of system cost and less than 3% of system power.

TPU generations

Google has released seven generations of TPU hardware since 2015. The following table summarizes the key specifications of each generation.

Specification	v1 (2015)	v2 (2017)	v3 (2018)	v4 (2021)	v5e (2023)	v5p (2023)	v6e / Trillium (2024)	v7 / Ironwood (2025)
Process node	28 nm	16 nm	16 nm	7 nm	Not disclosed	Not disclosed	Not disclosed	Not disclosed
Clock speed	700 MHz	700 MHz	940 MHz	1,050 MHz	Not disclosed	1,750 MHz	Not disclosed	Not disclosed
TensorCores per chip	1	2	2	2	1	2	1	2
HBM capacity per chip	8 GiB (DDR3)	16 GiB HBM	32 GiB HBM	32 GiB HBM	16 GiB HBM	95 GiB HBM	32 GiB HBM	192 GiB HBM3e
HBM bandwidth per chip	34 GB/s	600 GB/s	900 GB/s	1,200 GB/s	819 GB/s	2,765 GB/s	1,640 GB/s	7,380 GB/s
Peak compute (BF16)	N/A (INT8 only)	45 TFLOPS	123 TFLOPS	275 TFLOPS	197 TFLOPS	459 TFLOPS	918 TFLOPS	2,307 TFLOPS
Peak compute (INT8)	92 TOPS	N/A	N/A	N/A	393 TOPS	918 TOPS	1,836 TOPS	N/A
Peak compute (FP8)	N/A	N/A	N/A	N/A	N/A	459 TFLOPS	N/A	4,614 TFLOPS
TDP (per chip)	28-40 W	Not disclosed	Not disclosed	170 W	Not disclosed	Not disclosed	Not disclosed	~1,000 W
ICI bandwidth per chip	N/A (PCIe)	Not disclosed	Not disclosed	Not disclosed	400 GBps	1,200 GBps	800 GBps	1,200 GBps
ICI topology	N/A	2D torus	2D torus	3D torus	2D torus	3D torus	2D torus	3D torus
Max chips per pod	1 (PCIe card)	256	1,024	4,096	256	8,960	256	9,216
Cooling	Air	Air	Liquid	Liquid	Air	Liquid	Not disclosed	Liquid
Primary use	Inference	Training and inference	Training and inference	Training and inference	Inference and fine-tuning	Large-scale training	Training and inference	Inference-optimized

TPU v1 (2015)

The first-generation TPU was designed exclusively for inference. It contained a single 256 x 256 systolic array of 8-bit integer multiply-accumulate units, delivering a peak throughput of 92 TOPS (INT8). The chip was fabricated on a 28 nm process, fit on a PCIe card, drew 28 to 40 W, and used 8 GiB of DDR3 SDRAM rather than HBM. Google deployed TPU v1 across its data centers starting in 2015 to accelerate inference for services such as Google Search (RankBrain), Google Translate, Google Photos, and Google Street View. The chip was publicly described in a 2017 ISCA paper by Jouppi et al., which showed that the TPU was 15x to 30x faster and 30x to 80x more energy-efficient than contemporary CPUs and GPUs on inference workloads ^[1].

TPU v2 (2017)

TPU v2 was the first generation designed for both training and inference. The architecture was significantly restructured: the single 256 x 256 INT8 array was replaced by two TensorCores, each containing a 128 x 128 bfloat16 MXU. This was the first chip to use the bfloat16 floating-point format, which Google Brain developed to preserve the dynamic range of FP32 (by keeping 8 exponent bits) while halving the storage and bandwidth costs (by truncating the mantissa to 7 bits). Each chip delivered 45 TFLOPS in bfloat16 and had 16 GiB of HBM with 600 GB/s bandwidth. Up to 256 chips could be connected in a 2D torus topology to form a TPU v2 Pod, achieving 11.5 petaFLOPS of aggregate peak compute ^[2].

TPU v3 (2018)

TPU v3 retained the two-TensorCore-per-chip design but increased clock speed from 700 MHz to 940 MHz and doubled HBM capacity to 32 GiB per chip with 900 GB/s bandwidth. Peak per-chip performance rose to 123 TFLOPS in bfloat16, more than doubling the v2. The higher power density required liquid cooling for the first time. A TPU v3 Pod contained up to 1,024 chips. TPU v3 was used to train AlphaFold, which predicted protein structures with atomic-level accuracy using 128 TPU v3 cores ^[3].

TPU v4 (2021)

TPU v4 moved to a 7 nm process node and doubled the number of MXUs per TensorCore from one to two. It introduced a 3D torus interconnect topology, replacing the 2D torus of previous generations, and was the first TPU to deploy optical circuit switches (OCSes) for reconfigurable networking. A single TPU v4 Pod contained 4,096 chips. The chip also introduced SparseCores for embedding acceleration. TPU v4 delivered 275 TFLOPS per chip in bfloat16, consumed 170 W per chip, and was described in a 2023 ISCA paper as being 1.2x to 1.7x faster than the NVIDIA A100 while using 1.3x to 1.9x less power ^[4]. A v4i variant was also produced for inference-only workloads without liquid cooling.

TPU v5e (2023)

TPU v5e was designed as a cost-efficient option optimized for inference and fine-tuning rather than maximum training performance. It has a single TensorCore with four MXUs, 16 GiB HBM with 819 GB/s bandwidth, and delivers 197 TFLOPS in bfloat16 or 393 TOPS in INT8 per chip. It returned to the 2D torus topology (sufficient for its smaller target pod sizes of up to 256 chips) and uses air cooling. The v5e provides a lower cost-per-inference than the v5p, making it popular for serving workloads on Google Cloud ^[5].

TPU v5p (2023)

TPU v5p targeted maximum performance for large-scale training. Each chip has two TensorCores with four MXUs each (eight MXUs total), 95 GiB of HBM with 2,765 GB/s bandwidth, and delivers 459 TFLOPS per chip in bfloat16. It uses a 3D torus topology with 1,200 GBps of bidirectional ICI bandwidth per chip. A TPU v5p Pod contains 8,960 chips, with the largest schedulable job using 6,144 chips in a 3D torus configuration. Google described the v5p as competitive with the NVIDIA H100 ^[6].

TPU v6e, Trillium (2024)

TPU v6e, marketed as Trillium, was announced at Google I/O in May 2024 and became generally available in late 2024. It features an enlarged 256 x 256 MXU (up from 128 x 128 in prior generations), delivering 918 TFLOPS per chip in bfloat16, a 4.7x increase over TPU v5e. HBM capacity is 32 GiB per chip with 1,640 GB/s bandwidth. Each chip has 800 GBps of bidirectional ICI bandwidth over a 2D torus topology, with pods scaling to 256 chips. Trillium includes a third-generation SparseCore and is over 67% more energy-efficient than TPU v5e. In training benchmarks, Trillium delivered more than 4x the training performance of v5e for models such as Gemma 2-27B and Llama 2-70B, and a 3x increase in inference throughput for Stable Diffusion XL ^[7].

TPU v7, Ironwood (2025)

TPU v7, code-named Ironwood, was unveiled at Google Cloud Next in April 2025. Google described it as "the first TPU for the age of inference." Each chip contains two TensorCores and four SparseCores, fabricated as two chiplets, each with its own 96 GiB HBM3e partition (192 GiB total per chip with 7,380 GB/s bandwidth). Peak performance is 4,614 TFLOPS in FP8 and 2,307 TFLOPS in bfloat16 per chip. The chip uses a 3D torus topology with 1,200 GBps of bidirectional ICI bandwidth per chip and scales up to 9,216 chips in a single cluster, delivering a combined 42.5 exaFLOPS of FP8 compute. At approximately 1 kW per chip, the full 9,216-chip cluster requires nearly 10 MW and uses liquid cooling. Compared to Trillium, Ironwood delivers a 4x improvement in both training performance and inference throughput per chip ^[8].

The bfloat16 number format

TPU v2 introduced the bfloat16 (Brain Floating-Point 16) number format, which has since been adopted by other hardware vendors including NVIDIA, AMD, Intel, and Arm. The format uses 1 sign bit, 8 exponent bits, and 7 mantissa bits. By keeping the same 8-bit exponent as IEEE 754 FP32, bfloat16 preserves the same dynamic range (approximately 1.2 x 10^-38 to 3.4 x 10^38) while halving the storage and bandwidth requirements. Neural network training is much more sensitive to dynamic range than to precision, so the reduced mantissa has minimal impact on model accuracy. In TPU MXUs, bfloat16 inputs are multiplied and the results are accumulated in FP32, providing a mixed-precision pipeline that combines the bandwidth savings of 16-bit operands with the numerical stability of 32-bit accumulation ^[9].

Software stack

TPUs are programmed through the XLA (Accelerated Linear Algebra) compiler, which takes high-level operations from machine learning frameworks and compiles them into optimized TPU machine code. XLA performs operation fusion, memory layout optimization, and scheduling to maximize hardware utilization.

Three major frameworks support TPUs:

JAX: The most native and recommended framework for TPU programming. JAX's functional programming model and composable transformations (jit, grad, vmap, pmap) map naturally onto XLA and TPU hardware.
TensorFlow: Historically the first framework with TPU support, and still supported on most TPU generations. TensorFlow's tf.distribute API provides strategies for multi-host TPU training. However, TensorFlow is not supported on TPU v7 (Ironwood).
PyTorch: Supported via the PyTorch/XLA library, which traces PyTorch operations and compiles them through XLA. PyTorch/XLA enables existing PyTorch code to run on TPUs with minimal modifications.

TPU pods and multi-slice training

A TPU Pod is a collection of TPU chips connected through high-bandwidth ICI links. Pods allow users to distribute training across hundreds or thousands of chips using data parallelism, model parallelism, or pipeline parallelism. TPU slice topologies are specified as tuples (for example, 4x4 for a 2D torus or 4x4x8 for a 3D torus), where each value represents the number of chips along one dimension.

For workloads that require more chips than a single pod provides, Google offers Multislice training, which connects multiple TPU slices over the data center network (DCN). Multislice training has been demonstrated with up to 18,432 TPU v5p chips across multiple slices.

TPU vs. GPU comparison

TPUs and GPUs take fundamentally different approaches to accelerating machine learning.

Aspect	TPU	GPU
Design philosophy	Purpose-built for tensor operations	General-purpose parallel processor adapted for ML
Core compute unit	Systolic array (MXU)	CUDA cores / Tensor Cores
Programming model	XLA compiler (JAX, TensorFlow, PyTorch/XLA)	CUDA, cuDNN, and broad ecosystem
Availability	Google Cloud only	Multiple cloud providers, on-premises, consumer hardware
Framework support	JAX (native), TensorFlow, PyTorch/XLA	PyTorch, TensorFlow, JAX, and many others
Interconnect	Custom ICI (2D/3D torus)	NVLink, NVSwitch, InfiniBand
Strengths	High throughput per watt on matrix operations; tightly integrated pods; cost-effective at scale	Broad ecosystem; flexible for diverse workloads; widely available
Limitations	Limited to Google Cloud; narrower framework ecosystem; less flexible for non-ML workloads	Higher power per FLOP on pure matrix work; less integrated multi-chip topology

Edge TPU

In addition to cloud TPUs, Google developed the Edge TPU for on-device inference at the network edge. The Edge TPU is a small, low-power ASIC capable of 4 TOPS while consuming only 2 W (2 TOPS per watt). It is available through the Google Coral product line, which includes USB accelerators, PCIe modules, and system-on-module boards. The Edge TPU runs TensorFlow Lite models compiled with the Edge TPU compiler and is designed for applications such as object detection, image classification, and keyword spotting on embedded devices. Google also integrated a custom Edge TPU variant called the Pixel Neural Core into certain Pixel smartphones for on-device camera processing ^[10].

Models trained on TPUs

TPUs have been used to train and serve many well-known AI models:

Model	TPU generation used	Year	Description
AlphaGo	v1 (inference), v2 (training)	2016-2017	Defeated world Go champion Lee Sedol
Transformer (original)	v2	2017	Introduced the attention mechanism that underlies modern LLMs
BERT	v3	2018	Pre-trained bidirectional language representations
AlphaFold	v3	2020	Predicted protein structures with atomic accuracy
LaMDA	v3/v4	2021	Conversational language model
PaLM	v4	2022	540B-parameter language model trained on 6,144 TPU v4 chips
Gemini	v4/v5p	2023	Google's multimodal foundation model
Gemma	v5e/v5p	2024	Open-weight language models

References

Jouppi, N.P., Young, C., Patil, N., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*, 2017. https://arxiv.org/abs/1704.04760
Jouppi, N.P., Yoon, D.H., Ashcraft, M., et al. "A Domain-Specific Supercomputer for Training Deep Neural Networks." *Communications of the ACM*, 63(7), 2020. https://dl.acm.org/doi/10.1145/3360307
Jumper, J., Evans, R., Pritzel, A., et al. "Highly accurate protein structure prediction with AlphaFold." *Nature*, 596, 2021. https://www.nature.com/articles/s41586-021-03819-2
Jouppi, N.P., Kurian, G., Li, S., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." *Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA)*, 2023. https://arxiv.org/abs/2304.01433
Google Cloud. "Cloud TPU v5e documentation." https://docs.cloud.google.com/tpu/docs/v5e
Google Cloud. "Cloud TPU v5p documentation." https://docs.cloud.google.com/tpu/docs/v5p
Google Cloud. "Introducing Trillium, sixth-generation TPUs." *Google Cloud Blog*, 2024. https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus
Google. "Ironwood: The first Google TPU for the age of inference." *The Keyword*, April 2025. https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/ironwood-tpu-age-of-inference/
Google Cloud. "BFloat16: The secret to high performance on Cloud TPUs." *Google Cloud Blog*. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Google. "Edge TPU performance benchmarks." *Coral documentation*. https://www.coral.ai/docs/edgetpu/benchmarks/
Google Cloud. "TPU architecture." *Google Cloud documentation*. https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
Google Cloud. "TPU v6e documentation." https://docs.cloud.google.com/tpu/docs/v6e
Google Cloud. "TPU v7 (Ironwood) documentation." https://docs.cloud.google.com/tpu/docs/tpu7x
Wikipedia. "Tensor Processing Unit." https://en.wikipedia.org/wiki/Tensor_Processing_Unit

Explain like I'm 5 (ELI5)

Architecture

Systolic array

TensorCore

SparseCore

Memory subsystem

Interconnect topology

TPU generations

TPU v1 (2015)

TPU v2 (2017)

TPU v3 (2018)

TPU v4 (2021)

TPU v5e (2023)

TPU v5p (2023)

TPU v6e, Trillium (2024)

TPU v7, Ironwood (2025)

The bfloat16 number format

Software stack

TPU pods and multi-slice training

TPU vs. GPU comparison

Edge TPU

Models trained on TPUs

References

Improve this article

Related Articles

ARC-AGI 2

TPU Chip

TPU Device

TPU Master

TPU Node

TPU Slice

Explain like I'm 5 (ELI5)

Architecture

Systolic array

TensorCore

SparseCore

Memory subsystem

Interconnect topology

TPU generations

TPU v1 (2015)

TPU v2 (2017)

TPU v3 (2018)

TPU v4 (2021)

TPU v5e (2023)

TPU v5p (2023)

TPU v6e, Trillium (2024)

TPU v7, Ironwood (2025)

The bfloat16 number format

Software stack

TPU pods and multi-slice training

TPU vs. GPU comparison

Edge TPU

Models trained on TPUs

References

Related Articles

ARC-AGI 2

TPU Chip

TPU Device

TPU Master

TPU Node

TPU Slice