TPU Device

A Tensor Processing Unit (TPU) is an application-specific integrated circuit (ASIC) designed by Google to accelerate machine learning workloads. Unlike general-purpose processors such as CPUs or GPUs, TPUs are purpose-built for the mathematical operations that dominate neural network training and inference. Google first deployed TPUs internally in 2015 and publicly announced them in 2016. Since then, seven generations of TPUs have been released, each delivering significant improvements in compute performance, memory capacity, and energy efficiency. TPUs power many of Google's largest AI systems, including Gemini, AlphaFold, and Google Search.

Explain like I'm 5 (ELI5)

Imagine you have a regular toolbox that can fix lots of different things around the house. That is like a normal computer chip (a CPU). Now imagine you have a special tool that is really, really good at one specific job, like tightening screws super fast. A TPU is like that special tool, but instead of screws, it is really good at doing the math that helps computers learn. Google built TPUs so that programs that recognize pictures, understand speech, and answer questions can do their math much faster and use less electricity than if they used regular chips.

History and motivation

Google's motivation for building custom silicon came from a projection made in 2013. Engineers estimated that if every user spoke to their Android phone for just three minutes per day using voice search, Google would need to double the number of data centers worldwide to handle the deep learning inference load. This projected cost was unacceptable, so Google began developing a domain-specific accelerator that could run neural network inference far more efficiently than CPUs or GPUs.

The first TPU (v1) entered Google's data centers in 2015 and was formally announced at Google I/O in May 2016. A landmark paper by Norman Jouppi and colleagues, presented at the International Symposium on Computer Architecture (ISCA) in June 2017, described the TPU v1 architecture and showed that it was 15 to 30 times faster than contemporary CPUs and GPUs on inference tasks, with 30 to 80 times better performance per watt.

TPU v2, announced in 2017, expanded the scope from inference only to both training and inference, and introduced the bfloat16 number format, which later became an industry standard adopted by other hardware vendors including Intel, AMD, and NVIDIA. Each subsequent generation has pushed performance and scale further, culminating in the seventh-generation Ironwood chip announced in April 2025.

Architecture and design

TensorCore

The fundamental compute unit inside a TPU chip is the TensorCore. Each TPU chip contains one or more TensorCores (Ironwood uses a chiplet design with two TensorCores per chip). A TensorCore consists of three main processing elements:

Component	Function	Details
Matrix multiply unit (MXU)	Performs dense matrix multiplications	128x128 systolic array (v2 through v5p) or 256x256 systolic array (v6e and Ironwood). Inputs in bfloat16; accumulations in FP32. Each MXU performs 16,384 multiply-accumulate operations per cycle.
Vector unit	General-purpose computation	Handles activations, softmax, normalization, and element-wise operations.
Scalar unit	Control and addressing	Manages control flow, memory address calculations, and other maintenance operations.

Systolic array

The MXU uses a systolic array architecture, named after the Greek word for heartbeat because data pulses rhythmically through the chip. In a systolic array, each multiply-accumulator passes its result directly to the next one in the grid without writing intermediate values back to memory. This eliminates the memory bottleneck that limits performance on conventional processors.

In the TPU v1, the systolic array was 256x256, performing 65,536 multiply-accumulate operations per clock cycle. Running at 700 MHz, this delivered 92 trillion 8-bit operations per second (92 TOPS) while consuming only 40 watts. More than 90% of the silicon area was devoted to useful computation, compared to roughly 30% in a typical GPU.

The engineering tradeoff behind the systolic array is deliberate: it sacrifices the general-purpose flexibility of a GPU's thousands of programmable CUDA cores in exchange for much higher operation density and energy efficiency on matrix workloads.

Memory hierarchy

TPUs use high-bandwidth memory (HBM) as their primary off-chip memory. Data flows through a pipeline: the host streams data into an infeed queue, the TPU loads it from the infeed queue into HBM, computations are performed, and results are placed into an outfeed queue.

On-chip, TPUs have vector memory (VMEM) that feeds data to the MXU and vector unit. VMEM bandwidth is approximately 22 times higher than HBM bandwidth, meaning operations reading from VMEM require an arithmetic intensity of only 10 to 20 to achieve peak FLOPS utilization. This layered memory system is designed to keep the MXU fed with data as continuously as possible.

SparseCore

Starting with TPU v4, Google added a specialized processor called SparseCore to handle embedding operations. While the MXU excels at dense matrix multiplication, embedding lookups in recommendation systems, ranking models, and large language models involve irregular, data-dependent memory access patterns where the MXU provides no advantage.

TPU v4 featured four SparseCores per chip, each containing 16 compute tiles that operate in parallel on disjoint subsets of embedding operations. The SparseCore achieved 5 to 7 times speedups over previous approaches while using only 5% of total chip die area and power budget. It is 3 times faster than TPU v3 on recommendation models and 5 to 30 times faster than CPU-based systems.

Inter-chip interconnect (ICI)

TPU chips within a single slice communicate over a high-speed inter-chip interconnect (ICI). The ICI connects chips directly to their neighbors without any external switches, a design Google calls "glueless" networking. The networking logic is integrated directly into the chip itself.

Different TPU generations use different ICI topologies:

TPU generation	ICI topology	Notes
TPU v2, v3	2D torus	Each chip connects to 4 neighbors
TPU v4	3D torus	Each chip connects to 6 neighbors; uses optical circuit switches for reconfigurability
TPU v5e, v6e	2D torus	Each chip connects to 4 neighbors
TPU v5p	3D torus	4,800 Gbps per chip; 8,960 chips per pod
Ironwood (v7)	3D torus	1.2 TB/s bidirectional bandwidth per chip

TPU generations

Google has released seven generations of TPU hardware, each with different design targets and capabilities.

Generation	Year	Compute (per chip)	HBM capacity	HBM bandwidth	Key features
TPU v1	2015 (deployed) / 2016 (announced)	92 TOPS (INT8)	N/A (28 MiB on-chip)	N/A	Inference only; 256x256 systolic array; 40W TDP
TPU v2	2017	180 TFLOPS	64 GB	N/A	First training-capable TPU; introduced bfloat16
TPU v3	2018	420 TFLOPS	128 GB	N/A	Liquid cooling; 2x performance over v2
TPU v4	2021	275 TFLOPS (bf16)	32 GB HBM2e	N/A	SparseCore; optical circuit switches; 3D torus ICI
TPU v5e	2023	~197 TFLOPS	16 GB	819 GB/s	Cost-optimized; 2.7x perf/dollar over v4; single core per chip
TPU v5p	2023	~459 TFLOPS	95 GB	2.8 TB/s	Performance-optimized; 8,960-chip pods (~4.45 EFLOPS)
TPU v6e (Trillium)	2024	~918 TFLOPS	32 GB	N/A	256x256 MXU; 4.7x peak compute over v5e; 2x memory and ICI bandwidth over v5e
TPU v7 (Ironwood)	2025	4,614 TFLOPS	192 GB HBM3e	7.4 TB/s	First inference-focused TPU; chiplet design (two TensorCores); FP8 support; 5nm process; ~100B transistors; 600W TDP

TPU v1

TPU v1 was designed exclusively for inference. Its 256x256 systolic array and 28 MiB of on-chip memory were sufficient for running trained models but not for the larger memory and bidirectional data flow requirements of training. Google deployed TPU v1 across its data centers to accelerate services like Google Search, Google Photos, Google Translate, and Gmail.

TPU v2

TPU v2 was the first generation capable of both training and inference. It introduced the bfloat16 floating-point format, a 16-bit format with the same exponent range as 32-bit IEEE float but reduced mantissa precision (7 bits instead of 23). This design maintains numerical stability during training while halving memory usage compared to float32. TPU v2 was made available to external researchers through the TensorFlow Research Cloud program.

TPU v3

TPU v3 doubled compute performance over v2 to more than 420 TFLOPS per chip and doubled HBM capacity to 128 GB. It was the first TPU generation to use liquid cooling, which allowed higher clock speeds and denser chip packaging.

TPU v4

Announced at Google I/O 2021, TPU v4 introduced two major architectural changes. First, it added SparseCore for embedding-heavy workloads. Second, it replaced electrical inter-pod connections with optical circuit switches (OCS), enabling dynamic reconfiguration of the 3D torus topology. A 2023 paper in the proceedings of ISCA described the TPU v4 supercomputer as an "optically reconfigurable supercomputer" with 4,096 chips per pod. TPU v4 achieved more than 2x the performance of v3.

TPU v5e and v5p

The fifth generation was split into two variants. TPU v5e, announced in August 2023, was designed for cost efficiency, reducing core count and clock speed to hit aggressive power and cost targets. It delivers 2.7 times higher performance per dollar than TPU v4. TPU v5p, announced in December 2023, was designed for maximum training performance, scaling to 8,960-chip pods delivering approximately 4.45 exaFLOPS. Google positioned TPU v5p as competitive with the NVIDIA H100.

TPU v6e (Trillium)

Announced at Google I/O in May 2024 and available in preview from October 2024, Trillium expanded the MXU from 128x128 to 256x256 multiply-accumulators. This quadrupled peak FLOPS per cycle at the same clock speed, delivering 4.7 times the peak compute performance of TPU v5e. HBM capacity and bandwidth also doubled compared to v5e.

TPU v7 (Ironwood)

Unveiled at Google Cloud Next in April 2025, Ironwood is the first TPU generation designed primarily for inference. Each chip delivers 4,614 TFLOPS of peak compute and includes 192 GB of HBM3e memory with 7.4 TB/s bandwidth. Ironwood uses a chiplet design where two TensorCores, each with its own SparseCore pair and 96 GB of HBM, are connected by a die-to-die (D2D) interface that is six times faster than a single ICI link. The chip is fabricated on a 5nm process with approximately 100 billion transistors.

Ironwood is offered in two pod configurations: 256 chips and 9,216 chips. The larger configuration delivers 42.5 exaFLOPS, more than 24 times the compute of the El Capitan supercomputer. It is also the first TPU to support FP8 calculations in its matrix math units.

Software stack and programming model

XLA compiler

All code that runs on TPUs must be compiled by the XLA (Accelerated Linear Algebra) compiler. XLA is a just-in-time compiler that takes the computational graph emitted by a machine learning framework and compiles it into TPU machine code. Its most important optimization is operator fusion, which merges multiple operations into a single kernel to reduce memory transfers. Since memory bandwidth is typically the scarcest resource on hardware accelerators, eliminating unnecessary memory operations is one of the most effective ways to improve performance.

XLA is now part of the OpenXLA project, an open-source initiative that provides a common compilation stack for JAX, PyTorch, and TensorFlow. Google's MLPerf submissions demonstrated a seven-fold performance gain in training throughput for BERT using XLA-optimized compilation.

Supported frameworks

TPUs support three major machine learning frameworks:

Framework	TPU integration	Notes
JAX	Native support	JAX is designed around XLA from the ground up; the recommended framework for TPU development
TensorFlow	Native support	Historically the primary TPU framework; supports TPUStrategy for distributed training
PyTorch	Via PyTorch/XLA	Open-source package that enables PyTorch to run on XLA devices; uses the PJRT runtime

To run PyTorch on TPUs, users install the torch_xla package and obtain a TPU device handle via xm.xla_device(). The PyTorch/XLA project has migrated from the older XRT runtime to the PJRT runtime used by JAX, improving compatibility and performance.

Cloud TPU VMs

Google Cloud provides TPU access through Cloud TPU VMs, which give users direct SSH access to a Linux virtual machine with root privileges and access to the underlying TPU hardware. This architecture replaced the earlier "TPU node" model, which required a separate host VM communicating with TPU workers over gRPC. Cloud TPU VMs simplify debugging by providing direct access to compiler and runtime logs.

Numeric formats

TPUs natively support the bfloat16 floating-point format. Bfloat16 uses one sign bit, eight exponent bits, and seven mantissa bits. By retaining the same exponent range as float32, bfloat16 avoids the overflow and underflow issues that plague the IEEE float16 format during training. Unlike float16, bfloat16 does not require loss scaling, making it nearly a drop-in replacement for float32.

By default, TPUs perform matrix multiplications with bfloat16 inputs and accumulate results in float32. This mixed-precision approach delivers performance gains ranging from 4% to 47% (geometric mean of 13.9%) while using half the memory of full float32 training. Ironwood is the first TPU to also support FP8 calculations.

TPU topology and scaling

Pods and slices

A TPU pod is a contiguous set of TPU chips grouped together within a specialized network. A slice is a subset of chips within a pod, all connected by ICI. Users provision TPU resources in slices of various sizes (for example, v5e-8 refers to a slice of 8 TPU v5e chips).

Multislice

Multislice is a scaling technology that extends TPU connectivity beyond the ICI network of a single slice. In a multislice configuration, chips within each slice communicate over ICI, while chips in different slices communicate through host CPUs over the data-center network (DCN). The XLA compiler automatically inserts hierarchical collective operations and optimizes compute-communication overlap across the hybrid DCN/ICI topology.

Multislice enables training jobs to use more than 4,096 chips in a single run with TPU v4, and even larger configurations with later generations. This technology uses standard data parallelism and requires minimal code changes from the user.

Comparison to CPUs and GPUs

TPUs differ from CPUs and GPUs in fundamental ways. CPUs are general-purpose processors optimized for sequential tasks with complex control flow. GPUs contain thousands of smaller programmable cores designed for parallel workloads, originally graphics rendering but now widely used for machine learning via the CUDA programming model. TPUs sacrifice general-purpose flexibility entirely, dedicating almost all silicon area to matrix arithmetic.

Attribute	CPU	GPU	TPU
Architecture	Few powerful cores with large caches	Thousands of small programmable CUDA cores	Systolic array (MXU) plus vector and scalar units
Design target	General-purpose computing	Parallel workloads (graphics, ML, HPC)	Neural network training and inference
Programmability	Fully programmable	Programmable via CUDA, OpenCL, etc.	Programmable via XLA compiler only
Memory	System DRAM (DDR)	HBM (up to 80 GB on H100)	HBM (up to 192 GB on Ironwood)
Power per chip	65 to 350W (typical)	300 to 1,000W (high-end AI GPUs)	40W (v1) to 600W (Ironwood)
Availability	Universal	Multi-vendor (NVIDIA, AMD, Intel)	Google Cloud only
Software ecosystem	All languages and frameworks	CUDA (dominant), ROCm, OpenCL	XLA, JAX, TensorFlow, PyTorch/XLA

Performance comparisons

TPU v3 trained BERT models 8 times faster than NVIDIA V100 GPUs and delivered 1.7 to 2.4 times faster training for ResNet-50 and large language models. BERT training completes 2.8 times faster on TPUs than on A100 GPUs, and batch inference delivers 4 times higher throughput for transformer models. Single-query latency is 30% lower for models exceeding 10 billion parameters.

Google's Cloud TPU v6e (Trillium) delivers approximately 4 times better performance per dollar than NVIDIA H100 GPUs for large language model inference, according to Google's published benchmarks.

Energy efficiency

TPUs consume significantly less power than comparable GPU setups on supported workloads. Modern TPUs deliver 2 to 3 times better performance per watt than contemporary GPUs. Individual TPU chips typically consume 175 to 250 watts (prior to Ironwood), while high-end AI GPUs may use 700 to 1,000 watts. TPU-based systems can reduce overall power consumption by 60 to 65% compared to equivalent GPU deployments.

Applications

TPUs are used across a wide range of AI applications both within Google and by external Cloud customers.

Google internal use

All phases of Gemini model training run on TPU v5e and v6e pods without fallback to NVIDIA GPUs. Google Search, Google Translate, Gmail, Google Photos, and YouTube all use TPUs for inference workloads. The Nobel Prize-winning AlphaFold protein structure prediction system runs on TPUs.

External customers

AssemblyAI reports that Cloud TPU v5e delivers up to 4 times greater performance per dollar for speech-recognition inference compared to other solutions. Gridspace achieved 5 times training speedups and 6 times larger inference scale on TPUs for conversational AI models. AI21 Labs uses Trillium TPUs for its Mamba/Jamba language models.

Edge TPU

In addition to the data-center TPU line, Google produces the Edge TPU, a small ASIC designed for machine learning inference on low-power edge devices. The Edge TPU delivers 4 trillion operations per second (4 TOPS) while consuming only 2 watts, achieving 2 TOPS per watt.

The Edge TPU uses an estimated 64x64 systolic array running at 480 MHz. It can execute MobileNet V2 at nearly 400 frames per second and runs inference 70 to 100 times faster than a CPU on supported models. It supports only TensorFlow Lite models that are fully 8-bit quantized and compiled for the Edge TPU.

Google's Coral product line offers several hardware form factors containing the Edge TPU, including the Coral Dev Board (a single-board Linux computer), the Coral USB Accelerator (a USB-C dongle), and system-on-module variants for custom designs.

Limitations and challenges

Despite their performance advantages on supported workloads, TPUs have several limitations:

Ecosystem constraints. TPUs perform best with TensorFlow, JAX, and frameworks compiled through XLA. Porting complex GPU-based workloads with custom CUDA kernels requires significant engineering effort. PyTorch support through PyTorch/XLA, while improving, is less mature than native CUDA support on GPUs.
Availability. TPUs are available exclusively through Google Cloud. Organizations that require hardware portability across AWS, Azure, or on-premises environments cannot use TPUs.
Architectural specialization. TPUs are not suitable for non-ML workloads such as graphics rendering, scientific simulations with irregular computation patterns, or general-purpose computing. The systolic array design provides no advantage for tasks that do not involve dense matrix multiplication or embedding lookups.
Vendor lock-in. Adopting TPUs ties an organization to Google Cloud, creating switching costs if the organization later needs to migrate to another cloud provider.
Community and talent. The GPU ecosystem, particularly NVIDIA CUDA, has a much larger developer community, more extensive documentation, and more learning resources. Fewer engineers have experience developing for TPUs compared to GPUs.

Environmental impact

Google has published lifecycle analyses of TPU carbon efficiency. Over two generations (TPU v4 to Trillium), TPU hardware design improvements have led to a 3 times improvement in the carbon efficiency of AI workloads. Ironwood demonstrates an approximately 3.7 times improvement in Compute Carbon Intensity compared to TPU v5p and is 30 times more power-efficient than the first Cloud TPU released in 2018.

Operational electricity emissions account for more than 70% of a TPU's lifetime carbon footprint. Google's data centers operate at a fleet-wide average Power Usage Effectiveness (PUE) of 1.09, meaning nearly all energy consumed goes directly to computation rather than cooling or overhead.

References

Jouppi, N.P., Young, C., Patil, N., Patterson, D., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1-12, June 2017. arXiv:1704.04760.
Jouppi, N.P., Yoon, D.H., Ashcraft, M., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), 2023. arXiv:2304.01433.
Google Cloud. "TPU Architecture." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
Google Cloud. "Introduction to Cloud TPU." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/intro-to-tpu
Google Cloud. "TPU v6e (Trillium)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v6e
Google Cloud. "TPU7x (Ironwood)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/tpu7x
Google Cloud. "Cloud TPU Multislice Overview." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/multislice-introduction
Google Cloud Blog. "BFloat16: The Secret to High Performance on Cloud TPUs." https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Google Cloud Blog. "Ironwood: The First Google TPU for the Age of Inference." April 2025. https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/
Google Cloud Blog. "TPU Transformation: A Look Back at 10 Years of Our AI-Specialized Chips." https://cloud.google.com/transform/ai-specialized-chips-tpu-history-gen-ai
Google Cloud Blog. "TPUs Improved Carbon-Efficiency of AI Workloads by 3x." https://cloud.google.com/blog/topics/sustainability/tpus-improved-carbon-efficiency-of-ai-workloads-by-3x
Patterson, D., et al. "The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink." arXiv:2204.05149, 2022.
OpenXLA Project. "A Deep Dive into SparseCore for Large Embedding Models." https://openxla.org/xla/sparsecore
PyTorch/XLA. "Learn About TPUs." PyTorch Documentation. https://docs.pytorch.org/xla/master/accelerators/tpu.html

Explain like I'm 5 (ELI5)

History and motivation

Architecture and design

TensorCore

Systolic array

Memory hierarchy

SparseCore

Inter-chip interconnect (ICI)

TPU generations

TPU v1

TPU v2

TPU v3

TPU v4

TPU v5e and v5p

TPU v6e (Trillium)

TPU v7 (Ironwood)

Software stack and programming model

XLA compiler

Supported frameworks

Cloud TPU VMs

Numeric formats

TPU topology and scaling

Pods and slices

Multislice

Comparison to CPUs and GPUs

Performance comparisons

Energy efficiency

Applications

Google internal use

External customers

Edge TPU

Limitations and challenges

Environmental impact

See also

References

Improve this article

Related Articles

ARC-AGI 2

TPU Master

TPU Node

TPU Slice

TPU Worker

TPU Board

Explain like I'm 5 (ELI5)

History and motivation

Architecture and design

TensorCore

Systolic array

Memory hierarchy

SparseCore

Inter-chip interconnect (ICI)

TPU generations

TPU v1

TPU v2

TPU v3

TPU v4

TPU v5e and v5p

TPU v6e (Trillium)

TPU v7 (Ironwood)

Software stack and programming model

XLA compiler

Supported frameworks

Cloud TPU VMs

Numeric formats

TPU topology and scaling

Pods and slices

Multislice

Comparison to CPUs and GPUs

Performance comparisons

Energy efficiency

Applications

Google internal use

External customers

Edge TPU

Limitations and challenges

Environmental impact

See also

References

Related Articles

ARC-AGI 2