Cloud TPU

Cloud TPU (Tensor Processing Unit) is a family of custom application-specific integrated circuits (ASICs) developed by Google to accelerate machine learning workloads. First deployed internally at Google data centers in 2015 and publicly announced in May 2016, TPUs are designed from the ground up for neural network computation rather than general-purpose processing. Google offers TPU hardware to external users through its Google Cloud platform, where they are marketed as Cloud TPUs. As of 2025, Google has released seven generations of TPU hardware, each bringing substantial improvements in performance, memory capacity, and energy efficiency.

ELI5 (Explain Like I'm 5)

Imagine your brain is really good at lots of different things: reading, drawing, playing games, and doing math. That is like a regular computer chip (a CPU or GPU). Now imagine a special calculator that can only do one kind of math problem, but it does that one problem incredibly fast. That is what a TPU is. Google built this special calculator because training an AI model requires doing the same type of math (multiplying big grids of numbers) over and over, billions of times. By making a chip that only does this one job, Google made AI training and inference much faster and cheaper than using a regular chip that tries to do everything.

History and motivation

Google began developing TPUs around 2013 in response to internal projections showing that if every user spoke to their Android phone for just three minutes a day using voice search, the company would need to double its data center compute capacity. At the time, running deep learning inference on CPUs and GPUs was expensive in terms of both cost and power consumption. Google engineers, led by Norman Jouppi, designed a purpose-built chip that could handle neural network inference at scale with far better performance per watt than existing hardware.

The first TPU (v1) was deployed in Google data centers in 2015 and publicly disclosed at the Google I/O conference in May 2016. Jouppi and colleagues published the landmark paper "In-Datacenter Performance Analysis of a Tensor Processing Unit" at the International Symposium on Computer Architecture (ISCA) in June 2017. The paper demonstrated that the TPU achieved 15 to 30 times higher performance and 30 to 80 times higher performance per watt compared to contemporary CPUs and GPUs for neural network inference workloads ^[1].

Google made TPUs available to external users through Google Cloud Platform starting with TPU v2 in 2017. Since then, each new generation has expanded capabilities from inference-only (v1) to both training and inference (v2 onward), while continuously scaling up in compute power, memory bandwidth, and interconnect speed.

Architecture

Systolic array design

At the core of every TPU is a systolic array, a grid of multiply-accumulate (MAC) units through which data flows in a rhythmic, pipelined fashion. In a systolic array, partial results move from one processing element to the next without returning to memory at each step. This design minimizes memory access overhead and maximizes throughput for matrix multiplication, which is the dominant operation in neural network training and inference.

The original TPU v1 contained a single 256 x 256 systolic array of 8-bit multiply-accumulate units, providing 65,536 MACs that could perform up to 92 trillion operations per second (TOPS). Starting with TPU v2, the array was reorganized into 128 x 128 units operating on bfloat16 inputs with FP32 accumulation. TPU v6e and TPU v7 (Ironwood) expanded the MXU back to 256 x 256 multiply-accumulators, increasing per-cycle throughput.

TensorCore

Starting from TPU v2, each TPU chip contains one or more TensorCores. A TensorCore is a self-contained compute unit that includes:

Matrix Multiply Units (MXUs): The primary compute engines, each containing a systolic array of multiply-accumulators. TPU v4 and later have four MXUs per TensorCore.
Vector unit: Handles element-wise operations such as activation functions, softmax, and batch normalization.
Scalar unit: Manages control flow, memory address calculations, and other administrative operations.

Each TPU chip in v2 and v3 contains two TensorCores. TPU v4 and later generations also contain two TensorCores per chip, with each TensorCore housing four 128 x 128 MXUs (or 256 x 256 MXUs in v6e and v7).

SparseCore

Starting with TPU v4, Google introduced SparseCores, specialized dataflow processors designed to accelerate models that rely heavily on sparse embedding lookups. Embedding-heavy models are common in recommendation systems and ranking workloads. TPU v4 includes four SparseCores per chip, each with dedicated scratchpad memory and optimized dataflow for sparse memory access patterns. Models with ultra-large embeddings have achieved 5 to 7 times speedups using SparseCores while consuming only about 5% of the total chip die area and power budget. TPU v5p features second-generation SparseCores, and TPU v6e includes third-generation SparseCores (two per chip).

The bfloat16 number format

Google developed the bfloat16 (Brain Floating Point) number format specifically for TPU-based machine learning workloads. Bfloat16 is a 16-bit floating-point representation that uses one sign bit, eight exponent bits, and seven mantissa bits. Unlike IEEE FP16 (which trades exponent range for precision), bfloat16 preserves the same dynamic range as FP32 while halving memory usage. This design choice reflects the observation that neural networks are more sensitive to dynamic range than to precision during training.

On Cloud TPUs, matrix multiplications are performed with bfloat16 inputs and accumulated in FP32, providing a practical balance between computational speed and numerical accuracy. Because bfloat16 multipliers are roughly half the silicon area of FP16 multipliers and eight times smaller than FP32 multipliers, TPUs can pack more compute into the same die area ^[2].

Memory and interconnects

TPU chips use High Bandwidth Memory (HBM) as their primary data store. HBM capacity and bandwidth have increased substantially with each generation, from 8 GB of DDR3 in TPU v1 to 192 GB of HBM per chip in TPU v7 (Ironwood).

TPU chips within a pod or slice communicate through high-speed Inter-Chip Interconnects (ICI). The network topology varies by generation:

2D torus: TPU v2, v3, v5e, and v6e use a 2D torus topology, where each chip connects to its four nearest neighbors (north, south, east, west).
3D torus: TPU v4 and v5p use a 3D torus topology, where each chip connects to six neighbors. This provides higher bisection bandwidth for large-scale distributed training. TPU v5p pods use a 16 x 20 x 28 topology connecting 8,960 chips.
TPU v7 (Ironwood): Uses ICI operating at 9.6 Tb/s per chip, enabling 9,216-chip superpods.

TPU generations

The following table summarizes the specifications of each TPU generation:

Generation	Year	Process	Peak performance	HBM capacity	HBM bandwidth	Max pod size	Topology	Key feature
TPU v1	2015	28 nm	92 TOPS (INT8)	8 GB DDR3	34 GB/s	N/A (single board)	N/A	Inference only; 256x256 systolic array
TPU v2	2017	16 nm	45 TFLOPS (bf16)	16 GB HBM	600 GB/s	256 chips (11.5 PFLOPS)	2D torus	First TPU for training; introduced bfloat16
TPU v3	2018	16 nm	123 TFLOPS (bf16)	32 GB HBM	900 GB/s	1,024 chips	2D torus	Liquid cooling; 2.7x perf over v2
TPU v4	2021	7 nm	275 TFLOPS (bf16)	32 GB HBM	1,200 GB/s	4,096 chips	3D torus	SparseCores; optical reconfigurable interconnect
TPU v5e	2023	N/A	197 TFLOPS (bf16)	16 GB HBM	819 GB/s	256 chips	2D torus	Cost-efficient; training and inference
TPU v5p	2023	N/A	459 TFLOPS (bf16)	95 GB HBM	2,765 GB/s	8,960 chips	3D torus	2nd-gen SparseCores; competitive with H100
TPU v6e (Trillium)	2024	N/A	918 TFLOPS (bf16)	32 GB HBM	1,640 GB/s	256 chips	2D torus	4.7x perf over v5e; 3rd-gen SparseCores
TPU v7 (Ironwood)	2025	N/A	4,614 TFLOPS (FP8)	192 GB HBM	7,370 GB/s	9,216 chips	3D (ICI 9.6 Tb/s)	Inference-optimized; 2x perf/watt over v6e

TPU v1

The first-generation TPU was designed exclusively for neural network inference. It featured a single 256 x 256 systolic array of 8-bit integer ALUs, 28 MiB of on-chip SRAM, and 8 GB of DDR3 memory. Operating at 700 MHz on a 28 nm process, it consumed only 28 to 40 watts while delivering 92 TOPS. TPU v1 was deployed as a coprocessor on the PCIe bus and was never offered as a standalone cloud product. It powered latency-sensitive Google services including Search ranking, Google Translate, Google Photos, and the inference engine for AlphaGo ^[1].

TPU v2

Announced at Google I/O in May 2017, TPU v2 was the first generation to support both training and inference. Each chip contained two TensorCores with 128 x 128 MXUs, 16 GB of HBM, and 600 GB/s memory bandwidth. TPU v2 introduced the bfloat16 number format and delivered 45 TFLOPS per chip. Pods of up to 256 chips provided 11.5 petaFLOPS of aggregate compute. TPU v2 was the first generation made available to external users through Google Cloud and the TensorFlow Research Cloud (TFRC) program ^[3].

TPU v3

Announced at Google I/O 2018, TPU v3 doubled the HBM capacity to 32 GB per chip and increased memory bandwidth to 900 GB/s. The clock speed rose from 700 MHz to 940 MHz, and peak performance reached 123 TFLOPS per chip. Pods scaled up to 1,024 chips, providing over 100 petaFLOPS of aggregate compute. TPU v3 was the first generation to require liquid cooling due to its higher power density ^[4].

TPU v4

Announced at Google I/O 2021 and made generally available in 2022, TPU v4 represented a major architectural leap. Built on a 7 nm process with a die size under 400 mm², it delivered 275 TFLOPS per chip. Each chip contained two TensorCores (four 128 x 128 MXUs each), four SparseCores, and 32 GB of HBM with 1,200 GB/s bandwidth.

TPU v4 introduced a 3D torus interconnect topology with optically reconfigurable circuit switches (OCS), allowing dynamic reconfiguration of the network topology to match workload requirements. A full v4 pod contained 4,096 chips with 10x the interconnect bandwidth per chip compared to previous generations. Google described the TPU v4 pod as an "optically reconfigurable supercomputer" in a 2023 paper ^[5].

TPU v5e

Released in August 2023, TPU v5e was designed as a cost-efficient accelerator for both training and inference. It delivers 197 TFLOPS in bfloat16 and 393 TFLOPS in INT8, with 16 GB of HBM per chip. Pods support up to 256 chips in a 2D torus topology. Google positioned v5e as delivering the best price-performance ratio for mid-scale workloads, including large language model fine-tuning and serving ^[6].

TPU v5p

Announced in December 2023 alongside the Gemini model, TPU v5p is Google's most powerful training-focused TPU prior to Trillium. Each chip delivers 459 TFLOPS in bfloat16 and 918 TFLOPS in INT8, with 95 GB of HBM and 2,765 GB/s bandwidth. A full v5p pod connects 8,960 chips in a 16 x 20 x 28 3D torus topology with 4,800 Gbps of ICI bandwidth per chip. TPU v5p features second-generation SparseCores that can train embedding-dense models 1.9x faster than TPU v4. Google stated that TPU v5p is competitive with the NVIDIA H100 for large model training ^[7].

TPU v6e (Trillium)

Announced in mid-2024 and made generally available in late 2024, Trillium is Google's sixth-generation TPU. It achieves roughly 918 TFLOPS in bfloat16 per chip (approximately 4.7x the performance of TPU v5e) through larger 256 x 256 MXUs and a higher clock speed. HBM capacity doubled to 32 GB with doubled bandwidth (1,640 GB/s), and ICI bandwidth also doubled compared to v5e. Trillium includes third-generation SparseCores and is over 67% more energy efficient than TPU v5e.

Trillium pods scale up to 256 chips, and with Multislice technology and Titanium IPUs (Intelligence Processing Units), multiple pods can be connected into building-scale supercomputers with tens of thousands of chips. Google reported a 2.1x improvement in performance per dollar over v5e and 2.5x over v5p for dense LLM training on models such as Llama 2-70B and Llama 3.1-405B ^[8].

TPU v7 (Ironwood)

Unveiled at Google Cloud Next in April 2025, Ironwood is Google's seventh-generation TPU and the first generation explicitly designed for inference at scale. Each chip delivers 4,614 TFLOPS peak performance (FP8), a 10x improvement over TPU v5p per chip. Memory capacity jumps to 192 GB of HBM per chip with 7.37 TB/s bandwidth, six times the memory of Trillium.

Ironwood chips communicate via ICI at 9.6 Tb/s per chip. A full Ironwood superpod consists of 9,216 chips with access to 1.77 petabytes of aggregate HBM. Performance per watt is 2x that of Trillium, and Google states Ironwood is nearly 30x more efficient than the original TPU v1. Each chip contains two TensorCores and four SparseCores ^[9].

TPU pods, slices, and Multislice

TPU hardware is organized into a hierarchy of groupings:

Chip: A single TPU die containing TensorCores, SparseCores, HBM, and ICI interfaces.
Host: A physical machine (TPU VM) connected to one or more TPU chips.
Slice: A contiguous group of TPU chips within a pod connected via ICI. Users can provision slices of various sizes depending on their workload requirements.
Pod: The largest contiguous grouping of TPU chips connected by ICI within a single physical installation.

Cloud TPU Multislice is a scaling technology that allows a single training job to span multiple TPU slices, even across different pods. Slices within a Multislice configuration communicate through data center networking (DCN), which has higher latency but lower bandwidth than ICI. Multislice supports data parallelism, Fully Sharded Data Parallelism (FSDP), model parallelism, and pipeline parallelism. Google demonstrated this capability by running the world's largest distributed LLM training job across 50,944 TPU v5e chips ^[10].

Software ecosystem and framework support

Cloud TPUs support three major machine learning frameworks:

Framework	Integration method	Notes
JAX	Native via XLA	Primary framework for TPU development; developed by Google; compiles Python and NumPy-like code to XLA
TensorFlow	Native via XLA	Supported from TPU v2 onward; TPU v5e, v5p, and v6e support TensorFlow 2.15.0 and later via PJRT
PyTorch	Via PyTorch/XLA	Open-source library maintained by Google and the PyTorch community; uses XLA as the compiler backend

JAX

JAX is a numerical computing library developed by Google that combines NumPy-like syntax with automatic differentiation and XLA (Accelerated Linear Algebra) compilation. JAX is the primary framework for TPU development at Google and is used for training large-scale models including Gemini. JAX's functional programming model maps naturally to TPU hardware, and its pjit and shard_map APIs provide fine-grained control over how computations and data are distributed across TPU chips ^[11].

TensorFlow

TensorFlow was the original framework supported on Cloud TPUs. The TPU execution model in TensorFlow uses XLA compilation to translate TensorFlow graphs into optimized TPU machine code. Starting with TensorFlow 2.15.0, the PJRT (Portable JAX Runtime) interface provides automatic device memory defragmentation and a simpler hardware integration path.

PyTorch/XLA

PyTorch/XLA is an open-source library that enables PyTorch models to run on TPUs by converting PyTorch operations into XLA HLO (High Level Operations) graphs. The torchax library from Google further bridges PyTorch and JAX by wrapping JAX arrays as PyTorch tensor subclasses, enabling seamless interoperability. More recently, vLLM TPU (powered by tpu-inference) has unified JAX and PyTorch under a single lowering path for high-throughput LLM inference on TPUs.

Cloud TPU access and pricing

Google Cloud offers several ways to provision and use TPUs:

On-demand: Pay per chip-hour with no commitment. Offers maximum flexibility but highest per-unit cost.
Committed use discounts (CUDs): 1-year or 3-year commitments that reduce per-chip-hour pricing significantly.
Preemptible/Spot TPUs: Reduced-price TPU instances that can be reclaimed by Google with short notice. Suitable for fault-tolerant workloads; can reduce costs by up to 70%.
TPU Research Cloud (TRC): A program providing free Cloud TPU access to academic researchers.

Approximate pricing (as of 2025) varies by generation and commitment level:

TPU type	On-demand (per chip-hour)	1-year CUD	3-year CUD
TPU v5e	~$1.20	Discounted	Discounted
TPU v5p	~$4.20	Discounted	Discounted
TPU v6e (Trillium)	~$2.70	~$1.89	~$1.22

TPU resources can be provisioned through the Google Cloud console, the gcloud CLI, or programmatically through Google Kubernetes Engine (GKE). GKE is the recommended orchestration layer for production TPU workloads, providing features such as job queueing with Kueue and Multislice job abstraction through the JobSet API.

Comparison with GPUs

TPUs and GPUs differ in their design philosophy and target workloads. The following table highlights the main differences:

Aspect	Cloud TPU	GPU (e.g., NVIDIA H100/A100)
Design approach	Purpose-built ASIC for ML	General-purpose parallel processor
Precision formats	bfloat16, INT8, FP8 (v7), FP32 accum.	FP16, bfloat16, FP8, TF32, FP32, INT8
Primary compute unit	Systolic array (MXU)	CUDA cores, Tensor Cores
Memory type	HBM (integrated)	HBM (integrated)
Interconnect	ICI (custom, in-pod)	NVLink, NVSwitch, InfiniBand
Software ecosystem	JAX, TensorFlow, PyTorch/XLA	CUDA, cuDNN, all major frameworks
Vendor lock-in	Google Cloud only	Multi-cloud, on-premises
Strengths	Large-batch training, LLM inference, cost per FLOP	Flexibility, broad framework support, general-purpose compute

TPUs tend to offer better performance per dollar for large-scale, batch-oriented ML workloads, particularly for models that map well to matrix-heavy computation. Google has reported that TPU v6e provides up to 4x better performance per dollar compared to the NVIDIA H100 for LLM training and large-batch inference. However, GPUs offer broader software compatibility, support from multiple cloud providers, and the ability to handle diverse workloads beyond ML, including graphics rendering, simulation, and scientific computing ^[12].

The choice between TPUs and GPUs often depends on the specific workload, scale, framework preference, and whether vendor portability is a priority.

Notable applications and customers

TPUs have powered many of Google's most notable AI systems and attracted major external customers:

AlphaGo and AlphaFold: DeepMind used TPUs to train AlphaGo (which defeated world champion Go player Lee Sedol in 2016) and AlphaFold 2 (which solved the protein structure prediction problem, contributing to a Nobel Prize in Chemistry in 2024) ^[13].
Gemini: Google's multimodal AI model family. All phases of Gemini 3 training ran on TPU v5e and v6e pods ^[14].
Google Search and Translate: TPU v1 was originally deployed to accelerate inference for Google Search ranking and the neural machine translation system behind Google Translate.
Anthropic: In October 2025, Anthropic announced an expansion providing access to up to one million TPUs worth tens of billions of dollars, making it the largest external TPU customer. The deal includes over a gigawatt of capacity coming online in 2026 ^[15].
Midjourney: The AI image generation company reportedly reduced infrastructure costs by 65% by migrating workloads to TPUs.
Scientific research: Climate modeling simulations, drug discovery pipelines, and genomics workflows have been run on Cloud TPU infrastructure through Google's research partnerships.

Limitations

Despite their strong performance for ML workloads, Cloud TPUs have several limitations:

Vendor lock-in: TPUs are available exclusively through Google Cloud. Workloads cannot be moved to AWS, Azure, or on-premises hardware without porting to GPU-compatible code.
Framework constraints: While JAX, TensorFlow, and PyTorch are all supported, JAX remains the best-supported framework. PyTorch support via PyTorch/XLA has improved but still lags behind native CUDA support on GPUs for some operations.
Workload suitability: TPUs excel at regular, dense matrix computation but may underperform on workloads with irregular memory access patterns, heavy branching, or operations not well-suited to systolic arrays.
Availability: Demand for TPU capacity often exceeds supply, and quota allocation can be a bottleneck for new users.
Debugging complexity: Debugging performance issues on TPUs requires familiarity with XLA compilation, HLO graph analysis, and TPU-specific profiling tools, which have a steeper learning curve than CUDA-based GPU profiling.

References

Jouppi, N.P., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. https://arxiv.org/abs/1704.04760
Google Cloud. "BFloat16: The secret to high performance on Cloud TPUs." https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Google Cloud. "Tensor Processing Units (TPUs)." https://cloud.google.com/tpu
Google Cloud. "TPU v3 documentation." https://cloud.google.com/tpu/docs
Jouppi, N.P., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." arXiv:2304.01433, 2023. https://arxiv.org/abs/2304.01433
Google Cloud. "TPU v5e documentation." https://docs.cloud.google.com/tpu/docs/v5e
Google Cloud Blog. "Introducing Cloud TPU v5p and AI Hypercomputer." https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-tpu-v5p-and-ai-hypercomputer
Google Cloud Blog. "Introducing Trillium, sixth-generation TPUs." https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus
Google Blog. "Ironwood: The first Google TPU for the age of inference." https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/ironwood-tpu-age-of-inference/
Google Cloud Blog. "The world's largest distributed LLM training job on TPU v5e." https://cloud.google.com/blog/products/compute/the-worlds-largest-distributed-llm-training-job-on-tpu-v5e
JAX documentation. https://jax.readthedocs.io/
Google Cloud. "TPU v6e vs GPU: 4x Better AI Performance Per Dollar." https://introl.com/blog/google-tpu-v6e-vs-gpu-4x-better-ai-performance-per-dollar-guide
DeepMind. "AlphaFold: a solution to a 50-year-old grand challenge in biology." https://deepmind.google/research/breakthroughs/alphafold/
Google. "Gemini." https://deepmind.google/technologies/gemini/
Anthropic. "Expanding our use of Google Cloud TPUs and Services." https://www.anthropic.com/news/expanding-our-use-of-google-cloud-tpus-and-services

ELI5 (Explain Like I'm 5)

History and motivation

Architecture

Systolic array design

TensorCore

SparseCore

The bfloat16 number format

Memory and interconnects

TPU generations

TPU v1

TPU v2

TPU v3

TPU v4

TPU v5e

TPU v5p

TPU v6e (Trillium)

TPU v7 (Ironwood)

TPU pods, slices, and Multislice

Software ecosystem and framework support

JAX

TensorFlow

PyTorch/XLA

Cloud TPU access and pricing

Comparison with GPUs

Notable applications and customers

Limitations

References

Improve this article

Related Articles

Machine learning terms/Google Cloud

ARC-AGI 2

TPU Node

TPU Worker

Data Center

TPU Chip

ELI5 (Explain Like I'm 5)

History and motivation

Architecture

Systolic array design

TensorCore

SparseCore

The bfloat16 number format

Memory and interconnects

TPU generations

TPU v1

TPU v2

TPU v3

TPU v4

TPU v5e

TPU v5p

TPU v6e (Trillium)

TPU v7 (Ironwood)

TPU pods, slices, and Multislice

Software ecosystem and framework support

JAX

TensorFlow

PyTorch/XLA

Cloud TPU access and pricing

Comparison with GPUs

Notable applications and customers

Limitations

References

Related Articles

Machine learning terms/Google Cloud

ARC-AGI 2

TPU Node

TPU Worker

Data Center

TPU Chip