TPU Chip

The Tensor Processing Unit (TPU) is a custom application-specific integrated circuit (ASIC) developed by Google to accelerate machine learning workloads. First deployed in Google's data centers in 2015, TPUs are purpose-built for high-throughput, low-latency tensor operations, particularly the matrix multiplication at the heart of neural network training and inference. Over seven generations, Google has scaled the TPU from an inference-only accelerator running at 92 TOPS to the Ironwood (TPU v7) chip delivering 4,614 TFLOPS per chip, with superpods reaching 42.5 exaflops of aggregate compute.

TPUs have powered some of the most widely known AI systems in the world, including AlphaGo, AlphaFold, BERT, and Gemini. Google makes TPUs available to external users through Google Cloud, the TPU Research Cloud program, and Google Colab.

History and development

Origins

In 2013, Google recognized that if every user spoke to their Android phone for just three minutes per day, the company would need to double its data center compute capacity to handle the inference load. This realization prompted an internal effort to build custom silicon optimized for neural network inference. Dr. Amir Salek was recruited to establish custom silicon capabilities, and engineer Jonathan Ross (who later founded Groq) was among the original TPU designers.

The TPU v1 was designed, verified, fabricated, and deployed to production data centers in just 15 months, an unusually fast timeline for a custom ASIC. Google began deploying TPU v1 chips in its data centers in early 2015, but the existence of the chip remained secret for more than a year.

Public announcement

On May 18, 2016, at the Google I/O conference, CEO Sundar Pichai revealed that Google had been running TPUs inside its data centers for over a year. He stated that TPUs delivered "an order of magnitude better performance per watt for machine learning" compared to existing processors. The announcement came shortly after AlphaGo defeated world Go champion Lee Sedol in March 2016, a match in which TPUs played a role in powering the inference computations.

Academic publication

The TPU v1 architecture was formally described in the paper "In-Datacenter Performance Analysis of a Tensor Processing Unit" by Norman P. Jouppi et al., presented at the 44th International Symposium on Computer Architecture (ISCA) in June 2017. The paper reported that the TPU was 15 to 30 times faster and 30 to 80 times more energy-efficient than contemporary server-class CPUs and GPUs (an Intel Haswell CPU and an NVIDIA K80 GPU) on production neural network inference workloads.

Manufacturing partnership

Broadcom serves as the co-developer of TPUs, translating Google's architecture and specifications into manufacturable silicon. All TPU generations have been fabricated by TSMC.

Architecture

Systolic array design

The defining architectural feature of the TPU is its systolic array, a grid of multiply-accumulate (MAC) units through which data flows in a regular, wave-like pattern (the name "systolic" is an analogy to the rhythmic pumping of the heart). In TPU v1, the matrix multiply unit (MXU) consists of a 256 x 256 grid of 8-bit MAC units, totaling 65,536 ALUs.

During a matrix multiplication, weight values are preloaded into the array from above (the right-hand side, or RHS), while activation values enter from the left (the left-hand side, or LHS) and flow horizontally across the array. Each MAC unit multiplies its stored weight by the incoming activation, adds the result to a partial sum arriving from above, and passes both the activation (horizontally) and the updated partial sum (vertically) to neighboring units. Because all 65,536 ALUs pass intermediate results directly between spatially adjacent units without any memory access, power consumption is significantly reduced. The short, local wires connecting adjacent ALUs are also more energy-efficient than long global interconnects.

From TPU v2 onward, the MXU uses a 128 x 128 systolic array (16,384 multiply-accumulate units per MXU), with each chip containing two or more MXUs. The TPU v6e (Trillium) and TPU v7 (Ironwood) expanded to a 256 x 256 MXU, quadrupling the number of FLOPs per cycle compared to earlier generations.

Memory hierarchy

TPU v1 uses 8 GB of DDR3 DRAM as off-chip memory, providing 34 GB/s of bandwidth. On-chip, the design includes 28 MiB of software-managed SRAM (the "Unified Buffer") and 4 MiB of accumulators. This simplified memory hierarchy, with no hardware-managed caches, reduces memory access latency and die area compared to general-purpose processors.

Starting with TPU v2, Google switched to High Bandwidth Memory (HBM), dramatically increasing both capacity and bandwidth. By TPU v7, each chip has 192 GB of HBM with 7.37 TB/s of bandwidth.

bfloat16 number format

TPU v2 introduced the bfloat16 (Brain Floating Point) number format, a custom 16-bit floating-point representation conceived at Google Brain. Bfloat16 uses 1 sign bit, 8 exponent bits, and 7 mantissa bits. By retaining the same 8-bit exponent as IEEE 754 float32, bfloat16 preserves the same dynamic range (values up to approximately 3.4 x 10^38) while halving memory usage. This is in contrast to the IEEE 754 float16 (half-precision) format, which uses 5 exponent bits and 10 mantissa bits, giving it a narrower dynamic range.

Inside the MXU, multiplications are performed in bfloat16 while accumulations use full float32 precision, a mixed-precision strategy that maintains model accuracy while doubling throughput relative to pure float32 computation. Bfloat16 has since been adopted by other hardware vendors, including Intel, AMD, and NVIDIA, and is supported across all major deep learning frameworks.

Inter-chip interconnect (ICI)

TPU v2 introduced the Inter-Chip Interconnect (ICI), a custom high-bandwidth, low-latency network that links multiple TPU chips into a single logical accelerator called a "pod" or "slice." TPU v2 and v3 use a 2D torus topology, in which each chip connects to its four nearest neighbors (north, south, east, west). TPU v4 and v5p upgraded to a 3D torus, where each chip connects to six neighbors, increasing bisection bandwidth.

TPU v4 introduced optical circuit switches (OCSes) based on 3D Micro-Electro-Mechanical Systems (MEMS) mirrors that can dynamically reconfigure the interconnect topology. This allows the system to form "twisted" 3D torus topologies that provide up to 70% higher bisection bandwidth than a standard torus. The OCS hardware accounts for less than 5% of system cost and less than 3% of system power. Each TPU v4 pod connects 4,096 chips through 48 OCSes using Google's custom Palomar 136x136 OCS.

TPU v7 (Ironwood) scales the ICI to 9.6 Tb/s per chip, enabling superpods of up to 9,216 chips.

SparseCores

Starting with TPU v4, Google added SparseCores to each chip. SparseCores are specialized dataflow processors designed to accelerate models that rely on embedding lookups, a common operation in recommendation systems and large language models. SparseCores occupy only about 5% of die area and power but accelerate embedding-heavy workloads by 5 to 7 times. TPU v5p introduced second-generation SparseCores with further improvements, and TPU v7 contains four SparseCores per chip.

TPU generations

The table below summarizes the key specifications of each TPU generation.

Generation	Release year	Process node	Clock (MHz)	Memory	Memory bandwidth	Peak compute	TDP (W)	Chips per pod	Training support
TPU v1	2015	28 nm	700	8 GB DDR3	34 GB/s	92 TOPS (INT8)	75	N/A (inference only)	No
TPU v2	2017	16 nm	700	16 GB HBM	600 GB/s	45 TFLOPS (BF16)	280	256 (11.5 PFLOPS)	Yes
TPU v3	2018	16 nm	940	32 GB HBM	900 GB/s	123 TFLOPS (BF16)	220	1,024 (>100 PFLOPS)	Yes
TPU v4	2021	7 nm	1,050	32 GB HBM	1,200 GB/s	275 TFLOPS (BF16)	170	4,096 (>1 EFLOPS)	Yes
TPU v5e	2023	N/A	N/A	16 GB HBM	819 GB/s	197 TFLOPS (BF16)	N/A	256	Yes
TPU v5p	2023	N/A	1,750	95 GB HBM	2,765 GB/s	459 TFLOPS (BF16)	N/A	8,960 (4.45 EFLOPS)	Yes
TPU v6e (Trillium)	2024	N/A	N/A	32 GB HBM	1,640 GB/s	918 TFLOPS (BF16)	N/A	256	Yes
TPU v7 (Ironwood)	2025	N/A	N/A	192 GB HBM	7,370 GB/s	4,614 TFLOPS (FP8)	N/A	9,216 (42.5 EFLOPS)	Yes

TPU v1 (2015)

The first-generation TPU was designed exclusively for inference. It connects to its host server via a PCIe 3.0 bus and operates as a coprocessor, receiving instructions from the host CPU. The chip was fabricated on a 28 nm process, runs at 700 MHz, and consumes 75 W. Its 256 x 256 systolic array of 8-bit integer MAC units delivers 92 TOPS. Google deployed over 100,000 TPU v1 chips across its data centers to serve production workloads including RankBrain (search ranking), Google Street View text recognition, and Google Photos image processing. A single TPU v1 could process over 100 million Google Photos per day.

TPU v2 (2017)

Announced in May 2017, TPU v2 was the first generation to support both training and inference. It introduced HBM, bfloat16 arithmetic, and the ICI interconnect. Each chip contains two MXUs delivering a combined 45 TFLOPS in bfloat16. Four chips form a board, and 64 boards (256 chips) form a full pod delivering 11.5 petaflops. TPU v2 was the first TPU made available to external users through Google Cloud.

TPU v3 (2018)

Announced on May 8, 2018, TPU v3 doubled per-chip performance relative to TPU v2, reaching 123 TFLOPS in bfloat16. The clock speed increased to 940 MHz. Pods scaled to 1,024 chips with over 100 petaflops of aggregate compute. TPU v3 required liquid cooling due to its higher power density.

TPU v4 (2021)

Announced on May 18, 2021, and described in the 2023 ISCA paper "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings" by Jouppi et al., TPU v4 moved to a 7 nm process. Each chip delivers 275 TFLOPS in bfloat16 with 32 GB of HBM at 1,200 GB/s. The chip introduced SparseCores for embedding acceleration and optical circuit switches for reconfigurable 3D torus interconnect topology. A full pod of 4,096 chips exceeds 1 exaflop. Google reported that a TPU v4 deployment uses approximately 3 times less electricity and emits approximately 20 times less CO2 than a comparable on-premises GPU cluster performing the same training. On production ML benchmarks, TPU v4 was reported to be 5 to 87% faster than an NVIDIA A100 GPU.

TPU v5e (2023)

The TPU v5e is a cost-optimized variant designed for both training and inference on models up to approximately 200 billion parameters. It prioritizes price-performance, achieving 2.3 times better price-performance than TPU v4. Each chip has 16 GB of HBM and delivers 197 TFLOPS in bfloat16 (or 393 TOPS in INT8). Google reports that 8 TPU v5e chips can generate approximately 2,175 tokens per second on Llama 2-70B inference.

TPU v5p (2023)

Announced in December 2023, the TPU v5p is the high-performance variant of the fifth generation, intended for large-scale training. Each chip delivers 459 TFLOPS in bfloat16 with 95 GB of HBM at 2,765 GB/s. A full v5p pod composes 8,960 chips in a 3D torus with 4,800 Gbps of ICI bandwidth per chip, reaching approximately 4.45 exaflops. TPU v5p can train large LLM models 2.8 times faster than TPU v4. It includes second-generation SparseCores that accelerate embedding-dense workloads 1.9 times faster than TPU v4. The physical layout of TPU v5p was designed with the assistance of deep reinforcement learning.

TPU v6e, Trillium (2024)

Announced at Google I/O in May 2024 and made generally available in late 2024, Trillium is Google's sixth-generation TPU. Each chip delivers 918 TFLOPS in bfloat16, a 4.7 times increase over TPU v5e. The MXU was expanded from 128 x 128 to 256 x 256. HBM capacity doubled to 32 GB with 1,640 GB/s bandwidth. Trillium is over 67% more energy-efficient than TPU v5e. Pods scale to 256 chips with up to 13 TB/s of ICI bandwidth per chip.

TPU v7, Ironwood (2025)

Unveiled at Google Cloud Next in April 2025, Ironwood is Google's seventh-generation TPU and the first designed with inference as the primary target. Each chip delivers 4,614 TFLOPS in FP8 and contains 192 GB of HBM with 7.37 TB/s bandwidth. The chip uses a chiplet architecture: two chiplets, each containing one TensorCore, two SparseCores, and 96 GB of HBM. Superpods scale to 9,216 chips connected via a 3D torus ICI at 9.6 Tb/s per chip, delivering 42.5 exaflops of aggregate compute and 1.77 petabytes of shared HBM. Ironwood offers more than 4 times better performance per chip for both training and inference compared to the previous generation.

Edge TPU

In addition to data center TPUs, Google developed the Edge TPU, a small ASIC designed for on-device inference in low-power environments. The Edge TPU delivers 4 TOPS of INT8 inference performance while consuming only 2 watts (2 TOPS per watt). It can run models such as MobileNet V2 at nearly 400 frames per second. The Edge TPU supports only forward-pass operations (inference, not training) and requires 8-bit quantized TensorFlow Lite models.

Google sells Edge TPU hardware under the Coral brand in several form factors, including USB accelerators, PCI-e modules, development boards, and system-on-module packages.

Software ecosystem

Supported frameworks

TPUs are supported by three major deep learning frameworks:

Framework	Integration method	Notes
TensorFlow	Native support via XLA compiler	TensorFlow was the first framework with TPU support; tight integration with Google's ecosystem
JAX	Native support via XLA compiler	JAX's functional programming model and GSPMD (General-purpose SPMD) partitioner allow automatic parallelization across TPU pods with minimal code changes
PyTorch	PyTorch/XLA library	Open-source package that translates PyTorch operations to XLA for execution on TPUs

XLA compiler

XLA (Accelerated Linear Algebra) is an open-source compiler for machine learning that takes computation graphs from TensorFlow, JAX, and PyTorch and optimizes them for high-performance execution on TPUs, GPUs, and CPUs. XLA performs whole-program optimization, including operator fusion, memory layout assignment, and tile-size selection, producing efficient machine code for the target hardware.

Comparison with CPUs and GPUs

Feature	CPU	GPU	TPU
Design purpose	General-purpose computing	Parallel computing; originally graphics rendering	Machine learning inference and training
Core architecture	Few complex cores with large caches	Thousands of smaller CUDA/stream cores	Systolic array of MAC units
Arithmetic precision	FP64, FP32, INT32, INT64	FP64, FP32, FP16, BF16, INT8, FP8	BF16, FP32, INT8, FP8 (varies by generation)
Memory hierarchy	Multi-level hardware caches (L1, L2, L3)	HBM with hardware caches	HBM with software-managed SRAM (no hardware caches in v1)
Interconnect for scaling	Ethernet, InfiniBand	NVLink, NVSwitch, InfiniBand	Custom ICI with optical circuit switches
Programming model	Any language/framework	CUDA, ROCm, OpenCL	XLA (via TensorFlow, JAX, or PyTorch/XLA)
Availability	Ubiquitous	Multiple vendors (NVIDIA, AMD, Intel)	Google Cloud only

TPUs are optimized for workloads dominated by large matrix multiplications and convolutions, such as training and serving transformer models, convolutional neural networks, and recommendation systems. GPUs offer broader flexibility for workloads with irregular computation patterns, custom CUDA kernels, or non-ML parallel computing tasks. CPUs remain the best choice for workloads with complex branching logic, low parallelism, or tasks that require broad instruction set support.

Notable models and applications

TPUs have been used to train and serve many well-known AI systems:

Model or system	Year	TPU generation used	Domain
AlphaGo	2016	TPU v1	Game playing (Go)
RankBrain	2015	TPU v1	Search ranking
Google Street View text processing	2015	TPU v1	OCR
AlphaZero	2017	TPU v2	Game playing (chess, Shogi, Go)
BERT	2018	TPU v3	Natural language processing
AlphaFold	2020	TPU v3	Protein structure prediction
LaMDA	2021	TPU v4	Conversational AI
PaLM	2022	TPU v4	Large language model
Gemini	2023	TPU v4, v5e, v5p	Multimodal AI
Gemma	2024	TPU v5e	Open-weight LLM

Google also offers the open-weight Gemma model family, which shares technical infrastructure with Gemini and was trained on TPUs.

Cloud availability and pricing

TPUs are available to external users exclusively through Google Cloud. Pricing is per chip-hour and varies by TPU generation and region.

TPU version	On-demand price (per chip-hour, USD)	Committed use (1-year) discount
TPU v4	$0.24	~25-30%
TPU v5e	$0.32	~25-30%
TPU v5p	$0.48	~25-30%
TPU v6e (Trillium)	Varies by region	Available
TPU v7 (Ironwood)	Varies by region	Available

Google also provides free or subsidized TPU access through several programs:

TPU Research Cloud (TRC): Researchers, students, and entrepreneurs can apply for free access to TPU clusters for research purposes.
Google Colab: Offers limited free access to a single TPU v5e chip for experimentation.
Google Cloud free credits: New customers receive $300 in free credits applicable to TPU usage.

As of 2026, TPU v7 (Ironwood) is generally available. Google has also been in discussions with cloud providers such as CoreWeave and Crusoe about deploying TPUs outside of Google's own infrastructure.

Explain like I'm 5 (ELI5)

Imagine your brain is really good at all kinds of things: reading, talking, doing math, playing games. That is like a regular computer chip (a CPU). Now imagine a special calculator that can only do one thing, but it does that one thing incredibly fast: multiplying lots of numbers at once. That is what a TPU is. Google built this special calculator because artificial intelligence programs need to multiply millions of numbers together over and over again. By making a chip that only does multiplication really well, Google can run AI programs much faster and using much less electricity than a regular chip.

References

Jouppi, N.P. et al. (2017). "In-Datacenter Performance Analysis of a Tensor Processing Unit." Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1-12. https://arxiv.org/abs/1704.04760
Jouppi, N.P. et al. (2023). "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), pp. 1147-1160. https://arxiv.org/abs/2304.01433
Google Cloud. "An in-depth look at Google's first Tensor Processing Unit (TPU)." Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
Google Cloud. "TPU system architecture." Google Cloud Documentation. https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
Google Cloud. "BFloat16: The secret to high performance on Cloud TPUs." Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Google Cloud. "Introducing Cloud TPU v5p and AI Hypercomputer." Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-tpu-v5p-and-ai-hypercomputer
Google Cloud. "Introducing Trillium, sixth-generation TPUs." Google Cloud Blog. https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus
Google. "Ironwood: The first Google TPU for the age of inference." Google Blog. https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/ironwood-tpu-age-of-inference/
Google Cloud. "TPU v6e documentation." Google Cloud Documentation. https://docs.cloud.google.com/tpu/docs/v6e
Google Cloud. "TPU v5p documentation." Google Cloud Documentation. https://docs.cloud.google.com/tpu/docs/v5p
Google Cloud. "TPU v4 enables performance, energy and CO2e efficiency gains." Google Cloud Blog. https://cloud.google.com/blog/topics/systems/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains
Patterson, D. et al. (2021). "Ten Lessons From Three Generations Shaped Google's TPUv4i." IEEE Micro. https://www.cs.cmu.edu/~18742/papers/Jouppi2021.pdf
Google Cloud. "TPU transformation: A look back at 10 years of our AI-specialized chips." Google Cloud Blog. https://cloud.google.com/transform/ai-specialized-chips-tpu-history-gen-ai
Coral. "Edge TPU performance benchmarks." https://www.coral.ai/docs/edgetpu/benchmarks/

History and development

Origins

Public announcement

Academic publication

Manufacturing partnership

Architecture

Systolic array design

Memory hierarchy

bfloat16 number format

Inter-chip interconnect (ICI)

SparseCores

TPU generations

TPU v1 (2015)

TPU v2 (2017)

TPU v3 (2018)

TPU v4 (2021)

TPU v5e (2023)

TPU v5p (2023)

TPU v6e, Trillium (2024)

TPU v7, Ironwood (2025)

Edge TPU

Software ecosystem

Supported frameworks

XLA compiler

Comparison with CPUs and GPUs

Notable models and applications

Cloud availability and pricing

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

ARC-AGI 2

TPU Device

TPU Master

TPU Node

TPU Slice

TPU Type

History and development

Origins

Public announcement

Academic publication

Manufacturing partnership

Architecture

Systolic array design

Memory hierarchy

bfloat16 number format

Inter-chip interconnect (ICI)

SparseCores

TPU generations

TPU v1 (2015)

TPU v2 (2017)

TPU v3 (2018)

TPU v4 (2021)

TPU v5e (2023)

TPU v5p (2023)

TPU v6e, Trillium (2024)

TPU v7, Ironwood (2025)

Edge TPU

Software ecosystem

Supported frameworks

XLA compiler

Comparison with CPUs and GPUs

Notable models and applications

Cloud availability and pricing

Explain like I'm 5 (ELI5)

See also

References

Related Articles

ARC-AGI 2

TPU Device

TPU Master

TPU Node

TPU Slice

TPU Type