Tensor Processing Unit (TPU)

Overview

A Tensor Processing Unit (TPU) is a custom application-specific integrated circuit (ASIC) developed by Google specifically for accelerating machine learning workloads. Unlike general-purpose processors such as CPUs or even GPUs, TPUs are built from the ground up to handle the matrix multiplication and tensor operations that form the backbone of deep learning algorithms. By optimizing for these operations and trading away the flexibility of general-purpose hardware, TPUs achieve significantly higher throughput and better energy efficiency for neural network training and inference. The chip is purpose-built around a systolic array for matrix multiply-accumulate (MAC) operations and a memory hierarchy tuned for tensor traffic.

Google first deployed TPUs internally in its data centers in 2015 and publicly announced the chip at Google I/O in May 2016. Since then, the company has shipped seven public generations: TPU v1 (inference only), v2, v3, v4, v5e, v5p, the sixth-generation Trillium (v6e), and the seventh-generation Ironwood (v7). Each generation brought substantial improvements in compute performance, memory capacity, interconnect bandwidth, and energy efficiency. TPUs power many of Google's most prominent AI services, including Google Search, Google Translate, Google Photos, YouTube recommendations, and flagship models like BERT, PaLM, and Gemini. Through Google Cloud, TPUs are also available to external researchers and enterprises.

TPUs sit behind many of the workloads people now associate with modern AI: AlphaGo and AlphaZero in 2016 and 2017, AlphaFold protein structure prediction in 2020, the PaLM family of large language models, the Gemini family, and Google products such as Search, Photos, and Translate. Anthropic disclosed in 2025 that its Claude models also train and serve on TPU pods, with multi-gigawatt commitments stretching into 2026 and beyond.

History and Development

Origins

The TPU project began inside Google around 2013, driven by a projected surge in computational demand from neural network inference across the company's services. The team was led by Norman Jouppi, a distinguished hardware engineer who had previously contributed to MIPS R4000 processor design and to cache memory systems research at HP. Google's internal analysis suggested that if every user made just three minutes of voice queries per day using neural network-based speech recognition, the company would need to double its data center compute capacity. Building a custom ASIC tuned specifically for neural network math offered a more practical path than buying vast quantities of commodity CPUs or GPUs.

The first TPU was designed, verified, and built in just 15 months, an unusually fast timeline for a custom chip. Google began deploying TPU v1 in its data centers in 2015, using it to accelerate inference for services such as Google Search RankBrain, Google Street View text processing, Google Translate's Neural Machine Translation system, and the AlphaGo system that defeated world champion Lee Sedol in March 2016. The chip stayed quiet inside Google for more than a year before the public reveal at I/O 2016.

Academic Publication

The foundational paper describing the TPU, "In-Datacenter Performance Analysis of a Tensor Processing Unit," was authored by Jouppi and 75 co-authors and presented at the 44th International Symposium on Computer Architecture (ISCA) in June 2017. The paper demonstrated that the TPU delivered 15 to 30 times higher performance and 30 to 80 times better performance per watt than contemporary CPUs and GPUs for neural network inference workloads. It described how TPU v1 served Google's production neural networks, namely multilayer perceptrons, convolutional networks, and LSTMs, which together represented about 95% of inference demand at the time. The paper also noted that four of the six benchmarked applications were memory bandwidth limited, an observation that has shaped every TPU generation since. This publication established the TPU as a landmark in domain-specific accelerator design and helped popularize the concept of custom AI chips across the industry.

A follow-up paper, "A Domain-Specific Supercomputer for Training Deep Neural Networks," appeared in Communications of the ACM and IEEE Micro in 2020 and described the architecture of TPU v2 and v3. The TPU v4 system was the subject of a detailed paper at ISCA 2023, "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings."

Public Cloud Availability

Google made TPUs available to external users through its Cloud TPU service starting in 2018. The company also launched the TPU Research Cloud (TRC) program, which provides free access to Cloud TPUs for academic researchers. The TRC program grants accepted applicants access to a cluster of over 1,000 Cloud TPU devices, with the expectation that participants share their findings through publications, open-source code, or blog posts.

Architecture

Systolic Array Design

The central computational engine inside every TPU is the Matrix Multiply Unit (MXU), which is built on a systolic array architecture. A systolic array is a grid of interconnected processing elements (PEs) where data flows rhythmically between neighbors, much like a heartbeat (hence the name "systolic," borrowed from the medical term for cardiac contraction). Each PE performs a small multiply-and-accumulate (MAC) operation and passes partial results to the next PE. This design minimizes data movement and maximizes parallelism, since thousands of multiplications happen simultaneously without each one needing to independently fetch data from memory.

In TPU v1, the array was a 256 by 256 grid of INT8 cells totaling 65,536 MAC units. From v2 through v5p Google used 128 by 128 bfloat16 cells, with two MXUs per TensorCore. Starting with TPU v6e (Trillium), Google expanded the MXU back to 256 by 256, quadrupling the number of multiply-accumulators to 65,536 per unit. Each MXU performs one matrix multiply operation of the form bfloat16[8,128] x bfloat16[128,128] producing an fp32[8,128] result every 8 clock cycles, with all multiplications carried out in bfloat16 precision and all accumulations in full FP32 precision. Operands flow through the array in lockstep: weights stay resident while activations stream across, accumulating partial sums as they go. The design eliminates almost all register-file traffic, which is why TPUs hit such high utilization on dense GEMMs.

A systolic array is a poor fit for irregular workloads. It assumes the multiplication has a fixed shape large enough to fill the array, and it punishes sparse or branch-heavy code. This is one reason Google added separate hardware paths for embeddings (the SparseCore introduced in v4 and refreshed in Trillium) and for vector operations.

TensorCore and Memory Hierarchy

Each TPU chip contains one or more TensorCores, which serve as the primary compute units. A TensorCore includes the MXUs, a vector processing unit (VPU) for element-wise operations such as activations, normalizations, and softmax, a scalar unit for control flow and address arithmetic, and on-chip memory called VMEM (Vector Memory). The VPU is wider than a typical CPU SIMD lane but narrower than the MXU, and the compiler is responsible for scheduling work between the three. The memory hierarchy is designed to keep the MXUs fed with data:

HBM (High Bandwidth Memory): The main off-chip memory attached to each TPU chip. HBM capacity and bandwidth have increased with each generation, from 8 GB of DDR3 on TPU v1 to 192 GB of HBM3E on TPU v7.
VMEM: A smaller but much faster on-chip SRAM buffer. VMEM bandwidth to the MXU is roughly 22 times higher than HBM bandwidth, which means that data staged in VMEM can be consumed by the MXU with very low latency.
CMEM (Common Memory): Shared memory used for inter-core communication within a chip.

Unlike most CPUs and modern GPUs, the on-chip memory is software managed: the XLA compiler decides when to stage tensors in and out, which removes the cost and unpredictability of hardware caches but pushes more work onto the toolchain. Data flows from HBM into VMEM, and from VMEM into the MXU for computation. The results flow back out through the same path. Efficient use of this memory hierarchy is critical for achieving high utilization of the MXU.

SparseCore

Starting with TPU v4, Google introduced the SparseCore, a dedicated accelerator for processing sparse computations, particularly the large embedding table lookups common in recommendation systems and ranking models. Embedding tables are a key component of models used by services like YouTube, Google Ads, and Google Search. Standard dense matrix hardware handles these irregular, memory-bound lookups inefficiently, so the SparseCore provides a dataflow processor optimized specifically for this pattern.

The SparseCore uses only about 5% of the total die area and power budget but delivers 5 to 7 times faster embedding lookups compared to running them on the MXU. TPU v5p includes second-generation SparseCores, and TPU v6e introduced the third generation. TPU v7 (Ironwood) contains four SparseCores per chip.

The bfloat16 Number Format

Google Brain developed the bfloat16 (Brain Floating Point 16) number format specifically for use in TPUs and deep learning workloads. Bfloat16 is a 16-bit floating-point format consisting of 1 sign bit, 8 exponent bits, and 7 mantissa bits. Unlike the IEEE 754 half-precision (fp16) format, which allocates 5 bits to the exponent and 10 to the mantissa, bfloat16 preserves the same exponent range as standard 32-bit floats (fp32) while reducing the mantissa precision.

The rationale behind this design is that neural networks are far more sensitive to the dynamic range of values (governed by the exponent) than to precision (governed by the mantissa). By maintaining the full fp32 exponent range, bfloat16 avoids the overflow and underflow issues that can plague fp16 training, while still cutting memory usage and bandwidth requirements in half compared to fp32. The bfloat16 format has since been adopted widely beyond TPUs, including by Nvidia GPUs, Intel Xeon processors, and AMD Instinct accelerators.

Generations

Google has released seven generations of data center TPUs, each with significant improvements over its predecessor. The table below summarizes the key specifications:

Generation	Announced	Process	Clock	Peak FLOPS (bf16)	HBM Capacity	HBM Bandwidth	TDP	Max Pod Size	Topology	Cooling
TPU v1	May 2016	28 nm	700 MHz	92 TOPS (INT8)	8 GiB DDR3	34 GB/s	75 W	1 chip	PCIe attached	Air
TPU v2	May 2017	16 nm	700 MHz	45 TFLOPS	16 GB HBM	600 GB/s	280 W	256 chips	2D torus ICI	Air
TPU v3	May 2018	16 nm	940 MHz	123 TFLOPS	32 GB HBM	900 GB/s	220 W	1,024 chips	2D torus ICI	Liquid
TPU v4	May 2021	7 nm	1,050 MHz	275 TFLOPS	32 GB HBM2e	1,200 GB/s	~200 W	4,096 chips	3D torus + OCS	Liquid
TPU v5e	August 2023	n/a	n/a	197 TFLOPS	16 GB HBM	819 GB/s	n/a	256 chips	2D torus ICI	Liquid
TPU v5p	December 2023	n/a	1,750 MHz	459 TFLOPS	95 GB HBM3	2,765 GB/s	n/a	8,960 chips	3D torus + OCS	Liquid
TPU v6e (Trillium)	May 2024 (GA Dec 2024)	n/a	n/a	918 TFLOPS	32 GB HBM3	1,640 GB/s	~300 W	256 chips	2D torus ICI	Liquid
TPU v7 (Ironwood)	April 2025 (GA Nov 2025)	n/a	n/a	4,614 TFLOPS (FP8)	192 GB HBM3E	7,370 GB/s	n/a	9,216 chips	3D mesh + OCS	Liquid

A few notes on the table. Google has not always disclosed the manufacturing process for newer TPUs, so some cells say n/a. The TPU v1 figure of 92 TOPS comes from the Jouppi 2017 paper; Google's earlier marketing sometimes quoted 23 TOPS, which referred to a sustained measurement on real workloads rather than the peak. TPU v5p uses HBM3 with the highest per-chip bandwidth of any pre-Ironwood TPU. Trillium doubles peak FLOPs and ICI bandwidth versus v5e and ships with the third-generation SparseCore for embedding-heavy workloads. Ironwood pushes 192 GB of HBM3E per chip, six times Trillium, and pods of up to 9,216 chips connected at 1.2 TB/s bidirectional ICI per chip, for a total of about 42.5 FP8 ExaFLOPS per pod.

TPU v1 (2015)

The first-generation TPU was designed exclusively for inference. It featured a 256 by 256 systolic array of 65,536 MAC units capable of 92 trillion 8-bit integer operations per second (92 TOPS). The chip used 28 nm process technology, ran at 700 MHz, and was rated at 75 watts thermal design power. It was packaged to fit into existing hard drive bays in Google's servers, requiring no modifications to the data center infrastructure.

TPU v1 used 8 GiB of DDR3 SDRAM with 34 GB/s of bandwidth, plus 28 MiB of on-chip software-managed memory to stage tensors close to the MXU. The chip was memory-bandwidth-limited for many workloads. Despite this constraint, it achieved 15 to 30 times better performance than contemporary Intel Haswell CPUs and Nvidia K80 GPUs on neural network inference benchmarks, as documented in the ISCA 2017 paper.

TPU v2 (2017)

Announced at Google I/O in May 2017, TPU v2 was the first generation to support both training and inference. The shift to training required floating-point arithmetic, and TPU v2 introduced support for bfloat16 and fp32 computation, delivering 45 TFLOPS of peak bf16 performance. Memory was upgraded to 16 GB of HBM with 600 GB/s bandwidth.

Critically, TPU v2 also introduced the Inter-Chip Interconnect (ICI), a custom high-speed link that connected TPU chips directly to their neighbors in a 2D torus topology. This enabled the creation of TPU Pods, clusters of up to 256 chips that functioned as a single logical accelerator. A full TPU v2 Pod delivered approximately 11.5 petaFLOPS of peak throughput. TPU v2 was the first generation offered through the Cloud TPU service.

TPU v3 (2018)

Announced at Google I/O in May 2018, TPU v3 doubled per-chip performance to 123 TFLOPS of bf16 compute and doubled HBM capacity to 32 GB per chip with 900 GB/s bandwidth. The increased power density required Google to introduce liquid cooling for the first time in its TPU hardware, replacing the air cooling used in previous generations.

TPU v3 Pods scaled to 1,024 chips using the same 2D torus ICI topology as v2 but with higher per-link bandwidth, delivering over 100 petaFLOPS per pod. Notable models trained on TPU v3 include BERT, which was trained on a TPU v3 Pod in just four days, and AlphaFold 2, which DeepMind trained on 128 TPU v3 cores with convergence taking roughly two weeks.

TPU v4 (2021)

TPU v4, announced at Google I/O in May 2021, represented a major architectural leap. It moved to a 7 nm process node, delivered 275 TFLOPS of bf16 performance, and maintained 32 GB of HBM2e with 1,200 GB/s bandwidth. Mean chip power consumption was approximately 200 watts.

The most significant innovation in TPU v4 was the introduction of Optical Circuit Switches (OCS) in the interconnect fabric. While previous generations used fixed 2D torus topologies, TPU v4 adopted a 3D torus and added reconfigurable optical switches that could dynamically reroute interconnect links. This made the network topology programmable: if a chip or link failed, the OCS could reconfigure around the fault, improving availability and utilization. The OCS components accounted for less than 5% of total system cost and power.

TPU v4 Pods connected up to 4,096 chips, delivering exascale-class ML performance. A published study showed that PaLM 540B was trained across two TPU v4 Pods (6,144 chips total) over 56 days, sustaining approximately 60% of peak FLOPS utilization, a high figure for large-scale distributed training. TPU v4 was also the subject of a detailed paper published at ISCA 2023, "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings."

TPU v5e (2023)

Announced in August 2023, TPU v5e was designed as the cost-efficient variant in the fifth generation, optimized for the best price-performance ratio rather than raw peak performance. Each v5e chip contains a single TensorCore with four MXUs, delivering 197 TFLOPS (bf16) or 393 TOPS (int8). HBM capacity was 16 GB per chip with 819 GB/s bandwidth.

Google deliberately reduced the core count, memory, and clock speed of v5e compared to v5p to hit aggressive power and cost targets. The chip uses a 2D torus ICI topology and scales to 256-chip Pods. TPU v5e was positioned as 2.3 times better in price-performance than TPU v4, making it particularly attractive for inference workloads and training of models with up to 200 billion parameters.

TPU v5p (2023)

Announced in December 2023 alongside the AI Hypercomputer initiative, TPU v5p was the performance-focused variant. Each chip delivers 459 TFLOPS (bf16) or 918 TOPS (int8), more than double the FLOPS of TPU v4, with 95 GB of HBM3 and 2,765 GB/s bandwidth (triple the HBM of v4).

TPU v5p Pods scale to 8,960 chips connected via a 3D torus ICI at 4,800 Gbps per chip, making them the largest TPU Pods at that time. Google reported that TPU v5p trains large language models 2.8 times faster than TPU v4, and its second-generation SparseCores train embedding-dense models 1.9 times faster. The entire TPU v5p Pod delivers approximately 460 petaFLOPS.

TPU v6e / Trillium (2024)

The sixth-generation TPU, codenamed Trillium, was announced in May 2024 and reached general availability in December 2024. Trillium marked a significant architectural shift by expanding the MXU from 128 by 128 to 256 by 256 multiply-accumulators and increasing the clock speed. This combination delivers 918 TFLOPS of bf16 performance per chip (1,836 TOPS at INT8), a 4.7x improvement over TPU v5e.

HBM capacity doubled to 32 GB of HBM3 per chip with 1,640 GB/s bandwidth, and ICI bandwidth also doubled compared to v5e. Trillium introduced the third-generation SparseCore and is over 67% more energy-efficient than TPU v5e. Pods scale to 256 chips, and Google reported that a single Trillium cluster can deliver 91 exaFLOPS of aggregate compute.

TPU v7 / Ironwood (2025)

Announced at Google Cloud Next '25 in April 2025 and reaching general availability in November 2025, Ironwood is Google's seventh-generation and most powerful TPU to date. It is described as the first TPU designed specifically for the "age of inference," reflecting the growing importance of serving large models at scale.

Each Ironwood chip is composed of two chiplets, with each chiplet containing one TensorCore, two SparseCores, and 96 GB of HBM3E, for a total of 192 GB per chip (a 6x increase over Trillium). Per-chip performance reaches 4,614 FP8 TFLOPS, more than 4 times Trillium and 10 times TPU v5p. HBM bandwidth is approximately 7.37 TB/s per chip, and ICI bandwidth reaches 1.2 TB/s bidirectional.

Ironwood scales to 9,216-chip clusters delivering 42.5 exaFLOPS of aggregate compute, which Google noted exceeds the performance of the world's largest publicly benchmarked supercomputer. Power efficiency is 2 times better than Trillium and nearly 30 times better than the original Cloud TPU v2 from 2018. Early adopters include Anthropic, which announced plans to use up to one million TPUs for scaling its Claude models, with multi-gigawatt commitments stretching through 2026 and beyond.

TPU Pods and Interconnect

Pod Architecture

A TPU Pod is a cluster of TPU chips connected by Google's proprietary Inter-Chip Interconnect (ICI), a custom high-speed network that allows the chips to communicate directly without going through a host CPU or external network switch. Pods function as a single, large accelerator for distributed training and inference workloads.

The interconnect topology has evolved over generations:

Topology	Generations	Description
2D torus	TPU v2, v3, v5e, v6e	Each chip connects to four neighbors (up, down, left, right) in a wraparound grid
3D torus	TPU v4, v5p	Each chip connects to six neighbors along three axes, reducing network diameter
3D torus + OCS	TPU v4, v5p	Optical circuit switches enable dynamic reconfiguration of links
3D mesh + OCS	TPU v7	Ironwood expands the optical mesh to 9,216 chips per pod

In a torus topology, the wraparound connections reduce the maximum number of hops between any two chips. For a 3D torus, the maximum distance scales as roughly N/2 per dimension rather than N, which substantially lowers worst-case communication latency for collective operations such as all-reduce. Users on TPU v4 and later can request a twisted torus shape if a particular model layout benefits from it.

Optical Circuit Switches

The Optical Circuit Switch (OCS) technology introduced in TPU v4 was a major innovation. OCSes use optical fiber and small mirrors to physically reconfigure which TPU chips are connected, without converting signals to electrical form. This enables:

Fault tolerance: If a chip or link fails, the OCS reroutes around it, keeping the rest of the Pod operational. This lets multi-week training runs survive component failures that would otherwise require a full restart.
Flexible partitioning: A large physical Pod can be subdivided into multiple smaller logical Pods for different workloads.
Reduced cost and power: The entire OCS infrastructure uses less than 5% of total system cost and less than 5% of total system power, far cheaper and more efficient than traditional electrical switches like InfiniBand.

Multi-Pod and Data Center Scale

For workloads that require more chips than a single Pod can provide, Google connects multiple Pods via its data center network (DCN). While DCN bandwidth is lower than ICI, careful placement and communication scheduling can still enable efficient multi-Pod training. The PaLM 540B model, for example, was trained across two TPU v4 Pods connected via DCN.

Software Ecosystem

XLA Compiler

XLA (Accelerated Linear Algebra) is the open-source compiler that translates high-level ML framework operations into optimized machine code for TPUs. XLA takes a computation graph (a directed acyclic graph of tensor operations), fuses operations to reduce memory traffic, picks layouts, schedules collectives across the ICI, tiles computations to fit in on-chip VMEM, and schedules data movement to keep the MXUs maximally utilized.

XLA is the primary compilation path for both JAX and TensorFlow on TPUs. It is also available for PyTorch through the PyTorch/XLA project. In 2023, Google open-sourced XLA as part of the OpenXLA initiative, making the compiler available as a standalone project that supports multiple hardware backends including TPUs, GPUs, and CPUs.

PJRT Runtime

PJRT (Pretty Just-in-time Runtime) is a hardware-agnostic and framework-agnostic runtime interface that sits between ML frameworks and the XLA compiler. PJRT provides a uniform API for dispatching computations to different accelerators, abstracting away the details of each hardware platform. It is the primary runtime interface for TensorFlow and JAX on TPUs, and is fully supported for PyTorch as well.

Framework Support

TPUs are supported by the major ML frameworks:

Framework	TPU Support Mechanism	Notes
JAX	Native (XLA-based from inception)	The primary framework for TPU development at Google and the dominant choice for new research at Google DeepMind and outside labs using Cloud TPUs
TensorFlow	Native (XLA compilation)	Long-standing TPU support; the original TPU front end. TPU v7 (Ironwood) does not support TensorFlow per Google's documentation
PyTorch	PyTorch/XLA library	Translates PyTorch's eager-mode operations into XLA HLO graphs; actively maintained by Google. Google introduced TorchTPU in 2025 to give native PyTorch performance on TPUs
Keras	Supported via TF or JAX backends	Used in many tutorials and Kaggle notebooks

JAX, developed by Google, has become the preferred framework for TPU workloads because its functional, pure-function design aligns naturally with XLA's compilation model. JAX's jit, vmap, pmap, and shmap transformations map cleanly to TPU Pod topologies, making it straightforward to write programs that scale across thousands of chips. The trade-off of the compiler-first model is real: anything that does not fit XLA's static-shape, fused-graph assumption (dynamic shapes, data-dependent control flow, heavy Python in the inner loop) usually runs slowly on TPUs without rewrites.

vLLM on TPU

In 2025, the popular open-source LLM serving framework vLLM added a unified TPU backend supporting both PyTorch and JAX. This allows users to serve large language models on TPUs using the same vLLM APIs they use on GPUs, lowering the barrier for organizations migrating inference workloads to TPU hardware.

Notable Models and Workloads Trained on TPUs

TPUs have been used to train many of the most influential AI models of the past decade. The following table highlights key examples:

Model	Year	TPU Generation	Scale	Significance
AlphaGo	2016	TPU v1	Inference only	Defeated world Go champion Lee Sedol; first major public demonstration of TPU capabilities
Google Translate (NMT)	2016	TPU v1	Production inference	Inference acceleration for the Neural Machine Translation system
AlphaZero	2017	TPU v1, v2	Self-play RL	Self-play reinforcement learning for Go, chess, and shogi
Transformer	2017	TPU v2	Research scale	The "Attention Is All You Need" architecture was developed and trained at Google on TPUs
BERT	2018	TPU v3 Pod	16 TPU chips	Revolutionized NLP; trained in 4 days on a TPU v3 Pod
T5	2019	TPU v3	1,024 chips	Text-to-Text Transfer Transformer; explored scaling laws for language models
AlphaFold 2	2020	TPU v3	128 chips	Solved the protein structure prediction problem; won CASP14; convergence took roughly two weeks
MUM and LaMDA	2021	TPU v3, v4	Internal Google scale	Internal Google language models; LaMDA powered early Google Bard
PaLM	2022	TPU v4	6,144 chips (2 Pods)	540B parameter model trained over 56 days, sustaining ~60% of peak FLOPs
Gemini 1.0 / 1.5	2023, 2024	TPU v4, v5	Large-scale Pods	Google's flagship multimodal model family
Gemma	2024	TPU v5e	n/a	Open-weights model family released for the community
Claude (Anthropic)	2024 onward	TPU v5, v6, v7	Multi-gigawatt	Anthropic disclosed multi-gigawatt TPU commitments through 2026 and beyond
Gemini 3	2025	TPU v6, v7	Multiple pods	Used Trillium for training and Ironwood for serving

Beyond Google's own models, external researchers and companies have used Cloud TPUs to train large models, facilitated by the TPU Research Cloud program and Cloud TPU's pay-as-you-go pricing. Customers such as Salesforce, Hugging Face, and Snap run TPU workloads through Google Cloud, and the chips have become a fixture of Kaggle competitions, where Google offers free TPU time to participants.

Comparison with GPUs

TPU vs. GPU Architecture

TPUs and GPUs take fundamentally different approaches to accelerating computation:

Aspect	TPU	GPU (Nvidia)
Design philosophy	Domain-specific (ML only)	General-purpose parallel compute, graphics, HPC
Core compute unit	Systolic array (MXU)	SIMT cores plus tensor cores
Programming model	XLA graph compilation	CUDA / cuDNN / cuBLAS
Precision support	bf16, fp32, int8, FP8 (Ironwood)	fp16, bf16, fp32, FP8, FP4, int8, fp64
Memory model	Software-managed on-chip buffers, HBM	Hardware caches, HBM or GDDR
Pod-scale interconnect	Proprietary ICI plus optical circuit switching	NVLink + NVSwitch + InfiniBand or Ethernet
Software ecosystem	JAX, TensorFlow, PyTorch/XLA	CUDA ecosystem (broad library coverage)
Procurement	Google Cloud only (rental)	Sold by Nvidia, AMD, Intel; available from many clouds and on-prem

In practice the choice usually comes down to software compatibility and supply. CUDA's depth makes GPUs the default for researchers who want maximum library coverage; TPUs win on certain large training jobs where a JAX or TensorFlow model fits the systolic array cleanly and where Google's pod scale and OCS keep utilization high.

Performance and Cost

Benchmarks and real-world deployments have shown that TPUs and GPUs trade advantages depending on the workload:

Training large language models: TPU v6e delivers performance comparable to a quad-H100 NVL configuration at a fraction of the cost. Google reports that BERT training completes 2.8 times faster on TPUs than on A100 GPUs.
Inference cost: TPU v6e has been reported to offer up to 4 times better performance per dollar compared to Nvidia H100 for LLM inference and recommendation systems. On-demand pricing for TPU v6e starts at approximately $1.38 per chip-hour, with committed-use discounts bringing the cost as low as $0.39 per chip-hour.
Energy efficiency: TPU v6e consumes approximately 300 W per chip, compared to 700 W for an H100, resulting in substantially lower energy costs per operation.
Real-world case study: Midjourney reportedly reduced its monthly inference spending from $2.1 million to under $700,000 after migrating from GPUs to TPUs.

Advantages and Limitations

TPU strengths:

Superior price-performance for large-scale ML training and inference
Tight vertical integration between hardware, compiler (XLA), and frameworks (JAX)
Custom interconnect (ICI) designed specifically for collective ML communication patterns
Energy efficiency advantages, particularly for inference workloads

TPU limitations:

Available only through Google Cloud; cannot be purchased as standalone hardware, which means TPU users carry the same single-supplier risk as CUDA users in reverse
Narrower software ecosystem compared to Nvidia's CUDA platform
Less flexible for non-ML workloads (no graphics rendering, no general-purpose compute)
PyTorch support, while functional, requires the PyTorch/XLA translation layer, which can introduce compatibility gaps with some PyTorch features
Workloads that do not fit the XLA mental model (dynamic shapes, data-dependent loops, heavy Python in the inner loop) run poorly without rewrites
Debugging is harder than on a GPU because tensors live on a remote accelerator and because the compiled-graph model hides intermediate values
Regional availability inside Google Cloud is limited compared to GPUs, and some generations (notably Ironwood at launch) require an account-team conversation before access
On a per-chip basis TPUs sometimes lose to current GPUs on workloads dominated by sparse computation or by very small batch sizes
Smaller community and fewer third-party tools compared to the GPU ecosystem

MLPerf Benchmark Results

Google has been a regular submitter to MLCommons' MLPerf training and inference rounds. In MLPerf Training v4.1 (late 2024), Google reported that Trillium delivered up to 1.8 times better performance per dollar than TPU v5p on dense LLM training, and that scaling efficiency hit 99% on the GPT-3 175B benchmark when going from a single pod to thousands of chips. On the same benchmark with 2,048 chips, Trillium completed training about two minutes faster than v5p's 29.6 minutes.

MLPerf results should be read with caution. The benchmark suite covers a fixed list of model architectures (BERT-large, GPT-3 175B pretraining, Llama 2 70B fine-tuning, Stable Diffusion, recommendation, object detection, graph node classification) and lets vendors tune software stacks aggressively. The results are still the most public and most reproducible head-to-head comparison between TPU and GPU systems.

Cloud TPU

Pricing and Procurement

TPUs are not sold as standalone chips. The only commercial path is Google Cloud, where TPUs are rented as VMs. Cloud TPU pricing follows a per-chip-hour model, with rates varying by TPU generation, region, and commitment level:

TPU Generation	On-Demand (approx.)	Committed Use (3-year)
TPU v5e	~$1.20/chip/hour	Discounted (varies)
TPU v6e (Trillium)	~$1.38/chip/hour	As low as ~$0.39/chip/hour

Google also offers spot (preemptible) pricing at significant discounts for workloads that can tolerate interruptions, such as research experiments and non-time-critical training runs. Customers who want guaranteed long-term capacity sign multi-year reservation contracts; Anthropic's 2025 deal for over a gigawatt of TPU capacity is a recent example.

TPU Research Cloud (TRC)

The TPU Research Cloud is a program that provides free Cloud TPU access to academic researchers and open-source developers. Accepted participants receive temporary quota for Cloud TPUs at no charge, with access to TPU v4 and newer generations. In exchange, researchers are expected to share their work publicly through publications, code, or blog posts. The TRC has supported research in areas ranging from natural language processing to protein structure prediction and climate modeling. For lighter workloads Google also offers TPU access through Colab.

Regions and Availability

Cloud TPUs are available in select Google Cloud regions, mostly in North America, Europe, and parts of Asia, with availability varying by generation. TPU v4 and v5e are available in the broadest set of regions, while newer generations like v6e and v7 are initially offered in a smaller number of locations before expanding over time.

Edge TPU

In addition to its data center TPUs, Google developed the Edge TPU, a small ASIC designed for running ML inference on edge devices with tight power and size constraints. Announced in 2018, the Edge TPU is marketed under the Google Coral brand and is a separate product line from the data center TPUs.

Specifications

The Edge TPU delivers 4 trillion operations per second (4 TOPS) of int8 inference performance while consuming roughly 0.5 watts per TOPS (about 2 watts total), yielding an efficiency of 2 TOPS per watt. It can execute mobile computer vision models such as MobileNet V2 at nearly 400 frames per second. The chip supports convolutional neural networks, specifically deep feed-forward architectures compiled with the Edge TPU compiler. It supports TensorFlow Lite models only and is restricted to 8-bit integer arithmetic.

Form Factors

Google Coral offers the Edge TPU in several form factors:

Product	Description
Coral USB Accelerator	USB dongle that adds Edge TPU inference to any Linux computer (including Raspberry Pi)
Coral Dev Board	Single-board computer with an on-board Edge TPU for prototyping
Coral M.2 / Mini PCIe Module	M.2 or mini PCIe cards for integration into custom hardware designs
Coral System-on-Module (SoM)	Production-ready module for embedded and IoT products

Pixel Tensor vs. Coral Edge TPU

Google's Pixel phones include a related but distinct chip family branded "Pixel Neural Core" or "Tensor," designed in collaboration with Google Silicon. These mobile SoCs share lineage with TPU work but are not the same silicon as the Coral Edge TPU.

Use Cases

The Edge TPU and Coral platform are used in applications that require real-time, on-device ML inference without cloud connectivity:

Smart cities: Traffic monitoring, pedestrian counting, license plate recognition
Healthcare: On-device medical image analysis, patient monitoring
Manufacturing: Visual quality inspection, defect detection on production lines
Retail: In-store analytics, shelf monitoring, autonomous checkout
IoT and robotics: Object detection and pose estimation for robots and drones

By processing data locally, the Edge TPU eliminates network latency, reduces bandwidth usage, and keeps sensitive data on the device for improved privacy.

Key Research Papers

Paper	Year	Venue	Topic
Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit"	2017	ISCA	First public deep dive into TPU v1 hardware and workloads
Jouppi et al., "A Domain-Specific Supercomputer for Training Deep Neural Networks"	2020	Comm. ACM / IEEE Micro	Architecture of TPU v2 and v3
Jouppi et al., "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings"	2023	ISCA	TPU v4 system design, OCS, SparseCore
Various Google Cloud blog posts	2023 to 2025	n/a	Per-generation announcements with peak FLOPs and pod sizes

Impact and Significance

Pioneering Custom AI Hardware

The TPU played a pivotal role in demonstrating that purpose-built hardware for machine learning could deliver order-of-magnitude improvements over general-purpose processors. Before the TPU, the ML hardware landscape was dominated by Nvidia GPUs repurposed from their original graphics rendering role. Google's success with TPUs inspired a wave of custom AI chip development across the industry, including efforts from Apple (Neural Engine), Amazon (Inferentia, Trainium), Microsoft (Maia), Meta (MTIA), Tesla (Dojo), and numerous startups such as Cerebras, Graphcore, SambaNova, and Groq.

Driving AI Research

Because TPUs are tightly integrated with Google's research infrastructure, they have directly enabled many of the field's most important breakthroughs. The Transformer architecture, BERT, the T5 framework, AlphaFold, PaLM, and Gemini were all developed and trained on TPU hardware. The availability of large-scale TPU Pods has allowed Google researchers to explore scaling laws and train models at sizes that would be prohibitively expensive on commercially available hardware.

Shaping the Cloud AI Market

Cloud TPU has also influenced how organizations think about ML infrastructure. By offering TPUs as a cloud service with per-hour pricing, Google created a model where companies can access specialized AI hardware without capital expenditure on physical chips. This approach, combined with competitive pricing, has positioned Google Cloud as a credible alternative to Nvidia-centric infrastructure for large-scale ML workloads.

Explain Like I'm 5 (ELI5)

A Tensor Processing Unit, or TPU, is a special kind of computer chip made by Google. It is designed to help computers learn faster and be better at understanding things like pictures, sounds, and words. Regular computer chips (CPUs) are good at doing lots of different kinds of tasks, but they are slow at the specific math that AI needs. GPUs are faster at that math, but TPUs are built to do only that math, so they are even faster and use less electricity.

Think of it like kitchen tools. A CPU is like a Swiss Army knife: it can do many things, but none of them perfectly. A GPU is like a good chef's knife: great for chopping, decent for other tasks. A TPU is like a specialized pasta machine: it does one thing (make pasta) really, really well, and much faster than trying to do it by hand with a knife.

Google has made several versions of TPUs, each one faster and more capable than the one before. A bunch of TPUs wired together into a "pod" act like one giant brain. People use these pods to make computers do amazing things, like understand different languages, recognize pictures, predict how proteins fold, and chat with humans through assistants like Gemini and Claude.

References

Jouppi, N. P., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*, 2017. arXiv:1704.04760
Jouppi, N. P., et al. "A Domain-Specific Supercomputer for Training Deep Neural Networks." *Communications of the ACM / IEEE Micro*, 2020.
Jouppi, N. P., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." *Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA)*, 2023. arXiv:2304.01433
Google Cloud. "Introducing Trillium, sixth-generation TPUs." *Google Cloud Blog*, May 14, 2024. cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus
Google. "Ironwood: The first Google TPU for the age of inference." *The Keyword*, April 9, 2025. blog.google/products/google-cloud/ironwood-tpu-age-of-inference
Google Cloud. "TPU system architecture." *Cloud TPU Documentation*. docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
Google Cloud. "BFloat16: The secret to high performance on Cloud TPUs." *Google Cloud Blog*. cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Google Cloud. "Introducing Cloud TPU v5p and AI Hypercomputer." *Google Cloud Blog*, December 6, 2023. cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-tpu-v5p-and-ai-hypercomputer
Google Cloud. "TPU v6e documentation." *Cloud TPU Documentation*. docs.cloud.google.com/tpu/docs/v6e
Google Cloud. "TPU v5e documentation." *Cloud TPU Documentation*. docs.cloud.google.com/tpu/docs/v5e
Google Cloud. "TPU v5p documentation." *Cloud TPU Documentation*. docs.cloud.google.com/tpu/docs/v5p
Google Cloud. "TPU7x (Ironwood) documentation." *Cloud TPU Documentation*. docs.cloud.google.com/tpu/docs/tpu7x
Google Cloud Blog. "TPU v4 enables performance, energy and CO2e efficiency gains." cloud.google.com/blog/topics/systems/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains
Google Cloud Blog. "Trillium MLPerf 4.1 training benchmarks." cloud.google.com/blog/products/compute/trillium-mlperf-41-training-benchmarks/
OpenXLA Project. openxla.org
PyTorch/XLA project. github.com/pytorch/xla
Coral by Google. "Edge TPU FAQ." coral.ai/docs/edgetpu/faq
Google. "TPU Research Cloud." sites.research.google/trc/about
Chowdhery, A., et al. "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint*, 2022. arXiv:2204.02311
Jumper, J., et al. "Highly accurate protein structure prediction with AlphaFold." *Nature* 596, 583-589, 2021. nature.com/articles/s41586-021-03819-2
Anthropic. "Expanding our use of Google Cloud TPUs and Services." Press release, October 23, 2025. anthropic.com/news/expanding-our-use-of-google-cloud-tpus-and-services
Wikipedia. "Tensor Processing Unit." en.wikipedia.org/wiki/Tensor_Processing_Unit