Tensor Processing Unit (TPU)
Last reviewed
May 8, 2026
Sources
22 citations
Review status
Source-backed
Revision
v3 · 6,703 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
22 citations
Review status
Source-backed
Revision
v3 · 6,703 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms, GPU, Deep learning, Edge TPU, AI chip
A Tensor Processing Unit (TPU) is a custom application-specific integrated circuit (ASIC) developed by Google specifically for accelerating machine learning workloads. Unlike general-purpose processors such as CPUs or even GPUs, TPUs are built from the ground up to handle the matrix multiplication and tensor operations that form the backbone of deep learning algorithms. By optimizing for these operations and trading away the flexibility of general-purpose hardware, TPUs achieve significantly higher throughput and better energy efficiency for neural network training and inference. The chip is purpose-built around a systolic array for matrix multiply-accumulate (MAC) operations and a memory hierarchy tuned for tensor traffic.
Google first deployed TPUs internally in its data centers in 2015 and publicly announced the chip at Google I/O in May 2016. Since then, the company has shipped seven public generations: TPU v1 (inference only), v2, v3, v4, v5e, v5p, the sixth-generation Trillium (v6e), and the seventh-generation Ironwood (v7). Each generation brought substantial improvements in compute performance, memory capacity, interconnect bandwidth, and energy efficiency. TPUs power many of Google's most prominent AI services, including Google Search, Google Translate, Google Photos, YouTube recommendations, and flagship models like BERT, PaLM, and Gemini. Through Google Cloud, TPUs are also available to external researchers and enterprises.
TPUs sit behind many of the workloads people now associate with modern AI: AlphaGo and AlphaZero in 2016 and 2017, AlphaFold protein structure prediction in 2020, the PaLM family of large language models, the Gemini family, and Google products such as Search, Photos, and Translate. Anthropic disclosed in 2025 that its Claude models also train and serve on TPU pods, with multi-gigawatt commitments stretching into 2026 and beyond.
The TPU project began inside Google around 2013, driven by a projected surge in computational demand from neural network inference across the company's services. The team was led by Norman Jouppi, a distinguished hardware engineer who had previously contributed to MIPS R4000 processor design and to cache memory systems research at HP. Google's internal analysis suggested that if every user made just three minutes of voice queries per day using neural network-based speech recognition, the company would need to double its data center compute capacity. Building a custom ASIC tuned specifically for neural network math offered a more practical path than buying vast quantities of commodity CPUs or GPUs.
The first TPU was designed, verified, and built in just 15 months, an unusually fast timeline for a custom chip. Google began deploying TPU v1 in its data centers in 2015, using it to accelerate inference for services such as Google Search RankBrain, Google Street View text processing, Google Translate's Neural Machine Translation system, and the AlphaGo system that defeated world champion Lee Sedol in March 2016. The chip stayed quiet inside Google for more than a year before the public reveal at I/O 2016.
The foundational paper describing the TPU, "In-Datacenter Performance Analysis of a Tensor Processing Unit," was authored by Jouppi and 75 co-authors and presented at the 44th International Symposium on Computer Architecture (ISCA) in June 2017. The paper demonstrated that the TPU delivered 15 to 30 times higher performance and 30 to 80 times better performance per watt than contemporary CPUs and GPUs for neural network inference workloads. It described how TPU v1 served Google's production neural networks, namely multilayer perceptrons, convolutional networks, and LSTMs, which together represented about 95% of inference demand at the time. The paper also noted that four of the six benchmarked applications were memory bandwidth limited, an observation that has shaped every TPU generation since. This publication established the TPU as a landmark in domain-specific accelerator design and helped popularize the concept of custom AI chips across the industry.
A follow-up paper, "A Domain-Specific Supercomputer for Training Deep Neural Networks," appeared in Communications of the ACM and IEEE Micro in 2020 and described the architecture of TPU v2 and v3. The TPU v4 system was the subject of a detailed paper at ISCA 2023, "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings."
Google made TPUs available to external users through its Cloud TPU service starting in 2018. The company also launched the TPU Research Cloud (TRC) program, which provides free access to Cloud TPUs for academic researchers. The TRC program grants accepted applicants access to a cluster of over 1,000 Cloud TPU devices, with the expectation that participants share their findings through publications, open-source code, or blog posts.
The central computational engine inside every TPU is the Matrix Multiply Unit (MXU), which is built on a systolic array architecture. A systolic array is a grid of interconnected processing elements (PEs) where data flows rhythmically between neighbors, much like a heartbeat (hence the name "systolic," borrowed from the medical term for cardiac contraction). Each PE performs a small multiply-and-accumulate (MAC) operation and passes partial results to the next PE. This design minimizes data movement and maximizes parallelism, since thousands of multiplications happen simultaneously without each one needing to independently fetch data from memory.
In TPU v1, the array was a 256 by 256 grid of INT8 cells totaling 65,536 MAC units. From v2 through v5p Google used 128 by 128 bfloat16 cells, with two MXUs per TensorCore. Starting with TPU v6e (Trillium), Google expanded the MXU back to 256 by 256, quadrupling the number of multiply-accumulators to 65,536 per unit. Each MXU performs one matrix multiply operation of the form bfloat16[8,128] x bfloat16[128,128] producing an fp32[8,128] result every 8 clock cycles, with all multiplications carried out in bfloat16 precision and all accumulations in full FP32 precision. Operands flow through the array in lockstep: weights stay resident while activations stream across, accumulating partial sums as they go. The design eliminates almost all register-file traffic, which is why TPUs hit such high utilization on dense GEMMs.
A systolic array is a poor fit for irregular workloads. It assumes the multiplication has a fixed shape large enough to fill the array, and it punishes sparse or branch-heavy code. This is one reason Google added separate hardware paths for embeddings (the SparseCore introduced in v4 and refreshed in Trillium) and for vector operations.
Each TPU chip contains one or more TensorCores, which serve as the primary compute units. A TensorCore includes the MXUs, a vector processing unit (VPU) for element-wise operations such as activations, normalizations, and softmax, a scalar unit for control flow and address arithmetic, and on-chip memory called VMEM (Vector Memory). The VPU is wider than a typical CPU SIMD lane but narrower than the MXU, and the compiler is responsible for scheduling work between the three. The memory hierarchy is designed to keep the MXUs fed with data:
Unlike most CPUs and modern GPUs, the on-chip memory is software managed: the XLA compiler decides when to stage tensors in and out, which removes the cost and unpredictability of hardware caches but pushes more work onto the toolchain. Data flows from HBM into VMEM, and from VMEM into the MXU for computation. The results flow back out through the same path. Efficient use of this memory hierarchy is critical for achieving high utilization of the MXU.
Starting with TPU v4, Google introduced the SparseCore, a dedicated accelerator for processing sparse computations, particularly the large embedding table lookups common in recommendation systems and ranking models. Embedding tables are a key component of models used by services like YouTube, Google Ads, and Google Search. Standard dense matrix hardware handles these irregular, memory-bound lookups inefficiently, so the SparseCore provides a dataflow processor optimized specifically for this pattern.
The SparseCore uses only about 5% of the total die area and power budget but delivers 5 to 7 times faster embedding lookups compared to running them on the MXU. TPU v5p includes second-generation SparseCores, and TPU v6e introduced the third generation. TPU v7 (Ironwood) contains four SparseCores per chip.
Google Brain developed the bfloat16 (Brain Floating Point 16) number format specifically for use in TPUs and deep learning workloads. Bfloat16 is a 16-bit floating-point format consisting of 1 sign bit, 8 exponent bits, and 7 mantissa bits. Unlike the IEEE 754 half-precision (fp16) format, which allocates 5 bits to the exponent and 10 to the mantissa, bfloat16 preserves the same exponent range as standard 32-bit floats (fp32) while reducing the mantissa precision.
The rationale behind this design is that neural networks are far more sensitive to the dynamic range of values (governed by the exponent) than to precision (governed by the mantissa). By maintaining the full fp32 exponent range, bfloat16 avoids the overflow and underflow issues that can plague fp16 training, while still cutting memory usage and bandwidth requirements in half compared to fp32. The bfloat16 format has since been adopted widely beyond TPUs, including by Nvidia GPUs, Intel Xeon processors, and AMD Instinct accelerators.
Google has released seven generations of data center TPUs, each with significant improvements over its predecessor. The table below summarizes the key specifications:
| Generation | Announced | Process | Clock | Peak FLOPS (bf16) | HBM Capacity | HBM Bandwidth | TDP | Max Pod Size | Topology | Cooling |
|---|---|---|---|---|---|---|---|---|---|---|
| TPU v1 | May 2016 | 28 nm | 700 MHz | 92 TOPS (INT8) | 8 GiB DDR3 | 34 GB/s | 75 W | 1 chip | PCIe attached | Air |
| TPU v2 | May 2017 | 16 nm | 700 MHz | 45 TFLOPS | 16 GB HBM | 600 GB/s | 280 W | 256 chips | 2D torus ICI | Air |
| TPU v3 | May 2018 | 16 nm | 940 MHz | 123 TFLOPS | 32 GB HBM | 900 GB/s | 220 W | 1,024 chips | 2D torus ICI | Liquid |
| TPU v4 | May 2021 | 7 nm | 1,050 MHz | 275 TFLOPS | 32 GB HBM2e | 1,200 GB/s | ~200 W | 4,096 chips | 3D torus + OCS | Liquid |
| TPU v5e | August 2023 | n/a | n/a | 197 TFLOPS | 16 GB HBM | 819 GB/s | n/a | 256 chips | 2D torus ICI | Liquid |
| TPU v5p | December 2023 | n/a | 1,750 MHz | 459 TFLOPS | 95 GB HBM3 | 2,765 GB/s | n/a | 8,960 chips | 3D torus + OCS | Liquid |
| TPU v6e (Trillium) | May 2024 (GA Dec 2024) | n/a | n/a | 918 TFLOPS | 32 GB HBM3 | 1,640 GB/s | ~300 W | 256 chips | 2D torus ICI | Liquid |
| TPU v7 (Ironwood) | April 2025 (GA Nov 2025) | n/a | n/a | 4,614 TFLOPS (FP8) | 192 GB HBM3E | 7,370 GB/s | n/a | 9,216 chips | 3D mesh + OCS | Liquid |
A few notes on the table. Google has not always disclosed the manufacturing process for newer TPUs, so some cells say n/a. The TPU v1 figure of 92 TOPS comes from the Jouppi 2017 paper; Google's earlier marketing sometimes quoted 23 TOPS, which referred to a sustained measurement on real workloads rather than the peak. TPU v5p uses HBM3 with the highest per-chip bandwidth of any pre-Ironwood TPU. Trillium doubles peak FLOPs and ICI bandwidth versus v5e and ships with the third-generation SparseCore for embedding-heavy workloads. Ironwood pushes 192 GB of HBM3E per chip, six times Trillium, and pods of up to 9,216 chips connected at 1.2 TB/s bidirectional ICI per chip, for a total of about 42.5 FP8 ExaFLOPS per pod.
The first-generation TPU was designed exclusively for inference. It featured a 256 by 256 systolic array of 65,536 MAC units capable of 92 trillion 8-bit integer operations per second (92 TOPS). The chip used 28 nm process technology, ran at 700 MHz, and was rated at 75 watts thermal design power. It was packaged to fit into existing hard drive bays in Google's servers, requiring no modifications to the data center infrastructure.
TPU v1 used 8 GiB of DDR3 SDRAM with 34 GB/s of bandwidth, plus 28 MiB of on-chip software-managed memory to stage tensors close to the MXU. The chip was memory-bandwidth-limited for many workloads. Despite this constraint, it achieved 15 to 30 times better performance than contemporary Intel Haswell CPUs and Nvidia K80 GPUs on neural network inference benchmarks, as documented in the ISCA 2017 paper.
Announced at Google I/O in May 2017, TPU v2 was the first generation to support both training and inference. The shift to training required floating-point arithmetic, and TPU v2 introduced support for bfloat16 and fp32 computation, delivering 45 TFLOPS of peak bf16 performance. Memory was upgraded to 16 GB of HBM with 600 GB/s bandwidth.
Critically, TPU v2 also introduced the Inter-Chip Interconnect (ICI), a custom high-speed link that connected TPU chips directly to their neighbors in a 2D torus topology. This enabled the creation of TPU Pods, clusters of up to 256 chips that functioned as a single logical accelerator. A full TPU v2 Pod delivered approximately 11.5 petaFLOPS of peak throughput. TPU v2 was the first generation offered through the Cloud TPU service.
Announced at Google I/O in May 2018, TPU v3 doubled per-chip performance to 123 TFLOPS of bf16 compute and doubled HBM capacity to 32 GB per chip with 900 GB/s bandwidth. The increased power density required Google to introduce liquid cooling for the first time in its TPU hardware, replacing the air cooling used in previous generations.
TPU v3 Pods scaled to 1,024 chips using the same 2D torus ICI topology as v2 but with higher per-link bandwidth, delivering over 100 petaFLOPS per pod. Notable models trained on TPU v3 include BERT, which was trained on a TPU v3 Pod in just four days, and AlphaFold 2, which DeepMind trained on 128 TPU v3 cores with convergence taking roughly two weeks.
TPU v4, announced at Google I/O in May 2021, represented a major architectural leap. It moved to a 7 nm process node, delivered 275 TFLOPS of bf16 performance, and maintained 32 GB of HBM2e with 1,200 GB/s bandwidth. Mean chip power consumption was approximately 200 watts.
The most significant innovation in TPU v4 was the introduction of Optical Circuit Switches (OCS) in the interconnect fabric. While previous generations used fixed 2D torus topologies, TPU v4 adopted a 3D torus and added reconfigurable optical switches that could dynamically reroute interconnect links. This made the network topology programmable: if a chip or link failed, the OCS could reconfigure around the fault, improving availability and utilization. The OCS components accounted for less than 5% of total system cost and power.
TPU v4 Pods connected up to 4,096 chips, delivering exascale-class ML performance. A published study showed that PaLM 540B was trained across two TPU v4 Pods (6,144 chips total) over 56 days, sustaining approximately 60% of peak FLOPS utilization, a high figure for large-scale distributed training. TPU v4 was also the subject of a detailed paper published at ISCA 2023, "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings."
Announced in August 2023, TPU v5e was designed as the cost-efficient variant in the fifth generation, optimized for the best price-performance ratio rather than raw peak performance. Each v5e chip contains a single TensorCore with four MXUs, delivering 197 TFLOPS (bf16) or 393 TOPS (int8). HBM capacity was 16 GB per chip with 819 GB/s bandwidth.
Google deliberately reduced the core count, memory, and clock speed of v5e compared to v5p to hit aggressive power and cost targets. The chip uses a 2D torus ICI topology and scales to 256-chip Pods. TPU v5e was positioned as 2.3 times better in price-performance than TPU v4, making it particularly attractive for inference workloads and training of models with up to 200 billion parameters.
Announced in December 2023 alongside the AI Hypercomputer initiative, TPU v5p was the performance-focused variant. Each chip delivers 459 TFLOPS (bf16) or 918 TOPS (int8), more than double the FLOPS of TPU v4, with 95 GB of HBM3 and 2,765 GB/s bandwidth (triple the HBM of v4).
TPU v5p Pods scale to 8,960 chips connected via a 3D torus ICI at 4,800 Gbps per chip, making them the largest TPU Pods at that time. Google reported that TPU v5p trains large language models 2.8 times faster than TPU v4, and its second-generation SparseCores train embedding-dense models 1.9 times faster. The entire TPU v5p Pod delivers approximately 460 petaFLOPS.
The sixth-generation TPU, codenamed Trillium, was announced in May 2024 and reached general availability in December 2024. Trillium marked a significant architectural shift by expanding the MXU from 128 by 128 to 256 by 256 multiply-accumulators and increasing the clock speed. This combination delivers 918 TFLOPS of bf16 performance per chip (1,836 TOPS at INT8), a 4.7x improvement over TPU v5e.
HBM capacity doubled to 32 GB of HBM3 per chip with 1,640 GB/s bandwidth, and ICI bandwidth also doubled compared to v5e. Trillium introduced the third-generation SparseCore and is over 67% more energy-efficient than TPU v5e. Pods scale to 256 chips, and Google reported that a single Trillium cluster can deliver 91 exaFLOPS of aggregate compute.
Announced at Google Cloud Next '25 in April 2025 and reaching general availability in November 2025, Ironwood is Google's seventh-generation and most powerful TPU to date. It is described as the first TPU designed specifically for the "age of inference," reflecting the growing importance of serving large models at scale.
Each Ironwood chip is composed of two chiplets, with each chiplet containing one TensorCore, two SparseCores, and 96 GB of HBM3E, for a total of 192 GB per chip (a 6x increase over Trillium). Per-chip performance reaches 4,614 FP8 TFLOPS, more than 4 times Trillium and 10 times TPU v5p. HBM bandwidth is approximately 7.37 TB/s per chip, and ICI bandwidth reaches 1.2 TB/s bidirectional.
Ironwood scales to 9,216-chip clusters delivering 42.5 exaFLOPS of aggregate compute, which Google noted exceeds the performance of the world's largest publicly benchmarked supercomputer. Power efficiency is 2 times better than Trillium and nearly 30 times better than the original Cloud TPU v2 from 2018. Early adopters include Anthropic, which announced plans to use up to one million TPUs for scaling its Claude models, with multi-gigawatt commitments stretching through 2026 and beyond.
A TPU Pod is a cluster of TPU chips connected by Google's proprietary Inter-Chip Interconnect (ICI), a custom high-speed network that allows the chips to communicate directly without going through a host CPU or external network switch. Pods function as a single, large accelerator for distributed training and inference workloads.
The interconnect topology has evolved over generations:
| Topology | Generations | Description |
|---|---|---|
| 2D torus | TPU v2, v3, v5e, v6e | Each chip connects to four neighbors (up, down, left, right) in a wraparound grid |
| 3D torus | TPU v4, v5p | Each chip connects to six neighbors along three axes, reducing network diameter |
| 3D torus + OCS | TPU v4, v5p | Optical circuit switches enable dynamic reconfiguration of links |
| 3D mesh + OCS | TPU v7 | Ironwood expands the optical mesh to 9,216 chips per pod |
In a torus topology, the wraparound connections reduce the maximum number of hops between any two chips. For a 3D torus, the maximum distance scales as roughly N/2 per dimension rather than N, which substantially lowers worst-case communication latency for collective operations such as all-reduce. Users on TPU v4 and later can request a twisted torus shape if a particular model layout benefits from it.
The Optical Circuit Switch (OCS) technology introduced in TPU v4 was a major innovation. OCSes use optical fiber and small mirrors to physically reconfigure which TPU chips are connected, without converting signals to electrical form. This enables:
For workloads that require more chips than a single Pod can provide, Google connects multiple Pods via its data center network (DCN). While DCN bandwidth is lower than ICI, careful placement and communication scheduling can still enable efficient multi-Pod training. The PaLM 540B model, for example, was trained across two TPU v4 Pods connected via DCN.
XLA (Accelerated Linear Algebra) is the open-source compiler that translates high-level ML framework operations into optimized machine code for TPUs. XLA takes a computation graph (a directed acyclic graph of tensor operations), fuses operations to reduce memory traffic, picks layouts, schedules collectives across the ICI, tiles computations to fit in on-chip VMEM, and schedules data movement to keep the MXUs maximally utilized.
XLA is the primary compilation path for both JAX and TensorFlow on TPUs. It is also available for PyTorch through the PyTorch/XLA project. In 2023, Google open-sourced XLA as part of the OpenXLA initiative, making the compiler available as a standalone project that supports multiple hardware backends including TPUs, GPUs, and CPUs.
PJRT (Pretty Just-in-time Runtime) is a hardware-agnostic and framework-agnostic runtime interface that sits between ML frameworks and the XLA compiler. PJRT provides a uniform API for dispatching computations to different accelerators, abstracting away the details of each hardware platform. It is the primary runtime interface for TensorFlow and JAX on TPUs, and is fully supported for PyTorch as well.
TPUs are supported by the major ML frameworks:
| Framework | TPU Support Mechanism | Notes |
|---|---|---|
| JAX | Native (XLA-based from inception) | The primary framework for TPU development at Google and the dominant choice for new research at Google DeepMind and outside labs using Cloud TPUs |
| TensorFlow | Native (XLA compilation) | Long-standing TPU support; the original TPU front end. TPU v7 (Ironwood) does not support TensorFlow per Google's documentation |
| PyTorch | PyTorch/XLA library | Translates PyTorch's eager-mode operations into XLA HLO graphs; actively maintained by Google. Google introduced TorchTPU in 2025 to give native PyTorch performance on TPUs |
| Keras | Supported via TF or JAX backends | Used in many tutorials and Kaggle notebooks |
JAX, developed by Google, has become the preferred framework for TPU workloads because its functional, pure-function design aligns naturally with XLA's compilation model. JAX's jit, vmap, pmap, and shmap transformations map cleanly to TPU Pod topologies, making it straightforward to write programs that scale across thousands of chips. The trade-off of the compiler-first model is real: anything that does not fit XLA's static-shape, fused-graph assumption (dynamic shapes, data-dependent control flow, heavy Python in the inner loop) usually runs slowly on TPUs without rewrites.
In 2025, the popular open-source LLM serving framework vLLM added a unified TPU backend supporting both PyTorch and JAX. This allows users to serve large language models on TPUs using the same vLLM APIs they use on GPUs, lowering the barrier for organizations migrating inference workloads to TPU hardware.
TPUs have been used to train many of the most influential AI models of the past decade. The following table highlights key examples:
| Model | Year | TPU Generation | Scale | Significance |
|---|---|---|---|---|
| AlphaGo | 2016 | TPU v1 | Inference only | Defeated world Go champion Lee Sedol; first major public demonstration of TPU capabilities |
| Google Translate (NMT) | 2016 | TPU v1 | Production inference | Inference acceleration for the Neural Machine Translation system |
| AlphaZero | 2017 | TPU v1, v2 | Self-play RL | Self-play reinforcement learning for Go, chess, and shogi |
| Transformer | 2017 | TPU v2 | Research scale | The "Attention Is All You Need" architecture was developed and trained at Google on TPUs |
| BERT | 2018 | TPU v3 Pod | 16 TPU chips | Revolutionized NLP; trained in 4 days on a TPU v3 Pod |
| T5 | 2019 | TPU v3 | 1,024 chips | Text-to-Text Transfer Transformer; explored scaling laws for language models |
| AlphaFold 2 | 2020 | TPU v3 | 128 chips | Solved the protein structure prediction problem; won CASP14; convergence took roughly two weeks |
| MUM and LaMDA | 2021 | TPU v3, v4 | Internal Google scale | Internal Google language models; LaMDA powered early Google Bard |
| PaLM | 2022 | TPU v4 | 6,144 chips (2 Pods) | 540B parameter model trained over 56 days, sustaining ~60% of peak FLOPs |
| Gemini 1.0 / 1.5 | 2023, 2024 | TPU v4, v5 | Large-scale Pods | Google's flagship multimodal model family |
| Gemma | 2024 | TPU v5e | n/a | Open-weights model family released for the community |
| Claude (Anthropic) | 2024 onward | TPU v5, v6, v7 | Multi-gigawatt | Anthropic disclosed multi-gigawatt TPU commitments through 2026 and beyond |
| Gemini 3 | 2025 | TPU v6, v7 | Multiple pods | Used Trillium for training and Ironwood for serving |
Beyond Google's own models, external researchers and companies have used Cloud TPUs to train large models, facilitated by the TPU Research Cloud program and Cloud TPU's pay-as-you-go pricing. Customers such as Salesforce, Hugging Face, and Snap run TPU workloads through Google Cloud, and the chips have become a fixture of Kaggle competitions, where Google offers free TPU time to participants.
TPUs and GPUs take fundamentally different approaches to accelerating computation:
| Aspect | TPU | GPU (Nvidia) |
|---|---|---|
| Design philosophy | Domain-specific (ML only) | General-purpose parallel compute, graphics, HPC |
| Core compute unit | Systolic array (MXU) | SIMT cores plus tensor cores |
| Programming model | XLA graph compilation | CUDA / cuDNN / cuBLAS |
| Precision support | bf16, fp32, int8, FP8 (Ironwood) | fp16, bf16, fp32, FP8, FP4, int8, fp64 |
| Memory model | Software-managed on-chip buffers, HBM | Hardware caches, HBM or GDDR |
| Pod-scale interconnect | Proprietary ICI plus optical circuit switching | NVLink + NVSwitch + InfiniBand or Ethernet |
| Software ecosystem | JAX, TensorFlow, PyTorch/XLA | CUDA ecosystem (broad library coverage) |
| Procurement | Google Cloud only (rental) | Sold by Nvidia, AMD, Intel; available from many clouds and on-prem |
In practice the choice usually comes down to software compatibility and supply. CUDA's depth makes GPUs the default for researchers who want maximum library coverage; TPUs win on certain large training jobs where a JAX or TensorFlow model fits the systolic array cleanly and where Google's pod scale and OCS keep utilization high.
Benchmarks and real-world deployments have shown that TPUs and GPUs trade advantages depending on the workload:
TPU strengths:
TPU limitations:
Google has been a regular submitter to MLCommons' MLPerf training and inference rounds. In MLPerf Training v4.1 (late 2024), Google reported that Trillium delivered up to 1.8 times better performance per dollar than TPU v5p on dense LLM training, and that scaling efficiency hit 99% on the GPT-3 175B benchmark when going from a single pod to thousands of chips. On the same benchmark with 2,048 chips, Trillium completed training about two minutes faster than v5p's 29.6 minutes.
MLPerf results should be read with caution. The benchmark suite covers a fixed list of model architectures (BERT-large, GPT-3 175B pretraining, Llama 2 70B fine-tuning, Stable Diffusion, recommendation, object detection, graph node classification) and lets vendors tune software stacks aggressively. The results are still the most public and most reproducible head-to-head comparison between TPU and GPU systems.
TPUs are not sold as standalone chips. The only commercial path is Google Cloud, where TPUs are rented as VMs. Cloud TPU pricing follows a per-chip-hour model, with rates varying by TPU generation, region, and commitment level:
| TPU Generation | On-Demand (approx.) | Committed Use (3-year) |
|---|---|---|
| TPU v5e | ~$1.20/chip/hour | Discounted (varies) |
| TPU v6e (Trillium) | ~$1.38/chip/hour | As low as ~$0.39/chip/hour |
Google also offers spot (preemptible) pricing at significant discounts for workloads that can tolerate interruptions, such as research experiments and non-time-critical training runs. Customers who want guaranteed long-term capacity sign multi-year reservation contracts; Anthropic's 2025 deal for over a gigawatt of TPU capacity is a recent example.
The TPU Research Cloud is a program that provides free Cloud TPU access to academic researchers and open-source developers. Accepted participants receive temporary quota for Cloud TPUs at no charge, with access to TPU v4 and newer generations. In exchange, researchers are expected to share their work publicly through publications, code, or blog posts. The TRC has supported research in areas ranging from natural language processing to protein structure prediction and climate modeling. For lighter workloads Google also offers TPU access through Colab.
Cloud TPUs are available in select Google Cloud regions, mostly in North America, Europe, and parts of Asia, with availability varying by generation. TPU v4 and v5e are available in the broadest set of regions, while newer generations like v6e and v7 are initially offered in a smaller number of locations before expanding over time.
In addition to its data center TPUs, Google developed the Edge TPU, a small ASIC designed for running ML inference on edge devices with tight power and size constraints. Announced in 2018, the Edge TPU is marketed under the Google Coral brand and is a separate product line from the data center TPUs.
The Edge TPU delivers 4 trillion operations per second (4 TOPS) of int8 inference performance while consuming roughly 0.5 watts per TOPS (about 2 watts total), yielding an efficiency of 2 TOPS per watt. It can execute mobile computer vision models such as MobileNet V2 at nearly 400 frames per second. The chip supports convolutional neural networks, specifically deep feed-forward architectures compiled with the Edge TPU compiler. It supports TensorFlow Lite models only and is restricted to 8-bit integer arithmetic.
Google Coral offers the Edge TPU in several form factors:
| Product | Description |
|---|---|
| Coral USB Accelerator | USB dongle that adds Edge TPU inference to any Linux computer (including Raspberry Pi) |
| Coral Dev Board | Single-board computer with an on-board Edge TPU for prototyping |
| Coral M.2 / Mini PCIe Module | M.2 or mini PCIe cards for integration into custom hardware designs |
| Coral System-on-Module (SoM) | Production-ready module for embedded and IoT products |
Google's Pixel phones include a related but distinct chip family branded "Pixel Neural Core" or "Tensor," designed in collaboration with Google Silicon. These mobile SoCs share lineage with TPU work but are not the same silicon as the Coral Edge TPU.
The Edge TPU and Coral platform are used in applications that require real-time, on-device ML inference without cloud connectivity:
By processing data locally, the Edge TPU eliminates network latency, reduces bandwidth usage, and keeps sensitive data on the device for improved privacy.
| Paper | Year | Venue | Topic |
|---|---|---|---|
| Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" | 2017 | ISCA | First public deep dive into TPU v1 hardware and workloads |
| Jouppi et al., "A Domain-Specific Supercomputer for Training Deep Neural Networks" | 2020 | Comm. ACM / IEEE Micro | Architecture of TPU v2 and v3 |
| Jouppi et al., "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings" | 2023 | ISCA | TPU v4 system design, OCS, SparseCore |
| Various Google Cloud blog posts | 2023 to 2025 | n/a | Per-generation announcements with peak FLOPs and pod sizes |
The TPU played a pivotal role in demonstrating that purpose-built hardware for machine learning could deliver order-of-magnitude improvements over general-purpose processors. Before the TPU, the ML hardware landscape was dominated by Nvidia GPUs repurposed from their original graphics rendering role. Google's success with TPUs inspired a wave of custom AI chip development across the industry, including efforts from Apple (Neural Engine), Amazon (Inferentia, Trainium), Microsoft (Maia), Meta (MTIA), Tesla (Dojo), and numerous startups such as Cerebras, Graphcore, SambaNova, and Groq.
Because TPUs are tightly integrated with Google's research infrastructure, they have directly enabled many of the field's most important breakthroughs. The Transformer architecture, BERT, the T5 framework, AlphaFold, PaLM, and Gemini were all developed and trained on TPU hardware. The availability of large-scale TPU Pods has allowed Google researchers to explore scaling laws and train models at sizes that would be prohibitively expensive on commercially available hardware.
Cloud TPU has also influenced how organizations think about ML infrastructure. By offering TPUs as a cloud service with per-hour pricing, Google created a model where companies can access specialized AI hardware without capital expenditure on physical chips. This approach, combined with competitive pricing, has positioned Google Cloud as a credible alternative to Nvidia-centric infrastructure for large-scale ML workloads.
A Tensor Processing Unit, or TPU, is a special kind of computer chip made by Google. It is designed to help computers learn faster and be better at understanding things like pictures, sounds, and words. Regular computer chips (CPUs) are good at doing lots of different kinds of tasks, but they are slow at the specific math that AI needs. GPUs are faster at that math, but TPUs are built to do only that math, so they are even faster and use less electricity.
Think of it like kitchen tools. A CPU is like a Swiss Army knife: it can do many things, but none of them perfectly. A GPU is like a good chef's knife: great for chopping, decent for other tasks. A TPU is like a specialized pasta machine: it does one thing (make pasta) really, really well, and much faster than trying to do it by hand with a knife.
Google has made several versions of TPUs, each one faster and more capable than the one before. A bunch of TPUs wired together into a "pod" act like one giant brain. People use these pods to make computers do amazing things, like understand different languages, recognize pictures, predict how proteins fold, and chat with humans through assistants like Gemini and Claude.