See also: Machine learning terms, GPU, Deep learning
A Tensor Processing Unit (TPU) is a custom application-specific integrated circuit (ASIC) developed by Google specifically for accelerating machine learning workloads. Unlike general-purpose processors such as CPUs or even GPUs, TPUs are built from the ground up to handle the matrix multiplication and tensor operations that form the backbone of deep learning algorithms. By optimizing for these operations and trading away the flexibility of general-purpose hardware, TPUs achieve significantly higher throughput and better energy efficiency for neural network training and inference.
Google first deployed TPUs internally in its data centers in 2015 and publicly announced the chip at Google I/O in May 2016. Since then, the company has released seven generations of the hardware, each bringing substantial improvements in compute performance, memory capacity, interconnect bandwidth, and energy efficiency. TPUs power many of Google's most prominent AI services, including Google Search, Google Translate, Google Photos, YouTube recommendations, and flagship models like BERT, PaLM, and Gemini. Through Google Cloud, TPUs are also available to external researchers and enterprises.
The TPU project began inside Google around 2013, driven by a projected surge in computational demand from neural network inference across the company's services. The team was led by Norman Jouppi, a distinguished hardware engineer who had previously contributed to MIPS processor design and HP's memory systems research. Google's internal analysis suggested that if every user made just three minutes of voice queries per day using neural network-based speech recognition, the company would need to double its data center compute capacity. Building a custom ASIC tuned specifically for neural network math offered a more practical path than buying vast quantities of commodity CPUs or GPUs.
The first TPU was designed, verified, and built in just 15 months, an unusually fast timeline for a custom chip. Google began deploying TPU v1 in its data centers in 2015, using it to accelerate inference for services such as Google Search RankBrain, Google Street View text processing, and the AlphaGo system that defeated world champion Lee Sedol in March 2016.
The foundational paper describing the TPU, "In-Datacenter Performance Analysis of a Tensor Processing Unit," was authored by Jouppi and colleagues and presented at the 44th International Symposium on Computer Architecture (ISCA) in June 2017. The paper demonstrated that the TPU delivered 15 to 30 times higher performance and 30 to 80 times better performance per watt than contemporary CPUs and GPUs for neural network inference workloads. This publication established the TPU as a landmark in domain-specific accelerator design and helped popularize the concept of custom AI chips across the industry.
Google made TPUs available to external users through its Cloud TPU service starting in 2018. The company also launched the TPU Research Cloud (TRC) program, which provides free access to Cloud TPUs for academic researchers. The TRC program grants accepted applicants access to a cluster of over 1,000 Cloud TPU devices, with the expectation that participants share their findings through publications, open-source code, or blog posts.
The central computational engine inside every TPU is the Matrix Multiply Unit (MXU), which is built on a systolic array architecture. A systolic array is a grid of interconnected processing elements (PEs) where data flows rhythmically between neighbors, much like a heartbeat (hence the name "systolic," borrowed from the medical term for cardiac contraction). Each PE performs a small multiply-and-accumulate (MAC) operation and passes partial results to the next PE. This design minimizes data movement and maximizes parallelism, since thousands of multiplications happen simultaneously without each one needing to independently fetch data from memory.
In TPU generations prior to v6e, the MXU was arranged as a 128 x 128 systolic array, giving each unit 16,384 multiply-accumulators. Starting with TPU v6e (Trillium), Google expanded the MXU to 256 x 256, quadrupling the number of multiply-accumulators to 65,536 per unit. Each MXU performs one matrix multiply operation of the form bfloat16[8,128] x bfloat16[128,128] producing an fp32[8,128] result every 8 clock cycles, with all multiplications carried out in bfloat16 precision and all accumulations in full FP32 precision.
Each TPU chip contains one or more TensorCores, which serve as the primary compute units. A TensorCore includes the MXUs, a vector processing unit (VPU) for element-wise operations, a scalar unit, and on-chip memory called VMEM (Vector Memory). The memory hierarchy is designed to keep the MXUs fed with data:
Data flows from HBM into VMEM, and from VMEM into the MXU for computation. The results flow back out through the same path. Efficient use of this memory hierarchy is critical for achieving high utilization of the MXU, and the XLA compiler (discussed below) is responsible for orchestrating data movement to keep the systolic array busy.
Starting with TPU v4, Google introduced the SparseCore, a dedicated accelerator for processing sparse computations, particularly the large embedding table lookups common in recommendation systems and ranking models. Embedding tables are a key component of models used by services like YouTube, Google Ads, and Google Search. Standard dense matrix hardware handles these irregular, memory-bound lookups inefficiently, so the SparseCore provides a dataflow processor optimized specifically for this pattern.
The SparseCore uses only about 5% of the total die area and power budget but delivers 5 to 7 times faster embedding lookups compared to running them on the MXU. TPU v5p includes second-generation SparseCores, and TPU v6e introduced the third generation. TPU v7 (Ironwood) contains four SparseCores per chip.
Google Brain developed the bfloat16 (Brain Floating Point 16) number format specifically for use in TPUs and deep learning workloads. Bfloat16 is a 16-bit floating-point format consisting of 1 sign bit, 8 exponent bits, and 7 mantissa bits. Unlike the IEEE 754 half-precision (fp16) format, which allocates 5 bits to the exponent and 10 to the mantissa, bfloat16 preserves the same exponent range as standard 32-bit floats (fp32) while reducing the mantissa precision.
The rationale behind this design is that neural networks are far more sensitive to the dynamic range of values (governed by the exponent) than to precision (governed by the mantissa). By maintaining the full fp32 exponent range, bfloat16 avoids the overflow and underflow issues that can plague fp16 training, while still cutting memory usage and bandwidth requirements in half compared to fp32. The bfloat16 format has since been adopted widely beyond TPUs, including by Nvidia GPUs, Intel Xeon processors, and AMD Instinct accelerators.
Google has released seven generations of data center TPUs, each with significant improvements over its predecessor. The table below summarizes the key specifications:
| Generation | Year | Process | Clock | Peak FLOPS (bf16) | HBM Capacity | HBM Bandwidth | TDP | Max Pod Size | Interconnect |
|---|---|---|---|---|---|---|---|---|---|
| TPU v1 | 2015 | 28 nm | 700 MHz | 92 TOPS (int8) | 8 GB DDR3 | 34 GB/s | 75 W | 1 chip | N/A |
| TPU v2 | 2017 | 16 nm | 700 MHz | 45 TFLOPS | 16 GB HBM | 600 GB/s | 280 W | 256 chips | 2D torus ICI |
| TPU v3 | 2018 | 16 nm | 940 MHz | 123 TFLOPS | 32 GB HBM | 900 GB/s | 220 W | 1,024 chips | 2D torus ICI |
| TPU v4 | 2021 | 7 nm | 1,050 MHz | 275 TFLOPS | 32 GB HBM2e | 1,200 GB/s | ~200 W | 4,096 chips | 3D torus ICI + OCS |
| TPU v5e | 2023 | N/A | N/A | 197 TFLOPS | 16 GB HBM | 819 GB/s | N/A | 256 chips | 2D torus ICI |
| TPU v5p | 2023 | N/A | 1,750 MHz | 459 TFLOPS | 95 GB HBM | 2,765 GB/s | N/A | 8,960 chips | 3D torus ICI |
| TPU v6e (Trillium) | 2024 | N/A | N/A | 918 TFLOPS | 32 GB HBM | 1,640 GB/s | ~300 W | 256 chips | ICI |
| TPU v7 (Ironwood) | 2025 | N/A | N/A | 4,614 TFLOPS (FP8) | 192 GB HBM | 7,370 GB/s | N/A | 9,216 chips | ICI (1.2 TB/s bidir.) |
The first-generation TPU was designed exclusively for inference. It featured a 256 x 256 systolic array capable of 92 trillion 8-bit integer operations per second (92 TOPS). The chip used 28 nm process technology, ran at 700 MHz, and consumed just 75 watts. It was packaged to fit into existing hard drive bays in Google's servers, requiring no modifications to the data center infrastructure.
TPU v1 used 8 GB of DDR3 memory with 34 GB/s of bandwidth, making it memory-bandwidth-limited for many workloads. Despite this constraint, the chip achieved 15 to 30 times better performance than contemporary Intel Haswell CPUs and Nvidia K80 GPUs on neural network inference benchmarks, as documented in the ISCA 2017 paper.
Announced at Google I/O in May 2017, TPU v2 was the first generation to support both training and inference. The shift to training required floating-point arithmetic, and TPU v2 introduced support for bfloat16 and fp32 computation, delivering 45 TFLOPS of peak bf16 performance. Memory was upgraded to 16 GB of HBM with 600 GB/s bandwidth.
Critically, TPU v2 also introduced the Inter-Chip Interconnect (ICI), a custom high-speed link that connected TPU chips directly to their neighbors in a 2D torus topology. This enabled the creation of TPU Pods, clusters of up to 256 chips that functioned as a single logical accelerator. A full TPU v2 Pod delivered approximately 11.5 petaFLOPS of peak throughput. TPU v2 was the first generation offered through the Cloud TPU service.
Announced at Google I/O in May 2018, TPU v3 doubled per-chip performance to 123 TFLOPS of bf16 compute and doubled HBM capacity to 32 GB per chip with 900 GB/s bandwidth. The increased power density required Google to introduce liquid cooling for the first time in its TPU hardware, replacing the air cooling used in previous generations.
TPU v3 Pods scaled to 1,024 chips using the same 2D torus ICI topology as v2 but with higher per-link bandwidth, delivering over 100 petaFLOPS per pod. Notable models trained on TPU v3 include BERT, which was trained on a TPU v3 Pod in just four days.
TPU v4, announced at Google I/O in May 2021, represented a major architectural leap. It moved to a 7 nm process node, delivered 275 TFLOPS of bf16 performance, and maintained 32 GB of HBM2e with 1,200 GB/s bandwidth. Mean chip power consumption was approximately 200 watts.
The most significant innovation in TPU v4 was the introduction of Optical Circuit Switches (OCS) in the interconnect fabric. While previous generations used fixed 2D torus topologies, TPU v4 adopted a 3D torus and added reconfigurable optical switches that could dynamically reroute interconnect links. This made the network topology programmable: if a chip or link failed, the OCS could reconfigure around the fault, improving availability and utilization. The OCS components accounted for less than 5% of total system cost and power.
TPU v4 Pods connected up to 4,096 chips, delivering exascale-class ML performance. A published study showed that PaLM 540B was trained across two TPU v4 Pods (6,144 chips total) and that training ran at approximately 60% of peak FLOPS utilization, a high figure for large-scale distributed training. TPU v4 was also the subject of a detailed paper published at ISCA 2023, "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings."
TPU v5e was designed as the cost-efficient variant in the fifth generation, optimized for the best price-performance ratio rather than raw peak performance. Each v5e chip contains a single TensorCore with four MXUs, delivering 197 TFLOPS (bf16) or 393 TOPS (int8). HBM capacity was 16 GB per chip with 819 GB/s bandwidth.
Google deliberately reduced the core count, memory, and clock speed of v5e compared to v5p to hit aggressive power and cost targets. The chip uses a 2D torus ICI topology and scales to 256-chip Pods. TPU v5e was positioned as 2.3 times better in price-performance than TPU v4, making it particularly attractive for inference workloads and training of models with up to 200 billion parameters.
Announced in December 2023 alongside the AI Hypercomputer initiative, TPU v5p was the performance-focused variant. Each chip delivers 459 TFLOPS (bf16) or 918 TOPS (int8), more than double the FLOPS of TPU v4, with 95 GB of HBM and 2,765 GB/s bandwidth (triple the HBM of v4).
TPU v5p Pods scale to 8,960 chips connected via a 3D torus ICI at 4,800 Gbps per chip, making them the largest TPU Pods at that time. Google reported that TPU v5p trains large language models 2.8 times faster than TPU v4, and its second-generation SparseCores train embedding-dense models 1.9 times faster. The entire TPU v5p Pod delivers approximately 460 petaFLOPS.
The sixth-generation TPU, codenamed Trillium, reached general availability in late 2024. Trillium marked a significant architectural shift by expanding the MXU from 128 x 128 to 256 x 256 multiply-accumulators and increasing the clock speed. This combination delivers 918 TFLOPS of bf16 performance per chip, a 4.7x improvement over TPU v5e.
HBM capacity doubled to 32 GB per chip with 1,640 GB/s bandwidth, and ICI bandwidth also doubled compared to v5e. Trillium introduced the third-generation SparseCore and is over 67% more energy-efficient than TPU v5e. Pods scale to 256 chips, and Google reported that a single Trillium cluster can deliver 91 exaFLOPS of aggregate compute.
Announced at Google Cloud Next '25, Ironwood is Google's seventh-generation and most powerful TPU to date. It is described as the first TPU designed specifically for the "age of inference," reflecting the growing importance of serving large models at scale.
Each Ironwood chip is composed of two chiplets, with each chiplet containing one TensorCore, two SparseCores, and 96 GB of HBM, for a total of 192 GB per chip (a 6x increase over Trillium). Per-chip performance reaches 4,614 FP8 TFLOPS, more than 4 times Trillium and 10 times TPU v5p. HBM bandwidth is approximately 7.37 TB/s per chip, and ICI bandwidth reaches 1.2 TB/s bidirectional.
Ironwood scales to 9,216-chip clusters delivering 42.5 exaFLOPS of aggregate compute, which Google noted exceeds the performance of the world's largest publicly benchmarked supercomputer. Power efficiency is 2 times better than Trillium and nearly 30 times better than the original Cloud TPU v2 from 2018. Early adopters include Anthropic, which announced plans to use up to one million TPUs for scaling its Claude models.
A TPU Pod is a cluster of TPU chips connected by Google's proprietary Inter-Chip Interconnect (ICI), a custom high-speed network that allows the chips to communicate directly without going through a host CPU or external network switch. Pods function as a single, large accelerator for distributed training and inference workloads.
The interconnect topology has evolved over generations:
| Topology | Generations | Description |
|---|---|---|
| 2D torus | TPU v2, v3, v5e | Each chip connects to four neighbors (up, down, left, right) in a wraparound grid |
| 3D torus | TPU v4, v5p, v7 | Each chip connects to six neighbors along three axes, reducing network diameter |
| 3D torus + OCS | TPU v4 | Optical circuit switches enable dynamic reconfiguration of links |
In a torus topology, the wraparound connections reduce the maximum number of hops between any two chips. For a 3D torus, the maximum distance scales as roughly N/2 per dimension rather than N, which substantially lowers worst-case communication latency for collective operations such as all-reduce.
The Optical Circuit Switch (OCS) technology introduced in TPU v4 was a major innovation. OCSes use optical fiber and small mirrors to physically reconfigure which TPU chips are connected, without converting signals to electrical form. This enables:
For workloads that require more chips than a single Pod can provide, Google connects multiple Pods via its data center network (DCN). While DCN bandwidth is lower than ICI, careful placement and communication scheduling can still enable efficient multi-Pod training. The PaLM 540B model, for example, was trained across two TPU v4 Pods connected via DCN.
XLA (Accelerated Linear Algebra) is the open-source compiler that translates high-level ML framework operations into optimized machine code for TPUs. XLA takes a computation graph (a directed acyclic graph of tensor operations), fuses operations to reduce memory traffic, tiles computations to fit in on-chip VMEM, and schedules data movement to keep the MXUs maximally utilized.
XLA is the primary compilation path for both JAX and TensorFlow on TPUs. It is also available for PyTorch through the PyTorch/XLA project. In 2023, Google open-sourced XLA as part of the OpenXLA initiative, making the compiler available as a standalone project that supports multiple hardware backends including TPUs, GPUs, and CPUs.
PJRT (Pretty Just-in-time Runtime) is a hardware-agnostic and framework-agnostic runtime interface that sits between ML frameworks and the XLA compiler. PJRT provides a uniform API for dispatching computations to different accelerators, abstracting away the details of each hardware platform. It is the primary runtime interface for TensorFlow and JAX on TPUs, and is fully supported for PyTorch as well.
TPUs are supported by the three major ML frameworks:
| Framework | TPU Support Mechanism | Notes |
|---|---|---|
| JAX | Native (XLA-based from inception) | The primary framework for TPU development at Google; designed around XLA's functional programming model |
| TensorFlow | Native (XLA compilation) | Long-standing TPU support; TensorFlow was originally the main framework for TPU usage |
| PyTorch | PyTorch/XLA library | Translates PyTorch's eager-mode operations into XLA graphs; actively maintained by Google |
JAX, developed by Google, has become the preferred framework for TPU workloads because its functional, pure-function design aligns naturally with XLA's compilation model. JAX's jit, vmap, pmap, and shmap transformations map cleanly to TPU Pod topologies, making it straightforward to write programs that scale across thousands of chips.
In 2025, the popular open-source LLM serving framework vLLM added a unified TPU backend supporting both PyTorch and JAX. This allows users to serve large language models on TPUs using the same vLLM APIs they use on GPUs, lowering the barrier for organizations migrating inference workloads to TPU hardware.
TPUs have been used to train many of the most influential AI models of the past decade. The following table highlights key examples:
| Model | Year | TPU Generation | Scale | Significance |
|---|---|---|---|---|
| AlphaGo | 2016 | TPU v1 | Inference only | Defeated world Go champion Lee Sedol; first major public demonstration of TPU capabilities |
| Transformer | 2017 | TPU v2 | Research scale | The "Attention Is All You Need" architecture was developed and trained at Google on TPUs |
| BERT | 2018 | TPU v3 Pod | 16 TPU chips | Revolutionized NLP; trained in 4 days on a TPU v3 Pod |
| T5 | 2019 | TPU v3 | 1,024 chips | Text-to-Text Transfer Transformer; explored scaling laws for language models |
| AlphaFold 2 | 2020 | TPU v3 | 128 chips | Solved the protein structure prediction problem; won CASP14 |
| LaMDA | 2021 | TPU v3 | 1,024 chips | Conversational language model that powered early Google Bard |
| PaLM | 2022 | TPU v4 | 6,144 chips (2 Pods) | 540B parameter model; demonstrated scaling to thousands of TPU chips |
| Gemini | 2023 | TPU v4/v5p | Large-scale Pods | Google's flagship multimodal model family |
| Gemma | 2024 | TPU v5e | N/A | Open-weights model family released for the community |
Beyond Google's own models, external researchers and companies have used Cloud TPUs to train large models, facilitated by the TPU Research Cloud program and Cloud TPU's pay-as-you-go pricing.
TPUs and GPUs take fundamentally different approaches to accelerating computation:
| Aspect | TPU | GPU (Nvidia) |
|---|---|---|
| Design philosophy | Domain-specific (ML only) | General-purpose parallel compute |
| Core compute unit | Systolic array (MXU) | CUDA cores + Tensor Cores |
| Programming model | XLA graph compilation | CUDA / cuDNN / cuBLAS |
| Precision support | bf16, fp32, int8, fp8 (v7) | fp16, bf16, fp32, fp8, int8 |
| Memory | HBM (on-package) | HBM (on-package) |
| Interconnect | ICI (proprietary torus) | NVLink + NVSwitch + InfiniBand |
| Availability | Google Cloud only | Purchasable; all major clouds |
| Software ecosystem | JAX, TensorFlow, PyTorch/XLA | CUDA ecosystem (broad support) |
Benchmarks and real-world deployments have shown that TPUs and GPUs trade advantages depending on the workload:
TPU strengths:
TPU limitations:
Cloud TPU pricing follows a per-chip-hour model, with rates varying by TPU generation, region, and commitment level:
| TPU Generation | On-Demand (approx.) | Committed Use (3-year) |
|---|---|---|
| TPU v5e | ~$1.20/chip/hour | Discounted (varies) |
| TPU v6e (Trillium) | ~$1.38/chip/hour | As low as ~$0.39/chip/hour |
Google also offers spot (preemptible) pricing at significant discounts for workloads that can tolerate interruptions, such as research experiments and non-time-critical training runs.
The TPU Research Cloud is a program that provides free Cloud TPU access to academic researchers and open-source developers. Accepted participants receive temporary quota for Cloud TPUs at no charge, with access to TPU v4 and newer generations. In exchange, researchers are expected to share their work publicly through publications, code, or blog posts. The TRC has supported research in areas ranging from natural language processing to protein structure prediction and climate modeling.
Cloud TPUs are available in select Google Cloud regions, with availability varying by generation. TPU v4 and v5e are available in the broadest set of regions, while newer generations like v6e and v7 are initially offered in a smaller number of locations before expanding over time.
In addition to its data center TPUs, Google developed the Edge TPU, a small ASIC designed for running ML inference on edge devices with tight power and size constraints. The Edge TPU is marketed under the Google Coral brand.
The Edge TPU delivers 4 trillion operations per second (4 TOPS) of int8 inference performance while consuming only 2 watts of power, yielding an efficiency of 2 TOPS per watt. It can execute mobile computer vision models such as MobileNet V2 at nearly 400 frames per second. The chip supports convolutional neural networks, specifically deep feed-forward architectures compiled with the Edge TPU compiler.
Google Coral offers the Edge TPU in several form factors:
| Product | Description |
|---|---|
| Coral USB Accelerator | USB dongle that adds Edge TPU inference to any Linux computer (including Raspberry Pi) |
| Coral Dev Board | Single-board computer with an on-board Edge TPU for prototyping |
| Coral M.2 / Mini PCIe Module | M.2 or mini PCIe cards for integration into custom hardware designs |
| Coral System-on-Module (SoM) | Production-ready module for embedded and IoT products |
The Edge TPU and Coral platform are used in applications that require real-time, on-device ML inference without cloud connectivity:
By processing data locally, the Edge TPU eliminates network latency, reduces bandwidth usage, and keeps sensitive data on the device for improved privacy.
The TPU played a pivotal role in demonstrating that purpose-built hardware for machine learning could deliver order-of-magnitude improvements over general-purpose processors. Before the TPU, the ML hardware landscape was dominated by Nvidia GPUs repurposed from their original graphics rendering role. Google's success with TPUs inspired a wave of custom AI chip development across the industry, including efforts from Apple (Neural Engine), Amazon (Inferentia, Trainium), Microsoft (Maia), Meta (MTIA), Tesla (Dojo), and numerous startups such as Cerebras, Graphcore, SambaNova, and Groq.
Because TPUs are tightly integrated with Google's research infrastructure, they have directly enabled many of the field's most important breakthroughs. The Transformer architecture, BERT, the T5 framework, AlphaFold, PaLM, and Gemini were all developed and trained on TPU hardware. The availability of large-scale TPU Pods has allowed Google researchers to explore scaling laws and train models at sizes that would be prohibitively expensive on commercially available hardware.
Cloud TPU has also influenced how organizations think about ML infrastructure. By offering TPUs as a cloud service with per-hour pricing, Google created a model where companies can access specialized AI hardware without capital expenditure on physical chips. This approach, combined with competitive pricing, has positioned Google Cloud as a credible alternative to Nvidia-centric infrastructure for large-scale ML workloads.
A Tensor Processing Unit, or TPU, is a special kind of computer chip made by Google. It is designed to help computers learn faster and be better at understanding things like pictures, sounds, and words. Regular computer chips (CPUs) are good at doing lots of different kinds of tasks, but they are slow at the specific math that AI needs. GPUs are faster at that math, but TPUs are built to do only that math, so they are even faster and use less electricity.
Think of it like kitchen tools. A CPU is like a Swiss Army knife: it can do many things, but none of them perfectly. A GPU is like a good chef's knife: great for chopping, decent for other tasks. A TPU is like a specialized pasta machine: it does one thing (make pasta) really, really well, and much faster than trying to do it by hand with a knife.
Google has made several versions of TPUs, each one faster and more capable than the one before. People use these chips to make computers do amazing things, like understand different languages, recognize pictures, predict how proteins fold, and even play games like a human.