TPU resource
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 · 2,462 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 · 2,462 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
A TPU, or Tensor Processing Unit, is a family of custom application-specific integrated circuits (ASICs) developed by Google to accelerate machine learning workloads. TPUs target the dense linear algebra at the heart of deep learning, especially the matrix multiplications used in training and serving neural networks. The chips trade general-purpose flexibility for high throughput on tensor math at much better performance-per-watt than the GPUs or CPUs they replace.
Google first deployed TPUs in its own data centers in 2015, then opened them to outside developers through Google Cloud in early 2018. As of 2026, eight generations have been announced. TPUs power most of Google's own large model work, including Gemini and earlier projects such as AlphaGo, and they serve as the training substrate for external labs including Anthropic and Midjourney.
The defining structural choice in a TPU is the systolic array. Instead of a general SIMD or SIMT design, each chip is built around large grids of multiply-accumulate units, called Matrix Multiply Units (MXUs), that pump operands through the grid in lockstep. The original TPU v1 used a single 256 by 256 systolic array of 8-bit multiply-accumulate units. Later generations switched to multiple smaller MXUs per TensorCore (typically four 128 by 128 MXUs in v4 and v5p), trading a small amount of theoretical density for better utilization on real workloads. The MXUs sit beside a vector unit and a scalar unit, together forming a TensorCore. Most modern TPU chips contain two TensorCores per package, although TPU v6e ships with one TensorCore per chip and TPU 8i uses a chiplet design.
Every generation from v2 onward uses High Bandwidth Memory (HBM) stacked directly on the package. HBM provides far more bandwidth than the DDR3 used on v1, which matters because matrix multiplication is bandwidth bound once the model no longer fits in on-chip SRAM. HBM capacity has grown from 16 GiB in v2 to 192 GB in TPU v7 Ironwood and 288 GB in TPU 8i. On-chip SRAM has grown in parallel, with TPU 8i carrying 384 MB to keep key-value caches for transformer inference resident on silicon.
Starting with TPU v4, Google added a SparseCore: a dedicated accelerator for embedding lookups and other gather-scatter workloads that hurt dense matrix engines. SparseCores improved through v5p, v6e, and v7, and are useful for ranking, recommendation systems, and mixture-of-experts routing.
TPUs also pioneered the bfloat16 numeric format. Introduced with TPU v2, bfloat16 keeps the same 8-bit exponent as IEEE float32 but truncates the mantissa to 7 bits, so it has roughly the same dynamic range as float32 with half the storage. That property lets neural network training proceed at fp32-like stability while halving memory and compute cost, and bfloat16 is now standard across the industry. Recent generations also support INT8 for quantized inference and FP8 from Ironwood onward.
The first TPU was an inference-only accelerator built on a 28 nm process at 700 MHz with a 75 watt TDP. It paired a 256 by 256 8-bit systolic array with 8 GiB of DDR3 memory and 34 GB/s of bandwidth, hitting 92 trillion 8-bit operations per second. Google built more than 100,000 of them and used them for Search ranking, translation, Street View, voice recognition, and the AlphaGo match against Lee Sedol in 2016. v1 was announced publicly at Google I/O 2016, more than a year after deployment.
TPU v2 was the first generation built for training as well as inference. It used a 16 nm process, ran at 700 MHz, drew 280 watts, and carried 16 GiB of HBM with 600 GB/s of bandwidth, delivering about 45 bfloat16 teraFLOPS per chip. Four chips were packaged on a board, and 64 boards (256 chips) formed a pod with roughly 11.5 petaFLOPS of compute. This was the first generation made available externally on Google Cloud, in beta in early 2018.
Announced at Google I/O 2018, TPU v3 stayed on 16 nm but doubled the HBM to 32 GiB at 900 GB/s and added liquid cooling to push performance to 123 bfloat16 teraFLOPS per chip. Pods scaled up to 1,024 chips, delivering on the order of 100 petaFLOPS per pod. v3 was a workhorse for Google's mid-period research, including early BERT and T5 training runs.
TPU v4 moved to a 7 nm process, doubled the MXU count, raised the clock to 1,050 MHz, and reached 275 bfloat16 teraFLOPS per chip in roughly 170 watts. HBM stayed at 32 GiB but bandwidth jumped to 1,200 GB/s. The pod grew to 4,096 chips with a 3D torus topology that reaches 1.1 exaFLOPS per pod. v4 introduced reconfigurable optical circuit switches (OCS) that let Google rewire pod topology in software to route around faults and to fit jobs of varying shapes onto the fabric. A separate inference variant called v4i shipped without liquid cooling. v4 trained many of Google's PaLM family models.
v5e was positioned as the cost-efficient member of the fifth generation. Each chip delivers 197 bfloat16 teraFLOPS and 393 INT8 TOPS with 16 GiB of HBM at 819 GB/s. A v5e pod contains 256 chips arranged in a 2D torus. v5e ran the largest publicly disclosed distributed training run at the time of its announcement: a 50,944-chip job spanning 199 pods.
v5p is the high-performance counterpart to v5e. Each chip provides 459 bfloat16 teraFLOPS and 918 INT8 TOPS, paired with 95 GiB of HBM at 2,765 GiB/s. A v5p pod has 8,960 chips connected in a 3D torus with 4,800 Gbps of ICI bandwidth per chip, hitting roughly 4 bfloat16 exaFLOPS per pod. Google trained the original Gemini family on v5p, and the chip was marketed as competitive with Nvidia's H100 on a per-chip basis.
Announced at Google I/O 2024 and generally available later that year, Trillium delivers 918 bfloat16 teraFLOPS and 1,836 INT8 TOPS per chip, a 4.7 times improvement over v5e per chip. HBM doubled to 32 GB at 1,640 GB/s and ICI bandwidth doubled to 800 GB/s bidirectional. The pod is 256 chips in a 2D torus, smaller than v5p's pod, but Multislice over the Jupiter fabric lets users stitch together hundreds of pods. Trillium is roughly 67 percent more energy efficient than v5e and trained Gemini 1.5 Flash, Imagen 3, and Gemma 2.
Ironwood was unveiled at Google Cloud Next 25 on April 9, 2025 as the first TPU explicitly built for inference of reasoning models. Each chip delivers 4,614 FP8 TFLOPS, carries 192 GB of HBM3E at 7.37 TB/s, and has 9.6 Tb/s of ICI bandwidth. Pods come in 256-chip and 9,216-chip configurations; the full 9,216-chip superpod reaches 42.5 FP8 exaFLOPS. Ironwood doubles performance-per-watt versus Trillium and is nearly 30 times more efficient than the original 2018 Cloud TPU. Each chip contains two TensorCores and four SparseCores. Anthropic announced plans to scale to over one gigawatt of Ironwood capacity for training and serving future Claude models.
At Google Cloud Next 26 in April 2026, Google split the eighth generation into two purpose-built chips. TPU 8t (codename Sunfish), designed with Broadcom, targets training. A single 8t superpod scales to 9,600 chips with 2 petabytes of shared HBM and 121 exaFLOPS, roughly three times the per-pod throughput of Ironwood with double the inter-chip bandwidth. TPU 8i (codename Zebrafish), designed with MediaTek, targets inference for agentic AI workloads. It carries 288 GB of HBM, 384 MB of on-chip SRAM (about three times v7), and 19.2 Tb/s of interconnect bandwidth, with up to 1,152 chips per pod connected by a new Boardfly topology. Google quotes 80 percent better performance-per-dollar for 8i over Ironwood. Both chips are slated for general availability later in 2026 and target TSMC's 2 nm process.
A single TPU chip is rarely useful on its own at modern model sizes. Google ties chips together with two layers of network. The first layer, Inter-Chip Interconnect (ICI), runs over short copper or short-reach optics inside a pod and provides low-latency links arranged as a 2D torus (v5e, v6e, TPU 8t) or 3D torus (v4, v5p, v7). v4 introduced reconfigurable optical circuit switches so pod topology can be re-wired in software without re-cabling racks. The second layer is the Jupiter data center fabric, a 13 petabit-per-second optical network that connects pods to other pods.
Multislice is the software layer for jobs that span pod boundaries. A Multislice job runs on several pods stitched together over Jupiter, with ICI handling intra-slice collectives and DCN handling inter-slice collectives. The system gives near-linear scaling up to tens of thousands of chips; in November 2023 Google demonstrated a 50,944-chip distributed training job on v5e using this approach.
Pathways is the orchestration runtime that lets one JAX client drive thousands of TPUs as if they were one machine. It runs Google's own model factories internally, and the Pathways-on-Cloud service exposes the same system to external customers. With Pathways and JAX, Google has stated that single training jobs can address more than one million TPU chips across a regional cluster.
XLA (Accelerated Linear Algebra) is the compiler at the bottom of the TPU stack. XLA takes a graph of tensor operations and produces optimized machine code for the target accelerator, fusing operations, picking tiling strategies, and laying out memory. The project was donated to a multi-vendor foundation called OpenXLA in 2023, and contributors now include Google, AMD, Apple, Arm, AWS, Intel, Meta, and Nvidia. XLA is the only path to TPU execution: every framework eventually lowers to XLA HLO before the compiler emits TPU code.
JAX is the framework Google itself uses to train models on TPUs, and it is the most natural way to program a TPU from Python. It exposes a NumPy-style API with composable transformations (jit, vmap, pmap, grad, shard_map) and lowers to XLA under the hood. JAX is the lingua franca of Google research code, and it is what powers Gemini, Pathways, and the public Multislice tooling.
TensorFlow was the original TPU framework. It still supports TPUs through the tf.distribute.TPUStrategy API and remains popular in production serving pipelines, particularly in Vertex AI. Google has gradually shifted internal research from TensorFlow to JAX, but TensorFlow continues to be supported on all TPU generations.
PyTorch runs on TPU through PyTorch/XLA, a project that traces PyTorch operations and lowers them to XLA. It works, but historically gave up some performance versus JAX on the same hardware. In 2025 Google released torchax, a lighter shim that takes existing Hugging Face PyTorch models and lowers them to JAX for execution; the vLLM project adopted JAX as its TPU lowering path that same year. Hugging Face's Optimum-TPU library packages a similar workflow for training and inference.
Cloud TPU is sold by the chip-hour and by the slice. Customers request a slice of a specific generation (for example v5e-8 for eight v5e chips or v6e-256 for a full Trillium pod) and pay either on demand, on a one- or three-year committed-use discount, or through Dynamic Workload Scheduler for bursty workloads. Approximate on-demand rates in 2026 are about $1.20 per chip-hour for v5e, $2.70 for Trillium, and $4.20 for v5p, with deep discounts at the three-year tier (Trillium drops to roughly $1.22 per chip-hour). Ironwood and the eighth-generation chips are priced through the AI Hypercomputer bundle, which packages TPU, storage, and Pathways into a single SKU.
TPU slices integrate with Cloud Storage and managed Lustre for data, Vertex AI for managed training and serving, and Google Kubernetes Engine for self-managed orchestration. Vertex AI hides most of the infrastructure behind a managed API, while GKE TPU slices give direct pod access for custom schedulers like Kueue or Slurm.
GPUs from Nvidia and AMD are the main alternative for large-scale training. GPUs have a deeper third-party software ecosystem (CUDA, cuDNN, Triton, plus every researcher's favorite Python notebook), more flexible programming models, and a larger pool of operators. TPUs offer very high HBM bandwidth per dollar, large pod-scale ICI bandwidth that does not need NVLink switches or InfiniBand, and a compiler-first model that rewards static graphs. For dense transformer training and inference the two benchmark within striking distance at the chip level, but TPU pods often win on aggregate cost when a workload maps cleanly to JAX or XLA.
Google uses TPUs internally for nearly all of its production AI: Search ranking, ads, Translate, Photos, YouTube recommendations, and the Gemini family. External customers include Anthropic (Claude training and serving on TPU v5p, v6e, and v7, with a publicly announced expansion to over one gigawatt of Ironwood capacity in October 2025), Midjourney, Salesforce, and a long tail of generative AI startups. Google has stated that more than 60 percent of funded generative AI startups use its AI infrastructure and that 90 percent of generative AI unicorns are on Google Cloud, often on Cloud TPU.
A regular computer chip can do lots of different jobs, like a Swiss Army knife. A TPU is more like a giant industrial drill press: it is not very flexible, but if your job is drilling holes, it can drill a huge number of them very quickly. Machine learning is mostly one specific kind of math (multiplying big grids of numbers together), and TPUs are built just to do that one job. Google plugs lots of them together with very fast cables, so when you ask a chatbot like Gemini a question, thousands of TPUs do the math in parallel and the answer comes back almost instantly.