TPU resource

Machine Learning

18 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v4 · 3,520 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A TPU resource is an allocation of Tensor Processing Unit compute, Google's custom machine learning accelerator chips, that you reserve and run as a unit, most often a slice of a specific generation rented through Google Cloud.^[15] In practice "TPU resource" covers both the hardware (the chip generations from v2 through v5e, v5p, Trillium v6e, and Ironwood v7, plus the eighth-generation 8t and 8i) and the way that hardware is provisioned (on-demand, reserved, Spot, queued, or free through the TPU Research Cloud).^[3]^[15] The smallest practical unit is a slice, which Google Cloud defines as "a collection of chips all located inside the same TPU Pod connected by high-speed inter-chip interconnects (ICI)."^[16]

What is a TPU?

A TPU, or Tensor Processing Unit, is a family of custom application-specific integrated circuits (ASICs) developed by Google to accelerate machine learning workloads.^[8] TPUs target the dense linear algebra at the heart of deep learning, especially the matrix multiplications used in training and serving neural networks. The chips trade general-purpose flexibility for high throughput on tensor math at much better performance-per-watt than the GPUs or CPUs they replace.

Google first deployed TPUs in its own data centers in 2015, then opened them to outside developers through Google Cloud in early 2018.^[9] As of 2026, eight generations have been announced.^[3] TPUs power most of Google's own large model work, including Gemini and earlier projects such as AlphaGo, and they serve as the training substrate for external labs including Anthropic and Midjourney.^[14] This page focuses on TPU resources as you provision and use them; for the deep hardware design history of the chips themselves, see the dedicated Tensor Processing Unit (TPU) page.

Architecture

Systolic arrays and matrix units

The defining structural choice in a TPU is the systolic array. Instead of a general SIMD or SIMT design, each chip is built around large grids of multiply-accumulate units, called Matrix Multiply Units (MXUs), that pump operands through the grid in lockstep. The original TPU v1 used a single 256 by 256 systolic array of 8-bit multiply-accumulate units.^[8] Later generations switched to multiple smaller MXUs per TensorCore (typically four 128 by 128 MXUs in v4 and v5p), trading a small amount of theoretical density for better utilization on real workloads.^[6] The MXUs sit beside a vector unit and a scalar unit, together forming a TensorCore. Most modern TPU chips contain two TensorCores per package, although TPU v6e ships with one TensorCore per chip^[5] and TPU 8i uses a chiplet design.^[3]

Memory hierarchy and HBM

Every generation from v2 onward uses High Bandwidth Memory (HBM) stacked directly on the package. HBM provides far more bandwidth than the DDR3 used on v1, which matters because matrix multiplication is bandwidth bound once the model no longer fits in on-chip SRAM. HBM capacity has grown from 16 GiB in v2 to 192 GB in TPU v7 Ironwood^[2] and 288 GB in TPU 8i. On-chip SRAM has grown in parallel, with TPU 8i carrying 384 MB to keep key-value caches for transformer inference resident on silicon.^[3]

SparseCore and number formats

Starting with TPU v4, Google added a SparseCore: a dedicated accelerator for embedding lookups and other gather-scatter workloads that hurt dense matrix engines.^[6] SparseCores improved through v5p, v6e, and v7, and are useful for ranking, recommendation systems, and mixture-of-experts routing.^[4]

TPUs also pioneered the bfloat16 numeric format. Introduced with TPU v2, bfloat16 keeps the same 8-bit exponent as IEEE float32 but truncates the mantissa to 7 bits, so it has roughly the same dynamic range as float32 with half the storage. That property lets neural network training proceed at fp32-like stability while halving memory and compute cost, and bfloat16 is now standard across the industry.^[8] Recent generations also support INT8 for quantized inference and FP8 from Ironwood onward.^[2]

What TPU generations are available?

Google has shipped eight TPU generations since 2015, and a Cloud TPU resource is always tied to one of them.^[3]^[8] The table below summarizes the key specifications; the sections that follow give the detail.

Generation	Year	Peak per chip	HBM	Pod size	Primary use
v1	2015	92 TOPS (INT8)	8 GiB DDR3	n/a	Inference only
v2	2017	~45 bf16 TFLOPS	16 GiB	256 chips	Training + inference
v3	2018	123 bf16 TFLOPS	32 GiB	1,024 chips	Training
v4	2021	275 bf16 TFLOPS	32 GiB	4,096 chips	Training
v5e	2023	197 bf16 TFLOPS	16 GiB	256 chips	Cost-efficient
v5p	2023	459 bf16 TFLOPS	95 GiB	8,960 chips	High performance
v6e (Trillium)	2024	918 bf16 TFLOPS	32 GB	256 chips	Training + inference
v7 (Ironwood)	2025	4,614 FP8 TFLOPS	192 GB	9,216 chips	Inference of reasoning models
8t / 8i	2026	training / inference split	up to 288 GB (8i)	up to 9,600 (8t)	Agentic era

TPU v1 (2015)

The first TPU was an inference-only accelerator built on a 28 nm process at 700 MHz with a 75 watt TDP. It paired a 256 by 256 8-bit systolic array with 8 GiB of DDR3 memory and 34 GB/s of bandwidth, hitting 92 trillion 8-bit operations per second.^[8] Google built more than 100,000 of them and used them for Search ranking, translation, Street View, voice recognition, and the AlphaGo match against Lee Sedol in 2016.^[9] v1 was announced publicly at Google I/O 2016, more than a year after deployment.

TPU v2 (2017)

TPU v2 was the first generation built for training as well as inference. It used a 16 nm process, ran at 700 MHz, drew 280 watts, and carried 16 GiB of HBM with 600 GB/s of bandwidth, delivering about 45 bfloat16 teraFLOPS per chip.^[8] Four chips were packaged on a board, and 64 boards (256 chips) formed a pod with roughly 11.5 petaFLOPS of compute. This was the first generation made available externally on Google Cloud, in beta in early 2018.^[9]

TPU v3 (2018)

Announced at Google I/O 2018, TPU v3 stayed on 16 nm but doubled the HBM to 32 GiB at 900 GB/s and added liquid cooling to push performance to 123 bfloat16 teraFLOPS per chip. Pods scaled up to 1,024 chips, delivering on the order of 100 petaFLOPS per pod.^[8] v3 was a workhorse for Google's mid-period research, including early BERT and T5 training runs.

TPU v4 (2021)

TPU v4 moved to a 7 nm process, doubled the MXU count, raised the clock to 1,050 MHz, and reached 275 bfloat16 teraFLOPS per chip in roughly 170 watts. HBM stayed at 32 GiB but bandwidth jumped to 1,200 GB/s.^[6] The pod grew to 4,096 chips with a 3D torus topology that reaches 1.1 exaFLOPS per pod. v4 introduced reconfigurable optical circuit switches (OCS) that let Google rewire pod topology in software to route around faults and to fit jobs of varying shapes onto the fabric.^[6] A separate inference variant called v4i shipped without liquid cooling. v4 trained many of Google's PaLM family models.

TPU v5e (2023)

v5e was positioned as the cost-efficient member of the fifth generation. Each chip delivers 197 bfloat16 teraFLOPS and 393 INT8 TOPS with 16 GiB of HBM at 819 GB/s.^[8] A v5e pod contains 256 chips arranged in a 2D torus. v5e ran the largest publicly disclosed distributed training run at the time of its announcement: a 50,944-chip job spanning 199 pods.^[12]

TPU v5p (December 2023)

v5p is the high-performance counterpart to v5e. Each chip provides 459 bfloat16 teraFLOPS and 918 INT8 TOPS, paired with 95 GiB of HBM at 2,765 GiB/s.^[4] A v5p pod has 8,960 chips connected in a 3D torus with 4,800 Gbps of ICI bandwidth per chip, hitting roughly 4 bfloat16 exaFLOPS per pod.^[4] Google trained the original Gemini family on v5p, and the chip was marketed as competitive with Nvidia's H100 on a per-chip basis.^[10]

TPU v6e "Trillium" (2024)

Announced at Google I/O 2024 and generally available on December 12, 2024, Trillium delivers 918 bfloat16 teraFLOPS and 1,836 INT8 TOPS per chip, a 4.7 times improvement over v5e per chip.^[1]^[17] HBM doubled to 32 GB at 1,640 GB/s and ICI bandwidth doubled to 800 GB/s bidirectional.^[5] The pod is 256 chips in a 2D torus, smaller than v5p's pod, but Multislice over the Jupiter fabric lets users stitch together hundreds of pods. Trillium is roughly 67 percent more energy efficient than v5e and trained Gemini 1.5 Flash, Imagen 3, and Gemma 2.^[1] On Google Cloud's technical surfaces (the API, logs, and resource names) Trillium is referred to as v6e.^[5]

TPU v7 "Ironwood" (2025)

Ironwood was unveiled at Google Cloud Next 25 on April 9, 2025 as the first TPU explicitly built for inference of reasoning models.^[2] Each chip delivers 4,614 FP8 TFLOPS, carries 192 GB of HBM3E at 7.37 TB/s, and has 9.6 Tb/s of ICI bandwidth.^[7] Pods come in 256-chip and 9,216-chip configurations; the full 9,216-chip superpod reaches 42.5 FP8 exaFLOPS. Ironwood doubles performance-per-watt versus Trillium and is nearly 30 times more efficient than the original 2018 Cloud TPU.^[2] Each chip contains two TensorCores and four SparseCores.^[7] Anthropic announced plans to scale to over one gigawatt of Ironwood capacity for training and serving future Claude models.^[14]

TPU 8t and TPU 8i (2026)

At Google Cloud Next 26 in April 2026, Google split the eighth generation into two purpose-built chips.^[3] TPU 8t (codename Sunfish), designed with Broadcom, targets training. A single 8t superpod scales to 9,600 chips with 2 petabytes of shared HBM and 121 exaFLOPS, roughly three times the per-pod throughput of Ironwood with double the inter-chip bandwidth.^[3] TPU 8i (codename Zebrafish), designed with MediaTek, targets inference for agentic AI workloads. It carries 288 GB of HBM, 384 MB of on-chip SRAM (about three times v7), and 19.2 Tb/s of interconnect bandwidth, with up to 1,152 chips per pod connected by a new Boardfly topology. Google quotes 80 percent better performance-per-dollar for 8i over Ironwood.^[3] Both chips are slated for general availability later in 2026 and target TSMC's 2 nm process.

What is a TPU slice?

When you provision a TPU resource you do not rent a whole pod by default; you rent a slice. Google Cloud defines a slice as "a collection of chips all located inside the same TPU Pod connected by high-speed inter-chip interconnects (ICI)."^[16] A slice can be a single host (for example v5e-8, one VM with 8 v5e chips) or span many hosts (for example v6e-256, a full 256-chip Trillium pod across 64 four-chip host VMs).^[16] Slice shape is described by a topology tuple: a 2D tuple such as 2x4 for cost-efficient chips (v5e, v6e), or a 3D tuple such as 4x4x4 for performance chips (v4, v5p, v7), where the product of the dimensions equals the chip count.^[16]

The naming convention <generation>-<chip count> (for example v5p-128) is how you ask for a slice on the command line and in Vertex AI or Google Kubernetes Engine. Each TPU VM in a slice exposes a fixed number of chips to that host: v5e VMs carry 1, 4, or 8 chips, while v6e uses half-host VMs of 4 chips with v6e-1 and v6e-8 single-host exceptions for testing and inference.^[16]

Pods and large-scale interconnect

A single TPU chip is rarely useful on its own at modern model sizes. Google ties chips together with two layers of network. The first layer, Inter-Chip Interconnect (ICI), runs over short copper or short-reach optics inside a pod and provides low-latency links arranged as a 2D torus (v5e, v6e, TPU 8t) or 3D torus (v4, v5p, v7).^[8] v4 introduced reconfigurable optical circuit switches so pod topology can be re-wired in software without re-cabling racks.^[6] The second layer is the Jupiter data center fabric, a 13 petabit-per-second optical network that connects pods to other pods.^[11]

Multislice is the software layer for jobs that span pod boundaries. A Multislice job runs on several pods stitched together over Jupiter, with ICI handling intra-slice collectives and DCN handling inter-slice collectives.^[11] The system gives near-linear scaling up to tens of thousands of chips; in November 2023 Google demonstrated a 50,944-chip distributed training job on v5e using this approach.^[12]

Pathways is the orchestration runtime that lets one JAX client drive thousands of TPUs as if they were one machine. It runs Google's own model factories internally, and the Pathways-on-Cloud service exposes the same system to external customers. With Pathways and JAX, Google has stated that single training jobs can address more than one million TPU chips across a regional cluster.

How do you get access to TPUs?

You obtain a TPU resource by requesting a slice through Google Cloud, and the way you request it determines the price and the availability guarantee.^[15] Google Cloud documentation states that "queued resources enable you to request Cloud TPU resources in a queued manner": you submit a request, it enters a queue maintained by the Cloud TPU service, and the chips are assigned to your project for exclusive use once capacity frees up.^[16] Creating TPUs as queued resources is the documented best practice because it lets you wait for capacity instead of failing immediately. The main provisioning models are:

Provisioning model	What it is	Best for
On-demand	Default; no advance arrangement, no preemption, but no availability guarantee	Short or interactive jobs
Reserved (committed use)	Capacity purchased in advance through an account team or a 1-3 year commitment	Steady, large-scale training
Spot VMs	Lower-cost, preemptible-at-any-time capacity on separate quota	Fault-tolerant, cost-sensitive batch
Flex-start	Short reservations of capacity (up to 7 days) without long-term commitment	Bursty experimentation
Queued resources	A request placed in a queue that is fulfilled when capacity is available	Hard-to-get large slices

Spot VMs replace the older preemptible TPUs; they can be reclaimed by Google at any time but, unlike the legacy preemptible tier, have no fixed runtime limit, and preemptible quota is separate from (and usually larger than) standard quota.^[15] Reserved capacity can be consumed directly or fed through the queued-resource system for more reliable allocation.

What is the TPU Research Cloud?

The TPU Research Cloud (TRC) is Google's program that gives accepted researchers free access to a cluster of more than 1,000 Cloud TPU devices, supporting frameworks including TensorFlow, PyTorch, JAX, and Julia.^[18] Anyone can apply at sites.research.google/trc, invitations go out on a rolling basis, and accepted users get temporary free TPU quota that is ready within minutes. Participants are expected to share results through peer-reviewed publications, open-source code, or blog posts; the TPUs are free, although associated services such as small VMs and Cloud Storage buckets still incur (typically small) charges.^[18]

Software stack

XLA and OpenXLA

XLA (Accelerated Linear Algebra) is the compiler at the bottom of the TPU stack. XLA takes a graph of tensor operations and produces optimized machine code for the target accelerator, fusing operations, picking tiling strategies, and laying out memory. The project was donated to a multi-vendor foundation called OpenXLA in 2023, and contributors now include Google, AMD, Apple, Arm, AWS, Intel, Meta, and Nvidia.^[13] XLA is the only path to TPU execution: every framework eventually lowers to XLA HLO before the compiler emits TPU code.

JAX

JAX is the framework Google itself uses to train models on TPUs, and it is the most natural way to program a TPU from Python. It exposes a NumPy-style API with composable transformations (jit, vmap, pmap, grad, shard_map) and lowers to XLA under the hood. JAX is the lingua franca of Google research code, and it is what powers Gemini, Pathways, and the public Multislice tooling.

TensorFlow

TensorFlow was the original TPU framework. It still supports TPUs through the tf.distribute.TPUStrategy API and remains popular in production serving pipelines, particularly in Vertex AI. Google has gradually shifted internal research from TensorFlow to JAX, but TensorFlow continues to be supported on all TPU generations.

PyTorch/XLA and torchax

PyTorch runs on TPU through PyTorch/XLA, a project that traces PyTorch operations and lowers them to XLA. It works, but historically gave up some performance versus JAX on the same hardware. In 2025 Google released torchax, a lighter shim that takes existing Hugging Face PyTorch models and lowers them to JAX for execution; the vLLM project adopted JAX as its TPU lowering path that same year. Hugging Face's Optimum-TPU library packages a similar workflow for training and inference.

Cloud TPU offerings

Cloud TPU is sold by the chip-hour and by the slice. Customers request a slice of a specific generation (for example v5e-8 for eight v5e chips or v6e-256 for a full Trillium pod) and pay either on demand, on a one- or three-year committed-use discount, or through Dynamic Workload Scheduler for bursty workloads.^[15] Approximate on-demand rates in 2026 are about $1.20 per chip-hour for v5e, $2.70 for Trillium, and $4.20 for v5p, with deep discounts at the three-year tier (Trillium drops to roughly $1.22 per chip-hour).^[15] Ironwood and the eighth-generation chips are priced through the AI Hypercomputer bundle, which packages TPU, storage, and Pathways into a single SKU.^[10]

TPU slices integrate with Cloud Storage and managed Lustre for data, Vertex AI for managed training and serving, and Google Kubernetes Engine for self-managed orchestration. Vertex AI hides most of the infrastructure behind a managed API, while GKE TPU slices give direct pod access for custom schedulers like Kueue or Slurm.

How do TPUs compare to GPUs?

GPUs from Nvidia and AMD are the main alternative for large-scale training. GPUs have a deeper third-party software ecosystem (CUDA, cuDNN, Triton, plus every researcher's favorite Python notebook), more flexible programming models, and a larger pool of operators. TPUs offer very high HBM bandwidth per dollar, large pod-scale ICI bandwidth that does not need NVLink switches or InfiniBand, and a compiler-first model that rewards static graphs. For dense transformer training and inference the two benchmark within striking distance at the chip level, but TPU pods often win on aggregate cost when a workload maps cleanly to JAX or XLA. A practical rule of thumb: choose GPUs when you need the broadest framework and operator support or off-the-shelf CUDA kernels, and choose TPU slices when the workload is a large, well-structured transformer that lowers cleanly to XLA and benefits from pod-scale interconnect.

Who uses TPUs?

Google uses TPUs internally for nearly all of its production AI: Search ranking, ads, Translate, Photos, YouTube recommendations, and the Gemini family. External customers include Anthropic (Claude training and serving on TPU v5p, v6e, and v7, with a publicly announced expansion to over one gigawatt of Ironwood capacity in October 2025), Midjourney, Salesforce, and a long tail of generative AI startups.^[14] Google has stated that more than 60 percent of funded generative AI startups use its AI infrastructure and that 90 percent of generative AI unicorns are on Google Cloud, often on Cloud TPU.^[9]

Explain like I'm 5

A regular computer chip can do lots of different jobs, like a Swiss Army knife. A TPU is more like a giant industrial drill press: it is not very flexible, but if your job is drilling holes, it can drill a huge number of them very quickly. Machine learning is mostly one specific kind of math (multiplying big grids of numbers together), and TPUs are built just to do that one job. Google plugs lots of them together with very fast cables, so when you ask a chatbot like Gemini a question, thousands of TPUs do the math in parallel and the answer comes back almost instantly. To use TPUs yourself, you rent a "slice" of them from Google Cloud for a while, a bit like renting a few lanes at a giant bowling alley instead of buying the whole building.

References

Google Cloud, Introducing Trillium, sixth-generation TPUs, May 2024. ↩
Google Cloud, Ironwood: The first Google TPU for the age of inference, April 2025. ↩
Google Cloud, Our eighth generation TPUs: two chips for the agentic era, April 2026. ↩
Google Cloud Documentation, TPU v5p. ↩
Google Cloud Documentation, TPU v6e. ↩
Google Cloud Documentation, TPU v4. ↩
Google Cloud Documentation, TPU7x (Ironwood). ↩
Wikipedia, Tensor Processing Unit. ↩
Google Cloud, TPU transformation: A look back at 10 years of our AI-specialized chips. ↩
Google Cloud, Introducing Cloud TPU v5p and AI Hypercomputer, December 2023. ↩
Google Cloud, Using Cloud TPU Multislice to scale AI workloads. ↩
Google Cloud, The world's largest distributed LLM training job on TPU v5e, November 2023. ↩
OpenXLA Project, openxla.org/xla. ↩
Anthropic, Expanding our use of Google Cloud TPUs and Services, October 2025. ↩
Google Cloud, TPU Pricing. ↩
Google Cloud Documentation, TPU architecture (slices, topology, queued resources). ↩
Google Cloud, Trillium TPU is GA, December 2024. ↩
Google Research, TPU Research Cloud. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Machine learning terms/All Machine learning terms/Google Cloud Machine learning terms/TensorFlow Terms

What is a TPU?

Architecture

Systolic arrays and matrix units

Memory hierarchy and HBM

SparseCore and number formats

What TPU generations are available?

TPU v1 (2015)

TPU v2 (2017)

TPU v3 (2018)

TPU v4 (2021)

TPU v5e (2023)

TPU v5p (December 2023)

TPU v6e "Trillium" (2024)

TPU v7 "Ironwood" (2025)

TPU 8t and TPU 8i (2026)

What is a TPU slice?

Pods and large-scale interconnect

How do you get access to TPUs?

What is the TPU Research Cloud?

Software stack

XLA and OpenXLA

JAX

TensorFlow

PyTorch/XLA and torchax

Cloud TPU offerings

How do TPUs compare to GPUs?

Who uses TPUs?

Explain like I'm 5

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here