TPU Node

A TPU node refers to a configuration in Google Cloud where one or more Tensor Processing Unit (TPU) chips are provisioned as a network-attached accelerator resource. In the Cloud TPU context, a TPU node specifically describes the legacy architecture in which a user's virtual machine (VM) communicates with a separate, inaccessible TPU host machine over gRPC. This architecture has been superseded by the TPU VM model, where users connect directly to the VM that is physically attached to the TPU hardware. More broadly, the term "TPU node" is also used to describe any individual unit of TPU compute within a larger TPU pod or cluster, encompassing one or more TPU chips connected to a host machine.

ELI5 (explain like I'm 5)

Imagine you have a really fast calculator that is great at doing one specific kind of math problem. A TPU node is like having that calculator sitting in a room at a big computer center. In the old setup (called "TPU node architecture"), you had to talk to the calculator through a walkie-talkie from another room, and you could never actually go into the room where the calculator was. In the new setup (called "TPU VM"), you get to sit right next to the calculator and use it directly. Either way, the calculator itself is the same super-fast math helper that lets computers learn from huge amounts of data much more quickly than a regular computer could on its own.

Background and history

Google began developing TPUs in 2013 and first deployed them internally in 2015. The initial motivation was to handle the projected demand for neural network inference workloads across Google's data centers. At the time, Google estimated that if every user spoke to their Android phone for just three minutes per day using voice search, existing CPU-based infrastructure would need to double in capacity. Rather than doubling their data center footprint, Google opted to build a custom ASIC tailored to neural network computation.

The first public description of the TPU appeared in a 2017 paper by Norman Jouppi and colleagues at the 44th International Symposium on Computer Architecture (ISCA). This paper reported that the TPU v1 achieved 15 to 30 times higher performance and 30 to 80 times higher performance-per-watt compared to contemporary CPUs and GPUs on production neural network inference workloads. The paper became the second most cited publication in ISCA's 50-year history.

Google made TPUs available to external users through Google Cloud Platform starting with TPU v2 in 2018. The initial Cloud TPU offering used what is now called the "TPU node" architecture, where users provisioned a standard Compute Engine VM (typically an n1 instance) and connected to a separate TPU host over a gRPC network interface.

Timeline of TPU generations

Generation	Year	Process node	Peak TFLOPS (per chip)	HBM per chip	Memory bandwidth	Max chips per pod	Notable feature
TPU v1	2015	28 nm	23 (INT8)	None (8 GiB DDR3)	34 GB/s	N/A (single chip)	First deployment; inference only
TPU v2	2017	16 nm	45 (bf16)	16 GiB HBM	600 GB/s	256 (64-chip pods)	Added training; introduced bfloat16
TPU v3	2018	16 nm	123 (bf16)	32 GiB HBM	900 GB/s	1,024	Water cooling; 2x v2 performance
TPU v4	2021	7 nm	275 (bf16)	32 GiB HBM	1,200 GB/s	4,096	Optical circuit switches (OCS); 3D torus
TPU v5e	2023	N/A	197 (bf16)	16 GiB HBM2e	819 GB/s	256	Cost-efficient variant
TPU v5p	2023	N/A	459 (bf16)	95 GiB HBM	2,765 GB/s	8,960	2.8x faster LLM training than v4
TPU v6e (Trillium)	2024	N/A	918 (bf16)	32 GiB HBM	1,640 GB/s	256	4.7x v5e performance; 3rd-gen SparseCore
TPU v7 (Ironwood)	2025	N/A	4,614 (FP8)	192 GiB HBM	7,370 GB/s	9,216	First inference-focused TPU; 42.5 ExaFLOPS per pod

Cloud TPU node architecture (legacy)

The original Cloud TPU deployment model, known as the TPU node architecture, uses a two-machine design. In this configuration, the user provisions a standard Google Compute Engine VM (the "user VM") that runs application code. The user VM communicates with a separate "TPU host" VM over a gRPC connection. The TPU host is the machine physically connected to the TPU chips via PCIe, but in the TPU node model, the user has no direct access to this host.

How TPU node works

The user creates a Compute Engine VM and a Cloud TPU resource in the same Google Cloud zone.
Application code (written using TensorFlow, JAX, or PyTorch via PyTorch/XLA) runs on the user VM.
The framework generates a computation graph, which is serialized and transmitted to the TPU host over gRPC.
The TPU host's XLA compiler compiles the computation graph into TPU machine code.
The compiled program is dispatched to the TPU chips for execution.
Results are returned to the user VM over the same gRPC channel.

Training data in the TPU node architecture must be loaded from Google Cloud Storage (GCS) because the user VM and the TPU host are separate machines. The user cannot store training data on local disk and have the TPU access it directly.

Limitations of the TPU node model

Limitation	Description
No direct host access	Users cannot SSH into the TPU host, making it difficult to debug training failures, inspect logs, or profile TPU utilization
gRPC overhead	All communication between user code and TPU hardware passes through a network interface, adding latency compared to direct PCIe access
GCS data dependency	Training data must reside in Google Cloud Storage; local storage on the user VM is not directly accessible to the TPU
Separate provisioning	The user VM and TPU resource must be created and managed independently, adding operational complexity
Limited framework support	Newer tools and APIs (such as `prepare_tf_dataset()` in Hugging Face Transformers) only support the TPU VM architecture

Deprecation

As of April 2025, the TPU node architecture is officially deprecated by Google Cloud. Google recommends migrating all workloads to the TPU VM architecture. The deprecation was driven by the architectural advantages of TPU VMs, including direct SSH access, simpler data pipelines, and better debugging capabilities.

TPU VM architecture (current)

The TPU VM architecture replaced the TPU node model and is now the recommended way to use Cloud TPUs. In this design, the user connects directly via SSH to a Linux VM that is physically attached to the TPU hardware. There is no separate user VM or gRPC intermediary.

Key differences from TPU node

Feature	TPU node (legacy)	TPU VM (current)
Host access	No direct access to TPU host	Direct SSH to TPU host VM
Data loading	Must use GCS buckets	Can use local storage, GCS, or network file systems
Debugging	Limited; no access to host logs	Full root access; can inspect logs and profiles
VM provisioning	Separate user VM + TPU resource	Single TPU VM resource
Framework support	TensorFlow primarily	TensorFlow, JAX, PyTorch/XLA
gRPC overhead	Yes	No (direct PCIe connection)

In the TPU VM model, each set of four TPU chips is connected to a CPU host machine using a PCIe link. A single TPU VM may host one or more TPU chips depending on the accelerator type. For multi-host workloads, multiple TPU VMs coordinate over the data center network (DCN), while the TPU chips within each host communicate over the inter-chip interconnect (ICI).

TPU chip architecture

Each TPU chip is an application-specific integrated circuit (ASIC) designed by Google specifically for machine learning computation. Unlike general-purpose processors, TPUs are optimized for the dense matrix arithmetic that dominates neural network training and inference.

TensorCore

The primary compute unit inside a TPU chip is the TensorCore. Each TensorCore contains several functional blocks:

Matrix multiply unit (MXU): A systolic array of multiply-accumulate (MAC) units arranged in either a 128x128 grid (TPU v2 through v5p) or a 256x256 grid (TPU v6e and v7). Each MXU performs 16,384 (128x128) or 65,536 (256x256) multiply-accumulate operations per clock cycle. Multiplications use bfloat16 inputs while accumulations use FP32, preserving numerical range without sacrificing throughput.
Vector processing unit (VPU): Handles element-wise operations such as activations (ReLU, GELU, softmax), normalization, and residual additions.
Scalar unit: Manages control flow, address computation, and memory access scheduling.
High-bandwidth memory (HBM): On-package DRAM providing high throughput data access to the MXU and VPU.

The number of TensorCores per chip varies by generation. TPU v7 (Ironwood) chips contain two TensorCores, with each chiplet packaging one TensorCore, two SparseCores, and 96 GiB of HBM.

Systolic array operation

The MXU uses a weight-stationary systolic array design. In this approach:

Weights are pre-loaded into the MAC array before computation begins.
Activation values flow horizontally from left to right through the array.
Partial sums propagate vertically from top to bottom.
All intermediate results pass directly between adjacent ALUs without requiring memory access.

This architecture eliminates the memory bandwidth bottleneck that limits conventional processors during matrix multiplication. Because wires connect only spatially adjacent ALUs, they can be kept short, which reduces both power consumption and signal propagation delay. When fully utilized, a 128x128 MXU can perform one bf16[8,128] x bf16[128,128] matrix multiplication producing an f32[8,128] result every 8 clock cycles.

SparseCore

Starting with TPU v4, Google introduced SparseCores as additional dataflow processors designed to accelerate sparse operations common in recommendation and ranking models. These processors handle large embedding table lookups that are memory-bound rather than compute-bound. TPU v6e includes two SparseCores per chip, while TPU v5p and v7 include four per chip. The third-generation SparseCore in TPU v6e introduced variable SIMD widths (8 elements for FP32, 16 for bfloat16) and improved memory access patterns for reduced wasted bandwidth.

bfloat16 number format

TPUs use the bfloat16 (Brain Floating Point) number format, developed by Google Brain. bfloat16 is a 16-bit floating point format with one sign bit, eight exponent bits, and seven mantissa bits. This differs from the IEEE 754 half-precision (FP16) format, which allocates five exponent bits and ten mantissa bits.

The design rationale prioritizes dynamic range over precision. Neural networks are generally more sensitive to overflow and underflow (which depend on exponent range) than to rounding errors (which depend on mantissa precision). By matching the exponent range of FP32 while halving the storage size, bfloat16 effectively doubles the usable HBM capacity for model parameters and activations. The MXU performs multiplications in bfloat16 and accumulates results in FP32, preventing numerical drift during long chains of multiply-accumulate operations.

Network topology and interconnects

TPU systems use a hierarchical networking architecture with three distinct layers, each operating at a different scale and bandwidth.

Inter-chip interconnect (ICI)

ICI is the high-speed, low-latency link that connects TPU chips within a single slice. Starting with TPU v4, each chip has six ICI links (one in each direction along the X, Y, and Z axes), forming a 3D torus topology. For TPU v5p, each ICI axis provides 90 GB/s of bandwidth per chip.

The 3D torus topology wraps around in all three dimensions so that chips on opposite edges of the mesh are directly connected. This provides higher bisection bandwidth compared to a simple mesh. Google also supports "twisted" torus configurations, where the wrap-around connections are offset. A 4x4x8 twisted topology provides approximately 70% higher bisection bandwidth than a non-twisted 4x4x8 topology.

ICI resiliency is enabled by default for slices at the cube scale or larger, automatically routing around optical link faults.

Data center network (DCN)

DCN connects TPU VMs to each other and to the broader Google Cloud network. It operates at significantly lower bandwidth than ICI but enables multi-slice configurations where more TPU chips are needed than a single slice can provide. In multi-slice setups, ICI handles intra-slice communication while DCN handles inter-slice data transfers.

Optical circuit switch (OCS)

Introduced with TPU v4, optical circuit switches allow the physical interconnect topology to be dynamically reconfigured. A TPU v4 pod uses OCS to connect "cubes" (groups of 64 chips in a 4x4x4 arrangement) into larger configurations. This reconfigurability supports different topology choices (such as twisted vs. non-twisted torus) and improves fault tolerance by routing around failed optical links.

TPU v7 (Ironwood) extends this approach, with each rack housing 64 chips in a cube connected by ICI in a 3D torus. Multiple cubes are linked through OCS to form pods (256 chips) and superpods (up to 9,216 chips, requiring 144 cubes).

Organizational hierarchy

TPU compute resources are organized in a hierarchical structure:

Level	Definition	Example
Chip	A single TPU ASIC die	One TPU v4 chip with 32 GiB HBM
Host	A CPU-based VM connected to one or more TPU chips via PCIe	A machine with 4 TPU v4 chips
Slice	A collection of chips within one pod connected by ICI	A v4 slice with 2x2x4 topology (16 chips)
Pod	The maximum set of chips connected by ICI within one physical installation	A TPU v4 pod with 4,096 chips
Multislice	Multiple slices coordinated over DCN for a single training job	Three v5e-256 slices (768 chips total)

A single-host configuration uses one TPU VM with its directly attached chips. A multi-host configuration distributes computation across multiple TPU VMs, requiring coordination over both ICI (for chip-to-chip transfers) and DCN (for host-to-host transfers).

Software stack

TPU nodes and TPU VMs use the same software compilation pipeline, centered on the XLA (Accelerated Linear Algebra) compiler.

XLA compiler

XLA is an open-source compiler that translates high-level operations from ML frameworks into optimized TPU machine code. It takes computation graphs expressed in the HLO (High-Level Operations) intermediate representation and performs optimizations including:

Operator fusion (combining multiple operations into a single kernel)
Memory layout optimization
Automatic parallelization across TPU cores
Communication scheduling for multi-chip and multi-host configurations

XLA is developed as part of the OpenXLA project, with contributions from Google, AMD, Apple, ARM, Intel, Meta, and NVIDIA, among others.

Supported frameworks

Framework	TPU integration method	Notes
TensorFlow	Native XLA support	Original TPU framework; tf.distribute for multi-device
JAX	Native XLA backend	Functional API with composable transforms; preferred for research
PyTorch	PyTorch/XLA bridge	Lazy evaluation model; records operations as IR graph, then compiles via XLA

JAX has become the preferred framework for large-scale TPU training at Google and in research settings. It provides composable function transformations (jit, vmap, pmap, pjit) that map naturally to TPU parallelism strategies. PyTorch/XLA enables PyTorch users to run on TPUs with minimal code changes by intercepting PyTorch operations and compiling them through XLA.

GSPMD and parallelism strategies

GSPMD (General and Scalable Parallelization for ML Computation Graphs) is the XLA partitioning system that automatically distributes computation across TPU chips. Users annotate tensors with sharding specifications, and GSPMD generates the necessary communication operations (all-reduce, all-gather, reduce-scatter) to maintain correctness.

Supported parallelism strategies include:

Data parallelism: Each chip processes a different batch of data with the same model weights. Gradients are synchronized via all-reduce.
Model parallelism: Model parameters are split across chips. Each chip holds a subset of the weights and processes the corresponding portion of each layer.
Fully sharded data parallelism (FSDP): Combines data parallelism with weight sharding. Weights are all-gathered before each operation and gradients are reduce-scattered afterward.
Pipeline parallelism: Different layers of the model are assigned to different groups of chips, with micro-batches flowing through the pipeline.

Multislice training uses GSPMD within each slice (over ICI) and data parallelism across slices (over DCN). The XLA compiler automatically generates the inter-slice DCN communication code and overlaps it with computation.

Applications and models trained on TPUs

TPU nodes and TPU VMs have been used to train many of Google's largest AI systems:

AlphaGo and AlphaZero: Used TPU v1 chips for the neural network evaluation component of Monte Carlo tree search during gameplay and for training the policy and value networks.
BERT: Pre-trained on TPU v3 pods, with training completing in approximately four days on 16 TPU v3 chips.
PaLM: The 540 billion parameter model was trained on 6,144 TPU v4 chips using the Pathways system, distributing computation across two TPU v4 pods.
Gemini: Google's multimodal model family was trained on a mixture of TPU v4 and TPU v5e hardware.
Google Translate: Uses TPUs for inference in the neural machine translation pipeline.

Comparison with GPUs

Aspect	TPU	GPU
Design philosophy	Fixed-function ASIC for matrix math	General-purpose parallel processor
Precision formats	bfloat16, INT8, FP8 (v7); MXU accumulates in FP32	FP16, FP32, FP64, INT8, FP8; TF32 (Ampere+)
Interconnect	ICI (3D torus, up to 9,216 chips)	NVLink, NVSwitch, InfiniBand
Programming model	XLA compiler (TensorFlow, JAX, PyTorch/XLA)	CUDA, ROCm, Triton
Availability	Google Cloud only	Multiple cloud providers and on-premises
Optimal batch size	128 to 1,024	8 to 128
Software ecosystem	Narrower (XLA-based frameworks)	Broader (CUDA ecosystem, extensive library support)
Power efficiency (v1 era)	83x better perf/watt vs CPU; 29x vs GPU (inference)	Baseline for comparison

TPUs typically outperform GPUs on workloads that are dominated by large matrix multiplications with regular data access patterns. GPUs maintain advantages in workloads requiring flexible memory access, custom CUDA kernels, or support across multiple cloud providers and on-premises deployments.

Cloud TPU pricing

Cloud TPU pricing is measured in chip-hours. Google Cloud offers several pricing tiers:

Pricing tier	Description	Typical discount
On-demand	Pay-as-you-go with no commitment	Baseline price
1-year commitment	Reserved capacity for 12 months	Moderate discount
3-year commitment	Reserved capacity for 36 months	Largest discount (up to 60% off on-demand)
Preemptible / Spot	May be interrupted at any time	Up to 70% off on-demand

Representative on-demand pricing (subject to change):

TPU type	On-demand price (per chip-hour)	Preemptible price (per chip-hour)
TPU v2	$4.50	$1.35
TPU v3	$8.00	$2.40
TPU v5e	~$1.20	Varies by region
TPU v5p	~$4.20	Varies by region
TPU v6e	~$1.38	Varies by region

Billing accrues while a TPU node or TPU VM is in the READY state. In the Google Cloud console, prices are displayed per VM-hour rather than per chip-hour. For example, a single TPU v4 host with four chips shows as $12.88 per hour.

Edge TPU

In addition to the cloud-based TPU line, Google produces the Edge TPU, a compact ASIC designed for on-device inference at the network edge. The Edge TPU performs 4 trillion operations per second (4 TOPS) at only 2 watts of power, yielding 2 TOPS per watt. It supports only 8-bit integer (INT8) quantized models compiled through TensorFlow Lite.

Edge TPU hardware is sold under the Coral brand in several form factors: USB accelerator, M.2 module, mini PCIe card, system-on-module (SoM), and single-board computer (SBC). These devices are used for applications such as real-time object detection, audio classification, and pose estimation in environments where cloud connectivity is unavailable or latency requirements preclude round-trip network calls.

References

Jouppi, N.P., Young, C., Patil, N., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. https://arxiv.org/abs/1704.04760
Jouppi, N.P., Kurian, G., Li, S., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), 2023. https://arxiv.org/abs/2304.01433
Google Cloud. "TPU architecture." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
Google Cloud. "About TPUs in GKE." GKE AI/ML Documentation. https://docs.cloud.google.com/kubernetes-engine/docs/concepts/tpus
Google Cloud. "Cloud TPU Multislice Overview." https://docs.cloud.google.com/tpu/docs/multislice-introduction
Google Cloud. "TPU v4." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v4
Google Cloud. "TPU v5p." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v5p
Google Cloud. "TPU v6e." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v6e
Google Cloud. "TPU7x (Ironwood)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/tpu7x
Google Cloud. "BFloat16: The Secret to High Performance on Cloud TPUs." Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Chowdhery, A., Narang, S., Devlin, J., et al. "PaLM: Scaling Language Modeling with Pathways." Journal of Machine Learning Research, Vol. 24, 2023. https://dl.acm.org/doi/10.5555/3648699.3648939
Google. "Ironwood: The first Google TPU for the age of inference." Google Blog, April 2025. https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/ironwood-tpu-age-of-inference/
OpenXLA Project. "XLA: Accelerated Linear Algebra." https://openxla.org/xla
PyTorch/XLA. "XLA Overview." https://docs.pytorch.org/xla/master/learn/xla-overview.html
Google Cloud. "TPU Pricing." https://cloud.google.com/tpu/pricing

ELI5 (explain like I'm 5)

Background and history

Timeline of TPU generations

Cloud TPU node architecture (legacy)

How TPU node works

Limitations of the TPU node model

Deprecation

TPU VM architecture (current)

Key differences from TPU node

TPU chip architecture

TensorCore

Systolic array operation

SparseCore

bfloat16 number format

Network topology and interconnects

Inter-chip interconnect (ICI)

Data center network (DCN)

Optical circuit switch (OCS)

Organizational hierarchy

Software stack

XLA compiler

Supported frameworks

GSPMD and parallelism strategies

Applications and models trained on TPUs

Comparison with GPUs

Cloud TPU pricing

Edge TPU

References

Improve this article

Related Articles

Machine learning terms/Google Cloud

TPU Worker

ARC-AGI 2

Cloud TPU

TPU Chip

TPU Device

ELI5 (explain like I'm 5)

Background and history

Timeline of TPU generations

Cloud TPU node architecture (legacy)

How TPU node works

Limitations of the TPU node model

Deprecation

TPU VM architecture (current)

Key differences from TPU node

TPU chip architecture

TensorCore

Systolic array operation

SparseCore

bfloat16 number format

Network topology and interconnects

Inter-chip interconnect (ICI)

Data center network (DCN)

Optical circuit switch (OCS)

Organizational hierarchy

Software stack

XLA compiler

Supported frameworks

GSPMD and parallelism strategies

Applications and models trained on TPUs

Comparison with GPUs

Cloud TPU pricing

Edge TPU

References

Related Articles

Machine learning terms/Google Cloud

TPU Worker

ARC-AGI 2

Cloud TPU

TPU Chip

TPU Device