A TPU node refers to a configuration in Google Cloud where one or more Tensor Processing Unit (TPU) chips are provisioned as a network-attached accelerator resource. In the Cloud TPU context, a TPU node specifically describes the legacy architecture in which a user's virtual machine (VM) communicates with a separate, inaccessible TPU host machine over gRPC. This architecture has been superseded by the TPU VM model, where users connect directly to the VM that is physically attached to the TPU hardware. More broadly, the term "TPU node" is also used to describe any individual unit of TPU compute within a larger TPU pod or cluster, encompassing one or more TPU chips connected to a host machine.
Imagine you have a really fast calculator that is great at doing one specific kind of math problem. A TPU node is like having that calculator sitting in a room at a big computer center. In the old setup (called "TPU node architecture"), you had to talk to the calculator through a walkie-talkie from another room, and you could never actually go into the room where the calculator was. In the new setup (called "TPU VM"), you get to sit right next to the calculator and use it directly. Either way, the calculator itself is the same super-fast math helper that lets computers learn from huge amounts of data much more quickly than a regular computer could on its own.
Google began developing TPUs in 2013 and first deployed them internally in 2015. The initial motivation was to handle the projected demand for neural network inference workloads across Google's data centers. At the time, Google estimated that if every user spoke to their Android phone for just three minutes per day using voice search, existing CPU-based infrastructure would need to double in capacity. Rather than doubling their data center footprint, Google opted to build a custom ASIC tailored to neural network computation.
The first public description of the TPU appeared in a 2017 paper by Norman Jouppi and colleagues at the 44th International Symposium on Computer Architecture (ISCA). This paper reported that the TPU v1 achieved 15 to 30 times higher performance and 30 to 80 times higher performance-per-watt compared to contemporary CPUs and GPUs on production neural network inference workloads. The paper became the second most cited publication in ISCA's 50-year history.
Google made TPUs available to external users through Google Cloud Platform starting with TPU v2 in 2018. The initial Cloud TPU offering used what is now called the "TPU node" architecture, where users provisioned a standard Compute Engine VM (typically an n1 instance) and connected to a separate TPU host over a gRPC network interface.
| Generation | Year | Process node | Peak TFLOPS (per chip) | HBM per chip | Memory bandwidth | Max chips per pod | Notable feature |
|---|---|---|---|---|---|---|---|
| TPU v1 | 2015 | 28 nm | 23 (INT8) | None (8 GiB DDR3) | 34 GB/s | N/A (single chip) | First deployment; inference only |
| TPU v2 | 2017 | 16 nm | 45 (bf16) | 16 GiB HBM | 600 GB/s | 256 (64-chip pods) | Added training; introduced bfloat16 |
| TPU v3 | 2018 | 16 nm | 123 (bf16) | 32 GiB HBM | 900 GB/s | 1,024 | Water cooling; 2x v2 performance |
| TPU v4 | 2021 | 7 nm | 275 (bf16) | 32 GiB HBM | 1,200 GB/s | 4,096 | Optical circuit switches (OCS); 3D torus |
| TPU v5e | 2023 | N/A | 197 (bf16) | 16 GiB HBM2e | 819 GB/s | 256 | Cost-efficient variant |
| TPU v5p | 2023 | N/A | 459 (bf16) | 95 GiB HBM | 2,765 GB/s | 8,960 | 2.8x faster LLM training than v4 |
| TPU v6e (Trillium) | 2024 | N/A | 918 (bf16) | 32 GiB HBM | 1,640 GB/s | 256 | 4.7x v5e performance; 3rd-gen SparseCore |
| TPU v7 (Ironwood) | 2025 | N/A | 4,614 (FP8) | 192 GiB HBM | 7,370 GB/s | 9,216 | First inference-focused TPU; 42.5 ExaFLOPS per pod |
The original Cloud TPU deployment model, known as the TPU node architecture, uses a two-machine design. In this configuration, the user provisions a standard Google Compute Engine VM (the "user VM") that runs application code. The user VM communicates with a separate "TPU host" VM over a gRPC connection. The TPU host is the machine physically connected to the TPU chips via PCIe, but in the TPU node model, the user has no direct access to this host.
Training data in the TPU node architecture must be loaded from Google Cloud Storage (GCS) because the user VM and the TPU host are separate machines. The user cannot store training data on local disk and have the TPU access it directly.
| Limitation | Description |
|---|---|
| No direct host access | Users cannot SSH into the TPU host, making it difficult to debug training failures, inspect logs, or profile TPU utilization |
| gRPC overhead | All communication between user code and TPU hardware passes through a network interface, adding latency compared to direct PCIe access |
| GCS data dependency | Training data must reside in Google Cloud Storage; local storage on the user VM is not directly accessible to the TPU |
| Separate provisioning | The user VM and TPU resource must be created and managed independently, adding operational complexity |
| Limited framework support | Newer tools and APIs (such as prepare_tf_dataset() in Hugging Face Transformers) only support the TPU VM architecture |
As of April 2025, the TPU node architecture is officially deprecated by Google Cloud. Google recommends migrating all workloads to the TPU VM architecture. The deprecation was driven by the architectural advantages of TPU VMs, including direct SSH access, simpler data pipelines, and better debugging capabilities.
The TPU VM architecture replaced the TPU node model and is now the recommended way to use Cloud TPUs. In this design, the user connects directly via SSH to a Linux VM that is physically attached to the TPU hardware. There is no separate user VM or gRPC intermediary.
| Feature | TPU node (legacy) | TPU VM (current) |
|---|---|---|
| Host access | No direct access to TPU host | Direct SSH to TPU host VM |
| Data loading | Must use GCS buckets | Can use local storage, GCS, or network file systems |
| Debugging | Limited; no access to host logs | Full root access; can inspect logs and profiles |
| VM provisioning | Separate user VM + TPU resource | Single TPU VM resource |
| Framework support | TensorFlow primarily | TensorFlow, JAX, PyTorch/XLA |
| gRPC overhead | Yes | No (direct PCIe connection) |
In the TPU VM model, each set of four TPU chips is connected to a CPU host machine using a PCIe link. A single TPU VM may host one or more TPU chips depending on the accelerator type. For multi-host workloads, multiple TPU VMs coordinate over the data center network (DCN), while the TPU chips within each host communicate over the inter-chip interconnect (ICI).
Each TPU chip is an application-specific integrated circuit (ASIC) designed by Google specifically for machine learning computation. Unlike general-purpose processors, TPUs are optimized for the dense matrix arithmetic that dominates neural network training and inference.
The primary compute unit inside a TPU chip is the TensorCore. Each TensorCore contains several functional blocks:
The number of TensorCores per chip varies by generation. TPU v7 (Ironwood) chips contain two TensorCores, with each chiplet packaging one TensorCore, two SparseCores, and 96 GiB of HBM.
The MXU uses a weight-stationary systolic array design. In this approach:
This architecture eliminates the memory bandwidth bottleneck that limits conventional processors during matrix multiplication. Because wires connect only spatially adjacent ALUs, they can be kept short, which reduces both power consumption and signal propagation delay. When fully utilized, a 128x128 MXU can perform one bf16[8,128] x bf16[128,128] matrix multiplication producing an f32[8,128] result every 8 clock cycles.
Starting with TPU v4, Google introduced SparseCores as additional dataflow processors designed to accelerate sparse operations common in recommendation and ranking models. These processors handle large embedding table lookups that are memory-bound rather than compute-bound. TPU v6e includes two SparseCores per chip, while TPU v5p and v7 include four per chip. The third-generation SparseCore in TPU v6e introduced variable SIMD widths (8 elements for FP32, 16 for bfloat16) and improved memory access patterns for reduced wasted bandwidth.
TPUs use the bfloat16 (Brain Floating Point) number format, developed by Google Brain. bfloat16 is a 16-bit floating point format with one sign bit, eight exponent bits, and seven mantissa bits. This differs from the IEEE 754 half-precision (FP16) format, which allocates five exponent bits and ten mantissa bits.
The design rationale prioritizes dynamic range over precision. Neural networks are generally more sensitive to overflow and underflow (which depend on exponent range) than to rounding errors (which depend on mantissa precision). By matching the exponent range of FP32 while halving the storage size, bfloat16 effectively doubles the usable HBM capacity for model parameters and activations. The MXU performs multiplications in bfloat16 and accumulates results in FP32, preventing numerical drift during long chains of multiply-accumulate operations.
TPU systems use a hierarchical networking architecture with three distinct layers, each operating at a different scale and bandwidth.
ICI is the high-speed, low-latency link that connects TPU chips within a single slice. Starting with TPU v4, each chip has six ICI links (one in each direction along the X, Y, and Z axes), forming a 3D torus topology. For TPU v5p, each ICI axis provides 90 GB/s of bandwidth per chip.
The 3D torus topology wraps around in all three dimensions so that chips on opposite edges of the mesh are directly connected. This provides higher bisection bandwidth compared to a simple mesh. Google also supports "twisted" torus configurations, where the wrap-around connections are offset. A 4x4x8 twisted topology provides approximately 70% higher bisection bandwidth than a non-twisted 4x4x8 topology.
ICI resiliency is enabled by default for slices at the cube scale or larger, automatically routing around optical link faults.
DCN connects TPU VMs to each other and to the broader Google Cloud network. It operates at significantly lower bandwidth than ICI but enables multi-slice configurations where more TPU chips are needed than a single slice can provide. In multi-slice setups, ICI handles intra-slice communication while DCN handles inter-slice data transfers.
Introduced with TPU v4, optical circuit switches allow the physical interconnect topology to be dynamically reconfigured. A TPU v4 pod uses OCS to connect "cubes" (groups of 64 chips in a 4x4x4 arrangement) into larger configurations. This reconfigurability supports different topology choices (such as twisted vs. non-twisted torus) and improves fault tolerance by routing around failed optical links.
TPU v7 (Ironwood) extends this approach, with each rack housing 64 chips in a cube connected by ICI in a 3D torus. Multiple cubes are linked through OCS to form pods (256 chips) and superpods (up to 9,216 chips, requiring 144 cubes).
TPU compute resources are organized in a hierarchical structure:
| Level | Definition | Example |
|---|---|---|
| Chip | A single TPU ASIC die | One TPU v4 chip with 32 GiB HBM |
| Host | A CPU-based VM connected to one or more TPU chips via PCIe | A machine with 4 TPU v4 chips |
| Slice | A collection of chips within one pod connected by ICI | A v4 slice with 2x2x4 topology (16 chips) |
| Pod | The maximum set of chips connected by ICI within one physical installation | A TPU v4 pod with 4,096 chips |
| Multislice | Multiple slices coordinated over DCN for a single training job | Three v5e-256 slices (768 chips total) |
A single-host configuration uses one TPU VM with its directly attached chips. A multi-host configuration distributes computation across multiple TPU VMs, requiring coordination over both ICI (for chip-to-chip transfers) and DCN (for host-to-host transfers).
TPU nodes and TPU VMs use the same software compilation pipeline, centered on the XLA (Accelerated Linear Algebra) compiler.
XLA is an open-source compiler that translates high-level operations from ML frameworks into optimized TPU machine code. It takes computation graphs expressed in the HLO (High-Level Operations) intermediate representation and performs optimizations including:
XLA is developed as part of the OpenXLA project, with contributions from Google, AMD, Apple, ARM, Intel, Meta, and NVIDIA, among others.
| Framework | TPU integration method | Notes |
|---|---|---|
| TensorFlow | Native XLA support | Original TPU framework; tf.distribute for multi-device |
| JAX | Native XLA backend | Functional API with composable transforms; preferred for research |
| PyTorch | PyTorch/XLA bridge | Lazy evaluation model; records operations as IR graph, then compiles via XLA |
JAX has become the preferred framework for large-scale TPU training at Google and in research settings. It provides composable function transformations (jit, vmap, pmap, pjit) that map naturally to TPU parallelism strategies. PyTorch/XLA enables PyTorch users to run on TPUs with minimal code changes by intercepting PyTorch operations and compiling them through XLA.
GSPMD (General and Scalable Parallelization for ML Computation Graphs) is the XLA partitioning system that automatically distributes computation across TPU chips. Users annotate tensors with sharding specifications, and GSPMD generates the necessary communication operations (all-reduce, all-gather, reduce-scatter) to maintain correctness.
Supported parallelism strategies include:
Multislice training uses GSPMD within each slice (over ICI) and data parallelism across slices (over DCN). The XLA compiler automatically generates the inter-slice DCN communication code and overlaps it with computation.
TPU nodes and TPU VMs have been used to train many of Google's largest AI systems:
| Aspect | TPU | GPU |
|---|---|---|
| Design philosophy | Fixed-function ASIC for matrix math | General-purpose parallel processor |
| Precision formats | bfloat16, INT8, FP8 (v7); MXU accumulates in FP32 | FP16, FP32, FP64, INT8, FP8; TF32 (Ampere+) |
| Interconnect | ICI (3D torus, up to 9,216 chips) | NVLink, NVSwitch, InfiniBand |
| Programming model | XLA compiler (TensorFlow, JAX, PyTorch/XLA) | CUDA, ROCm, Triton |
| Availability | Google Cloud only | Multiple cloud providers and on-premises |
| Optimal batch size | 128 to 1,024 | 8 to 128 |
| Software ecosystem | Narrower (XLA-based frameworks) | Broader (CUDA ecosystem, extensive library support) |
| Power efficiency (v1 era) | 83x better perf/watt vs CPU; 29x vs GPU (inference) | Baseline for comparison |
TPUs typically outperform GPUs on workloads that are dominated by large matrix multiplications with regular data access patterns. GPUs maintain advantages in workloads requiring flexible memory access, custom CUDA kernels, or support across multiple cloud providers and on-premises deployments.
Cloud TPU pricing is measured in chip-hours. Google Cloud offers several pricing tiers:
| Pricing tier | Description | Typical discount |
|---|---|---|
| On-demand | Pay-as-you-go with no commitment | Baseline price |
| 1-year commitment | Reserved capacity for 12 months | Moderate discount |
| 3-year commitment | Reserved capacity for 36 months | Largest discount (up to 60% off on-demand) |
| Preemptible / Spot | May be interrupted at any time | Up to 70% off on-demand |
Representative on-demand pricing (subject to change):
| TPU type | On-demand price (per chip-hour) | Preemptible price (per chip-hour) |
|---|---|---|
| TPU v2 | $4.50 | $1.35 |
| TPU v3 | $8.00 | $2.40 |
| TPU v5e | ~$1.20 | Varies by region |
| TPU v5p | ~$4.20 | Varies by region |
| TPU v6e | ~$1.38 | Varies by region |
Billing accrues while a TPU node or TPU VM is in the READY state. In the Google Cloud console, prices are displayed per VM-hour rather than per chip-hour. For example, a single TPU v4 host with four chips shows as $12.88 per hour.
In addition to the cloud-based TPU line, Google produces the Edge TPU, a compact ASIC designed for on-device inference at the network edge. The Edge TPU performs 4 trillion operations per second (4 TOPS) at only 2 watts of power, yielding 2 TOPS per watt. It supports only 8-bit integer (INT8) quantized models compiled through TensorFlow Lite.
Edge TPU hardware is sold under the Coral brand in several form factors: USB accelerator, M.2 module, mini PCIe card, system-on-module (SoM), and single-board computer (SBC). These devices are used for applications such as real-time object detection, audio classification, and pose estimation in environments where cloud connectivity is unavailable or latency requirements preclude round-trip network calls.