TPU Node
Last reviewed
Sources
17 citations
Review status
Source-backed
Revision
v6 ยท 4,244 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
17 citations
Review status
Source-backed
Revision
v6 ยท 4,244 words
Add missing citations, update stale details, or suggest a clearer explanation.
A TPU node is the legacy Google Cloud architecture for accessing Tensor Processing Unit (TPU) hardware, in which a user's virtual machine (VM) runs application code and communicates with a separate, inaccessible TPU host machine over a gRPC network connection.[4] It is the predecessor to the TPU VM architecture, where users instead connect directly via SSH to the machine that is physically attached to the TPU chips.[3] As Google Cloud's documentation describes the older model, "you could only access Cloud TPUs remotely. You would typically create one or more VMs that would then communicate with Cloud TPU host machines over the network using gRPC."[16] The TPU node (also written "TPU Node") architecture has been deprecated since the introduction of Cloud TPU VMs on June 2, 2021, and TPU v3 is the only TPU generation that still supports it.[16][17] More broadly, the term "TPU node" is sometimes also used to describe any individual unit of TPU compute within a larger TPU pod or cluster, encompassing one or more TPU chips connected to a host machine.
Imagine you have a really fast calculator that is great at doing one specific kind of math problem. A TPU node is like having that calculator sitting in a room at a big computer center. In the old setup (called "TPU node architecture"), you had to talk to the calculator through a walkie-talkie from another room, and you could never actually go into the room where the calculator was. In the new setup (called "TPU VM"), you get to sit right next to the calculator and use it directly. Either way, the calculator itself is the same super-fast math helper that lets computers learn from huge amounts of data much more quickly than a regular computer could on its own.
A TPU node is the original Cloud TPU access model, sometimes called the "network-attached" or "remote" model. In this design, the TPU chips are not directly reachable by the user. Instead, the user provisions a standard Google Compute Engine VM (typically an n1 instance) that talks to a separate TPU host over gRPC.[4] The user owns and logs into the Compute Engine VM, but the TPU host machine that is physically wired to the accelerators stays hidden behind the network interface.
This is the defining characteristic that distinguishes a TPU node from a TPU VM: with a TPU node you reach the accelerator across the data center network, whereas with a TPU VM you run your code on the host that is directly attached to the silicon. Google Cloud documentation states that "the TPU VM architecture lets you directly connect to the VM physically connected to the TPU device using SSH," which is precisely the capability the older TPU node model lacked.[3]
Google began developing TPUs in 2013 and first deployed them internally in 2015. The initial motivation was to handle the projected demand for neural network inference workloads across Google's data centers. At the time, Google estimated that if every user spoke to their Android phone for just three minutes per day using voice search, existing CPU-based infrastructure would need to double in capacity. Rather than doubling their data center footprint, Google opted to build a custom ASIC tailored to neural network computation.
The first public description of the TPU appeared in a 2017 paper by Norman Jouppi and colleagues at the 44th International Symposium on Computer Architecture (ISCA).[1] This paper reported that the TPU v1 achieved 15 to 30 times higher performance and 30 to 80 times higher performance-per-watt compared to contemporary CPUs and GPUs on production neural network inference workloads.[1] The paper became the second most cited publication in ISCA's 50-year history.
Google made TPUs available to external users through Google Cloud Platform starting with TPU v2 in 2018. The initial Cloud TPU offering used what is now called the "TPU node" architecture, where users provisioned a standard Compute Engine VM (typically an n1 instance) and connected to a separate TPU host over a gRPC network interface.[4] The TPU VM architecture that replaced it was announced on June 2, 2021, after a private preview that began in October 2020.[16]
| Generation | Year | Process node | Peak TFLOPS (per chip) | HBM per chip | Memory bandwidth | Max chips per pod | Notable feature |
|---|---|---|---|---|---|---|---|
| TPU v1 | 2015 | 28 nm | 23 (INT8) | None (8 GiB DDR3) | 34 GB/s | N/A (single chip) | First deployment; inference only |
| TPU v2 | 2017 | 16 nm | 45 (bf16) | 16 GiB HBM | 600 GB/s | 256 (64-chip pods) | Added training; introduced bfloat16 |
| TPU v3 | 2018 | 16 nm | 123 (bf16) | 32 GiB HBM | 900 GB/s | 1,024 | Water cooling; 2x v2 performance |
| TPU v4 | 2021 | 7 nm | 275 (bf16) | 32 GiB HBM | 1,200 GB/s | 4,096 | Optical circuit switches (OCS); 3D torus [6] |
| TPU v5e | 2023 | N/A | 197 (bf16) | 16 GiB HBM2e | 819 GB/s | 256 | Cost-efficient variant |
| TPU v5p | 2023 | N/A | 459 (bf16) | 95 GiB HBM | 2,765 GB/s | 8,960 | 2.8x faster LLM training than v4 [7] |
| TPU v6e (Trillium) | 2024 | N/A | 918 (bf16) | 32 GiB HBM | 1,640 GB/s | 256 | 4.7x v5e performance; 3rd-gen SparseCore [8] |
| TPU v7 (Ironwood) | 2025 | N/A | 4,614 (FP8) | 192 GiB HBM | 7,370 GB/s | 9,216 | First inference-focused TPU; 42.5 ExaFLOPS per pod [9][12] |
The original Cloud TPU deployment model, known as the TPU node architecture, uses a two-machine design. In this configuration, the user provisions a standard Google Compute Engine VM (the "user VM") that runs application code. The user VM communicates with a separate "TPU host" VM over a gRPC connection. The TPU host is the machine physically connected to the TPU chips via PCIe, but in the TPU node model, the user has no direct access to this host.[4]
Training data in the TPU node architecture must be loaded from Google Cloud Storage (GCS) because the user VM and the TPU host are separate machines.[3] The user cannot store training data on local disk and have the TPU access it directly.
| Limitation | Description |
|---|---|
| No direct host access | Users cannot SSH into the TPU host, making it difficult to debug training failures, inspect logs, or profile TPU utilization |
| gRPC overhead | All communication between user code and TPU hardware passes through a network interface, adding latency compared to direct PCIe access |
| GCS data dependency | Training data must reside in Google Cloud Storage; local storage on the user VM is not directly accessible to the TPU |
| Separate provisioning | The user VM and TPU resource must be created and managed independently, adding operational complexity |
| Limited framework support | Newer tools and APIs (such as prepare_tf_dataset() in Hugging Face Transformers) only support the TPU VM architecture |
Yes. The TPU node architecture is deprecated, and Google Cloud recommends migrating all workloads to the TPU VM architecture.[3] In Google Kubernetes Engine (GKE), TPU v3 is the only TPU generation that still supports the TPU node architecture; every newer generation requires TPU VMs.[17] Between April and June 2024, Google migrated existing legacy TPU notebooks (such as those in Colab) onto modern TPU VM machines.[17] According to Google's release notes, the legacy TPU node accelerators were deprecated because "TPU VM accelerators improve usability, reliability, and debuggability, as well as enable support for modern JAX on TPU."[17] The Cloud TPU API used to manage TPU nodes is likewise no longer under active development; Google recommends managing TPU resources through GKE or Compute Engine instead.[17]
The TPU VM architecture replaced the TPU node model and is now the recommended way to use Cloud TPUs.[3] Announced on June 2, 2021, it lets the user connect directly via SSH to a Linux VM that is physically attached to the TPU hardware, with no separate user VM or gRPC intermediary.[16] Google Cloud documentation defines this VM plainly: "A TPU VM, also known as a worker, is a virtual machine running Linux that has access to the underlying TPUs."[3]
The motivation for the change was both usability and performance. As the launch announcement put it, with TPU VMs "you may also achieve performance gains because your code no longer needs to make round trips across the datacenter network to reach the TPUs," in contrast to the older model in which "Cloud TPU VMs run on the TPU host machines that are directly attached to TPU accelerators."[16]
The core difference is where your code runs relative to the TPU hardware. With a TPU node, your code runs on a separate Compute Engine VM and reaches the TPU over the data center network using gRPC; with a TPU VM, your code runs on the host that is directly attached to the TPU chips, which you reach by SSH.[3][16] The practical consequences (host access, data loading, debugging, and overhead) follow from that single architectural choice:
| Feature | TPU node (legacy) | TPU VM (current) |
|---|---|---|
| Host access | No direct access to TPU host | Direct SSH to TPU host VM |
| Data loading | Must use GCS buckets | Can use local storage, GCS, or network file systems |
| Debugging | Limited; no access to host logs | Full root access; can inspect logs and profiles |
| VM provisioning | Separate user VM + TPU resource | Single TPU VM resource |
| Framework support | TensorFlow primarily | TensorFlow, JAX, PyTorch/XLA |
| gRPC overhead | Yes | No (direct PCIe connection) |
| Status | Deprecated (GKE: TPU v3 only) | Recommended |
In the TPU VM model, each set of four TPU chips is connected to a CPU host machine using a PCIe link.[3] A single TPU VM may host one or more TPU chips depending on the accelerator type. For multi-host workloads, multiple TPU VMs coordinate over the data center network (DCN), while the TPU chips within each host communicate over the inter-chip interconnect (ICI).[3]
Each TPU chip is an application-specific integrated circuit (ASIC) designed by Google specifically for machine learning computation. Unlike general-purpose processors, TPUs are optimized for the dense matrix arithmetic that dominates neural network training and inference.
The primary compute unit inside a TPU chip is the TensorCore. Each TensorCore contains several functional blocks:
The number of TensorCores per chip varies by generation. TPU v7 (Ironwood) chips contain two TensorCores, with each chiplet packaging one TensorCore, two SparseCores, and 96 GiB of HBM.[9]
The MXU uses a weight-stationary systolic array design. In this approach:
This architecture eliminates the memory bandwidth bottleneck that limits conventional processors during matrix multiplication.[1] Because wires connect only spatially adjacent ALUs, they can be kept short, which reduces both power consumption and signal propagation delay. When fully utilized, a 128x128 MXU can perform one bf16[8,128] x bf16[128,128] matrix multiplication producing an f32[8,128] result every 8 clock cycles.[3]
Starting with TPU v4, Google introduced SparseCores as additional dataflow processors designed to accelerate sparse operations common in recommendation and ranking models.[2] These processors handle large embedding table lookups that are memory-bound rather than compute-bound. TPU v6e includes two SparseCores per chip, while TPU v5p and v7 include four per chip.[7] The third-generation SparseCore in TPU v6e introduced variable SIMD widths (8 elements for FP32, 16 for bfloat16) and improved memory access patterns for reduced wasted bandwidth.[8]
TPUs use the bfloat16 (Brain Floating Point) number format, developed by Google Brain.[10] bfloat16 is a 16-bit floating point format with one sign bit, eight exponent bits, and seven mantissa bits. This differs from the IEEE 754 half-precision (FP16) format, which allocates five exponent bits and ten mantissa bits.
The design rationale prioritizes dynamic range over precision. Neural networks are generally more sensitive to overflow and underflow (which depend on exponent range) than to rounding errors (which depend on mantissa precision). By matching the exponent range of FP32 while halving the storage size, bfloat16 effectively doubles the usable HBM capacity for model parameters and activations.[10] The MXU performs multiplications in bfloat16 and accumulates results in FP32, preventing numerical drift during long chains of multiply-accumulate operations.
TPU systems use a hierarchical networking architecture with three distinct layers, each operating at a different scale and bandwidth.
ICI is the high-speed, low-latency link that connects TPU chips within a single slice.[3] Starting with TPU v4, each chip has six ICI links (one in each direction along the X, Y, and Z axes), forming a 3D torus topology.[2] For TPU v5p, each ICI axis provides 90 GB/s of bandwidth per chip.[7]
The 3D torus topology wraps around in all three dimensions so that chips on opposite edges of the mesh are directly connected. This provides higher bisection bandwidth compared to a simple mesh. Google also supports "twisted" torus configurations, where the wrap-around connections are offset. A 4x4x8 twisted topology provides approximately 70% higher bisection bandwidth than a non-twisted 4x4x8 topology.[7]
ICI resiliency is enabled by default for slices at the cube scale or larger, automatically routing around optical link faults.
DCN connects TPU VMs to each other and to the broader Google Cloud network. It operates at significantly lower bandwidth than ICI but enables multi-slice configurations where more TPU chips are needed than a single slice can provide.[5] In multi-slice setups, ICI handles intra-slice communication while DCN handles inter-slice data transfers.[5]
Introduced with TPU v4, optical circuit switches allow the physical interconnect topology to be dynamically reconfigured.[2] A TPU v4 pod uses OCS to connect "cubes" (groups of 64 chips in a 4x4x4 arrangement) into larger configurations.[6] This reconfigurability supports different topology choices (such as twisted vs. non-twisted torus) and improves fault tolerance by routing around failed optical links.[2]
TPU v7 (Ironwood) extends this approach, with each rack housing 64 chips in a cube connected by ICI in a 3D torus. Multiple cubes are linked through OCS to form pods (256 chips) and superpods (up to 9,216 chips, requiring 144 cubes).[9]
TPU compute resources are organized in a hierarchical structure that runs from a single chip up to a multislice training job. A TPU node (or, in current terminology, a TPU host) sits near the bottom of this hierarchy: it is the host plus its directly attached chips, while a slice and a pod are progressively larger groupings of those hosts.
| Level | Definition | Example |
|---|---|---|
| Chip | A single TPU ASIC die | One TPU v4 chip with 32 GiB HBM |
| Host | A CPU-based VM connected to one or more TPU chips via PCIe | A machine with 4 TPU v4 chips |
| Slice | A collection of chips within one pod connected by ICI | A v4 slice with 2x2x4 topology (16 chips) |
| Pod | The maximum set of chips connected by ICI within one physical installation | A TPU v4 pod with 4,096 chips |
| Multislice | Multiple slices coordinated over DCN for a single training job | Three v5e-256 slices (768 chips total) |
Google Cloud defines the larger units precisely: "A slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter-chip interconnects (ICI)," and "A TPU Pod is a contiguous set of TPUs grouped together over a specialized network."[3] A single-host configuration uses one TPU VM with its directly attached chips.[3] A multi-host configuration distributes computation across multiple TPU VMs, requiring coordination over both ICI (for chip-to-chip transfers) and DCN (for host-to-host transfers).[5]
TPU nodes and TPU VMs use the same software compilation pipeline, centered on the XLA (Accelerated Linear Algebra) compiler.
XLA is an open-source compiler that translates high-level operations from ML frameworks into optimized TPU machine code.[13] It takes computation graphs expressed in the HLO (High-Level Operations) intermediate representation and performs optimizations including:
XLA is developed as part of the OpenXLA project, with contributions from Google, AMD, Apple, ARM, Intel, Meta, and NVIDIA, among others.[13]
| Framework | TPU integration method | Notes |
|---|---|---|
| TensorFlow | Native XLA support | Original TPU framework; tf.distribute for multi-device |
| JAX | Native XLA backend | Functional API with composable transforms; preferred for research |
| PyTorch | PyTorch/XLA bridge | Lazy evaluation model; records operations as IR graph, then compiles via XLA |
JAX has become the preferred framework for large-scale TPU training at Google and in research settings. It provides composable function transformations (jit, vmap, pmap, pjit) that map naturally to TPU parallelism strategies. PyTorch/XLA enables PyTorch users to run on TPUs with minimal code changes by intercepting PyTorch operations and compiling them through XLA.[14]
GSPMD (General and Scalable Parallelization for ML Computation Graphs) is the XLA partitioning system that automatically distributes computation across TPU chips. Users annotate tensors with sharding specifications, and GSPMD generates the necessary communication operations (all-reduce, all-gather, reduce-scatter) to maintain correctness.
Supported parallelism strategies include:
Multislice training uses GSPMD within each slice (over ICI) and data parallelism across slices (over DCN).[5] The XLA compiler automatically generates the inter-slice DCN communication code and overlaps it with computation.
TPU nodes and TPU VMs have been used to train many of Google's largest AI systems:
| Aspect | TPU | GPU |
|---|---|---|
| Design philosophy | Fixed-function ASIC for matrix math | General-purpose parallel processor |
| Precision formats | bfloat16, INT8, FP8 (v7); MXU accumulates in FP32 | FP16, FP32, FP64, INT8, FP8; TF32 (Ampere+) |
| Interconnect | ICI (3D torus, up to 9,216 chips) | NVLink, NVSwitch, InfiniBand |
| Programming model | XLA compiler (TensorFlow, JAX, PyTorch/XLA) | CUDA, ROCm, Triton |
| Availability | Google Cloud only | Multiple cloud providers and on-premises |
| Optimal batch size | 128 to 1,024 | 8 to 128 |
| Software ecosystem | Narrower (XLA-based frameworks) | Broader (CUDA ecosystem, extensive library support) |
| Power efficiency (v1 era) | 83x better perf/watt vs CPU; 29x vs GPU (inference) [1] | Baseline for comparison |
TPUs typically outperform GPUs on workloads that are dominated by large matrix multiplications with regular data access patterns. GPUs maintain advantages in workloads requiring flexible memory access, custom CUDA kernels, or support across multiple cloud providers and on-premises deployments.
Cloud TPU pricing is measured in chip-hours.[15] Google Cloud offers several pricing tiers:
| Pricing tier | Description | Typical discount |
|---|---|---|
| On-demand | Pay-as-you-go with no commitment | Baseline price |
| 1-year commitment | Reserved capacity for 12 months | Moderate discount |
| 3-year commitment | Reserved capacity for 36 months | Largest discount (up to 60% off on-demand) |
| Preemptible / Spot | May be interrupted at any time | Up to 70% off on-demand |
Representative on-demand pricing (subject to change):
| TPU type | On-demand price (per chip-hour) | Preemptible price (per chip-hour) |
|---|---|---|
| TPU v2 | $4.50 | $1.35 |
| TPU v3 | $8.00 | $2.40 |
| TPU v5e | ~$1.20 | Varies by region |
| TPU v5p | ~$4.20 | Varies by region |
| TPU v6e | ~$1.38 | Varies by region |
Billing accrues while a TPU node or TPU VM is in the READY state. In the Google Cloud console, prices are displayed per VM-hour rather than per chip-hour.[15] For example, a single TPU v4 host with four chips shows as $12.88 per hour.
In addition to the cloud-based TPU line, Google produces the Edge TPU, a compact ASIC designed for on-device inference at the network edge. The Edge TPU performs 4 trillion operations per second (4 TOPS) at only 2 watts of power, yielding 2 TOPS per watt. It supports only 8-bit integer (INT8) quantized models compiled through TensorFlow Lite.
Edge TPU hardware is sold under the Coral brand in several form factors: USB accelerator, M.2 module, mini PCIe card, system-on-module (SoM), and single-board computer (SBC). These devices are used for applications such as real-time object detection, audio classification, and pose estimation in environments where cloud connectivity is unavailable or latency requirements preclude round-trip network calls.