See also: Tensor Processing Unit, distributed training, TensorFlow, JAX, GPU, systolic array
The TPU master refers to the primary control and coordination mechanism within Google's Tensor Processing Unit (TPU) system architecture. Depending on context, the term can describe two related but distinct concepts: (1) the on-chip scalar control unit inside each TPU that manages instruction dispatch, data flow, and computation scheduling across the chip's functional units; and (2) the host-level coordinator in a distributed TPU training setup that orchestrates communication between a user's program and one or more remote TPU workers. Both roles serve the same fundamental purpose: directing how computation and data move through the TPU hardware so that machine learning workloads execute efficiently.
Google first described the TPU architecture in detail in a 2017 paper presented at the International Symposium on Computer Architecture (ISCA), where Norman Jouppi and colleagues outlined how the chip's control logic manages a massive systolic array of multiply-accumulate units. Since then, the TPU family has grown through seven generations, from the original TPU v1 (deployed internally at Google in 2015) to the Ironwood TPU (announced in 2025), and the role of the master controller has evolved alongside increasingly complex multi-chip, multi-host, and multi-pod configurations.
Imagine you and your friends are building a really big LEGO castle together. Someone needs to be in charge: they hand out the right LEGO bricks to each person, tell everyone which part to build, and make sure all the pieces fit together at the end. The TPU master is like that person in charge. Inside a special computer chip made by Google, the TPU master tells all the tiny math workers which numbers to multiply, moves data to the right places, and makes sure everything finishes on time. When lots of these chips work together, there is also a bigger boss (a host computer) that coordinates all the chips, kind of like a teacher organizing several groups of students working on different parts of the same project.
Each TPU chip contains one or more TensorCores, and each TensorCore includes a scalar control unit that functions as the on-chip master. This unit is responsible for fetching instructions, managing control flow, and dispatching operations to the chip's specialized execution units.
The scalar unit handles instruction decode, address computation, and orchestration of the data pipeline. It issues one instruction per cycle to the downstream functional units. Because a single scalar unit drives thousands of arithmetic logic units (ALUs) within the matrix and vector units, the TPU follows a control model where one lightweight controller manages a large amount of parallel compute. This design keeps die area and power consumption low while maximizing throughput for the regular, predictable workloads typical of neural network inference and training.
The MXU is the computational core of the TPU, built around a systolic array of multiply-accumulate cells. In TPU generations prior to v6e, the array measures 128 x 128, giving 16,384 ALUs. Starting with TPU v6e (Trillium) and continuing with TPU v7 (Ironwood), the array was enlarged to 256 x 256, or 65,536 ALUs. Each MXU can perform 16,000 multiply-accumulate operations per clock cycle.
The systolic array processes data in a pipelined, wave-like fashion. Weight values are loaded from above (the right-hand side, or RHS), while input activations enter from the left (the left-hand side, or LHS). At each clock tick, partial products propagate through the array without returning to main memory, which eliminates the memory bandwidth bottleneck that limits conventional processors. All multiplications accept bfloat16 inputs, and all accumulations are performed in FP32 to preserve numerical precision.
The master's scalar unit controls when weights are loaded into the MXU, when input data is streamed in, and when results are written back to memory. This coordination is critical because the systolic array must be kept continuously fed to avoid pipeline bubbles that waste compute cycles.
The VPU handles element-wise operations such as activation functions (ReLU, sigmoid, softmax), additions, and reductions. On TPU v5p, the VPU contains 8 x 128 SIMD lanes with four independent ALUs per position, completing standard operations in one cycle with a two-cycle latency. The VPU's throughput is roughly one-tenth that of the MXU, reflecting the fact that matrix multiplications dominate the arithmetic in most deep learning models.
Starting with TPU v4, Google added SparseCores to accelerate embedding lookups, a common operation in recommendation models. TPU v5p and Ironwood include four SparseCores per chip, while TPU v6e includes two. These dataflow processors speed up models that rely on sparse embeddings by 5x to 7x while consuming only about 5% of the chip's die area and power budget.
The TPU master coordinates data movement across multiple levels of memory. Efficient memory management is one of the most performance-sensitive responsibilities of the control logic.
| Memory level | Typical capacity | Bandwidth | Role |
|---|---|---|---|
| Registers / accumulators | Small (per-ALU) | Highest (internal to array) | Store partial sums during systolic computation |
| VMEM (vector memory) | ~128 MiB (v5e) | ~22x higher than HBM | Programmer-controlled scratchpad for active data |
| HBM (high-bandwidth memory) | 16 to 192 GB | 600 to 7,370 GB/s | Main off-chip memory for model weights and data |
| Weight FIFO / unified buffer | 24 to 28 MiB (v1) | Internal bus, 256 bytes wide | Staging area for weights before they enter the MXU |
VMEM acts like a large, software-managed L1/L2 cache. Unlike CPU caches, VMEM requires explicit data movement instructions, and the scalar unit is responsible for scheduling DMA (direct memory access) transfers between HBM and VMEM so that data arrives just in time for computation. In practice, HBM-to-VMEM transfers are overlapped with MXU computation, and VMEM-to-MXU loads are overlapped with HBM stores, creating a deeply pipelined execution model.
The table below summarizes the progression of TPU hardware across seven generations. As the chips have grown more powerful, the master's coordination responsibilities have expanded to manage larger arrays, more memory, and faster interconnects.
| Generation | Year | Process node | Clock speed | Peak TFLOPS | HBM capacity | HBM bandwidth | Topology |
|---|---|---|---|---|---|---|---|
| v1 | 2015 | 28 nm | 700 MHz | 23 (INT8) | 8 GB DDR3 | 34 GB/s | Standalone (PCIe) |
| v2 | 2017 | 16 nm | 700 MHz | 45 (bf16) | 16 GB HBM | ~600 GB/s | 2D torus |
| v3 | 2018 | 16 nm | 940 MHz | 123 (bf16) | 32 GB HBM | ~900 GB/s | 2D torus |
| v4 | 2021 | 7 nm | 1,050 MHz | 275 (bf16) | 32 GB HBM | ~1,200 GB/s | 3D torus |
| v5e | 2023 | N/A | ~1.5 GHz | 197-393 (bf16) | 16 GB HBM | ~820 GB/s | 2D torus |
| v5p | 2023 | N/A | ~1.5 GHz | 459 (bf16) | 95 GB HBM | ~2,765 GB/s | 3D torus |
| v6e (Trillium) | 2024 | N/A | N/A | ~920 (bf16) | 32 GB HBM | N/A | 2D torus |
| v7 (Ironwood) | 2025 | N/A | N/A | 4,614 (FP8) | 192 GB HBM3e | 7,370 GB/s | 3D torus |
Beyond the on-chip scalar control unit, the term "TPU master" also applies to the host-level process that coordinates distributed training or inference across multiple TPU chips, hosts, and pods. This section covers that broader orchestration role.
A TPU VM (also called a worker) is a Linux virtual machine running on a physical host computer that is directly connected to TPU hardware via a high-bandwidth PCIe interface. Users can SSH into the TPU VM to run code, inspect logs, and debug. This direct-access model, introduced as a replacement for the older TPU Node architecture, removed the extra network hop that previously existed between the user's VM and the TPU host.
In the older TPU Node architecture, a user provisioned a separate Google Compute Engine VM (an n1 instance) that communicated with an inaccessible TPU host machine over gRPC. This added latency and complexity. The TPU VM model eliminated that intermediary, giving users root access to the machine physically attached to the TPU chips.
| Configuration | Description | Use case |
|---|---|---|
| Single-host | One TPU VM connected to a small number of TPU chips (typically 4 or 8) | Prototyping, small-to-medium models |
| Multi-host | Multiple TPU VMs, each connected to its own set of chips, linked by ICI | Large models that exceed single-host memory or compute |
| Sub-host | A TPU VM uses only a fraction of the chips on a physical host | Cost-efficient inference for smaller models |
In multi-host configurations, one process typically acts as the coordinator (the master), distributing work, synchronizing gradient updates, and managing checkpointing across all workers.
A TPU pod is a contiguous set of TPU chips grouped together through a dedicated high-speed network called the inter-chip interconnect (ICI). A slice is a subset of chips within a pod that are allocated to a single workload.
TPU v2, v3, v5e, and v6e use a 2D torus ICI topology, where each chip connects to four nearest neighbors. TPU v4, v5p, and Ironwood use a 3D torus topology, connecting each chip to six neighbors. The 3D torus reduces the network diameter from roughly 2 times the square root of N to 3 times the cube root of N, which substantially lowers worst-case communication latency. For a 4,096-chip pod, maximum hops drop from approximately 128 (2D) to 48 (3D).
TPU v4 introduced optical circuit switches (OCS) to connect cubes (4 x 4 x 4 groups of 64 chips) within a pod. The OCS fabric can dynamically reconfigure the pod topology, allowing operators to provision a single large pod, multiple smaller pods, or reconfigure mid-job for fault tolerance.
TPU v4 topologies are specified as a three-tuple A x B x C, where A, B, and C represent chip counts in each dimension. Certain configurations (where 2A = B = C, or 2A = 2B = C) support twisted tori, a topology variant that routes some ICI links across the torus in a shifted pattern to increase bisection bandwidth. For example, a 4x4x8_twisted slice provides a 70% theoretical increase in bisection bandwidth compared to a standard 4x4x8 configuration, improving performance for collective operations like all-reduce.
For workloads that exceed the capacity of a single slice, Google supports multislice configurations where multiple slices communicate over the datacenter network (DCN) rather than ICI. Within each slice, chips communicate via ICI at high bandwidth (45 to 90 GB/s per link). Between slices, communication travels over DCN at lower bandwidth (around 6 GB/s on v5e). This bandwidth hierarchy shapes the parallelism strategies used:
The host-level master coordinates the synchronization of gradients and model state across slices, typically using collective operations that are aware of the two-tier network topology.
The TPU master (at both the chip and host levels) interacts with several machine learning frameworks. The framework runtime translates high-level model code into low-level operations that the TPU hardware executes.
TensorFlow was the first framework to support TPUs. In TensorFlow's distributed runtime, there are three roles: a client (the user's program), a master (the coordinator), and one or more workers. The client constructs a computation graph and sends it to the master as a tf.GraphDef protocol buffer. The master partitions the graph into subgraphs, assigns each subgraph to a worker, applies optimizations such as constant folding, and coordinates execution.
For TPU-specific workloads, tf.distribute.TPUStrategy provides synchronous distributed training across all TPU cores in a pod. The user connects to the TPU cluster using a TPUClusterResolver, which locates the TPU workers and initializes the TPU system:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='my-tpu')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
The strategy handles device placement, gradient aggregation via all-reduce, and synchronization automatically. Models must be created within a strategy.scope() block so that variables are mirrored across TPU devices.
JAX is Google's array computing library that compiles Python functions to optimized XLA (Accelerated Linear Algebra) code. JAX treats TPU chips as a device mesh and uses sharding annotations to partition data and computation across the mesh.
Key JAX primitives for TPU orchestration include:
jax.jit: Compiles a function for execution on TPU with automatic sharding decisionsjax.sharding.NamedSharding: Specifies how arrays are distributed across a named device meshjax.experimental.shard_map: Enables fully manual per-device code with explicit collective communicationJAX's SPMD (single program, multiple data) model means that the same program runs on every TPU chip, with the framework and XLA compiler inserting the necessary communication operations (all-reduce, all-gather, reduce-scatter) based on the sharding specification.
PyTorch supports TPUs through the PyTorch/XLA library, which translates PyTorch operations into XLA computations. PyTorch/XLA uses a lazy tensor approach: operations are recorded into a graph and compiled/executed on the TPU when results are needed. Multi-host TPU training in PyTorch uses torch.distributed with an XLA-compatible backend.
Pathways is Google's internal distributed ML runtime, developed by Google DeepMind, that adds a virtualization layer on top of the standard TPU runtime. Instead of allocating TPU chips directly to users, Pathways manages chips through long-lived server processes. A single user can connect to an arbitrary number of Pathways-controlled devices and write their program as if all devices were attached to a single process, even when the devices span multiple data centers.
Pathways served as the runtime for training PaLM, a 540-billion-parameter language model, across two TPU v4 pods containing 6,144 chips. The system achieved 57.8% hardware FLOPS utilization, a record at the time for language models at that scale. Benefits of the Pathways approach include reduced job startup time, built-in fault tolerance, and the ability to run multiple jobs on the same hardware (multitenancy).
The TPU master, whether on-chip or at the host level, enables several forms of parallelism that are commonly combined in large-scale training.
| Strategy | Description | Communication pattern | Typical scope |
|---|---|---|---|
| Data parallelism | Each device holds a full model copy and processes a different data batch; gradients are averaged | All-reduce | Across hosts or slices |
| Tensor parallelism | Individual layers are split across devices; each device computes part of every layer | All-reduce, all-gather within layers | Within a host or ICI-connected group |
| Pipeline parallelism | The model is divided into sequential stages; each device holds one or more stages | Point-to-point between stages | Across hosts |
| FSDP (fully sharded data parallelism) | Model parameters, gradients, and optimizer states are sharded across devices; gathered on demand | All-gather before forward, reduce-scatter after backward | Within a slice |
| Expert parallelism | In mixture-of-experts models, different experts reside on different devices | All-to-all | Within a pod |
The choice of strategy depends on model size, the number of available chips, and the bandwidth hierarchy (MXU to VMEM to HBM to ICI to DCN). The master's job is to execute the chosen strategy efficiently by scheduling data transfers and synchronization points to overlap with computation.
The following table compares TPU and GPU hardware on several axes relevant to understanding the TPU master's role and context.
| Aspect | Google TPU | NVIDIA GPU |
|---|---|---|
| Architecture type | Application-specific (ASIC) with systolic array | General-purpose with CUDA cores and Tensor Cores |
| Primary compute unit | MXU (128x128 or 256x256 systolic array) | Streaming multiprocessors with Tensor Cores |
| Control model | Scalar unit drives deterministic execution | Warp scheduler with dynamic instruction scheduling |
| Interconnect | ICI (dedicated torus network between chips) | NVLink / NVSwitch (point-to-point between GPUs) |
| Software ecosystem | TensorFlow, JAX, PyTorch/XLA | CUDA, cuDNN; broad framework support |
| Memory type | HBM (16 to 192 GB per chip) | HBM (40 to 192 GB per GPU) |
| Strengths | Large-scale distributed training, energy efficiency, tight integration with Google Cloud | Broad software ecosystem, flexibility, wide vendor support |
| Example: BERT training | ~2.8x faster than A100 (reported benchmarks) | Baseline comparison |
| Energy efficiency (v4 vs. A100) | 1.2x to 1.7x better performance per watt | Baseline comparison |
Google also produces the Edge TPU, a small ASIC for on-device inference at the edge. The Edge TPU delivers 4 TOPS (tera-operations per second) of INT8 performance while consuming only 2 watts (2 TOPS per watt). It is available through the Coral product line as a USB accelerator, a PCIe module, and a system-on-module. The Edge TPU runs TensorFlow Lite models and can execute models like MobileNet V2 at nearly 400 frames per second.
While the Edge TPU does not use the same multi-chip pod architecture as Cloud TPUs, its on-chip control logic serves the same master function: scheduling instructions, managing data flow through the inference pipeline, and coordinating output.
The TPU host CPU (the master machine) runs the data input pipeline, including data loading, augmentation, and preprocessing. Because the TPU chips themselves are optimized for matrix arithmetic and cannot efficiently run general-purpose preprocessing code, the host CPU must prepare batches of data and feed them to the TPU fast enough to keep the MXUs saturated. Google recommends using the TFRecord format and tf.data pipelines with prefetching and parallel reads to avoid input bottlenecks.
TPU systolic arrays operate most efficiently when matrix dimensions are multiples of 128 (or 256 on v6e and later). The master must ensure that batch sizes and model dimensions are padded appropriately; otherwise, portions of the array sit idle. The global batch size in distributed training is divided by the number of replicas (strategy.num_replicas_in_sync in TensorFlow), so the per-replica batch size must still be large enough to utilize the MXU effectively.
In large pod configurations with thousands of chips, hardware failures are expected. The ICI network supports resiliency features that route traffic around faulty optical links or OCS components, at the cost of temporarily reduced ICI bandwidth. The Pathways runtime adds application-level fault tolerance by managing checkpoint-and-restart workflows and rerouting computation to healthy chips. The master process coordinates these recovery operations.
Both TensorFlow and JAX compile computation graphs to XLA (Accelerated Linear Algebra) intermediate representation before executing on TPU hardware. The XLA compiler performs operator fusion, memory layout optimization, and communication scheduling. The master process sends the compiled HLO (High-Level Optimizer) programs to TPU workers for execution. Profiling tools like TensorBoard's TPU profiler and JAX's jax.profiler help identify when the master is not keeping the TPU fed (an "infeed stall") or when collective communication is creating bottlenecks.
The timeline below traces the development of TPU hardware and the evolution of the master's role.