TPU Master

Introduction

The TPU master refers to the primary control and coordination mechanism within Google's Tensor Processing Unit (TPU) system architecture. Depending on context, the term can describe two related but distinct concepts: (1) the on-chip scalar control unit inside each TPU that manages instruction dispatch, data flow, and computation scheduling across the chip's functional units; and (2) the host-level coordinator in a distributed TPU training setup that orchestrates communication between a user's program and one or more remote TPU workers. Both roles serve the same fundamental purpose: directing how computation and data move through the TPU hardware so that machine learning workloads execute efficiently.

Google first described the TPU architecture in detail in a 2017 paper presented at the International Symposium on Computer Architecture (ISCA), where Norman Jouppi and colleagues outlined how the chip's control logic manages a massive systolic array of multiply-accumulate units. Since then, the TPU family has grown through seven generations, from the original TPU v1 (deployed internally at Google in 2015) to the Ironwood TPU (announced in 2025), and the role of the master controller has evolved alongside increasingly complex multi-chip, multi-host, and multi-pod configurations.

Explain like I'm 5 (ELI5)

Imagine you and your friends are building a really big LEGO castle together. Someone needs to be in charge: they hand out the right LEGO bricks to each person, tell everyone which part to build, and make sure all the pieces fit together at the end. The TPU master is like that person in charge. Inside a special computer chip made by Google, the TPU master tells all the tiny math workers which numbers to multiply, moves data to the right places, and makes sure everything finishes on time. When lots of these chips work together, there is also a bigger boss (a host computer) that coordinates all the chips, kind of like a teacher organizing several groups of students working on different parts of the same project.

On-chip control architecture

Each TPU chip contains one or more TensorCores, and each TensorCore includes a scalar control unit that functions as the on-chip master. This unit is responsible for fetching instructions, managing control flow, and dispatching operations to the chip's specialized execution units.

Scalar unit

The scalar unit handles instruction decode, address computation, and orchestration of the data pipeline. It issues one instruction per cycle to the downstream functional units. Because a single scalar unit drives thousands of arithmetic logic units (ALUs) within the matrix and vector units, the TPU follows a control model where one lightweight controller manages a large amount of parallel compute. This design keeps die area and power consumption low while maximizing throughput for the regular, predictable workloads typical of neural network inference and training.

Matrix multiply unit (MXU)

The MXU is the computational core of the TPU, built around a systolic array of multiply-accumulate cells. In TPU generations prior to v6e, the array measures 128 x 128, giving 16,384 ALUs. Starting with TPU v6e (Trillium) and continuing with TPU v7 (Ironwood), the array was enlarged to 256 x 256, or 65,536 ALUs. Each MXU can perform 16,000 multiply-accumulate operations per clock cycle.

The systolic array processes data in a pipelined, wave-like fashion. Weight values are loaded from above (the right-hand side, or RHS), while input activations enter from the left (the left-hand side, or LHS). At each clock tick, partial products propagate through the array without returning to main memory, which eliminates the memory bandwidth bottleneck that limits conventional processors. All multiplications accept bfloat16 inputs, and all accumulations are performed in FP32 to preserve numerical precision.

The master's scalar unit controls when weights are loaded into the MXU, when input data is streamed in, and when results are written back to memory. This coordination is critical because the systolic array must be kept continuously fed to avoid pipeline bubbles that waste compute cycles.

Vector processing unit (VPU)

The VPU handles element-wise operations such as activation functions (ReLU, sigmoid, softmax), additions, and reductions. On TPU v5p, the VPU contains 8 x 128 SIMD lanes with four independent ALUs per position, completing standard operations in one cycle with a two-cycle latency. The VPU's throughput is roughly one-tenth that of the MXU, reflecting the fact that matrix multiplications dominate the arithmetic in most deep learning models.

SparseCores

Starting with TPU v4, Google added SparseCores to accelerate embedding lookups, a common operation in recommendation models. TPU v5p and Ironwood include four SparseCores per chip, while TPU v6e includes two. These dataflow processors speed up models that rely on sparse embeddings by 5x to 7x while consuming only about 5% of the chip's die area and power budget.

Memory hierarchy

The TPU master coordinates data movement across multiple levels of memory. Efficient memory management is one of the most performance-sensitive responsibilities of the control logic.

Memory level	Typical capacity	Bandwidth	Role
Registers / accumulators	Small (per-ALU)	Highest (internal to array)	Store partial sums during systolic computation
VMEM (vector memory)	~128 MiB (v5e)	~22x higher than HBM	Programmer-controlled scratchpad for active data
HBM (high-bandwidth memory)	16 to 192 GB	600 to 7,370 GB/s	Main off-chip memory for model weights and data
Weight FIFO / unified buffer	24 to 28 MiB (v1)	Internal bus, 256 bytes wide	Staging area for weights before they enter the MXU

VMEM acts like a large, software-managed L1/L2 cache. Unlike CPU caches, VMEM requires explicit data movement instructions, and the scalar unit is responsible for scheduling DMA (direct memory access) transfers between HBM and VMEM so that data arrives just in time for computation. In practice, HBM-to-VMEM transfers are overlapped with MXU computation, and VMEM-to-MXU loads are overlapped with HBM stores, creating a deeply pipelined execution model.

TPU generations

The table below summarizes the progression of TPU hardware across seven generations. As the chips have grown more powerful, the master's coordination responsibilities have expanded to manage larger arrays, more memory, and faster interconnects.

Generation	Year	Process node	Clock speed	Peak TFLOPS	HBM capacity	HBM bandwidth	Topology
v1	2015	28 nm	700 MHz	23 (INT8)	8 GB DDR3	34 GB/s	Standalone (PCIe)
v2	2017	16 nm	700 MHz	45 (bf16)	16 GB HBM	~600 GB/s	2D torus
v3	2018	16 nm	940 MHz	123 (bf16)	32 GB HBM	~900 GB/s	2D torus
v4	2021	7 nm	1,050 MHz	275 (bf16)	32 GB HBM	~1,200 GB/s	3D torus
v5e	2023	N/A	~1.5 GHz	197-393 (bf16)	16 GB HBM	~820 GB/s	2D torus
v5p	2023	N/A	~1.5 GHz	459 (bf16)	95 GB HBM	~2,765 GB/s	3D torus
v6e (Trillium)	2024	N/A	N/A	~920 (bf16)	32 GB HBM	N/A	2D torus
v7 (Ironwood)	2025	N/A	N/A	4,614 (FP8)	192 GB HBM3e	7,370 GB/s	3D torus

Host-level master in distributed TPU systems

Beyond the on-chip scalar control unit, the term "TPU master" also applies to the host-level process that coordinates distributed training or inference across multiple TPU chips, hosts, and pods. This section covers that broader orchestration role.

TPU VM architecture

A TPU VM (also called a worker) is a Linux virtual machine running on a physical host computer that is directly connected to TPU hardware via a high-bandwidth PCIe interface. Users can SSH into the TPU VM to run code, inspect logs, and debug. This direct-access model, introduced as a replacement for the older TPU Node architecture, removed the extra network hop that previously existed between the user's VM and the TPU host.

In the older TPU Node architecture, a user provisioned a separate Google Compute Engine VM (an n1 instance) that communicated with an inaccessible TPU host machine over gRPC. This added latency and complexity. The TPU VM model eliminated that intermediary, giving users root access to the machine physically attached to the TPU chips.

Single-host vs. multi-host configurations

Configuration	Description	Use case
Single-host	One TPU VM connected to a small number of TPU chips (typically 4 or 8)	Prototyping, small-to-medium models
Multi-host	Multiple TPU VMs, each connected to its own set of chips, linked by ICI	Large models that exceed single-host memory or compute
Sub-host	A TPU VM uses only a fraction of the chips on a physical host	Cost-efficient inference for smaller models

In multi-host configurations, one process typically acts as the coordinator (the master), distributing work, synchronizing gradient updates, and managing checkpointing across all workers.

TPU pods and slices

A TPU pod is a contiguous set of TPU chips grouped together through a dedicated high-speed network called the inter-chip interconnect (ICI). A slice is a subset of chips within a pod that are allocated to a single workload.

TPU v2, v3, v5e, and v6e use a 2D torus ICI topology, where each chip connects to four nearest neighbors. TPU v4, v5p, and Ironwood use a 3D torus topology, connecting each chip to six neighbors. The 3D torus reduces the network diameter from roughly 2 times the square root of N to 3 times the cube root of N, which substantially lowers worst-case communication latency. For a 4,096-chip pod, maximum hops drop from approximately 128 (2D) to 48 (3D).

TPU v4 introduced optical circuit switches (OCS) to connect cubes (4 x 4 x 4 groups of 64 chips) within a pod. The OCS fabric can dynamically reconfigure the pod topology, allowing operators to provision a single large pod, multiple smaller pods, or reconfigure mid-job for fault tolerance.

Twisted torus topologies

TPU v4 topologies are specified as a three-tuple A x B x C, where A, B, and C represent chip counts in each dimension. Certain configurations (where 2A = B = C, or 2A = 2B = C) support twisted tori, a topology variant that routes some ICI links across the torus in a shifted pattern to increase bisection bandwidth. For example, a 4x4x8_twisted slice provides a 70% theoretical increase in bisection bandwidth compared to a standard 4x4x8 configuration, improving performance for collective operations like all-reduce.

Multislice training

For workloads that exceed the capacity of a single slice, Google supports multislice configurations where multiple slices communicate over the datacenter network (DCN) rather than ICI. Within each slice, chips communicate via ICI at high bandwidth (45 to 90 GB/s per link). Between slices, communication travels over DCN at lower bandwidth (around 6 GB/s on v5e). This bandwidth hierarchy shapes the parallelism strategies used:

Intra-slice: Fully sharded data parallelism (FSDP) and/or tensor parallelism over ICI
Inter-slice: Data parallelism over DCN, where each slice holds a full model replica

The host-level master coordinates the synchronization of gradients and model state across slices, typically using collective operations that are aware of the two-tier network topology.

Orchestration with machine learning frameworks

The TPU master (at both the chip and host levels) interacts with several machine learning frameworks. The framework runtime translates high-level model code into low-level operations that the TPU hardware executes.

TensorFlow

TensorFlow was the first framework to support TPUs. In TensorFlow's distributed runtime, there are three roles: a client (the user's program), a master (the coordinator), and one or more workers. The client constructs a computation graph and sends it to the master as a tf.GraphDef protocol buffer. The master partitions the graph into subgraphs, assigns each subgraph to a worker, applies optimizations such as constant folding, and coordinates execution.

For TPU-specific workloads, tf.distribute.TPUStrategy provides synchronous distributed training across all TPU cores in a pod. The user connects to the TPU cluster using a TPUClusterResolver, which locates the TPU workers and initializes the TPU system:

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='my-tpu')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)

The strategy handles device placement, gradient aggregation via all-reduce, and synchronization automatically. Models must be created within a strategy.scope() block so that variables are mirrored across TPU devices.

JAX

JAX is Google's array computing library that compiles Python functions to optimized XLA (Accelerated Linear Algebra) code. JAX treats TPU chips as a device mesh and uses sharding annotations to partition data and computation across the mesh.

Key JAX primitives for TPU orchestration include:

jax.jit: Compiles a function for execution on TPU with automatic sharding decisions
jax.sharding.NamedSharding: Specifies how arrays are distributed across a named device mesh
jax.experimental.shard_map: Enables fully manual per-device code with explicit collective communication

JAX's SPMD (single program, multiple data) model means that the same program runs on every TPU chip, with the framework and XLA compiler inserting the necessary communication operations (all-reduce, all-gather, reduce-scatter) based on the sharding specification.

PyTorch

PyTorch supports TPUs through the PyTorch/XLA library, which translates PyTorch operations into XLA computations. PyTorch/XLA uses a lazy tensor approach: operations are recorded into a graph and compiled/executed on the TPU when results are needed. Multi-host TPU training in PyTorch uses torch.distributed with an XLA-compatible backend.

Pathways

Pathways is Google's internal distributed ML runtime, developed by Google DeepMind, that adds a virtualization layer on top of the standard TPU runtime. Instead of allocating TPU chips directly to users, Pathways manages chips through long-lived server processes. A single user can connect to an arbitrary number of Pathways-controlled devices and write their program as if all devices were attached to a single process, even when the devices span multiple data centers.

Pathways served as the runtime for training PaLM, a 540-billion-parameter language model, across two TPU v4 pods containing 6,144 chips. The system achieved 57.8% hardware FLOPS utilization, a record at the time for language models at that scale. Benefits of the Pathways approach include reduced job startup time, built-in fault tolerance, and the ability to run multiple jobs on the same hardware (multitenancy).

Parallelism strategies coordinated by the master

The TPU master, whether on-chip or at the host level, enables several forms of parallelism that are commonly combined in large-scale training.

Strategy	Description	Communication pattern	Typical scope
Data parallelism	Each device holds a full model copy and processes a different data batch; gradients are averaged	All-reduce	Across hosts or slices
Tensor parallelism	Individual layers are split across devices; each device computes part of every layer	All-reduce, all-gather within layers	Within a host or ICI-connected group
Pipeline parallelism	The model is divided into sequential stages; each device holds one or more stages	Point-to-point between stages	Across hosts
FSDP (fully sharded data parallelism)	Model parameters, gradients, and optimizer states are sharded across devices; gathered on demand	All-gather before forward, reduce-scatter after backward	Within a slice
Expert parallelism	In mixture-of-experts models, different experts reside on different devices	All-to-all	Within a pod

The choice of strategy depends on model size, the number of available chips, and the bandwidth hierarchy (MXU to VMEM to HBM to ICI to DCN). The master's job is to execute the chosen strategy efficiently by scheduling data transfers and synchronization points to overlap with computation.

TPU vs. GPU comparison

The following table compares TPU and GPU hardware on several axes relevant to understanding the TPU master's role and context.

Aspect	Google TPU	NVIDIA GPU
Architecture type	Application-specific (ASIC) with systolic array	General-purpose with CUDA cores and Tensor Cores
Primary compute unit	MXU (128x128 or 256x256 systolic array)	Streaming multiprocessors with Tensor Cores
Control model	Scalar unit drives deterministic execution	Warp scheduler with dynamic instruction scheduling
Interconnect	ICI (dedicated torus network between chips)	NVLink / NVSwitch (point-to-point between GPUs)
Software ecosystem	TensorFlow, JAX, PyTorch/XLA	CUDA, cuDNN; broad framework support
Memory type	HBM (16 to 192 GB per chip)	HBM (40 to 192 GB per GPU)
Strengths	Large-scale distributed training, energy efficiency, tight integration with Google Cloud	Broad software ecosystem, flexibility, wide vendor support
Example: BERT training	~2.8x faster than A100 (reported benchmarks)	Baseline comparison
Energy efficiency (v4 vs. A100)	1.2x to 1.7x better performance per watt	Baseline comparison

Edge TPU

Google also produces the Edge TPU, a small ASIC for on-device inference at the edge. The Edge TPU delivers 4 TOPS (tera-operations per second) of INT8 performance while consuming only 2 watts (2 TOPS per watt). It is available through the Coral product line as a USB accelerator, a PCIe module, and a system-on-module. The Edge TPU runs TensorFlow Lite models and can execute models like MobileNet V2 at nearly 400 frames per second.

While the Edge TPU does not use the same multi-chip pod architecture as Cloud TPUs, its on-chip control logic serves the same master function: scheduling instructions, managing data flow through the inference pipeline, and coordinating output.

Practical considerations

Data input pipeline

The TPU host CPU (the master machine) runs the data input pipeline, including data loading, augmentation, and preprocessing. Because the TPU chips themselves are optimized for matrix arithmetic and cannot efficiently run general-purpose preprocessing code, the host CPU must prepare batches of data and feed them to the TPU fast enough to keep the MXUs saturated. Google recommends using the TFRecord format and tf.data pipelines with prefetching and parallel reads to avoid input bottlenecks.

Batch size and utilization

TPU systolic arrays operate most efficiently when matrix dimensions are multiples of 128 (or 256 on v6e and later). The master must ensure that batch sizes and model dimensions are padded appropriately; otherwise, portions of the array sit idle. The global batch size in distributed training is divided by the number of replicas (strategy.num_replicas_in_sync in TensorFlow), so the per-replica batch size must still be large enough to utilize the MXU effectively.

Fault tolerance

In large pod configurations with thousands of chips, hardware failures are expected. The ICI network supports resiliency features that route traffic around faulty optical links or OCS components, at the cost of temporarily reduced ICI bandwidth. The Pathways runtime adds application-level fault tolerance by managing checkpoint-and-restart workflows and rerouting computation to healthy chips. The master process coordinates these recovery operations.

Performance tuning with XLA

Both TensorFlow and JAX compile computation graphs to XLA (Accelerated Linear Algebra) intermediate representation before executing on TPU hardware. The XLA compiler performs operator fusion, memory layout optimization, and communication scheduling. The master process sends the compiled HLO (High-Level Optimizer) programs to TPU workers for execution. Profiling tools like TensorBoard's TPU profiler and JAX's jax.profiler help identify when the master is not keeping the TPU fed (an "infeed stall") or when collective communication is creating bottlenecks.

History and development

The timeline below traces the development of TPU hardware and the evolution of the master's role.

2013: Google begins internal TPU development in response to projections that voice search adoption could require doubling datacenter compute capacity.
2015: TPU v1 deployed in Google datacenters for inference workloads. The chip operates as a PCIe coprocessor with the host CPU acting as master.
2017: TPU v2 introduces support for training (not just inference), bfloat16 arithmetic, and inter-chip ICI networking in a 2D torus. The master's role expands to coordinate multi-chip computation. The ISCA paper by Jouppi et al. publicly describes the v1 architecture.
2018: TPU v3 doubles performance over v2 and supports pods of up to 1,024 chips. Cloud TPU becomes generally available.
2018: Edge TPU announced for on-device inference.
2021: TPU v4 introduces 3D torus ICI, optical circuit switches for dynamic reconfiguration, and SparseCores. Published at ISCA 2023.
2022: Pathways runtime used to train PaLM (540B parameters) on 6,144 TPU v4 chips across two pods.
2023: TPU v5e (cost-optimized) and v5p (performance-optimized) released. Multislice training enables scaling beyond single-pod limits.
2024: TPU v6e (Trillium) announced with 4.7x performance improvement over v5e and a 256x256 MXU.
2025: Ironwood (TPU v7) announced with 4,614 FP8 TFLOPS, 192 GB HBM3e per chip, and superpods scaling to 9,216 chips (42.5 exaFLOPS aggregate).

References

Jouppi, N. P., Young, C., Patil, N., et al. (2017). "In-Datacenter Performance Analysis of a Tensor Processing Unit." *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*. https://arxiv.org/abs/1704.04760
Jouppi, N. P., Kurian, G., Li, S., et al. (2023). "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." *Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA)*. https://arxiv.org/abs/2304.01433
Barham, P., Chowdhery, A., Dean, J., et al. (2022). "Pathways: Asynchronous Distributed Dataflow for ML." *MLSys 2022*. https://arxiv.org/abs/2203.12533
Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint*. https://arxiv.org/abs/2204.02311
Google Cloud. "TPU system architecture." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
Google Cloud. "Training on TPU pods." Cloud TPU Documentation. https://cloud.google.com/tpu/docs/training-on-tpu-pods
TensorFlow. "Use TPUs." TensorFlow Guide. https://www.tensorflow.org/guide/tpu
Google Cloud. "TPU v4." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v4
JAX Team. "How to Think About TPUs." *JAX Scaling Book*. https://jax-ml.github.io/scaling-book/tpus/
Google Cloud. "Introducing Cloud TPU v5p and AI Hypercomputer." Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-tpu-v5p-and-ai-hypercomputer
Google. "Ironwood: The first Google TPU for the age of inference." Google Blog, 2025. https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/
Wang, S., Li, Y., et al. (2019). "Benchmarking TPU, GPU, and CPU Platforms for Deep Learning." *arXiv preprint*. https://arxiv.org/abs/1907.10701
Google Cloud. "The world's largest distributed LLM training job on TPU v5e." Google Cloud Blog. https://cloud.google.com/blog/products/compute/the-worlds-largest-distributed-llm-training-job-on-tpu-v5e
Google Cloud. "TPU v5e." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v5e
Coral. "Edge TPU performance benchmarks." Coral Documentation. https://www.coral.ai/docs/edgetpu/benchmarks/

Introduction

Explain like I'm 5 (ELI5)

On-chip control architecture

Scalar unit

Matrix multiply unit (MXU)

Vector processing unit (VPU)

SparseCores

Memory hierarchy

TPU generations

Host-level master in distributed TPU systems

TPU VM architecture

Single-host vs. multi-host configurations

TPU pods and slices

Twisted torus topologies

Multislice training

Orchestration with machine learning frameworks

TensorFlow

JAX

PyTorch

Pathways

Parallelism strategies coordinated by the master

TPU vs. GPU comparison

Edge TPU

Practical considerations

Data input pipeline

Batch size and utilization

Fault tolerance

Performance tuning with XLA

History and development

References

Improve this article

Related Articles

ARC-AGI 2

TPU Chip

TPU Node

TPU Slice

TPU Board

TPU Device

Introduction

Explain like I'm 5 (ELI5)

On-chip control architecture

Scalar unit

Matrix multiply unit (MXU)

Vector processing unit (VPU)

SparseCores

Memory hierarchy

TPU generations

Host-level master in distributed TPU systems

TPU VM architecture

Single-host vs. multi-host configurations

TPU pods and slices

Twisted torus topologies

Multislice training

Orchestration with machine learning frameworks

TensorFlow

JAX

PyTorch

Pathways

Parallelism strategies coordinated by the master

TPU vs. GPU comparison

Edge TPU

Practical considerations

Data input pipeline

Batch size and utilization

Fault tolerance

Performance tuning with XLA

History and development

References

Related Articles

ARC-AGI 2

TPU Chip

TPU Node

TPU Slice

TPU Board

TPU Device