TPU Worker

A TPU worker is a virtual machine (VM) running Linux that has direct access to one or more Tensor Processing Unit (TPU) chips. In Google Cloud's TPU architecture, each worker serves as the computational host responsible for executing machine learning workloads on the attached TPU hardware. Workers load and preprocess data, submit compiled programs to TPU chips via the XLA compiler, and coordinate with other workers during distributed training. The term "TPU worker" is used interchangeably with "TPU VM" in Google Cloud documentation, and it represents the fundamental unit of execution in both single-host and multi-host TPU configurations.

Explain like I'm 5 (ELI5)

Imagine you have a special calculator that is really fast at doing math homework (that is the TPU chip). But the calculator cannot read the homework by itself. It needs a helper to read the problems, write them down for the calculator, and then collect the answers. That helper is the TPU worker. Sometimes the homework is so big that you need many helpers, each with their own calculator, all working together on different parts of the homework at the same time. The helpers pass notes to each other so they all stay on the same page.

Architecture and role

A TPU worker is a CPU-based virtual machine physically connected to TPU hardware via PCIe. The worker handles tasks that the TPU chips themselves cannot perform: reading data from storage, running preprocessing pipelines, managing control flow, and communicating with other workers over the data center network. The actual tensor computations (matrix multiplications, convolutions, and other deep learning operations) are offloaded to the TPU chips.

The relationship between a TPU worker and its TPU chips follows a host-device model. The worker (host) prepares computation graphs using a framework such as JAX, PyTorch, or TensorFlow. These graphs are compiled by the XLA (Accelerated Linear Algebra) compiler into optimized TPU machine code. The compiled program is then dispatched to the TPU chips (devices) for execution. Results flow back to the worker through an outfeed queue, while input data enters through an infeed queue.

Each TPU worker can access up to 8 TPU chips, depending on the machine type and TPU generation. For example, a TPU v5e worker using the ct5lp-hightpu-8t machine type has access to 8 chips, while a ct5lp-hightpu-1t configuration provides only 1 chip per worker.

TPU node vs. TPU VM architecture

Google Cloud has offered two distinct architectures for accessing TPU hardware. Understanding the difference is important because the term "TPU worker" carries slightly different meaning in each.

Feature	TPU node (deprecated)	TPU VM (current)
User access	Separate user VM (n1 instance) communicates with TPU host over gRPC	Direct SSH access to the TPU host VM
Data pipeline	Runs on a separate VM; data transferred over network to TPU host	Runs directly on the TPU host, eliminating extra network hop
Debugging	Limited access to TPU runtime logs	Full root access to compiler and runtime debug logs
Framework support	Primarily TensorFlow	JAX, PyTorch, and TensorFlow
Status	Deprecated as of April 2025	Current recommended architecture

In the older TPU node architecture, the "worker" referred to the remote TPU host that the user could not directly access. In the current TPU VM architecture, the worker is the VM that the user directly logs into and runs code on. The TPU VM approach eliminates the overhead of a separate user VM and enables training setups (such as distributed reinforcement learning) that were not feasible with the TPU node model.

Hardware components of a TPU worker

Each TPU worker is connected to TPU chips that contain specialized processing elements.

TensorCore

A TensorCore is the primary compute unit within a TPU chip. Each TensorCore contains:

Matrix multiply units (MXUs): Systolic arrays that perform the bulk of tensor computation. In TPU generations prior to v6e, each MXU uses a 128x128 array of multiply-accumulators. Starting with TPU v6e (Trillium), the MXU was expanded to 256x256, quadrupling FLOPS per cycle. Each MXU performs 16,000 multiply-accumulate operations per cycle, taking bfloat16 inputs and accumulating results in FP32.
Vector processing unit (VPU): Handles element-wise operations such as activations, softmax, and other non-matrix computations.
Scalar unit: Manages control flow and memory addressing.

The number of TensorCores per chip varies by generation. TPU v4 has two TensorCores per chip (each with four MXUs), while TPU v5e has one TensorCore per chip (with four MXUs).

Memory system

TPU chips use high-bandwidth memory (HBM) for storing model parameters and activations. Capacity and bandwidth vary by generation:

TPU generation	HBM per chip	HBM bandwidth per chip	TensorCores per chip	MXUs per TensorCore
v2	16 GB	600 GB/s	2	2
v3	32 GB	900 GB/s	2	2
v4	32 GB	1,200 GB/s	2	4
v5e	16 GB	819 GB/s	1	4
v5p	95 GB	2,765 GB/s	2	4
v6e (Trillium)	32 GB	1,640 GB/s	1	4
Ironwood (v7)	192 GB	7,370 GB/s	1	4

In addition to HBM, each TensorCore has on-chip vector memory (VMEM) that serves as a software-controlled scratchpad. VMEM bandwidth is roughly 22 times higher than HBM bandwidth, making it valuable for operations on smaller tensors that fit in local storage.

SparseCore

Starting with TPU v4, Google introduced SparseCores: specialized dataflow processors designed to accelerate embedding operations common in recommendation and ranking models. TPU v4 includes four SparseCores per chip, each with 2.5 MB of scratchpad memory. SparseCores accelerate embedding-heavy models by 5x to 7x while using only about 5% of die area and power. TPU v5p and Ironwood (v7) also include SparseCores.

Communication hierarchy

TPU workers participate in a layered communication system. The bandwidth at each level determines how training workloads should be partitioned across devices.

Communication layer	Description	Typical bandwidth	Direction
HBM	Between TensorCore and on-chip memory	600 GB/s to 7,370 GB/s (varies by generation)	On-chip
VMEM	Between TensorCore and scratchpad	~22x HBM bandwidth	On-chip
ICI (inter-chip interconnect)	Between neighboring TPU chips in a slice	45 to 200 GB/s per axis (varies by generation)	Within a slice
PCIe	Between CPU host and TPU chips	~16 GB/s	Within a worker
DCN (data center network)	Between CPU hosts across workers	3.125 to 12.5 GB/s per TPU (varies by generation)	Across workers

The steep drop in bandwidth from ICI to DCN (roughly 10x or more) has major implications for how parallelism strategies are chosen. Operations that require frequent, low-latency communication (such as tensor parallelism) work well over ICI within a single slice, while strategies that tolerate higher latency and lower bandwidth (such as data parallelism) are better suited for DCN communication across slices.

Single-host and multi-host configurations

TPU workers are deployed in one of two modes, depending on the scale of the workload.

Single-host (single worker)

A single-host configuration uses one TPU VM with its attached chips. This is appropriate for smaller models that fit within the memory and compute capacity of a single worker. For TPU v5e, a single host can access up to 8 chips, providing up to 1,576 TFLOPS of bf16 compute and 128 GB of combined HBM.

Multi-host (multiple workers)

A multi-host configuration distributes training across multiple TPU VMs. Each worker runs the same training program (following the SPMD, or Single Program Multiple Data, paradigm), but operates on different portions of data or different shards of the model. Workers within a multi-host slice are connected by ICI, enabling high-bandwidth collective operations such as all-reduce.

Multi-host configurations are required when the model or batch size exceeds what a single worker can handle. For example, a TPU v4 slice with a 4x4x4 topology consists of 64 chips spread across 16 workers (4 chips per worker in TPU v4).

Slices, pods, and Multislice

TPU workers are organized into increasingly large groupings.

Slice

A slice is a collection of TPU chips within the same pod, connected by high-speed ICI links. All workers in a slice can communicate directly through ICI without going through the data center network. Slice topology is specified as a tuple describing the chip layout:

2D topologies (TPU v5e, v6e): Specified as AxB (e.g., 4x4 for 16 chips). Chips are connected in a 2D torus.
3D topologies (TPU v4, v5p, Ironwood): Specified as AxBxC (e.g., 4x4x4 for 64 chips). Chips are connected in a 3D torus, where each chip links to its six nearest neighbors.

Some topologies support a twisted torus configuration that increases bisection bandwidth. For example, a 4x4x8_twisted topology provides approximately 70% higher bisection bandwidth compared to a non-twisted 4x4x8 layout.

Pod

A TPU pod is the largest contiguous grouping of TPU chips connected by ICI. Pod sizes vary by generation:

TPU generation	Maximum pod size (chips)	Peak pod compute (bf16)
v3	1,024	123 PFLOPS
v4	4,096	1.1 EFLOPS
v5e	256	50.6 PFLOPS
v5p	8,960	4.1 EFLOPS
Ironwood (v7)	9,216	~42.5 EFLOPS (fp8)

Within a pod, reconfigurable optical circuit switches (OCS) can dynamically rearrange inter-cube connections, improving fault tolerance and scheduling flexibility. This feature was introduced with TPU v4.

Multislice

Multislice extends TPU capacity beyond a single pod by connecting multiple slices over the data center network (DCN). Within each slice, chips continue to communicate over ICI. Between slices, data is transferred from TPU chips to the CPU host over PCIe, then across the DCN to other hosts.

Developers do not need to write explicit inter-slice communication code. The XLA compiler detects the hybrid ICI/DCN topology and automatically generates hierarchical collective operations, overlapping DCN communication with computation to hide latency.

Multislice has been used to run training jobs on over 50,000 TPU v5e chips simultaneously, representing the largest distributed LLM training job on TPUs as of the announcement.

Arithmetic intensity requirements

For Multislice training to be efficient, the ratio of computation to communication must be high enough to keep TPU chips busy while gradients are synchronized over DCN. For TPU v4 chips (275 TFLOPS each) with a per-host DCN bandwidth of 50 Gbps, the required arithmetic intensity is approximately 22,000 FLOPS per bit. In practice, this means transformer models trained across two slices need a minimum batch size of roughly 350,000 tokens, with 700,000 or more tokens recommended when using many slices.

Parallelism strategies across workers

Distributed training across TPU workers uses several parallelism approaches, often combined.

Data parallelism

Each worker holds a complete copy of the model and processes a different subset of the training batch. After computing gradients locally, workers synchronize through an all-reduce operation. Data parallelism is the simplest strategy and works well when the model fits in a single worker's memory.

Fully sharded data parallelism (FSDP)

FSDP partitions model parameters, gradients, and optimizer states across workers. Each worker stores only a shard of the full model state, reducing per-worker memory requirements. Parameters are gathered on demand for forward and backward passes, then re-sharded afterward. FSDP is commonly used within a slice over ICI.

Tensor parallelism

Large matrix operations are split across multiple chips, with each chip computing a portion of the result. This requires frequent inter-chip communication and works best over ICI within a slice. Tensor parallelism is not recommended over DCN due to the high communication overhead.

Pipeline parallelism

Different layers of the model are assigned to different workers. Data flows through the pipeline in micro-batches. Pipeline parallelism reduces per-worker memory requirements but introduces pipeline bubbles (idle time between micro-batches).

Combined strategies

Large-scale training typically combines multiple parallelism approaches. A common pattern for LLM training on TPUs uses data parallelism across slices over DCN, where each slice stores a model replica. Within each slice, the model is sharded across chips using FSDP or a combination of FSDP and tensor parallelism over ICI.

The ICI and DCN parallelism dimensions are configured independently. Within a slice, the product of ici_data_parallelism, ici_fsdp_parallelism, and ici_tensor_parallelism must equal the number of chips per slice. Across slices, the corresponding DCN values must multiply to equal the number of slices. Google's documentation recommends always setting dcn_tensor_parallelism to 1, since DCN bandwidth is too low for the frequent communication that tensor parallelism requires.

XLA compilation and execution on workers

The XLA (Accelerated Linear Algebra) compiler is central to how TPU workers execute computation.

Graph tracing: The ML framework (JAX, PyTorch/XLA, or TensorFlow) traces the computation into an intermediate representation (IR) graph.
HLO generation: The IR graph is converted into HLO (High Level Operations) format.
Optimization: XLA performs operator fusion, memory layout optimization, and parallelization passes on the HLO graph.
LLO compilation: The optimized HLO is further compiled into LLO (Low Level Optimized) format, producing TPU-specific machine instructions.
Execution: The compiled program runs on the TPU chips. Compiled graphs are cached so that subsequent executions with the same computation graph and input shapes can reuse the binary.

For distributed execution, XLA uses the GSPMD (General-purpose SPMD) model. Developers annotate how tensors should be sharded across devices using high-level APIs such as JAX's jax.sharding or shard_map. The XLA compiler then automatically inserts the necessary collective communication operations (all-reduce, all-gather, reduce-scatter) and maps the single program onto all TPU chips in the configuration.

TPU workers in Kubernetes (GKE)

Google Kubernetes Engine (GKE) provides managed orchestration of TPU workers through TPU slice node pools.

Single-host node pools

Each node in a single-host TPU node pool is an independent TPU VM. The TPUs attached to different VMs are not interconnected via ICI. Nodes can be added or removed individually, and standard Kubernetes scaling behavior applies.

Multi-host node pools

Multi-host TPU slice node pools contain two or more interconnected TPU VMs that form a single slice. GKE treats these as atomic units:

If one node in the slice fails to deploy, the entire slice deployment fails.
During scaling operations, GKE scales the entire set of nodes to zero and creates new nodes as a unit.
Nodes cannot be individually added to or removed from a multi-host node pool after creation.
During upgrades, GKE atomically recreates the entire node pool rather than performing a rolling update.

A container requesting TPU resources in GKE must consume all TPU chips on the node; partial consumption is not allowed. TPU slice nodes carry a google.com/tpu taint that prevents non-TPU workloads from being scheduled on them.

Fault tolerance and recovery

At the scale of thousands of TPU chips, hardware failures are expected. TPU workers and their infrastructure include several resilience mechanisms.

ICI resiliency

For TPU v4, v5p, and Ironwood (v7), ICI resiliency is enabled by default for slices of one cube (4x4x4, or 64 chips) or larger. When an optical ICI link or optical circuit switch fails, the system routes traffic around the fault. This improves scheduling availability at the cost of a temporary performance reduction.

Multislice auto-recovery

In a Multislice configuration, if one slice experiences a failure, Cloud TPU automatically creates a replacement slice. All other slices in the environment are restarted to re-establish the distributed training job. Proper checkpoint management is required to resume training from the last saved state.

Checkpointing best practices

Because multi-host node pools are recreated atomically on failure, TPU training jobs should save checkpoints frequently to durable storage such as Google Cloud Storage. Frameworks like Orbax (for JAX) and standard PyTorch checkpointing utilities handle distributed checkpoint saving across workers.

TPU worker performance across generations

The following table summarizes peak per-chip performance for each TPU generation:

Generation	Year	Process node	bf16 TFLOPS	Int8 TOPS	HBM (GB)	ICI topology	Pod chips
v1	2015	28 nm	N/A (int8 only)	92	8 (DDR3)	N/A	N/A
v2	2017	16 nm	45	45	16	2D torus	512
v3	2018	16 nm	123	123	32	2D torus	1,024
v4	2021	7 nm	275	275	32	3D torus	4,096
v5e	2023	N/A	197	393	16	2D torus	256
v5p	2023	N/A	459	918	95	3D torus	8,960
v6e (Trillium)	2024	N/A	918	1,840	32	2D torus	256
Ironwood (v7)	2025	N/A	4,614 (fp8)	N/A	192	3D torus	9,216

TPU v1 was designed for inference only. All subsequent generations (v2 onward) support both training and inference.

TPU workers vs. GPU workers

TPU workers differ from GPU-based workers in several fundamental ways.

Aspect	TPU worker	GPU worker
Chip design	Application-specific integrated circuit (ASIC) built for ML	General-purpose processor adapted for ML
Interconnect	ICI torus connecting nearest neighbors; constant per-device link bandwidth	Hierarchical switching (NVLink, NVSwitch) approximating point-to-point
Programming model	XLA compilation with SPMD; framework compiles full graph before execution	CUDA kernels; supports both eager and graph-based execution
Memory model	HBM per chip, distributed across torus	HBM per GPU, with unified memory in some architectures
Scaling unit	Slice/pod with ICI; Multislice with DCN	Multi-GPU nodes with NVLink; multi-node with InfiniBand or Ethernet
Framework support	JAX (native), PyTorch/XLA, TensorFlow	PyTorch (native), TensorFlow, JAX (via GPU backend)
Availability	Google Cloud only (Cloud TPU, GKE, Vertex AI)	Multiple cloud providers and on-premises
Energy efficiency	2 to 3x more efficient per ML operation	Higher absolute power draw per chip

The architectural difference in interconnect topology is notable. GPUs use hierarchical switching that approximates point-to-point connections, offering high bandwidth between any two GPUs in the same node but requiring network switches for cross-node communication. TPUs use a torus topology with constant per-device link bandwidth, which provides more predictable communication patterns and lower cost per link, but limits each chip to communicating directly only with its nearest neighbors.

Supported frameworks

TPU workers support three major ML frameworks:

JAX: The most native TPU framework. JAX's functional programming model maps naturally to XLA's compilation approach and the SPMD execution model. JAX is used extensively within Google for large-scale training.
PyTorch: Supported through the PyTorch/XLA library, which traces PyTorch operations into XLA graphs for TPU execution. PyTorch/XLA supports both lazy tensor evaluation and eager mode (with some limitations).
TensorFlow: The original framework for TPU development. Some TensorFlow operations are not available on TPU, and users must verify compatibility.

Use cases

TPU workers are well-suited for workloads with specific characteristics:

Large-scale model training: Training large language models (LLMs), vision transformers, and other large architectures that benefit from high-bandwidth matrix multiplication.
Recommendation and ranking: Models with large embedding tables benefit from SparseCores.
Extended training runs: Jobs that run for weeks or months, where TPU energy efficiency translates to meaningful cost savings.
Large batch training: Workloads that can use large effective batch sizes to fully utilize TPU compute capacity.
Inference at scale: Serving trained models with low latency and high throughput, particularly for transformer-based models.

History

Google began internal TPU development in 2013 under Amir Salek, who was recruited to establish a custom silicon team. Norman P. Jouppi served as the technical lead and principal architect, directing the design and deployment of the first TPU in approximately 15 months. The first TPU (v1) was publicly unveiled at Google I/O in 2016 and had been used internally since 2015 to power services including Google Search, Google Photos, and Google Translate. The TPU v1 was also used in the AlphaGo system that defeated world Go champion Lee Sedol in March 2016.

Google made TPUs commercially available through Google Cloud Platform in 2018. Broadcom has served as a co-developer across all TPU generations, providing technologies such as SerDes high-speed interfaces and managing fabrication through TSMC.

Jonathan Ross, one of the original TPU engineers, later founded Groq, a company building its own ML accelerator chips.

References

Google Cloud. "TPU architecture." Google Cloud Documentation. https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
Google Cloud. "Introduction to Cloud TPU." Google Cloud Documentation. https://docs.cloud.google.com/tpu/docs/intro-to-tpu
Google Cloud. "Training on TPU slices." Google Cloud Documentation. https://docs.cloud.google.com/tpu/docs/training-on-tpu-pods
Google Cloud. "Cloud TPU Multislice overview." Google Cloud Documentation. https://docs.cloud.google.com/tpu/docs/multislice-introduction
Google Cloud. "TPU v4." Google Cloud Documentation. https://docs.cloud.google.com/tpu/docs/v4
Google Cloud. "TPU v5e." Google Cloud Documentation. https://docs.cloud.google.com/tpu/docs/v5e
Google Cloud. "About TPUs in GKE." Google Kubernetes Engine Documentation. https://docs.cloud.google.com/kubernetes-engine/docs/concepts/tpus
Jouppi, N. P., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA '23), 2023. https://arxiv.org/abs/2304.01433
Jouppi, N. P., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17), 2017.
JAX Team. "How to Think About TPUs." Scaling Book. https://jax-ml.github.io/scaling-book/tpus/
Google Cloud. "Cloud TPU VMs with Ranking and Recommendation are generally available." Google Cloud Blog. https://cloud.google.com/blog/products/compute/cloud-tpu-vms-are-generally-available
Google Cloud. "The world's largest distributed LLM training job on TPU v5e." Google Cloud Blog. https://cloud.google.com/blog/products/compute/the-worlds-largest-distributed-llm-training-job-on-tpu-v5e
Wikipedia contributors. "Tensor Processing Unit." Wikipedia. https://en.wikipedia.org/wiki/Tensor_Processing_Unit
Google Cloud. "Inside the Ironwood TPU codesigned AI stack." Google Cloud Blog. https://cloud.google.com/blog/products/compute/inside-the-ironwood-tpu-codesigned-ai-stack

Explain like I'm 5 (ELI5)

Architecture and role

TPU node vs. TPU VM architecture

Hardware components of a TPU worker

TensorCore

Memory system

SparseCore

Communication hierarchy

Single-host and multi-host configurations

Single-host (single worker)

Multi-host (multiple workers)

Slices, pods, and Multislice

Slice

Pod

Multislice

Arithmetic intensity requirements

Parallelism strategies across workers

Data parallelism

Fully sharded data parallelism (FSDP)

Tensor parallelism

Pipeline parallelism

Combined strategies

XLA compilation and execution on workers

TPU workers in Kubernetes (GKE)

Single-host node pools

Multi-host node pools

Fault tolerance and recovery

ICI resiliency

Multislice auto-recovery

Checkpointing best practices

TPU worker performance across generations

TPU workers vs. GPU workers

Supported frameworks

Use cases

History

See also

References

Improve this article

Related Articles

Machine learning terms/Google Cloud

TPU Node

ARC-AGI 2

TPU Pod

Cloud TPU

TPU Chip

Explain like I'm 5 (ELI5)

Architecture and role

TPU node vs. TPU VM architecture

Hardware components of a TPU worker

TensorCore

Memory system

SparseCore

Communication hierarchy

Single-host and multi-host configurations

Single-host (single worker)

Multi-host (multiple workers)

Slices, pods, and Multislice

Slice

Pod

Multislice

Arithmetic intensity requirements

Parallelism strategies across workers

Data parallelism

Fully sharded data parallelism (FSDP)

Tensor parallelism

Pipeline parallelism

Combined strategies

XLA compilation and execution on workers

TPU workers in Kubernetes (GKE)

Single-host node pools

Multi-host node pools

Fault tolerance and recovery

ICI resiliency

Multislice auto-recovery

Checkpointing best practices

TPU worker performance across generations

TPU workers vs. GPU workers

Supported frameworks

Use cases

History