A TPU worker is a virtual machine (VM) running Linux that has direct access to one or more Tensor Processing Unit (TPU) chips. In Google Cloud's TPU architecture, each worker serves as the computational host responsible for executing machine learning workloads on the attached TPU hardware. Workers load and preprocess data, submit compiled programs to TPU chips via the XLA compiler, and coordinate with other workers during distributed training. The term "TPU worker" is used interchangeably with "TPU VM" in Google Cloud documentation, and it represents the fundamental unit of execution in both single-host and multi-host TPU configurations.
Imagine you have a special calculator that is really fast at doing math homework (that is the TPU chip). But the calculator cannot read the homework by itself. It needs a helper to read the problems, write them down for the calculator, and then collect the answers. That helper is the TPU worker. Sometimes the homework is so big that you need many helpers, each with their own calculator, all working together on different parts of the homework at the same time. The helpers pass notes to each other so they all stay on the same page.
A TPU worker is a CPU-based virtual machine physically connected to TPU hardware via PCIe. The worker handles tasks that the TPU chips themselves cannot perform: reading data from storage, running preprocessing pipelines, managing control flow, and communicating with other workers over the data center network. The actual tensor computations (matrix multiplications, convolutions, and other deep learning operations) are offloaded to the TPU chips.
The relationship between a TPU worker and its TPU chips follows a host-device model. The worker (host) prepares computation graphs using a framework such as JAX, PyTorch, or TensorFlow. These graphs are compiled by the XLA (Accelerated Linear Algebra) compiler into optimized TPU machine code. The compiled program is then dispatched to the TPU chips (devices) for execution. Results flow back to the worker through an outfeed queue, while input data enters through an infeed queue.
Each TPU worker can access up to 8 TPU chips, depending on the machine type and TPU generation. For example, a TPU v5e worker using the ct5lp-hightpu-8t machine type has access to 8 chips, while a ct5lp-hightpu-1t configuration provides only 1 chip per worker.
Google Cloud has offered two distinct architectures for accessing TPU hardware. Understanding the difference is important because the term "TPU worker" carries slightly different meaning in each.
| Feature | TPU node (deprecated) | TPU VM (current) |
|---|---|---|
| User access | Separate user VM (n1 instance) communicates with TPU host over gRPC | Direct SSH access to the TPU host VM |
| Data pipeline | Runs on a separate VM; data transferred over network to TPU host | Runs directly on the TPU host, eliminating extra network hop |
| Debugging | Limited access to TPU runtime logs | Full root access to compiler and runtime debug logs |
| Framework support | Primarily TensorFlow | JAX, PyTorch, and TensorFlow |
| Status | Deprecated as of April 2025 | Current recommended architecture |
In the older TPU node architecture, the "worker" referred to the remote TPU host that the user could not directly access. In the current TPU VM architecture, the worker is the VM that the user directly logs into and runs code on. The TPU VM approach eliminates the overhead of a separate user VM and enables training setups (such as distributed reinforcement learning) that were not feasible with the TPU node model.
Each TPU worker is connected to TPU chips that contain specialized processing elements.
A TensorCore is the primary compute unit within a TPU chip. Each TensorCore contains:
The number of TensorCores per chip varies by generation. TPU v4 has two TensorCores per chip (each with four MXUs), while TPU v5e has one TensorCore per chip (with four MXUs).
TPU chips use high-bandwidth memory (HBM) for storing model parameters and activations. Capacity and bandwidth vary by generation:
| TPU generation | HBM per chip | HBM bandwidth per chip | TensorCores per chip | MXUs per TensorCore |
|---|---|---|---|---|
| v2 | 16 GB | 600 GB/s | 2 | 2 |
| v3 | 32 GB | 900 GB/s | 2 | 2 |
| v4 | 32 GB | 1,200 GB/s | 2 | 4 |
| v5e | 16 GB | 819 GB/s | 1 | 4 |
| v5p | 95 GB | 2,765 GB/s | 2 | 4 |
| v6e (Trillium) | 32 GB | 1,640 GB/s | 1 | 4 |
| Ironwood (v7) | 192 GB | 7,370 GB/s | 1 | 4 |
In addition to HBM, each TensorCore has on-chip vector memory (VMEM) that serves as a software-controlled scratchpad. VMEM bandwidth is roughly 22 times higher than HBM bandwidth, making it valuable for operations on smaller tensors that fit in local storage.
Starting with TPU v4, Google introduced SparseCores: specialized dataflow processors designed to accelerate embedding operations common in recommendation and ranking models. TPU v4 includes four SparseCores per chip, each with 2.5 MB of scratchpad memory. SparseCores accelerate embedding-heavy models by 5x to 7x while using only about 5% of die area and power. TPU v5p and Ironwood (v7) also include SparseCores.
TPU workers participate in a layered communication system. The bandwidth at each level determines how training workloads should be partitioned across devices.
| Communication layer | Description | Typical bandwidth | Direction |
|---|---|---|---|
| HBM | Between TensorCore and on-chip memory | 600 GB/s to 7,370 GB/s (varies by generation) | On-chip |
| VMEM | Between TensorCore and scratchpad | ~22x HBM bandwidth | On-chip |
| ICI (inter-chip interconnect) | Between neighboring TPU chips in a slice | 45 to 200 GB/s per axis (varies by generation) | Within a slice |
| PCIe | Between CPU host and TPU chips | ~16 GB/s | Within a worker |
| DCN (data center network) | Between CPU hosts across workers | 3.125 to 12.5 GB/s per TPU (varies by generation) | Across workers |
The steep drop in bandwidth from ICI to DCN (roughly 10x or more) has major implications for how parallelism strategies are chosen. Operations that require frequent, low-latency communication (such as tensor parallelism) work well over ICI within a single slice, while strategies that tolerate higher latency and lower bandwidth (such as data parallelism) are better suited for DCN communication across slices.
TPU workers are deployed in one of two modes, depending on the scale of the workload.
A single-host configuration uses one TPU VM with its attached chips. This is appropriate for smaller models that fit within the memory and compute capacity of a single worker. For TPU v5e, a single host can access up to 8 chips, providing up to 1,576 TFLOPS of bf16 compute and 128 GB of combined HBM.
A multi-host configuration distributes training across multiple TPU VMs. Each worker runs the same training program (following the SPMD, or Single Program Multiple Data, paradigm), but operates on different portions of data or different shards of the model. Workers within a multi-host slice are connected by ICI, enabling high-bandwidth collective operations such as all-reduce.
Multi-host configurations are required when the model or batch size exceeds what a single worker can handle. For example, a TPU v4 slice with a 4x4x4 topology consists of 64 chips spread across 16 workers (4 chips per worker in TPU v4).
TPU workers are organized into increasingly large groupings.
A slice is a collection of TPU chips within the same pod, connected by high-speed ICI links. All workers in a slice can communicate directly through ICI without going through the data center network. Slice topology is specified as a tuple describing the chip layout:
Some topologies support a twisted torus configuration that increases bisection bandwidth. For example, a 4x4x8_twisted topology provides approximately 70% higher bisection bandwidth compared to a non-twisted 4x4x8 layout.
A TPU pod is the largest contiguous grouping of TPU chips connected by ICI. Pod sizes vary by generation:
| TPU generation | Maximum pod size (chips) | Peak pod compute (bf16) |
|---|---|---|
| v3 | 1,024 | 123 PFLOPS |
| v4 | 4,096 | 1.1 EFLOPS |
| v5e | 256 | 50.6 PFLOPS |
| v5p | 8,960 | 4.1 EFLOPS |
| Ironwood (v7) | 9,216 | ~42.5 EFLOPS (fp8) |
Within a pod, reconfigurable optical circuit switches (OCS) can dynamically rearrange inter-cube connections, improving fault tolerance and scheduling flexibility. This feature was introduced with TPU v4.
Multislice extends TPU capacity beyond a single pod by connecting multiple slices over the data center network (DCN). Within each slice, chips continue to communicate over ICI. Between slices, data is transferred from TPU chips to the CPU host over PCIe, then across the DCN to other hosts.
Developers do not need to write explicit inter-slice communication code. The XLA compiler detects the hybrid ICI/DCN topology and automatically generates hierarchical collective operations, overlapping DCN communication with computation to hide latency.
Multislice has been used to run training jobs on over 50,000 TPU v5e chips simultaneously, representing the largest distributed LLM training job on TPUs as of the announcement.
For Multislice training to be efficient, the ratio of computation to communication must be high enough to keep TPU chips busy while gradients are synchronized over DCN. For TPU v4 chips (275 TFLOPS each) with a per-host DCN bandwidth of 50 Gbps, the required arithmetic intensity is approximately 22,000 FLOPS per bit. In practice, this means transformer models trained across two slices need a minimum batch size of roughly 350,000 tokens, with 700,000 or more tokens recommended when using many slices.
Distributed training across TPU workers uses several parallelism approaches, often combined.
Each worker holds a complete copy of the model and processes a different subset of the training batch. After computing gradients locally, workers synchronize through an all-reduce operation. Data parallelism is the simplest strategy and works well when the model fits in a single worker's memory.
FSDP partitions model parameters, gradients, and optimizer states across workers. Each worker stores only a shard of the full model state, reducing per-worker memory requirements. Parameters are gathered on demand for forward and backward passes, then re-sharded afterward. FSDP is commonly used within a slice over ICI.
Large matrix operations are split across multiple chips, with each chip computing a portion of the result. This requires frequent inter-chip communication and works best over ICI within a slice. Tensor parallelism is not recommended over DCN due to the high communication overhead.
Different layers of the model are assigned to different workers. Data flows through the pipeline in micro-batches. Pipeline parallelism reduces per-worker memory requirements but introduces pipeline bubbles (idle time between micro-batches).
Large-scale training typically combines multiple parallelism approaches. A common pattern for LLM training on TPUs uses data parallelism across slices over DCN, where each slice stores a model replica. Within each slice, the model is sharded across chips using FSDP or a combination of FSDP and tensor parallelism over ICI.
The ICI and DCN parallelism dimensions are configured independently. Within a slice, the product of ici_data_parallelism, ici_fsdp_parallelism, and ici_tensor_parallelism must equal the number of chips per slice. Across slices, the corresponding DCN values must multiply to equal the number of slices. Google's documentation recommends always setting dcn_tensor_parallelism to 1, since DCN bandwidth is too low for the frequent communication that tensor parallelism requires.
The XLA (Accelerated Linear Algebra) compiler is central to how TPU workers execute computation.
For distributed execution, XLA uses the GSPMD (General-purpose SPMD) model. Developers annotate how tensors should be sharded across devices using high-level APIs such as JAX's jax.sharding or shard_map. The XLA compiler then automatically inserts the necessary collective communication operations (all-reduce, all-gather, reduce-scatter) and maps the single program onto all TPU chips in the configuration.
Google Kubernetes Engine (GKE) provides managed orchestration of TPU workers through TPU slice node pools.
Each node in a single-host TPU node pool is an independent TPU VM. The TPUs attached to different VMs are not interconnected via ICI. Nodes can be added or removed individually, and standard Kubernetes scaling behavior applies.
Multi-host TPU slice node pools contain two or more interconnected TPU VMs that form a single slice. GKE treats these as atomic units:
A container requesting TPU resources in GKE must consume all TPU chips on the node; partial consumption is not allowed. TPU slice nodes carry a google.com/tpu taint that prevents non-TPU workloads from being scheduled on them.
At the scale of thousands of TPU chips, hardware failures are expected. TPU workers and their infrastructure include several resilience mechanisms.
For TPU v4, v5p, and Ironwood (v7), ICI resiliency is enabled by default for slices of one cube (4x4x4, or 64 chips) or larger. When an optical ICI link or optical circuit switch fails, the system routes traffic around the fault. This improves scheduling availability at the cost of a temporary performance reduction.
In a Multislice configuration, if one slice experiences a failure, Cloud TPU automatically creates a replacement slice. All other slices in the environment are restarted to re-establish the distributed training job. Proper checkpoint management is required to resume training from the last saved state.
Because multi-host node pools are recreated atomically on failure, TPU training jobs should save checkpoints frequently to durable storage such as Google Cloud Storage. Frameworks like Orbax (for JAX) and standard PyTorch checkpointing utilities handle distributed checkpoint saving across workers.
The following table summarizes peak per-chip performance for each TPU generation:
| Generation | Year | Process node | bf16 TFLOPS | Int8 TOPS | HBM (GB) | ICI topology | Pod chips |
|---|---|---|---|---|---|---|---|
| v1 | 2015 | 28 nm | N/A (int8 only) | 92 | 8 (DDR3) | N/A | N/A |
| v2 | 2017 | 16 nm | 45 | 45 | 16 | 2D torus | 512 |
| v3 | 2018 | 16 nm | 123 | 123 | 32 | 2D torus | 1,024 |
| v4 | 2021 | 7 nm | 275 | 275 | 32 | 3D torus | 4,096 |
| v5e | 2023 | N/A | 197 | 393 | 16 | 2D torus | 256 |
| v5p | 2023 | N/A | 459 | 918 | 95 | 3D torus | 8,960 |
| v6e (Trillium) | 2024 | N/A | 918 | 1,840 | 32 | 2D torus | 256 |
| Ironwood (v7) | 2025 | N/A | 4,614 (fp8) | N/A | 192 | 3D torus | 9,216 |
TPU v1 was designed for inference only. All subsequent generations (v2 onward) support both training and inference.
TPU workers differ from GPU-based workers in several fundamental ways.
| Aspect | TPU worker | GPU worker |
|---|---|---|
| Chip design | Application-specific integrated circuit (ASIC) built for ML | General-purpose processor adapted for ML |
| Interconnect | ICI torus connecting nearest neighbors; constant per-device link bandwidth | Hierarchical switching (NVLink, NVSwitch) approximating point-to-point |
| Programming model | XLA compilation with SPMD; framework compiles full graph before execution | CUDA kernels; supports both eager and graph-based execution |
| Memory model | HBM per chip, distributed across torus | HBM per GPU, with unified memory in some architectures |
| Scaling unit | Slice/pod with ICI; Multislice with DCN | Multi-GPU nodes with NVLink; multi-node with InfiniBand or Ethernet |
| Framework support | JAX (native), PyTorch/XLA, TensorFlow | PyTorch (native), TensorFlow, JAX (via GPU backend) |
| Availability | Google Cloud only (Cloud TPU, GKE, Vertex AI) | Multiple cloud providers and on-premises |
| Energy efficiency | 2 to 3x more efficient per ML operation | Higher absolute power draw per chip |
The architectural difference in interconnect topology is notable. GPUs use hierarchical switching that approximates point-to-point connections, offering high bandwidth between any two GPUs in the same node but requiring network switches for cross-node communication. TPUs use a torus topology with constant per-device link bandwidth, which provides more predictable communication patterns and lower cost per link, but limits each chip to communicating directly only with its nearest neighbors.
TPU workers support three major ML frameworks:
TPU workers are well-suited for workloads with specific characteristics:
Google began internal TPU development in 2013 under Amir Salek, who was recruited to establish a custom silicon team. Norman P. Jouppi served as the technical lead and principal architect, directing the design and deployment of the first TPU in approximately 15 months. The first TPU (v1) was publicly unveiled at Google I/O in 2016 and had been used internally since 2015 to power services including Google Search, Google Photos, and Google Translate. The TPU v1 was also used in the AlphaGo system that defeated world Go champion Lee Sedol in March 2016.
Google made TPUs commercially available through Google Cloud Platform in 2018. Broadcom has served as a co-developer across all TPU generations, providing technologies such as SerDes high-speed interfaces and managing fabrication through TSMC.
Jonathan Ross, one of the original TPU engineers, later founded Groq, a company building its own ML accelerator chips.