TPU Slice

A TPU slice is a configurable grouping of interconnected Tensor Processing Unit (TPU) chips within a Google Cloud TPU Pod. Slices range from a handful of chips to thousands, and they serve as the fundamental allocation unit for machine learning workloads on Google's custom AI accelerator hardware. Every chip inside a single slice communicates over a high-speed inter-chip interconnect (ICI), while chips in separate slices exchange data through the slower data-center network (DCN). By organizing TPU hardware into slices, Google gives users fine-grained control over how many chips a training or inference job receives, enabling workloads to scale from small experiments to runs spanning tens of thousands of processors.

Explain like I'm 5 (ELI5)

Imagine a big box of building blocks. Each block is a tiny computer chip that is really good at one thing: doing lots of math very fast. A TPU slice is like picking a set of blocks out of the box and snapping them together so they can all talk to each other through special fast tunnels. If you need to solve a bigger puzzle, you snap more blocks together into a bigger slice. If your puzzle is truly enormous, you can use several slices at once; the slices talk to each other through regular hallways that are a bit slower than the tunnels, but still get the job done.

Background and motivation

Google introduced the first Tensor Processing Unit (TPU v1) in 2015 as an inference-only accelerator. The TPU v2, announced in 2017, expanded the design to support training and introduced the concept of a TPU Pod: a rack-scale collection of chips wired together with a high-bandwidth 2D torus interconnect. Because a full Pod contained hundreds or thousands of chips, Google needed a way to let multiple users share the same physical Pod. The solution was the TPU slice, a logically partitioned subset of chips within a Pod that could be allocated independently.

As successive TPU generations increased Pod sizes (from 256 chips in v2 to 4,096 in v4 and 8,960 in v5p), slices became even more important for resource management. The introduction of optical circuit switches (OCS) in TPU v4 made slices dynamically reconfigurable: Google could carve out a slice of any supported size from the full Pod by programming the switches, without physically re-cabling hardware.

TPU chip architecture overview

Understanding slices requires a brief look at the chips they contain. Each TPU chip is an application-specific integrated circuit (ASIC) built around one or more TensorCores. A TensorCore contains:

Matrix Multiply Unit (MXU): a systolic array of multiply-accumulate units (128x128 in most generations, 256x256 in v6e and TPU7x). The MXU performs the bulk of matrix multiplication work in deep learning.
Vector unit: handles element-wise operations such as activation functions, softmax, and normalization.
Scalar unit: manages control flow, address computation, and memory access patterns.

Each chip also has high-bandwidth memory (HBM) for storing model parameters and intermediate activations, plus ICI ports that link it to neighboring chips.

TPU generation	TensorCores per chip	MXU array size	HBM per chip	Peak BF16 FLOPS per chip
v2	2	128x128	16 GB	46 TFLOPS
v3	2	128x128	32 GB	123 TFLOPS
v4	2	128x128	32 GB	275 TFLOPS
v5e	1	128x128	16 GB	197 TFLOPS
v5p	2	128x128	95 GB	459 TFLOPS
v6e (Trillium)	1	256x256	32 GB	918 TFLOPS
TPU7x (Ironwood)	2	256x256	192 GB	2,307 TFLOPS

Slice definition and structure

A TPU slice is defined as a set of chips within a single Pod that are connected by ICI links. The key properties of a slice are:

Contiguous topology: the chips in a slice form a contiguous rectangular region in the Pod's mesh or torus network.
ICI connectivity: all chip-to-chip communication within the slice travels over ICI, which offers low latency and high bandwidth (tens to hundreds of gigabytes per second per chip, depending on generation).
Dedicated hosts: each slice is served by one or more TPU VM hosts (Linux virtual machines) that have direct physical connections to the chips. A host typically manages 4 or 8 chips.
Independent scheduling: slices are the unit of resource allocation in Google Cloud. A user requests a slice of a specific accelerator type and topology; Google's scheduler finds space in an available Pod.

Topology specification

TPU slices are described by their topology, a tuple indicating the number of chips along each network dimension:

2D topologies (v2, v3, v5e, v6e): specified as AxB. For example, a v5e slice with topology 4x8 contains 32 chips arranged in a 4-by-8 grid.
3D topologies (v4, v5p, TPU7x): specified as AxBxC. For example, a v4 slice with topology 4x4x8 contains 128 chips.

Google enforces ordering constraints (A <= B <= C for 3D) and requires that each dimension be either at most 4 or a multiple of 4. This ensures the resulting shape maps cleanly onto the physical wiring of the Pod.

Cubes

Starting with TPU v4, the physical building block of a Pod is the cube, a 4x4x4 arrangement of 64 chips housed in the same rack. Intra-cube ICI links use direct-attach copper (DAC) cables because the physical distances are short. Connections between cubes traverse optical transceivers and, in v4 and later, optical circuit switches.

Slices that are exact multiples of a full cube (for example, 4x4x4, 4x4x8, or 8x8x8) enjoy full 3D torus connectivity, meaning each dimension has wrap-around links that halve the maximum network diameter. Slices smaller than one cube lack wrap-around links, which roughly doubles the latency of collective communication operations compared to torus-connected slices of similar chip count.

Inter-chip interconnect (ICI)

ICI is the proprietary high-speed network that links chips within a slice. Its bandwidth has scaled across TPU generations:

TPU generation	ICI bandwidth per chip (bidirectional)	Topology	Ports per chip
v2	~200 GBps	2D torus	4
v3	~200 GBps	2D torus	4
v4	~90 GBps	3D mesh/torus	6
v5e	~90 GBps	2D torus	4
v5p	~180 GBps	3D torus	6
v6e (Trillium)	~180 GBps (800 GBps raw)	2D torus	4
TPU7x (Ironwood)	1,200 GBps	3D torus	6

ICI is fast relative to the data-center network but still slower than HBM bandwidth. This performance gap influences how parallelism strategies are mapped onto a slice: operations that require heavy inter-chip data movement (such as all-reduce during data parallelism) benefit from being placed on chips connected by the fastest ICI links.

Torus vs. mesh connectivity

A torus topology adds wrap-around links so that the chip at position 0 along a dimension connects directly to the chip at the maximum position. This cuts the worst-case hop count in half and increases bisection bandwidth. A mesh lacks these wrap-around links.

For v4 and v5p, full torus connectivity is available only on slices that contain at least one complete cube (64 chips in a 4x4x4 arrangement). Smaller slices operate as meshes.

Twisted torus topologies

TPU v4 and v5p support twisted torus configurations on certain slice shapes. A twist remaps the wrap-around links so that traffic is more evenly distributed across the network. Google reports that a 4x4x8 twisted torus provides roughly 70% higher bisection bandwidth than the standard 4x4x8 torus. Users can request twisted topologies by appending _twisted to the topology string (for example, 4x4x8_twisted).

Pod composition by generation

Each TPU generation defines a maximum Pod size. A slice can be any supported subset of the full Pod.

TPU generation	Chips per Pod	Maximum slice topology	Interconnect
v2	256	8x16 (2D)	2D torus
v3	1,024	16x32 (2D)	2D torus
v4	4,096	12x16x16 (3D)	3D torus + OCS
v5e	256	16x16 (2D)	2D torus
v5p	8,960	16x16x24 (3D)	3D torus + OCS
v6e (Trillium)	256	16x16 (2D)	2D torus
TPU7x (Ironwood)	9,216	8x16x16 (3D)	3D torus + OCS

For 3D generations, the largest schedulable single job is often smaller than the full Pod. On v5p, for example, the maximum single-slice job uses 6,144 chips (96 cubes) in a 16x16x24 topology.

Optical circuit switches and dynamic slice allocation

TPU v4 introduced optical circuit switches (OCS) to the TPU interconnect. OCS units sit between cubes and use microelectromechanical (MEMS) mirrors to physically redirect light beams through optical fibers. This allows Google to reconfigure which cubes are connected to which without touching physical cables.

OCS provides several benefits for slice management:

Flexible sizing: Google can provision a slice of any supported size, from 4 chips to the full Pod, by programming the switches.
Fault tolerance: if a chip or link fails, the OCS can route traffic around the faulty component, preserving the rest of the slice. This capability is called ICI resiliency and is enabled by default for slices of one cube or larger on v4, v5p, and TPU7x.
Improved utilization: because slices are not fixed to particular physical positions, the scheduler has more freedom to pack multiple slices into a single Pod.
Energy efficiency: OCS mirrors consume power only during reconfiguration events. Once a path is established, light passes through with minimal loss.

Multislice

Cloud TPU Multislice is a technology that allows a single training job to span multiple slices. Before Multislice, a job was limited to a single slice, capping the chip count at the Pod maximum (for example, 4,096 chips on v4). With Multislice, jobs can use up to 256 slices, potentially spanning multiple Pods in different racks.

How Multislice communication works

Within each slice, chips communicate over ICI as usual. Between slices, data follows a longer path:

The source chip writes data to its host's memory via PCIe.
The host transmits the data over the data-center network (DCN) to the destination host.
The destination host writes the data into its chip's HBM via PCIe.

DCN bandwidth per chip ranges from about 3 GBps (v5e) to 12.5 GBps (v6e), which is 10 to 60 times slower than ICI. The XLA compiler automatically generates the inter-slice communication code and overlaps it with computation to hide latency.

Parallelism strategies across slices

Multislice supports several parallelism schemes:

Parallelism type	Scope	Description
Data parallelism	Within or across slices	Each chip (or group of chips) holds a full copy of the model and processes a different batch of data. Gradients are averaged across replicas.
Fully sharded data parallelism (FSDP)	Within or across slices	Model parameters, gradients, and optimizer states are sharded across chips. Each chip holds only a fraction of the model, reducing memory per chip.
Tensor parallelism	Within slice (recommended)	Individual tensors (such as weight matrices) are split across chips. Requires high-bandwidth ICI and is not recommended across DCN.
Pipeline parallelism	Within or across slices	Different layers of the model are assigned to different chips or groups of chips. Data flows through the pipeline in micro-batches.

Google recommends keeping tensor parallelism within a single slice because it demands the low-latency, high-bandwidth communication that only ICI provides. Data parallelism and FSDP tolerate the higher latency of DCN and are the primary strategies used across slices.

Constraints and requirements

All slices in a Multislice configuration must have the same shape (homogeneous slices).
A maximum of 256 slices can participate in a single Multislice job.
Multislice is supported with JAX and PyTorch frameworks.
If any slice fails, Cloud TPU provisions a replacement and resets all remaining slices. Training resumes from the latest checkpoint.

Software stack

TPU slices are programmed through a layered software stack:

XLA compiler

The XLA (Accelerated Linear Algebra) compiler translates high-level framework operations into optimized TPU machine code. It handles partitioning computations across the chips in a slice, inserting collective communication operations (all-reduce, all-gather, reduce-scatter) as needed. Users typically interact with XLA indirectly through JAX or TensorFlow.

GSPMD

GSPMD (General and Scalable Parallelization for ML Computation Graphs) is an XLA extension that automates the mapping of a single-device program onto a multi-chip slice. Developers annotate a small number of tensors with sharding specifications; GSPMD propagates these annotations through the computation graph and generates the necessary communication code. In benchmarks, GSPMD has achieved 50% to 62% compute utilization on up to 2,048 TPU v3 cores for models with up to one trillion parameters.

Pathways

Pathways is a distributed runtime that allows a single Python client to orchestrate work across multiple TPU Pods. It extends JAX's execution model so that SPMD computations have access to all provisioned cores, not just those on the local host. Google used Pathways to train PaLM (540 billion parameters) on 6,144 TPU v4 chips, reaching 57.8% hardware FLOPS utilization.

Orbax checkpointing

Orbax is a JAX library that provides checkpointing primitives for saving and restoring model state (JAX PyTrees) to local storage or Google Cloud Storage. Reliable checkpointing is essential for Multislice training, where automatic recovery from slice failures depends on being able to reload the latest checkpoint without user intervention.

Slice configurations by generation

TPU v4 slice configurations

TPU v4 slices range from small single-host configurations to large multi-host topologies:

Accelerator type	Topology	Chips	TensorCores
v4-8	2x2x1	4	8
v4-16	2x2x2	8	16
v4-32	2x4x2	16	32
v4-64	2x4x4	32	64
v4-128	4x4x4	64	128
v4-256	4x4x8	128	256
v4-512	4x8x8	256	512
v4-1024	8x8x8	512	1,024
v4-2048	8x8x16	1,024	2,048
v4-4096	8x16x16	2,048	4,096

Slices of v4-128 and above (one full cube or more) have 3D torus connectivity and ICI resiliency enabled by default.

TPU v5e slice configurations

Accelerator type	Topology	Chips
v5e-1	1x1	1
v5e-4	2x2	4
v5e-8	2x4	8
v5e-16	4x4	16
v5e-32	4x8	32
v5e-64	8x8	64
v5e-128	8x16	128
v5e-256	16x16	256

TPU v5p slice configurations (selected)

Accelerator type	Topology	Chips	Cubes
v5p-8	2x2x1	4	less than 1
v5p-128	4x4x4	64	1
v5p-512	4x8x8	256	4
v5p-1024	8x8x8	512	8
v5p-4096	8x16x16	2,048	32
v5p-12288	16x16x24	6,144	96

Real-world training examples

TPU slices have been used to train some of the largest large language models and other AI systems:

Model	Organization	TPU generation	Slice size	Notes
PaLM 540B	Google	v4	6,144 chips (two Pods)	Trained using Pathways; 57.8% hardware FLOPS utilization
Gemini Ultra	Google	v4 + v5e	Multiple slices across data centers	First Google model trained across multiple data centers
Largest disclosed LLM training	Google	v5e	50,944 chips (199 Pods)	Achieved 10 exaFLOPS (BF16) at full scale
LaMDA	Google	v3	1,024 chips	Trained on a full TPU v3 Pod

Comparison with GPU clusters

TPU slices differ from GPU clusters in several important ways:

Aspect	TPU slice	GPU cluster
Interconnect topology	Fixed 2D or 3D torus/mesh	Hierarchical switch network (NVLink, NVSwitch, InfiniBand)
Links per device	Constant (4 or 6 ICI ports)	Varies by level in the switch hierarchy
Scaling model	Add more chips to the torus; bandwidth per device stays constant	Add switches and links; bandwidth may decrease at higher tiers
Reconfigurability	OCS allows dynamic repartitioning (v4+)	Typically fixed cabling
Programming model	XLA/GSPMD automatic sharding	Manual or semi-automatic (Megatron-LM, DeepSpeed, FSDP)
Chip design	Custom ASIC optimized for matrix multiplication	General-purpose GPU with tensor cores

The torus topology of TPU slices means that the number of links per device and per-device bandwidth remain constant regardless of the total system size. GPU clusters, by contrast, often see effective per-device bandwidth decline as the cluster grows beyond what a single switch fabric can serve.

Practical considerations

Choosing a slice size

Selecting the right slice size involves balancing several factors:

Model size: models that exceed the HBM of a single chip require model parallelism or FSDP, which in turn requires a slice large enough to hold the sharded model.
Batch size: data parallelism scales the effective batch size with the number of chips. The per-chip batch size must remain large enough to keep the MXUs busy (high arithmetic intensity).
Communication overhead: larger slices have more chips, but also higher aggregate communication volume during collectives. Torus-connected slices (one cube or larger) handle this more efficiently than mesh-connected sub-cube slices.
Cost: Cloud TPU pricing is per chip-hour. A 512-chip slice costs 64 times as much per hour as an 8-chip slice.

Choosing a topology shape

Even for a fixed number of chips, multiple topology shapes may be available. For example, a 512-chip v4 slice can be configured as 4x4x32, 4x8x16, or 8x8x8. The best choice depends on the parallelism strategy:

Cube shapes (all dimensions equal or close to equal, like 8x8x8) maximize bisection bandwidth and are generally best for workloads dominated by all-reduce operations.
Elongated shapes (like 4x4x32) may be preferable for pipeline parallelism, where the long dimension maps naturally to the pipeline stages.
Twisted torus variants should be considered when bisection bandwidth is the bottleneck, as they can increase it by up to 70%.

ICI resiliency

ICI resiliency, enabled by default on v4, v5p, and TPU7x slices of one cube or larger, allows ICI connections to be dynamically rerouted around optical or switch faults. This improves availability but can cause temporary performance degradation while the rerouting takes effect. For latency-sensitive inference workloads, users may choose to disable ICI resiliency.

References

Jouppi, N.P., Young, C., Patil, N., Patterson, D., et al. (2017). "In-Datacenter Performance Analysis of a Tensor Processing Unit." *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*. https://arxiv.org/abs/1704.04760
Jouppi, N.P., Yoon, D.H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., and Patterson, D.A. (2020). "A Domain-Specific Supercomputer for Training Deep Neural Networks." *Communications of the ACM*, 63(7), 67-78. https://dl.acm.org/doi/10.1145/3360307
Jouppi, N.P., Kurian, G., Li, S., Ma, P., et al. (2023). "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." *Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA)*. https://arxiv.org/abs/2304.01433
Xu, Y., Lee, H., Chen, D., et al. (2021). "GSPMD: General and Scalable Parallelization for ML Computation Graphs." *arXiv preprint*. https://arxiv.org/abs/2105.04663
Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., et al. (2022). "Pathways: Asynchronous Distributed Dataflow for ML." *Proceedings of MLSys*. https://arxiv.org/abs/2203.12533
Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint*. https://arxiv.org/abs/2204.02311
Google Cloud. "TPU system architecture." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
Google Cloud. "Cloud TPU Multislice overview." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/multislice-introduction
Google Cloud. "TPU v4." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v4
Google Cloud. "TPU v5p." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v5p
Google Cloud. "TPU v5e." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v5e
Google Cloud. "TPU v6e (Trillium)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v6e
Google Cloud. "TPU7x (Ironwood)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/tpu7x
JAX Scaling Book. "How to Think About TPUs." https://jax-ml.github.io/scaling-book/tpus/
Google Cloud Blog. "Using Cloud TPU Multislice to scale AI workloads." https://cloud.google.com/blog/products/compute/using-cloud-tpu-multislice-to-scale-ai-workloads

Explain like I'm 5 (ELI5)

Background and motivation

TPU chip architecture overview

Slice definition and structure

Topology specification

Cubes

Inter-chip interconnect (ICI)

Torus vs. mesh connectivity

Twisted torus topologies

Pod composition by generation

Optical circuit switches and dynamic slice allocation

Multislice

How Multislice communication works

Parallelism strategies across slices

Constraints and requirements

Software stack

XLA compiler

GSPMD

Pathways

Orbax checkpointing

Slice configurations by generation

TPU v4 slice configurations

TPU v5e slice configurations

TPU v5p slice configurations (selected)

Real-world training examples

Comparison with GPU clusters

Practical considerations

Choosing a slice size

Choosing a topology shape

ICI resiliency

See also

References

Improve this article

Related Articles

ARC-AGI 2

TPU Chip

TPU Device

TPU Master

TPU Node

TPU Type

Explain like I'm 5 (ELI5)

Background and motivation

TPU chip architecture overview

Slice definition and structure

Topology specification

Cubes

Inter-chip interconnect (ICI)

Torus vs. mesh connectivity

Twisted torus topologies

Pod composition by generation

Optical circuit switches and dynamic slice allocation

Multislice

How Multislice communication works

Parallelism strategies across slices

Constraints and requirements

Software stack

XLA compiler

GSPMD

Pathways

Orbax checkpointing

Slice configurations by generation

TPU v4 slice configurations

TPU v5e slice configurations

TPU v5p slice configurations (selected)

Real-world training examples

Comparison with GPU clusters

Practical considerations

Choosing a slice size

Choosing a topology shape

ICI resiliency

See also

References

Related Articles

ARC-AGI 2

TPU Chip

TPU Device

TPU Master

TPU Node

TPU Type