A TPU slice is a configurable grouping of interconnected Tensor Processing Unit (TPU) chips within a Google Cloud TPU Pod. Slices range from a handful of chips to thousands, and they serve as the fundamental allocation unit for machine learning workloads on Google's custom AI accelerator hardware. Every chip inside a single slice communicates over a high-speed inter-chip interconnect (ICI), while chips in separate slices exchange data through the slower data-center network (DCN). By organizing TPU hardware into slices, Google gives users fine-grained control over how many chips a training or inference job receives, enabling workloads to scale from small experiments to runs spanning tens of thousands of processors.
Imagine a big box of building blocks. Each block is a tiny computer chip that is really good at one thing: doing lots of math very fast. A TPU slice is like picking a set of blocks out of the box and snapping them together so they can all talk to each other through special fast tunnels. If you need to solve a bigger puzzle, you snap more blocks together into a bigger slice. If your puzzle is truly enormous, you can use several slices at once; the slices talk to each other through regular hallways that are a bit slower than the tunnels, but still get the job done.
Google introduced the first Tensor Processing Unit (TPU v1) in 2015 as an inference-only accelerator. The TPU v2, announced in 2017, expanded the design to support training and introduced the concept of a TPU Pod: a rack-scale collection of chips wired together with a high-bandwidth 2D torus interconnect. Because a full Pod contained hundreds or thousands of chips, Google needed a way to let multiple users share the same physical Pod. The solution was the TPU slice, a logically partitioned subset of chips within a Pod that could be allocated independently.
As successive TPU generations increased Pod sizes (from 256 chips in v2 to 4,096 in v4 and 8,960 in v5p), slices became even more important for resource management. The introduction of optical circuit switches (OCS) in TPU v4 made slices dynamically reconfigurable: Google could carve out a slice of any supported size from the full Pod by programming the switches, without physically re-cabling hardware.
Understanding slices requires a brief look at the chips they contain. Each TPU chip is an application-specific integrated circuit (ASIC) built around one or more TensorCores. A TensorCore contains:
Each chip also has high-bandwidth memory (HBM) for storing model parameters and intermediate activations, plus ICI ports that link it to neighboring chips.
| TPU generation | TensorCores per chip | MXU array size | HBM per chip | Peak BF16 FLOPS per chip |
|---|---|---|---|---|
| v2 | 2 | 128x128 | 16 GB | 46 TFLOPS |
| v3 | 2 | 128x128 | 32 GB | 123 TFLOPS |
| v4 | 2 | 128x128 | 32 GB | 275 TFLOPS |
| v5e | 1 | 128x128 | 16 GB | 197 TFLOPS |
| v5p | 2 | 128x128 | 95 GB | 459 TFLOPS |
| v6e (Trillium) | 1 | 256x256 | 32 GB | 918 TFLOPS |
| TPU7x (Ironwood) | 2 | 256x256 | 192 GB | 2,307 TFLOPS |
A TPU slice is defined as a set of chips within a single Pod that are connected by ICI links. The key properties of a slice are:
TPU slices are described by their topology, a tuple indicating the number of chips along each network dimension:
Google enforces ordering constraints (A <= B <= C for 3D) and requires that each dimension be either at most 4 or a multiple of 4. This ensures the resulting shape maps cleanly onto the physical wiring of the Pod.
Starting with TPU v4, the physical building block of a Pod is the cube, a 4x4x4 arrangement of 64 chips housed in the same rack. Intra-cube ICI links use direct-attach copper (DAC) cables because the physical distances are short. Connections between cubes traverse optical transceivers and, in v4 and later, optical circuit switches.
Slices that are exact multiples of a full cube (for example, 4x4x4, 4x4x8, or 8x8x8) enjoy full 3D torus connectivity, meaning each dimension has wrap-around links that halve the maximum network diameter. Slices smaller than one cube lack wrap-around links, which roughly doubles the latency of collective communication operations compared to torus-connected slices of similar chip count.
ICI is the proprietary high-speed network that links chips within a slice. Its bandwidth has scaled across TPU generations:
| TPU generation | ICI bandwidth per chip (bidirectional) | Topology | Ports per chip |
|---|---|---|---|
| v2 | ~200 GBps | 2D torus | 4 |
| v3 | ~200 GBps | 2D torus | 4 |
| v4 | ~90 GBps | 3D mesh/torus | 6 |
| v5e | ~90 GBps | 2D torus | 4 |
| v5p | ~180 GBps | 3D torus | 6 |
| v6e (Trillium) | ~180 GBps (800 GBps raw) | 2D torus | 4 |
| TPU7x (Ironwood) | 1,200 GBps | 3D torus | 6 |
ICI is fast relative to the data-center network but still slower than HBM bandwidth. This performance gap influences how parallelism strategies are mapped onto a slice: operations that require heavy inter-chip data movement (such as all-reduce during data parallelism) benefit from being placed on chips connected by the fastest ICI links.
A torus topology adds wrap-around links so that the chip at position 0 along a dimension connects directly to the chip at the maximum position. This cuts the worst-case hop count in half and increases bisection bandwidth. A mesh lacks these wrap-around links.
For v4 and v5p, full torus connectivity is available only on slices that contain at least one complete cube (64 chips in a 4x4x4 arrangement). Smaller slices operate as meshes.
TPU v4 and v5p support twisted torus configurations on certain slice shapes. A twist remaps the wrap-around links so that traffic is more evenly distributed across the network. Google reports that a 4x4x8 twisted torus provides roughly 70% higher bisection bandwidth than the standard 4x4x8 torus. Users can request twisted topologies by appending _twisted to the topology string (for example, 4x4x8_twisted).
Each TPU generation defines a maximum Pod size. A slice can be any supported subset of the full Pod.
| TPU generation | Chips per Pod | Maximum slice topology | Interconnect |
|---|---|---|---|
| v2 | 256 | 8x16 (2D) | 2D torus |
| v3 | 1,024 | 16x32 (2D) | 2D torus |
| v4 | 4,096 | 12x16x16 (3D) | 3D torus + OCS |
| v5e | 256 | 16x16 (2D) | 2D torus |
| v5p | 8,960 | 16x16x24 (3D) | 3D torus + OCS |
| v6e (Trillium) | 256 | 16x16 (2D) | 2D torus |
| TPU7x (Ironwood) | 9,216 | 8x16x16 (3D) | 3D torus + OCS |
For 3D generations, the largest schedulable single job is often smaller than the full Pod. On v5p, for example, the maximum single-slice job uses 6,144 chips (96 cubes) in a 16x16x24 topology.
TPU v4 introduced optical circuit switches (OCS) to the TPU interconnect. OCS units sit between cubes and use microelectromechanical (MEMS) mirrors to physically redirect light beams through optical fibers. This allows Google to reconfigure which cubes are connected to which without touching physical cables.
OCS provides several benefits for slice management:
Cloud TPU Multislice is a technology that allows a single training job to span multiple slices. Before Multislice, a job was limited to a single slice, capping the chip count at the Pod maximum (for example, 4,096 chips on v4). With Multislice, jobs can use up to 256 slices, potentially spanning multiple Pods in different racks.
Within each slice, chips communicate over ICI as usual. Between slices, data follows a longer path:
DCN bandwidth per chip ranges from about 3 GBps (v5e) to 12.5 GBps (v6e), which is 10 to 60 times slower than ICI. The XLA compiler automatically generates the inter-slice communication code and overlaps it with computation to hide latency.
Multislice supports several parallelism schemes:
| Parallelism type | Scope | Description |
|---|---|---|
| Data parallelism | Within or across slices | Each chip (or group of chips) holds a full copy of the model and processes a different batch of data. Gradients are averaged across replicas. |
| Fully sharded data parallelism (FSDP) | Within or across slices | Model parameters, gradients, and optimizer states are sharded across chips. Each chip holds only a fraction of the model, reducing memory per chip. |
| Tensor parallelism | Within slice (recommended) | Individual tensors (such as weight matrices) are split across chips. Requires high-bandwidth ICI and is not recommended across DCN. |
| Pipeline parallelism | Within or across slices | Different layers of the model are assigned to different chips or groups of chips. Data flows through the pipeline in micro-batches. |
Google recommends keeping tensor parallelism within a single slice because it demands the low-latency, high-bandwidth communication that only ICI provides. Data parallelism and FSDP tolerate the higher latency of DCN and are the primary strategies used across slices.
TPU slices are programmed through a layered software stack:
The XLA (Accelerated Linear Algebra) compiler translates high-level framework operations into optimized TPU machine code. It handles partitioning computations across the chips in a slice, inserting collective communication operations (all-reduce, all-gather, reduce-scatter) as needed. Users typically interact with XLA indirectly through JAX or TensorFlow.
GSPMD (General and Scalable Parallelization for ML Computation Graphs) is an XLA extension that automates the mapping of a single-device program onto a multi-chip slice. Developers annotate a small number of tensors with sharding specifications; GSPMD propagates these annotations through the computation graph and generates the necessary communication code. In benchmarks, GSPMD has achieved 50% to 62% compute utilization on up to 2,048 TPU v3 cores for models with up to one trillion parameters.
Pathways is a distributed runtime that allows a single Python client to orchestrate work across multiple TPU Pods. It extends JAX's execution model so that SPMD computations have access to all provisioned cores, not just those on the local host. Google used Pathways to train PaLM (540 billion parameters) on 6,144 TPU v4 chips, reaching 57.8% hardware FLOPS utilization.
Orbax is a JAX library that provides checkpointing primitives for saving and restoring model state (JAX PyTrees) to local storage or Google Cloud Storage. Reliable checkpointing is essential for Multislice training, where automatic recovery from slice failures depends on being able to reload the latest checkpoint without user intervention.
TPU v4 slices range from small single-host configurations to large multi-host topologies:
| Accelerator type | Topology | Chips | TensorCores |
|---|---|---|---|
| v4-8 | 2x2x1 | 4 | 8 |
| v4-16 | 2x2x2 | 8 | 16 |
| v4-32 | 2x4x2 | 16 | 32 |
| v4-64 | 2x4x4 | 32 | 64 |
| v4-128 | 4x4x4 | 64 | 128 |
| v4-256 | 4x4x8 | 128 | 256 |
| v4-512 | 4x8x8 | 256 | 512 |
| v4-1024 | 8x8x8 | 512 | 1,024 |
| v4-2048 | 8x8x16 | 1,024 | 2,048 |
| v4-4096 | 8x16x16 | 2,048 | 4,096 |
Slices of v4-128 and above (one full cube or more) have 3D torus connectivity and ICI resiliency enabled by default.
| Accelerator type | Topology | Chips |
|---|---|---|
| v5e-1 | 1x1 | 1 |
| v5e-4 | 2x2 | 4 |
| v5e-8 | 2x4 | 8 |
| v5e-16 | 4x4 | 16 |
| v5e-32 | 4x8 | 32 |
| v5e-64 | 8x8 | 64 |
| v5e-128 | 8x16 | 128 |
| v5e-256 | 16x16 | 256 |
| Accelerator type | Topology | Chips | Cubes |
|---|---|---|---|
| v5p-8 | 2x2x1 | 4 | less than 1 |
| v5p-128 | 4x4x4 | 64 | 1 |
| v5p-512 | 4x8x8 | 256 | 4 |
| v5p-1024 | 8x8x8 | 512 | 8 |
| v5p-4096 | 8x16x16 | 2,048 | 32 |
| v5p-12288 | 16x16x24 | 6,144 | 96 |
TPU slices have been used to train some of the largest large language models and other AI systems:
| Model | Organization | TPU generation | Slice size | Notes |
|---|---|---|---|---|
| PaLM 540B | v4 | 6,144 chips (two Pods) | Trained using Pathways; 57.8% hardware FLOPS utilization | |
| Gemini Ultra | v4 + v5e | Multiple slices across data centers | First Google model trained across multiple data centers | |
| Largest disclosed LLM training | v5e | 50,944 chips (199 Pods) | Achieved 10 exaFLOPS (BF16) at full scale | |
| LaMDA | v3 | 1,024 chips | Trained on a full TPU v3 Pod |
TPU slices differ from GPU clusters in several important ways:
| Aspect | TPU slice | GPU cluster |
|---|---|---|
| Interconnect topology | Fixed 2D or 3D torus/mesh | Hierarchical switch network (NVLink, NVSwitch, InfiniBand) |
| Links per device | Constant (4 or 6 ICI ports) | Varies by level in the switch hierarchy |
| Scaling model | Add more chips to the torus; bandwidth per device stays constant | Add switches and links; bandwidth may decrease at higher tiers |
| Reconfigurability | OCS allows dynamic repartitioning (v4+) | Typically fixed cabling |
| Programming model | XLA/GSPMD automatic sharding | Manual or semi-automatic (Megatron-LM, DeepSpeed, FSDP) |
| Chip design | Custom ASIC optimized for matrix multiplication | General-purpose GPU with tensor cores |
The torus topology of TPU slices means that the number of links per device and per-device bandwidth remain constant regardless of the total system size. GPU clusters, by contrast, often see effective per-device bandwidth decline as the cluster grows beyond what a single switch fabric can serve.
Selecting the right slice size involves balancing several factors:
Even for a fixed number of chips, multiple topology shapes may be available. For example, a 512-chip v4 slice can be configured as 4x4x32, 4x8x16, or 8x8x8. The best choice depends on the parallelism strategy:
ICI resiliency, enabled by default on v4, v5p, and TPU7x slices of one cube or larger, allows ICI connections to be dynamically rerouted around optical or switch faults. This improves availability but can cause temporary performance degradation while the rerouting takes effect. For latency-sensitive inference workloads, users may choose to disable ICI resiliency.