# TPU Slice

> Source: https://aiwiki.ai/wiki/tpu_slice
> Updated: 2026-06-25
> Categories: AI Hardware, Google, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **TPU slice** is a collection of [Tensor Processing Unit](/wiki/tpu) (TPU) chips that all sit inside the same [Google Cloud](/wiki/google_cloud) TPU Pod and are connected to one another by a high-speed inter-chip interconnect (ICI).[7] It is the fundamental unit of allocation for machine learning workloads on Google's custom AI accelerators: a user requests a slice of a chosen size and shape, and Google's scheduler carves that many ICI-connected chips out of an available Pod.[7] Slices range from a single chip to thousands, every chip within a slice communicates over ICI, and chips in separate slices exchange data over the slower data-center network (DCN).[8][14] Google Cloud documentation states plainly: "A slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter-chip interconnects (ICI)."[7]

## Explain like I'm 5 (ELI5)

Imagine a big box of building blocks. Each block is a tiny computer chip that is really good at one thing: doing lots of math very fast. A TPU slice is like picking a set of blocks out of the box and snapping them together so they can all talk to each other through special fast tunnels. If you need to solve a bigger puzzle, you snap more blocks together into a bigger slice. If your puzzle is truly enormous, you can use several slices at once; the slices talk to each other through regular hallways that are a bit slower than the tunnels, but still get the job done.

## What is a TPU slice?

A TPU slice is defined as a set of chips within a single Pod that are connected by ICI links.[7] The key properties of a slice are:

1. **Contiguous topology**: the chips in a slice form a contiguous rectangular region in the Pod's mesh or torus network.[14]
2. **ICI connectivity**: all chip-to-chip communication within the slice travels over ICI, which offers low latency and high bandwidth (hundreds to over a thousand gigabytes per second per chip, depending on generation).[9][10][14]
3. **Dedicated hosts**: each slice is served by one or more TPU VM hosts (Linux virtual machines) that have direct physical connections to the chips. A host typically manages 4 or 8 chips.[7]
4. **Independent scheduling**: slices are the unit of resource allocation in Google Cloud. A user requests a slice of a specific accelerator type and topology; Google's scheduler finds space in an available Pod.[7]

Google Cloud describes the larger container as follows: "A TPU Pod is a contiguous set of TPUs grouped together over a specialized network."[7] A slice is therefore any ICI-connected subset of that Pod that the scheduler hands to a single workload.

### How a slice is specified (topology)

TPU slices are described by their topology, a tuple indicating the number of chips along each network dimension:

- **2D topologies** (v2, v3, v5e, v6e): specified as AxB. For example, a v5e slice with topology 4x8 contains 32 chips arranged in a 4-by-8 grid.[11]
- **3D topologies** (v4, v5p, TPU7x): specified as AxBxC. For example, a v4 slice with topology 4x4x8 contains 128 chips.[9]

Google enforces ordering constraints (A <= B <= C for 3D) and requires that each dimension be either at most 4 or a multiple of 4. This ensures the resulting shape maps cleanly onto the physical wiring of the Pod.[7]

### What is a cube?

Starting with TPU v4, the physical building block of a Pod is the **cube**, a 4x4x4 arrangement of 64 chips housed in the same rack.[3][9] Google Cloud defines it directly as "a 4x4x4 topology of interconnected TPU chips" that is "only applicable to 3D topologies (beginning with TPU v4)."[7] Intra-cube ICI links use direct-attach copper (DAC) cables because the physical distances are short. Connections between cubes traverse optical transceivers and, in v4 and later, optical circuit switches.[3]

Slices that are exact multiples of a full cube (for example, 4x4x4, 4x4x8, or 8x8x8) enjoy full 3D torus connectivity, meaning each dimension has wrap-around links that halve the maximum network diameter.[9] Slices smaller than one cube lack wrap-around links, which roughly doubles the latency of collective communication operations compared to torus-connected slices of similar chip count.[14]

## How is a slice different from a Pod and a host?

The three terms describe a hierarchy. A **chip** holds one or more TensorCores. A **host** is a TPU VM, that is, "a VM that runs on a physical computer connected to TPU hardware," and each host is physically wired to a fixed number of chips (4 chips per host on v4, v5p, and TPU7x; up to 8 on some other types).[7] A **slice** is a group of ICI-connected chips, spanning one or many hosts. A **Pod** is the full contiguous set of chips wired together over the specialized ICI/OCS network; a slice is any schedulable subset of a Pod.[7]

| Term | What it is | Typical size |
|---|---|---|
| Chip | One TPU ASIC with 1-2 TensorCores and its own HBM | 1 unit |
| Host (TPU VM) | A virtual machine physically attached to a fixed set of chips | 4 or 8 chips |
| Slice | ICI-connected chips allocated to one workload | 1 chip to thousands |
| Pod | The full contiguous network of chips wired by ICI plus OCS | 256 to 9,216 chips |

### Single-host vs multi-host slices

Google Cloud distinguishes single-host from multi-host slices by how many TPU VMs the slice spans. "A single-host topology refers to a topology with TPU chips from a single compute host," while "a multi-host topology refers to a topology with TPU chips from more than one compute host."[7] A 2x2x1 slice (4 chips) maps to a single host on v4/v5p/TPU7x and is therefore single-host; anything larger spans multiple hosts and runs as a distributed (multi-host) job.[7] In multi-host slices the TPUs are still connected over ICI; only the orchestration crosses VM boundaries.[7]

## Background and motivation

Google introduced the first [Tensor Processing Unit](/wiki/tpu) (TPU v1) in 2015 as an inference-only accelerator.[1] The TPU v2, announced in 2017, expanded the design to support training and introduced the concept of a TPU Pod: a rack-scale collection of chips wired together with a high-bandwidth 2D torus interconnect.[2] Because a full Pod contained hundreds or thousands of chips, Google needed a way to let multiple users share the same physical Pod. The solution was the TPU slice, a logically partitioned subset of chips within a Pod that could be allocated independently.[7]

As successive TPU generations increased Pod sizes (from 256 chips in v2 to 4,096 in v4 and 8,960 in v5p), slices became even more important for resource management.[9][10] The introduction of optical circuit switches (OCS) in TPU v4 made slices dynamically reconfigurable: Google could carve out a slice of any supported size from the full Pod by programming the switches, without physically re-cabling hardware.[3]

## What is inside each TPU chip?

Understanding slices requires a brief look at the chips they contain. Each TPU chip is an application-specific integrated circuit ([ASIC](/wiki/ai_chip)) built around one or more **TensorCores**.[14] A TensorCore contains:

- **Matrix Multiply Unit (MXU)**: a [systolic array](/wiki/systolic_array) of multiply-accumulate units (128x128 in most generations, 256x256 in v6e and TPU7x).[12][13] The MXU performs the bulk of [matrix multiplication](/wiki/tensor) work in [deep learning](/wiki/deep_learning).
- **Vector unit**: handles element-wise operations such as [activation functions](/wiki/activation_function), [softmax](/wiki/softmax), and [normalization](/wiki/batch_normalization).
- **Scalar unit**: manages control flow, address computation, and memory access patterns.

Each chip also has high-bandwidth memory (HBM) for storing model parameters and intermediate activations, plus ICI ports that link it to neighboring chips.[14]

| TPU generation | TensorCores per chip | MXU array size | HBM per chip | Peak BF16 FLOPS per chip |
|---|---|---|---|---|
| v2 | 2 | 128x128 | 16 GB | 46 TFLOPS |
| v3 | 2 | 128x128 | 32 GB | 123 TFLOPS |
| v4 | 2 | 128x128 | 32 GB | 275 TFLOPS |
| v5e | 1 | 128x128 | 16 GB | 197 TFLOPS |
| v5p | 2 | 128x128 | 95 GB | 459 TFLOPS |
| v6e (Trillium) | 1 | 256x256 | 32 GB | 918 TFLOPS |
| TPU7x (Ironwood) | 2 | 256x256 | 192 GB | 2,307 TFLOPS |

## What is the inter-chip interconnect (ICI)?

ICI is the proprietary high-speed network that links chips within a slice. Google Cloud defines it as "high speed, low latency internal links that connect TPUs within a TPU Pod."[8] Per-chip ICI bandwidth has scaled sharply across generations, reaching 1,200 GBps (bidirectional) per chip on v5p and TPU7x:

| TPU generation | ICI bandwidth per chip (bidirectional) | Topology | Ports (links) per chip |
|---|---|---|---|
| v4 | 300 GBps (six 50 GBps links) | 3D mesh/torus | 6 |
| v5e | 400 GBps | 2D torus | 4 |
| v5p | 1,200 GBps | 3D torus | 6 |
| v6e (Trillium) | 800 GBps | 2D torus | 4 |
| TPU7x (Ironwood) | 1,200 GBps | 3D torus | 6 |

ICI is fast relative to the data-center network but still slower than HBM bandwidth. This performance gap influences how parallelism strategies are mapped onto a slice: operations that require heavy inter-chip data movement (such as all-reduce during [data parallelism](/wiki/data_parallelism)) benefit from being placed on chips connected by the fastest ICI links.[14]

### Torus vs mesh connectivity

A **torus** topology adds wrap-around links so that the chip at position 0 along a dimension connects directly to the chip at the maximum position. This cuts the worst-case hop count in half and increases bisection bandwidth. A **mesh** lacks these wrap-around links.[14]

For v4 and v5p, full torus connectivity is available only on slices that contain at least one complete cube (64 chips in a 4x4x4 arrangement). Smaller slices operate as meshes.[9]

### Twisted torus topologies

TPU v4 and v5p support **twisted torus** configurations on certain slice shapes. A twist remaps the wrap-around links so that traffic is more evenly distributed across the network.[3] Google Cloud reports that "a 4x4x8 twisted topology provides a 70% theoretical increase in bisection bandwidth compared to a non-twisted 4x4x8 slice."[9] Users can request twisted topologies by appending `_twisted` to the topology string (for example, `4x4x8_twisted`).[9]

## What size is a TPU Pod by generation?

Each TPU generation defines a maximum Pod size. A slice can be any supported subset of the full Pod.

| TPU generation | Chips per Pod | Maximum slice topology | Interconnect |
|---|---|---|---|
| v2 | 256 | 8x16 (2D) | 2D torus |
| v3 | 1,024 | 16x32 (2D) | 2D torus |
| v4 | 4,096 | 12x16x16 (3D) | 3D torus + OCS |
| v5e | 256 | 16x16 (2D) | 2D torus |
| v5p | 8,960 | 16x16x24 (3D) | 3D torus + OCS |
| v6e (Trillium) | 256 | 16x16 (2D) | 2D torus |
| TPU7x (Ironwood) | 9,216 | 8x16x16 (3D) | 3D torus + OCS |

For 3D generations, the largest schedulable single job can be smaller than the full Pod. On v5p, the largest single-slice accelerator type is v5p-12288, a 16x16x24 topology of 6,144 chips (96 cubes), even though a full v5p Pod has 8,960 chips.[10] To go beyond a single slice, jobs use Multislice.

## How are slices allocated with optical circuit switches?

TPU v4 introduced optical circuit switches (OCS) to the TPU interconnect.[3] OCS units sit between cubes and use microelectromechanical (MEMS) mirrors to physically redirect light beams through optical fibers. This allows Google to reconfigure which cubes are connected to which without touching physical cables.[3]

OCS provides several benefits for slice management:

- **Flexible sizing**: Google can provision a slice of any supported size, from 4 chips to the full Pod, by programming the switches.[3]
- **Fault tolerance**: if a chip or link fails, the OCS can route traffic around the faulty component, preserving the rest of the slice. This capability is called **ICI resiliency** and is enabled by default for slices of one cube or larger on v4, v5p, and TPU7x.[9]
- **Improved utilization**: because slices are not fixed to particular physical positions, the scheduler has more freedom to pack multiple slices into a single Pod.[3]
- **Energy efficiency**: OCS mirrors consume power only during reconfiguration events. Once a path is established, light passes through with minimal loss.[3]

## What is Multislice training?

Cloud TPU Multislice is a technology that allows a single training job to span multiple slices.[8] Google Cloud defines a Multislice configuration as "two or more TPU chip slices that can communicate over DCN."[8] Before Multislice, a job was limited to a single slice, capping the chip count at the Pod maximum (for example, 4,096 chips on v4). With Multislice, a run can use up to 256 slices, potentially spanning multiple Pods in different racks.[8] Google states that Multislice "enables near-linear scaling up to tens of thousands of TPU chips" by communicating between slices over the data-center network.[15]

### How does Multislice communication work?

Within each slice, chips communicate over ICI as usual. Between slices, data follows a longer path over the DCN, which Google Cloud describes as "a higher latency, lower-throughput network (when compared with ICI) that connects TPU slices in a Multislice configuration":[8]

1. The source chip writes data to its host's memory via PCIe.
2. The host transmits the data over the data-center network (DCN) to the destination host.
3. The destination host writes the data into its chip's HBM via PCIe.[8]

DCN bandwidth per chip ranges from about 3 GBps (v5e) to 12.5 GBps (v6e), which is many times slower than ICI.[8] The [XLA](/wiki/jax) compiler automatically generates the inter-slice communication code and overlaps it with computation to hide latency.[15]

### Which parallelism strategies work across slices?

Multislice supports several parallelism schemes:

| Parallelism type | Scope | Description |
|---|---|---|
| [Data parallelism](/wiki/data_parallelism) | Within or across slices | Each chip (or group of chips) holds a full copy of the model and processes a different batch of data. Gradients are averaged across replicas. |
| Fully sharded data parallelism (FSDP) | Within or across slices | Model parameters, gradients, and optimizer states are sharded across chips. Each chip holds only a fraction of the model, reducing memory per chip. |
| [Tensor parallelism](/wiki/model_parallelism) | Within slice (recommended) | Individual tensors (such as weight matrices) are split across chips. Requires high-bandwidth ICI and is not recommended across DCN. |
| Pipeline parallelism | Within or across slices | Different layers of the model are assigned to different chips or groups of chips. Data flows through the pipeline in micro-batches. |

Google recommends keeping tensor parallelism within a single slice because it demands the low-latency, high-bandwidth communication that only ICI provides.[15] Data parallelism and FSDP tolerate the higher latency of DCN and are the primary strategies used across slices.[15]

### What are the Multislice constraints?

- All slices in a Multislice configuration must have the same shape. Google Cloud states: "All slices must be of the same shape (for example, the same AcceleratorType). Heterogeneous slice shapes are not supported."[8]
- A maximum of 256 slices can participate in a single Multislice job.[8]
- Multislice is supported with [JAX](/wiki/jax) and [PyTorch](/wiki/pytorch) frameworks.[8]
- If any slice fails, Cloud TPU provisions a replacement and resets all remaining slices. Training resumes from the latest checkpoint.[8]

## How are slices programmed (software stack)?

TPU slices are programmed through a layered software stack:

### XLA compiler

The XLA (Accelerated Linear Algebra) compiler translates high-level framework operations into optimized TPU machine code. It handles partitioning computations across the chips in a slice, inserting collective communication operations (all-reduce, all-gather, reduce-scatter) as needed. Users typically interact with XLA indirectly through [JAX](/wiki/jax) or [TensorFlow](/wiki/tensorflow).

### GSPMD

GSPMD (General and Scalable Parallelization for ML Computation Graphs) is an XLA extension that automates the mapping of a single-device program onto a multi-chip slice.[4] Developers annotate a small number of tensors with sharding specifications; GSPMD propagates these annotations through the computation graph and generates the necessary communication code.[4] In benchmarks, GSPMD achieved 50% to 62% compute utilization on 128 to 2,048 TPU v3 cores for models with up to one trillion parameters.[4]

### Pathways

Pathways is a distributed runtime that allows a single Python client to orchestrate work across multiple TPU Pods.[5] It extends JAX's execution model so that SPMD computations have access to all provisioned cores, not just those on the local host.[5] Google used Pathways to train [PaLM](/wiki/palm) (540 billion parameters) on 6,144 TPU v4 chips spread across two Pods (3,072 chips each, attached to 768 hosts per Pod), reaching 57.8% hardware FLOPS utilization and 46.2% model FLOPS utilization.[6]

### Orbax checkpointing

Orbax is a JAX library that provides checkpointing primitives for saving and restoring model state (JAX PyTrees) to local storage or Google Cloud Storage. Reliable checkpointing is essential for Multislice training, where automatic recovery from slice failures depends on being able to reload the latest checkpoint without user intervention.

## Slice configurations by generation

### TPU v4 slice configurations

TPU v4 slices range from small single-host configurations to large multi-host topologies:

| Accelerator type | Topology | Chips | TensorCores |
|---|---|---|---|
| v4-8 | 2x2x1 | 4 | 8 |
| v4-16 | 2x2x2 | 8 | 16 |
| v4-32 | 2x4x2 | 16 | 32 |
| v4-64 | 2x4x4 | 32 | 64 |
| v4-128 | 4x4x4 | 64 | 128 |
| v4-256 | 4x4x8 | 128 | 256 |
| v4-512 | 4x8x8 | 256 | 512 |
| v4-1024 | 8x8x8 | 512 | 1,024 |
| v4-2048 | 8x8x16 | 1,024 | 2,048 |
| v4-4096 | 8x16x16 | 2,048 | 4,096 |

The v4-8 configuration (2x2x1, 4 chips) is single-host. Slices of v4-128 and above (one full cube or more) have 3D torus connectivity and ICI resiliency enabled by default.[9]

### TPU v5e slice configurations

| Accelerator type | Topology | Chips |
|---|---|---|
| v5e-1 | 1x1 | 1 |
| v5e-4 | 2x2 | 4 |
| v5e-8 | 2x4 | 8 |
| v5e-16 | 4x4 | 16 |
| v5e-32 | 4x8 | 32 |
| v5e-64 | 8x8 | 64 |
| v5e-128 | 8x16 | 128 |
| v5e-256 | 16x16 | 256 |

### TPU v5p slice configurations (selected)

| Accelerator type | Topology | Chips | Cubes |
|---|---|---|---|
| v5p-8 | 2x2x1 | 4 | less than 1 |
| v5p-128 | 4x4x4 | 64 | 1 |
| v5p-512 | 4x8x8 | 256 | 4 |
| v5p-1024 | 8x8x8 | 512 | 8 |
| v5p-4096 | 8x16x16 | 2,048 | 32 |
| v5p-12288 | 16x16x24 | 6,144 | 96 |

## Real-world training examples

TPU slices have been used to train some of the largest [large language models](/wiki/large_language_model) and other AI systems:

| Model | Organization | TPU generation | Slice size | Notes |
|---|---|---|---|---|
| [PaLM](/wiki/palm) 540B | Google | v4 | 6,144 chips (two Pods) | Trained using Pathways; 57.8% hardware FLOPS utilization [6] |
| [Gemini](/wiki/gemini) Ultra | Google | v4 + v5e | Multiple slices across data centers | First Google model trained across multiple data centers |
| Largest disclosed LLM training | Google | v5e | 50,944 chips (199 Pods) | Multislice run reaching 10 exa-FLOPS (BF16) at full scale, Nov 2023 [16] |
| [LaMDA](/wiki/lamda_language_model_for_dialogue_applications) | Google | v3 | 1,024 chips | Trained on a full TPU v3 Pod |

The 50,944-chip v5e run, spanning 199 Pods, is the largest publicly disclosed distributed LLM training job Google has reported; at full scale it delivered roughly 10 exa-FLOPS in BF16 and around 5.32 exa-OPS with INT8 quantization.[16]

## How do TPU slices compare with GPU clusters?

TPU slices differ from [GPU](/wiki/gpu_computing) clusters in several important ways:

| Aspect | TPU slice | GPU cluster |
|---|---|---|
| Interconnect topology | Fixed 2D or 3D torus/mesh | Hierarchical switch network (NVLink, NVSwitch, InfiniBand) |
| Links per device | Constant (4 or 6 ICI ports) | Varies by level in the switch hierarchy |
| Scaling model | Add more chips to the torus; bandwidth per device stays constant | Add switches and links; bandwidth may decrease at higher tiers |
| Reconfigurability | OCS allows dynamic repartitioning (v4+) | Typically fixed cabling |
| Programming model | XLA/GSPMD automatic sharding | Manual or semi-automatic (Megatron-LM, DeepSpeed, FSDP) |
| Chip design | Custom ASIC optimized for [matrix multiplication](/wiki/tensor) | General-purpose GPU with tensor cores |

The torus topology of TPU slices means that the number of links per device and per-device bandwidth remain constant regardless of the total system size. GPU clusters, by contrast, often see effective per-device bandwidth decline as the cluster grows beyond what a single switch fabric can serve.

## Practical considerations

### How do you choose a slice size?

Selecting the right slice size involves balancing several factors:

- **Model size**: models that exceed the HBM of a single chip require [model parallelism](/wiki/model_parallelism) or FSDP, which in turn requires a slice large enough to hold the sharded model.
- **Batch size**: [data parallelism](/wiki/data_parallelism) scales the effective batch size with the number of chips. The per-chip batch size must remain large enough to keep the MXUs busy (high arithmetic intensity).
- **Communication overhead**: larger slices have more chips, but also higher aggregate communication volume during collectives. Torus-connected slices (one cube or larger) handle this more efficiently than mesh-connected sub-cube slices.
- **Cost**: Cloud TPU pricing is per chip-hour. A 512-chip slice costs 64 times as much per hour as an 8-chip slice.

### How do you choose a topology shape?

Even for a fixed number of chips, multiple topology shapes may be available. For example, a 512-chip v4 slice can be configured as 4x4x32, 4x8x16, or 8x8x8. The best choice depends on the parallelism strategy:

- **Cube shapes** (all dimensions equal or close to equal, like 8x8x8) maximize bisection bandwidth and are generally best for workloads dominated by all-reduce operations.
- **Elongated shapes** (like 4x4x32) may be preferable for pipeline parallelism, where the long dimension maps naturally to the pipeline stages.
- **Twisted torus** variants should be considered when bisection bandwidth is the bottleneck, as they can increase it by up to 70%.[3][9]

### What is ICI resiliency?

ICI resiliency, enabled by default on v4, v5p, and TPU7x slices of one cube or larger, allows ICI connections to be dynamically rerouted around optical or switch faults.[9] This improves availability but can cause temporary performance degradation while the rerouting takes effect. For latency-sensitive inference workloads, users may choose to disable ICI resiliency.[10]

## See also

- [Tensor Processing Unit](/wiki/tpu)
- [Cloud TPU](/wiki/cloud_tpu)
- [TPU Pod](/wiki/tpu_pod)
- [TPU chip](/wiki/tpu_chip)
- [Systolic array](/wiki/systolic_array)
- [Data parallelism](/wiki/data_parallelism)
- [Model parallelism](/wiki/model_parallelism)
- [GPU computing](/wiki/gpu_computing)

## References

1. Jouppi, N.P., Young, C., Patil, N., Patterson, D., et al. (2017). "In-Datacenter Performance Analysis of a Tensor Processing Unit." *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*. https://arxiv.org/abs/1704.04760
2. Jouppi, N.P., Yoon, D.H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., and Patterson, D.A. (2020). "A Domain-Specific Supercomputer for Training Deep Neural Networks." *Communications of the ACM*, 63(7), 67-78. https://dl.acm.org/doi/10.1145/3360307
3. Jouppi, N.P., Kurian, G., Li, S., Ma, P., et al. (2023). "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." *Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA)*. https://arxiv.org/abs/2304.01433
4. Xu, Y., Lee, H., Chen, D., et al. (2021). "GSPMD: General and Scalable Parallelization for ML Computation Graphs." *arXiv preprint*. https://arxiv.org/abs/2105.04663
5. Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., et al. (2022). "Pathways: Asynchronous Distributed Dataflow for ML." *Proceedings of MLSys*. https://arxiv.org/abs/2203.12533
6. Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint*. https://arxiv.org/abs/2204.02311
7. Google Cloud. "TPU architecture (system architecture)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
8. Google Cloud. "Cloud TPU Multislice overview." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/multislice-introduction
9. Google Cloud. "TPU v4." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v4
10. Google Cloud. "TPU v5p." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v5p
11. Google Cloud. "TPU v5e." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v5e
12. Google Cloud. "TPU v6e (Trillium)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v6e
13. Google Cloud. "TPU7x (Ironwood)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/tpu7x
14. JAX Scaling Book. "How to Think About TPUs." https://jax-ml.github.io/scaling-book/tpus/
15. Google Cloud Blog. "Using Cloud TPU Multislice to scale AI workloads." (August 31, 2023). https://cloud.google.com/blog/products/compute/using-cloud-tpu-multislice-to-scale-ai-workloads
16. Google Cloud Blog. "The world's largest distributed LLM training job on TPU v5e." (November 2023). https://cloud.google.com/blog/products/compute/the-worlds-largest-distributed-llm-training-job-on-tpu-v5e