# TPU Device

> Source: https://aiwiki.ai/wiki/tpu_device
> Updated: 2026-06-01
> Categories: AI Hardware, Google, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **Tensor Processing Unit (TPU)** is an [application-specific integrated circuit](/wiki/ai_chip) (ASIC) designed by [Google](/wiki/google) to accelerate [machine learning](/wiki/machine_learning) workloads. Unlike general-purpose processors such as [CPUs](/wiki/cpu) or [GPUs](/wiki/gpu_computing), TPUs are purpose-built for the mathematical operations that dominate [neural network](/wiki/neural_network) training and [inference](/wiki/inference). Google first deployed TPUs internally in 2015 and publicly announced them in 2016.[10] Since then, seven generations of TPUs have been released, each delivering significant improvements in compute performance, memory capacity, and energy efficiency. TPUs power many of Google's largest AI systems, including [Gemini](/wiki/gemini), [AlphaFold](/wiki/alphafold), and Google Search.[4]

## Explain like I'm 5 (ELI5)

Imagine you have a regular toolbox that can fix lots of different things around the house. That is like a normal computer chip (a CPU). Now imagine you have a special tool that is really, really good at one specific job, like tightening screws super fast. A TPU is like that special tool, but instead of screws, it is really good at doing the math that helps computers learn. Google built TPUs so that programs that recognize pictures, understand speech, and answer questions can do their math much faster and use less electricity than if they used regular chips.

## History and motivation

Google's motivation for building custom silicon came from a projection made in 2013. Engineers estimated that if every user spoke to their Android phone for just three minutes per day using voice search, Google would need to double the number of data centers worldwide to handle the [deep learning](/wiki/deep_learning) inference load.[1] This projected cost was unacceptable, so Google began developing a domain-specific accelerator that could run neural network inference far more efficiently than CPUs or GPUs.[10]

The first TPU (v1) entered Google's data centers in 2015 and was formally announced at Google I/O in May 2016.[10] A landmark paper by Norman Jouppi and colleagues, presented at the International Symposium on Computer Architecture (ISCA) in June 2017, described the TPU v1 architecture and showed that it was 15 to 30 times faster than contemporary CPUs and GPUs on inference tasks, with 30 to 80 times better performance per watt.[1]

TPU v2, announced in 2017, expanded the scope from inference only to both training and inference, and introduced the [bfloat16](/wiki/bfloat16) number format, which later became an industry standard adopted by other hardware vendors including Intel, AMD, and NVIDIA.[8] Each subsequent generation has pushed performance and scale further, culminating in the seventh-generation Ironwood chip announced in April 2025.[9]

## Architecture and design

### TensorCore

The fundamental compute unit inside a TPU chip is the **TensorCore**. Each TPU chip contains one or more TensorCores (Ironwood uses a chiplet design with two TensorCores per chip).[3] A TensorCore consists of three main processing elements:[3]

| Component | Function | Details |
|---|---|---|
| Matrix multiply unit (MXU) | Performs dense matrix multiplications | 128x128 systolic array (v2 through v5p) or 256x256 systolic array (v6e and Ironwood). Inputs in [bfloat16](/wiki/bfloat16); accumulations in FP32. Each MXU performs 16,384 multiply-accumulate operations per cycle. |
| Vector unit | General-purpose computation | Handles activations, [softmax](/wiki/softmax), normalization, and element-wise operations. |
| Scalar unit | Control and addressing | Manages control flow, memory address calculations, and other maintenance operations. |

### Systolic array

The MXU uses a **systolic array** architecture, named after the Greek word for heartbeat because data pulses rhythmically through the chip. In a systolic array, each multiply-accumulator passes its result directly to the next one in the grid without writing intermediate values back to memory.[3] This eliminates the memory bottleneck that limits performance on conventional processors.

In the TPU v1, the systolic array was 256x256, performing 65,536 multiply-accumulate operations per clock cycle. Running at 700 MHz, this delivered 92 trillion 8-bit operations per second (92 TOPS) while consuming only 40 watts.[1] More than 90% of the silicon area was devoted to useful computation, compared to roughly 30% in a typical GPU.[1]

The engineering tradeoff behind the systolic array is deliberate: it sacrifices the general-purpose flexibility of a GPU's thousands of programmable CUDA cores in exchange for much higher operation density and energy efficiency on matrix workloads.

### Memory hierarchy

TPUs use **high-bandwidth memory (HBM)** as their primary off-chip memory. Data flows through a pipeline: the host streams data into an infeed queue, the TPU loads it from the infeed queue into HBM, computations are performed, and results are placed into an outfeed queue.[3]

On-chip, TPUs have vector memory (VMEM) that feeds data to the MXU and vector unit. VMEM bandwidth is approximately 22 times higher than HBM bandwidth, meaning operations reading from VMEM require an arithmetic intensity of only 10 to 20 to achieve peak FLOPS utilization.[3] This layered memory system is designed to keep the MXU fed with data as continuously as possible.

### SparseCore

Starting with TPU v4, Google added a specialized processor called **SparseCore** to handle embedding operations.[2] While the MXU excels at dense matrix multiplication, embedding lookups in recommendation systems, ranking models, and [large language models](/wiki/large_language_model) involve irregular, data-dependent memory access patterns where the MXU provides no advantage.[13]

TPU v4 featured four SparseCores per chip, each containing 16 compute tiles that operate in parallel on disjoint subsets of embedding operations.[13] The SparseCore achieved 5 to 7 times speedups over previous approaches while using only 5% of total chip die area and power budget.[2] It is 3 times faster than TPU v3 on recommendation models and 5 to 30 times faster than CPU-based systems.[13]

### Inter-chip interconnect (ICI)

TPU chips within a single slice communicate over a high-speed **inter-chip interconnect (ICI)**. The ICI connects chips directly to their neighbors without any external switches, a design Google calls "glueless" networking. The networking logic is integrated directly into the chip itself.[3]

Different TPU generations use different ICI topologies:

| TPU generation | ICI topology | Notes |
|---|---|---|
| TPU v2, v3 | 2D torus | Each chip connects to 4 neighbors |
| TPU v4 | 3D torus | Each chip connects to 6 neighbors; uses optical circuit switches for reconfigurability |
| TPU v5e, v6e | 2D torus | Each chip connects to 4 neighbors |
| TPU v5p | 3D torus | 4,800 Gbps per chip; 8,960 chips per pod |
| Ironwood (v7) | 3D torus | 1.2 TB/s bidirectional bandwidth per chip |

## TPU generations

Google has released seven generations of TPU hardware, each with different design targets and capabilities.[10]

| Generation | Year | Compute (per chip) | HBM capacity | HBM bandwidth | Key features |
|---|---|---|---|---|---|
| TPU v1 | 2015 (deployed) / 2016 (announced) | 92 TOPS (INT8) | N/A (28 MiB on-chip) | N/A | Inference only; 256x256 systolic array; 40W TDP |
| TPU v2 | 2017 | 180 TFLOPS | 64 GB | N/A | First training-capable TPU; introduced [bfloat16](/wiki/bfloat16) |
| TPU v3 | 2018 | 420 TFLOPS | 128 GB | N/A | Liquid cooling; 2x performance over v2 |
| TPU v4 | 2021 | 275 TFLOPS (bf16) | 32 GB HBM2e | N/A | [SparseCore](/wiki/sparsecore); optical circuit switches; 3D torus ICI |
| TPU v5e | 2023 | ~197 TFLOPS | 16 GB | 819 GB/s | Cost-optimized; 2.7x perf/dollar over v4; single core per chip |
| TPU v5p | 2023 | ~459 TFLOPS | 95 GB | 2.8 TB/s | Performance-optimized; 8,960-chip pods (~4.45 EFLOPS) |
| TPU v6e (Trillium) | 2024 | ~918 TFLOPS | 32 GB | N/A | 256x256 MXU; 4.7x peak compute over v5e; 2x memory and ICI bandwidth over v5e |
| TPU v7 (Ironwood) | 2025 | 4,614 TFLOPS | 192 GB HBM3e | 7.4 TB/s | First inference-focused TPU; chiplet design (two TensorCores); FP8 support; 5nm process; ~100B transistors; 600W TDP |

### TPU v1

TPU v1 was designed exclusively for inference. Its 256x256 systolic array and 28 MiB of on-chip memory were sufficient for running trained models but not for the larger memory and bidirectional data flow requirements of training.[1] Google deployed TPU v1 across its data centers to accelerate services like Google Search, Google Photos, Google Translate, and [Gmail](/wiki/gmail).[10]

### TPU v2

TPU v2 was the first generation capable of both training and inference. It introduced the bfloat16 floating-point format, a 16-bit format with the same exponent range as 32-bit IEEE float but reduced mantissa precision (7 bits instead of 23).[8] This design maintains numerical stability during training while halving memory usage compared to float32.[8] TPU v2 was made available to external researchers through the TensorFlow Research Cloud program.[10]

### TPU v3

TPU v3 doubled compute performance over v2 to more than 420 TFLOPS per chip and doubled HBM capacity to 128 GB. It was the first TPU generation to use liquid cooling, which allowed higher clock speeds and denser chip packaging.[10]

### TPU v4

Announced at Google I/O 2021, TPU v4 introduced two major architectural changes. First, it added SparseCore for embedding-heavy workloads. Second, it replaced electrical inter-pod connections with optical circuit switches (OCS), enabling dynamic reconfiguration of the 3D torus topology. A 2023 paper in the proceedings of ISCA described the TPU v4 supercomputer as an "optically reconfigurable supercomputer" with 4,096 chips per pod.[2] TPU v4 achieved more than 2x the performance of v3.[2]

### TPU v5e and v5p

The fifth generation was split into two variants. TPU v5e, announced in August 2023, was designed for cost efficiency, reducing core count and clock speed to hit aggressive power and cost targets. It delivers 2.7 times higher performance per dollar than TPU v4.[4] TPU v5p, announced in December 2023, was designed for maximum training performance, scaling to 8,960-chip pods delivering approximately 4.45 exaFLOPS.[4] Google positioned TPU v5p as competitive with the [NVIDIA](/wiki/nvidia) H100.[10]

### TPU v6e (Trillium)

Announced at Google I/O in May 2024 and available in preview from October 2024, Trillium expanded the MXU from 128x128 to 256x256 multiply-accumulators. This quadrupled peak FLOPS per cycle at the same clock speed, delivering 4.7 times the peak compute performance of TPU v5e. HBM capacity and bandwidth also doubled compared to v5e.[5]

### TPU v7 (Ironwood)

Unveiled at Google Cloud Next in April 2025, Ironwood is the first TPU generation designed primarily for inference.[9] Each chip delivers 4,614 TFLOPS of peak compute and includes 192 GB of HBM3e memory with 7.4 TB/s bandwidth.[6] Ironwood uses a chiplet design where two TensorCores, each with its own [SparseCore](/wiki/sparsecore) pair and 96 GB of HBM, are connected by a die-to-die (D2D) interface that is six times faster than a single ICI link.[6] The chip is fabricated on a 5nm process with approximately 100 billion transistors.[6]

Ironwood is offered in two pod configurations: 256 chips and 9,216 chips. The larger configuration delivers 42.5 exaFLOPS, more than 24 times the compute of the El Capitan supercomputer.[9] It is also the first TPU to support FP8 calculations in its matrix math units.[6]

## Software stack and programming model

### XLA compiler

All code that runs on TPUs must be compiled by the **XLA (Accelerated Linear Algebra)** compiler. XLA is a just-in-time compiler that takes the computational graph emitted by a machine learning framework and compiles it into TPU machine code.[4] Its most important optimization is **operator fusion**, which merges multiple operations into a single kernel to reduce memory transfers.[4] Since memory bandwidth is typically the scarcest resource on hardware accelerators, eliminating unnecessary memory operations is one of the most effective ways to improve performance.

XLA is now part of the **OpenXLA** project, an open-source initiative that provides a common compilation stack for [JAX](/wiki/jax), [PyTorch](/wiki/pytorch), and [TensorFlow](/wiki/tensorflow).[13] Google's MLPerf submissions demonstrated a seven-fold performance gain in training throughput for [BERT](/wiki/bert) using XLA-optimized compilation.

### Supported frameworks

TPUs support three major machine learning frameworks:

| Framework | TPU integration | Notes |
|---|---|---|
| [JAX](/wiki/jax) | Native support | JAX is designed around XLA from the ground up; the recommended framework for TPU development |
| [TensorFlow](/wiki/tensorflow) | Native support | Historically the primary TPU framework; supports TPUStrategy for distributed training |
| [PyTorch](/wiki/pytorch) | Via PyTorch/XLA | Open-source package that enables PyTorch to run on XLA devices; uses the PJRT runtime |

To run PyTorch on TPUs, users install the `torch_xla` package and obtain a TPU device handle via `xm.xla_device()`.[14] The PyTorch/XLA project has migrated from the older XRT runtime to the PJRT runtime used by JAX, improving compatibility and performance.[14]

### Cloud TPU VMs

Google Cloud provides TPU access through **Cloud TPU VMs**, which give users direct SSH access to a Linux virtual machine with root privileges and access to the underlying TPU hardware.[4] This architecture replaced the earlier "TPU node" model, which required a separate host VM communicating with TPU workers over gRPC. Cloud TPU VMs simplify debugging by providing direct access to compiler and runtime logs.[4]

### Numeric formats

TPUs natively support the **bfloat16** floating-point format. Bfloat16 uses one sign bit, eight exponent bits, and seven mantissa bits. By retaining the same exponent range as float32, bfloat16 avoids the overflow and underflow issues that plague the IEEE float16 format during training.[8] Unlike float16, bfloat16 does not require loss scaling, making it nearly a drop-in replacement for float32.[8]

By default, TPUs perform matrix multiplications with bfloat16 inputs and accumulate results in float32. This mixed-precision approach delivers performance gains ranging from 4% to 47% (geometric mean of 13.9%) while using half the memory of full float32 training.[8] Ironwood is the first TPU to also support **FP8** calculations.[6]

## TPU topology and scaling

### Pods and slices

A **TPU pod** is a contiguous set of TPU chips grouped together within a specialized network. A **slice** is a subset of chips within a pod, all connected by ICI. Users provision TPU resources in slices of various sizes (for example, v5e-8 refers to a slice of 8 TPU v5e chips).[4]

### Multislice

**Multislice** is a scaling technology that extends TPU connectivity beyond the ICI network of a single slice. In a multislice configuration, chips within each slice communicate over ICI, while chips in different slices communicate through host CPUs over the data-center network (DCN).[7] The XLA compiler automatically inserts hierarchical collective operations and optimizes compute-communication overlap across the hybrid DCN/ICI topology.[7]

Multislice enables training jobs to use more than 4,096 chips in a single run with TPU v4, and even larger configurations with later generations. This technology uses standard [data parallelism](/wiki/data_parallelism) and requires minimal code changes from the user.[7]

## Comparison to CPUs and GPUs

TPUs differ from CPUs and GPUs in fundamental ways. CPUs are general-purpose processors optimized for sequential tasks with complex control flow. GPUs contain thousands of smaller programmable cores designed for parallel workloads, originally graphics rendering but now widely used for machine learning via the CUDA programming model. TPUs sacrifice general-purpose flexibility entirely, dedicating almost all silicon area to matrix arithmetic.

| Attribute | CPU | [GPU](/wiki/gpu_computing) | TPU |
|---|---|---|---|
| Architecture | Few powerful cores with large caches | Thousands of small programmable CUDA cores | Systolic array (MXU) plus vector and scalar units |
| Design target | General-purpose computing | Parallel workloads (graphics, ML, HPC) | [Neural network](/wiki/neural_network) training and inference |
| Programmability | Fully programmable | Programmable via CUDA, OpenCL, etc. | Programmable via XLA compiler only |
| Memory | System DRAM (DDR) | HBM (up to 80 GB on H100) | HBM (up to 192 GB on Ironwood) |
| Power per chip | 65 to 350W (typical) | 300 to 1,000W (high-end AI GPUs) | 40W (v1) to 600W (Ironwood) |
| Availability | Universal | Multi-vendor (NVIDIA, AMD, Intel) | Google Cloud only |
| Software ecosystem | All languages and frameworks | CUDA (dominant), ROCm, OpenCL | XLA, JAX, TensorFlow, PyTorch/XLA |

### Performance comparisons

TPU v3 trained [BERT](/wiki/bert) models 8 times faster than NVIDIA V100 GPUs and delivered 1.7 to 2.4 times faster training for ResNet-50 and large language models. BERT training completes 2.8 times faster on TPUs than on A100 GPUs, and batch inference delivers 4 times higher throughput for [transformer](/wiki/transformer) models. Single-query latency is 30% lower for models exceeding 10 billion parameters.

Google's Cloud TPU v6e (Trillium) delivers approximately 4 times better performance per dollar than NVIDIA H100 GPUs for large language model inference, according to Google's published benchmarks.[5]

### Energy efficiency

TPUs consume significantly less power than comparable GPU setups on supported workloads. Modern TPUs deliver 2 to 3 times better performance per watt than contemporary GPUs. Individual TPU chips typically consume 175 to 250 watts (prior to Ironwood), while high-end AI GPUs may use 700 to 1,000 watts. TPU-based systems can reduce overall power consumption by 60 to 65% compared to equivalent GPU deployments.

## Applications

TPUs are used across a wide range of AI applications both within Google and by external Cloud customers.

### Google internal use

All phases of [Gemini](/wiki/gemini) model training run on TPU v5e and v6e pods without fallback to NVIDIA GPUs. Google Search, Google Translate, [Gmail](/wiki/gmail), Google Photos, and YouTube all use TPUs for inference workloads. The Nobel Prize-winning [AlphaFold](/wiki/alphafold) protein structure prediction system runs on TPUs.[10]

### External customers

AssemblyAI reports that Cloud TPU v5e delivers up to 4 times greater performance per dollar for speech-recognition inference compared to other solutions. Gridspace achieved 5 times training speedups and 6 times larger inference scale on TPUs for conversational AI models. AI21 Labs uses Trillium TPUs for its Mamba/Jamba language models.[5]

## Edge TPU

In addition to the data-center TPU line, Google produces the **Edge TPU**, a small ASIC designed for machine learning inference on low-power edge devices. The Edge TPU delivers 4 trillion operations per second (4 TOPS) while consuming only 2 watts, achieving 2 TOPS per watt.

The Edge TPU uses an estimated 64x64 systolic array running at 480 MHz. It can execute MobileNet V2 at nearly 400 frames per second and runs inference 70 to 100 times faster than a CPU on supported models. It supports only [TensorFlow Lite](/wiki/tensorflow) models that are fully 8-bit quantized and compiled for the Edge TPU.

Google's **Coral** product line offers several hardware form factors containing the Edge TPU, including the Coral Dev Board (a single-board Linux computer), the Coral USB Accelerator (a USB-C dongle), and system-on-module variants for custom designs.

## Limitations and challenges

Despite their performance advantages on supported workloads, TPUs have several limitations:

- **Ecosystem constraints.** TPUs perform best with [TensorFlow](/wiki/tensorflow), [JAX](/wiki/jax), and frameworks compiled through XLA. Porting complex GPU-based workloads with custom CUDA kernels requires significant engineering effort. [PyTorch](/wiki/pytorch) support through PyTorch/XLA, while improving, is less mature than native CUDA support on GPUs.
- **Availability.** TPUs are available exclusively through Google Cloud. Organizations that require hardware portability across AWS, Azure, or on-premises environments cannot use TPUs.
- **Architectural specialization.** TPUs are not suitable for non-ML workloads such as graphics rendering, scientific simulations with irregular computation patterns, or general-purpose computing. The systolic array design provides no advantage for tasks that do not involve dense matrix multiplication or embedding lookups.
- **Vendor lock-in.** Adopting TPUs ties an organization to Google Cloud, creating switching costs if the organization later needs to migrate to another cloud provider.
- **Community and talent.** The GPU ecosystem, particularly [NVIDIA](/wiki/nvidia) CUDA, has a much larger developer community, more extensive documentation, and more learning resources. Fewer engineers have experience developing for TPUs compared to GPUs.

## Environmental impact

Google has published lifecycle analyses of TPU carbon efficiency. Over two generations (TPU v4 to Trillium), TPU hardware design improvements have led to a 3 times improvement in the carbon efficiency of AI workloads.[11] Ironwood demonstrates an approximately 3.7 times improvement in Compute Carbon Intensity compared to TPU v5p and is 30 times more power-efficient than the first Cloud TPU released in 2018.[9]

Operational electricity emissions account for more than 70% of a TPU's lifetime carbon footprint.[12] Google's data centers operate at a fleet-wide average Power Usage Effectiveness (PUE) of 1.09, meaning nearly all energy consumed goes directly to computation rather than cooling or overhead.[12]

## See also

- [GPU computing](/wiki/gpu_computing)
- [Deep learning](/wiki/deep_learning)
- [Neural network](/wiki/neural_network)
- [TensorFlow](/wiki/tensorflow)
- [PyTorch](/wiki/pytorch)
- [Large language model](/wiki/large_language_model)
- [Quantization](/wiki/quantization)
- [Data parallelism](/wiki/data_parallelism)
- [Model parallelism](/wiki/model_parallelism)

## References

1. Jouppi, N.P., Young, C., Patil, N., Patterson, D., et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1-12, June 2017. arXiv:1704.04760.
2. Jouppi, N.P., Yoon, D.H., Ashcraft, M., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA), 2023. arXiv:2304.01433.
3. Google Cloud. "TPU Architecture." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm
4. Google Cloud. "Introduction to Cloud TPU." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/intro-to-tpu
5. Google Cloud. "TPU v6e (Trillium)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/v6e
6. Google Cloud. "TPU7x (Ironwood)." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/tpu7x
7. Google Cloud. "Cloud TPU Multislice Overview." Cloud TPU Documentation. https://docs.cloud.google.com/tpu/docs/multislice-introduction
8. Google Cloud Blog. "BFloat16: The Secret to High Performance on Cloud TPUs." https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
9. Google Cloud Blog. "Ironwood: The First Google TPU for the Age of Inference." April 2025. https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/
10. Google Cloud Blog. "TPU Transformation: A Look Back at 10 Years of Our AI-Specialized Chips." https://cloud.google.com/transform/ai-specialized-chips-tpu-history-gen-ai
11. Google Cloud Blog. "TPUs Improved Carbon-Efficiency of AI Workloads by 3x." https://cloud.google.com/blog/topics/sustainability/tpus-improved-carbon-efficiency-of-ai-workloads-by-3x
12. Patterson, D., et al. "The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink." arXiv:2204.05149, 2022.
13. OpenXLA Project. "A Deep Dive into SparseCore for Large Embedding Models." https://openxla.org/xla/sparsecore
14. PyTorch/XLA. "Learn About TPUs." PyTorch Documentation. https://docs.pytorch.org/xla/master/accelerators/tpu.html
