TPU v4
Last reviewed
Jun 3, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 1,461 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 1,461 words
Add missing citations, update stale details, or suggest a clearer explanation.
TPU v4 is Google's fourth-generation Tensor Processing Unit, a custom application-specific integrated circuit (ASIC) built to accelerate machine learning workloads in Google's data centers and on Google Cloud. Sundar Pichai previewed it during the Google I/O keynote on May 18, 2021, describing the supporting infrastructure as the fastest system Google had ever deployed [1][2]. The chip is most often discussed for two features: a peak throughput of about 275 bfloat16 TFLOPS per chip, and a data-center-scale interconnect that uses optical circuit switches to wire 4,096 chips into a reconfigurable three-dimensional torus [3][4]. Google detailed the design in a peer-reviewed paper presented at the 50th International Symposium on Computer Architecture (ISCA) in 2023 [4][5].
Google began designing in-house accelerators because the cost of serving neural networks on general-purpose hardware threatened to outpace its data-center capacity. The first TPU, deployed internally around 2015, was an inference-only chip. The second and third generations (TPU v2 and v3) added training support, bfloat16 arithmetic, and water cooling, and they introduced the "pod," a tightly coupled cluster of chips joined by a dedicated inter-chip interconnect. A TPU v3 pod scaled to 4,096 chips arranged in a fixed two-dimensional torus.
TPU v4 continued that lineage but changed the network. In Google's own accounting it is the company's fifth domain-specific architecture for machine learning and its third generation of ML supercomputer [4]. The chip program is also tied to Google's research arm: the placement of several TPU generations, including v4, drew on reinforcement-learning methods developed under Google Brain and later branded AlphaChip [6].
Each TPU v4 chip is fabricated on a 7 nm process, the same node as the contemporary Nvidia A100 GPU [5][7]. A chip contains two TensorCores; each TensorCore holds four matrix-multiply units (MXUs) plus a vector unit and a scalar unit, and runs at roughly 1,050 MHz [3][8]. Peak compute is 275 TFLOPS in bfloat16, the same figure in INT8 [3]. Each chip carries 32 GiB of high-bandwidth memory delivering about 1,200 GB/s, and draws on the order of 170 W [3][8].
A distinctive addition is the SparseCore, a dataflow unit that accelerates the large embedding lookups common in recommendation and ranking models. Google reports that SparseCores speed up embedding-heavy models by roughly 5x to 7x while occupying only about 5 percent of die area and power [4][5].
| Specification | TPU v4 (per chip) |
|---|---|
| Process node | 7 nm |
| Peak compute | 275 TFLOPS (bf16 / INT8) |
| HBM capacity | 32 GiB |
| HBM bandwidth | ~1,200 GB/s |
| TensorCores | 2 (4 MXUs each) |
| Clock | ~1,050 MHz |
| TDP | ~170 W |
Google states that TPU v4 outperforms TPU v3 by about 2.1x and offers roughly 2.7x better performance per watt [5][7].
The interconnect is what gives the ISCA paper its title, "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings" [4]. Instead of a fixed wiring pattern, TPU v4 routes the links between groups of chips through optical circuit switches (OCSes). The switches let the system dynamically reconfigure its topology, which Google says improves scale, availability, utilization, modularity, deployment, security, power, and performance [4].
The building block is a cube of 64 chips arranged 4x4x4. The OCS layer can splice these cubes together into arbitrary three-dimensional torus topologies, including a "twisted" torus, and can do so in roughly ten seconds [4][9]. That reconfigurability matters operationally: if a chip or cube fails, the switch can route around it without taking down a whole training job, and the same hardware can be carved into many differently shaped slices for different workloads.
A full TPU v4 pod, which Google also calls a SuperPod, connects 4,096 chips and reaches about 1.1 exaflops of peak bf16/INT8 compute, with all-reduce bandwidth around 1.1 PB/s and bisection bandwidth near 24 TB/s [3]. Four chips share each CPU host, so 64 chips and their 16 hosts fit in a single rack [4]. Google argues the optical approach is cheaper and lower power than alternatives such as InfiniBand: the OCSes and their optical components account for less than 5 percent of system cost and less than 3 percent of system power [4][5].
The technology became available to customers gradually. After the I/O 2021 preview, Google announced a public-preview machine-learning hub built on Cloud TPU v4 pods in May 2022, run from an Oklahoma data center holding eight pods at up to about 9 exaflops of aggregate peak performance and operating at roughly 90 percent carbon-free energy [10][11]. Broad Google Cloud availability followed later that year.
TPU v4 was the workhorse behind several of Google's flagship models. The most cited example is the Pathways Language Model (PaLM), a 540-billion-parameter dense model trained in 2022. PaLM ran on 6,144 TPU v4 chips spread across two pods, with 3,072 chips and 768 hosts per pod, joined over the data-center network using a mix of data and model parallelism [12][13]. Google reported this as the largest TPU configuration used for training up to that point, made practical by the Pathways orchestration system [13].
The PaLM run is also a frequently quoted efficiency datapoint. Pipeline-free training across the 6,144 chips reached 46.2 percent model FLOPs utilization and 57.8 percent hardware FLOPs utilization, unusually high figures for a model of that size [12][13]. TPU v4 went on to underpin later Google work, including the PaLM 2 family.
Google's headline comparison places TPU v4 against the Nvidia A100, the dominant training accelerator of the same generation and node. In the ISCA paper, a TPU v4 is reported as 1.2x to 1.7x faster than an A100 while using 1.3x to 1.9x less power on comparable workloads [4][7]. At the system level, Google states that a 4,096-chip TPU v4 supercomputer uses roughly 2x to 6x less energy and produces about 20x less carbon dioxide than contemporary rival systems running in typical data centers, a figure that depends heavily on the cleanliness of the local grid [5][7]. Independent and press summaries of the paper note a range of about 5 to 87 percent faster than the A100 depending on the benchmark [8].
In the MLPerf Training 2.0 round published in mid-2022, Google's TPU v4 submissions set the fastest training times on five of the benchmarks, averaging about 1.42x faster than the next-fastest non-Google submission and roughly 1.5x faster than Google's own MLPerf 1.0 results [14][15]. Some of those runs used up to 3,456 chips, and two were scaled to full pods [14][15].
These claims are not without caveats. The A100 comparison is between two 7 nm parts of the same era rather than against later GPUs such as the H100, and the carbon and energy figures reflect Google's data-center conditions; the company itself frames them as workload- and site-dependent [5][7].
TPU v4 was followed by the fifth generation, split into two products: TPU v5e, tuned for cost-efficient training and inference, and TPU v5p, the high-performance variant aimed at the largest models and positioned as competitive with the Nvidia H100. The sixth generation, Trillium (TPU v6e), followed, and Google later introduced Ironwood, an inference-focused generation that scales OCS-connected pods well beyond the 4,096 chips of the v4 era. Across these successors, the optical-circuit-switch fabric introduced with TPU v4 remained a defining element of Google's machine-learning supercomputer design [9].