TPU v4

AI Hardware Google

8 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v2 · 1,577 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

TPU v4 is Google's fourth-generation Tensor Processing Unit, a custom application-specific integrated circuit (ASIC) that accelerates machine learning workloads in Google's data centers and on Google Cloud. Each chip delivers about 275 bfloat16 TFLOPS of peak compute, and Google wires 4,096 of them into a single supercomputer, called a pod, that reaches roughly 1.1 exaflops of peak performance ^[3]^[4]. Its defining innovation is a data-center-scale interconnect built on optical circuit switches (OCSes), which let the system reconfigure its wiring into arbitrary three-dimensional torus topologies on the fly ^[4]. Sundar Pichai previewed the chip during the Google I/O keynote on May 18, 2021, describing the supporting infrastructure as the fastest system Google had ever deployed, and Cloud TPU v4 reached broad Google Cloud availability in 2022 ^[1]^[2]^[10]. Google detailed the design in a peer-reviewed paper presented at the 50th International Symposium on Computer Architecture (ISCA) in 2023 ^[4]^[5].

What is Google TPU v4?

Google began designing in-house accelerators because the cost of serving neural networks on general-purpose hardware threatened to outpace its data-center capacity. The first TPU, deployed internally around 2015, was an inference-only chip. The second and third generations (TPU v2 and v3) added training support, bfloat16 arithmetic, and water cooling, and they introduced the "pod," a tightly coupled cluster of chips joined by a dedicated inter-chip interconnect. A TPU v3 pod scaled to 4,096 chips arranged in a fixed two-dimensional torus.

TPU v4 continued that lineage but changed the network. In Google's own accounting it is the company's fifth domain-specific architecture for machine learning and its third generation of ML supercomputer ^[4]. The chip program is also tied to Google's research arm: the placement of several TPU generations, including v4, drew on reinforcement-learning methods developed under Google Brain and later branded AlphaChip ^[6].

What are the specifications of TPU v4?

Each TPU v4 chip is fabricated on a 7 nm process, the same node as the contemporary Nvidia A100 GPU ^[5]^[7]. A chip contains two TensorCores; each TensorCore holds four matrix-multiply units (MXUs) plus a vector unit and a scalar unit, and runs at roughly 1,050 MHz ^[3]^[8]. Peak compute is 275 TFLOPS in bfloat16, the same figure in INT8 ^[3]. Each chip carries 32 GiB of high-bandwidth memory delivering about 1,200 GB/s, and draws on the order of 170 W ^[3]^[8].

A distinctive addition is the SparseCore, a dataflow unit that accelerates the large embedding lookups common in recommendation and ranking models. As the ISCA paper puts it, "Each TPU v4 includes SparseCores, dataflow processors that accelerate models that rely on embeddings by 5x-7x yet use only 5% of die area and power" ^[4]^[5].

Specification	TPU v4 (per chip)
Process node	7 nm
Peak compute	275 TFLOPS (bf16 / INT8)
HBM capacity	32 GiB
HBM bandwidth	~1,200 GB/s
TensorCores	2 (4 MXUs each)
Clock	~1,050 MHz
TDP	~170 W

Google states that TPU v4 outperforms TPU v3 by about 2.1x and offers roughly 2.7x better performance per watt ^[5]^[7].

What are optical circuit switches (OCS) on TPU v4?

The interconnect is what gives the ISCA paper its title, "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings" ^[4]. Instead of a fixed wiring pattern, TPU v4 routes the links between groups of chips through optical circuit switches (OCSes). The switches let the system dynamically reconfigure its topology, which the paper says serves "to improve scale, availability, utilization, modularity, deployment, security, power, and performance," and they let "users pick a twisted 3D torus topology if desired" ^[4].

The building block is a cube of 64 chips arranged 4x4x4. The OCS layer can splice these cubes together into arbitrary three-dimensional torus topologies, including a "twisted" torus, and can do so in roughly ten seconds ^[4]^[9]. That reconfigurability matters operationally: if a chip or cube fails, the switch can route around it without taking down a whole training job, and the same hardware can be carved into many differently shaped slices for different workloads.

Google argues the optical approach is cheaper and lower power than alternatives such as InfiniBand. The paper states plainly that the switches are "much cheaper, lower power, and faster than Infiniband," and that "OCSes and underlying optical components are <5% of system cost and <3% of system power" ^[4]^[5].

How fast is a TPU v4 pod?

A full TPU v4 pod, which Google also calls a SuperPod, connects 4,096 chips and reaches about 1.1 exaflops of peak bf16/INT8 compute, with all-reduce bandwidth around 1.1 PB/s and bisection bandwidth near 24 TB/s ^[3]. Four chips share each CPU host, so 64 chips and their 16 hosts fit in a single rack ^[4].

The technology became available to customers gradually. After the I/O 2021 preview, Google announced a public-preview machine-learning hub built on Cloud TPU v4 pods in May 2022, run from an Oklahoma data center holding eight pods at up to about 9 exaflops of aggregate peak performance and operating at roughly 90 percent carbon-free energy ^[10]^[11]. Broad Google Cloud availability followed later that year, with early access granted to research teams including Cohere, LG AI Research, Meta AI, and Salesforce Research ^[10].

What models were trained on TPU v4?

TPU v4 was the workhorse behind several of Google's flagship models. The most cited example is the Pathways Language Model (PaLM), a 540-billion-parameter dense model trained in 2022. PaLM ran on 6,144 TPU v4 chips spread across two pods, with 3,072 chips and 768 hosts per pod, joined over the data-center network using a mix of data and model parallelism ^[12]^[13]. Google reported this as the largest TPU configuration used for training up to that point, made practical by the Pathways orchestration system ^[13].

The PaLM run is also a frequently quoted efficiency datapoint. Pipeline-free training across the 6,144 chips reached 46.2 percent model FLOPs utilization and 57.8 percent hardware FLOPs utilization, unusually high figures for a model of that size ^[12]^[13]. TPU v4 went on to underpin later Google work, including the PaLM 2 family.

How does TPU v4 compare to the Nvidia A100?

Google's headline comparison places TPU v4 against the Nvidia A100, the dominant training accelerator of the same generation and node. In the ISCA paper, a TPU v4 is reported as 1.2x to 1.7x faster than an A100 while using 1.3x to 1.9x less power on comparable workloads ^[4]^[7]. At the system level, Google states that a 4,096-chip TPU v4 supercomputer uses roughly 2x to 6x less energy and produces about 20x less carbon dioxide than contemporary rival systems running in typical data centers, a figure that depends heavily on the cleanliness of the local grid ^[5]^[7]. Independent and press summaries of the paper note a range of about 5 to 87 percent faster than the A100 depending on the benchmark ^[8].

In the MLPerf Training 2.0 round published in mid-2022, Google's TPU v4 submissions set the fastest training times on five of the benchmarks, averaging about 1.42x faster than the next-fastest non-Google submission and roughly 1.5x faster than Google's own MLPerf 1.0 results ^[14]^[15]. Some of those runs used up to 3,456 chips, and two were scaled to full pods ^[14]^[15].

These claims are not without caveats. The A100 comparison is between two 7 nm parts of the same era rather than against later GPUs such as the H100, and the carbon and energy figures reflect Google's data-center conditions; the company itself frames them as workload- and site-dependent ^[5]^[7].

What came after TPU v4?

TPU v4 was followed by the fifth generation, split into two products: TPU v5e, tuned for cost-efficient training and inference, and TPU v5p, the high-performance variant aimed at the largest models and positioned as competitive with the Nvidia H100. The sixth generation, Trillium (TPU v6e), followed, and Google later introduced Ironwood, an inference-focused generation that scales OCS-connected pods well beyond the 4,096 chips of the v4 era. Across these successors, the optical-circuit-switch fabric introduced with TPU v4 remained a defining element of Google's machine-learning supercomputer design ^[9].

References

HPCwire. "Google Launches TPU v4 AI Chips." 20 May 2021. https://www.hpcwire.com/2021/05/20/google-launches-tpu-v4-ai-chips/ ↩
TechCrunch. "Google launches the next generation of its custom AI chips." 18 May 2021. https://techcrunch.com/2021/05/18/google-launches-the-next-generation-of-its-custom-ai-chips/ ↩
Google Cloud. "TPU v4 (documentation)." https://docs.cloud.google.com/tpu/docs/v4 ↩
Jouppi, N., Kurian, G., Li, S., et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings." Proceedings of ISCA 2023. https://dl.acm.org/doi/10.1145/3579371.3589350 ↩
Jouppi, N., et al. "TPU v4: An Optically Reconfigurable Supercomputer.." arXiv:2304.01433. https://arxiv.org/abs/2304.01433 ↩
Google DeepMind. "How AlphaChip transformed computer chip design." 26 September 2024. https://deepmind.google/blog/how-alphachip-transformed-computer-chip-design/ ↩
The Register. "Google boffins reveal tech details of TPU v4 datacenter rigs." 6 April 2023. https://www.theregister.com/2023/04/06/google_tpuv4_hardware_nvidia/ ↩
"Tensor Processing Unit." Wikipedia. https://en.wikipedia.org/wiki/Tensor_Processing_Unit ↩
FiberMall. "Unveiling Google's TPU Architecture: OCS Optical Circuit Switching." https://www.fibermall.com/blog/unveiling-google-tpu-architecture.htm ↩
TechCrunch. "Google launches a 9 exaflop cluster of Cloud TPU v4 pods into public preview." 11 May 2022. https://techcrunch.com/2022/05/11/google-launches-a-9-exaflop-cluster-of-cloud-tpu-v4-pods-into-public-preview/ ↩
HPCwire. "Google Cloud's New TPU v4 ML Hub Packs 9 Exaflops of AI." 16 May 2022. https://www.hpcwire.com/2022/05/16/google-clouds-new-tpu-v4-ml-hub-packs-9-exaflops-of-ai/ ↩
Google. "Benchmarking FLOPs utilization on TPU v4." https://services.google.com/fh/files/blogs/tpu_v4_benchmarking.pdf ↩
Chowdhery, A., Narang, S., Devlin, J., et al. "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311 ↩
Google Cloud Blog. "Cloud TPU v4 MLPerf 2.0 results." https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-v4-mlperf-2-0-results ↩
HPCwire. "The Mainstreaming of MLPerf? Nvidia Dominates Training v2.0 but Challengers Are Rising." 29 June 2022. https://www.hpcwire.com/2022/06/29/the-mainstreaming-of-mlperf-nvidia-dominates-training-v2-0-but-challengers-are-rising/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AI weather forecasting Pipelining bfloat16

What is Google TPU v4?

What are the specifications of TPU v4?

What are optical circuit switches (OCS) on TPU v4?

How fast is a TPU v4 pod?

What models were trained on TPU v4?

How does TPU v4 compare to the Nvidia A100?

What came after TPU v4?

References

Improve this article

Related Articles

Tensor Processing Unit (TPU)

TPU Pod

TPU Chip

TPU Device

TPU Master

TPU Node

What links here

Related Articles

Tensor Processing Unit (TPU)

TPU Pod

TPU Chip

TPU Device

TPU Master

TPU Node

What links here