TPU v5p
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,634 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,634 words
Add missing citations, update stale details, or suggest a clearer explanation.
TPU v5p (Cloud TPU v5p) is the fifth-generation, performance-tier Tensor Processing Unit custom AI accelerator developed by Google and offered through Google Cloud. Announced on December 6, 2023, alongside the launch of the Gemini family of foundation models, the v5p was positioned as Google's "most powerful, scalable, and flexible AI accelerator to date," targeting the training of frontier-class large language models, mixture-of-experts systems, and dense embedding workloads.[^1][^2] The "p" suffix denotes the performance variant of the fifth generation, complementing the earlier efficiency-tier TPU v5e introduced in August 2023.
Each TPU v5p chip delivers a peak throughput of 459 TFLOPS in bfloat16 and 918 TFLOPS (also reported as TOPS) at INT8 precision, paired with 95 GB of HBM3 high-bandwidth memory at roughly 2.76 TB/s.[^1][^3][^4] A full v5p "pod" connects 8,960 chips through a reconfigurable 3D-torus inter-chip interconnect at 4,800 Gbps per chip, representing more than double the FLOPS per chip, triple the HBM capacity, and roughly four times the per-pod scale of its predecessor, the TPU v4.[^1][^2] Google has stated that the chip can train large LLMs up to 2.8 times faster than v4 and embedding-dense models up to 1.9 times faster owing to its second-generation SparseCores.[^1][^2]
Cloud TPU v5p entered limited "early access" in December 2023 and reached general availability at Google Cloud Next on April 9, 2024.[^5][^6] The chip became the workhorse for training Google's Gemini models and was prominently adopted by external customers including Anthropic, Apple (for the on-device tier of Apple Foundation Models), Salesforce, Lightricks, and Hugging Face.[^7][^8][^9] In May 2024, Google previewed its successor — the sixth-generation TPU code-named Trillium (TPU v6e) — and in November 2025 unveiled the seventh generation, Ironwood, positioning v5p as a transitional but enormously influential platform in the modern AI-accelerator landscape.[^10][^11]
The TPU program was started inside Google around 2013 to address the explosive compute demand of deep-learning workloads such as Google Brain-derived voice and translation systems. The first-generation TPU, deployed in 2015 and publicly disclosed in 2016, was an inference-only ASIC; subsequent generations added training capability and high-bandwidth memory.[^12] The architecture is anchored by a systolic-array matrix multiply unit (MXU), originally 128×128 multiply-accumulators in the v2/v3/v4/v5 generations, surrounded by vector and scalar units within "TensorCores."[^4][^12]
By the early 2020s the lineage looked roughly:
| Generation | Year | Role |
|---|---|---|
| TPU v1 | 2015–16 | Inference (8-bit) |
| TPU v2 | 2017 | First-gen training |
| TPU v3 | 2018 | Liquid-cooled training |
| TPU v4 | 2021 (public 2022) | First optical-switched pods, 4,096 chips |
| TPU v5e | Aug 2023 | Cost-efficient inference/training |
| TPU v5p | Dec 2023 | Performance-tier training |
| Trillium (TPU v6e) | 2024 | 4.7× per-chip compute over v5e |
| Ironwood (TPU v7) | 2025 | Inference-era flagship |
TPU v4 introduced Google's now-signature use of optical circuit switches (OCS) to dynamically reconfigure 3D-torus inter-chip topologies, an innovation that v5p inherited and scaled.[^13][^14]
Google's fifth generation diverged into two SKUs serving different market segments:
This "two-tier" strategy, established with v5, has continued in subsequent generations and is reflected in the naming conventions (e.g., Trillium is the v6e efficiency variant; an unnamed performance v6p sibling was not separately marketed before Ironwood).[^11]
The v5p was unveiled in a joint announcement on December 6, 2023, alongside the launch of Gemini 1.0 (Ultra, Pro, and Nano) and the broader AI Hypercomputer architecture from Google Cloud — a phrase Google used to describe its integrated stack of accelerators, networking, software, and consumption models.[^1][^2][^16] In the post announcing the chip, Amin Vahdat, VP/GM of Machine Learning, Systems, and Cloud AI at Google, wrote that v5p "is our most powerful, scalable, and flexible AI accelerator to date" and that "Gemini, Google's most capable and general AI model, was trained on, and is served, using TPUs."[^1] Sundar Pichai framed the v5p as the hardware foundation for Google DeepMind's Gemini ambitions and for Google Cloud customers' "next wave" of model training.[^16]
A TPU v5p chip is a custom Google-designed ASIC built around two TensorCores, each of which contains four 128×128 Matrix Multiply Units (MXUs) implemented as systolic arrays, a vector unit, and a scalar unit.[^4][^17] Across the two TensorCores, the chip therefore carries eight MXUs in total — twice the per-chip MXU count of TPU v5e, which uses a single TensorCore. The v5p preserves the 128×128 MXU dimension established with v2 (Trillium would later expand the MXU to 256×256).[^17]
The reported peak per-chip throughput is 459 TFLOPS in bfloat16 and 918 TFLOPS/TOPS at INT8.[^1][^3][^4] The clock speed has been documented at approximately 1,750 MHz.[^4] Google has not publicly disclosed the process node, transistor count, or die area for the v5p chip; Wikipedia's TPU table lists these fields as undisclosed.[^4]
Each v5p chip integrates four second-generation SparseCores — small dataflow processors specialized for embedding and sparse-tensor operations, particularly important for recommendation systems and embedding-dense workloads.[^17] Google has described SparseCores as occupying roughly 5% of chip die area and power while delivering 5–7× speedups on the operations they target versus equivalent code on the MXU or vector unit.[^17] The second-generation SparseCores in v5p underpin Google's claim that v5p trains embedding-dense models 1.9× faster than v4.[^1][^2]
A v5p TPU node exposes four chips per host (machine type ct5p-hightpu-4t), pairing them with 208 vCPUs (104 usable per NUMA node), 448 GB of host RAM, two NUMA nodes, and a 200 Gbps host NIC for data-center networking.[^3] Each chip is connected to the host through PCIe and to its neighbors over the dedicated inter-chip interconnect, with 50 Gbps per chip reserved for data-center-network (DCN) traffic that crosses pod boundaries.[^3]
Each v5p chip carries 95 GB of HBM3 memory at approximately 2,765 GB/s of bandwidth (2.76 TB/s).[^3][^4] This is three times the HBM capacity of the 32 GB present on TPU v4 chips and represents a nearly 2.3× increase in HBM bandwidth.[^1][^4] In real-world workloads, Lightricks reported that the "ample memory capacity" of v5p allowed it to fit an entire generative text-to-video model on a single chip without splitting across processes, materially accelerating each training cycle.[^1]
Inter-chip interconnect (ICI) is Google's high-bandwidth, low-latency proprietary fabric that wires TPU chips into 3D-torus configurations. On v5p, the ICI delivers 4,800 Gbps per chip (cited as 1,200 GBps bidirectional in Google's documentation) — Google's highest per-chip ICI bandwidth in the v4/v5 era.[^1][^3] Each chip is connected to its six nearest neighbors, forming a 3D torus.[^3][^4]
Following the precedent set by TPU v4, the v5p pod uses Optical Circuit Switches (OCS) to dynamically wire chips into reconfigurable 3D-torus topologies. A v5p superpod is reported to require 48 OCS units controlling roughly 13,824 optical ports to interconnect 8,960 chips in a 16×20×28 3D-torus configuration, with sub-10-ns switching times and ICI-resiliency features that route around faulty optical links or switches.[^14][^18][^19]
A full TPU v5p pod comprises 8,960 chips, making it 2.19× the size of a 4,096-chip TPU v4 pod and providing roughly 4× the per-pod aggregate FLOPS of v4 when accounting for the doubled per-chip throughput.[^1][^2] At pod scale, a single v5p superpod delivers approximately 4.45 exaFLOPS of BF16 compute.[^17][^20]
The maximum single-slice schedulable job (i.e., a topology connected fully via ICI without DCN hops) is 6,144 chips in a 16×16×24 configuration — 96 "cubes," where a cube is a 4×4×4 sub-volume.[^3][^4][^20] Larger workloads can be assembled via Google's Multislice software, which links multiple ICI-coupled slices over DCN; Multislice scales v5p workloads to 18,432 chips.[^21] OCS-based topology wrapping means even sub-pod slices receive full toroidal connectivity once they reach cube size.[^3]
Single-slice topology examples documented by Google include sizes from 2×2×1 (4 chips) up to 16×16×24 (6,144 chips), with intermediate shapes such as 8×8×8 (512 chips), 16×16×16 (4,096 chips), and 16×16×20 (5,120 chips) being commonly used for production training jobs.[^3]
The v5p is supported across Google's open-source and proprietary AI software stack, including:
The compiler underpinning the stack is the open-source XLA compiler, which lowers JAX, PyTorch, and TensorFlow programs to TPU kernels.[^16]
Google's launch positioning compared v5p to v4 with the following claims:
Jeff Dean, then chief scientist of Google DeepMind and Google Research, said at launch that "Google DeepMind and Google Research have observed 2× speedups for LLM training workloads using TPU v5p chips compared to the performance on our TPU v4 generation. With the 2nd generation of SparseCores we also see significant improvement in the performance of embeddings-heavy workloads."[^1]
In MLCommons' MLPerf Training v4.0 (June 2024), Google submitted v5p results showing near-linear scaling (≈99.9% efficiency) on the GPT-3 175B pre-training task across slices ranging from 512 to 6,144 chips, the result of hardware/runtime/compiler/framework co-design.[^25]
In MLPerf Training v4.1 (November 2024), Google reported a 5.7% speedup on the GPT-3 175B benchmark at 6,144 accelerators versus its earlier v5p submission and disclosed Stable Diffusion training results. The same round disclosed that Trillium reduced training cost by up to 1.8× (45% lower) versus v5p while achieving the same validation accuracy. Convergence-scaling efficiency at the largest cluster sizes was reported to be comparable between Trillium and v5p.[^22]
Although Google has not published the exact mix of chips used for each Gemini variant, the v5p was developed in tight co-design with the Gemini program and was the most powerful TPU available at the time Gemini 1.0 was trained.[^1][^16][^26] Sundar Pichai's public framing of the Gemini 1.0 launch credited TPUs (including the just-announced v5p) for "training and serving" the new generation of models.[^16]
When v5p reached general availability in April 2024, Google published list pricing of $4.20 per chip-hour for on-demand v5p capacity in supported zones — up from $3.22/chip-hour for v4 and $1.20/chip-hour for v5e.[^28] One- and three-year committed-use discounts brought effective list rates down to approximately $2.94 and $1.89 per chip-hour respectively.[^28][^24]
v5p was first made available on December 6, 2023 through a private "request via your Google Cloud account manager" allocation process timed to the Gemini 1.0 launch.[^2] General availability was announced at Google Cloud Next 2024 in early April 2024, when Google also disclosed GKE-native v5p support and multi-host inferencing.[^6][^29]
At GA, v5p was offered through North-American zones (notably us-east5 in Columbus, Ohio); regional expansion proceeded throughout 2024.[^3]
The earliest and largest tenant of v5p capacity was Google itself, which used the chip to train its Gemini family of frontier models. Google has stated that "Gemini, Google's most capable and general AI model, was trained on, and is served, using TPUs," with v5p being the most powerful TPU available during the Gemini 1.0 development cycle and into the Gemini 1.5 generation.[^1][^16]
In July 2024, Apple published its first technical report on the Apple Foundation Models (AFM) underlying Apple Intelligence. The paper disclosed that the AFM-on-device model — a roughly 3-billion-parameter model targeting on-device inference on Apple Silicon — was pre-trained on a single slice of 2,048 TPU v5p chips, while the larger AFM-server model was pre-trained on 8,192 TPU v4 chips provisioned as eight 1,024-chip slices.[^23][^7] Training used the AXLearn JAX-based library, a sustained 52% model-FLOP utilization (MFU) on v5p, sequence lengths up to 8,192/32,768 tokens after continued pre-training, and a core pre-training corpus of 6.3 trillion tokens. Apple's deliberate decision to use Google TPUs for on-device foundation-model training (rather than NVIDIA GPUs) was widely covered as one of the most significant external endorsements of the TPU platform.[^7][^23]
Anthropic has been a major Google Cloud TPU customer and trained Claude models using a diversified compute strategy spanning Google TPUs, AWS Trainium, and NVIDIA GPUs.[^30] Anthropic's TPU usage included v5p chips, and the company announced in late 2025 a landmark expansion of its TPU footprint to "well over a gigawatt" of capacity coming online in 2026, alongside a partnership with Broadcom for next-generation accelerator co-design.[^30][^31]
Salesforce announced at the v5p launch that it would use the chip and the AI Hypercomputer architecture to pre-train its proprietary foundation models for specialized production use cases. The company's quote on the launch blog highlighted both the 2× v4-to-v5p speedup and the seamless migration via JAX.[^1]
The Israeli generative-AI app developer Lightricks used v5p on GKE to train its LTXV text-to-video model, achieving the ~2.5× v4-over-v5p training speedup and citing the memory capacity as critical to fitting longer video sequences on a single chip.[^1][^27]
Hugging Face partnered with Google Cloud to make TPUs (including the v5 family) accessible to its user base via its Inference Endpoints and Spaces products, and developed the open-source optimum-tpu library to ease deployment of Hugging Face Transformers on Cloud TPUs.[^32]
Additional reported users of TPU v5p capacity through Google Cloud included generative-AI startups training image and video models, large-scale RL workloads, and academic groups (e.g., the Marin open-model project under Google's Open Source program).[^33]
Industry coverage at launch positioned v5p as Google's most credible response to NVIDIA's H100 GPU, particularly for large-scale training where the v5p's pod-level scale of 8,960 chips and 4,800 Gbps ICI bandwidth competed with Nvidia's NVLink/InfiniBand-based clusters.[^4][^33] Analysts at SemiAnalysis, ServeTheHome, and others described the v5p as a major step in Google's vertically-integrated AI-infrastructure strategy, with the AI Hypercomputer framing emphasizing co-design across hardware, networking (optical circuit switching, Jupiter data-center network), and software (XLA, JAX, Pathways, Multislice).[^1][^14]
A noteworthy point in third-party analysis was the constraint that v5p was a Google-Cloud-exclusive device: unlike NVIDIA GPUs, TPUs are not sold for on-premises or multi-cloud deployments, which limited their addressable market but reinforced Google Cloud's strategic moat around the chip.[^33] By late 2024, the v5p's market position was being eclipsed by Trillium (TPU v6e), which delivered roughly 2× the per-chip performance and was advertised as 2× more power-efficient than v5p, with a 4.7× peak compute per chip uplift over v5e.[^10][^34]
Google previewed Trillium at Google I/O on May 14, 2024 and reached general availability on December 11, 2024. Trillium is the v6e (efficiency-tier) successor in Google's TPU naming convention. It delivers approximately 4.7× the peak compute per chip vs. TPU v5e, doubles HBM and ICI bandwidth over v5e, and per Google delivers ~2× the per-chip performance of v5p while being ~2× more power-efficient. MLPerf 4.1 results showed Trillium training the GPT-3 175B benchmark at 1.8× lower cost than v5p for the same validation accuracy.[^10][^22][^34]
Trillium expanded the MXU from 128×128 to 256×256, increasing matrix-multiply throughput per cycle, while the per-pod scale dropped to 256 chips per ICI domain — relying more heavily on Multislice for very large jobs.[^17][^34]
Google unveiled Ironwood in April 2025 as the first TPU "for the age of inference" and brought it to GA on November 6, 2025. Ironwood delivers 4,614 FP8 TFLOPS per chip, 192 GB of HBM3e at 7.37 TB/s, and "superpod" scale of 9,216 chips producing 42.5 FP8 exaFLOPS per superpod. Google has cited a 10× peak performance improvement over TPU v5p and a 3.7× improvement in compute-carbon-intensity vs. v5p.[^11][^35][^36] Anthropic's late-2025 commitment to scale TPU usage to one million chips and "well over a gigawatt" was anchored on Ironwood capacity.[^30][^31]
In Google's overall AI-accelerator roadmap as of the mid-2020s, v5p occupies the role of the first production-grade frontier-training TPU whose performance and pod-scale enabled the Gemini program and external customers' largest models. Trillium and Ironwood subsequently surpassed it in absolute performance, but the v5p remains in service through Google Cloud and continues to host significant training workloads, particularly where 95 GB HBM3 and 8,960-chip single-pod scale remain advantageous.[^11][^17]