TPU v5e
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,638 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,638 words
Add missing citations, update stale details, or suggest a clearer explanation.
Cloud TPU v5e (sometimes rendered TPU v5e or, internally at Google, TPUv5e) is the fifth-generation, efficiency-optimized member of Google's Tensor Processing Unit (TPU) family. It was announced on August 29, 2023 at the Google Cloud Next '23 conference in San Francisco as part of a broader set of generative-AI infrastructure updates, then reached general availability on November 9, 2023.[^1][^2][^3] The chip was positioned by Google as "the most cost-efficient, versatile, and scalable Cloud TPU to date," targeting large-language-model inference and cost-sensitive training of models up to roughly 200 billion parameters rather than the absolute peak performance of its larger sibling, the TPU v5p, which was announced four months later in December 2023.[^1][^2][^4]
The "e" suffix in the product name is widely understood as standing for "efficient" (or, in some Google materials, "efficiency"), distinguishing the chip from the performance-tier "p" variant that completed the v5 family.[^1][^5] In SemiAnalysis's analysis, the same silicon is referred to internally as a "lite" variant of the full TPUv5 ("Viperfish") design, with deliberately halved high-bandwidth memory stacks, lower clock speeds, and a less aggressive networking topology in exchange for far better dollars-per-FLOP economics and easier deployment.[^6]
A single v5e chip delivers approximately 197 BF16 TFLOPS and 393 INT8 TOPS, paired with 16 GB of HBM2e at roughly 819 GB/s of bandwidth and four 400 Gbps inter-chip interconnect (ICI) ports.[^7][^6] A maximum v5e pod consists of 256 chips arranged in a flat 2D torus, providing more than 400 Tb/s of aggregate ICI bandwidth and roughly 100 PetaOps of INT8 throughput in a single tightly coupled slice.[^1][^7] Larger jobs are supported through Google's Multislice software, which uses the data-center network to stitch many v5e pods into training runs spanning tens of thousands of chips; in November 2023 Google ran a 50,944-chip Multislice job on v5e — at the time the world's largest publicly documented Cloud TPU training job.[^8][^9]
Google began designing in-house tensor accelerators in 2013 and first deployed TPU v1 internally in 2015 to accelerate inference for products such as Google Search and Google Translate.[^10] Subsequent generations (v2, v3, v4) added training support, bfloat16 arithmetic, high-bandwidth memory, and progressively larger pod topologies, culminating in TPU v4 — a liquid-cooled chip introduced in 2021 and made generally available in 2022 with up to 4,096 chips per 3D-torus pod connected by optical circuit switches.[^10] TPU v4 trained Google's PaLM and helped lay the infrastructure groundwork for the company's foundation-model program.[^1][^10]
By mid-2023, however, the generative-AI boom had created two distinct workload pressures inside Google Cloud. First, training of frontier dense language models was scaling well beyond a single v4 pod, demanding software that could federate many pods over the data-center network. Second, inference for products such as Bard and for third-party LLM serving needed accelerators with much better cost per query, lower system complexity, and air-cooled installability in a wider variety of data centers. Google's response was a two-chip "v5" family that explicitly split these workloads: a smaller, cheaper, air-cooled inference-and-cost-training chip (v5e), and a larger, liquid-cooled flagship for the largest frontier models (v5p).[^1][^4]
TPU v5e was first publicly previewed on August 29, 2023, at Google Cloud Next '23 in a joint blog post by Amin Vahdat (VP/GM ML, Systems, and Cloud AI) and Mark Lohmeyer (VP/GM Compute and ML Infrastructure), alongside the general-availability launch of the A3 NVIDIA H100 GPU VMs and a set of generative-AI software updates.[^1][^11][^5] HPCwire later reported that the v5-class silicon had a "controversial origin": elements of the design had been the subject of disputes inside Google's chip-design organization in the years before launch, although v5e itself shipped as a conventional Google-designed accelerator built in collaboration with Broadcom and manufactured at TSMC.[^12]
In Google's own marketing materials, TPU v5e is described as "the most cost-efficient" Cloud TPU at launch, with Mark Lohmeyer telling TechCrunch that v5e was "the most cost-efficient and accessible cloud TPU to date."[^3] Subsequent Google blog posts and third-party guides standardised on reading the "e" as "efficient" (or "efficiency"), to be contrasted with the "p" in TPU v5p, introduced as the performance-tier counterpart in December 2023.[^4][^5][^13]
That positioning translates into deliberate architectural trade-offs:
The result, in SemiAnalysis's framing, is a chip whose design optimizes for total cost of ownership — power, networking, deployment flexibility — rather than for the peak FLOPS-per-chip metric that dominates NVIDIA's product marketing.[^6]
Each TPU v5e chip contains a single TensorCore. The TensorCore houses four MXUs (matrix-multiply units), one vector unit, and one scalar unit. Each MXU is a 128×128 systolic array of multiply-accumulate units, capable of 16,384 multiply-accumulate operations per cycle — the same MXU dimensions used throughout the pre-v6 TPU line.[^7][^14][^6] Native supported numeric formats include bfloat16 (BF16) for training and INT8 for inference; Google later extended INT8 training support to v5e through its Accurate Quantized Training (AQT) software.[^7][^8]
Per-chip peak throughput, as published in Google's Cloud TPU v5e documentation, is:[^7]
SemiAnalysis estimates the v5e die at roughly 325 mm² and notes that, compared with the full ("Viperfish") TPUv5 silicon, v5e ships with reduced HBM stack count, lower clock speed, and a less aggressive networking topology to hit aggressive power and unit-cost targets.[^6]
Each v5e chip pairs the TensorCore with a single HBM2e stack providing 16 GB of high-bandwidth memory at 819 GB/s (often reported as 800 GiB/s in Google documentation).[^7][^6] Hosts in v5e systems each provide an additional 512 GiB of DRAM and are paired with eight chips per host in the largest configuration.[^7] Unlike TPU v5p, v5e does not include the SparseCore acceleration unit used for embedding-dense workloads such as recommendation systems.[^14][^4]
The 16 GB-per-chip memory ceiling is important to the chip's positioning: a single v5e chip can comfortably serve LLMs up to about 13 billion parameters in 8-bit precision; multi-chip configurations using the eight VM shapes scale up to 2-trillion-parameter inference workloads via tensor parallelism over ICI.[^15]
Each v5e chip has four ICI ports, each running at 400 Gbps per direction for an aggregate 1.6 Tb/s of bidirectional ICI bandwidth per chip.[^7][^6] Inside a maximum pod the chips form a flat 2D torus that scales to 256 chips, yielding more than 400 Tb/s of aggregate ICI bandwidth, 51.2 TB/s of all-reduce bandwidth, and 1.6 TB/s of bisection bandwidth across the pod.[^7][^1] Google deliberately omitted the optical circuit switches (OCS) used in TPU v4 pods, because at the v5e pod scale a static 2D torus is sufficient and cheaper.[^6]
A v5e pod hosts up to 256 chips distributed across racks with eight chips per host server; the standard Google Cloud documentation describes three primary VM shapes:[^7]
| Machine type | Chips | vCPUs | RAM |
|---|---|---|---|
ct5lp-hightpu-1t | 1 | 24 | 48 GB |
ct5lp-hightpu-4t | 4 | 112 | 192 GB |
ct5lp-hightpu-8t | 8 | 224 | 384 GB |
In addition to these single-host shapes, Google supports multi-host slice topologies ranging from 16 chips up through the full 256-chip pod, exposed to users as one of "eight different VM configurations" — terminology Google used in its initial v5e announcement to emphasise that customers could pick a slice size from one chip to more than 250 chips inside a single pod, with multi-host serving handled by frameworks such as Sax.[^1][^7]
SemiAnalysis reported that each v5e sled houses four TPUs sharing one host with a 64-core AMD x86 chip and a 100G NIC, with four dual-sided rack units of eight sleds each making up a full 256-chip pod.[^6] These low-level packaging details have not been confirmed by Google in primary documentation, but are consistent with the publicly disclosed ct5lp-hightpu-4t machine type and ICI topology.[^7]
The Cloud TPU v5e documentation distinguishes carefully between training and serving:[^7]
The 13-billion-parameter ceiling for single-chip serving is a direct consequence of the 16 GB HBM2e capacity: in INT8 a 13B-parameter dense model fits roughly within 13 GB of weight memory, leaving enough headroom for KV cache, activations, and serving runtime overhead. Larger models (Llama 2 70B, GPT-J 175B, GPT-3 175B and so on) require tensor-parallel sharding across 4-, 8-, or 16-chip configurations.[^15]
At pod scale, a single 256-chip TPU v5e slice provides:[^7][^1]
Because a v5e pod is much smaller than a v5p (256 vs. 8,960 chips) and a v4 (3,072 chips) pod, large training jobs frequently span many pods. Google addresses this through Cloud TPU Multislice, a stack first introduced in preview alongside v5e and now central to the v5/Trillium training story.[^16][^1]
Multislice is described by Google as a "full-stack performance-scaling technology" that enables a single training job to use multiple TPU slices within one pod or across many pods.[^16] Communication inside a slice runs over the high-speed ICI fabric, while gradients are reduced across slices over Google's Jupiter data-center network using a hierarchical collective scheme. The XLA compiler automatically generates the inter-slice DCN code, overlapping communication with computation and decomposing collectives such as all-reduce into within-slice reduction, cross-slice reduction over DCN, and a broadcast step.[^16]
For TPU v5e specifically, Multislice exposes a 256-chip ICI domain per slice with ~25 Gbps per chip of DCN bandwidth, supporting data parallelism, fully sharded data parallelism (FSDP), tensor parallelism, and pipeline parallelism on top of the same JAX/XLA stack.[^16] In a November 2023 demonstration, Google ran a single distributed training job across 50,944 TPU v5e chips spanning 199 Cloud TPU v5e pods — claimed at the time as the world's largest publicly disclosed LLM training run on Cloud TPU. The run trained MaxText dense models from 16B to 128B parameters and reached 66.86 % BF16 Model FLOPs Utilization (MFU) on a single pod and approximately 53 % MFU end-to-end across the cluster using INT8 AQT quantized training.[^8] Peak observed throughput was reported as 5.32 exa-OP/s out of a theoretical 20 exa-OPs of INT8 compute.[^8]
This run also illustrated several Multislice-specific engineering optimizations: checkpoint broadcasting (where one slice loads a checkpoint and broadcasts it to the others, giving Google a reported 150× speedup over conventional loading), distributed data loading to avoid Cloud Storage bottlenecks at 64+ pod scale, and XLA-runtime optimizations to eliminate host/device transfer overhead.[^8]
Like other Cloud TPUs, v5e is targeted by Google's XLA/OpenXLA compiler and is accessible from the same set of high-level ML frameworks supported on prior generations:[^1][^7]
Cloud TPU v5e is also exposed through higher-level Google Cloud services. At the August 2023 launch, Google announced general availability of Cloud TPU in Google Cloud Kubernetes Engine (GKE) for v4 and v5e, training integration in Vertex AI, and support for v5e through partner tooling such as Ray and Slurm.[^1][^15] The "what you train is what you serve" pattern — using the same TPU shape and framework for both training and inference — is emphasised in Google's v5e inference documentation.[^15]
A characteristic example of the training-side stack used on v5e is the November 2023 50,944-chip run: it combined the XPK (Accelerated Processing Kit) ML cluster orchestrator on top of GKE with Jobset and Kueue capacity management, the MaxText JAX-based dense-LLM reference implementation, the AQT quantization library, Flax/Orbax/Optax for model authoring and checkpointing, and the OpenXLA compiler for code generation.[^8] On the parallelism side, the same job used pure data parallelism across pods over DCN combined with fully sharded data parallelism (FSDP) for the 16B, 32B, and 64B configurations, and a mix of FSDP plus tensor parallelism for the 128B configuration.[^8] The resulting code structure is small enough that adapting a training job from a single 256-chip slice to a 199-pod, 50,944-chip Multislice run does not require fundamentally rewriting model code — only changing the sharding spec passed to the JAX mesh_utils.create_hybrid_device_mesh helper.[^16]
Google's headline claims for TPU v5e relative to TPU v4, taken from its August 29, 2023 announcement and the November 9, 2023 general-availability post, are:[^1][^2]
us-west4 at launch versus $3.22/chip-hour for TPU v4).[^2]In MLPerf-class benchmarks Google reported:
These claims are consistent with the broader analyst framing of v5e as a workload-specialized chip whose advantage is realised on a per-dollar rather than per-chip basis.[^17][^6]
Beyond MLPerf, Google and its partners published a number of LLM-serving benchmarks on v5e:
The Llama benchmark is a useful illustration of the v5e design philosophy: because dense LLM decoding is memory-bandwidth bound, the 16 GB HBM2e per chip becomes the limiting factor at small batch sizes, and quantization plus tensor parallelism over the 1.6 Tb/s ICI fabric are key to extracting throughput from the comparatively small per-chip compute envelope.[^20][^15]
Google's launch and follow-up materials cite a number of named v5e users:
Outside of Google's own published list, broader industry coverage names additional generative-AI tenants using TPUs more generally — including Anthropic, Salesforce, Midjourney, and Snap — although how much of each customer's workload runs specifically on v5e (vs. v4 or v5p) is not always disclosed.[^18][^4]
At launch, Cloud TPU v5e was offered in preview through Google account managers and a registration form, with general availability declared on November 9, 2023.[^2][^1] Public on-demand list pricing at GA was $1.20 per chip-hour in the us-west4 region, with 1- and 3-year committed-use discounts dropping the price substantially (third-party analyses noted 3-year commit pricing as low as roughly $0.55/chip-hour for v5e).[^2][^4][^18]
Google also emphasised that Cloud TPU v5e was its first AI chip available outside the United States: at announcement, deployments were planned for the Netherlands (europe-west4 region) for EMEA customers and Singapore (asia-southeast1) for Asia-Pacific customers, complementing existing US installations.[^1] By GA, public list pricing for us-west4 was published on the Cloud TPU pricing page, while detailed per-region pricing for other locations was published on Google Cloud's standard SKU pricing pages.[^2]
TPU v5e and TPU v5p were the two halves of Google's v5 generation, addressing complementary workloads:[^4]
Both v5e and v5p shared the Multislice training and multi-host inference software stacks, allowing customers to mix-and-match efficiency and performance accelerators behind the same JAX/XLA programming model.[^4][^16]
The v5 family was succeeded by:
In retrospect, the launch of TPU v5e represented an important inflection in Google's accelerator strategy: rather than chasing a single, ever-larger flagship, the company committed to a multi-SKU TPU portfolio in which an "e"-class chip explicitly optimized for cost per query coexists with a "p"-class flagship for the largest frontier-training workloads. That bifurcation — and the Multislice software that made it economically viable — is the architectural pattern that has continued through the Trillium and Ironwood generations and that defines Google's ongoing alternative to the NVIDIA H100 / Blackwell hyperscale-GPU stack.[^1][^4][^6][^19]