TPU v5e

18 min read
Updated
Suggest editHistory
RawGraph

Last reviewed

Sources

No citations yet

Review status

Needs citations

Revision

v3 · 3,615 words

TPU v5e

Cloud TPU v5e (sometimes rendered TPU v5e or, internally at Google, TPUv5e) is the fifth-generation, efficiency-optimized member of Google's Tensor Processing Unit (TPU) family. It was announced on August 29, 2023 at the Google Cloud Next '23 conference in San Francisco as part of a broader set of generative-AI infrastructure updates, then reached general availability on November 9, 2023.[1][2][3] The chip was positioned by Google as "the most cost-efficient, versatile, and scalable Cloud TPU to date," targeting large-language-model inference and cost-sensitive training of models up to roughly 200 billion parameters rather than the absolute peak performance of its larger sibling, the TPU v5p, which was announced four months later in December 2023.[1][2][4]

The "e" suffix in the product name is widely understood as standing for "efficient" (or, in some Google materials, "efficiency"), distinguishing the chip from the performance-tier "p" variant that completed the v5 family.[1][5] In SemiAnalysis's analysis, the same silicon is referred to internally as a "lite" variant of the full TPUv5 ("Viperfish") design, with deliberately halved high-bandwidth memory stacks, lower clock speeds, and a less aggressive networking topology in exchange for far better dollars-per-FLOP economics and easier deployment.[6]

A single v5e chip delivers approximately 197 BF16 TFLOPS and 393 INT8 TOPS, paired with 16 GB of HBM2e at roughly 819 GB/s of bandwidth and four 400 Gbps inter-chip interconnect (ICI) ports.[7][6] A maximum v5e pod consists of 256 chips arranged in a flat 2D torus, providing more than 400 Tb/s of aggregate ICI bandwidth and roughly 100 PetaOps of INT8 throughput in a single tightly coupled slice.[1][7] Larger jobs are supported through Google's Multislice software, which uses the data-center network to stitch many v5e pods into training runs spanning tens of thousands of chips; in November 2023 Google ran a 50,944-chip Multislice job on v5e, at the time the world's largest publicly documented Cloud TPU training job.[8][9]

Background: from TPU v4 to the v5 generation

Google began designing in-house tensor accelerators in 2013 and first deployed TPU v1 internally in 2015 to accelerate inference for products such as Google Search and Google Translate.[10] Subsequent generations (v2, v3, v4) added training support, bfloat16 arithmetic, high-bandwidth memory, and progressively larger pod topologies, culminating in TPU v4, a liquid-cooled chip introduced in 2021 and made generally available in 2022 with up to 4,096 chips per 3D-torus pod connected by optical circuit switches.[10] TPU v4 trained Google's PaLM and helped lay the infrastructure groundwork for the company's foundation-model program.[1][10]

By mid-2023, however, the generative-AI boom had created two distinct workload pressures inside Google Cloud. First, training of frontier dense language models was scaling well beyond a single v4 pod, demanding software that could federate many pods over the data-center network. Second, inference for products such as Bard and for third-party LLM serving needed accelerators with much better cost per query, lower system complexity, and air-cooled installability in a wider variety of data centers. Google's response was a two-chip "v5" family that explicitly split these workloads: a smaller, cheaper, air-cooled inference-and-cost-training chip (v5e), and a larger, liquid-cooled flagship for the largest frontier models (v5p).[1][4]

TPU v5e was first publicly previewed on August 29, 2023, at Google Cloud Next '23 in a joint blog post by Amin Vahdat (VP/GM ML, Systems, and Cloud AI) and Mark Lohmeyer (VP/GM Compute and ML Infrastructure), alongside the general-availability launch of the A3 NVIDIA H100 GPU VMs and a set of generative-AI software updates.[1][11][5] HPCwire later reported that the v5-class silicon had a "controversial origin": elements of the design had been the subject of disputes inside Google's chip-design organization in the years before launch, although v5e itself shipped as a conventional Google-designed accelerator built in collaboration with Broadcom and manufactured at TSMC.[12]

Naming and positioning: "e" for efficient

In Google's own marketing materials, TPU v5e is described as "the most cost-efficient" Cloud TPU at launch, with Mark Lohmeyer telling TechCrunch that v5e was "the most cost-efficient and accessible cloud TPU to date."[3] Subsequent Google blog posts and third-party guides standardised on reading the "e" as "efficient" (or "efficiency"), to be contrasted with the "p" in TPU v5p, introduced as the performance-tier counterpart in December 2023.[4][5][13]

That positioning translates into deliberate architectural trade-offs:

  • One TensorCore per chip instead of two on v5p, with four matrix-multiply units (MXUs), a vector unit, and a scalar unit per TensorCore.[7][14]
  • Half the high-bandwidth memory stacks and slower HBM than v5p, giving 16 GB at ~819 GB/s versus v5p's 95 GB at much higher bandwidth.[6][4]
  • A flat 2D-torus ICI topology with no optical circuit switches inside a pod, capped at 256 chips per slice, far below v5p's 8,960-chip 3D-torus pod.[6][4]
  • Air cooling rather than the liquid cooling required by v4 and v5p, enabling installation in a wider set of Google Cloud data centers and the first deployment of a Google AI chip outside the United States.[1][2]

The result, in SemiAnalysis's framing, is a chip whose design optimizes for total cost of ownership (power, networking, deployment flexibility) rather than for the peak FLOPS-per-chip metric that dominates NVIDIA's product marketing.[6]

Architecture and chip-level specifications

TensorCore and matrix units

Each TPU v5e chip contains a single TensorCore. The TensorCore houses four MXUs (matrix-multiply units), one vector unit, and one scalar unit. Each MXU is a 128×128 systolic array of multiply-accumulate units, capable of 16,384 multiply-accumulate operations per cycle, the same MXU dimensions used throughout the pre-v6 TPU line.[7][14][6] Native supported numeric formats include bfloat16 (BF16) for training and INT8 for inference; Google later extended INT8 training support to v5e through its Accurate Quantized Training (AQT) software.[7][8]

Per-chip peak throughput, as published in Google's Cloud TPU v5e documentation, is:[7]

  • 197 BF16 TFLOPS
  • 393 INT8 TOPS

SemiAnalysis estimates the v5e die at roughly 325 mm² and notes that, compared with the full ("Viperfish") TPUv5 silicon, v5e ships with reduced HBM stack count, lower clock speed, and a less aggressive networking topology to hit aggressive power and unit-cost targets.[6]

Memory hierarchy

Each v5e chip pairs the TensorCore with a single HBM2e stack providing 16 GB of high-bandwidth memory at 819 GB/s (often reported as 800 GiB/s in Google documentation).[7][6] Hosts in v5e systems each provide an additional 512 GiB of DRAM and are paired with eight chips per host in the largest configuration.[7] Unlike TPU v5p, v5e does not include the SparseCore acceleration unit used for embedding-dense workloads such as recommendation systems.[14][4]

The 16 GB-per-chip memory ceiling is important to the chip's positioning: a single v5e chip can comfortably serve LLMs up to about 13 billion parameters in 8-bit precision; multi-chip configurations using the eight VM shapes scale up to 2-trillion-parameter inference workloads via tensor parallelism over ICI.[15]

Inter-chip interconnect (ICI)

Each v5e chip has four ICI ports, each running at 400 Gbps per direction for an aggregate 1.6 Tb/s of bidirectional ICI bandwidth per chip.[7][6] Inside a maximum pod the chips form a flat 2D torus that scales to 256 chips, yielding more than 400 Tb/s of aggregate ICI bandwidth, 51.2 TB/s of all-reduce bandwidth, and 1.6 TB/s of bisection bandwidth across the pod.[7][1] Google deliberately omitted the optical circuit switches (OCS) used in TPU v4 pods, because at the v5e pod scale a static 2D torus is sufficient and cheaper.[6]

Hosts, VM shapes, and pod topology

A v5e pod hosts up to 256 chips distributed across racks with eight chips per host server; the standard Google Cloud documentation describes three primary VM shapes:[7]

Machine typeChipsvCPUsRAM
ct5lp-hightpu-1t12448 GB
ct5lp-hightpu-4t4112192 GB
ct5lp-hightpu-8t8224384 GB

In addition to these single-host shapes, Google supports multi-host slice topologies ranging from 16 chips up through the full 256-chip pod, exposed to users as one of "eight different VM configurations," terminology Google used in its initial v5e announcement to emphasise that customers could pick a slice size from one chip to more than 250 chips inside a single pod, with multi-host serving handled by frameworks such as Sax.[1][7]

SemiAnalysis-disclosed packaging details

SemiAnalysis reported that each v5e sled houses four TPUs sharing one host with a 64-core AMD x86 chip and a 100G NIC, with four dual-sided rack units of eight sleds each making up a full 256-chip pod.[6] These low-level packaging details have not been confirmed by Google in primary documentation, but are consistent with the publicly disclosed ct5lp-hightpu-4t machine type and ICI topology.[7]

Workload coverage and chip-count ceilings

The Cloud TPU v5e documentation distinguishes carefully between training and serving:[7]

  • Single-host training is supported on slices of 1, 4, or 8 chips, matching the three published VM shapes.
  • Multi-host training is supported on slices ranging from 16 chips up to the full 256-chip pod.
  • Single-host serving is supported on slices of 1, 4, or 8 chips with the chip directly accessible from a JAX, PyTorch/XLA, or TensorFlow process.
  • Multi-host serving of LLMs that do not fit on a single host is handled through the Sax framework, which transparently shards a single model across multiple v5e hosts within a pod.

The 13-billion-parameter ceiling for single-chip serving is a direct consequence of the 16 GB HBM2e capacity: in INT8 a 13B-parameter dense model fits roughly within 13 GB of weight memory, leaving enough headroom for KV cache, activations, and serving runtime overhead. Larger models (Llama 2 70B, GPT-J 175B, GPT-3 175B and so on) require tensor-parallel sharding across 4-, 8-, or 16-chip configurations.[15]

Pod-level performance and Multislice scaling

At pod scale, a single 256-chip TPU v5e slice provides:[7][1]

  • 50.63 BF16 PFLOPS and 100 INT8 PetaOps of aggregate compute
  • >400 Tb/s of aggregate ICI bandwidth
  • 51.2 TB/s all-reduce bandwidth, 1.6 TB/s bisection bandwidth
  • 6.4 Tbps of data-center-network bandwidth per pod

Because a v5e pod is much smaller than a v5p (256 vs. 8,960 chips) and a v4 (3,072 chips) pod, large training jobs frequently span many pods. Google addresses this through Cloud TPU Multislice, a stack first introduced in preview alongside v5e and now central to the v5/Trillium training story.[16][1]

Multislice technology

Multislice is described by Google as a "full-stack performance-scaling technology" that enables a single training job to use multiple TPU slices within one pod or across many pods.[16] Communication inside a slice runs over the high-speed ICI fabric, while gradients are reduced across slices over Google's Jupiter data-center network using a hierarchical collective scheme. The XLA compiler automatically generates the inter-slice DCN code, overlapping communication with computation and decomposing collectives such as all-reduce into within-slice reduction, cross-slice reduction over DCN, and a broadcast step.[16]

For TPU v5e specifically, Multislice exposes a 256-chip ICI domain per slice with ~25 Gbps per chip of DCN bandwidth, supporting data parallelism, fully sharded data parallelism (FSDP), tensor parallelism, and pipeline parallelism on top of the same JAX/XLA stack.[16] In a November 2023 demonstration, Google ran a single distributed training job across 50,944 TPU v5e chips spanning 199 Cloud TPU v5e pods, claimed at the time as the world's largest publicly disclosed LLM training run on Cloud TPU. The run trained MaxText dense models from 16B to 128B parameters and reached 66.86 % BF16 Model FLOPs Utilization (MFU) on a single pod and approximately 53 % MFU end-to-end across the cluster using INT8 AQT quantized training.[8] Peak observed throughput was reported as 5.32 exa-OP/s out of a theoretical 20 exa-OPs of INT8 compute.[8]

This run also illustrated several Multislice-specific engineering optimizations: checkpoint broadcasting (where one slice loads a checkpoint and broadcasts it to the others, giving Google a reported 150× speedup over conventional loading), distributed data loading to avoid Cloud Storage bottlenecks at 64+ pod scale, and XLA-runtime optimizations to eliminate host/device transfer overhead.[8]

Software stack

Like other Cloud TPUs, v5e is targeted by Google's XLA/OpenXLA compiler and is accessible from the same set of high-level ML frameworks supported on prior generations:[1][7]

  • JAX with GSPMD-based sharding, used internally for PaLM, Gemini, and other frontier Google models.
  • PyTorch via PyTorch/XLA, with PyTorch/XLA 2.1 explicitly adding model- and data-parallel support for v5e.[1]
  • TensorFlow through the existing Cloud TPU TF backend.
  • Hugging Face Transformers, PyTorch Lightning, and Ray for higher-level orchestration and training.[1]
  • MaxText (a JAX dense-LLM reference stack), Pax (Google's industrial-scale training framework), and MaxDiffusion for image-generation workloads.[16][15]
  • Sax for multi-host serving of large LLMs that span more than one host.[7]

Cloud TPU v5e is also exposed through higher-level Google Cloud services. At the August 2023 launch, Google announced general availability of Cloud TPU in Google Cloud Kubernetes Engine (GKE) for v4 and v5e, training integration in Vertex AI, and support for v5e through partner tooling such as Ray and Slurm.[1][15] The "what you train is what you serve" pattern, using the same TPU shape and framework for both training and inference, is emphasised in Google's v5e inference documentation.[15]

A characteristic example of the training-side stack used on v5e is the November 2023 50,944-chip run: it combined the XPK (Accelerated Processing Kit) ML cluster orchestrator on top of GKE with Jobset and Kueue capacity management, the MaxText JAX-based dense-LLM reference implementation, the AQT quantization library, Flax/Orbax/Optax for model authoring and checkpointing, and the OpenXLA compiler for code generation.[8] On the parallelism side, the same job used pure data parallelism across pods over DCN combined with fully sharded data parallelism (FSDP) for the 16B, 32B, and 64B configurations, and a mix of FSDP plus tensor parallelism for the 128B configuration.[8] The resulting code structure is small enough that adapting a training job from a single 256-chip slice to a 199-pod, 50,944-chip Multislice run does not require fundamentally rewriting model code; only the sharding spec passed to the JAX mesh_utils.create_hybrid_device_mesh helper needs to change.[16]

Performance and cost claims

Google's headline claims for TPU v5e relative to TPU v4, taken from its August 29, 2023 announcement and the November 9, 2023 general-availability post, are:[1][2]

  • Up to 2× higher training performance per dollar vs. Cloud TPU v4.
  • Up to 2.5× higher inference performance per dollar vs. Cloud TPU v4 for LLMs and generative-AI models.
  • Up to 1.7× lower inference latency vs. Cloud TPU v4.[15]
  • Less than half the cost of TPU v4 per chip-hour (with a $1.20/chip-hour on-demand list price in us-west4 at launch versus $3.22/chip-hour for TPU v4).[2]

In MLPerf-class benchmarks Google reported:

  • MLPerf Training 3.1: a 2.3× price-performance improvement versus TPU v4 on the GPT-3 175B benchmark, with the converged GPT-3 175B run executed on 4,096 TPU v5e chips through Multislice.[2]
  • MLPerf Inference 3.1: a 2.7× price-performance improvement versus TPU v4 on the GPT-J 6B language-model benchmark, using four TPU v5e chips.[2]

These claims are consistent with the broader analyst framing of v5e as a workload-specialized chip whose advantage is realised on a per-dollar rather than per-chip basis.[17][6]

LLM-serving benchmarks

Beyond MLPerf, Google and its partners published a number of LLM-serving benchmarks on v5e:

  • On a single TPU v5e-8 host running PyTorch/XLA, Google reported 17 ms/token inference latency for Llama 2 70B and approximately 42 tokens/s/chip of throughput when using weight-only quantization to fit a larger batch size.[20]
  • On the same eight-chip v5e configuration, Google's JetStream serving stack later achieved up to 4,783 tokens/second total throughput on open LLMs including Llama 2.[15]
  • For diffusion workloads, Google's MaxDiffusion stack demonstrated that 1,000 1024×1024 images can be generated for roughly $0.10 on a TPU v5e-8 node using an optimized Stable Diffusion XL model with 4 diffusion decoder steps.[15]

The Llama benchmark is a useful illustration of the v5e design philosophy: because dense LLM decoding is memory-bandwidth bound, the 16 GB HBM2e per chip becomes the limiting factor at small batch sizes, and quantization plus tensor parallelism over the 1.6 Tb/s ICI fabric are key to extracting throughput from the comparatively small per-chip compute envelope.[20][15]

Customers and early adopters

Google's launch and follow-up materials cite a number of named v5e users:

  • Anthropic: co-founder Tom Brown said at launch that A3 and TPU v5e with Multislice would "bring price-performance benefits as we continue to build the next wave of AI"; by GA, Anthropic was using TPU v5e to "efficiently scale serving" for its Claude LLM.[1][2]
  • AssemblyAI: reported up to 4× greater performance per dollar versus comparable accelerated instances when running production automatic-speech-recognition (ASR) inference on TPU v5e, with v5e handling 25 million-plus inference calls per day.[1][15][18]
  • Gridspace: claimed a 5× speed increase in AI models and a 6× improvement in inference scale (1,000 seconds of audio processed in one real-time second) running its full-stack conversational-AI platform on v5e.[1][15]
  • Lightricks: Core Generative AI lead Yoav HaCohen described v5e's "unified toolset" as enabling rapid transition from idea to training to deployment.[1]
  • Hugging Face: used TPU v5e to serve Stable Diffusion XL 1.0 image generation at GA.[2]
  • Google Bard: at GA, Google highlighted that Bard, then serving more than 200 countries in over 40 languages, was running large-scale training and inference on v5e.[2]

Outside of Google's own published list, broader industry coverage names additional generative-AI tenants using TPUs more generally, including Anthropic, Salesforce, Midjourney, and Snap, although how much of each customer's workload runs specifically on v5e (vs. v4 or v5p) is not always disclosed.[18][4]

Pricing and availability

At launch, Cloud TPU v5e was offered in preview through Google account managers and a registration form, with general availability declared on November 9, 2023.[2][1] Public on-demand list pricing at GA was $1.20 per chip-hour in the us-west4 region, with 1- and 3-year committed-use discounts dropping the price substantially (third-party analyses noted 3-year commit pricing as low as roughly $0.55/chip-hour for v5e).[2][4][18]

Google also emphasised that Cloud TPU v5e was its first AI chip available outside the United States: at announcement, deployments were planned for the Netherlands (europe-west4 region) for EMEA customers and Singapore (asia-southeast1) for Asia-Pacific customers, complementing existing US installations.[1] By GA, public list pricing for us-west4 was published on the Cloud TPU pricing page, while detailed per-region pricing for other locations was published on Google Cloud's standard SKU pricing pages.[2]

Relationship to TPU v5p, Trillium, and Ironwood

TPU v5e and TPU v5p were the two halves of Google's v5 generation, addressing complementary workloads:[4]

  • TPU v5e: efficiency tier, BF16/INT8-tuned, 1 TensorCore, 16 GB HBM2e, 2D torus, 256 chips per pod, air-cooled, on-demand list price $1.20/chip-hour at GA.[1][2][7]
  • TPU v5p: performance tier, ~2× the FLOPS and 3× the HBM of TPU v4, 3D torus with up to 8,960 chips per pod, 2.8× faster LLM training than TPU v4, liquid-cooled, on-demand list price $4.20/chip-hour.[4]

Both v5e and v5p shared the Multislice training and multi-host inference software stacks, allowing customers to mix-and-match efficiency and performance accelerators behind the same JAX/XLA programming model.[4][16]

The v5 family was succeeded by:

  • Trillium (TPU v6e): announced at Google I/O 2024 and made generally available in late 2024 as Google's sixth-generation TPU. Google reported that Trillium delivers up to 2.1× higher performance per dollar than Cloud TPU v5e and 2.5× higher performance per dollar than Cloud TPU v5p in dense LLM training, with substantially higher peak compute per chip.[19]
  • TPU Ironwood: Google's seventh-generation, inference-focused TPU, which continued the efficiency lineage that v5e established within the v5/v6e/v7 progression.[19]

In retrospect, the launch of TPU v5e represented an important inflection in Google's accelerator strategy: rather than chasing a single, ever-larger flagship, the company committed to a multi-SKU TPU portfolio in which an "e"-class chip explicitly optimized for cost per query coexists with a "p"-class flagship for the largest frontier-training workloads. That bifurcation, with the Multislice software that made it economically viable, is the architectural pattern that has continued through the Trillium and Ironwood generations and that defines Google's ongoing alternative to the NVIDIA H100 / Blackwell hyperscale-GPU stack.[1][4][6][19]

See also

References

  1. Vahdat, Amin and Lohmeyer, Mark. "Announcing Cloud TPU v5e and A3 GPUs in GA." Google Cloud Blog, August 29, 2023.
  2. "Cloud TPU v5e is generally available." Google Cloud Blog, November 9, 2023.
  3. Lardinois, Frederic. "Google Cloud announces the 5th generation of its custom TPUs." TechCrunch, August 29, 2023.
  4. Lohmeyer, Mark and Vahdat, Amin. "Introducing Cloud TPU v5p and AI Hypercomputer." Google Cloud Blog, December 6, 2023.
  5. "Google's TPU v5e: A New Chapter in Physical AI Chips and Cloud AI Infrastructure." Vcom, 2023.
  6. Patel, Dylan and colleagues. "TPUv5e: The New Benchmark in Cost-Efficient Inference and Training for <200B Parameter Models." SemiAnalysis, 2023.
  7. "TPU v5e." Google Cloud Documentation.
  8. "The world's largest distributed LLM training job on TPU v5e." Google Cloud Blog, November 2023.
  9. "Cloud TPU Multislice Overview." Google Cloud Documentation.
  10. "Google TPU Architecture: 7 Generations Explained." Introl Blog.
  11. Vahdat, Amin. "LinkedIn post: Announcing Cloud TPU v5e and A3 GPUs in GA." LinkedIn, August 2023.
  12. "Google TPU v5e AI Chip Debuts after Controversial Origins." HPCwire, August 30, 2023.
  13. "TPU v5p." Google Cloud Documentation.
  14. "TPU architecture." Google Cloud Documentation.
  15. "How Cloud TPU v5e accelerates large-scale AI inference." Google Cloud Blog.
  16. "Using Cloud TPU Multislice to scale AI workloads." Google Cloud Blog.
  17. "Google Cloud's TPU v5e Accelerates the AI Compute War." Futurum Group, 2023.
  18. "AssemblyAI on Cloud TPU v5e price performance." Google Cloud Blog, 2023.
  19. "Introducing Trillium, sixth-generation TPUs." Google Cloud Blog, 2024.
  20. "High-Performance Llama 2 Training and Inference with PyTorch/XLA on Cloud TPUs." PyTorch Blog, 2023.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation.

Suggest edit

What links here