Trillium (TPU v6e)
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 4,903 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 4,903 words
Add missing citations, update stale details, or suggest a clearer explanation.
Trillium, also designated TPU v6e, is the sixth-generation Tensor Processing Unit developed by Google for machine learning training and inference workloads on Cloud TPU infrastructure.[1][2] Google announced Trillium at the Google I/O developer conference in May 2024 and made it generally available to Google Cloud customers on December 12, 2024.[5][8][11] According to Google, each Trillium chip delivers a 4.7x increase in peak compute per chip over the prior-generation TPU v5e, doubles both High Bandwidth Memory (HBM) capacity and bandwidth, doubles Inter-Chip Interconnect (ICI) bandwidth, adds a third-generation SparseCore, and is over 67% more energy-efficient than v5e, scaling up to 256 chips in a single high-bandwidth pod.[1][11] Google has stated that it used Trillium TPUs to train Gemini 2.0, its most capable model at the time of the December 2024 launch.[8][11]
The "e" suffix denotes the chip's positioning as an efficiency-optimized member of the TPU family, a lineage convention Google introduced with TPU v5e.[3] Unlike the fifth generation, which Google split into two distinct silicon designs (v5e and v5p), the sixth generation was released only as a single efficiency-class part; there is no publicly announced "TPU v6p" variant, and Trillium has been the sole v6 product line through 2025.[3][4]
Trillium was unveiled by Google CEO Sundar Pichai on May 14, 2024 at the Google I/O developer conference in Mountain View, California, where it was announced as the silicon that would underlie the next generation of Gemini models.[5][6] The chip entered preview availability on Google Cloud on October 31, 2024 and reached general availability on December 12, 2024, coinciding with the public launch of Gemini 2.0 Flash.[7][8] At general availability Google stated, "We used Trillium TPUs to train the new Gemini 2.0, Google's most capable AI model yet," and Pichai separately stated that "TPUs powered 100% of Gemini 2.0 training and inference."[9][11]
Each Trillium chip delivers a peak of approximately 918 BF16 TFLOPS and 1,836 INT8 TOPS, equipped with 32 GB of high-bandwidth memory at 1,640 GB/s, and uses 800 GB/s of inter-chip interconnect (ICI) bandwidth distributed across four ports.[10] Compared to its TPU v5e predecessor, Google reports a 4.7x increase in peak compute per chip, doubled HBM capacity, doubled HBM bandwidth, doubled ICI bandwidth, and a greater-than-67% improvement in energy efficiency.[1][11] Trillium chips are organized into 256-chip pods linked by a 2D torus interconnect, with multiple pods aggregated through Google's data-center-scale Jupiter network fabric, a topology Google has scaled to over 100,000 Trillium chips operating as a single training substrate.[11][12]
Trillium's successor, TPU Ironwood (TPU v7), was announced at Google Cloud Next 2025 on April 9, 2025 and is positioned by Google as an inference-optimized counterpart that complements rather than replaces Trillium for training workloads.[13][14]
Trillium is Google's sixth-generation Tensor Processing Unit, a custom application-specific integrated circuit (ASIC) designed in-house by Google to accelerate the large matrix-multiplication operations at the core of deep learning. It is offered exclusively through Cloud TPU on Google Cloud rather than sold as standalone silicon, and it serves both training and inference for large models. Google describes the chip as "the most performant and most energy-efficient TPU to date," citing a 4.7x increase in peak compute per chip over TPU v5e.[1][11] In Google's framing, Trillium is the chip that "powered the training of" the Gemini 2.0 generation of models.[8][11]
The TPU program at Google began in 2013 as an internal effort to relieve the company's data centers of the computational load imposed by the rapid growth of deep learning inference, especially for products such as Google Search, Google Photos, and Google Translate.[15][38] According to accounts published by Google engineers, then-Chief Scientist Jeff Dean had calculated that if 100 million Android users utilized voice-to-text dictation for three minutes per day, the back-end inference load would more than double the total computational capacity of all Google data centers at the time. The internal response was to commission a custom silicon design specifically tuned to the matrix-multiplication-heavy inner loops of generative pre-trained transformer and deep neural network inference.[38] Norm Jouppi, recruited from HP Labs, served as tech lead and principal architect, and the team brought TPU v1 from concept to production deployment in approximately 15 months.[38][15]
The first-generation TPU, designed for inference only, reached internal deployment in 2015 and was first publicly disclosed in 2016; it implemented a 256x256 systolic array of 8-bit multiply-accumulate units delivering 92 TOPS of INT8 compute and famously powered the AlphaGo match against Lee Sedol.[15][16] The second-generation TPU was announced at Google I/O in May 2017 and introduced support for bfloat16 training as well as the inter-chip interconnect that enabled the first TPU Pod deployments, the architectural step that opened TPUs to training workloads and not just serving.[15][16] TPU v3 followed in 2018 with liquid cooling and roughly double the throughput of v2.[15] TPU v4, presented in 2021, doubled v3's performance again, introduced 3D-torus topology and optical-circuit-switched reconfiguration of pod slices, and brought the first generation of the SparseCore accelerator for embedding-heavy workloads such as recommendation models.[15][16]
In 2023 Google split the fifth generation into two parts with different positioning: v5e, optimized for cost-efficient training and serving in mid-size jobs, launched in August 2023; and v5p, optimized for the highest-throughput large-model training, launched in December 2023.[17][18] The two chips differ in topology and pod scale, with v5e using a 2D-torus 256-chip pod and v5p using a 3D-torus pod that scales up to 8,960 chips.[17] The v5e/v5p split established the "e" (efficient) and "p" (performance) suffix convention that would define future TPU generations and shape how customers reasoned about chip choice for a given workload.[17][18]
Apple foundation models for Apple Intelligence were trained on v4 (the server-side AFM-server, on 8,192 v4 chips) and v5p (the on-device AFM-on-device, on 2,048 v5p chips), disclosed in Apple's foundation-models paper in July 2024, a notable third-party validation of Google's TPU stack for frontier-scale model training arriving just two months after the Trillium announcement.[19][20]
Trillium (v6e) followed v5e and v5p as a generational refresh of the efficiency-class part; Google did not release a separate v6p chip, and the sole v6 silicon design has been used for both training and serving across the v6e generation.[4][11] This represented a small but consequential change of strategy: in v5, the "e" and "p" parts had differentiated form factors and topologies, whereas v6 leaned harder into a single converged design with the efficiency profile. In April 2025 Google announced its seventh-generation TPU under the codename Ironwood, positioned by the company as the first TPU "for the age of inference," though Ironwood pods also offer the largest training cluster Google has ever shipped.[13][14]
Google CEO Sundar Pichai announced Trillium at the Google I/O 2024 keynote on May 14, 2024.[5][6] The chip then progressed through Google Cloud's standard rollout: it entered preview on October 31, 2024, and reached general availability on December 12, 2024, the same day Google launched Gemini 2.0 Flash.[7][8][11] The table below summarizes the milestones.
| Milestone | Date | Source |
|---|---|---|
| Announced at Google I/O | May 14, 2024 | Google I/O keynote[5][6] |
| Preview on Google Cloud | October 31, 2024 | Google Cloud Blog[7] |
| General availability | December 12, 2024 | Google Cloud Blog[11] |
| Used to train Gemini 2.0 | Announced December 12, 2024 | Google Cloud Blog[8][11] |
| Successor (Ironwood / TPU v7) announced | April 9, 2025 | Google Blog[13] |
Each Trillium chip contains TensorCores, with each TensorCore housing two matrix-multiply units (MXUs), a vector unit, and a scalar unit.[16] A core architectural change in v6e is the expansion of the systolic-array MXU from 128x128 (used in TPU v2 through v5p) to 256x256 multiply-accumulators.[16][21] The 256x256 array quadruples the number of multiply-accumulate operations executed per MXU cycle relative to v5p. Combined with a higher clock speed and other microarchitectural improvements, the change delivers approximately 4.7x the BF16 peak throughput of v5e at the chip level.[21]
The MXUs support bfloat16 and INT8 natively, with BF16 inputs producing BF16 or FP32 accumulator outputs.[10] Some Google documentation also reports an FP8 peak figure of 4,614 TFLOPS for Trillium, indicating that the same MXU lanes can be used for FP8 operations under the appropriate compiler-driven mode.[16]
Trillium ships with Google's third-generation SparseCore, a dataflow processor distinct from the MXU and designed to accelerate the irregular memory-access patterns associated with large embedding tables found in ranking, recommendation, and retrieval-augmented models.[1][22] Google describes the SparseCore in Trillium as "a specialized accelerator for processing ultra-large embeddings common in advanced ranking and recommendation workloads."[1] Where v5p included four SparseCores per chip, v6e includes two SparseCores per chip but with redesigned bandwidth and SIMD-width handling; variable-width SIMD lanes (8 elements for FP32, 16 elements for bfloat16) reduce wasted bandwidth from misaligned reads of embedding tables.[22] Google reports a 2x improvement in embedding performance and a 5x improvement on the DLRM DCNv2 recommendation benchmark relative to v5e.[11]
Trillium's HBM stack provides 32 GB of capacity per chip, exactly double the 16 GB found in TPU v5e and a meaningful step up from v5p's 95 GB-class configuration adjusted for the different chip class.[10][11] Google states that "doubling the HBM capacity and bandwidth allows Trillium to work with larger models with more weights and larger key-value caches."[1] HBM bandwidth is 1,640 GiB/s per chip (some sources round to 1,600 GB/s), again roughly doubling the v5e figure.[10] Each Trillium host VM provides 1,536 GiB of DRAM, a 3x increase over v5e, which Google leverages for "host-offloading" of optimizer state and activations during training of large models such as Llama-3.1-405B.[11][23]
Each Trillium chip exposes four ICI ports for a combined 800 GB/s of inter-chip interconnect bandwidth, twice the per-chip ICI capacity of v5e.[10] Google notes that "doubling the ICI bandwidth enables training and inference jobs to scale to tens of thousands of chips."[1] Within a single pod the chips are arranged in a 2D torus, the same topology family used by v2, v3, v5e, and v6e (in contrast to the 3D torus used by v4 and v5p).[16][22] Total bisection bandwidth within a Trillium pod is 3.2 TB/s and the pod-wide all-reduce bandwidth is 102.4 TB/s.[10]
The table below collects Trillium's per-chip and per-pod specifications alongside the corresponding TPU v5e figures, which Google uses as the baseline for its headline comparisons.[1][10][11]
| Specification | Trillium (TPU v6e) | TPU v5e | Change |
|---|---|---|---|
| Peak BF16 compute per chip | ~918 TFLOPS | ~197 TFLOPS | ~4.7x[1][11] |
| Peak INT8 per chip | 1,836 TOPS | ~394 TOPS | ~4.7x[10] |
| HBM capacity per chip | 32 GB | 16 GB | 2x[1][11] |
| HBM bandwidth per chip | 1,640 GB/s | 820 GB/s | 2x[1][10] |
| ICI bandwidth per chip | 800 GB/s | 400 GB/s | 2x[1][10] |
| SparseCore generation | 3rd gen | 2nd gen | new gen[1][22] |
| Chips per pod | 256 (16x16 2D torus) | 256 (2D torus) | same scale[1][10] |
| Pod all-reduce bandwidth | 102.4 TB/s | (lower) | higher[10] |
| Energy efficiency vs v5e | over 67% better | baseline | >67%[1][11] |
A standard Trillium server board hosts four v6e chips per host CPU socket; the publicly displayed boards at SC24 packed four chips visible alongside their HBM stacks, with each accelerator served by a dedicated host network interface.[24] A full host machine pairs eight chips with 1,536 GiB of DRAM and 4x200 Gbps host network connectivity, identified in Google Cloud as the ct6e-standard-8t machine type.[10] Smaller machine types, including ct6e-standard-1t for a single chip and ct6e-standard-4t for a 2x2 slice, are also available for development and small-scale serving.[10]
The 256-chip pod ("Trillium-256") is interconnected as a 16x16 2D torus and serves as the elementary "slice" of Trillium capacity.[10] Google states that "Trillium can scale up to 256 TPUs in a single high-bandwidth, low-latency pod."[1] Supported user-visible slice topologies range from 1x1 (a single chip) up to 16x16 (256 chips); the 2x4 (eight-chip) configuration is highlighted by Google as the recommended sweet spot for inference on a single VM.[10]
Beyond the 256-chip pod, multiple pods can be stitched together with Google's "multislice" software stack and Titanium infrastructure offloads. Titanium uses host adapters, the data-center-wide rail-aligned network, and other accelerator-side offloads to extend Trillium training jobs across pod boundaries.[11][12] At the cluster level, Google has connected more than 100,000 Trillium chips through its Jupiter network fabric, which provides 13 petabits per second of bisection bandwidth, large enough to serve a single distributed training job at hundreds of thousands of accelerators.[11][12]
Trillium uses the same XLA/JAX/PyXLA software stack as previous TPU generations, an intentional design decision that minimizes migration cost for customers moving from v5e or v5p.[25] Workloads are typically expressed in JAX or PyTorch (via PyTorch/XLA) and lowered to the XLA compiler, which targets the TPU instruction set and is jointly maintained by Google as the canonical TPU code generator.[25] Google also supports TensorFlow on TPU and ships reference distributed-training stacks including MaxText, an open-source LLM pre-training and post-training framework written in Python and JAX, and MaxDiffusion for diffusion models.[25][7]
The XLA stack received meaningful improvements alongside Trillium's launch. The compiler exposes new scheduling and host-offloading passes that exploit Trillium's 3x larger host DRAM by moving optimizer state, activations, and infrequently-accessed weights off-chip to host memory, returning them just in time for the next computational step. This is the mechanism behind Google's reported >50% MFU improvement on Llama-3.1-405B training.[11][39] A second XLA improvement, "collections scheduling," coordinates multiple TPU slices serving inference requests so that batched prefill and decode phases can run on different slice subsets without losing throughput.[11]
Higher-level orchestration on Trillium is provided through Google's "AI Hypercomputer" stack, which combines the chip and pod with Pathways (Google's runtime for orchestrating large-scale TPU jobs across hundreds of pods), Google Kubernetes Engine (GKE) integration, and Dynamic Workload Scheduler (DWS) for queueing and reservations.[11][26] DWS's Flex-start mode allows bursty workloads to consume Trillium capacity without long-term commitments, which Google explicitly highlights as a use case for fine-tuning and short-horizon experimentation.[1] For third-party developers, Hugging Face and Google jointly developed the open-source Optimum-TPU library, which exposes the Hugging Face Transformers and TGI ecosystems on Trillium with TPU-specific optimizations.[1][27]
For inference, Trillium is supported by Google's JetStream serving engine and by community projects including vLLM (via a unified TPU backend that lowers PyTorch and JAX to XLA). Google reports throughput above 3,500 tokens per second per Trillium node for long-sequence inference on 70-billion-parameter-class models using these stacks; per-chip throughput approaches 1,000 tokens/sec on Gemma 7B at batch size 64 (BF16) and 300-400 tokens/sec on 70B-class models at small batch sizes, scaling further as batch size grows and MXU utilization increases.[11][28] Time-to-first-token (TTFT) for single-user queries on 7B-class models is reported in the 5-20 ms range on Trillium hardware with JetStream.[28]
Google's headline claim is that Trillium delivers a 4.7x increase in peak compute per chip over TPU v5e, and at general availability the company reported "over 4x improvement in training performance" and "up to 3x increase in inference throughput" on representative real-world workloads.[1][11] The deeper per-workload figures are broken out in the subsections that follow. In Google's words, Trillium achieves "an impressive 4.7X increase in peak compute performance per chip compared to TPU v5e."[1]
In Google's own benchmarks comparing Trillium to TPU v5e under matched software stacks, Trillium achieves up to 4x faster training on dense large language models including Llama 2-70B and GPT-3-175B, and up to 3.8x faster training on mixture-of-experts (mixture of experts) models.[11] Across smaller dense models such as Gemma 2-9B and Llama-2-7B, training speedups of 3x or more over v5e are reported.[7] Scaling efficiency across pods is reported at 99% on a 12-pod (3,072-chip) configuration training GPT-3-175B and 94% on a 24-pod (6,144-chip) configuration, with near-99% efficiency for Llama-2-70B across 4-36 pods.[11] Across Gemma 2-27B and MaxText default-32B benchmarks, training speedups above 4x over v5e are reported.[7]
When training the 405-billion-parameter Llama 3.1 model with host-offloading enabled, Trillium demonstrates greater than 50% improvement in model FLOPs utilization (MFU) versus v5e at equivalent settings, illustrating the benefit of the chip's tripled host DRAM and improved compiler scheduling.[11] These figures are gathered on Google's own clusters and use the MaxText reference implementation as a baseline, allowing direct apples-to-apples comparison between v5e and v6e on identical kernels.[7][25]
Google trained Gemini 2.0 Flash and the larger frontier models in the Gemini 2 family on Trillium pods at GA, with Sundar Pichai stating that "TPUs powered 100% of Gemini 2.0 training and inference."[9][29] Although Google does not publish exact chip counts for individual Gemini training runs, the company stated at the Trillium GA event that it has deployed more than 100,000 Trillium chips into a single Jupiter-fabric cluster, and DigiTimes and other outlets have reported this as the primary substrate for the Gemini 2 generation.[12][40]
For inference on Stable Diffusion XL (stable diffusion), Trillium offers 3.1x higher throughput in offline configuration and 2.9x higher throughput in server configuration relative to v5e.[11] The cost per 1,000 generated images at server-mode is reduced by 22% and at offline-mode by 27% relative to v5e at GA pricing, equivalent to a per-image cost of roughly 22 cents at offline-mode rates.[11] For Llama-2-70B inference, Trillium delivers nearly 2x the tokens-per-second of v5e.[11]
Trillium was submitted to MLPerf Inference v5.0 in early 2025, where the chip recorded a 3.5x throughput improvement over the prior-generation submission on Stable Diffusion XL image generation.[41] Google reported "industry leading" tokens-per-second and tokens-per-dollar on the 70-billion-parameter inference task using a v6e node and the vLLM/JetStream serving stack.[11][41]
Trillium chips deliver greater than 67% improvement in energy efficiency over TPU v5e at matched workloads, which Google identifies as the largest single-generation efficiency jump in the TPU family up to that point.[1][11] Google's exact phrasing is that "our sixth-generation TPUs are also our most sustainable: Trillium TPUs are over 67% more energy-efficient than TPU v5e."[1] Google describes Trillium as "the most performant and most energy-efficient TPU to date."[1] At the I/O 2024 keynote, Pichai cited the energy advance as a response to a roughly 1,000,000x increase in industry demand for machine-learning compute over the preceding six years.[6][42]
At general availability in December 2024, Trillium was made available across multiple Google Cloud regions, including locations in North America, Europe, and Asia-Pacific.[30] On-demand pricing for v6e begins around US$1.375 per chip-hour for short-term consumption, with 3-year committed-use discounts reducing the effective price to approximately US$0.55 per chip-hour and aggressive multi-region commitments going as low as US$0.39 per chip-hour.[31]
In published price-performance comparisons against the prior generations, Google reports:
These figures reflect Google's internal benchmark normalization and on-demand pricing at GA. Independent third-party benchmarks have reached different conclusions in some categories; for example, Artificial Analysis's hardware benchmarking reported in 2025 that for their composite inference-cost metric, NVIDIA Blackwell B200 achieved roughly a 5x tokens-per-dollar advantage over Trillium under their test conditions.[32]
Google states that it "used Trillium TPUs to train the new Gemini 2.0, Google's most capable AI model yet," and Pichai has said that "TPUs powered 100% of Gemini 2.0 training and inference."[9][11] Beyond Gemini, Google has disclosed that earlier and contemporaneous TPU generations trained and serve models such as Gemini 1.5 Flash, Imagen 3, and Gemma 2.[1] External customers train other model families on Trillium pods, and the full set of publicly disclosed deployments is detailed in the next section.
Although the majority of Trillium chips are consumed internally by Google for Gemini training, Search, YouTube, and other workloads, several external customers have publicly disclosed adoption:
Anthropic: Anthropic has been a Google Cloud TPU customer since the early 2020s and trains its Claude family of models predominantly on TPU pods, with secondary capacity on AWS Trainium.[33][34] In October 2025, Anthropic announced an expansion of its TPU use with Google Cloud and disclosed access to up to one million TPU chips, with multi-gigawatt capacity expected to come online in 2026 and 2027 across Trillium and Ironwood.[33][35]
Apple: Apple's foundation models that power Apple Intelligence were pretrained on TPU v4 and v5p (not on Trillium), as disclosed in Apple's WWDC 2024 foundation-models paper. Specifically, the server-side AFM-server was trained on 8,192 TPU v4 chips arranged as 8x1,024 slices, and the on-device AFM-on-device was trained on 2,048 TPU v5p chips.[19][20] While Apple has not publicly disclosed Trillium-specific deployments, the relationship is one of the most-cited demonstrations that frontier-scale foundation models can be trained without NVIDIA H100 GPUs.[36]
AI21 Labs: A long-standing TPU customer since v4, AI21 trains its Mamba- and Jamba-architecture language models on Trillium pods.[11] CTO Barak Lenz stated at GA that "Trillium will be essential in accelerating the development of our next generation of sophisticated language models."[11]
Hugging Face: Hugging Face and Google collaboratively built the Optimum-TPU library, bringing the Hugging Face model and training stack to Trillium for use by community developers.[1][27]
Deep Genomics: Uses Trillium for inference on its BigRNA model, scanning tens of millions of RNA variants for drug-discovery applications.[7]
HubX: Reported 35% latency reduction and 45% cost-per-image reduction running its MaxDiffusion-based FLUX.1 text-to-image pipeline on Trillium.[7]
Lightricks: Plans to scale its text-to-video models on Trillium after achieving 2.5x speedups on v5p; named the AI Hypercomputer stack as a key factor in adoption.[1][7]
Essential AI, Nuro, and Deloitte were named at the I/O 2024 announcement as additional Trillium customers.[1]
Industry analysis at the time of Trillium's launch positioned the chip as one of the most credible non-NVIDIA training substrates for frontier models. Trillium's 918 BF16 TFLOPS per chip is below NVIDIA H100's 989 BF16 TFLOPS per device but exceeds it in INT8 throughput, and Trillium's 32 GB HBM and 1.64 TB/s bandwidth are smaller than the 80 GB / 3.35 TB/s of an H100 SXM. However, the chip's interconnect-aware system design and lower per-chip cost change the comparison at pod scale, where Trillium's 256-chip pod with 102 TB/s all-reduce bandwidth competes favorably with NVIDIA NVLink-connected H100 nodes for many distributed training topologies.[10][37] The launch of NVIDIA Blackwell and the NVIDIA Blackwell B200 in 2024-2025 reset the per-chip comparison, with B200 offering substantially more compute and HBM3e per device; Google's response, particularly for inference, has been the Trillium-succeeding TPU Ironwood rather than a v6 die shrink.[14][32]
A 2025 independent benchmarking study by Artificial Analysis compared Trillium against B200 and AMD MI300X for inference economics on a representative LLM workload. The study reported that B200 achieved approximately 5x the tokens-per-dollar of Trillium and approximately 2x the tokens-per-dollar of MI300X under their test conditions and pricing assumptions.[32] These per-chip cost figures depend strongly on serving framework maturity and batch size, areas where Google has continued to invest the JetStream and vLLM-TPU stacks, and the gap narrows considerably on workloads that exploit Trillium's full pod-scale bandwidth for very large model parallelism.[11][28]
Among hyperscaler-designed accelerators, Trillium is most directly compared with AWS Trainium2 from Amazon (announced December 2024) and Microsoft Azure's Maia 100. Compared with AWS Trainium 2's UltraServer configuration, Trillium is positioned at a smaller pod scale (256 chips per 2D-torus pod versus Trainium2's 16-chip UltraServer scaled via EFA networking) but with much higher single-pod bisection bandwidth and a more mature compiler stack rooted in XLA and JAX.[11][17] Anthropic runs Claude training and serving across both stacks and has publicly emphasized the resulting heterogeneity as a strategic advantage rather than as a single-vendor dependency.[33][34]
TPU Ironwood, the seventh-generation TPU, was announced by Google at Cloud Next 2025 in Las Vegas on April 9, 2025 and reached general availability in late 2025.[13][14] Google explicitly positions Ironwood as "the first TPU for the age of inference," with Trillium continuing as the company's preferred training silicon for many workload types and Ironwood adding both a much larger 9,216-chip liquid-cooled pod and significantly larger per-chip resources.[13][14]
Per Google's published specifications, Ironwood delivers 192 GB of HBM per chip (6x Trillium), 7.37 TB/s of HBM bandwidth per chip (4.5x Trillium), 1.2 TBps bidirectional ICI bandwidth (1.5x Trillium), and approximately 4,614 FP8 TFLOPS per chip; Google describes Ironwood as offering "more than 4x better performance per chip" than Trillium for both training and inference on representative workloads, and "2x perf/watt" relative to Trillium at the system level.[13][14] A full 9,216-chip Ironwood pod delivers approximately 42.5 exaflops of compute, exceeding the publicly disclosed throughput of the fastest non-AI supercomputers at the time of launch.[14]
Despite Ironwood's larger pod size, Trillium is not retired with the new generation. Google's stated infrastructure strategy as of 2025 treats Trillium and Ironwood as complementary: Trillium remains the workhorse for training of mid- and large-size models where the 256-chip pod and lower per-chip cost are optimal, while Ironwood targets very-large-scale training jobs and high-throughput inference of frontier models such as the Gemini 2.5 Pro and Gemini 3 Pro families.[13][29] The continued availability of v6e pods alongside Ironwood reflects a more general TPU lifecycle pattern in which earlier generations remain in production long after the launch of newer parts; for example, v4 and v5p pods remained generally available throughout Trillium's first year on the market.[3][4]