Trillium (TPU v6e)
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,111 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,111 words
Add missing citations, update stale details, or suggest a clearer explanation.
Trillium, also designated TPU v6e, is the sixth-generation tensor_processing_unit_tpu developed by google for machine learning training and inference workloads on cloud_tpu infrastructure.[^1][^2] The "e" suffix denotes the chip's positioning as an efficiency-optimized member of the TPU family — a lineage convention Google introduced with tpu v5e.[^3] Unlike the fifth generation, which Google split into two distinct silicon designs (v5e and v5p), the sixth generation was released only as a single efficiency-class part; there is no publicly announced "TPU v6p" variant, and Trillium has been the sole v6 product line through 2025.[^3][^4]
Trillium was unveiled by Google CEO sundar_pichai on May 14, 2024 at the Google I/O developer conference in Mountain View, California, where it was announced as the silicon that would underlie the next generation of gemini models.[^5][^6] The chip entered preview availability on Google Cloud on October 31, 2024 and reached general availability on December 12, 2024, coinciding with the public launch of gemini_2_5_flash's predecessor, Gemini 2.0 Flash.[^7][^8] Google stated at GA that "TPUs powered 100% of Gemini 2.0 training and inference."[^9]
Each Trillium chip delivers a peak of approximately 918 BF16 TFLOPS and 1,836 INT8 TOPS, equipped with 32 GB of high-bandwidth memory at 1,640 GB/s, and uses 800 GB/s of inter-chip interconnect (ICI) bandwidth distributed across four ports.[^10] Compared to its TPU v5e predecessor, Google reports a 4.7× increase in peak compute per chip, doubled HBM capacity, doubled HBM bandwidth, doubled ICI bandwidth, and a greater-than-67% improvement in energy efficiency.[^1][^11] Trillium chips are organized into 256-chip pods linked by a 2D torus interconnect, with multiple pods aggregated through Google's data-center-scale Jupiter network fabric — a topology Google has scaled to over 100,000 Trillium chips operating as a single training substrate.[^12]
Trillium's successor, tpu_ironwood (TPU v7), was announced at Google Cloud Next 2025 on April 9, 2025 and is positioned by Google as an inference-optimized counterpart that complements rather than replaces Trillium for training workloads.[^13][^14]
The TPU program at google began in 2013 as an internal effort to relieve the company's data centers of the computational load imposed by the rapid growth of deep learning inference, especially for products such as Google Search, Google Photos, and Google Translate.[^15][^38] According to accounts published by Google engineers, then-Chief Scientist Jeff Dean had calculated that if 100 million Android users utilized voice-to-text dictation for three minutes per day, the back-end inference load would more than double the total computational capacity of all Google data centers at the time. The internal response was to commission a custom silicon design specifically tuned to the matrix-multiplication-heavy inner loops of generative_pre-trained_transformer and deep neural network inference.[^38] Norm Jouppi, recruited from HP Labs, served as tech lead and principal architect, and the team brought TPU v1 from concept to production deployment in approximately 15 months.[^38][^15]
The first-generation TPU, designed for inference only, reached internal deployment in 2015 and was first publicly disclosed in 2016; it implemented a 256×256 systolic array of 8-bit multiply-accumulate units delivering 92 TOPS of INT8 compute and famously powered the AlphaGo match against Lee Sedol.[^15][^16] The second-generation TPU was announced at Google I/O in May 2017 and introduced support for bfloat16 training as well as the inter-chip interconnect that enabled the first tpu_pod deployments — the architectural step that opened TPUs to training workloads and not just serving.[^15][^16] TPU v3 followed in 2018 with liquid cooling and roughly double the throughput of v2.[^15] TPU v4, presented in 2021, doubled v3's performance again, introduced 3D-torus topology and optical-circuit-switched reconfiguration of pod slices, and brought the first generation of the SparseCore accelerator for embedding-heavy workloads such as recommendation models.[^15][^16]
In 2023 Google split the fifth generation into two parts with different positioning: v5e, optimized for cost-efficient training and serving in mid-size jobs, launched in August 2023; and v5p, optimized for the highest-throughput large-model training, launched in December 2023.[^17][^18] The two chips differ in topology and pod scale, with v5e using a 2D-torus 256-chip pod and v5p using a 3D-torus pod that scales up to 8,960 chips.[^17] The v5e/v5p split established the "e" (efficient) and "p" (performance) suffix convention that would define future TPU generations and shape how customers reasoned about chip choice for a given workload.[^17][^18]
apple_foundation_models for apple_intelligence were trained on v4 (the server-side AFM-server, on 8,192 v4 chips) and v5p (the on-device AFM-on-device, on 2,048 v5p chips), disclosed in Apple's foundation-models paper in July 2024 — a notable third-party validation of Google's TPU stack for frontier-scale model training arriving just two months after the Trillium announcement.[^19][^20]
Trillium (v6e) followed v5e and v5p as a generational refresh of the efficiency-class part; Google did not release a separate v6p chip, and the sole v6 silicon design has been used for both training and serving across the v6e generation.[^4][^11] This represented a small but consequential change of strategy: in v5, the "e" and "p" parts had differentiated form factors and topologies, whereas v6 leaned harder into a single converged design with the efficiency profile. In April 2025 Google announced its seventh-generation TPU under the codename Ironwood, positioned by the company as the first TPU "for the age of inference" — though Ironwood pods also offer the largest training cluster Google has ever shipped.[^13][^14]
Each Trillium chip contains TensorCores, with each TensorCore housing two matrix-multiply units (MXUs), a vector unit, and a scalar unit.[^16] A core architectural change in v6e is the expansion of the systolic-array MXU from 128×128 (used in TPU v2 through v5p) to 256×256 multiply-accumulators.[^16][^21] The 256×256 array quadruples the number of multiply-accumulate operations executed per MXU cycle relative to v5p. Combined with a higher clock speed and other microarchitectural improvements, the change delivers approximately 4.7× the BF16 peak throughput of v5e at the chip level.[^21]
The MXUs support bfloat16 and INT8 natively, with BF16 inputs producing BF16 or FP32 accumulator outputs.[^10] Some Google documentation also reports an FP8 peak figure of 4,614 TFLOPS for Trillium, indicating that the same MXU lanes can be used for FP8 operations under the appropriate compiler-driven mode.[^16]
Trillium ships with Google's third-generation SparseCore, a dataflow processor distinct from the MXU and designed to accelerate the irregular memory-access patterns associated with large embedding tables found in ranking, recommendation, and retrieval-augmented models.[^1][^22] Where v5p included four SparseCores per chip, v6e includes two SparseCores per chip but with redesigned bandwidth and SIMD-width handling — variable-width SIMD lanes (8 elements for FP32, 16 elements for bfloat16) reduce wasted bandwidth from misaligned reads of embedding tables.[^22] Google reports a 2× improvement in embedding performance and a 5× improvement on the DLRM DCNv2 recommendation benchmark relative to v5e.[^11]
Trillium's HBM stack provides 32 GB of capacity per chip — exactly double the 16 GB found in TPU v5e and a meaningful step up from v5p's 95 GB-class configuration adjusted for the different chip class.[^10][^11] HBM bandwidth is 1,640 GiB/s per chip (some sources round to 1,600 GB/s), again roughly doubling the v5e figure.[^10] Each Trillium host VM provides 1,536 GiB of DRAM, a 3× increase over v5e, which Google leverages for "host-offloading" of optimizer state and activations during training of large models such as Llama-3.1-405B.[^11][^23]
Each Trillium chip exposes four ICI ports for a combined 800 GB/s of inter-chip interconnect bandwidth, twice the per-chip ICI capacity of v5e.[^10] Within a single pod the chips are arranged in a 2D torus, the same topology family used by v2, v3, v5e, and v6e (in contrast to the 3D torus used by v4 and v5p).[^16][^22] Total bisection bandwidth within a Trillium pod is 3.2 TB/s and the pod-wide all-reduce bandwidth is 102.4 TB/s.[^10]
A standard Trillium server board hosts four v6e chips per host CPU socket; the publicly displayed boards at SC24 packed four chips visible alongside their HBM stacks, with each accelerator served by a dedicated host network interface.[^24] A full host machine pairs eight chips with 1,536 GiB of DRAM and 4×200 Gbps host network connectivity, identified in Google Cloud as the ct6e-standard-8t machine type.[^10] Smaller machine types, including ct6e-standard-1t for a single chip and ct6e-standard-4t for a 2×2 slice, are also available for development and small-scale serving.[^10]
The 256-chip pod ("Trillium-256") is interconnected as a 16×16 2D torus and serves as the elementary "slice" of Trillium capacity.[^10] Supported user-visible slice topologies range from 1×1 (a single chip) up to 16×16 (256 chips); the 2×4 (eight-chip) configuration is highlighted by Google as the recommended sweet spot for inference on a single VM.[^10]
Beyond the 256-chip pod, multiple pods can be stitched together with Google's "multislice" software stack and Titanium infrastructure offloads. Titanium uses host adapters, the data-center-wide rail-aligned network, and other accelerator-side offloads to extend Trillium training jobs across pod boundaries.[^11][^12] At the cluster level, Google has connected more than 100,000 Trillium chips through its Jupiter network fabric, which provides 13 petabits per second of bisection bandwidth — large enough to serve a single distributed training job at hundreds of thousands of accelerators.[^11][^12]
Trillium uses the same XLA/JAX/PyXLA software stack as previous TPU generations, an intentional design decision that minimizes migration cost for customers moving from v5e or v5p.[^25] Workloads are typically expressed in jax or pytorch (via PyTorch/XLA) and lowered to the xla compiler, which targets the TPU instruction set and is jointly maintained by Google as the canonical TPU code generator.[^25] Google also supports tensorflow on TPU and ships reference distributed-training stacks including MaxText, an open-source LLM pre-training and post-training framework written in Python and JAX, and MaxDiffusion for diffusion models.[^25][^7]
The XLA stack received meaningful improvements alongside Trillium's launch. The compiler exposes new scheduling and host-offloading passes that exploit Trillium's 3× larger host DRAM by moving optimizer state, activations, and infrequently-accessed weights off-chip to host memory, returning them just in time for the next computational step. This is the mechanism behind Google's reported >50% MFU improvement on Llama-3.1-405B training.[^11][^39] A second XLA improvement, "collections scheduling," coordinates multiple TPU slices serving inference requests so that batched prefill and decode phases can run on different slice subsets without losing throughput.[^11]
Higher-level orchestration on Trillium is provided through Google's "AI Hypercomputer" stack, which combines the chip and pod with Pathways (Google's runtime for orchestrating large-scale TPU jobs across hundreds of pods), Google Kubernetes Engine (GKE) integration, and Dynamic Workload Scheduler (DWS) for queueing and reservations.[^11][^26] DWS's Flex-start mode allows bursty workloads to consume Trillium capacity without long-term commitments, which Google explicitly highlights as a use case for fine-tuning and short-horizon experimentation.[^1] For third-party developers, hugging_face and Google jointly developed the open-source Optimum-TPU library, which exposes the Hugging Face Transformers and TGI ecosystems on Trillium with TPU-specific optimizations.[^1][^27]
For inference, Trillium is supported by Google's JetStream serving engine and by community projects including vLLM (via a unified TPU backend that lowers PyTorch and JAX to XLA). Google reports throughput above 3,500 tokens per second per Trillium node for long-sequence inference on 70-billion-parameter-class models using these stacks; per-chip throughput approaches 1,000 tokens/sec on Gemma 7B at batch size 64 (BF16) and 300–400 tokens/sec on 70B-class models at small batch sizes, scaling further as batch size grows and MXU utilization increases.[^11][^28] Time-to-first-token (TTFT) for single-user queries on 7B-class models is reported in the 5–20 ms range on Trillium hardware with JetStream.[^28]
In Google's own benchmarks comparing Trillium to TPU v5e under matched software stacks, Trillium achieves up to 4× faster training on dense large language models including llama_2-70B and GPT-3-175B, and up to 3.8× faster training on mixture-of-experts (mixture_of_experts) models.[^11] Across smaller dense models such as Gemma 2-9B and Llama-2-7B, training speedups of 3× or more over v5e are reported.[^7] Scaling efficiency across pods is reported at 99% on a 12-pod (3,072-chip) configuration training gpt3-175B and 94% on a 24-pod (6,144-chip) configuration, with near-99% efficiency for Llama-2-70B across 4–36 pods.[^11] Across gemma_2-27B and MaxText default-32B benchmarks, training speedups above 4× over v5e are reported.[^7]
When training the 405-billion-parameter llama_3_1 model with host-offloading enabled, Trillium demonstrates greater than 50% improvement in model FLOPs utilization (MFU) versus v5e at equivalent settings, illustrating the benefit of the chip's tripled host DRAM and improved compiler scheduling.[^11] These figures are gathered on Google's own clusters and use the MaxText reference implementation as a baseline, allowing direct apples-to-apples comparison between v5e and v6e on identical kernels.[^7][^25]
Google trained Gemini 2.0 Flash and the larger frontier models in the Gemini 2 family on Trillium pods at GA, with sundar_pichai stating that "TPUs powered 100% of Gemini 2.0 training and inference."[^9][^29] Although Google does not publish exact chip counts for individual Gemini training runs, the company has stated at the Trillium GA event that it has deployed more than 100,000 Trillium chips into a single Jupiter-fabric cluster, and DigiTimes and other outlets have reported this as the primary substrate for the Gemini 2 generation.[^12][^40]
For inference on Stable Diffusion XL (stable_diffusion), Trillium offers 3.1× higher throughput in offline configuration and 2.9× higher throughput in server configuration relative to v5e.[^11] The cost per 1,000 generated images at server-mode is reduced by 22% and at offline-mode by 27% relative to v5e at GA pricing — equivalent to a per-image cost of roughly 22 cents at offline-mode rates.[^11] For Llama-2-70B inference, Trillium delivers nearly 2× the tokens-per-second of v5e.[^11]
Trillium was submitted to MLPerf Inference v5.0 in early 2025, where the chip recorded a 3.5× throughput improvement over the prior-generation submission on Stable Diffusion XL image generation.[^41] Google reported "industry leading" tokens-per-second and tokens-per-dollar on the 70-billion-parameter inference task using a v6e node and the vLLM/JetStream serving stack.[^11][^41]
Trillium chips deliver greater than 67% improvement in energy efficiency over TPU v5e at matched workloads, which Google identifies as the largest single-generation efficiency jump in the TPU family up to that point.[^1][^11] Pichai cited the energy advance as a response to a "1,000,000×" increase in industry demand for machine-learning compute over the preceding six years, and described Trillium as "the most efficient and best-performing TPU to date" at the I/O 2024 keynote.[^6][^42]
At general availability in December 2024, Trillium was made available across multiple Google Cloud regions, including locations in North America, Europe, and Asia-Pacific.[^30] On-demand pricing for v6e begins around US$1.375 per chip-hour for short-term consumption, with 3-year committed-use discounts reducing the effective price to approximately US$0.55 per chip-hour and aggressive multi-region commitments going as low as US$0.39 per chip-hour.[^31]
In published price-performance comparisons against the prior generations, Google reports:
These figures reflect Google's internal benchmark normalization and on-demand pricing at GA. Independent third-party benchmarks have reached different conclusions in some categories; for example, Artificial Analysis's hardware benchmarking reported in 2025 that for their composite inference-cost metric, nvidia_blackwell_b200 achieved roughly a 5× tokens-per-dollar advantage over Trillium under their test conditions.[^32]
Although the majority of Trillium chips are consumed internally by Google for Gemini training, Search, YouTube, and other workloads, several external customers have publicly disclosed adoption:
anthropic — Anthropic has been a Google Cloud TPU customer since the early 2020s and trains its claude family of models predominantly on TPU pods, with secondary capacity on aws_trainium.[^33][^34] In October 2025, Anthropic announced an expansion of its TPU use with Google Cloud and disclosed access to up to one million TPU chips, with multi-gigawatt capacity expected to come online in 2026 and 2027 across Trillium and Ironwood.[^33][^35]
apple — Apple's foundation models that power apple_intelligence were pretrained on TPU v4 and v5p — not on Trillium — as disclosed in Apple's WWDC 2024 foundation-models paper. Specifically, the server-side AFM-server was trained on 8,192 TPU v4 chips arranged as 8×1,024 slices, and the on-device AFM-on-device was trained on 2,048 TPU v5p chips.[^19][^20] While Apple has not publicly disclosed Trillium-specific deployments, the relationship is one of the most-cited demonstrations that frontier-scale foundation models can be trained without nvidia_h100 GPUs.[^36]
ai21_labs — A long-standing TPU customer since v4, AI21 trains its Mamba- and Jamba-architecture language models on Trillium pods.[^11] CTO Barak Lenz stated at GA that "Trillium will be essential in accelerating the development of our next generation of sophisticated language models."[^11]
hugging_face — Hugging Face and Google collaboratively built the Optimum-TPU library, bringing the Hugging Face model and training stack to Trillium for use by community developers.[^1][^27]
Deep Genomics — Uses Trillium for inference on its BigRNA model, scanning tens of millions of RNA variants for drug-discovery applications.[^7]
HubX — Reported 35% latency reduction and 45% cost-per-image reduction running its MaxDiffusion-based FLUX.1 text-to-image pipeline on Trillium.[^7]
Lightricks — Plans to scale its text-to-video models on Trillium after achieving 2.5× speedups on v5p; named the AI Hypercomputer stack as a key factor in adoption.[^1][^7]
Essential AI, Nuro, and Deloitte were named at the I/O 2024 announcement as additional Trillium customers.[^1]
Industry analysis at the time of Trillium's launch positioned the chip as one of the most credible non-NVIDIA training substrates for frontier models. Trillium's 918 BF16 TFLOPS per chip is below nvidia_h100's 989 BF16 TFLOPS per device but exceeds it in INT8 throughput, and Trillium's 32 GB HBM and 1.64 TB/s bandwidth are smaller than the 80 GB / 3.35 TB/s of an H100 SXM. However, the chip's interconnect-aware system design and lower per-chip cost change the comparison at pod scale, where Trillium's 256-chip pod with 102 TB/s all-reduce bandwidth competes favorably with NVIDIA NVLink-connected H100 nodes for many distributed training topologies.[^10][^37] The launch of nvidia_blackwell and the nvidia_blackwell_b200 in 2024–2025 reset the per-chip comparison, with B200 offering substantially more compute and HBM3e per device; Google's response — particularly for inference — has been the Trillium-succeeding tpu_ironwood rather than a v6 die shrink.[^14][^32]
A 2025 independent benchmarking study by Artificial Analysis compared Trillium against B200 and AMD MI300X for inference economics on a representative LLM workload. The study reported that B200 achieved approximately 5× the tokens-per-dollar of Trillium and approximately 2× the tokens-per-dollar of MI300X under their test conditions and pricing assumptions.[^32] These per-chip cost figures depend strongly on serving framework maturity and batch size — areas where Google has continued to invest the JetStream and vLLM-TPU stacks — and the gap narrows considerably on workloads that exploit Trillium's full pod-scale bandwidth for very large model parallelism.[^11][^28]
Among hyperscaler-designed accelerators, Trillium is most directly compared with aws_trainium2 from Amazon (announced December 2024) and Microsoft Azure's Maia 100. Compared with aws_trainium2's UltraServer configuration, Trillium is positioned at a smaller pod scale (256 chips per 2D-torus pod versus Trainium2's 16-chip UltraServer scaled via EFA networking) but with much higher single-pod bisection bandwidth and a more mature compiler stack rooted in XLA and JAX.[^11][^17] anthropic runs Claude training and serving across both stacks and has publicly emphasized the resulting heterogeneity as a strategic advantage rather than as a single-vendor dependency.[^33][^34]
tpu_ironwood, the seventh-generation TPU, was announced by Google at Cloud Next 2025 in Las Vegas on April 9, 2025 and reached general availability in late 2025.[^13][^14] Google explicitly positions Ironwood as "the first TPU for the age of inference," with Trillium continuing as the company's preferred training silicon for many workload types and Ironwood adding both a much larger 9,216-chip liquid-cooled pod and significantly larger per-chip resources.[^13][^14]
Per Google's published specifications, Ironwood delivers 192 GB of HBM per chip (6× Trillium), 7.37 TB/s of HBM bandwidth per chip (4.5× Trillium), 1.2 TBps bidirectional ICI bandwidth (1.5× Trillium), and approximately 4,614 FP8 TFLOPS per chip — Google describes Ironwood as offering "more than 4× better performance per chip" than Trillium for both training and inference on representative workloads, and "2× perf/watt" relative to Trillium at the system level.[^13][^14] A full 9,216-chip Ironwood pod delivers approximately 42.5 exaflops of compute, exceeding the publicly disclosed throughput of the fastest non-AI supercomputers at the time of launch.[^14]
Despite Ironwood's larger pod size, Trillium is not retired with the new generation. Google's stated infrastructure strategy as of 2025 treats Trillium and Ironwood as complementary: Trillium remains the workhorse for training of mid- and large-size models where the 256-chip pod and lower per-chip cost are optimal, while Ironwood targets very-large-scale training jobs and high-throughput inference of frontier models such as the gemini_2_5_pro and gemini_3_pro families.[^13][^29] The continued availability of v6e pods alongside Ironwood reflects a more general TPU lifecycle pattern in which earlier generations remain in production long after the launch of newer parts; for example, v4 and v5p pods remained generally available throughout Trillium's first year on the market.[^3][^4]