AWS Trainium 3
Last reviewed
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 3,587 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 3,587 words
Add missing citations, update stale details, or suggest a clearer explanation.
AWS Trainium 3 (also written as Trainium3 and abbreviated Trn3) is the third-generation custom AI training and inference accelerator from Amazon Web Services, designed by Amazon's in-house chip team Annapurna Labs and the first AWS-designed AI chip built on a 3 nanometer process. AWS reached general availability through Amazon EC2 Trn3 UltraServers on December 1, 2025, stating that Trn3 UltraServers deliver up to 4.4 times the compute performance, 3.9 times the memory bandwidth, and 4 times the energy efficiency of the prior-generation Trn2 UltraServer.[1][2] Each chip delivers 2.52 petaflops of dense FP8 compute, carries 144 GB of HBM3e memory, and provides 4.9 TB/s of memory bandwidth, roughly doubling the per-chip FP8 throughput of AWS Trainium 2.[1][3]
Trainium 3 was first previewed at AWS re:Invent in December 2024 and formally introduced with full specifications at re:Invent in December 2025.[2][4] At the system level, Trn3 UltraServers scale from 64 chips in the Gen1 configuration to 144 chips in the Gen2 configuration, with the larger system reaching 362 petaflops of FP8 compute, 20.7 TB of HBM3e capacity, and 706 TB/s of aggregate memory bandwidth.[1][3] Customers including Anthropic, Karakuri, Metagenomi, NetoAI, Ricoh, and Splash Music have reported "reducing training and inference costs by up to 50%" relative to alternative accelerators, according to AWS.[1]
AWS designed its first Trainium accelerator in 2020 and shipped it in EC2 Trn1 instances in 2022 as a deep learning training chip aimed at lowering the cost of large model training relative to general-purpose GPUs. The original chip used two NeuronCore-v2 cores with 32 GiB of HBM2e on a 7 nanometer process. The second-generation AWS Trainium 2, announced at re:Invent 2023 and made generally available in late 2024, was a wholesale redesign for the generative AI era: it added six more NeuronCore-v3 cores, tripled on-package memory to 96 GiB of HBM3e, and introduced the Trn2 UltraServer, a rack-scale shared-memory domain of 64 chips connected by NeuronLink-v3. Trainium 2 became the basis of Project Rainier, a cluster of nearly 500,000 chips that AWS brought online in less than twelve months in partnership with Anthropic for training the Claude family of models; AWS says the cluster provides more than five times the compute Anthropic used to train its previous models.[5]
Amazon previewed Trainium 3 during the Trainium 2 launch at re:Invent 2024, telling attendees that the next chip would use a 3 nanometer process, would offer roughly four times the performance of Trainium 2, and would be available in late 2025 with fuller volumes in early 2026.[4] Over the following twelve months Annapurna Labs worked closely with Anthropic on the silicon design, with the AI safety company providing direct input into the chip's instruction set architecture and its interconnect fabric over a collaboration that ran for more than two years.[6] The program reflected a broader strategic shift inside AWS toward owning more of the AI compute stack, with the design philosophy explicitly optimizing performance per total cost of ownership rather than peak performance.[7]
Trainium 3 is fabricated on TSMC's N3P process, the company's performance-enhanced 3 nanometer node, a step up from the N5 node used for Trainium 2.[7] SemiAnalysis notes that Trainium 3 is one of the first adopters of N3P, alongside Nvidia's Vera Rubin and AMD's MI450X active interposer die.[7] The accelerator is packaged using TSMC's CoWoS-R variant, which employs an organic thin-film interposer (six layers of copper redistribution layers on a polymer substrate) rather than the silicon interposer used in CoWoS-S packaging on high-end GPUs.[7] The package is composed of two CoWoS-R assemblies rather than one large interposer, a topology AWS chose to prioritize yield and supply continuity given industry-wide competition for advanced packaging supply through 2025 and 2026.[7]
The package integrates four stacks of HBM3e in a 12-high configuration, each providing 36 GiB for a total of 144 GiB per chip, versus the 8-high stacks on Trainium 2.[7] Pin speeds reach 9.6 gigabits per second, which SemiAnalysis described as "the highest HBM3E pin speeds we've seen yet" in 2025 and a substantial step up from the 5.7 gigabits per second used on Trainium 2.[7]
Each Trainium 3 accelerator contains eight NeuronCore-v4 cores, the same count as Trainium 2 but with substantially upgraded internal organization.[8] A NeuronCore-v4 consists of four execution engines: a Tensor Engine for matrix and convolution operations, a Vector Engine for elementwise operations and reductions, a Scalar Engine for control flow, and a GPSIMD Engine for general-purpose work such as sorts and gathers.[8] The core includes 32 MiB of SBUF software-managed scratchpad memory (up from 28 MiB on NeuronCore-v3) and a 2 MiB PSUM buffer for accumulating partial sums during matrix multiplication.[8]
The Tensor Engine contains two separate systolic arrays optimized for different numeric formats. A 128 by 128 BF16 array handles bfloat16, float16, TF32, and FP32 computations, while a larger 512 by 128 array is dedicated to the OCP-compliant MXFP8 and MXFP4 microscaling formats, doubling the MXFP8 capability of Trainium 2.[7] Per Tensor Engine, the core delivers 315 teraflops of MXFP8 or MXFP4 throughput, 79 teraflops of BF16/FP16/TF32, and 20 teraflops of FP32. With eight cores per chip, a single Trainium 3 device reaches 2.52 petaflops of dense MXFP8 or MXFP4 performance.[3][7] Notably, BF16 throughput per chip is roughly unchanged from Trainium 2 at around 0.65 petaflops. AWS put the majority of the additional silicon area into MXFP8 and MXFP4 capability rather than scaling all numeric formats uniformly, on the basis that production-scale generative AI workloads in 2025 and 2026 are increasingly dominated by FP8 training, FP8 inference, and FP4 inference.[7]
Trainium 3 integrates 144 GiB of HBM3e per chip across four stacks, providing 4.9 TB/s of memory bandwidth.[3] The capacity is 1.5 times that of Trainium 2 and the bandwidth is roughly 70 percent higher. The larger memory footprint allows Trainium 3 to hold significantly larger model weights, optimizer states, and KV caches per chip, which reduces the number of chips that must be enlisted simply to fit a given model and improves utilization on inference workloads with long context windows.
The table below compares the key per-chip specifications of Trainium 3 against its predecessors.
| Specification | AWS Trainium | AWS Trainium 2 | AWS Trainium 3 |
|---|---|---|---|
| Launch year | 2022 | 2024 | 2025 |
| Process node | TSMC N7 | TSMC N5 | TSMC N3P |
| NeuronCore version | v2 (2 cores) | v3 (8 cores) | v4 (8 cores) |
| FP8 / MXFP8 compute (per chip) | ~0.19 PF (FP16) | 1.3 PF | 2.52 PF |
| BF16 compute (per chip) | 0.19 PF | 0.65 PF | 0.65 PF |
| HBM capacity | 32 GiB HBM2e | 96 GiB HBM3e | 144 GiB HBM3e |
| HBM bandwidth | 0.82 TB/s | 2.9 TB/s | 4.9 TB/s |
| HBM pin speed | 3.2 Gb/s | 5.7 Gb/s | 9.6 Gb/s |
| Packaging | Standard | CoWoS-R | CoWoS-R |
| Scale-up domain | 16 chips (Trn1) | 64 chips | 64 or 144 chips |
A Trn3 UltraServer is a rack-scale system that ties multiple Trainium 3 chips into a single shared-memory scale-up domain. AWS ships Trn3 UltraServers in two configurations: a Gen1 with 64 chips and a Gen2 with 144 chips.[1][3] Both expose a single logical memory and compute pool, which lets large model parallelism strategies place tensor parallel groups and pipeline parallel stages across the entire scale-up domain without crossing network protocol boundaries.
Each Trn3 server, internally referred to as a sled, contains four Trainium 3 chips connected through an intra-server PCIe switch. Every chip exposes four PCIe Gen6 x8 links to this switch, providing 256 GB/s of bidirectional bandwidth within a sled.[7] The Gen1 UltraServer uses 16 sleds for 64 chips, while the Gen2 uses 36 sleds for 144 chips.
The defining architectural feature of the Trn3 UltraServer is its NeuronSwitch-v1 fabric, the first switched all-to-all interconnect that AWS has shipped on a Trainium generation.[1][7] Prior Trn2 UltraServers used a directly-cabled 2D torus topology over NeuronLink-v3, which kept costs low but produced bottlenecks for all-to-all collective patterns such as the exchanges that mixture-of-experts routing requires. NeuronSwitch-v1 replaces the torus with an all-to-all switched fabric over NeuronLink-v4, which AWS says doubles interchip interconnect bandwidth over the Trn2 UltraServer.[1][7] In the Gen2 UltraServer, any chip can communicate with any other chip at NeuronLink speeds without multi-hop routing.[7]
The Gen2 UltraServer aggregates the fabric into 706 TB/s of total HBM-to-HBM bandwidth across all 144 chips, supporting any-to-any communication without oversubscription.[3] AWS plans three successive PCIe switch generations during Trainium 3's commercial lifetime: a first-generation 160-lane, 20-port PCIe Gen6 switch used at launch, a 320-lane higher-radix PCIe switch, and a switch that will move the fabric onto the open UALink protocol.[7]
Beyond the UltraServer, Trn3 systems connect to larger training clusters through Elastic Fabric Adapter version 4 (EFAv4), AWS's custom RDMA-over-Ethernet stack. Each chip is provisioned with 200 Gb/s of EFAv4 bandwidth by default, with a 400 Gb/s option, delivered through Nitro-v6 400G SmartNICs.[7] Multiple UltraServers are aggregated into EC2 UltraClusters, AWS's training cluster architecture, which can connect hundreds of thousands of Trainium chips into a single training workload.
AWS publicly compares Trn3 UltraServers to Trn2 UltraServers across several axes. The headline figures are 4.4 times higher peak compute, 3.9 times higher aggregate memory bandwidth, and 4 times better energy efficiency at the UltraServer level.[1][2] On the Amazon Bedrock managed inference platform, AWS reports that Trainium 3 is its fastest accelerator, delivering up to 3 times faster performance than Trainium 2 with over 5 times higher output tokens per megawatt at similar latency per user.[1]
The table below summarizes the published UltraServer-level specifications.
| Metric | Trn2 UltraServer | Trn3 Gen1 UltraServer | Trn3 Gen2 UltraServer |
|---|---|---|---|
| Chips per UltraServer | 64 | 64 | 144 |
| Peak MXFP8 compute | ~83 PF | 161 PF | 362 PF |
| Aggregate HBM capacity | 6 TiB | 9 TiB | 20.7 TiB |
| Aggregate HBM bandwidth | ~185 TB/s | 314 TB/s | 706 TB/s |
| Interchip fabric | NeuronLink-v3 torus | NeuronSwitch-v1 | NeuronSwitch-v1 |
| Per-chip NeuronLink bandwidth | ~1 TB/s | 2 TB/s | 2 TB/s |
| Per-chip EFA bandwidth | 200 Gb/s | 200 or 400 Gb/s | 200 or 400 Gb/s |
| Relative performance | 1.0x | ~2.0x | up to 4.4x |
Mark Carroll, an AWS director of engineering on the Trainium team, attributed the gains to the combination of the new chip and the new Neuron switches, telling TechCrunch in 2026 that "that's why Trainium3 is breaking all kinds of records."[6] Anthropic has continued to scale Claude training and serving on Trainium hardware beyond the footprint already deployed for Project Rainier.[9]
Trainium 3 is supported by the AWS Neuron SDK, the same compiler and runtime toolchain used by Trainium 1 and Trainium 2. The SDK exposes Trainium hardware to higher-level frameworks through native PyTorch and JAX integrations, with the Neuron compiler responsible for partitioning models and scheduling NeuronCore execution.[8] AWS says moving a supported PyTorch model to Trainium is "basically a one-line change, and then recompile, and then run on Trainium."[6] For workloads that require lower-level control, the Neuron Kernel Interface (NKI) allows engineers to write custom kernels using a Python-based programming model that compiles directly to NeuronCore instructions.[8] A Trainium 3 era addition is a profiling and debugging toolset that visualizes how a model is executing across the engines of each NeuronCore, the SBUF and PSUM memory hierarchy, and the NeuronLink and EFA networking fabrics.
The supported numeric formats include FP32, BF16, FP16, TF32, MXFP8, and MXFP4, with the compiler automatically managing mixed-precision policies.[8] Additional framework integrations include vLLM for high-throughput inference, Hugging Face's Optimum Neuron for transformer model deployment, PyTorch Lightning for training orchestration, and TorchTitan for very large language model training. Trn3 UltraServers are exposed through the standard set of AWS managed services: Amazon SageMaker offers managed training jobs and inference endpoints; Amazon EKS and ECS provide container orchestration; AWS Batch and AWS ParallelCluster support HPC-style scheduling; and Amazon Bedrock uses Trn3 hardware as an inference backend for hosted foundation models.[1]
At launch, AWS announced a range of customers using or planning to use Trainium 3. The headline adopter is Anthropic, which uses more than one million Trainium 2 chips to train and serve Claude and has committed to expanding its Trainium capacity through 2026.[5][6] In late 2025, Anthropic and Amazon expanded their collaboration to secure up to 5 gigawatts of new compute for training and deploying Claude, including new Trainium capacity coming online through 2026.[10]
OpenAI signed a multi-year agreement with AWS that includes approximately 2 gigawatts of Trainium capacity spanning the Trainium 3 and Trainium 4 generations, part of a $38 billion AWS compute agreement that the two companies subsequently expanded.[11] The deal was widely read as a notable diversification of OpenAI's supplier base away from a near-exclusive reliance on Nvidia GPUs.
Apple is an AWS AI-silicon customer that evaluates Trainium for pre-training future Apple Intelligence models. At re:Invent 2024, Apple's senior director for AI and machine learning Benoit Dupin said Apple expected up to 50 percent efficiency improvement in pre-training with AWS; the same statements made clear the Trainium chips "would not be used to actually run Apple Intelligence features for customers," which run on Apple's own silicon.[12] Databricks has used Trainium for parts of its training services. Other named Trainium 3 customers include Decart (real-time generative video), poolside, Ricoh, Karakuri, Metagenomi, NetoAI, and Splash Music.[1]
The table below shows major announced Trainium 3 customers and use cases as of mid 2026.
| Customer | Use case | Notes |
|---|---|---|
| Anthropic | Training and serving Claude | Primary Trainium customer, up to 5 GW expanded partnership |
| OpenAI | Frontier model training and inference | Multi-year deal, ~2 GW of Trainium capacity |
| Apple | Pre-training Apple Intelligence models | Up to 50% pre-training efficiency vs prior; not used for customer inference |
| Databricks | AI model training | Foundation and customer model training |
| Decart | Real-time generative video | Reports 4x faster inference at half the cost of GPUs |
| poolside | AI coding assistant | Training and inference |
| Ricoh | Enterprise document AI | Up to 50% cost reduction |
| Karakuri | Japanese-language LLMs | Up to 50% cost reduction |
| Metagenomi | Bioinformatics models | Up to 50% cost reduction |
| Splash Music | Generative music | Up to 50% cost reduction |
A central design objective of Trainium 3 was to reduce the energy intensity of large-scale model training and serving. AWS reports that Trn3 UltraServers deliver 4 times the energy efficiency of Trn2 UltraServers, driven by the move from TSMC N5 to N3P, the higher utilization of the MXFP8 and MXFP4 systolic arrays, and the switched NeuronSwitch-v1 fabric that reduces wasted communication time.[1][7] In real-world serving workloads, the combined effect is over 5 times higher output tokens per megawatt.[1]
AWS deploys Trainium 3 in two rack-level configurations that differ primarily in chip density and cooling: an air-cooled NL32x2 configuration and a liquid-cooled NL72x2 configuration.[7] AWS has not published precise per-chip TDP numbers, though industry analysis indicates Trainium 3 falls in the high-hundreds-of-watts range per accelerator, on par with contemporary Nvidia Blackwell and Blackwell Ultra accelerators.[7] Press reports describe Trainium 3 as reducing data center power consumption for equivalent AI workloads by around 40 percent compared to Trainium 2 deployments.[13]
AWS does not publish per-chip-hour list pricing for Trn3 UltraServers, instead offering them primarily through capacity reservations, savings plans, and multi-year contracts.[3] AWS says its Trn3 UltraServers "cost up to 50% less to run for comparable performance than using classic cloud servers," with customer-reported cost reductions of up to 50 percent attributed to price-performance, energy savings, and operational density.[1][6] For workloads that map well to Trainium's strengths (large transformer training and inference with MXFP8 or MXFP4, mixture-of-experts models, long-context inference), the headline economics are competitive with Nvidia GPUs; for workloads that depend on the CUDA ecosystem or specialized GPU libraries that have no Neuron equivalent, Trainium 3 has been less attractive.
Trainium 3 entered the market in late 2025 against a maturing set of competing AI accelerators. The closest direct competitor is Nvidia's Blackwell architecture, particularly the B200 and Blackwell Ultra (GB300) parts, which dominated frontier AI training shipments through 2025.[14] On peak FP8 throughput, Trainium 3 is broadly competitive with B200 on a per-chip basis, though Nvidia retains advantages on aggregate per-system bandwidth, software ecosystem maturity, and inference-optimized FP4 throughput on Blackwell Ultra.[14] SemiAnalysis observed that AWS's liquid-cooled NL72x2 switched topology, which spans 144 chips across two racks, is "a potential challenger approaching" Nvidia's 72-package NVL72 reference rack on aggregate bandwidth.[7]
Other competitors in this generation include Google's Tensor Processing Unit line, specifically TPU v6e (Trillium) and TPU v7 (Ironwood), which target similar workloads on Google Cloud, and AMD's MI355X and MI400 series accelerators.[7] Trainium 3 differs from all of these by being available only through AWS, which constrains its market reach but allows AWS to closely co-design the chip with its broader cloud infrastructure and pricing model.
Alongside the Trn3 UltraServer launch at re:Invent 2025, AWS previewed the fourth-generation Trainium 4 chip, scheduled to begin delivery in 2027.[15] AWS describes Trainium 4 as bringing significant performance improvements across all dimensions, including at least 3 times the FP8 processing power and 4 times the memory bandwidth of Trainium 3, with higher FP4 performance and support for Nvidia's NVLink Fusion technology as part of a broader interoperability strategy.[15] AWS has indicated that intermediate networking and software improvements will continue to arrive on Trainium 3 systems through 2026, and has signaled interest in adopting the UALink open standard for chip-to-chip interconnect in a future Trn3 switch revision and in Trainium 4.[7]
Industry analyst response to Trainium 3 was generally positive. The semiconductor analysis firm SemiAnalysis titled its December 2025 deep dive "AWS Trainium3 Deep Dive: A Potential Challenger Approaching," framing the chip as opening "yet another front" for Nvidia's Jensen Huang alongside Google's TPU v7 and AMD's MI450X, while cautioning that "Nvidia will stay King of the Jungle" if its development pace accelerates further.[7] Coverage in HPCwire, Tom's Hardware, and The Next Platform highlighted the move to a switched scale-up topology with NeuronSwitch-v1 as the most consequential architectural change, since it brings AWS's training systems closer to the topological flexibility of Nvidia's NVLink switched domains.[16][14][15]
Some observers cautioned that Trainium 3's per-chip BF16 throughput, unchanged from Trainium 2, would limit its appeal for workloads that have not yet migrated to FP8 or FP4 numerics, and that the Neuron software stack still trails CUDA in library breadth and third-party tool support.[7] The broad customer adoption at launch, particularly the OpenAI and continued Anthropic commitments, was widely interpreted as evidence that frontier AI labs see Trainium 3 economics as competitive enough to justify the engineering investment required to port major training pipelines onto a non-Nvidia stack.