AWS Trainium 3
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,542 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,542 words
Add missing citations, update stale details, or suggest a clearer explanation.
AWS Trainium 3 (also written as Trainium3 and abbreviated Trn3) is the third generation of Amazon Web Services' custom machine learning training accelerator, developed by Amazon's in-house chip design team Annapurna Labs. The chip was first previewed at AWS re:Invent in December 2024, formally introduced with full specifications at re:Invent in December 2025, and reached general availability through Amazon EC2 Trn3 UltraServers on December 1, 2025. Trainium 3 is the first AWS-designed AI accelerator manufactured on a 3 nanometer process node, fabricated by TSMC using its N3P technology. Each chip contains eight NeuronCore-v4 compute cores, 144 GiB of HBM3e memory, 4.9 TB/s of memory bandwidth, and delivers 2.52 petaflops of dense FP8 compute, roughly doubling the per-chip throughput of the prior generation AWS Trainium 2.
At the system level, Trn3 UltraServers scale from 64 chips in the Gen1 configuration to 144 chips in the Gen2 configuration, with the larger system reaching 362 petaflops of FP8 compute, 20.7 TB of HBM3e capacity, and 706 TB/s of aggregate memory bandwidth. AWS markets Trainium 3 as offering up to 4.4 times the performance, 3.9 times the memory bandwidth, and four times the performance per watt of Trn2 UltraServers. Customers including Anthropic, OpenAI, Apple, Databricks, Decart, poolside, Ricoh, and Splash Music have reported cost reductions of up to 50 percent on training and inference workloads relative to alternative accelerators.
AWS designed its first Trainium accelerator in 2020 and shipped it in EC2 Trn1 instances in 2022 as a deep learning training chip aimed at lowering the cost of large model training relative to general-purpose GPUs. The original chip used two NeuronCore-v2 cores with 32 GiB of HBM2e on a 7 nanometer process. The second-generation AWS Trainium 2, announced at re:Invent 2023 and made generally available in late 2024, was a wholesale redesign for the generative AI era: it added six more NeuronCore-v3 cores, tripled on-package memory to 96 GiB of HBM3e, and introduced the Trn2 UltraServer, a rack-scale shared-memory domain of 64 chips connected by NeuronLink-v3. Trainium 2 became the basis of Project Rainier, a 500,000 chip cluster that AWS built in less than twelve months in partnership with Anthropic for training the Claude family of models.
Amazon previewed Trainium 3 during the Trainium 2 launch at re:Invent 2024, telling attendees that the next chip would arrive in late 2025, would use a 3 nanometer process, and would offer roughly four times the training performance of Trainium 2 with materially better energy efficiency. Over the following twelve months Annapurna Labs worked closely with Anthropic on the silicon design, with the AI safety company providing direct feedback on training pipeline requirements such as collective communication patterns, mixture-of-experts routing, and reinforcement learning rollouts. Anthropic engineers later reported that porting their existing training stack from Trainium 2 to Trainium 3 took roughly three weeks, far less than the months-long effort earlier custom AI chip transitions had typically demanded. The program reflected a broader strategic shift inside AWS toward owning more of the AI compute stack, with the design philosophy explicitly optimizing performance per total cost of ownership rather than peak performance.
Trainium 3 is fabricated on TSMC's N3P process, the company's performance-enhanced 3 nanometer node, which delivers roughly five percent higher speed at constant leakage, five to ten percent lower power at constant frequency, and around four percent more effective transistor density relative to baseline N3. The accelerator uses a monolithic two-die compute design connected through the package substrate rather than a chiplet approach with separate I/O dies. AWS rejected a competing chiplet proposal from Marvell during the design phase in favor of this simpler topology, prioritizing yield and supply continuity.
The accelerator is packaged using TSMC's CoWoS-R variant, which employs an organic thin-film interposer (six layers of copper redistribution layers on polymer substrate) rather than the silicon interposer used in CoWoS-S packaging on high-end GPUs. CoWoS-R is significantly less constrained by TSMC's silicon interposer capacity, an important consideration given industry-wide competition for advanced packaging supply through 2025 and 2026. The package integrates four stacks of HBM3e in 12-high configuration, each providing 36 GiB for a total of 144 GiB per chip. Pin speeds reach 9.6 gigabits per second, among the fastest HBM3e implementations shipping in 2025 and a substantial step up from the 5.7 gigabits per second used on Trainium 2.
Each Trainium 3 accelerator contains eight NeuronCore-v4 cores, the same count as Trainium 2 but with substantially upgraded internal organization. A NeuronCore-v4 consists of four execution engines: a Tensor Engine for matrix and convolution operations, a Vector Engine for elementwise operations and reductions, a Scalar Engine for control flow, and a GPSIMD Engine for general-purpose work such as sorts and gathers. The core includes 32 MiB of SBUF software-managed scratchpad memory (up from 28 MiB on NeuronCore-v3) and a 2 MiB PSUM buffer for accumulating partial sums during matrix multiplication.
The Tensor Engine contains two separate systolic arrays optimized for different numeric formats. A 128 by 128 BF16 array handles bfloat16, float16, TF32, and FP32 computations, while a larger 512 by 128 array is dedicated to the OCP-compliant MXFP8 and MXFP4 microscaling formats. The wider MXFP8/MXFP4 array reflects an explicit design bet that the next several years of frontier model work will rely heavily on these block-scaled low-precision formats. Per Tensor Engine, the core delivers 315 teraflops of MXFP8 or MXFP4 throughput, 79 teraflops of BF16/FP16/TF32, and 20 teraflops of FP32. With eight cores per chip, a single Trainium 3 device reaches 2.52 petaflops of dense MXFP8 or MXFP4 performance. Notably, BF16 throughput per chip is roughly unchanged from Trainium 2 at around 0.65 petaflops. AWS put the majority of the additional silicon area into MXFP8 and MXFP4 capability rather than scaling all numeric formats uniformly, on the basis that production-scale generative AI workloads in 2025 and 2026 are increasingly dominated by FP8 training, FP8 inference, and FP4 inference.
Trainium 3 integrates 144 GiB of HBM3e per chip across four stacks, providing 4.9 TB/s of memory bandwidth. The capacity is 1.5 times that of Trainium 2 and the bandwidth is roughly 70 percent higher. The larger memory footprint allows Trainium 3 to hold significantly larger model weights, optimizer states, and KV caches per chip, which reduces the number of chips that must be enlisted simply to fit a given model and improves utilization on inference workloads with long context windows.
The table below compares the key per-chip specifications of Trainium 3 against its predecessors.
| Specification | AWS Trainium | AWS Trainium 2 | AWS Trainium 3 |
|---|---|---|---|
| Launch year | 2022 | 2024 | 2025 |
| Process node | TSMC N7 | TSMC N5 | TSMC N3P |
| NeuronCore version | v2 (2 cores) | v3 (8 cores) | v4 (8 cores) |
| FP8 / MXFP8 compute (per chip) | ~0.19 PF (FP16) | 1.3 PF | 2.52 PF |
| BF16 compute (per chip) | 0.19 PF | 0.65 PF | 0.65 PF |
| HBM capacity | 32 GiB HBM2e | 96 GiB HBM3e | 144 GiB HBM3e |
| HBM bandwidth | 0.82 TB/s | 2.9 TB/s | 4.9 TB/s |
| HBM pin speed | 3.2 Gb/s | 5.7 Gb/s | 9.6 Gb/s |
| Packaging | Standard | CoWoS-R | CoWoS-R |
| Scale-up domain | 16 chips (Trn1) | 64 chips | 64 or 144 chips |
A Trn3 UltraServer is a rack-scale system that ties multiple Trainium 3 chips into a single shared-memory scale-up domain. AWS ships Trn3 UltraServers in two configurations: a Gen1 with 64 chips and a Gen2 with 144 chips. Both expose a single logical memory and compute pool, which lets large model parallelism strategies place tensor parallel groups and pipeline parallel stages across the entire scale-up domain without crossing network protocol boundaries.
Each Trn3 server, internally referred to as a sled, contains four Trainium 3 chips connected through an intra-server PCIe switch. Every chip exposes four PCIe Gen6 x8 links to this switch, providing 256 GB/s of bidirectional bandwidth within a sled. The Gen1 UltraServer uses 16 sleds for 64 chips, while the Gen2 uses 36 sleds for 144 chips.
The defining architectural feature of the Trn3 UltraServer is its NeuronSwitch-v1 fabric, the first switched all-to-all interconnect that AWS has shipped on a Trainium generation. Prior Trn2 UltraServers used a directly-cabled 2D torus topology over NeuronLink-v3, which kept costs low but produced bottlenecks for all-to-all collective patterns such as the exchanges that mixture-of-experts routing requires. NeuronSwitch-v1 replaces the torus with an all-to-all switched fabric over NeuronLink-v4, doubling intra-UltraServer bandwidth and reducing collective operation tail latency to under 10 microseconds.
Each Trainium 3 chip provides 2 TB/s of NeuronLink-v4 bandwidth into the fabric. The Gen2 UltraServer aggregates this into 706 TB/s of total HBM-to-HBM bandwidth across all 144 chips, supporting any-to-any communication without oversubscription. AWS plans three successive PCIe switch generations during Trainium 3's commercial lifetime: a first-generation 160-lane, 20-port PCIe Gen6 switch used at launch, a 320-lane higher-radix PCIe switch, and a UALink switch that will move the fabric onto the open UALink protocol.
Beyond the UltraServer, Trn3 systems connect to larger training clusters through Elastic Fabric Adapter version 4 (EFAv4), AWS's custom RDMA-over-Ethernet stack. Each chip is provisioned with 200 Gb/s of EFAv4 bandwidth by default, with a 400 Gb/s option, delivered through Nitro-v6 400G SmartNICs. Multiple UltraServers are aggregated into EC2 UltraClusters 3.0, AWS's third-generation training cluster architecture, which can connect hundreds of thousands of Trainium chips into a single training workload, with the largest planned clusters targeting up to one million chips across multiple data centers.
AWS publicly compares Trn3 UltraServers to Trn2 UltraServers across several axes. The headline figures are 4.4 times higher peak compute, 3.9 times higher aggregate memory bandwidth, and four times better performance per watt at the UltraServer level. In real-world serving workloads on OpenAI's open-weight GPT-OSS model, AWS measured 3 times higher per-chip throughput and 4 times faster inference response times relative to Trn2 UltraServers, along with over 5 times more output tokens per megawatt of power consumed.
The table below summarizes the published UltraServer-level specifications.
| Metric | Trn2 UltraServer | Trn3 Gen1 UltraServer | Trn3 Gen2 UltraServer |
|---|---|---|---|
| Chips per UltraServer | 64 | 64 | 144 |
| Peak MXFP8 compute | ~83 PF | 161 PF | 362 PF |
| Aggregate HBM capacity | 6 TiB | 9 TiB | 20.7 TiB |
| Aggregate HBM bandwidth | ~185 TB/s | 314 TB/s | 706 TB/s |
| Interchip fabric | NeuronLink-v3 torus | NeuronSwitch-v1 | NeuronSwitch-v1 |
| Per-chip NeuronLink bandwidth | ~1 TB/s | 2 TB/s | 2 TB/s |
| Per-chip EFA bandwidth | 200 Gb/s | 200 or 400 Gb/s | 200 or 400 Gb/s |
| Relative performance | 1.0x | ~2.0x | up to 4.4x |
On the Amazon Bedrock managed inference platform, AWS reports 3 times faster end-to-end performance than the equivalent Trainium 2 backend for production workloads. Anthropic has stated that it expects to continue scaling Claude training and serving on Trainium 3 well beyond the footprint already deployed for Project Rainier.
Trainium 3 is supported by the AWS Neuron SDK, the same compiler and runtime toolchain used by Trainium 1 and Trainium 2. The SDK exposes Trainium hardware to higher-level frameworks through native PyTorch and JAX integrations, with the Neuron compiler responsible for partitioning models and scheduling NeuronCore execution. For workloads that require lower-level control, the Neuron Kernel Interface (NKI) allows engineers to write custom kernels using a Python-based programming model that compiles directly to NeuronCore instructions. A Trainium 3 specific addition is Neuron Explorer, a profiling and debugging tool that visualizes how a model is executing across the engines of each NeuronCore, the SBUF and PSUM memory hierarchy, and the NeuronLink and EFA networking fabrics.
The supported numeric formats include FP32, BF16, FP16, TF32, MXFP8, and MXFP4, with the compiler automatically managing mixed-precision policies. Additional framework integrations include vLLM for high-throughput inference, Hugging Face's Optimum Neuron for transformer model deployment, PyTorch Lightning for training orchestration, and TorchTitan for very large language model training. Trn3 UltraServers are exposed through the standard set of AWS managed services: Amazon SageMaker offers managed training jobs and inference endpoints; Amazon EKS and ECS provide container orchestration; AWS Batch and AWS ParallelCluster support HPC-style scheduling; and Amazon Bedrock uses Trn3 hardware as the primary inference backend for several of its hosted foundation models.
At launch, AWS announced a wide range of customers using or planning to use Trainium 3. The headline adopter is Anthropic, which operates close to one million Trainium 2 chips for Claude and has publicly committed to expanding its Trainium 3 footprint through 2026. In April 2026, Anthropic and Amazon announced a $100 billion expansion of their compute partnership, with a significant portion of the new capacity earmarked for Trainium 3 and the planned Trainium 4.
OpenAI signed a multi-year agreement to use Trainium 3 capacity for a portion of its training and inference infrastructure, a notable expansion of its supplier diversification away from a near-exclusive reliance on Nvidia GPUs. Press reporting described the OpenAI commitment as approximately 2 gigawatts of Trainium capacity across Trainium 3 and Trainium 4 generations, embedded in a broader $38 billion AWS compute agreement subsequently expanded over eight years, with Amazon also taking a major equity-and-compute stake in OpenAI.
Apple became a Trainium customer in 2025 and uses Trainium hardware in its Private Cloud Compute infrastructure backing server-side Apple Intelligence features. Databricks began using Trainium in 2025 for parts of its Mosaic AI training service. Other named Trainium 3 customers include Decart (real-time generative video, reporting 4 times faster inference than competing GPU systems at half the cost), poolside, Ricoh, Karakuri, Metagenomi, NetoAI, and Splash Music.
The table below shows the major announced Trainium 3 customers and use cases as of mid 2026.
| Customer | Use case | Notes |
|---|---|---|
| Anthropic | Training and serving Claude | Primary Trainium customer, $100B expanded partnership |
| OpenAI | Frontier model training and inference | Multi-year deal, ~2 GW of Trainium capacity |
| Apple | Private Cloud Compute for Apple Intelligence | Server-side AI features for iOS and macOS |
| Databricks | Mosaic AI training | Foundation and customer model training |
| Decart | Real-time generative video | Reports 4x faster inference vs GPUs |
| poolside | AI coding assistant | Training and inference |
| Ricoh | Enterprise document AI | Japanese-language model training |
| Karakuri | Japanese-language LLMs | Up to 50% cost reduction |
| Metagenomi | Bioinformatics models | Up to 50% cost reduction |
| Splash Music | Generative music | Up to 50% cost reduction |
A central design objective of Trainium 3 was to reduce the energy intensity of large-scale model training and serving. AWS reports that Trn3 UltraServers deliver four times the performance per watt of Trn2 UltraServers, driven by the move from TSMC N5 to N3P, the higher utilization of the MXFP8 and MXFP4 systolic arrays, and the switched NeuronSwitch-v1 fabric that reduces wasted communication time. In real-world serving workloads, the combined effect is over 5 times higher token throughput per megawatt.
AWS deploys Trainium 3 in two rack-level SKUs that differ primarily in chip density and cooling: an air-cooled NL32x2 configuration with 32 chips per rack and a liquid-cooled NL72x2 configuration with 64 chips per rack. AWS has not published precise per-chip TDP numbers, though industry analysis indicates Trainium 3 falls in the high-hundreds-of-watts range per accelerator, on par with contemporary Nvidia Blackwell and Blackwell Ultra accelerators. Press reports describe Trainium 3 as reducing data center power consumption for equivalent AI workloads by around 40 percent compared to Trainium 2 deployments.
AWS does not publish per-chip-hour list pricing for Trn3 UltraServers, instead offering them primarily through capacity reservations, savings plans, and multi-year contracts. AWS positions the chip as offering a roughly 40 percent better price-performance ratio than the closest competing Nvidia GPU systems, with customer-reported cost reductions of up to 50 percent attributed to price-performance, energy savings, and operational density. For workloads that map well to Trainium's strengths (large transformer training and inference with MXFP8 or MXFP4, mixture-of-experts models, long-context inference), the headline economics are competitive with Nvidia GPUs; for workloads that depend on the CUDA ecosystem or specialized GPU libraries that have no Neuron equivalent, Trainium 3 has been less attractive.
Trainium 3 entered the market in late 2025 against a maturing set of competing AI accelerators. The closest direct competitor is Nvidia's Blackwell architecture, particularly the B200 and Blackwell Ultra (GB300) parts, which dominated frontier AI training shipments through 2025. On peak FP8 throughput, Trainium 3 is broadly competitive with B200 on a per-chip basis, though Nvidia retains advantages on aggregate per-system bandwidth, software ecosystem maturity, and inference-optimized FP4 throughput on Blackwell Ultra. Trainium 3 systems offer higher per-chip HBM3e capacity than B200, equivalent to the larger memory variants of Blackwell Ultra.
Other competitors in this generation include Google's TPU v6e (Trillium) and TPU v7 (Ironwood), which target similar workloads on Google Cloud, and AMD's MI355X and MI400 series accelerators. Trainium 3 differs from all of these by being available only through AWS, which constrains its market reach but allows AWS to closely co-design the chip with its broader cloud infrastructure and pricing model.
Alongside the Trn3 UltraServer launch at re:Invent 2025, AWS previewed the fourth-generation Trainium 4 chip, scheduled for general availability in 2027. The headline figures for Trainium 4 are roughly 6 times the FP4 processing performance of Trainium 3, 3 times the FP8 performance, and 4 times the memory bandwidth, with support for Nvidia's NVLink Fusion technology as part of a broader interoperability strategy. AWS has indicated that Trainium 4 will continue the company's pattern of two-year cadence on Trainium hardware refreshes, with intermediate networking and software improvements arriving on Trainium 3 systems through 2026. AWS has also signaled interest in adopting the UALink open standard for chip-to-chip interconnect in a future Trn3 switch revision and in Trainium 4.
Industry analyst response to Trainium 3 was generally positive. The semiconductor analysis firm SemiAnalysis described the chip in a December 2025 deep dive as "a potential challenger approaching" and the first Trainium generation to compete head-on with Nvidia on training economics rather than only on niche inference. Coverage in HPCwire, Tom's Hardware, and The Next Platform highlighted the move to a switched scale-up topology with NeuronSwitch-v1 as the most consequential architectural change, since it brings AWS's training systems closer to the topological flexibility of Nvidia's NVLink switched domains.
Some observers cautioned that Trainium 3's per-chip BF16 throughput, unchanged from Trainium 2, would limit its appeal for workloads that have not yet migrated to FP8 or FP4 numerics, and that the Neuron software stack still trails CUDA in library breadth and third-party tool support. The broad customer adoption at launch, particularly the OpenAI and continued Anthropic commitments, was widely interpreted as evidence that frontier AI labs see Trainium 3 economics as competitive enough to justify the engineering investment required to port major training pipelines onto a non-Nvidia stack.