AWS Trainium 2 (also written as Trainium2 and abbreviated Trn2) is the second generation of Amazon Web Services' custom machine learning training accelerator, developed by Amazon's in-house chip design team Annapurna Labs. Amazon first previewed Trainium 2 at AWS re:Invent 2023 and announced general availability at AWS re:Invent 2024 on December 3, 2024. Compared to the original AWS Trainium, the second generation delivers four times the compute performance, four times the memory bandwidth, and three times the memory capacity per chip. Each chip integrates eight NeuronCore-v3 compute cores, 96 GiB of HBM3e memory, and dedicated collective communication hardware for scale-out training. The chip powers Amazon EC2 Trn2 instances and the Trn2 UltraServer, a rack-scale system that connects 64 chips into a single shared-memory domain. AWS deployed Trainium 2 at massive scale through Project Rainier, a cluster of nearly 500,000 Trainium 2 chips built in partnership with Anthropic for training and serving the Claude family of models.
Amazon introduced the original AWS Trainium accelerator in late 2021 as a purpose-built chip for deep learning training on AWS. The first generation, which powered the EC2 Trn1 and Trn1n instances, used two large NeuronCore-v2 compute cores per chip and shipped with 32 GiB of HBM2e memory. While Trn1 established a cost-per-training-step advantage for certain workloads, the architecture was designed before large language models dominated the training landscape, and its memory capacity and interconnect bandwidth proved limiting for models with hundreds of billions of parameters.
Amazon announced Trainium 2 at AWS re:Invent in November 2023, positioning it as a ground-up redesign for the generative AI era. The revised architecture increased the NeuronCore count from two to eight per chip, tripled device memory to 96 GiB, moved to HBM3e for higher bandwidth, and introduced a new generation of NeuronLink interconnect. The chip also added 16 dedicated Collective Communication Cores (CC-Cores), up from six in the first generation, to handle the all-reduce and all-gather operations that dominate large-scale distributed training without competing with the tensor cores for execution resources.
AWS made Trn2 instances generally available in the US East (Ohio) region on December 3, 2024, at the same AWS re:Invent event where the company also previewed Trainium 3 and disclosed Project Rainier details. The simultaneous GA announcement and next-generation preview reflected how quickly Amazon had accelerated its custom silicon roadmap in response to demand from Anthropic and other large-scale training customers.
Each Trainium 2 chip contains eight NeuronCore-v3 compute cores. The third generation of NeuronCore extends its predecessors with four functional engines per core:
NeuronCore-v3 also introduces native support for dynamic shapes and control flow through ISA extensions, which earlier generations lacked. This matters for workloads like reinforcement learning from human feedback (RLHF) and sequence-to-sequence tasks where batch dimensions vary at runtime. Trainium 2 additionally supports user-programmable rounding modes, giving researchers control over numerical behavior in reduced-precision formats.
The chip provides the following aggregate compute across all eight cores:
| Precision | Dense TFLOPS | Sparse TFLOPS |
|---|---|---|
| FP8 | 1,299 | 2,563 |
| BF16 / FP16 / TF32 | 667 | 2,563 |
| FP32 | 181 | -- |
Note: sparse figures reflect structured sparsity (2:4 pattern). Per-core FP8 dense throughput is approximately 162 TFLOPS and per-core BF16/FP16/TF32 dense throughput is approximately 83 TFLOPS.
Starting with Trainium 2, AWS Neuron supports Logical NeuronCore Configuration (LNC), which lets software combine the resources of two or four adjacent physical NeuronCores into a single logical NeuronCore. This feature is useful for workloads that benefit from a larger scratchpad or higher sustained memory bandwidth per logical compute unit, at the cost of reduced parallelism. LNC gives operators a tuning knob that did not exist on Trainium 1.
Trainium 2 includes 16 dedicated CC-Cores per chip, compared to six in Trainium 1. These cores handle collective operations (AllReduce, AllGather, ReduceScatter, AllToAll) independently of the tensor cores, so gradient synchronization across chips proceeds in parallel with the forward and backward passes. On NVIDIA GPU clusters, collective operations share streaming multiprocessors with compute, which can create resource contention at large batch sizes. The dedicated CC-Core design is one of the architectural choices AWS made to optimize the chip for distributed training throughput rather than single-chip peak compute.
| Specification | Value |
|---|---|
| Architecture | NeuronCore-v3 |
| NeuronCores per chip | 8 |
| FP8 dense compute | 1,299 TFLOPS |
| BF16 dense compute | 667 TFLOPS |
| FP32 compute | 181 TFLOPS |
| HBM type | HBM3e |
| HBM capacity | 96 GiB |
| HBM bandwidth | 2.9 TB/s |
| DMA bandwidth | 3.5 TB/s |
| Scratchpad (SBUF) | 224 MiB |
| CC-Cores | 16 |
| NeuronLink-v3 bandwidth | 1.28 TB/s per chip |
| Thermal design power | ~500 W |
| Specification | trn2.3xlarge | trn2.48xlarge |
|---|---|---|
| Trainium 2 chips | 1 | 16 |
| NeuronCores | 8 | 128 |
| Accelerator memory (HBM) | 96 GB | 1.5 TB |
| HBM bandwidth | 2.9 TB/s | 46 TB/s |
| FP8 dense compute | 1.3 PFLOPs | 20.8 PFLOPs |
| System vCPUs | 12 | 192 |
| System memory | 128 GB | 2 TB |
| Instance storage | 1x 470 GB NVMe | 4x 1.92 TB NVMe |
| Network bandwidth (EFA v3) | 0.2 Tbps | 3.2 Tbps |
| EBS bandwidth | 5 Gbps | 80 Gbps |
| Specification | Trn2 UltraServer |
|---|---|
| Trainium 2 chips | 64 |
| NeuronCores | 512 |
| Accelerator memory | 6 TiB |
| HBM bandwidth | 185 TB/s |
| FP8 dense compute | 83.2 PFLOPs |
| FP8 sparse compute | 332 PFLOPs |
| Constituent instances | 4 x trn2u.48xlarge |
Each Trainium 2 chip carries 96 GiB of HBM3e across four stacks, with an aggregate bandwidth of 2.9 TB/s. Relative to the first-generation Trn1 chip (32 GiB HBM2e at 820 GB/s), this represents three times the capacity and roughly 3.5 times the bandwidth.
Physically, the chip is a multi-chiplet design in which two compute chiplets share access to all four HBM stacks. The layout creates non-uniform memory access (NUMA) characteristics: a compute chiplet accesses adjacent HBM stacks at full bandwidth but incurs a penalty when accessing stacks attached to the other chiplet. This is similar to the NUMA behavior observed in AMD's Instinct MI300X. Applications that require maximum HBM throughput benefit from NUMA-aware memory placement, and the Neuron SDK exposes controls for this at the operator level.
Trainium 2 also provides 224 MiB of on-chip scratchpad memory (SBUF), 4.7 times larger than Trainium 1's scratchpad. The SBUF is partitioned among NeuronCores and acts as a programmer-controlled tile buffer for the tensor engine, avoiding redundant HBM loads during matrix operations.
DMA bandwidth reaches 3.5 TB/s with inline memory compression and decompression. The inline compression capability reduces effective memory footprint for certain weight formats, which is useful during mixed-precision training where intermediate activations can be stored in a compressed representation before being expanded for the backward pass.
The 16-chip Trn2 instance aggregates to 1.5 TiB of HBM at 46 TB/s combined bandwidth. Memory pooling across up to 64 chips is supported in the UltraServer configuration, enabling trillion-parameter models to distribute their weight matrices across a shared 6 TiB address space.
Amazon Web Services offers two Trn2 instance sizes in the EC2 family:
The trn2.3xlarge is the entry-level instance with a single Trainium 2 chip, 12 vCPUs, 128 GB of system memory, and 96 GB of accelerator memory. It targets inference workloads, fine-tuning smaller models, and workloads that do not require multi-chip connectivity. Network bandwidth is 200 Gbps.
The trn2.48xlarge is the full-instance configuration with all 16 Trainium 2 chips enabled, 192 vCPUs, 2 TiB of system memory, 1.5 TiB of HBM, and 3.2 Tbps of Elastic Fabric Adapter (EFA) v3 network bandwidth. The 16 chips are connected in a 4x4 two-dimensional torus via NeuronLink-v3 at 1 TB/s aggregate chip-to-chip bandwidth. This instance is the building block for both standalone large-model training and for composition into UltraServer systems.
A third variant, the trn2u.48xlarge, has the same compute and memory specs as the trn2.48xlarge but includes UltraServer-capable hardware to support NeuronLink connections across server boundaries. Four trn2u.48xlarge instances combine to form one Trn2 UltraServer.
Trainium 2 instances can be reserved via Amazon EC2 Capacity Blocks for ML, which allows customers to book blocks of up to 64 instances for periods up to six months, with instant start times. Capacity Blocks are available up to eight weeks in advance, giving training teams predictable access without long-term commitment.
Training frameworks pre-configured in AWS Deep Learning AMIs (DLAMIs) are available out of the box, with PyTorch, JAX, and Hugging Face Transformers pre-installed and optimized for Trn2.
The Trn2 UltraServer is a rack-scale system that connects four trn2u.48xlarge instances using NeuronLink-v3 to form a single scale-up domain with 64 Trainium 2 chips. Unlike traditional multi-server GPU clusters that rely on network-layer collectives for weight synchronization, the UltraServer exposes all 6 TiB of HBM as a shared accelerator memory pool accessible by any chip at memory bus speeds.
This design removes the latency and bandwidth penalties of inter-server networking during the portions of training where all-reduce operations dominate. For large language models where gradient synchronization represents a significant fraction of step time, moving that communication onto the NeuronLink fabric rather than over EFA can reduce effective training time.
The UltraServer extends the chip topology from a 4x4 two-dimensional torus (within a single instance) to a 4x4x4 three-dimensional torus (across the four instances). The Z-axis connections between instances use OSFP-XD active electrical cables and provide 64 GB/s of point-to-point bandwidth per chip pair, while X and Y axis connections within a server run at 128 GB/s per connection. AWS deliberately avoided optical transceivers for these links, using passive and active copper cables instead. Copper cables provide roughly 100 times better mean time to failure and far lower cable flapping rates than optical transceivers, which is a meaningful reliability advantage in a cluster with thousands of interconnects.
Multiple UltraServers can be connected into an UltraCluster. AWS supports UltraClusters of up to 2,048 Trainium 2 chips at present, interconnected via a petabit-scale non-blocking Ethernet fabric. UltraClusters integrate with Amazon FSx for Lustre for high-throughput checkpoint storage, enabling fast save and restore of model states during long training runs.
The UltraServer was available in preview at the time of general availability launch in December 2024.
NeuronLink-v3 is the scale-up interconnect that connects Trainium 2 chips within and across servers. Unlike NVIDIA NVLink, which uses a switch topology enabling all-to-all communication between GPUs in a node, NeuronLink uses direct point-to-point connections arranged in torus configurations. Each chip has six physical NeuronLink ports: four for intra-server connections (forming the 4x4 2D torus) and two additional ports activated in UltraServer mode (extending to the 4x4x4 3D torus).
The torus topology determines which collective communication algorithms are most efficient. Ring AllReduce, which travels around the torus, achieves near-peak bandwidth utilization on a 2D or 3D torus. AWS's NeuronX Collective Communication Library (NXCCL) implements AllReduce, AllGather, ReduceScatter, and AllToAll collectives optimized for the specific torus dimensions. The library is conceptually similar to NVIDIA's NCCL but tuned for Trainium 2's topology rather than NVLink's all-to-all connectivity.
The aggregate NeuronLink-v3 bandwidth per chip is 1.28 TB/s, which is the sum across all active ports. This figure captures the total bidirectional chip-to-chip capacity rather than a single-port measure.
For scale-out communication between UltraServer systems, Trn2 instances use Elastic Fabric Adapter v3, which provides 3.2 Tbps of network bandwidth per instance. EFA v3 supports AWS-designed RDMA and is used for data-parallel gradient synchronization across racks, while NeuronLink handles tensor-parallel and pipeline-parallel communication within a scale-up domain.
All Trainium 2 instances run the AWS Neuron SDK, a comprehensive software stack that includes a compiler, runtime, training libraries, inference libraries, and developer tools.
Neuron supports PyTorch and JAX as primary training frameworks. PyTorch support uses TorchDynamo bytecode capture to extract computation graphs for compilation, a significant improvement over the earlier PyTorch/XLA path used for Trainium 1, which relied on lazy tensor tracing and was prone to graph fragmentation. JAX support, which reached beta status in 2024, benefits from JAX's functional execution model and static compilation assumptions, which map well to Trainium 2's torus topology and the Neuron compiler's optimization pipeline.
For inference, Neuron supports vLLM, Hugging Face Transformers, and TensorRT-style optimizations through its inference libraries. PyTorch Lightning and Amazon SageMaker AI are supported as training orchestration layers.
The Neuron SDK integrates with OpenXLA through stable HLO and GSPMD interfaces, which means models written for TPU with JAX/XLA can be ported to Trainium 2 with relatively minor changes. GSPMD (General and Scalable Parallelism for ML Computation) enables automatic partitioning of models across chips.
Neuron Kernel Interface (NKI), introduced in 2024, gives performance engineers direct access to the NeuronCore-v3 instruction set architecture, memory allocation, and execution scheduling. NKI uses a tile-based programming model similar to Triton (used for CUDA kernel development) and allows developers to write custom operators in Python that compile to native NeuronCore instructions. This is useful for fused attention kernels, custom normalization layers, and other operators that would otherwise be decomposed into multiple passes through HBM. AWS ships a Neuron Kernel Library with production-ready, open-source NKI kernels covering common transformer operations, and the library includes benchmarks and documentation.
The Neuron compiler is built on MLIR and is open source. It accepts models in HLO, TorchScript, or traced JAX form and produces optimized binaries for NeuronCore-v3 execution. Key optimizations include operator fusion to minimize HBM traffic, tiling selection to maximize scratchpad utilization, and NUMA-aware weight placement.
Neuron Explorer provides profiling and debugging tools within the SDK, giving developers visibility into NeuronCore utilization, memory bandwidth consumption, and collective communication efficiency.
AWS's published performance figures position Trn2 against prior-generation Trainium and against GPU-based EC2 instances:
| Benchmark | Trn2 vs Trn1 | Trn2 vs P5/P5e (H100) |
|---|---|---|
| Training throughput | 4x faster | 30-40% better price-performance |
| Memory bandwidth | 4x higher | -- |
| Memory capacity | 3x larger | -- |
| Energy efficiency | ~2x better | -- |
For inference, AWS reported that Trainium 2 delivers three times higher token-generation throughput for Meta's Llama 3.1 405B model on Amazon Bedrock compared to other available cloud provider offerings at the time of launch in December 2024.
Claude 3.5 Haiku running on Trainium 2 with latency optimization delivers up to 60 percent faster inference compared to the non-latency-optimized version of the same model, according to Anthropic's November 2024 announcement.
In the SemiAnalysis analysis of the Trainium 2 architecture, the chip's arithmetic intensity is approximately 225.9 BF16 FLOP per byte, which is lower than competing accelerators such as the NVIDIA H100 (approximately 300 FLOP/byte) or Google TPU v6e (approximately 500 FLOP/byte). Lower arithmetic intensity is not inherently a disadvantage: it suits workloads such as Mixture of Experts architectures and memory-bandwidth-bound inference tasks where the balance between compute and memory access favors higher bandwidth over raw TFLOPS.
The following table compares Trainium 2 specifications with NVIDIA's H100 SXM5, H200 SXM, and Blackwell B200 accelerators:
| Specification | Trainium 2 (chip) | NVIDIA H100 SXM5 | NVIDIA H200 SXM | NVIDIA B200 |
|---|---|---|---|---|
| FP8 dense compute | 1,299 TFLOPS | 3,958 TFLOPS | 3,958 TFLOPS | 9,000 TFLOPS |
| BF16 dense compute | 667 TFLOPS | 1,979 TFLOPS | 1,979 TFLOPS | 4,500 TFLOPS |
| FP32 compute | 181 TFLOPS | 67 TFLOPS | 67 TFLOPS | 160 TFLOPS |
| HBM capacity | 96 GiB | 80 GiB | 141 GiB | 192 GiB |
| HBM bandwidth | 2.9 TB/s | 3.35 TB/s | 4.8 TB/s | 8.0 TB/s |
| Scale-up interconnect | NeuronLink-v3 (1.28 TB/s) | NVLink 4 (0.9 TB/s) | NVLink 4 (0.9 TB/s) | NVLink 5 (1.8 TB/s) |
| TDP | ~500 W | 700 W | 700 W | 1,000 W |
Note: NVIDIA's FP8 figures reflect the Tensor Core with sparsity disabled for the dense row. Trainium 2 FP8 figures from AWS Neuron documentation. H100 and H200 values are for SXM variants. B200 values are from NVIDIA's published product brief.
Trainium 2's raw TFLOPS at FP8 and BF16 are lower than the H100, H200, and B200. AWS argues that the relevant metric for training cost is throughput per dollar rather than peak TFLOPS, and that Trainium 2's combination of lower on-demand pricing, HBM capacity, and the scale-up topology of the UltraServer produces a better cost-per-trained-token outcome for large models.
From a software ecosystem standpoint, NVIDIA's CUDA platform has a much larger library of hand-optimized kernels, profiling tools, and community examples than the Neuron SDK. AWS has invested in closing this gap through NKI, open-source Neuron compiler development, and partnerships with framework teams at PyTorch and Hugging Face. NVIDIA's larger scale-up domain in GB200 NVL72 systems (72 GPUs sharing NVLink) gives Blackwell-generation clusters more flexibility in tensor-parallel degree, which can reduce inference costs at large model sizes.
The most consequential deployment of Trainium 2 is Project Rainier, an AI compute cluster built by AWS for Anthropic to train and serve the Claude model family.
AWS detailed Project Rainier at AWS re:Invent in December 2024, describing a cluster with hundreds of thousands of Trainium 2 chips spread across multiple U.S. data centers. Project Rainier became fully operational within approximately one year of its initial announcement, a deployment pace AWS described as historically fast for infrastructure at this scale. The cluster activated with nearly 500,000 Trainium 2 chips and AWS expected Anthropic to scale to more than one million Trainium 2 chips across training and inference workloads by end of 2025.
Project Rainier is named after Mount Rainier, the 4,392-meter stratovolcano visible from Amazon's Seattle headquarters on clear days. The cluster includes data center facilities in St. Joseph County, Indiana and other U.S. sites totaling over 400 MW of initial power capacity, with expansion plans to over 1,000 MW. The Indiana facility alone represents an investment of approximately $11 billion.
The Rainier cluster delivers more than five times the compute Anthropic used to train its previous generation of Claude models. This scale reflects a significant bet by both companies: AWS is constructing and operating the infrastructure while Anthropic commits its Claude training workloads exclusively to Trainium 2. Amazon has invested $8 billion in Anthropic since early 2024, and Anthropic committed to spending more than $100 billion on AWS compute over the following decade.
Anthropic's engineering teams work closely with Annapurna Labs on hardware co-design, providing feedback from Claude training runs to influence the design of future Trainium generations. This collaboration means that the characteristics of frontier LLM training at Anthropic's scale directly inform Trainium architecture decisions.
Project Rainier was initially kept out of public reporting. AWS CEO Andy Jassy disclosed that Trainium is "fully subscribed" and represents a "multibillion-dollar business that grew 150% quarter-over-quarter" at the time of the re:Invent 2024 announcements.
Amazon Bedrock is the fully managed service through which AWS customers access foundation models from Anthropic, Meta, Mistral, and other providers via API. Trainium 2 powers two distinct capabilities in Bedrock: latency-optimized inference for specific models and background training infrastructure.
Bedrock's latency-optimized inference tier uses Trainium 2 to accelerate models that benefit from the chip's high memory bandwidth and capacity. At launch in December 2024, latency-optimized Trainium 2 inference was available for:
Latency-optimized Trainium 2 inference is available in the US East (Ohio) region via cross-region inference, and more regions are planned.
Beyond the publicly visible inference tier, Trainium 2 runs workloads for Bedrock model providers including Anthropic's Claude 3.x and 3.5 series training. More than 50 percent of token throughput on Amazon Bedrock ran on Trainium hardware by 2025, the majority of which is Trainium 2.
Claude 3.5 Haiku accessed through Bedrock with latency optimization on Trainium 2 is priced at $1.00 per million input tokens and $5.00 per million output tokens in the US East (Ohio) region as of the launch in late 2024. Standard (non-latency-optimized) Claude 3.5 Haiku is priced at $0.80 per million input tokens and $4.00 per million output tokens across all regions.
EC2 Trn2 pricing as of launch in December 2024:
| Instance | On-demand price |
|---|---|
| trn2.48xlarge | $0.125 per hour |
AWS has not publicly listed trn2.3xlarge on-demand pricing at the same time, though the Capacity Blocks for ML reservation mechanism applies to blocks of trn2.48xlarge instances.
Spot pricing for trn2.48xlarge has been observed at approximately $8.72 per hour, reflecting the premium the market places on guaranteed access to scarce Trainium 2 capacity. Reserved Instance and Savings Plans pricing for Trn2 were not available at GA launch, which is typical for newly released instance families before demand patterns are established.
AWS claims Trn2 instances deliver 30 to 40 percent better price-performance than P5e and P5en instances (which use NVIDIA H100 SXM5 GPUs), citing internal benchmark results for large language model training workloads.
AWS announced Trainium 3 at AWS re:Invent 2024, one year after announcing Trainium 2, establishing an annual chip cadence. Trainium 3 reached general availability at AWS re:Invent 2025 in December 2025.
Trainium 3 is built on a 3 nm process, compared to Trainium 2's 5 nm node. Per-chip specifications as disclosed by AWS include:
The Trn3 UltraServer scales to 144 chips, delivering 362 FP8 PFLOPs, 20.7 TiB of HBM, and 706 TB/s of aggregate memory bandwidth. AWS claims Trn3 delivers 4.4 times higher performance and 3.9 times higher memory bandwidth than Trn2 UltraServers, with four times better performance per watt.
At the same re:Invent 2025 event, AWS previewed Trainium 4, targeting late 2026 or early 2027 availability. Trainium 4 is expected to deliver three times the FP8 performance and four times the memory bandwidth of Trainium 3, and will support NVIDIA NVLink Fusion, enabling heterogeneous clusters where Trainium 4 chips and NVIDIA Blackwell-generation GPUs share a NVLink fabric.
Anthropic is by far the largest Trainium 2 customer, using roughly 500,000 chips under Project Rainier at the cluster's activation and scaling toward one million chips by end of 2025.
Other early Trainium 2 customers include:
AWS CEO Andy Jassy acknowledged in late 2024 that Trainium 2 was "fully subscribed" at launch, meaning demand exceeded available capacity. AWS indicated it expected to be able to accommodate a broader set of enterprise customers with Trainium 3 in 2025 and 2026.