AWS Trainium
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,475 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,475 words
Add missing citations, update stale details, or suggest a clearer explanation.
AWS Trainium is a family of custom machine learning accelerator chips designed by Annapurna Labs for Amazon Web Services. The chips are purpose-built for training and, increasingly, for serving large neural networks. Trainium sits alongside Inferentia (focused on inference) inside the broader AWS silicon stack that also includes the Graviton CPU line and the Nitro virtualization system. AWS unveiled the first Trainium at the December 2020 re:Invent keynote, made the corresponding Trn1 EC2 instances generally available in October 2022, announced Trainium2 at re:Invent 2023, and rolled Trn2 instances out broadly during 2024. Trainium3, fabricated on a TSMC 3 nm class process, was previewed at re:Invent 2024 and reached general availability through Trn3 UltraServers announced at re:Invent 2025 on December 2, 2025.[1][2][3]
The chips are best known publicly for their role in Anthropic's Claude development, where they power Project Rainier, an EC2 UltraCluster centered on Indiana that runs nearly half a million Trainium2 chips and is being expanded to more than a million chips during 2026. The chips are also used inside Amazon for Search, Amazon Bedrock latency-optimized inference, and the Amazon Nova family of foundation models.[4][5]
Trainium is the work of Annapurna Labs, an Israeli chip design firm founded in 2011 by Hrvoje Bilic, Nafea Bshara, and Ronen Boneh. AWS bought Annapurna Labs in January 2015 for a reported $350 to $370 million, and the team has since produced the chips that quietly underpin most of AWS's modern infrastructure. The same group built the Nitro hypervisor and offload cards (launched November 2017), the Graviton ARM CPU line, and the Inferentia inference accelerator that shipped on Inf1 instances in December 2019. Trainium is the company's first custom training silicon, and the Inferentia2 inference chip released in 2023 shares the same NeuronCore-v2 design.[6]
The pitch from the start was straightforward. GPUs from Nvidia were the de facto training hardware, but they were expensive and supply constrained, and a hyperscaler that owns its own datacenters can save a lot of money by designing the silicon, the rack, the network, and the firmware as one system. Google had already proven this thesis with the Cloud TPU. AWS wanted its own version, and Annapurna gave it the team to build one.
AWS now operates four custom chip families designed by Annapurna Labs:
| Chip family | Purpose | First shipped |
|---|---|---|
| Nitro | Hypervisor, networking, storage offload | November 2017 |
| Graviton | ARM-based general purpose CPUs | November 2018 |
| Inferentia | Low-cost inference accelerator | December 2019 (Inf1) |
| Trainium | Training accelerator | October 2022 (Trn1) |
In practice, an EC2 host running a Trn2 instance has a Graviton control processor, Nitro cards handling network and storage offload, and a tray of Trainium chips connected by NeuronLink for the actual model math. Inferentia2 and Trainium are close cousins. They share the same NeuronCore-v2 core design and software stack, with Inferentia2 tuned for cheaper inference and Trainium tuned for the higher memory bandwidth and interconnect demands of model training.
Andy Jassy announced the original Trainium during the AWS re:Invent 2020 keynote on December 1, 2020. The first generation is built on a 7 nm process, contains roughly 55 billion transistors, and packs two NeuronCore-v2 cores per chip. Each NeuronCore-v2 contains four engines that handle different parts of a typical neural network workload: a tensor engine built on a power-optimized systolic array for matrix multiplication and convolution, a vector engine for normalization and pooling, a scalar engine for elementwise operations like ReLU, and a general-purpose SIMD (GPSIMD) engine of eight programmable 512-bit cores that lets developers write custom kernels in C++.[7]
The chip supports a wide range of numeric formats: cFP8 (a configurable 8-bit float introduced with NeuronCore-v2), FP16, BF16, TF32, FP32, INT8, INT16, and INT32. Each NeuronCore-v2 tensor engine delivers more than 90 TFLOPS of FP16 or BF16 compute, and a single Trainium chip is rated at 210 FP16/BF16/cFP8/TF32 TFLOPS, 52.5 FP32 TFLOPS, and 420 INT8 TOPS. Each chip carries 32 GB of high bandwidth memory.[7]
AWS made Trn1 generally available on October 10, 2022, in the US East (N. Virginia) and US West (Oregon) regions. The lineup launched with three sizes:[8]
| Instance | Trainium chips | NeuronCores | vCPUs | Instance memory | Accelerator memory | Local NVMe | Network bandwidth |
|---|---|---|---|---|---|---|---|
| trn1.2xlarge | 1 | 2 | 8 | 32 GiB | 32 GB | 0.5 TB | up to 12.5 Gbps |
| trn1.32xlarge | 16 | 32 | 128 | 512 GiB | 512 GB | 8 TB | 800 Gbps EFAv2 |
| trn1n.32xlarge | 16 | 32 | 128 | 512 GiB | 512 GB | 8 TB | 1,600 Gbps EFAv2 |
The full size, trn1.32xlarge, delivers up to 3 PFLOPS of FP16 / BF16 compute and 9.8 TB/s of aggregate HBM bandwidth across its 16 chips, which are stitched together with NeuronLink-v2 in a 2D torus. The trn1n variant doubles the external networking to 1.6 Tbps for customers who need to spread training across many instances. AWS exposes scale through its EC2 UltraCluster fabric, where Trn1 and Trn1n instances can be combined into pods of more than 100,000 chips connected by petabit-scale Elastic Fabric Adapter networking. List on-demand pricing at launch was $1.34 per hour for trn1.2xlarge, $21.50 per hour for trn1.32xlarge, and $24.78 per hour for trn1n.32xlarge.[8]
Launch customers and partners highlighted by AWS included PyTorch (with native support added to the framework), Amazon Search, the protein structure prediction firm HeliXon, the Japanese fintech Money Forward, and the AI productivity startup Magic.
Trainium2 was announced at re:Invent 2023 and previewed throughout 2024 before reaching general availability on December 3, 2024 in the US East (Ohio) region. The chip is fabricated on a 5 nm process. Each Trainium2 chip contains eight NeuronCores and 96 GB of HBM with 2.9 TB/s of bandwidth per chip. AWS rates a single Trainium2 chip at up to 1.3 PFLOPS of dense FP8 and up to 5.2 PFLOPS of sparse FP8 compute. Sparsity in this generation is hardware accelerated using a 16:4 pattern (four nonzero values per group of sixteen), giving roughly a 4x throughput uplift on suitable models.[9][10]
The Trn2 family is sold in two main shapes plus the UltraServer configuration:
| Instance | Trainium2 chips | NeuronCores | vCPUs | Instance memory | Accelerator memory | HBM bandwidth | Network bandwidth |
|---|---|---|---|---|---|---|---|
| trn2.48xlarge | 16 | 128 | 192 | 2 TiB | 1.5 TB | 46.4 TB/s | 3.2 Tbps EFAv3 |
| trn2u.48xlarge (UltraServer node) | 16 | 128 | 192 | 2 TiB | 1.5 TB | 46.4 TB/s | 3.2 Tbps EFAv3 |
A full trn2.48xlarge is rated at 20.8 PFLOPS of dense FP8 and around 83.2 PFLOPS of sparse FP8.[9][10]
The Trn2 UltraServer is a physical product introduced alongside Trn2: four trn2u.48xlarge nodes (64 Trainium2 chips total) are wired together with NeuronLink in a high-bandwidth ring, exposing them to the operating system as a single logical machine with 512 NeuronCores, 6 TB of HBM, 185 TB/s of aggregate HBM bandwidth, and 12.8 Tbps of EFAv3 networking. AWS uses the same UltraServer building block to construct EC2 UltraClusters, the largest of which is Project Rainier (described below). For shorter bookings, customers reserve UltraServers and instances through Amazon EC2 Capacity Blocks for ML.[9][10]
AWS publishes two performance comparisons that recur in marketing materials. Trn2 delivers about 4x the performance, 4x the memory bandwidth, and 3x the memory capacity of Trn1, and AWS claims it offers 30 to 40 percent better price performance than NVIDIA H100 based P5e and P5en instances on the workloads AWS measured. Both numbers are AWS internal benchmarks rather than third-party MLPerf submissions; as of MLPerf Training v5.1 results published in November 2025, AWS did not appear among the 20 submitting organizations.[10][11]
AWS first showed Trainium3 silicon at re:Invent 2024 and then announced general availability of Trn3 UltraServers at re:Invent 2025 on December 2, 2025, when CEO Matt Garman delivered the keynote. Trainium3 is fabricated on TSMC's 3 nm class process (N3P) and is implemented as a dual-chiplet accelerator. Each chip integrates eight NeuronCore-v4 cores (four per chiplet) with 32 MiB of local SRAM per core, supports systolic arrays sized 128x128 for BF16 and 512x128 for MXFP8, and delivers 2.52 PFLOPS of MXFP8/MXFP4 compute, 671 dense (2,517 sparse) BF16/FP16/TF32 TFLOPS, and 183 TFLOPS of FP32 per chip. Each chip ships with 144 GB of HBM3e across four stacks at 4.9 TB/s of bandwidth, and NeuronLink-v4 contributes 2.56 TB/s of inter-device bandwidth.[1][2][3][12]
Trainium3 adds support for microscaling formats, including MXFP8 and MXFP4, alongside the existing BF16, FP16, FP8, and FP32 paths, and AWS exposes a user-programmable rounding mode. The architecture also adds 16 dedicated collective communication cores (CC-Cores) so that all-reduce and all-gather steps do not contend with the tensor engines, and it supports Logical NeuronCore Configuration (LNC) for pooling resources across cores.[12]
Trn3 UltraServers ship in two configurations:[1][3]
| Configuration | Chips per UltraServer | Cooling | Peak FP8 / MXFP8 | Aggregate HBM3e | Aggregate HBM bandwidth |
|---|---|---|---|---|---|
| Trn3 UltraServer (Gen1) | 64 | Air-cooled | ~161 PFLOPS | ~9.2 TB | ~314 TB/s |
| Trn3 UltraServer (Gen2) | 144 | Liquid-cooled | ~362 PFLOPS | ~20.7 TB | ~706 TB/s |
The Gen2 UltraServer extends scale-up beyond the 64-chip topology that Trn2 used, putting AWS's rack-scale system roughly on par with NVIDIA's GB300 NVL72 rack on aggregate FP8 throughput. The two-generation cluster is connected by NeuronSwitch-v1, an all-to-all fabric that AWS says roughly doubles the inter-chip interconnect bandwidth of the previous generation, and exposed through EC2 UltraClusters 3.0.[1][3]
AWS markets the Trn3 UltraServer at up to 4.4x higher peak compute, 3.9x higher memory bandwidth, and over 4x better energy efficiency than a Trn2 UltraServer. At re:Invent 2025 Matt Garman also previewed Trainium4 as the next generation, with AWS stating it will deliver at least 3x the FP8 processing power and 4x more memory bandwidth than Trainium3, and describing it publicly as a redesign that scales out across many smaller dies rather than building a larger monolithic chip.[1][13]
Trainium3 was co-designed with Anthropic. Trade press has reported that Anthropic engineers communicate with Annapurna Labs daily and contribute feedback from Claude training runs that shapes future chip design.
Every generation of Trainium and Inferentia uses the same basic building block: the NeuronCore. The design has gone through four major revisions in production. NeuronCore-v1 powered the original Inferentia. NeuronCore-v2, introduced with Trainium1 and Inferentia2, added the four-engine layout that has shaped every subsequent generation. NeuronCore-v3 ships in Trainium2 and reorganizes the same engines for higher density and lower precision throughput. NeuronCore-v4 ships in Trainium3 and adds support for microscaling MX formats.[7][10][12]
A single NeuronCore-v2 contains:
| Engine | Role | Notes |
|---|---|---|
| Tensor engine | GEMM, convolution, transpose | Power-optimized systolic array, accepts cFP8/FP16/BF16/TF32/FP32/INT8 inputs and accumulates in FP32 or INT32 |
| Vector engine | Normalization, pooling, softmax-style ops | About 10x faster than NeuronCore-v1 |
| Scalar engine | Elementwise ops like ReLU and biases | Around 2.9 TFLOPS of FP32, 3x NeuronCore-v1 |
| GPSIMD engine | Custom kernels in C++ | Eight 512-bit programmable cores per NeuronCore |
The GPSIMD engine is the part developers can program directly, through the Neuron Kernel Interface (NKI). NKI is roughly analogous to writing a CUDA kernel for an Nvidia GPU. It is how Anthropic and other heavy users push the chip past what the compiler can generate on its own.
NeuronCore-v3, introduced in Trainium2, splits each chip into eight cores instead of two and reorganizes the engines for the FP8 / sparse-FP8 numerics that dominate modern transformer training. Each chip exposes 128 NeuronCores when packaged as a trn2.48xlarge with 16 chips. NeuronCore-v4, introduced with Trainium3, continues this trajectory and adds the new microscaling MX formats, larger systolic arrays, dedicated SRAM, and the dedicated collective communication cores.[12]
Trainium chips are connected at three different levels.
Inside an instance, NeuronLink (v2 in Trn1, v3 in Trn2, v4 in Trn3) wires the chips into a 2D torus or ring at hundreds of gigabytes per second per chip; NeuronLink-v4 in Trainium3 reaches 2.56 TB/s per device. Inside an UltraServer, the same NeuronLink fabric extends across multiple physical instances so that 64 or 144 chips look to software like a single machine. Trn3 Gen2 UltraServers add NeuronSwitch-v1, an all-to-all fabric that AWS describes as roughly doubling inter-chip bandwidth versus Trn2. Above that, instances are stitched into EC2 UltraClusters with Elastic Fabric Adapter (EFAv2 on Trn1, EFAv3 on Trn2 onward, exposed as UltraClusters 3.0 with Trn3), an AWS-specific RDMA transport that runs over the regular Nitro network and bypasses the kernel for low-latency, lossless collective communication. UltraClusters scale to tens or hundreds of thousands of chips. Project Rainier is currently the largest AWS has ever deployed, at 70 percent larger than any prior AWS AI cluster.[4][10][12]
This multi-tier structure mirrors the way Google describes its TPU pods (chip, board, slice, pod) and is the source of most of Trainium's headline numbers. A 64-chip or 144-chip UltraServer is the natural unit for a single training job that fits inside one tensor parallel domain, while UltraClusters give pipeline and data parallel scale.
The Neuron SDK is what makes the hardware usable. It is split into a runtime, a compiler, and a set of framework integrations.
The Neuron Compiler is an XLA-based graph compiler that ingests models from PyTorch (via XLA), TensorFlow, JAX, and MXNet, lowers them to a Neuron-specific intermediate representation, and emits binaries the chip can run. It handles tiling for the SRAM hierarchy, scheduling across the engines inside each NeuronCore, and collective placement. Most users never touch it directly.
PyTorch on Neuron has shifted over the lifetime of the SDK. The original torch-neuronx package wraps PyTorch/XLA and is the most battle-tested path for distributed training. The newer TorchNeuron Native backend, added in 2025, provides eager execution, torch.compile, and the standard distributed APIs (FSDP, DDP, DTensor, tensor parallel) directly on Trainium. AWS now positions TorchNeuron Native as the recommended starting point for new workloads.
JAX on Neuron uses the same XLA compiler path. JAX programs lower to HLO, which the Neuron compiler then targets to the chip. AWS has shipped reference implementations of large language model pretraining in both PyTorch and JAX.
Hugging Face integration comes via the optimum-neuron library, an open source Hugging Face project that lets Transformers users fine-tune and serve models on Trainium and Inferentia without rewriting their training code. AWS and Hugging Face also publish a aws-neuron model namespace with precompiled artifacts.
The Neuron Kernel Interface (NKI) is the lowest-level developer surface. It exposes the GPSIMD engine and the tensor engine in a Python-embedded DSL that compiles down to chip instructions. NKI is the equivalent of writing a CUDA kernel and is the path most heavy users take when the compiler is leaving performance on the table.
Amazon SageMaker has first-class Trainium support: SageMaker training jobs and HyperPod clusters can be launched directly on Trn1, Trn2, and Trn3 instances, and SageMaker JumpStart hosts ready-to-train recipes for popular open models on Neuron.
The Neuron stack also integrates with Kubernetes (EKS), ECS, Ray, and Slurm-based HPC schedulers, and supports container images via the AWS Deep Learning Containers.
Project Rainier is the largest Trainium deployment in the world. It is a multi-site EC2 UltraCluster built primarily for Anthropic and used to train and serve Claude. AWS announced the project in 2024 and brought the first phase online in less than a year. The flagship site is an $11 billion campus in St. Joseph County, Indiana, near New Carlisle, that broke ground in October 2024 and was activated in October 2025. The campus spans roughly 1,200 acres and is engineered for outside-air cooling, requiring zero water for cooling October through March and minimal water April through September. Other Project Rainier capacity is spread across additional US sites.[4][14]
AWS describes the active Project Rainier deployment as nearly 500,000 Trainium2 chips, with plans to scale beyond one million Trainium2 chips by year end 2026 across both training and inference workloads. The cluster gives Anthropic more than five times the compute it used to train its previous generation of Claude models. AWS calls Project Rainier 70 percent larger than any other AI compute platform in the company's history.[4]
The physical building block at every site is the Trn2 UltraServer (four trn2u.48xlarge instances, 64 chips, NeuronLink) connected upward by EFAv3. The custom NeuronLink topology and the Nitro-managed network are the reason AWS can stand up clusters of this size on its own silicon.
Project Rainier should not be confused with Project Ceiba, the joint AWS / Nvidia supercomputer announced at re:Invent 2023, which is built from Nvidia GH200 Grace Hopper Superchips and Nvidia Blackwell systems and runs on AWS Nitro and EFA. Ceiba is Nvidia silicon hosted on AWS infrastructure. Rainier is AWS silicon hosting a single anchor customer.
Anthropic is the headline customer. Amazon and Anthropic announced a deeper strategic alliance on November 22, 2024 that took Amazon's total investment in Anthropic to $8 billion (an additional $4 billion on top of an earlier $4 billion commitment) and named AWS as Anthropic's primary cloud and training partner.[15]
On April 20, 2026, the two companies expanded the deal further. Anthropic committed to spending more than $100 billion over the next ten years on AWS technologies (Trainium chips plus tens of millions of Graviton cores) to secure up to 5 GW of capacity for training and serving Claude. The contract spans Trainium2, Trainium3, Trainium4, and future generations. Anthropic said meaningful new Trainium2 capacity would come online in Q2 2026 and that roughly 1 GW of combined Trainium2 and Trainium3 capacity would be online by the end of 2026. Amazon, in turn, committed an immediate $5 billion investment with up to $20 billion more tied to commercial milestones, on top of the $8 billion it had previously invested, taking the potential total Amazon stake to roughly $33 billion. The deal also expanded international inference capacity for Claude in Asia and Europe and announced that the full Anthropic-native Claude Platform console would be accessible from inside the AWS console.[16][17][18]
The customer list AWS highlighted at re:Invent 2025 included, in addition to Anthropic, Karakuri (the Japanese LLM company, which has reported reducing training costs by more than 50 percent on Trainium), Metagenomi, NetoAI, Ricoh (which trained a Japanese language LLM on a 256-node Trn1 cluster, reportedly cutting training cost by 50 percent and training time by 25 percent versus GPUs), Splash Music, and Decart (which has reported 4x faster inference for real-time generative video at half the cost of GPUs). Other publicly named users include the AI coding company poolside, Databricks (which has integrated Trainium support into Mosaic AI), and HeliXon (protein structure prediction). Inside Amazon, Trainium is used by Amazon Search, by Amazon Bedrock for latency-optimized inference of models including Claude 3.5 Haiku and Llama 3.1 405B, and to train the Amazon Nova family.[1][5]
AWS does not publish chip-level revenue, but Andy Jassy disclosed at re:Invent 2025 that the Trainium2 business had reached a multi-billion dollar annualized revenue run rate.
AWS publishes a few headline numbers that come up in almost every Trainium discussion:
These are vendor figures. AWS has historically not submitted Trainium results to MLPerf Training, so direct head-to-head comparisons against NVIDIA H100, NVIDIA Blackwell, and Google TPU v5p / Trillium happen through customer case studies and analyst pieces rather than a shared benchmark suite.[11]
In the broader market for AI accelerators, Trainium occupies an in-between position. Google's TPU is the closest analogue: a hyperscaler-owned ASIC, only available through the corresponding cloud, with a software stack (XLA / JAX) that is shared with the rest of the company's ML stack. Nvidia GPUs, especially the H100 and A100, define the rest of the market: available everywhere, well documented, and backed by the CUDA ecosystem.
| Dimension | AWS Trainium | Google TPU | Nvidia H100 / Blackwell |
|---|---|---|---|
| Owner | AWS (Annapurna Labs) | Nvidia | |
| Availability | EC2 (Trn1, Trn2, Trn3) | Google Cloud only | All major clouds + on-prem |
| Native software | Neuron SDK, XLA | XLA, JAX, TF | CUDA, cuDNN, NCCL |
| Top-end interconnect | NeuronLink-v4 + NeuronSwitch-v1 + EFAv3 | ICI + OCS optical pod fabric | NVLink + InfiniBand / Ethernet |
| Rack-scale system | Trn3 Gen2 UltraServer (144 chips) | TPU v5p / Trillium pods | NVL72 / NVL576 (GB200, GB300) |
| Public benchmark presence | Customer case studies | MLPerf submissions | MLPerf submissions |
| Anchor customer | Anthropic | Google internal, Anthropic, others | Almost everyone |
The trade-off for buyers is well known. Trainium and TPU give better price performance for customers willing to live inside one cloud and absorb the cost of porting. CUDA and Nvidia GPUs give portability, a larger pool of pretrained models, and a much deeper third-party software ecosystem. For most teams running a few GPU-hours a week, the math favors renting H100s or A100s. For teams running thousands of nodes for months, even a 20 percent saving is enough to justify the porting work, which is why most of the public Trainium customers are companies training their own foundation models.
The most common complaint about Trainium is the same one leveled at every non-CUDA accelerator: the software ecosystem is younger and shallower. CUDA has had two decades to grow a library of tuned kernels, third-party frameworks, and trained engineers. Neuron has had only a few years. In practice this shows up as occasional rough edges:
AWS has invested heavily to close these gaps. The TorchNeuron Native backend, the NKI kernel interface, the JAX and Hugging Face integrations, and the steady cadence of compiler releases all aim at making Trainium feel more like "just another PyTorch backend." The fact that Anthropic, the company training one of the most capable language models in the world, has bet most of its future training capacity on Trainium suggests the trade-off has crossed the line into acceptable for at least one very demanding user.
AWS publicly disclosed the broad cadence of Trainium generations at re:Invent 2025:
| Generation | Process | First shipped | Status (May 2026) |
|---|---|---|---|
| Trainium1 | TSMC 7 nm | October 2022 (Trn1) | Generally available |
| Trainium2 | TSMC 5 nm | December 2024 (Trn2 + Trn2 UltraServer) | Generally available; primary deployment in Project Rainier |
| Trainium3 | TSMC 3 nm class (N3P) | December 2, 2025 (Trn3 UltraServer Gen1 and Gen2) | Generally available |
| Trainium4 | Not disclosed | Future | Previewed at re:Invent 2025; AWS targets >=3x FP8 of Trainium3 and 4x memory bandwidth |
Trainium4, previewed by Matt Garman in the December 2025 keynote, is described by AWS as a chiplet-heavy design that scales out across many smaller dies rather than building a larger monolithic accelerator. The deal with Anthropic announced on April 20, 2026 includes the right to purchase Trainium4 and future generations.[13][16]