AWS Trainium
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,867 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,867 words
Add missing citations, update stale details, or suggest a clearer explanation.
AWS Trainium is a family of custom machine learning accelerator chips designed by Annapurna Labs for Amazon Web Services. The chips are purpose-built for training and, increasingly, for serving large neural networks. Trainium sits alongside Inferentia (focused on inference) inside the broader AWS silicon stack that also includes the Graviton CPU line and the Nitro virtualization system. AWS unveiled the first Trainium at the December 2020 re:Invent keynote, made the corresponding Trn1 EC2 instances generally available in October 2022, announced Trainium2 at re:Invent 2023, and rolled Trn2 instances out broadly during 2024. Trainium3, fabricated on a 3 nm process, was previewed at re:Invent 2024 and entered general availability through Trn3 UltraServers in late 2025.
The chips are best known publicly for their role in Anthropic's Claude development, where they power Project Rainier, an EC2 UltraCluster in Indiana that runs nearly half a million Trainium2 chips and is being expanded to more than a million chips. The chips are also used inside Amazon for Search, Amazon Bedrock latency-optimized inference, and the Amazon Nova family of foundation models.
Trainium is the work of Annapurna Labs, a small Israeli chip design firm founded in 2011 by Hrvoje Bilic, Nafea Bshara, and Ronen Boneh. AWS bought Annapurna Labs in January 2015 for a reported $350 to $370 million, and the team has since produced the chips that quietly underpin most of AWS's modern infrastructure. The same group built the Nitro hypervisor and offload cards (launched November 2017), the Graviton ARM CPU line, and the Inferentia inference accelerator that shipped on Inf1 instances in December 2019. Trainium is the company's first custom training silicon, and the Inferentia2 inference chip released in 2023 shares the same NeuronCore-v2 design.
The pitch from the start was straightforward. GPUs from Nvidia were the de facto training hardware, but they were expensive and supply constrained, and a hyperscaler that owns its own datacenters can save a lot of money by designing the silicon, the rack, the network, and the firmware as one system. Google had already proven this thesis with the Cloud TPU. AWS wanted its own version, and Annapurna gave it the team to build one.
AWS now operates four custom chip families designed by Annapurna Labs:
| Chip family | Purpose | First shipped |
|---|---|---|
| Nitro | Hypervisor, networking, storage offload | November 2017 |
| Graviton | ARM-based general purpose CPUs | November 2018 |
| Inferentia | Low-cost inference accelerator | December 2019 (Inf1) |
| Trainium | Training accelerator | October 2022 (Trn1) |
In practice, an EC2 host running a Trn2 instance has a Graviton control processor, Nitro cards handling network and storage offload, and a tray of Trainium chips connected by NeuronLink for the actual model math. Inferentia2 and Trainium are close cousins. They share the same NeuronCore-v2 core design and software stack, with Inferentia2 tuned for cheaper inference and Trainium tuned for the higher memory bandwidth and interconnect demands of model training.
Andy Jassy announced the original Trainium during the AWS re:Invent 2020 keynote on December 1, 2020. The first generation is built on a 7 nm process, contains roughly 55 billion transistors, and packs two NeuronCore-v2 cores per chip. Each NeuronCore-v2 contains four engines that handle different parts of a typical neural network workload: a tensor engine built on a power-optimized systolic array for matrix multiplication and convolution, a vector engine for normalization and pooling, a scalar engine for elementwise operations like ReLU, and a general-purpose SIMD (GPSIMD) engine of eight programmable 512-bit cores that lets developers write custom kernels in C++.
The chip supports a wide range of numeric formats: cFP8 (a configurable 8-bit float introduced with NeuronCore-v2), FP16, BF16, TF32, FP32, INT8, INT16, and INT32. Each NeuronCore-v2 tensor engine delivers more than 90 TFLOPS of FP16 or BF16 compute, and a single Trainium chip is rated at 210 FP16/BF16/cFP8/TF32 TFLOPS, 52.5 FP32 TFLOPS, and 420 INT8 TOPS. Each chip carries 32 GB of high bandwidth memory.
AWS made Trn1 generally available on October 10, 2022, in the US East (N. Virginia) and US West (Oregon) regions. The lineup launched with three sizes:
| Instance | Trainium chips | NeuronCores | vCPUs | Instance memory | Accelerator memory | Local NVMe | Network bandwidth |
|---|---|---|---|---|---|---|---|
| trn1.2xlarge | 1 | 2 | 8 | 32 GiB | 32 GB | 0.5 TB | up to 12.5 Gbps |
| trn1.32xlarge | 16 | 32 | 128 | 512 GiB | 512 GB | 8 TB | 800 Gbps EFAv2 |
| trn1n.32xlarge | 16 | 32 | 128 | 512 GiB | 512 GB | 8 TB | 1,600 Gbps EFAv2 |
The full size, trn1.32xlarge, delivers up to 3 PFLOPS of FP16 / BF16 compute and 9.8 TB/s of aggregate HBM bandwidth across its 16 chips, which are stitched together with NeuronLink-v2 in a 2D torus. The trn1n variant doubles the external networking to 1.6 Tbps for customers who need to spread training across many instances. AWS exposes scale through its EC2 UltraCluster fabric, where Trn1 and Trn1n instances can be combined into pods of more than 100,000 chips connected by petabit-scale Elastic Fabric Adapter networking. List on-demand pricing at launch was $1.34 per hour for trn1.2xlarge, $21.50 per hour for trn1.32xlarge, and $24.78 per hour for trn1n.32xlarge.
Launch customers and partners highlighted by AWS included PyTorch (with native support added to the framework), Amazon Search, the protein structure prediction firm HeliXon, the Japanese fintech Money Forward, and the AI productivity startup Magic.
Trainium2 was announced at re:Invent 2023 and previewed throughout 2024 before reaching general availability on December 3, 2024 in the US East (Ohio) region. The chip is fabricated on a 5 nm process. Each Trainium2 chip contains eight NeuronCores (still NeuronCore-v2 in this generation, expanded and tuned) and 96 GB of HBM with 2.9 TB/s of bandwidth per chip. AWS rates a single Trainium2 chip at up to 1.3 PFLOPS of dense FP8 and up to 5.2 PFLOPS of sparse FP8 compute. Sparsity in this generation is hardware accelerated using a 16:4 pattern (four nonzero values per group of sixteen), giving roughly a 4x throughput uplift on suitable models.
The Trn2 family is sold in two main shapes:
| Instance | Trainium2 chips | NeuronCores | vCPUs | Instance memory | Accelerator memory | HBM bandwidth | Network bandwidth |
|---|---|---|---|---|---|---|---|
| trn2.3xlarge | 1 | 8 | 12 | 128 GiB | 96 GB | 2.9 TB/s | 0.2 Tbps |
| trn2.48xlarge | 16 | 128 | 192 | 2 TiB | 1.5 TB | 46.4 TB/s | 3.2 Tbps EFAv3 |
| trn2u.48xlarge (UltraServer node) | 16 | 128 | 192 | 2 TiB | 1.5 TB | 46.4 TB/s | 3.2 Tbps EFAv3 |
A full trn2.48xlarge is rated at 20.8 PFLOPS of dense FP8 and around 83 PFLOPS of sparse FP8.
The Trn2 UltraServer is a new physical product. Four trn2u.48xlarge nodes (64 Trainium2 chips total) are wired together with NeuronLink-v3 in a high-bandwidth ring, exposing them to the operating system as a single logical machine with 512 NeuronCores, 6 TB of HBM, 185 TB/s of aggregate HBM bandwidth, and 12.8 Tbps of EFAv3 networking. The intra-node NeuronLink fabric runs at 1,024 GB/s per chip, and the inter-node ring at 256 GB/s per chip. AWS uses the same UltraServer building block to construct EC2 UltraClusters, the largest of which is Project Rainier (described below). For shorter bookings, customers reserve UltraServers and instances through Amazon EC2 Capacity Blocks for ML.
AWS publishes two performance comparisons that recur in marketing materials. Trn2 delivers about 4x the performance, 4x the memory bandwidth, and 3x the memory capacity of Trn1, and it offers 30 to 40 percent better price performance than NVIDIA H100 based P5e and P5en instances on the workloads AWS measured. Both numbers are AWS internal benchmarks rather than third-party MLPerf submissions. As of MLPerf Training v5.0 and v5.1 results published in 2025, AWS Trainium had not appeared in the public MLPerf Training tables, where Nvidia, Google, and a handful of partners dominate submissions.
AWS first showed Trainium3 silicon at re:Invent 2024 and then announced general availability of Trn3 UltraServers in late 2025 at re:Invent. Trainium3 moves to a 3 nm process. AWS publishes a per-chip rating of about 2.52 PFLOPS of FP8 dense compute (roughly 2x Trainium2 per chip), 144 GB of HBM3e, and 4.9 TB/s of HBM bandwidth per chip. The chip adds support for new microscaling formats, including MXFP8 and MXFP4, alongside the existing BF16, FP16, FP8, and FP32 paths. AWS markets a Trn3 UltraServer (the same four-node, 64-chip topology as Trn2) at up to 4.4x higher peak performance, 3.9x higher memory bandwidth, and more than 4x better energy efficiency than a Trn2 UltraServer, with an aggregate of 362 MXFP8 PFLOPS and 20.7 TB of HBM3e at 706 TB/s.
Trainium3 was co-designed with Anthropic. CNBC and others have reported that Anthropic engineers communicate with Annapurna Labs daily and contribute direct feedback from Claude training runs that shapes future chip design. AWS has previewed Trainium4 for a future generation, with the company describing it as a redesign that scales out across many small dies rather than building a larger monolithic chip.
Every generation of Trainium and Inferentia uses the same basic building block: the NeuronCore. The design has gone through three major revisions in production. NeuronCore-v1 powered the original Inferentia. NeuronCore-v2, introduced with Trainium1 and Inferentia2, added the four-engine layout that has shaped every subsequent generation. NeuronCore-v3 ships in Trainium2 and reorganizes the same engines for higher density and lower precision throughput.
A single NeuronCore-v2 contains:
| Engine | Role | Notes |
|---|---|---|
| Tensor engine | GEMM, convolution, transpose | Power-optimized systolic array, accepts cFP8/FP16/BF16/TF32/FP32/INT8 inputs and accumulates in FP32 or INT32 |
| Vector engine | Normalization, pooling, softmax-style ops | About 10x faster than NeuronCore-v1 |
| Scalar engine | Elementwise ops like ReLU and biases | Around 2.9 TFLOPS of FP32, 3x NeuronCore-v1 |
| GPSIMD engine | Custom kernels in C++ | Eight 512-bit programmable cores per NeuronCore |
The GPSIMD engine is the part developers can program directly, through the Neuron Kernel Interface (NKI). NKI is roughly analogous to writing a CUDA kernel for an Nvidia GPU. It is how Anthropic and other heavy users push the chip past what the compiler can generate on its own.
NeuronCore-v3, introduced in Trainium2, splits each chip into eight cores instead of two and reorganizes the engines for the FP8 / sparse-FP8 numerics that dominate modern transformer training. Each chip exposes 128 NeuronCores when packaged as a trn2.48xlarge with 16 chips. NeuronCore-v3 also adds dedicated collective communication cores so that the all-reduce and all-gather steps that dominate distributed training do not have to share the tensor engine with the actual math. NeuronCore-v4, introduced with Trainium3, continues this trajectory and adds the new microscaling MX formats.
Trainium chips are connected at three different levels.
Inside an instance, NeuronLink (v2 in Trn1, v3 in Trn2 and beyond) wires the chips into a 2D torus or ring at hundreds of gigabytes per second per chip. Inside an UltraServer, the same NeuronLink fabric extends across four physical instances so that 64 chips look to software like a single machine. Above that, instances are stitched into EC2 UltraClusters with Elastic Fabric Adapter (EFAv2 on Trn1, EFAv3 on Trn2 onward), an AWS-specific RDMA transport that runs over the regular Nitro network and bypasses the kernel for low-latency, lossless collective communication. UltraClusters scale to tens or hundreds of thousands of chips. Project Rainier is currently the largest AWS has ever deployed, at 70 percent larger than any prior AWS AI cluster.
This multi-tier structure mirrors the way Google describes its TPU pods (chip, board, slice, pod) and is the source of most of Trainium's headline numbers. A 64-chip UltraServer is the natural unit for a single training job that fits inside one tensor parallel domain, while UltraClusters give pipeline and data parallel scale.
The Neuron SDK is what makes the hardware usable. It is split into a runtime, a compiler, and a set of framework integrations.
The Neuron Compiler is an XLA-based graph compiler that ingests models from PyTorch (via XLA), TensorFlow, JAX, and MXNet, lowers them to a Neuron-specific intermediate representation, and emits binaries the chip can run. It handles tiling for the SRAM hierarchy, scheduling across the four engines inside each NeuronCore, and collective placement. Most users never touch it directly.
PyTorch on Neuron has shifted over the lifetime of the SDK. The original torch-neuronx package wraps PyTorch/XLA and is the most battle-tested path for distributed training. The newer TorchNeuron Native backend, added in 2025, provides eager execution, torch.compile, and the standard distributed APIs (FSDP, DDP, DTensor, tensor parallel) directly on Trainium. AWS now positions TorchNeuron Native as the recommended starting point for new workloads.
JAX on Neuron uses the same XLA compiler path. JAX programs lower to HLO, which the Neuron compiler then targets to the chip. AWS has shipped reference implementations of large language model pretraining in both PyTorch and JAX.
Hugging Face integration comes via the optimum-neuron library, an open source Hugging Face project that lets Transformers users fine-tune and serve models on Trainium and Inferentia without rewriting their training code. AWS and Hugging Face also publish a aws-neuron model namespace with precompiled artifacts.
The Neuron Kernel Interface (NKI) is the lowest-level developer surface. It exposes the GPSIMD engine and the tensor engine in a Python-embedded DSL that compiles down to chip instructions. NKI is the equivalent of writing a CUDA kernel and is the path most heavy users take when the compiler is leaving performance on the table.
Amazon SageMaker has first-class Trainium support: SageMaker training jobs and HyperPod clusters can be launched directly on Trn1, Trn2, and Trn3 instances, and SageMaker JumpStart hosts ready-to-train recipes for popular open models on Neuron.
The Neuron stack also integrates with Kubernetes (EKS), ECS, Ray, and Slurm-based HPC schedulers, and supports container images via the AWS Deep Learning Containers.
Project Rainier is the largest Trainium deployment in the world. It is a multi-site EC2 UltraCluster built primarily for Anthropic and used to train and serve Claude. AWS announced the project in 2024 and brought the first phase online in less than a year. The flagship site is an $11 billion campus in St. Joseph County, Indiana, near New Carlisle, that broke ground in October 2024 and reached full operation in October 2025. The campus spans roughly 1,200 acres and will eventually draw about 2.2 GW of power. Other Project Rainier capacity is spread across additional US sites.
AWS describes the active Project Rainier deployment as nearly 500,000 Trainium2 chips, with plans to scale beyond one million Trainium2 chips by year end across both training and inference workloads. The cluster gives Anthropic more than five times the compute it used to train its previous generation of Claude models. AWS calls Project Rainier 70 percent larger than any other AI compute platform in the company's history.
The physical building block at every site is the Trn2 UltraServer (four trn2u.48xlarge instances, 64 chips, NeuronLink-v3) connected upward by EFAv3. The custom NeuronLink topology and the Nitro-managed network are the reason AWS can stand up clusters of this size on its own silicon.
Project Rainier should not be confused with Project Ceiba, the joint AWS / Nvidia supercomputer announced at re:Invent 2023, which is built from Nvidia GH200 Grace Hopper Superchips and Nvidia Blackwell systems and runs on AWS Nitro and EFA. Ceiba is Nvidia silicon hosted on AWS infrastructure. Rainier is AWS silicon hosting a single anchor customer.
Anthropic is the headline customer. Amazon and Anthropic announced a deeper strategic alliance on November 22, 2024 that took Amazon's total investment in Anthropic to $8 billion (an additional $4 billion on top of an earlier $4 billion commitment) and named AWS as Anthropic's primary cloud and training partner. In April 2026, the two companies expanded the deal further: Anthropic committed to spending more than $100 billion on AWS over ten years to secure up to 5 GW of capacity, including new Trainium2 deployments in the first half of the year and roughly 1 GW of combined Trainium2 and Trainium3 capacity online by the end of 2026. Amazon, in turn, committed up to $33 billion in additional Anthropic investment alongside the spend pledge.
The customer page on aws.amazon.com lists, in addition to Anthropic, the AI coding company poolside, the real-time video model startup Decart, and the Japanese LLM company Karakuri (which reported reducing its training costs by more than 50 percent on Trainium). Other publicly named users include the Japanese imaging company Ricoh (which trained a Japanese language LLM on a 256-node Trn1 cluster, reportedly cutting training cost by 50 percent and training time by 25 percent versus GPUs), Databricks (which has integrated Trainium support into Mosaic AI), and HeliXon (protein structure prediction). Inside Amazon, Trainium is used by Amazon Search, by Amazon Bedrock for latency-optimized inference of Claude 3.5 Haiku and Llama 3.1 405B, and to train the Amazon Nova family.
AWS does not publish chip-level revenue, but Andy Jassy disclosed at re:Invent 2025 that the Trainium2 business had reached a multi-billion dollar annualized revenue run rate.
AWS publishes a few headline numbers that come up in almost every Trainium discussion:
These are vendor figures. AWS has historically not submitted Trainium results to MLPerf Training, so direct head-to-head comparisons against NVIDIA H100, NVIDIA Blackwell, and Google TPU v5p / Trillium happen through customer case studies and analyst pieces rather than a shared benchmark suite.
In the broader market for AI accelerators, Trainium occupies an in-between position. Google's TPU is the closest analogue: a hyperscaler-owned ASIC, only available through the corresponding cloud, with a software stack (XLA / JAX) that is shared with the rest of the company's ML stack. Nvidia GPUs, especially the H100 and A100, define the rest of the market: available everywhere, well documented, and backed by the CUDA ecosystem.
| Dimension | AWS Trainium | Google TPU | Nvidia H100 / Blackwell |
|---|---|---|---|
| Owner | AWS (Annapurna Labs) | Nvidia | |
| Availability | EC2 (Trn1, Trn2, Trn3) | Google Cloud only | All major clouds + on-prem |
| Native software | Neuron SDK, XLA | XLA, JAX, TF | CUDA, cuDNN, NCCL |
| Top-end interconnect | NeuronLink-v3 + EFAv3 | ICI + OCS optical pod fabric | NVLink + InfiniBand / Ethernet |
| Public benchmark presence | Customer case studies | MLPerf submissions | MLPerf submissions |
| Anchor customer | Anthropic | Google internal, Anthropic, others | Almost everyone |
The trade-off for buyers is well known. Trainium and TPU give better price performance for customers willing to live inside one cloud and absorb the cost of porting. CUDA and Nvidia GPUs give portability, a larger pool of pretrained models, and a much deeper third-party software ecosystem. For most teams running a few GPU-hours a week, the math favors renting H100s or A100s. For teams running thousands of nodes for months, even a 20 percent saving is enough to justify the porting work, which is why most of the public Trainium customers are companies training their own foundation models.
The most common complaint about Trainium is the same one leveled at every non-CUDA accelerator: the software ecosystem is younger and shallower. CUDA has had two decades to grow a library of tuned kernels, third-party frameworks, and trained engineers. Neuron has had only a few years. In practice this shows up as occasional rough edges:
AWS has invested heavily to close these gaps. The TorchNeuron Native backend, the NKI kernel interface, the JAX and Hugging Face integrations, and the steady cadence of compiler releases all aim at making Trainium feel more like "just another PyTorch backend." The fact that Anthropic, the company training one of the most capable language models in the world, has bet most of its future training capacity on Trainium suggests the trade-off has crossed the line into acceptable for at least one very demanding user.