AWS Trainium

AI Hardware AI Infrastructure

23 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v3 · 4,681 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AWS Trainium is a family of custom machine learning accelerator chips designed by Annapurna Labs for Amazon Web Services, purpose-built for training and, increasingly, for serving large neural networks. Trainium sits alongside Inferentia (focused on inference) inside the broader AWS silicon stack that also includes the Graviton CPU line and the Nitro virtualization system. AWS unveiled the first Trainium at the December 2020 re:Invent keynote, made the corresponding Trn1 EC2 instances generally available in October 2022, announced Trainium2 at re:Invent 2023, and rolled Trn2 instances out broadly during 2024. Trainium3, fabricated on a TSMC 3 nm class process, was previewed at re:Invent 2024 and reached general availability through Trn3 UltraServers announced at re:Invent 2025 on December 2, 2025.^[1]^[2]^[3]

The chips are best known publicly for their role in Anthropic's Claude development, where they power Project Rainier, an EC2 UltraCluster centered on Indiana that runs nearly half a million Trainium2 chips and is being expanded to more than a million chips during 2026. By re:Invent 2025 the Trainium2 business had reached a multi-billion dollar annualized revenue run rate, with more than 1 million chips in production and over 100,000 companies using Trainium, making it the majority of Amazon Bedrock usage. The chips are also used inside Amazon for Search, for latency-optimized inference, and the Amazon Nova family of foundation models.^[4]^[5]^[13]

What is AWS Trainium?

AWS Trainium is the first custom training silicon built by Annapurna Labs, an Israeli chip design firm founded in 2011 by Hrvoje Bilic, Nafea Bshara, and Ronen Boneh. AWS bought Annapurna Labs in January 2015 for a reported $350 to $370 million, and the team has since produced the chips that quietly underpin most of AWS's modern infrastructure. The same group built the Nitro hypervisor and offload cards (launched November 2017), the Graviton ARM CPU line, and the Inferentia inference accelerator that shipped on Inf1 instances in December 2019. The Inferentia2 inference chip released in 2023 shares the same NeuronCore-v2 design as Trainium1.^[6]

The pitch from the start was straightforward. GPUs from Nvidia were the de facto training hardware, but they were expensive and supply constrained, and a hyperscaler that owns its own datacenters can save a lot of money by designing the silicon, the rack, the network, and the firmware as one system. Google had already proven this thesis with the Cloud TPU. AWS wanted its own version, and Annapurna gave it the team to build one.

Where does Trainium fit in the AWS silicon family?

AWS now operates four custom chip families designed by Annapurna Labs:

Chip family	Purpose	First shipped
Nitro	Hypervisor, networking, storage offload	November 2017
Graviton	ARM-based general purpose CPUs	November 2018
Inferentia	Low-cost inference accelerator	December 2019 (Inf1)
Trainium	Training accelerator	October 2022 (Trn1)

In practice, an EC2 host running a Trn2 instance has a Graviton control processor, Nitro cards handling network and storage offload, and a tray of Trainium chips connected by NeuronLink for the actual model math. Inferentia2 and Trainium are close cousins. They share the same NeuronCore-v2 core design and software stack, with Inferentia2 tuned for cheaper inference and Trainium tuned for the higher memory bandwidth and interconnect demands of model training.

Chip generations

Trainium1

Andy Jassy announced the original Trainium during the AWS re:Invent 2020 keynote on December 1, 2020. The first generation is built on a 7 nm process, contains roughly 55 billion transistors, and packs two NeuronCore-v2 cores per chip. Each NeuronCore-v2 contains four engines that handle different parts of a typical neural network workload: a tensor engine built on a power-optimized systolic array for matrix multiplication and convolution, a vector engine for normalization and pooling, a scalar engine for elementwise operations like ReLU, and a general-purpose SIMD (GPSIMD) engine of eight programmable 512-bit cores that lets developers write custom kernels in C++.^[7]

The chip supports a wide range of numeric formats: cFP8 (a configurable 8-bit float introduced with NeuronCore-v2), FP16, BF16, TF32, FP32, INT8, INT16, and INT32. Each NeuronCore-v2 tensor engine delivers more than 90 TFLOPS of FP16 or BF16 compute, and a single Trainium chip is rated at 210 FP16/BF16/cFP8/TF32 TFLOPS, 52.5 FP32 TFLOPS, and 420 INT8 TOPS. Each chip carries 32 GB of high bandwidth memory.^[7]

AWS made Trn1 generally available on October 10, 2022, in the US East (N. Virginia) and US West (Oregon) regions. The lineup launched with three sizes:^[8]

Instance	Trainium chips	NeuronCores	vCPUs	Instance memory	Accelerator memory	Local NVMe	Network bandwidth
trn1.2xlarge	1	2	8	32 GiB	32 GB	0.5 TB	up to 12.5 Gbps
trn1.32xlarge	16	32	128	512 GiB	512 GB	8 TB	800 Gbps EFAv2
trn1n.32xlarge	16	32	128	512 GiB	512 GB	8 TB	1,600 Gbps EFAv2

The full size, trn1.32xlarge, delivers up to 3 PFLOPS of FP16 / BF16 compute and 9.8 TB/s of aggregate HBM bandwidth across its 16 chips, which are stitched together with NeuronLink-v2 in a 2D torus. The trn1n variant doubles the external networking to 1.6 Tbps for customers who need to spread training across many instances. AWS exposes scale through its EC2 UltraCluster fabric, where Trn1 and Trn1n instances can be combined into pods of more than 100,000 chips connected by petabit-scale Elastic Fabric Adapter networking. List on-demand pricing at launch was $1.34 per hour for trn1.2xlarge, $21.50 per hour for trn1.32xlarge, and $24.78 per hour for trn1n.32xlarge.^[8]

Launch customers and partners highlighted by AWS included PyTorch (with native support added to the framework), Amazon Search, the protein structure prediction firm HeliXon, the Japanese fintech Money Forward, and the AI productivity startup Magic.

Trainium2

Trainium2 was announced at re:Invent 2023 and previewed throughout 2024 before reaching general availability on December 3, 2024 in the US East (Ohio) region. The chip is fabricated on a 5 nm process. Each Trainium2 chip contains eight NeuronCores and 96 GB of HBM with 2.9 TB/s of bandwidth per chip. AWS rates a single Trainium2 chip at up to 1.3 PFLOPS of dense FP8 and up to 5.2 PFLOPS of sparse FP8 compute. Sparsity in this generation is hardware accelerated using a 16:4 pattern (four nonzero values per group of sixteen), giving roughly a 4x throughput uplift on suitable models.^[9]^[10]

The Trn2 family is sold in two main shapes plus the UltraServer configuration:

Instance	Trainium2 chips	NeuronCores	vCPUs	Instance memory	Accelerator memory	HBM bandwidth	Network bandwidth
trn2.48xlarge	16	128	192	2 TiB	1.5 TB	46.4 TB/s	3.2 Tbps EFAv3
trn2u.48xlarge (UltraServer node)	16	128	192	2 TiB	1.5 TB	46.4 TB/s	3.2 Tbps EFAv3

A full trn2.48xlarge is rated at 20.8 PFLOPS of dense FP8 and around 83.2 PFLOPS of sparse FP8.^[9]^[10]

The Trn2 UltraServer is a physical product introduced alongside Trn2: four trn2u.48xlarge nodes (64 Trainium2 chips total) are wired together with NeuronLink in a high-bandwidth ring, exposing them to the operating system as a single logical machine with 512 NeuronCores, 6 TB of HBM, 185 TB/s of aggregate HBM bandwidth, and 12.8 Tbps of EFAv3 networking. AWS uses the same UltraServer building block to construct EC2 UltraClusters, the largest of which is Project Rainier (described below). For shorter bookings, customers reserve UltraServers and instances through Amazon EC2 Capacity Blocks for ML.^[9]^[10]

AWS publishes two performance comparisons that recur in marketing materials. Trn2 delivers about 4x the performance, 4x the memory bandwidth, and 3x the memory capacity of Trn1, and AWS claims it offers 30 to 40 percent better price performance than NVIDIA H100 based P5e and P5en instances on the workloads AWS measured. Both numbers are AWS internal benchmarks rather than third-party MLPerf submissions; as of MLPerf Training v5.1 results published in November 2025, AWS did not appear among the 20 submitting organizations.^[10]^[11]

Trainium3

AWS first showed Trainium3 silicon at re:Invent 2024 and then announced general availability of Trn3 UltraServers at re:Invent 2025 on December 2, 2025, when CEO Matt Garman delivered the keynote. Trainium3 is fabricated on TSMC's 3 nm class process (N3P) and is implemented as a dual-chiplet accelerator. Each chip integrates eight NeuronCore-v4 cores (four per chiplet) with 32 MiB of local SRAM per core, supports systolic arrays sized 128x128 for BF16 and 512x128 for MXFP8, and delivers 2.52 PFLOPS of MXFP8/MXFP4 compute, 671 dense (2,517 sparse) BF16/FP16/TF32 TFLOPS, and 183 TFLOPS of FP32 per chip. Each chip ships with 144 GB of HBM3e across four stacks at 4.9 TB/s of bandwidth, and NeuronLink-v4 contributes 2.56 TB/s of inter-device bandwidth.^[1]^[2]^[3]^[12]

Trainium3 adds support for microscaling formats, including MXFP8 and MXFP4, alongside the existing BF16, FP16, FP8, and FP32 paths, and AWS exposes a user-programmable rounding mode. The architecture also adds 16 dedicated collective communication cores (CC-Cores) so that all-reduce and all-gather steps do not contend with the tensor engines, and it supports Logical NeuronCore Configuration (LNC) for pooling resources across cores.^[12]

Trn3 UltraServers ship in two configurations:^[1]^[3]

Configuration	Chips per UltraServer	Cooling	Peak FP8 / MXFP8	Aggregate HBM3e	Aggregate HBM bandwidth
Trn3 UltraServer (Gen1)	64	Air-cooled	~161 PFLOPS	~9.2 TB	~314 TB/s
Trn3 UltraServer (Gen2)	144	Liquid-cooled	~362 PFLOPS	~20.7 TB	~706 TB/s

The Gen2 UltraServer extends scale-up beyond the 64-chip topology that Trn2 used, putting AWS's rack-scale system roughly on par with NVIDIA's GB300 NVL72 rack on aggregate FP8 throughput. The two-generation cluster is connected by NeuronSwitch-v1, an all-to-all fabric that AWS says roughly doubles the inter-chip interconnect bandwidth of the previous generation, and exposed through EC2 UltraClusters 3.0.^[1]^[3]

AWS markets the Trn3 UltraServer at up to 4.4x higher peak compute, 3.9x higher memory bandwidth, and over 4x better energy efficiency than a Trn2 UltraServer. At re:Invent 2025 Matt Garman also previewed Trainium4 as the next generation, with AWS stating it will deliver at least 3x the FP8 processing power and 4x more memory bandwidth than Trainium3, and describing it publicly as a redesign that scales out across many smaller dies rather than building a larger monolithic chip.^[1]^[13]

Trainium3 was co-designed with Anthropic. Trade press has reported that Anthropic engineers communicate with Annapurna Labs daily and contribute feedback from Claude training runs that shapes future chip design.

How does the NeuronCore architecture work?

Every generation of Trainium and Inferentia uses the same basic building block: the NeuronCore. The design has gone through four major revisions in production. NeuronCore-v1 powered the original Inferentia. NeuronCore-v2, introduced with Trainium1 and Inferentia2, added the four-engine layout that has shaped every subsequent generation. NeuronCore-v3 ships in Trainium2 and reorganizes the same engines for higher density and lower precision throughput. NeuronCore-v4 ships in Trainium3 and adds support for microscaling MX formats.^[7]^[10]^[12]

A single NeuronCore-v2 contains:

Engine	Role	Notes
Tensor engine	GEMM, convolution, transpose	Power-optimized systolic array, accepts cFP8/FP16/BF16/TF32/FP32/INT8 inputs and accumulates in FP32 or INT32
Vector engine	Normalization, pooling, softmax-style ops	About 10x faster than NeuronCore-v1
Scalar engine	Elementwise ops like ReLU and biases	Around 2.9 TFLOPS of FP32, 3x NeuronCore-v1
GPSIMD engine	Custom kernels in C++	Eight 512-bit programmable cores per NeuronCore

The GPSIMD engine is the part developers can program directly, through the Neuron Kernel Interface (NKI). NKI is roughly analogous to writing a CUDA kernel for an Nvidia GPU. It is how Anthropic and other heavy users push the chip past what the compiler can generate on its own.

NeuronCore-v3, introduced in Trainium2, splits each chip into eight cores instead of two and reorganizes the engines for the FP8 / sparse-FP8 numerics that dominate modern transformer training. Each chip exposes 128 NeuronCores when packaged as a trn2.48xlarge with 16 chips. NeuronCore-v4, introduced with Trainium3, continues this trajectory and adds the new microscaling MX formats, larger systolic arrays, dedicated SRAM, and the dedicated collective communication cores.^[12]

How are Trainium chips networked together?

Trainium chips are connected at three different levels.

Inside an instance, NeuronLink (v2 in Trn1, v3 in Trn2, v4 in Trn3) wires the chips into a 2D torus or ring at hundreds of gigabytes per second per chip; NeuronLink-v4 in Trainium3 reaches 2.56 TB/s per device. Inside an UltraServer, the same NeuronLink fabric extends across multiple physical instances so that 64 or 144 chips look to software like a single machine. Trn3 Gen2 UltraServers add NeuronSwitch-v1, an all-to-all fabric that AWS describes as roughly doubling inter-chip bandwidth versus Trn2. Above that, instances are stitched into EC2 UltraClusters with Elastic Fabric Adapter (EFAv2 on Trn1, EFAv3 on Trn2 onward, exposed as UltraClusters 3.0 with Trn3), an AWS-specific RDMA transport that runs over the regular Nitro network and bypasses the kernel for low-latency, lossless collective communication. UltraClusters scale to tens or hundreds of thousands of chips. Project Rainier is currently the largest AWS has ever deployed, at 70 percent larger than any prior AWS AI cluster.^[4]^[10]^[12]

This multi-tier structure mirrors the way Google describes its TPU pods (chip, board, slice, pod) and is the source of most of Trainium's headline numbers. A 64-chip or 144-chip UltraServer is the natural unit for a single training job that fits inside one tensor parallel domain, while UltraClusters give pipeline and data parallel scale.

How do developers program Trainium? The Neuron SDK

The Neuron SDK is what makes the hardware usable. It is split into a runtime, a compiler, and a set of framework integrations.

The Neuron Compiler is an XLA-based graph compiler that ingests models from PyTorch (via XLA), TensorFlow, JAX, and MXNet, lowers them to a Neuron-specific intermediate representation, and emits binaries the chip can run. It handles tiling for the SRAM hierarchy, scheduling across the engines inside each NeuronCore, and collective placement. Most users never touch it directly.

PyTorch on Neuron has shifted over the lifetime of the SDK. The original torch-neuronx package wraps PyTorch/XLA and is the most battle-tested path for distributed training. The newer TorchNeuron Native backend, added in 2025, provides eager execution, torch.compile, and the standard distributed APIs (FSDP, DDP, DTensor, tensor parallel) directly on Trainium. AWS now positions TorchNeuron Native as the recommended starting point for new workloads.

JAX on Neuron uses the same XLA compiler path. JAX programs lower to HLO, which the Neuron compiler then targets to the chip. AWS has shipped reference implementations of large language model pretraining in both PyTorch and JAX.

Hugging Face integration comes via the optimum-neuron library, an open source Hugging Face project that lets Transformers users fine-tune and serve models on Trainium and Inferentia without rewriting their training code. AWS and Hugging Face also publish a aws-neuron model namespace with precompiled artifacts.

The Neuron Kernel Interface (NKI) is the lowest-level developer surface. It exposes the GPSIMD engine and the tensor engine in a Python-embedded DSL that compiles down to chip instructions. NKI is the equivalent of writing a CUDA kernel and is the path most heavy users take when the compiler is leaving performance on the table.

Amazon SageMaker has first-class Trainium support: SageMaker training jobs and HyperPod clusters can be launched directly on Trn1, Trn2, and Trn3 instances, and SageMaker JumpStart hosts ready-to-train recipes for popular open models on Neuron.

The Neuron stack also integrates with Kubernetes (EKS), ECS, Ray, and Slurm-based HPC schedulers, and supports container images via the AWS Deep Learning Containers.

What is Project Rainier?

Project Rainier is the largest Trainium deployment in the world. It is a multi-site EC2 UltraCluster built primarily for Anthropic and used to train and serve Claude. AWS announced the project in 2024 and brought the first phase online in less than a year. The flagship site is an $11 billion campus in St. Joseph County, Indiana, near New Carlisle, that broke ground in October 2024 and was activated in October 2025. The campus spans roughly 1,200 acres and is engineered for outside-air cooling, requiring zero water for cooling October through March and minimal water April through September. Other Project Rainier capacity is spread across additional US sites.^[4]^[14]

AWS describes the active Project Rainier deployment as nearly 500,000 Trainium2 chips, with plans to scale beyond one million Trainium2 chips by year end 2026 across both training and inference workloads. The cluster gives Anthropic more than five times the compute it used to train its previous generation of Claude models. AWS calls Project Rainier 70 percent larger than any other AI compute platform in the company's history.^[4]

The physical building block at every site is the Trn2 UltraServer (four trn2u.48xlarge instances, 64 chips, NeuronLink) connected upward by EFAv3. The custom NeuronLink topology and the Nitro-managed network are the reason AWS can stand up clusters of this size on its own silicon.

Project Rainier should not be confused with Project Ceiba, the joint AWS / Nvidia supercomputer announced at re:Invent 2023, which is built from Nvidia GH200 Grace Hopper Superchips and Nvidia Blackwell systems and runs on AWS Nitro and EFA. Ceiba is Nvidia silicon hosted on AWS infrastructure. Rainier is AWS silicon hosting a single anchor customer.

Who uses AWS Trainium?

Anthropic is the headline customer. Amazon and Anthropic announced a deeper strategic alliance on November 22, 2024 that took Amazon's total investment in Anthropic to $8 billion (an additional $4 billion on top of an earlier $4 billion commitment) and named AWS as Anthropic's primary cloud and training partner.^[15]

On April 20, 2026, the two companies expanded the deal further. Anthropic committed to spending more than $100 billion over the next ten years on AWS technologies (Trainium chips plus tens of millions of Graviton cores) to secure up to 5 GW of capacity for training and serving Claude. The contract spans Trainium2, Trainium3, Trainium4, and future generations. Anthropic said meaningful new Trainium2 capacity would come online in Q2 2026 and that roughly 1 GW of combined Trainium2 and Trainium3 capacity would be online by the end of 2026. Amazon, in turn, committed an immediate $5 billion investment with up to $20 billion more tied to commercial milestones, on top of the $8 billion it had previously invested, taking the potential total Amazon stake to roughly $33 billion. The deal also expanded international inference capacity for Claude in Asia and Europe and announced that the full Anthropic-native Claude Platform console would be accessible from inside the AWS console.^[16]^[17]^[18]

Amazon CEO Andy Jassy framed the silicon, not just the partnership, as the reason for the demand. "Our custom AI silicon offers high performance at significantly lower cost for customers, which is why it's in such hot demand," Jassy said. "Anthropic's commitment to run its large language models on AWS Trainium for the next decade reflects the progress we've made together on custom silicon."^[16] Anthropic CEO Dario Amodei tied the deal to demand for Claude: "Our users tell us Claude is increasingly essential to how they work, and we need to build the infrastructure to keep pace with rapidly growing demand," he said, adding that the collaboration would let Anthropic "continue advancing AI research while delivering Claude to our customers, including the more than 100,000 building on AWS."^[17]

The customer list AWS highlighted at re:Invent 2025 included, in addition to Anthropic, Karakuri (the Japanese LLM company, which has reported reducing training costs by more than 50 percent on Trainium), Metagenomi, NetoAI, Ricoh (which trained a Japanese language LLM on a 256-node Trn1 cluster, reportedly cutting training cost by 50 percent and training time by 25 percent versus GPUs), Splash Music, and Decart (which has reported 4x faster inference for real-time generative video at half the cost of GPUs). Other publicly named users include the AI coding company poolside, Databricks (which has integrated Trainium support into Mosaic AI), and HeliXon (protein structure prediction). Inside Amazon, Trainium is used by Amazon Search, by Amazon Bedrock for latency-optimized inference of models including Claude 3.5 Haiku and Llama 3.1 405B, and to train the Amazon Nova family.^[1]^[5]

AWS does not publish chip-level revenue, but Andy Jassy disclosed at re:Invent 2025 that the Trainium2 business had reached a multi-billion dollar annualized revenue run rate, with more than 1 million chips in production and over 100,000 companies using Trainium, accounting for the majority of Bedrock usage at the time.^[13]

How much cheaper is Trainium than GPUs?

AWS publishes a few headline numbers that come up in almost every Trainium discussion:

Trn1 instances offer up to 50 percent lower training cost than comparable GPU EC2 instances, per AWS's launch materials and several customer case studies (Ricoh, Karakuri).
Trn2 instances offer 30 to 40 percent better price performance than Nvidia H100 based EC2 P5e and P5en instances on AWS-measured workloads.
Trn2 UltraServers deliver about 4x the dense FP8 throughput of a Trn1 32xlarge instance.
Trn3 UltraServers are advertised at 4.4x higher peak performance, 3.9x higher memory bandwidth, and over 4x better energy efficiency than Trn2 UltraServers; the Gen2 UltraServer hits 362 PFLOPS of FP8 / MXFP8 per rack.^[1]
AWS announced at re:Invent 2025 that customers including Anthropic, Karakuri, Metagenomi, NetoAI, Ricoh, and Splash Music had reduced training and inference costs by up to 50 percent on Trainium.^[1]

These are vendor figures. AWS has historically not submitted Trainium results to MLPerf Training, so direct head-to-head comparisons against NVIDIA H100, NVIDIA Blackwell, and Google TPU v5p / Trillium happen through customer case studies and analyst pieces rather than a shared benchmark suite.^[11]

How does Trainium compare with TPU and Nvidia GPUs?

In the broader market for AI accelerators, Trainium occupies an in-between position. Google's TPU is the closest analogue: a hyperscaler-owned ASIC, only available through the corresponding cloud, with a software stack (XLA / JAX) that is shared with the rest of the company's ML stack. Nvidia GPUs, especially the H100 and A100, define the rest of the market: available everywhere, well documented, and backed by the CUDA ecosystem.

Dimension	AWS Trainium	Google TPU	Nvidia H100 / Blackwell
Owner	AWS (Annapurna Labs)	Google	Nvidia
Availability	EC2 (Trn1, Trn2, Trn3)	Google Cloud only	All major clouds + on-prem
Native software	Neuron SDK, XLA	XLA, JAX, TF	CUDA, cuDNN, NCCL
Top-end interconnect	NeuronLink-v4 + NeuronSwitch-v1 + EFAv3	ICI + OCS optical pod fabric	NVLink + InfiniBand / Ethernet
Rack-scale system	Trn3 Gen2 UltraServer (144 chips)	TPU v5p / Trillium pods	NVL72 / NVL576 (GB200, GB300)
Public benchmark presence	Customer case studies	MLPerf submissions	MLPerf submissions
Anchor customer	Anthropic	Google internal, Anthropic, others	Almost everyone

The trade-off for buyers is well known. Trainium and TPU give better price performance for customers willing to live inside one cloud and absorb the cost of porting. CUDA and Nvidia GPUs give portability, a larger pool of pretrained models, and a much deeper third-party software ecosystem. For most teams running a few GPU-hours a week, the math favors renting H100s or A100s. For teams running thousands of nodes for months, even a 20 percent saving is enough to justify the porting work, which is why most of the public Trainium customers are companies training their own foundation models.

What are the limitations of Trainium?

The most common complaint about Trainium is the same one leveled at every non-CUDA accelerator: the software ecosystem is younger and shallower. CUDA has had two decades to grow a library of tuned kernels, third-party frameworks, and trained engineers. Neuron has had only a few years. In practice this shows up as occasional rough edges:

Some operations that are trivial on a GPU need a custom NKI kernel on Trainium because the compiler does not yet generate optimal code for them.
New research models (especially novel attention variants and state space models) often appear on CUDA first and need a delay before they run efficiently on Neuron.
Debugging tools, profilers, and the broader open-source kernel ecosystem are smaller, even with Hugging Face, SageMaker, and TorchTitan integrations to fill in the gaps.
The chips are only available through AWS, so customers who want geographic redundancy across cloud providers cannot multi-source on identical hardware.

AWS has invested heavily to close these gaps. The TorchNeuron Native backend, the NKI kernel interface, the JAX and Hugging Face integrations, and the steady cadence of compiler releases all aim at making Trainium feel more like "just another PyTorch backend." The fact that Anthropic, the company training one of the most capable language models in the world, has bet most of its future training capacity on Trainium suggests the trade-off has crossed the line into acceptable for at least one very demanding user.

Roadmap

AWS publicly disclosed the broad cadence of Trainium generations at re:Invent 2025:

Generation	Process	First shipped	Status (May 2026)
Trainium1	TSMC 7 nm	October 2022 (Trn1)	Generally available
Trainium2	TSMC 5 nm	December 2024 (Trn2 + Trn2 UltraServer)	Generally available; primary deployment in Project Rainier
Trainium3	TSMC 3 nm class (N3P)	December 2, 2025 (Trn3 UltraServer Gen1 and Gen2)	Generally available
Trainium4	Not disclosed	Future	Previewed at re:Invent 2025; AWS targets >=3x FP8 of Trainium3 and 4x memory bandwidth

Trainium4, previewed by Matt Garman in the December 2025 keynote, is described by AWS as a chiplet-heavy design that scales out across many smaller dies rather than building a larger monolithic accelerator. The deal with Anthropic announced on April 20, 2026 includes the right to purchase Trainium4 and future generations.^[13]^[16]

References

AWS, Announcing Amazon EC2 Trn3 UltraServers for faster, lower-cost generative AI training, December 2, 2025. ↩
AWS Neuron Documentation, Trainium3 Architecture. ↩
HPCwire, AWS Brings the Trainium3 Chip to Market With New EC2 UltraServers, December 2, 2025. ↩
About Amazon, AWS activates Project Rainier: One of the world's largest AI clusters. ↩
About Amazon, AWS re:Invent 2025: Amazon announces Nova 2 and other AI news and updates. ↩
Wikipedia, Annapurna Labs. ↩
AWS Neuron Documentation, NeuronCore-v2 Architecture. ↩
AWS Press Center, AWS Announces General Availability of Amazon EC2 Trn1 Instances Powered by AWS-Designed Trainium Chips, October 10, 2022. ↩
AWS News Blog, Amazon EC2 Trn2 Instances and Trn2 UltraServers for AI/ML training and inference are now available, December 3, 2024. ↩
AWS Neuron Documentation, Amazon EC2 Trn2 Architecture. ↩
MLCommons, MLCommons Releases MLPerf Training v5.1 Results, November 2025. ↩
AWS Neuron Documentation, Trainium3 Architecture. ↩
TechCrunch, Andy Jassy says Amazon's Nvidia competitor chip is already a multi-billion-dollar business, December 3, 2025. ↩
CNBC, Amazon opens $11 billion AI data center Project Rainier in Indiana, October 29, 2025. ↩
About Amazon, Amazon and Anthropic expand strategic collaboration, November 22, 2024. ↩
About Amazon, Amazon and Anthropic deepen strategic collaboration, April 20, 2026. ↩
Anthropic, Anthropic and Amazon expand collaboration for up to 5 GW of new compute, April 20, 2026. ↩
TechCrunch, Anthropic takes $5B from Amazon and pledges $100B in cloud spending in return, April 20, 2026. ↩
AWS, AI Accelerator: AWS Trainium.
AWS, Amazon EC2 Trn2 Instances.
AWS, Amazon EC2 Trn1 Instances.
Anthropic, Powering the next generation of AI development with AWS.
AWS Neuron SDK GitHub repository, aws-neuron/aws-neuron-sdk.
Hugging Face, optimum-neuron.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

AWS Trainium

What is AWS Trainium?

Where does Trainium fit in the AWS silicon family?

Chip generations

Trainium1

Trainium2

Trainium3

How does the NeuronCore architecture work?

How are Trainium chips networked together?

How do developers program Trainium? The Neuron SDK

What is Project Rainier?

Who uses AWS Trainium?

How much cheaper is Trainium than GPUs?

How does Trainium compare with TPU and Nvidia GPUs?

What are the limitations of Trainium?

Roadmap

References

Improve this article

What links here (24 of 28)

What links here (24 of 28)

What is AWS Trainium?

Where does Trainium fit in the AWS silicon family?

Chip generations

Trainium1

Trainium2

Trainium3

How does the NeuronCore architecture work?

How are Trainium chips networked together?

How do developers program Trainium? The Neuron SDK

What is Project Rainier?

Who uses AWS Trainium?

How much cheaper is Trainium than GPUs?

How does Trainium compare with TPU and Nvidia GPUs?

What are the limitations of Trainium?

Roadmap

References

Improve this article

Related Articles

Cloud TPU

NVIDIA Picasso

Tensor Processing Unit (TPU)

TPU Pod

TPU Node

TPU Worker

What links here (24 of 28)

Related Articles

Cloud TPU

NVIDIA Picasso

Tensor Processing Unit (TPU)

TPU Pod

TPU Node

TPU Worker

What links here (24 of 28)