AWS Inferentia

AI Hardware AI Inference

23 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v3 · 4,624 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AWS Inferentia is a family of custom application specific integrated circuits (ASICs) designed by Amazon Web Services for machine learning inference in the cloud, built to deliver, in AWS's words, "high performance at the lowest cost in Amazon EC2 for deep learning and generative AI inference."^[3] Designed by Amazon's Annapurna Labs subsidiary, Inferentia powers the Amazon Elastic Compute Cloud (EC2) Inf1 and Inf2 instance families and was one of the first hyperscaler designed inference AI chips to reach wide production deployment. The first generation chip was announced at AWS re:Invent in November 2018 and reached general availability in December 2019, while the second generation Inferentia2 was announced at re:Invent 2022 and launched in April 2023 with roughly four times the memory and far higher bandwidth, aimed at generative AI workloads such as large language models and latent diffusion models. AWS currently markets the Inferentia family as delivering "up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances."^[1]^[2]^[3]

Inferentia sits within a broader AWS silicon roadmap that includes the aws trainium training accelerators, the aws trainium2 second generation training chip launched in 2024, and the aws trainium 3 chip introduced at re:Invent 2025. While Inferentia chips are dedicated to inference, the Trainium line supports both training and increasingly large scale inference, leading industry observers to describe a gradual convergence of the two product families. As of May 2026, AWS has not announced an Inferentia3 chip, and the company appears to be channelling its next generation efforts into Trainium silicon that can serve both workloads.^[4]^[5]^[6]

What is AWS Inferentia?

AWS Inferentia is Amazon's purpose built accelerator for the inference stage of machine learning, the phase in which an already trained model is run to generate predictions, as distinct from the compute intensive training phase. Where general purpose GPUs are designed to serve both training and inference, Inferentia is optimised specifically for high throughput, low cost prediction serving inside AWS, and it is sold only as part of EC2 instances and managed AWS services rather than as standalone hardware. AWS positions the chip on price-performance rather than peak per chip speed: the pitch is the lowest cost per inference for a given throughput and latency target, achieved through vertical integration with the AWS cloud and the mature Neuron software stack.^[3]^[13]

The chip family complements aws trainium, Amazon's training focused accelerator. Both lines are designed by the same Annapurna Labs engineering team, share the NeuronCore architectural lineage, and are programmed through the same AWS Neuron SDK, which lets developers move a model between training and inference accelerators without rewriting deployment code.^[4]^[14]

AWS Annapurna Labs background

Annapurna Labs was founded in 2011 in Israel by Bilic "Billy" Hrvoje, Nafea Bshara and Ronen Boneh, and was named after the Annapurna Massif in the Himalayas. The startup operated in stealth for several years, designing networking and SmartNIC silicon for data center customers, with early backers including Avigdor Willenz, Walden International, Arm Holdings and TSMC. In January 2015 Amazon acquired Annapurna Labs for a reported price of US$350 to US$370 million and folded the team into the AWS organisation, where it became the engineering core for AWS designed silicon.^[7]

Following the acquisition, Annapurna Labs developed multiple chip lines for AWS. The first major commercial product was the AWS Nitro System, a combination of hardware offload cards and a lightweight hypervisor first deployed at scale starting in 2017. The team then produced the Arm based Graviton general purpose CPU family, and in 2018 it began work on machine learning ASICs branded as Inferentia (for inference) and later Trainium (for model training). Annapurna Labs maintains a design centre in Austin, Texas alongside its Israeli engineering base, and the operation has been described by AWS leadership and outside observers as the "secret sauce" behind AWS's vertical integration strategy.^[7]^[8]

When was AWS Inferentia announced and released?

AWS Inferentia was first publicly announced by then AWS CEO Andy Jassy during his keynote at AWS re:Invent 2018 in Las Vegas on 28 November 2018. Jassy positioned the chip as a low cost, low latency alternative to graphics processing units (GPUs) for machine learning inference, the production phase of deploying trained models, describing Inferentia as "a very high-throughput, low-latency, sustained-performance very cost-effective processor" and indicating that it would become available the following year.^[9]^[10] At launch AWS stated the chip would support FP16, BF16 and INT8 numerical formats and would integrate with the TensorFlow, Apache MXNet and PyTorch frameworks, as well as models exported to the ONNX exchange format.^[9]^[10]

The chip itself became generally available roughly one year later when AWS announced the Amazon EC2 Inf1 instance family on 3 December 2019 at re:Invent 2019. Each first generation Inferentia chip contained four NeuronCore-v1 cores and delivered up to 128 INT8 tera operations per second (TOPS) and 64 FP16/BF16 tera floating point operations per second (TFLOPS). The chip carried 8 GiB of off chip DDR4 DRAM with approximately 50 GiB/sec of bandwidth, plus a large on chip cache to reduce external memory traffic. The Inf1 family scaled from a single chip in the inf1.xlarge size up to 16 Inferentia chips in the largest inf1.24xlarge instance, which AWS described as offering over two peta operations per second of aggregate inference throughput.^[2]^[11]

The launch lineup included four instance sizes: inf1.xlarge (1 chip, 4 vCPUs, 8 GiB system memory), inf1.2xlarge (1 chip, 8 vCPUs, 16 GiB), inf1.6xlarge (4 chips, 24 vCPUs, 48 GiB) and inf1.24xlarge (16 chips, 96 vCPUs, 192 GiB). Networking ranged from up to 25 Gbps to 100 Gbps on the largest size. The Inf1 instances paired the Inferentia chips with second generation Intel Xeon Scalable host CPUs and were available initially in the US East (N. Virginia) and US West (Oregon) regions through On Demand, Spot, Reserved Instance and Savings Plan purchasing options.^[2]

At launch AWS claimed Inf1 instances delivered up to 3x higher inference throughput and up to 40% lower cost per inference than the company's GPU based EC2 G4 instances, which used NVIDIA T4 GPUs and had previously been the lowest cost inference option on AWS.^[2]

NeuronCore architecture

The NeuronCore is the basic compute building block of every Inferentia and Trainium chip. The first generation NeuronCore-v1 used in Inferentia 1 implements a high performance systolic array based matrix multiplication engine targeted at the dense linear algebra operations dominant in deep learning workloads, particularly convolutions and the attention and feed forward layers of transformer models. Each NeuronCore is paired with a sizeable on chip cache that reduces the frequency of external DRAM accesses, which helps maintain throughput on large models.^[11]^[12]

With Inferentia 2, AWS introduced NeuronCore-v2, a substantially redesigned core that uses fewer instances per chip (two NeuronCore-v2 cores per Inferentia2 chip versus four NeuronCore-v1 cores per Inferentia1 chip) but with each core offering far higher compute and richer functionality. NeuronCore-v2 is organised around four independent execution engines:

TensorEngine: a matrix multiplication engine optimised for dense linear algebra. AWS reported the TensorEngine in NeuronCore-v2 as approximately 6x faster than its NeuronCore-v1 predecessor.^[13]
VectorEngine: a unit specialised for non element wise vector operations such as batch normalisation, layer normalisation, pooling and reductions, reported as approximately 10x faster than NeuronCore-v1.^[13]
ScalarEngine: a unit optimised for element wise operations such as rectified linear unit (ReLU) activations, sigmoid and other point wise non linearities, reported as approximately 3x faster than NeuronCore-v1.^[13]
GPSIMD-Engine: a new general purpose single instruction multiple data programmable engine that lets developers write custom C++ operators executed directly on the NeuronCore for control flow, fused custom kernels and operations not natively supported by the compiler.^[3]^[13]

NeuronCore-v2 also introduced ISA level support for dynamic input shapes, which is important for transformer and large language model inference where sequence lengths vary between requests, along with stochastic rounding and configurable FP8 (cFP8) formats. The improved core architecture, combined with the higher capacity HBM memory subsystem, accounts for most of the headline 4x throughput and 10x latency improvements that AWS attributes to Inferentia2 relative to Inferentia1.^[13]

Neuron SDK

The AWS Neuron Software Development Kit (Neuron SDK) is the open source software stack used to compile, profile, debug and deploy models on Inferentia and Trainium hardware. Neuron sits between high level ML frameworks and the underlying NeuronCore hardware, providing a compiler that lowers framework graphs into Neuron specific intermediate representations and ultimately into NeuronCore instructions, plus a runtime that schedules execution across one or more chips.^[14]

Framework integration covers PyTorch (via the torch-neuron and torch-neuronx packages, the latter built on PyTorch XLA), TensorFlow, JAX and Apache MXNet. The Neuron SDK also exposes higher level integrations with libraries widely used in production inference deployments, including the Hugging Face Optimum Neuron project, which bridges Transformers and the Diffusers libraries to Neuron hardware, and the vLLM inference server, which supports continuous batching, expert parallelism, disaggregated inference and speculative decoding on Inferentia and Trainium. AWS additionally provides a lower level kernel programming interface, the Neuron Kernel Interface (NKI), built on MLIR, for advanced users who want to write custom kernels directly against the NeuronCore.^[14]^[15]

Neuron is the same software stack used by both aws trainium and Inferentia, which lets developers move models between training and inference accelerators without rewriting deployment code. AWS publishes Neuron releases on GitHub and through public package repositories, and provides Deep Learning Amazon Machine Images (DLAMIs) preconfigured for Inf1, Inf2, Trn1 and Trn2 instances. The SDK also integrates with Amazon SageMaker, Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS) and AWS ParallelCluster for managed and orchestrated deployments.^[3]^[14]

Inferentia 2 (announced re:Invent 2022)

At AWS re:Invent 2022, held in late November and early December 2022 in Las Vegas, AWS announced the preview of Amazon EC2 Inf2 instances powered by a new Inferentia2 chip designed specifically for large generative AI models. The announcement positioned Inferentia2 as a response to the rapid growth in the size of transformer models, where many state of the art systems by 2022 had outgrown the 8 GB per chip memory budget of the original Inferentia and could no longer be served efficiently on Inf1.^[16]

Inferentia2 retained the basic NeuronCore organisation of its predecessor but rebalanced the chip toward larger model capacity. Each Inferentia2 chip integrates two NeuronCore-v2 cores delivering 380 INT8 TOPS and 190 FP16/BF16/cFP8/TF32 TFLOPS, plus 47.5 FP32 TFLOPS. The chip is paired with 32 GiB of on package high bandwidth memory (HBM), a four times capacity increase over the 8 GB DDR4 of Inferentia1, with bandwidth rising approximately 16x to 820 GiB/sec per chip. The chip also exposes 1 TB/sec of direct memory access (DMA) bandwidth with inline compression and decompression for the model parameter pipeline.^[13]^[17]

Supported data types on Inferentia2 expanded to cover FP32, TF32, BF16, FP16, INT8, UINT8 and the new configurable FP8 (cFP8) format, the latter of which AWS positioned as particularly useful for large language model inference because of its reduced memory footprint and I/O cost without the accuracy loss often seen with INT8 post training quantisation. As AWS describes it, "Inferentia2 adds additional support for FP32, TF32, and the new configurable FP8 (cFP8) data type to provide developers more flexibility to optimize performance and accuracy."^[3]^[13]

The chips communicate with one another through NeuronLink-v2, a dedicated chip to chip interconnect that provides 192 GB/sec of bidirectional bandwidth on Inf2 instances. NeuronLink-v2 supports collective communications operators such as all reduce, which enables distributed inference where a single large model is sharded across multiple Inferentia2 chips in the same instance. AWS demonstrated that a 175 billion parameter transformer model can be served for inference across multiple Inferentia2 chips within a single inf2.48xlarge instance, a milestone aimed squarely at deployment of GPT class large language models.^[13]^[16]

How does Inferentia2 differ from Inferentia1?

Inferentia2 keeps the same NeuronCore lineage as the first generation chip but rebalances it heavily toward the memory capacity and bandwidth that large generative models need, which is why AWS reports up to 4x higher throughput and up to 10x lower latency on Inf2 versus Inf1. The largest single difference is memory: Inferentia2 carries 32 GiB of on package HBM, a four times increase over the 8 GiB of DDR4 on Inferentia1, with per chip bandwidth rising roughly 16x to 820 GiB/sec, which is what allows multi billion parameter models to stay resident in accelerator memory. The table below summarises the per chip and per instance differences.^[11]^[13]^[17]

Attribute	Inferentia1 (Inf1, 2019)	Inferentia2 (Inf2, 2023)
NeuronCores per chip	4 x NeuronCore-v1	2 x NeuronCore-v2
On chip / on package memory	8 GiB DDR4	32 GiB HBM
Memory bandwidth per chip	~50 GiB/sec	~820 GiB/sec
Peak INT8 throughput	128 TOPS	380 TOPS
Peak FP16/BF16 throughput	64 TFLOPS	190 TFLOPS
Chip to chip interconnect	none (PCIe only)	NeuronLink-v2, 192 GB/sec
Max chips per instance	16 (inf1.24xlarge)	12 (inf2.48xlarge)
Supported data types	FP16, BF16, INT8	FP32, TF32, BF16, FP16, INT8, UINT8, cFP8
Primary target	classic ML inference (CV, NLP, ranking)	generative AI, LLMs up to ~175B params

Beyond raw numbers, NeuronCore-v2 added ISA level support for dynamic input shapes (important when sequence lengths vary between requests), stochastic rounding, the configurable cFP8 format for memory efficient LLM inference, and the NeuronLink-v2 interconnect that lets a single large model be sharded across all the chips in an instance. AWS also reports up to 50% better performance per watt on Inf2 relative to Inf1.^[1]^[13]

Inf2 instances

The Inf2 instance family reached general availability on 13 April 2023, initially in the US East (Ohio) and US East (N. Virginia) regions, with subsequent expansion to additional regions. AWS describes Inf2 as instances that "are designed to run high-performance DL inference applications at scale globally" and as the first inference optimised EC2 instances to support scale out distributed inference across accelerators. AWS offers four Inf2 sizes:^[1]^[17]

Instance	Inferentia2 chips	NeuronCore-v2 cores	Accelerator HBM	vCPUs	System memory	Network bandwidth
inf2.xlarge	1	2	32 GiB	4	16 GiB	up to 15 Gbps
inf2.8xlarge	1	2	32 GiB	32	128 GiB	up to 25 Gbps
inf2.24xlarge	6	12	192 GiB	96	384 GiB	50 Gbps
inf2.48xlarge	12	24	384 GiB	192	768 GiB	100 Gbps

The largest inf2.48xlarge size delivers an aggregate 2.3 petaFLOPS of BF16/FP16 dense compute, 384 GiB of pooled HBM accelerator memory and 9.8 TB/sec of total memory bandwidth across the twelve chips, with the chips wired together through 192 GB/sec NeuronLink-v2 for distributed inference of very large models.^[1]^[17]

Compared to the prior generation Inf1 instances, AWS reports Inf2 delivers up to 4x higher throughput, up to 10x lower latency and up to 50% better performance per watt on representative deep learning inference workloads. Inf2 is available through On Demand, Reserved Instance, Spot and Savings Plan purchasing models, and is supported as a managed endpoint in Amazon SageMaker as well as in EKS and ECS.^[1]

What is AWS Inferentia used for?

Amazon Alexa

The largest publicly disclosed Inferentia 1 deployment is the Amazon Alexa voice assistant. On 12 November 2020 AWS announced that the Alexa team had migrated the majority of its GPU based machine learning inference workloads, with a focus on the text to speech (TTS) component of the Alexa pipeline, to Inf1 instances. AWS reported that the migration produced approximately 25% lower end to end latency and approximately 30% lower cost compared with the previous GPU based instances for Alexa TTS workloads. The text to speech improvements directly affect the perceived responsiveness of the voice assistant on more than 100 million Alexa enabled devices worldwide.^[18]

Amazon Search and other internal Amazon workloads

Amazon Search has been an early adopter of both Inferentia generations, with AWS reporting up to 85% cost reduction on certain ranking and retrieval models when migrated from GPU based instances to Inferentia. Other internal Amazon teams have used Inferentia for personalisation, fraud detection and recommendation workloads where high throughput at low cost per inference is more important than absolute peak per query latency.^[3]

Anthropic and Claude inference

Anthropic, the developer of the Claude family of large language models, has progressively deepened its partnership with AWS, which by late 2024 had invested a cumulative US$8 billion into Anthropic and signed a strategic compute agreement under which Anthropic adopted AWS as its primary training partner. In November 2024 the companies announced that Anthropic would jointly co develop future generations of AWS Trainium silicon, contributing low level kernels and improvements to the Neuron SDK. Project Rainier, the resulting compute cluster, was activated in 2025 and ultimately contains hundreds of thousands of Trainium2 chips spread across multiple US data centres including a large site in Indiana, with capacity scaling toward multiple gigawatts of power draw under the broader Amazon Anthropic agreement.^[19]^[20]

It is important to distinguish how the Inferentia and Trainium lines are used in this partnership. Training of Anthropic's Claude models, including claude opus 4 7 and earlier Claude generations, runs primarily on aws trainium2 (and successor Trainium silicon), not on Inferentia. Inferentia is designed for inference workloads only and lacks the cluster scale interconnect and per chip flop budget needed for frontier scale pretraining. Inference for Claude on Amazon Bedrock and direct Anthropic API endpoints draws from a mix of AWS hardware including Trainium and Inferentia generations, with the precise mix not publicly disaggregated by AWS or Anthropic. AWS positions Bedrock as the managed inference layer for Claude on AWS, abstracting the underlying accelerator choice from customers.^[21]^[22]

Customer workloads

AWS has publicly cited a range of external customers using Inferentia for production inference. On Inf2 specifically, AWS has reported case studies including Leonardo.ai claiming an 80% cost reduction with no performance loss for diffusion model inference; Runway reporting 2x higher throughput than comparable GPU instances; Money Forward reporting 10x latency reduction over their previous Inf1 deployment; Dataminr reporting up to 9x better throughput per dollar; and Fileread.ai reporting 33% latency reduction with 50% throughput increase. AWS has also cited Metagenomi reducing large scale protein design costs by up to 56% on Inferentia2. On the original Inf1 platform AWS cited Finch Computing, Autodesk, ByteDance, Anthem (now Elevance Health), Dataminr and Screening Eagle Technologies among production users.^[1]^[3]

How does Inferentia compare to NVIDIA inference GPUs?

Inferentia competes primarily with NVIDIA's GPU based inference offerings on AWS. When Inf1 launched in 2019, the principal point of comparison was the NVIDIA T4 GPU available in EC2 G4 instances, and AWS positioned Inferentia 1 as offering up to 3x higher throughput and up to 40% lower cost per inference for matched workloads. By the time Inf2 launched in 2023 the competitive landscape on AWS included G5 instances with NVIDIA A10G GPUs and P4d/P5 instances with NVIDIA A100 and H100 GPUs. AWS positions Inf2 against these GPUs primarily on cost per inference for generative AI workloads with model sizes between approximately one billion and several hundred billion parameters, where the 32 GiB HBM per chip and the 192 GB/sec NeuronLink interconnect allow large models to be sharded across an instance without leaving NeuronCore memory.^[1]^[2]

Inferentia is also frequently compared, particularly in industry press, with dedicated inference accelerators from other vendors such as the groq lpu (an inference focused architecture optimised for low latency token generation), the cerebras wse 3 wafer scale engine, and modern NVIDIA Blackwell parts such as the nvidia b200 (which is positioned for both training and inference at the high end). Each of these architectures targets a somewhat different operating point. Inferentia's distinguishing characteristics are its vertical integration with the AWS cloud, the maturity of the Neuron SDK, and the price/performance profile delivered through reserved capacity inside EC2 rather than as standalone hardware.^[4]

Relationship to aws trainium, aws trainium2 and aws trainium 3

Inferentia and Trainium are closely related product lines designed by the same Annapurna Labs engineering organisation and sharing the NeuronCore architectural lineage. The first AWS Trainium chip was announced at re:Invent 2020 and reached general availability in 2022, deployed in Trn1 instances. Like Inferentia, Trainium uses NeuronCore-v2 derived cores, but with higher per chip compute, larger HBM, and significantly more capable interconnect (NeuronLink and the second generation EFA fabric) for tightly coupled distributed training across hundreds or thousands of chips.^[4]

aws trainium2 was announced at re:Invent 2023 and reached general availability at re:Invent 2024, offering approximately 4x performance improvement over the original Trainium and 96 GB of HBM per chip. Trn2 UltraServers combine 16 Trainium2 chips into a single tightly coupled domain, with up to 83 PFLOPS of dense FP8 compute and 332 PFLOPS of sparse FP8 compute. aws trainium 3 was announced at re:Invent 2025 as AWS's first 3 nanometer AI chip, with 2.52 PFLOPS of FP8 compute per chip, 144 GB of HBM3e memory, 4.9 TB/sec of bandwidth, and approximately 40% better energy efficiency than Trainium2. Trn3 UltraServers can pack up to 144 Trainium3 chips into a single integrated system with up to 4.4x higher compute than Trainium2 UltraServers.^[4]^[6]

A notable trend visible across the 2024 and 2025 re:Invent announcements is the increasing positioning of Trainium silicon as a dual purpose training and inference accelerator. Trainium2 and Trainium3 are explicitly marketed for high volume inference of frontier scale models, in addition to their original training role. Industry commentators have described this as a gradual convergence of the Trainium and Inferentia lines, with Inferentia 2 remaining the workhorse for classic high volume inference workloads while Trainium handles frontier scale inference where the larger memory and more capable interconnect are required.^[5]^[6]

Is Inferentia cheaper than GPUs? Pricing and economics

Inferentia is sold exclusively as part of EC2 instances and managed AWS services rather than as discrete hardware, so its pricing is bundled into hourly instance rates and into Bedrock and SageMaker per token or per hour charges. AWS publishes On Demand instance prices on its EC2 pricing pages; representative Inf2 On Demand list prices in early 2026 include approximately US$0.76 per hour for inf2.xlarge, US$1.97 per hour for inf2.8xlarge, US$6.49 per hour for inf2.24xlarge and US$12.98 per hour for inf2.48xlarge. Substantial discounts are available through one and three year Reserved Instances, Savings Plans and Spot purchasing. Inf1 instances remain available at lower nominal hourly prices, with inf1.xlarge listed at approximately US$0.23 per hour and inf1.24xlarge at approximately US$4.72 per hour in the US East region.^[1]^[2]

The economic argument for Inferentia is built around cost per inference rather than headline hourly rate. AWS summarises the family as delivering "up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances," and AWS and its customers have repeatedly reported per inference cost reductions in the range of 30% to 90% versus equivalent GPU instances, depending on workload, batch size and model architecture. Reported figures include 30% lower cost on Alexa text to speech versus the previous GPU based instances, 40% lower cost per inference on Inf1 versus the EC2 G4 baseline at launch, 80% cost reduction on Leonardo.ai and Finch Computing inference workloads, 85% cost reduction on certain Amazon Search ranking workloads, and 90% lower cost reported by NTT PC Communications. The magnitude of the savings is highly workload dependent, and AWS publishes specific case studies rather than a universal headline number.^[1]^[3]^[18]

For deployments at very large scale, Inferentia's economic advantage compounds with capacity planning. Because Inferentia chips are produced for AWS internal demand rather than allocated through third party distribution channels, AWS has been able to absorb relatively predictable inference capacity into its long term silicon planning, which in turn supports the price commitments offered to customers under reserved capacity and Savings Plans.^[8]

Will there be an Inferentia 3? Future roadmap and Inferentia 3 status

As of May 2026, AWS has not publicly announced a third generation Inferentia chip or an Inf3 instance family. The current public roadmap from re:Invent 2024 and re:Invent 2025 focuses on Trainium evolution, with Trainium2 generally available, aws trainium 3 announced at re:Invent 2025 with availability scaling through 2026, and Trainium4 disclosed as in development with promised 6x FP4 throughput, 3x FP8 performance and 4x memory bandwidth relative to Trainium3, plus integration with NVIDIA NVLink Fusion for hybrid Trainium and NVIDIA GPU clusters.^[5]^[6]

Industry analysts have interpreted the absence of an Inferentia3 announcement, combined with the deliberate marketing of Trainium2 and Trainium3 for inference workloads, as a sign that AWS is consolidating its inference roadmap onto Trainium silicon rather than maintaining two parallel chip lines. AWS executives have not publicly committed to retiring the Inferentia brand, and Inf2 instances continue to be expanded into new regions and integrated more deeply with managed services such as Amazon Bedrock and Amazon SageMaker. Whether AWS eventually rebrands future inference focused parts under a successor name, ships an Inferentia3 as a discrete product, or fully converges the lines under Trainium remains an open question as of May 2026.^[5]^[6]

Inferentia's longer term role is therefore likely to evolve from a standalone product line into one of several layers within AWS's broader inference stack. The Neuron SDK already abstracts most differences between the two chip families, and AWS has positioned managed inference services on Bedrock and SageMaker as the customer facing entry point. From a customer perspective, the question of whether a particular workload runs on Inferentia or Trainium is increasingly a backend implementation detail managed by AWS rather than an explicit deployment choice.^[14]^[21]

References

Amazon Web Services, "Amazon EC2 Inf2 Instances", https://aws.amazon.com/ec2/instance-types/inf2/. Accessed 2026-05-19. ↩
Jeff Barr, "Amazon EC2 Update: Inf1 Instances with AWS Inferentia Chips for High Performance Cost Effective Inferencing", AWS News Blog, 3 December 2019, https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/. Accessed 2026-05-19. ↩
Amazon Web Services, "AI Chip: Amazon Inferentia", https://aws.amazon.com/ai/machine-learning/inferentia/. Accessed 2026-05-19. ↩
Introl, "Amazon Trainium and Inferentia: Silicon Ecosystem Guide 2025", https://introl.com/blog/aws-trainium-inferentia-silicon-ecosystem-guide-2025. Accessed 2026-05-19. ↩
Futurum Group, "AWS re:Invent 2025: Wrestling Back AI Leadership", https://futurumgroup.com/insights/aws-reinvent-2025-wrestling-back-ai-leadership/. Accessed 2026-05-19. ↩
Amazon Web Services Builder, "Get the latest on AWS AI Chips from re:Invent 2025", https://builder.aws.com/content/37FzTD9aYF6SIHE421joJaN2Hgv/get-the-latest-on-aws-ai-chips-from-reinvent-2025. Accessed 2026-05-19. ↩
Wikipedia, "Annapurna Labs", https://en.wikipedia.org/wiki/Annapurna_Labs. Accessed 2026-05-19. ↩
SiliconANGLE, "Amazon's secretive AI weapon: An exclusive look inside AWS' Annapurna Labs chip operation", 27 November 2024, https://siliconangle.com/2024/11/27/amazons-secretive-ai-weapon-exclusive-look-inside-aws-annapurna-labs-chip-operation/. Accessed 2026-05-19. ↩
CNBC, "AWS launches Inferentia AI chip", 28 November 2018, https://www.cnbc.com/2018/11/28/aws-launches-inferentia-ai-chip.html. Accessed 2026-05-19. ↩
TechCrunch, "AWS announces new Inferentia machine learning chip", 28 November 2018, https://techcrunch.com/2018/11/28/aws-announces-new-inferentia-machine-learning-chip/. Accessed 2026-05-24. ↩
AWS Neuron Documentation, "Inferentia Architecture", https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/inferentia.html. Accessed 2026-05-19. ↩
Amazon Web Services, "Amazon EC2 Inf1 Instances", https://aws.amazon.com/ec2/instance-types/inf1/. Accessed 2026-05-19. ↩
AWS Machine Learning Blog, "AWS Inferentia2 builds on AWS Inferentia1 by delivering 4x higher throughput and 10x lower latency", https://aws.amazon.com/blogs/machine-learning/aws-inferentia2-builds-on-aws-inferentia1-by-delivering-4x-higher-throughput-and-10x-lower-latency/. Accessed 2026-05-19. ↩
Amazon Web Services, "SDK for Gen AI and Deep Learning: AWS Neuron", https://aws.amazon.com/ai/machine-learning/neuron/. Accessed 2026-05-19. ↩
AWS Neuron Documentation, "What is AWS Neuron?", https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/what-is-neuron.html. Accessed 2026-05-19. ↩
Amazon Web Services, "AWS announces Amazon EC2 Inf2 instances (Preview)", November 2022, https://aws.amazon.com/about-aws/whats-new/2022/11/aws-announces-amazon-ec2-inf2-instances-preview/. Accessed 2026-05-19. ↩
Antje Barth, "Amazon EC2 Inf2 Instances for Low Cost, High Performance Generative AI Inference are Now Generally Available", AWS News Blog, 13 April 2023, https://aws.amazon.com/blogs/aws/amazon-ec2-inf2-instances-for-low-cost-high-performance-generative-ai-inference-are-now-generally-available/. Accessed 2026-05-19. ↩
AWS News Blog, "Majority of Alexa Now Running on Faster, More Cost Effective Amazon EC2 Inf1 Instances", 12 November 2020, https://aws.amazon.com/blogs/aws/majority-of-alexa-now-running-on-faster-more-cost-effective-amazon-ec2-inf1-instances/. Accessed 2026-05-19. ↩
Anthropic, "Powering the next generation of AI development with AWS", https://www.anthropic.com/news/anthropic-amazon-trainium. Accessed 2026-05-19. ↩
About Amazon, "AWS activates Project Rainier: One of the world's largest AI clusters", https://www.aboutamazon.com/news/aws/aws-project-rainier-ai-trainium-chips-compute-cluster. Accessed 2026-05-19. ↩
Amazon Web Services, "Claude by Anthropic: Models in Amazon Bedrock", https://aws.amazon.com/bedrock/anthropic/. Accessed 2026-05-19. ↩
TechCrunch, "Anthropic raises another $4B from Amazon, makes AWS its 'primary' training partner", 22 November 2024, https://techcrunch.com/2024/11/22/anthropic-raises-an-additional-4b-from-amazon-makes-aws-its-primary-cloud-partner/. Accessed 2026-05-19. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

AWS Trainium 2 AWS Trainium 3 Amazon Amazon SageMaker Intel Crescent Island Meta MTIA Project Rainier Rivos Snowflake-AWS chip deal Systolic array

What is AWS Inferentia?

AWS Annapurna Labs background

When was AWS Inferentia announced and released?

NeuronCore architecture

Neuron SDK

Inferentia 2 (announced re:Invent 2022)

How does Inferentia2 differ from Inferentia1?

Inf2 instances

What is AWS Inferentia used for?

Amazon Alexa

Amazon Search and other internal Amazon workloads

Anthropic and Claude inference

Customer workloads

How does Inferentia compare to NVIDIA inference GPUs?

Relationship to aws trainium, aws trainium2 and aws trainium 3

Is Inferentia cheaper than GPUs? Pricing and economics

Will there be an Inferentia 3? Future roadmap and Inferentia 3 status

References

Improve this article

Related Articles

NVIDIA Picasso

Groq LPU

d-Matrix Corsair

Etched Sohu

Positron AI

FP4 (4-bit floating point)

What links here

Related Articles

NVIDIA Picasso

Groq LPU

d-Matrix Corsair

Etched Sohu

Positron AI

FP4 (4-bit floating point)

What links here