AWS Inferentia
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,924 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,924 words
Add missing citations, update stale details, or suggest a clearer explanation.
AWS Inferentia is a family of custom application specific integrated circuits (ASICs) designed by Amazon Web Services (AWS) for machine learning inference workloads in the cloud. The chips are produced through Annapurna Labs, the Israeli semiconductor design subsidiary that Amazon acquired in January 2015, and they power the Amazon Elastic Compute Cloud (EC2) Inf1 and Inf2 instance families. First announced at AWS re:Invent in November 2018 and brought to general availability in December 2019, Inferentia became one of the earliest hyperscaler designed inference accelerators to enter wide production deployment. A second generation chip, Inferentia2, was announced at re:Invent 2022 and launched commercially in April 2023 with substantially higher memory capacity and bandwidth aimed at generative AI workloads including large language models and latent diffusion models.[1][2][3]
Inferentia sits within a broader AWS silicon roadmap that includes the aws trainium training accelerators, the aws trainium2 second generation training chip launched in 2024, and the aws trainium 3 chip introduced at re:Invent 2025. While Inferentia chips are dedicated to inference, the Trainium line supports both training and increasingly large scale inference, leading industry observers to describe a gradual convergence of the two product families. As of May 2026, AWS has not announced an Inferentia3 chip, and the company appears to be channelling its next generation efforts into Trainium silicon that can serve both workloads.[4][5][6]
Annapurna Labs was founded in 2011 in Israel by Bilic "Billy" Hrvoje, Nafea Bshara and Ronen Boneh, and was named after the Annapurna Massif in the Himalayas. The startup operated in stealth for several years, designing networking and SmartNIC silicon for data center customers, with early backers including Avigdor Willenz, Walden International, Arm Holdings and TSMC. In January 2015 Amazon acquired Annapurna Labs for a reported price of US$350 to US$370 million and folded the team into the AWS organisation, where it became the engineering core for AWS designed silicon.[7]
Following the acquisition, Annapurna Labs developed multiple chip lines for AWS. The first major commercial product was the AWS Nitro System, a combination of hardware offload cards and a lightweight hypervisor first deployed at scale starting in 2017. The team then produced the Arm based Graviton general purpose CPU family, and in 2018 it began work on machine learning ASICs branded as Inferentia (for inference) and later Trainium (for model training). Annapurna Labs maintains a design centre in Austin, Texas alongside its Israeli engineering base, and the operation has been described by AWS leadership and outside observers as the "secret sauce" behind AWS's vertical integration strategy.[7][8]
AWS Inferentia was first publicly announced by then AWS CEO Andy Jassy during his keynote at AWS re:Invent 2018 in Las Vegas on 28 November 2018. Jassy positioned the chip as a low cost, low latency alternative to graphics processing units (GPUs) for machine learning inference, the production phase of deploying trained models. At launch AWS stated the chip would support FP16, BF16 and INT8 numerical formats and would integrate with the TensorFlow, Apache MXNet and PyTorch frameworks, as well as models exported to the ONNX exchange format.[9][10]
The chip itself became generally available roughly one year later when AWS announced the Amazon EC2 Inf1 instance family on 3 December 2019 at re:Invent 2019. Each first generation Inferentia chip contained four NeuronCore-v1 cores and delivered up to 128 INT8 tera operations per second (TOPS) and 64 FP16/BF16 tera floating point operations per second (TFLOPS). The chip carried 8 GiB of off chip DDR4 DRAM with approximately 50 GiB/sec of bandwidth, plus a large on chip cache to reduce external memory traffic. The Inf1 family scaled from a single chip in the inf1.xlarge size up to 16 Inferentia chips in the largest inf1.24xlarge instance, which AWS described as offering over two peta operations per second of aggregate inference throughput.[2][11]
The launch lineup included four instance sizes: inf1.xlarge (1 chip, 4 vCPUs, 8 GiB system memory), inf1.2xlarge (1 chip, 8 vCPUs, 16 GiB), inf1.6xlarge (4 chips, 24 vCPUs, 48 GiB) and inf1.24xlarge (16 chips, 96 vCPUs, 192 GiB). Networking ranged from up to 25 Gbps to 100 Gbps on the largest size. The Inf1 instances paired the Inferentia chips with second generation Intel Xeon Scalable host CPUs and were available initially in the US East (N. Virginia) and US West (Oregon) regions through On Demand, Spot, Reserved Instance and Savings Plan purchasing options.[2]
At launch AWS claimed Inf1 instances delivered up to 3x higher inference throughput and up to 40% lower cost per inference than the company's GPU based EC2 G4 instances, which used NVIDIA T4 GPUs and had previously been the lowest cost inference option on AWS.[2]
The NeuronCore is the basic compute building block of every Inferentia and Trainium chip. The first generation NeuronCore-v1 used in Inferentia 1 implements a high performance systolic array based matrix multiplication engine targeted at the dense linear algebra operations dominant in deep learning workloads, particularly convolutions and the attention and feed forward layers of transformer models. Each NeuronCore is paired with a sizeable on chip cache that reduces the frequency of external DRAM accesses, which helps maintain throughput on large models.[11][12]
With Inferentia 2, AWS introduced NeuronCore-v2, a substantially redesigned core that uses fewer instances per chip (two NeuronCore-v2 cores per Inferentia2 chip versus four NeuronCore-v1 cores per Inferentia1 chip) but with each core offering far higher compute and richer functionality. NeuronCore-v2 is organised around four independent execution engines:
NeuronCore-v2 also introduced ISA level support for dynamic input shapes, which is important for transformer and large language model inference where sequence lengths vary between requests, along with stochastic rounding and configurable FP8 (cFP8) formats. The improved core architecture, combined with the higher capacity HBM memory subsystem, accounts for most of the headline 4x throughput and 10x latency improvements that AWS attributes to Inferentia2 relative to Inferentia1.[13]
The AWS Neuron Software Development Kit (Neuron SDK) is the open source software stack used to compile, profile, debug and deploy models on Inferentia and Trainium hardware. Neuron sits between high level ML frameworks and the underlying NeuronCore hardware, providing a compiler that lowers framework graphs into Neuron specific intermediate representations and ultimately into NeuronCore instructions, plus a runtime that schedules execution across one or more chips.[14]
Framework integration covers PyTorch (via the torch-neuron and torch-neuronx packages, the latter built on PyTorch XLA), TensorFlow, JAX and Apache MXNet. The Neuron SDK also exposes higher level integrations with libraries widely used in production inference deployments, including the Hugging Face Optimum Neuron project, which bridges Transformers and the Diffusers libraries to Neuron hardware, and the vLLM inference server, which supports continuous batching, expert parallelism, disaggregated inference and speculative decoding on Inferentia and Trainium. AWS additionally provides a lower level kernel programming interface, the Neuron Kernel Interface (NKI), built on MLIR, for advanced users who want to write custom kernels directly against the NeuronCore.[14][15]
Neuron is the same software stack used by both aws trainium and Inferentia, which lets developers move models between training and inference accelerators without rewriting deployment code. AWS publishes Neuron releases on GitHub and through public package repositories, and provides Deep Learning Amazon Machine Images (DLAMIs) preconfigured for Inf1, Inf2, Trn1 and Trn2 instances. The SDK also integrates with Amazon SageMaker, Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS) and AWS ParallelCluster for managed and orchestrated deployments.[3][14]
At AWS re:Invent 2022, held in late November and early December 2022 in Las Vegas, AWS announced the preview of Amazon EC2 Inf2 instances powered by a new Inferentia2 chip designed specifically for large generative AI models. The announcement positioned Inferentia2 as a response to the rapid growth in the size of transformer models, where many state of the art systems by 2022 had outgrown the 8 GB per chip memory budget of the original Inferentia and could no longer be served efficiently on Inf1.[16]
Inferentia2 retained the basic NeuronCore organisation of its predecessor but rebalanced the chip toward larger model capacity. Each Inferentia2 chip integrates two NeuronCore-v2 cores delivering 380 INT8 TOPS and 190 FP16/BF16/cFP8/TF32 TFLOPS, plus 47.5 FP32 TFLOPS. The chip is paired with 32 GiB of on package high bandwidth memory (HBM), a four times capacity increase over the 8 GB DDR4 of Inferentia1, with bandwidth rising approximately 16x to 820 GiB/sec per chip. The chip also exposes 1 TB/sec of direct memory access (DMA) bandwidth with inline compression and decompression for the model parameter pipeline.[13][17]
Supported data types on Inferentia2 expanded to cover FP32, TF32, BF16, FP16, INT8, UINT8 and the new configurable FP8 (cFP8) format, the latter of which AWS positioned as particularly useful for large language model inference because of its reduced memory footprint and I/O cost without the accuracy loss often seen with INT8 post training quantisation.[13]
The chips communicate with one another through NeuronLink-v2, a dedicated chip to chip interconnect that provides 192 GB/sec of bidirectional bandwidth on Inf2 instances. NeuronLink-v2 supports collective communications operators such as all reduce, which enables distributed inference where a single large model is sharded across multiple Inferentia2 chips in the same instance. AWS demonstrated that a 175 billion parameter transformer model can be served for inference across multiple Inferentia2 chips within a single inf2.48xlarge instance, a milestone aimed squarely at deployment of GPT class large language models.[13][16]
The Inf2 instance family reached general availability on 13 April 2023, initially in the US East (Ohio) and US East (N. Virginia) regions, with subsequent expansion to additional regions. AWS offers four Inf2 sizes:
| Instance | Inferentia2 chips | NeuronCore-v2 cores | Accelerator HBM | vCPUs | System memory | Network bandwidth |
|---|---|---|---|---|---|---|
| inf2.xlarge | 1 | 2 | 32 GiB | 4 | 16 GiB | up to 15 Gbps |
| inf2.8xlarge | 1 | 2 | 32 GiB | 32 | 128 GiB | up to 25 Gbps |
| inf2.24xlarge | 6 | 12 | 192 GiB | 96 | 384 GiB | 50 Gbps |
| inf2.48xlarge | 12 | 24 | 384 GiB | 192 | 768 GiB | 100 Gbps |
The largest inf2.48xlarge size delivers an aggregate 2.3 petaFLOPS of BF16/FP16 dense compute, 384 GiB of pooled HBM accelerator memory and 9.8 TB/sec of total memory bandwidth across the twelve chips, with the chips wired together through NeuronLink-v2 for distributed inference of very large models.[1][17]
Compared to the prior generation Inf1 instances, AWS reports Inf2 delivers up to 4x higher throughput, up to 10x lower latency and up to 50% better performance per watt on representative deep learning inference workloads. Inf2 is available through On Demand, Reserved Instance, Spot and Savings Plan purchasing models, and is supported as a managed endpoint in Amazon SageMaker as well as in EKS and ECS.[1]
The largest publicly disclosed Inferentia 1 deployment is the Amazon Alexa voice assistant. On 12 November 2020 AWS announced that the Alexa team had migrated the majority of its GPU based machine learning inference workloads, with a focus on the text to speech (TTS) component of the Alexa pipeline, to Inf1 instances. AWS reported that the migration produced approximately 25% lower end to end latency and approximately 30% lower cost compared with the previous GPU based instances for Alexa TTS workloads. The text to speech improvements directly affect the perceived responsiveness of the voice assistant on more than 100 million Alexa enabled devices worldwide.[18]
Amazon Search has been an early adopter of both Inferentia generations, with AWS reporting up to 85% cost reduction on certain ranking and retrieval models when migrated from GPU based instances to Inferentia. Other internal Amazon teams have used Inferentia for personalisation, fraud detection and recommendation workloads where high throughput at low cost per inference is more important than absolute peak per query latency.[3]
Anthropic, the developer of the Claude family of large language models, has progressively deepened its partnership with AWS, which by late 2024 had invested a cumulative US$8 billion into Anthropic and signed a strategic compute agreement under which Anthropic adopted AWS as its primary training partner. In November 2024 the companies announced that Anthropic would jointly co develop future generations of AWS Trainium silicon, contributing low level kernels and improvements to the Neuron SDK. Project Rainier, the resulting compute cluster, was activated in 2025 and ultimately contains hundreds of thousands of Trainium2 chips spread across multiple US data centres including a large site in Indiana, with capacity scaling toward multiple gigawatts of power draw under the broader Amazon Anthropic agreement.[19][20]
It is important to distinguish how the Inferentia and Trainium lines are used in this partnership. Training of Anthropic's Claude models, including claude opus 4 7 and earlier Claude generations, runs primarily on aws trainium2 (and successor Trainium silicon), not on Inferentia. Inferentia is designed for inference workloads only and lacks the cluster scale interconnect and per chip flop budget needed for frontier scale pretraining. Inference for Claude on Amazon Bedrock and direct Anthropic API endpoints draws from a mix of AWS hardware including Trainium and Inferentia generations, with the precise mix not publicly disaggregated by AWS or Anthropic. AWS positions Bedrock as the managed inference layer for Claude on AWS, abstracting the underlying accelerator choice from customers.[21][22]
AWS has publicly cited a range of external customers using Inferentia for production inference. On Inf2 specifically, AWS has reported case studies including Leonardo.ai claiming an 80% cost reduction with no performance loss for diffusion model inference; Runway reporting 2x higher throughput than comparable GPU instances; Money Forward reporting 10x latency reduction over their previous Inf1 deployment; and Fileread.ai reporting 33% latency reduction with 50% throughput increase. On the original Inf1 platform AWS cited Finch Computing, Autodesk, ByteDance, Anthem (now Elevance Health), Dataminr and Screening Eagle Technologies among production users.[1][3]
Inferentia competes primarily with NVIDIA's GPU based inference offerings on AWS. When Inf1 launched in 2019, the principal point of comparison was the NVIDIA T4 GPU available in EC2 G4 instances, and AWS positioned Inferentia 1 as offering up to 3x higher throughput and up to 40% lower cost per inference for matched workloads. By the time Inf2 launched in 2023 the competitive landscape on AWS included G5 instances with NVIDIA A10G GPUs and P4d/P5 instances with NVIDIA A100 and H100 GPUs. AWS positions Inf2 against these GPUs primarily on cost per inference for generative AI workloads with model sizes between approximately one billion and several hundred billion parameters, where the 32 GiB HBM per chip and the 192 GB/sec NeuronLink interconnect allow large models to be sharded across an instance without leaving NeuronCore memory.[1][2]
Inferentia is also frequently compared, particularly in industry press, with dedicated inference accelerators from other vendors such as the groq lpu (an inference focused architecture optimised for low latency token generation), the cerebras wse 3 wafer scale engine, and modern NVIDIA Blackwell parts such as the nvidia b200 (which is positioned for both training and inference at the high end). Each of these architectures targets a somewhat different operating point. Inferentia's distinguishing characteristics are its vertical integration with the AWS cloud, the maturity of the Neuron SDK, and the price/performance profile delivered through reserved capacity inside EC2 rather than as standalone hardware.[4]
Inferentia and Trainium are closely related product lines designed by the same Annapurna Labs engineering organisation and sharing the NeuronCore architectural lineage. The first AWS Trainium chip was announced at re:Invent 2020 and reached general availability in 2022, deployed in Trn1 instances. Like Inferentia, Trainium uses NeuronCore-v2 derived cores, but with higher per chip compute, larger HBM, and significantly more capable interconnect (NeuronLink and the second generation EFA fabric) for tightly coupled distributed training across hundreds or thousands of chips.[4]
aws trainium2 was announced at re:Invent 2023 and reached general availability at re:Invent 2024, offering approximately 4x performance improvement over the original Trainium and 96 GB of HBM per chip. Trn2 UltraServers combine 16 Trainium2 chips into a single tightly coupled domain, with up to 83 PFLOPS of dense FP8 compute and 332 PFLOPS of sparse FP8 compute. aws trainium 3 was announced at re:Invent 2025 as AWS's first 3 nanometer AI chip, with 2.52 PFLOPS of FP8 compute per chip, 144 GB of HBM3e memory, 4.9 TB/sec of bandwidth, and approximately 40% better energy efficiency than Trainium2. Trn3 UltraServers can pack up to 144 Trainium3 chips into a single integrated system with up to 4.4x higher compute than Trainium2 UltraServers.[4][6]
A notable trend visible across the 2024 and 2025 re:Invent announcements is the increasing positioning of Trainium silicon as a dual purpose training and inference accelerator. Trainium2 and Trainium3 are explicitly marketed for high volume inference of frontier scale models, in addition to their original training role. Industry commentators have described this as a gradual convergence of the Trainium and Inferentia lines, with Inferentia 2 remaining the workhorse for classic high volume inference workloads while Trainium handles frontier scale inference where the larger memory and more capable interconnect are required.[5][6]
Inferentia is sold exclusively as part of EC2 instances and managed AWS services rather than as discrete hardware, so its pricing is bundled into hourly instance rates and into Bedrock and SageMaker per token or per hour charges. AWS publishes On Demand instance prices on its EC2 pricing pages; representative Inf2 On Demand list prices in early 2026 include approximately US$0.76 per hour for inf2.xlarge, US$1.97 per hour for inf2.8xlarge, US$6.49 per hour for inf2.24xlarge and US$12.98 per hour for inf2.48xlarge. Substantial discounts are available through one and three year Reserved Instances, Savings Plans and Spot purchasing. Inf1 instances remain available at lower nominal hourly prices, with inf1.xlarge listed at approximately US$0.23 per hour and inf1.24xlarge at approximately US$4.72 per hour in the US East region.[1][2]
The economic argument for Inferentia is built around cost per inference rather than headline hourly rate. AWS and its customers have repeatedly reported per inference cost reductions in the range of 30% to 85% versus equivalent GPU instances, depending on workload, batch size and model architecture. Reported figures include 30% lower cost on Alexa text to speech versus the previous GPU based instances, 40% lower cost per inference on Inf1 versus the EC2 G4 baseline at launch, 80% cost reduction on Leonardo.ai and Finch Computing inference workloads, 85% cost reduction on certain Amazon Search ranking workloads, and 90% cost reduction reported by NTT PC Communications. The magnitude of the savings is highly workload dependent, and AWS publishes specific case studies rather than a universal headline number.[1][3][18]
For deployments at very large scale, Inferentia's economic advantage compounds with capacity planning. Because Inferentia chips are produced for AWS internal demand rather than allocated through third party distribution channels, AWS has been able to absorb relatively predictable inference capacity into its long term silicon planning, which in turn supports the price commitments offered to customers under reserved capacity and Savings Plans.[8]
As of May 2026, AWS has not publicly announced a third generation Inferentia chip or an Inf3 instance family. The current public roadmap from re:Invent 2024 and re:Invent 2025 focuses on Trainium evolution, with Trainium2 generally available, aws trainium 3 announced at re:Invent 2025 with availability scaling through 2026, and Trainium4 disclosed as in development with promised 6x FP4 throughput, 3x FP8 performance and 4x memory bandwidth relative to Trainium3, plus integration with NVIDIA NVLink Fusion for hybrid Trainium and NVIDIA GPU clusters.[5][6]
Industry analysts have interpreted the absence of an Inferentia3 announcement, combined with the deliberate marketing of Trainium2 and Trainium3 for inference workloads, as a sign that AWS is consolidating its inference roadmap onto Trainium silicon rather than maintaining two parallel chip lines. AWS executives have not publicly committed to retiring the Inferentia brand, and Inf2 instances continue to be expanded into new regions and integrated more deeply with managed services such as Amazon Bedrock and Amazon SageMaker. Whether AWS eventually rebrands future inference focused parts under a successor name, ships an Inferentia3 as a discrete product, or fully converges the lines under Trainium remains an open question as of May 2026.[5][6]
Inferentia's longer term role is therefore likely to evolve from a standalone product line into one of several layers within AWS's broader inference stack. The Neuron SDK already abstracts most differences between the two chip families, and AWS has positioned managed inference services on Bedrock and SageMaker as the customer facing entry point. From a customer perspective, the question of whether a particular workload runs on Inferentia or Trainium is increasingly a backend implementation detail managed by AWS rather than an explicit deployment choice.[14][21]