Intel Gaudi 3
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,570 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,570 words
Add missing citations, update stale details, or suggest a clearer explanation.
Intel Gaudi 3 is a deep-learning accelerator designed by Habana Labs, a subsidiary of Intel, and the third major generation of the Gaudi product line that began at the Israeli startup acquired by Intel in 2019.[^1][^2] The chip was publicly introduced on April 9, 2024 during the Intel Vision 2024 customer event in Phoenix, Arizona, and was positioned by Intel as a cost-competitive alternative to Nvidia's H100 data-center GPU for training and inference of large language models.[^1][^3] Manufactured on TSMC's N5 (5 nm) process and built as a dual-die package, Gaudi 3 combines 8 matrix multiplication engines, 64 programmable tensor processor cores, 128 GB of HBM2e memory and 24 on-package 200 Gb Ethernet links for scale-out, delivering an Intel-claimed 1,835 TFLOPS of FP8 / BF16 dense compute.[^4][^5][^6] General availability for the air-cooled OAM 2.0 form factor (HL-325L) began in the third quarter of 2024, formally launched alongside Xeon 6 P-core CPUs on September 24, 2024.[^7][^8]
Despite favourable price-performance positioning against H100, Gaudi 3 fell well short of Intel's commercial targets — then-CEO Pat Gelsinger conceded in October 2024 that the company would miss its $500 million 2024 revenue goal for the Gaudi portfolio, blaming software maturity and the product transition from Gaudi 2.[^9][^10] After Gelsinger's forced departure on December 1, 2024 and the appointment of Lip-Bu Tan as CEO in March 2025, Intel announced on January 30, 2025 that the planned XPU successor Falcon Shores would not be brought to market, and confirmed that Gaudi 3 would be the last standalone Gaudi-branded accelerator before a pivot to the rack-scale Jaguar Shores system planned for 2026.[^11][^12][^13] As of 2025, Gaudi 3 is deployed commercially on IBM Cloud, in Dell PowerEdge XE9680 servers, in Inflection AI's enterprise systems, and through Intel's Tiber AI Cloud, with reduced 2025 shipment targets and an archived open-source SynapseAI driver indicating end-of-line status.[^14][^15][^16][^17][^18]
Habana Labs was founded in 2016 in Tel Aviv, Israel, and unveiled its first deep-learning training processor, "Gaudi," in mid-2019.[^19] Intel acquired Habana for approximately $2 billion in cash, with the deal announced on December 16, 2019, in what then-CEO Bob Swan described as a move to accelerate Intel's AI strategy and provide its data-center customers with a discrete AI accelerator complementary to Xeon CPUs.[^19][^20] Following the acquisition, Habana continued to operate as an independent Intel business unit reporting through Intel's Data Center Group, and Intel formally wound down its earlier Nervana-branded training (NNP-T1000) and inference (NNP-I1000) accelerators in early 2020 in favour of the Habana roadmap.[^20]
The first-generation Gaudi (HL-2000), built on a 16 nm process, targeted training workloads and shipped as the HLS-1 server with eight accelerators, but it gained only limited cloud and hyperscaler traction — its most visible public deployment was as the back-end for Amazon Web Services' DL1 instances launched in 2021.[^21] Gaudi 2, announced in May 2022, moved to TSMC's 7 nm process, doubled HBM capacity to 96 GB of HBM2e, doubled tensor cores to 24 TPCs with two matrix multiplication engines (MMEs), and added integrated 100 GbE networking — 24 ports of 100 GbE RoCE on-package.[^21][^22] Gaudi 2 became Intel's only benchmarked alternative to Nvidia's H100 in MLPerf Training v3.1 and v4.0 submissions, where Intel reported a Llama 2 70B LoRA fine-tuning time-to-train of 78.1 minutes on eight Gaudi 2 accelerators.[^23][^24]
By early 2024, Intel was attempting to break Nvidia's dominance in AI accelerators at the same time as it was pouring capital into its IDM 2.0 foundry strategy. CEO Pat Gelsinger publicly told investors that the addressable market for AI accelerators would reach roughly $24 billion by 2024, and Gaudi 3 was central to Intel's strategy to capture a meaningful share of that spend ahead of the more ambitious "Falcon Shores" XPU originally roadmapped for 2025.[^25][^26] Internally, Habana had already begun a multi-year die-shrink and architectural-rebalance program to keep pace with the H100, with a target of 4x BF16 throughput over Gaudi 2 and parity in process node with Nvidia (TSMC N5 versus Nvidia's TSMC 4N).[^5][^6]
Gaudi 3 is fabricated on TSMC's N5 process and is the first Gaudi part to use a dual-die package: two identical compute dies, each described by Intel as a mirror image of the other, are bonded together via a high-bandwidth on-package interconnect so that the package presents as a single logical accelerator to system software.[^5][^6] Each die contains a central 48 MB cache region, four MMEs and 32 TPCs; combined, the package therefore exposes 64 TPCs, 8 MMEs and 96 MB of on-die SRAM cache.[^4][^5][^6] Intel's white-paper terminology describes the design as a "heterogeneous compute engine," consistent with Habana's prior architectural philosophy of pairing fixed-function matrix engines with general-purpose tensor cores rather than building one large vector array.[^4]
The MMEs are wide, fixed-function dense-matrix units designed to execute the dot-product cores of transformer attention and feed-forward layers at high efficiency. According to Intel's white paper and subsequent IEEE Spectrum coverage, each MME is 512 bits wide and supports FP8, BF16, FP16, FP32 and TF32 precisions, and the eight engines together provide the bulk of Gaudi 3's 1,835 TFLOPS of dense FP8 / BF16 throughput.[^4][^6] Compared with Gaudi 2's two MMEs, Gaudi 3 quadruples MME count, which underpins Intel's claim of 4x BF16 compute over its predecessor.[^4][^22]
The TPCs are fully programmable VLIW-SIMD cores that handle activation functions, normalisation, element-wise operations, custom kernels and any computation that does not fit cleanly into the matrix-multiply pattern.[^4] Gaudi 3 ships with 64 TPCs (32 per die), up from 24 in Gaudi 2, and the TPC ISA is exposed to developers through Intel's TPC-C language and the SynapseAI runtime, allowing user-written kernels for new operators that have not yet been added to high-level frameworks.[^4][^27]
The package mounts eight 16 GB HBM2e stacks alongside the two compute dies, yielding 128 GB of unified high-bandwidth memory addressable across the whole package; Intel rates aggregate HBM bandwidth at 3.7 TB/s, up from 2.4 TB/s on Gaudi 2.[^4][^5][^6] Together with the 96 MB of on-die SRAM split across the dies, Gaudi 3 carries 33 % more HBM capacity than Gaudi 2 and 60 % more than the H100 SXM5 (80 GB).[^4][^6] Intel deliberately chose HBM2e rather than HBM3 or HBM3e — a decision analyst Sally Ward-Foxton and others attributed to bill-of-materials price and supply considerations, since HBM2e is markedly cheaper than the HBM3e used in H200 and Nvidia's Blackwell family.[^6][^28]
The 128 GB HBM2e pool is sized so that a 70-billion-parameter Llama 2 model in FP8 weights fits within a single accelerator with substantial room left over for KV-cache, an explicit Gaudi 3 positioning point: as Intel and several reviewers noted, an H100's 80 GB cannot serve a 70B FP8 model comfortably with meaningful KV cache, whereas Gaudi 3's larger pool can.[^28][^29] Within the package, the 48 MB last-level cache per die is software-managed and shared between MMEs and TPCs.[^4][^6]
A defining feature of every Gaudi generation has been the integration of RDMA-over-converged-Ethernet (RoCE v2) NICs directly on the accelerator die, eliminating the need for external InfiniBand or Nvidia-style proprietary scale-out fabric. Gaudi 3 integrates 24 200 Gb Ethernet ports per accelerator — for a peak 4.8 Tbps of egress bandwidth per chip and 9.6 Tbps bi-directional — twice the per-chip networking bandwidth of Gaudi 2 (which had 24x 100 GbE).[^1][^4][^14] On a standard eight-accelerator OAM baseboard, 21 ports per chip are used as an all-to-all internal mesh, while the remaining three ports per chip are exposed externally through QSFP-DD cages for scale-out beyond the chassis.[^14] Intel claims that this architecture allows clusters of up to 8,192 accelerators (1,024 nodes) using standard Ethernet switches and no proprietary interconnect — a marketing point repeatedly emphasised relative to Nvidia's NVLink/NVSwitch ecosystem.[^14][^4]
The primary deployment form factor is the HL-325L, an OCP Open Accelerator Module v2.0 (OAM 2.0) compliant mezzanine card. The HL-325L carries the full 128 GB of HBM2e, 96 MB SRAM, all 8 MMEs and 64 TPCs and 24 200 GbE ports, and is rated at a card-level TDP of up to 900 W for the air-cooled variant.[^30] Eight HL-325L modules are typically integrated onto an HLB-325 universal baseboard (UBB) to form a standard Gaudi 3 server building block; the all-to-all 200 GbE mesh between the eight modules is hosted on the UBB.[^30][^4]
A second form factor, the HL-338, is a full-height, dual-slot PCIe Gen 5 x16 CEM card with the same compute and memory specifications but a reduced 600 W TDP, targeted at inference and fine-tuning in standard servers rather than training in dedicated AI chassis.[^31] Intel positioned the HL-338 for general availability in late 2024 / first half of 2025.[^14][^31]
In response to U.S. export controls on advanced AI accelerators sold into China, Intel disclosed in April 2024 two China-only variants — the HL-328 (OAM) and HL-388 (PCIe) — both retaining the 128 GB HBM2e capacity and 3.7 TB/s bandwidth but with substantially reduced compute throughput to remain below the U.S. Department of Commerce's total-processing-performance thresholds.[^32][^33] Reporting placed the performance reduction at roughly 92 % versus the unconstrained part, and Intel set their availability for September 2024.[^32][^33]
Gaudi 3 is supported by Intel's Gaudi software stack, the rebranded successor to Habana's SynapseAI suite. The stack provides a graph compiler, runtime, communication libraries, the TPC-C kernel language and integrations with PyTorch (via the habana_frameworks bridge), Hugging Face Transformers (via the Optimum-Habana library), and DeepSpeed (via Habana DeepSpeed).[^27][^34][^35]
The Optimum-Habana open-source library, developed jointly by Hugging Face and Habana, allows Transformers and Diffusers models to run on Gaudi 2 and Gaudi 3 with minimal code changes; Hugging Face's documentation lists official support for BERT, ALBERT, DistilBERT, RoBERTa, T5, GPT-2, Llama 2, Vision Transformer, Swin, wav2vec2 and Stable Diffusion among others.[^34][^35] Intel and Hugging Face have claimed that the over 400,000 Transformers-compatible model checkpoints on the Hub can in principle be enabled on Gaudi with the library.[^34]
Despite this ecosystem, software was the most cited barrier to Gaudi 3 adoption in 2024. In Intel's Q3 2024 earnings call Pat Gelsinger said Gaudi 3 uptake had been "slower than we anticipated" and explicitly attributed the shortfall to software ease-of-use issues, particularly the friction of porting CUDA-optimised code paths to the Gaudi runtime.[^9][^10] Independent analysts including SemiAnalysis and The Register echoed that the Gaudi software stack was less mature than Nvidia's CUDA ecosystem and not yet equivalent in transformer-kernel coverage to vendors such as AMD's ROCm, weighing on customer adoption.[^36][^12] In late 2025, Phoronix reported that Intel had archived the upstream open-source user-space SynapseAI Core driver code and was no longer maintaining it, a step consistent with Gaudi 3's end-of-line status.[^17]
At the April 2024 launch, Intel published projected comparisons (not MLPerf-audited at that time) versus the Nvidia H100 showing, on average, 50 % faster inference throughput across Llama 2 7B/70B and Falcon 180B, 40 % better inference power efficiency, and 50 % faster time-to-train for Llama 2 7B/13B and GPT-3 175B.[^1][^3] For training on a cluster of 8,192 accelerators on GPT-3 175B, Intel projected 40 % shorter training time than an equivalent H100 cluster.[^6][^28]
In peak dense throughput, Gaudi 3 is rated at 1,835 TFLOPS of FP8 and 1,835 TFLOPS of BF16, sitting between the H100 SXM5 (roughly 1,979 TFLOPS FP8 dense) and ahead of Gaudi 2.[^4][^28] However, this places Gaudi 3 below the H200 (≈3,958 TFLOPS FP8 dense) and the B200 (≈4,500 TFLOPS FP8 dense), and once those parts shipped during 2024 Intel's principal price-performance comparison remained anchored to H100 rather than to Nvidia's newer Hopper or Blackwell parts.[^28][^29]
Third-party analyses such as Spheron's 2026 LLM Inference comparison reported that at batch-1 token generation Gaudi 3 reaches roughly 76 % of H200 throughput and 45 % of B200 throughput on Llama 2 70B, with the gap widening at larger batches where the workload becomes compute-bound rather than memory-bound; conversely, under high-concurrency serving Gaudi 3 reached approximately 95 % of H100 raw throughput at substantially lower cost per token (cited at roughly $0.31 per million tokens on Gaudi 3 versus $0.48 on H100 for Llama 2 70B serving).[^28]
Intel committed at the April 2024 announcement to submit Gaudi 3 MLPerf results, and the company did contribute Gaudi 3 inference results to subsequent MLPerf rounds; Gaudi 2 had previously been the only non-Nvidia accelerator with audited MLPerf submissions for GPT-3 training, and Gaudi 3 continued that participation pattern.[^23][^24][^7] Intel's published economic-analysis white paper documents per-server price-performance numbers but, as several reviewers including The Next Platform noted, the comparisons rely on Intel-internal performance projections and customer mileage will vary substantially by workload.[^28][^29]
At Computex in June 2024 Intel publicly disclosed pricing: a Gaudi 3 OAM module was list-priced at roughly $15,650, and a fully populated eight-accelerator HLB-325 baseboard at $125,000, including the integrated networking.[^15][^37] Intel's positioning emphasised that the baseboard's 14.68 BF16 PFLOPS of aggregate throughput, at $125,000, translated to roughly $8,515 per BF16 PFLOPS — which Intel and The Next Platform calculated as about 2.9x better price-per-petaflop than an estimated $200,000 H100 HGX baseboard at $25,000 per PFLOPS.[^29][^15] TechRadar and other outlets observed that the disclosed price represented an unusually transparent move for an AI accelerator vendor in 2024.[^15][^37]
Intel's announced rollout was: sampling to partners from April 9, 2024; air-cooled OAM volume production in Q3 2024; liquid-cooled OAM and HL-338 PCIe in Q4 2024; broader OEM general availability through Dell Technologies, Hewlett Packard Enterprise, Lenovo and Supermicro across Q3–Q4 2024.[^3][^7] The formal product launch with Xeon 6 P-cores took place at an Intel virtual event on September 24, 2024.[^7][^8] Gaudi 3 was also made available on Intel's own Tiber Developer (later "Tiber AI") Cloud during the second half of 2024.[^14][^7]
Major OEM systems shipping Gaudi 3 include the Dell PowerEdge XE9680 in an eight-accelerator OAM configuration (announced and later showcased with rack-integration documentation at SC24), Supermicro's X14 8U Gaudi 3 platform, HPE's ProLiant DL384 Gen11 and Gen12 systems, Lenovo's ThinkSystem SR685a V3, GIGABYTE's G894 series and Wiwynn baseboards.[^14][^7][^38]
In April 2024 IBM was announced as the first cloud service provider to commit to making Gaudi 3 available as a public-cloud service, and at Intel Vision 2025 IBM confirmed general availability on IBM Cloud in the Frankfurt (eu-de) and Washington D.C. (us-east) regions, with Dallas (us-south) following in Q2 2025.[^16][^39] IBM positioned Gaudi 3 as the compute backbone for its watsonx.ai development studio, with watsonx.ai software deployable on Gaudi 3 virtual servers from Q2 2025 and Red Hat OpenShift AI worker-node support in the same timeframe.[^16][^39] At the time of the cancellation of Falcon Shores in January 2025, IBM remained the only large public-cloud provider to have deployed Gaudi 3 in production — a key data point cited by analysts in assessing Gaudi 3's commercial reach.[^11][^12]
In October 2024 Inflection AI — under Sean White and after Microsoft's "Inflection-as-acqui-hire" of co-founder Mustafa Suleyman — announced that Inflection 3.0, its enterprise platform, would run on Gaudi 3 accelerators rather than Nvidia GPUs. Inflection cited up to 2x improved price-performance versus competing offerings and an intention to ship a physical Inflection-branded appliance based on Gaudi 3 from Q1 2025.[^18][^40] Inflection engineers later published a technical retrospective describing the porting effort from CUDA to the Gaudi stack.[^40]
At Intel Vision 2024 Intel also announced commitments from Indian customers and partners: Bharti Airtel intended to use Gaudi for telecom-data AI applications, Infosys committed to building Gaudi-based services, Ola Krutrim was already using Gaudi 2 to pre-train Indic-language foundation models with plans to migrate to Gaudi 3, and Mumbai-based data-centre operator CtrlS announced a Gaudi 3-based AI supercomputer for Indian enterprise customers.[^41][^3]
Intel listed Bosch, NAVER, NielsenIQ, Seekr (which integrated Gaudi 3 into its SeekrFlow AI platform), Landing AI, IFF, Roboflow and Dell Technologies as further Gaudi customers and partners at the April and September 2024 announcements.[^3][^7] South Korean cloud and search operator NAVER's involvement was specifically emphasised as a sovereign-AI play in the Korean market.[^3]
On Intel's Q3 2024 earnings call (October 31, 2024) Pat Gelsinger told investors the company would not meet its previously stated $500 million 2024 Gaudi revenue target, citing slower than expected uptake, the Gaudi 2 → Gaudi 3 transition, and "software ease of use" issues; Intel took a $300 million write-down on accelerator inventory in the same quarter.[^9][^10] The $500 million target had itself been a step down from the $1–2 billion range Gelsinger had floated earlier in 2024.[^9][^10] Reports from TrendForce in October 2024 said Intel had also revised down its 2025 Gaudi 3 unit-shipment target from 300,000–350,000 to roughly 200,000–250,000 — a more-than-30 % reduction.[^42]
SemiAnalysis published a December 2024 assessment of Intel's culture and AI strategy that was sharply critical of Gaudi 3's competitive position, and The Register's February 2025 commentary argued that Gaudi 3 "lacks an upgrade path" because Falcon Shores would not ship and Jaguar Shores was still two years out, leaving customers without a multi-generation Gaudi roadmap to plan against.[^12][^36] HPCwire characterised Falcon Shores as "another swing, and a miss" in February 2025.[^43] Tom's Hardware, IEEE Spectrum and The Next Platform consistently described Gaudi 3 as technically competitive but commercially out-positioned by the timing of Nvidia's H200 and Blackwell launches.[^6][^28][^29][^44]
Gelsinger, who had returned to Intel as CEO in February 2021 with a turnaround mandate centred on the IDM 2.0 foundry strategy and AI accelerator pivot, was given the choice of retirement or removal by Intel's board after a contentious December 1, 2024 meeting and stepped down effective December 1, 2024.[^45][^46] CFO David Zinsner and Intel Products CEO Michelle "MJ" Johnston Holthaus were named interim co-CEOs.[^45][^46]
On March 12, 2025 Intel named Lip-Bu Tan, the former Cadence Design Systems CEO and an Intel board member, as the company's new CEO, effective March 18, 2025.[^47] Tan inherited the strategic decisions that had already been signalled in Holthaus's late-January earnings-call comments — including the Falcon Shores cancellation — and signalled an even sharper focus on rack-scale system delivery and on Intel Foundry as a customer business in subsequent communications.[^47][^12]
On Intel's Q4 2024 earnings call on January 30, 2025, co-CEO Michelle Johnston Holthaus announced that Intel would "leverage Falcon Shores as an internal test chip, without bringing it to market."[^11][^48] Falcon Shores had been Intel's intended XPU successor to Gaudi 3, originally pitched as a Xe-architecture-plus-Gaudi-fabric hybrid for late-2025 launch and at one stage positioned by Intel as a re-entry into the HPC GPU market after Ponte Vecchio.[^11][^29] Holthaus said Intel was "not yet participating in the cloud-based AI data center market in a meaningful way" and that the company would refocus on a "system-level solution at rack scale" rather than ship a third-generation discrete data-centre GPU.[^11][^48]
Together with the Falcon Shores cancellation, the announcement implicitly designated Gaudi 3 as the last standalone Gaudi-branded accelerator. The end-of-line nature of Gaudi 3 was reinforced through 2025 by reduced unit-shipment targets, the archiving of the open-source SynapseAI Core user-space driver code reported by Phoronix in late 2025, and the absence of a Gaudi-4-branded follow-on.[^17][^42][^12]
Intel has confirmed that the successor — codenamed Jaguar Shores — will inherit the Gaudi brand and be delivered as a full rack-scale system rather than a single accelerator, will use SK hynix HBM4 memory and the Intel 18A process, and is targeted for 2026.[^49][^50] Public images from Trendforce and Tom's Hardware coverage of Jaguar Shores reveal a 92.5 × 92.5 mm package integrating four compute tiles and eight HBM4 stacks, paired with Intel's Diamond Rapids Xeon CPU in rack-scale configurations.[^49][^50]
Adding a further surprising data point, in late 2025 several outlets reported on a Gaudi 3 hybrid rack-scale system that embedded Nvidia Blackwell B200 GPUs alongside Gaudi 3 accelerators — an arrangement read by SemiAnalysis and others as Intel implicitly conceding that, for the most demanding inference workloads in 2025–2026, its own silicon could not stand alone against Blackwell.[^36][^44]
Gaudi 3 was, on balance, well-reviewed on its technical merits at launch: IEEE Spectrum called the design's dual-die, on-package Ethernet-fabric approach a credible alternative philosophy to Nvidia's NVLink-centric platform; The Next Platform calculated a roughly 2.5–2.9x price-per-petaflop advantage over H100 baseboards; and Tom's Hardware noted that Intel had been unusually transparent about pricing.[^6][^29][^15] At the same time, virtually every independent technical write-up flagged software-stack maturity and Intel's lack of a multi-generation Gaudi roadmap as headwinds.[^6][^29][^12]
In commercial terms the chip's reception was clearly disappointing relative to internal targets. The missed $500 million 2024 revenue figure, the more-than-30 % cut to 2025 unit-shipment targets, the limited public-cloud footprint outside IBM and Intel Tiber, the eventual cancellation of the Falcon Shores successor, and the archival of the open-source driver code together placed Gaudi 3 as the final chapter in Intel's 2019-vintage Habana acquisition arc.[^9][^11][^17][^42] Industry observers writing in early 2025 generally framed Gaudi 3 as a credible engineering effort that nevertheless arrived too late to dent the data-center AI accelerator market materially before the H200 and Blackwell ramps, leaving Intel to refocus its AI hardware ambitions on the Jaguar Shores rack-scale system due in 2026.[^12][^43][^44][^49]