AMD Instinct MI400
Last reviewed
May 17, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 ยท 3,497 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 ยท 3,497 words
Add missing citations, update stale details, or suggest a clearer explanation.
AMD Instinct MI400 is a family of data center GPU accelerators developed by AMD for artificial intelligence training, inference, and high-performance computing workloads. Announced on AMD's Advancing AI 2025 event and detailed in full at the company's CES 2026 keynote on January 5, 2026, the MI400 series succeeds the AMD Instinct MI350 series and is built on AMD's CDNA 5 architecture. The lineup includes the flagship MI455X for hyperscale AI training, the MI440X for enterprise AI servers, and the MI430X for sovereign AI and HPC. Each MI400 accelerator packs 432 GB of HBM4 memory, 19.6 TB/s of memory bandwidth, and up to 40 PFLOPS of FP4 compute, roughly doubling the on-paper performance of the MI350 generation.[1][2]
The MI400 series anchors AMD's first complete rack-scale AI system, codenamed Helios, which couples 72 MI455X accelerators with next-generation EPYC "Venice" Zen 6 CPUs and Pensando "Vulcano" 800 Gb/s NICs in a double-wide Open Compute Project (OCP) rack. Helios is positioned as AMD's direct competitor to NVIDIA's Vera Rubin platform, and it is the hardware foundation for a multi-billion dollar, 6 gigawatt strategic partnership announced between AMD and OpenAI in October 2025. Production shipments of MI400 silicon are expected to begin in mid-2026, with the first gigawatt of OpenAI deployments arriving in the second half of 2026.[3][4]
AMD first publicly committed to the MI400 series at its Advancing AI 2025 event in San Jose, California on June 12, 2025. During the keynote, AMD CEO Lisa Su outlined a multi-year accelerator roadmap that included the just-launched MI350 and MI355X parts, the MI400 series for 2026, and a follow-on MI500 generation expected in 2027. AMD framed the MI400 launch as the moment when its data center GPU business would transition from selling individual accelerators to selling complete, integrated rack-scale systems on equal footing with NVIDIA's GB200 NVL72 and the forthcoming Vera Rubin rack.[1][5]
The full MI400 product family was unveiled at CES 2026 on January 5, 2026, when Lisa Su delivered the opening keynote in Las Vegas. The CES presentation introduced three named variants (MI430X, MI440X, MI455X), confirmed the Helios reference design, and showed working silicon mounted in a Helios compute tray for the first time.[6][7]
The MI400 series is the first product family built on AMD's CDNA 5 (Compute DNA, fifth generation) GPU architecture. CDNA 5 is a clean-sheet redesign of the matrix and vector compute engines, dropping support for graphics rasterization entirely in favor of denser AI and HPC math units. AMD describes CDNA 5 as its most ambitious chiplet design to date, combining multiple compute and memory dies across two TSMC process nodes inside a single accelerator package.[8][9]
The MI455X uses 12 compute chiplets on TSMC's N2 (2 nm class) process node, paired with 3 additional I/O chiplets on TSMC 3 nm. The package integrates these dies on top of a base interposer hosting the cache and memory fabric, then connects to 12 stacks of HBM4 memory. The full MI455X reaches approximately 320 billion transistors, up from the 208 billion in the Blackwell B200 and roughly 1.8 times the transistor count of the MI355X.[8][10]
Each MI400 series accelerator delivers up to 40 PFLOPS of dense FP4 throughput and 20 PFLOPS of dense FP8 throughput, double the matrix throughput of the MI355X at the same precisions. The new Matrix Cores in CDNA 5 add native support for the OCP MX (microscaling) FP4 and FP6 formats that are also used by Blackwell, plus retained support for FP8, BF16, FP16, TF32, FP32, and FP64.
| Precision format | MI400 dense throughput | Primary use case |
|---|---|---|
| FP4 (MXFP4) | 40 PFLOPS | Quantized inference, ultra-low-precision training |
| FP6 (MXFP6) | 30 PFLOPS | Inference with accuracy headroom |
| FP8 | 20 PFLOPS | Mixed-precision training, high-quality inference |
| BF16 / FP16 | 10 PFLOPS | Standard training, scientific AI |
| TF32 | 5 PFLOPS | Drop-in training replacement for FP32 |
| FP32 | 2.5 PFLOPS | Traditional dense compute |
| FP64 (MI430X) | 1.3 PFLOPS | HPC, scientific simulation |
The MI430X variant emphasizes FP64 throughput, restoring the full-rate double-precision pipeline that was reduced in some MI350 SKUs. AMD positions the MI430X as a successor to the MI300A and the GPU half of the Frontier supercomputer for traditional HPC customers, while the MI440X and MI455X are tuned more aggressively toward low-precision AI throughput.[2][9]
Each MI400 GPU is paired with 432 GB of HBM4 memory across 12 stacks, with each stack running at 8 Gb/s per pin to deliver an aggregate 19.6 TB/s of memory bandwidth. This represents a 1.5 times capacity improvement and a 2.45 times bandwidth improvement compared to the MI355X (288 GB of HBM3e at 8 TB/s). The choice of HBM4 puts the MI400 family slightly ahead of NVIDIA's first Rubin parts on raw capacity, even though Rubin runs its HBM4 stacks at a higher per-pin data rate.[2][11]
The larger HBM4 footprint matters for very large mixture of experts and trillion-parameter dense models, which can fit more experts and KV cache per accelerator, and for agentic AI workloads that hold long context windows in memory. AMD claims a single MI455X can serve a 1 trillion parameter dense model entirely in its own HBM4 at FP4, eliminating the need to shard model weights across multiple accelerators.[6]
The MI400 series introduces two new interconnect technologies. For scale-up communication within a Helios rack, MI400 accelerators support the Ultra Accelerator Link (UALink) standard alongside AMD's existing Infinity Fabric, making the MI400 family the first commercial silicon shipped with UALink support. The internal scale-up fabric provides 3.6 TB/s of GPU-to-GPU bandwidth per accelerator, with up to 260 TB/s of aggregate scale-up bandwidth across all 72 GPUs in a Helios rack.[3][12]
For scale-out communication between racks, each MI400 GPU exposes 300 GB/s of dedicated scale-out bandwidth through paired Pensando "Vulcano" 800 Gb/s NICs. The Vulcano NIC implements the Ultra Ethernet Consortium (UEC) specification and supports both standard Ethernet and AMD's proprietary congestion control extensions. A single Helios rack provides up to 43 TB/s of aggregate Ethernet-based scale-out bandwidth, roughly an order of magnitude more than typical NVIDIA InfiniBand deployments of the same era.[3][13]
The MI455X is the flagship of the MI400 family and the only variant qualified for the Helios rack-scale platform. It delivers 40 PFLOPS of FP4 compute, 20 PFLOPS of FP8, 432 GB of HBM4 at 19.6 TB/s, and 3.6 TB/s of scale-up bandwidth per accelerator. The MI455X targets hyperscale AI training clusters, the largest distributed inference deployments, and frontier model training runs. It is sold only in rack-scale configurations.[2][6]
| Specification | MI455X |
|---|---|
| Architecture | CDNA 5 |
| Transistors | ~320 billion |
| Process node | TSMC N2 (compute) + TSMC N3 (I/O) |
| Chiplets | 12 compute + 3 I/O + base interposer |
| HBM memory | 432 GB HBM4 (12 stacks) |
| Memory bandwidth | 19.6 TB/s |
| FP4 dense | 40 PFLOPS |
| FP8 dense | 20 PFLOPS |
| Scale-up bandwidth | 3.6 TB/s per GPU |
| Scale-out bandwidth | 300 GB/s per GPU |
| Interconnect | UALink, Infinity Fabric, Ultra Ethernet |
| TDP | Liquid-cooled, rack-scale (~1.4 to 1.8 kW range) |
The MI440X is the enterprise variant, designed to slot into standard 8-GPU OAM servers next to a single EPYC Venice CPU. It carries the same 432 GB of HBM4 and 19.6 TB/s memory bandwidth as the MI455X, but is optimized for lower-precision workloads (FP4, FP8, BF16) and ships in air-cooled or liquid-cooled form factors. The MI440X powers AMD's new Enterprise AI Platform reference design, which targets corporate AI training and inference at single-server to multi-rack scale.[2][14]
The MI430X is the HPC and sovereign AI variant. It uses a different chiplet mix that emphasizes FP64 and FP32 throughput alongside the same FP4 and FP8 matrix units. The MI430X targets national laboratories, government cloud customers, defense contractors, and traditional supercomputing centers that need both high-precision scientific compute and modern AI capability in the same accelerator.[2][7]
| Variant | FP4 / FP8 | FP64 | Workload emphasis |
|---|---|---|---|
| MI455X | 40 / 20 PFLOPS | Reduced rate | Hyperscale AI training, frontier models |
| MI440X | 40 / 20 PFLOPS | Reduced rate | Enterprise AI, mainstream inference |
| MI430X | 40 / 20 PFLOPS | ~1.3 PFLOPS (full rate) | HPC, sovereign AI, scientific computing |
Helios is AMD's first complete rack-scale AI system, and the primary delivery vehicle for the MI455X. AMD describes Helios as a reference design rather than a single SKU, meaning OEM partners (Dell, HPE, Lenovo, Supermicro) and ODMs will sell their own Helios-compatible racks that meet the AMD specification. The system is built on the Open Compute Project (OCP) 2025 double-wide rack standard, which Meta originally contributed to OCP.[12][15]
A single Helios rack integrates 72 MI455X accelerators, 18 EPYC "Venice" Zen 6 CPUs (paired with four GPUs each in a 1:4 ratio), and 72 Pensando Vulcano 800 Gb/s NICs (one per GPU for scale-out). The system uses direct-to-chip liquid cooling across all major components, with power consumption estimated at 130 to 170 kW per rack depending on configuration.[12][13]
At the rack level, Helios delivers approximately 2.9 EFLOPS of FP4 inference compute and 1.4 EFLOPS of FP8 training compute, with 31 TB of aggregate HBM4 memory and 1.4 PB/s of total memory bandwidth. The rack-internal scale-up fabric provides 260 TB/s of bandwidth, and the scale-out network adds another 43 TB/s of Ethernet capacity for connecting Helios racks into larger superpods.[3][12]
| Helios rack metric | Value |
|---|---|
| MI455X accelerators per rack | 72 |
| EPYC Venice CPUs per rack | 18 (Zen 6, 2 nm) |
| Pensando Vulcano NICs | 72 (800 Gb/s each) |
| Total HBM4 memory | 31 TB |
| Aggregate memory bandwidth | 1.4 PB/s |
| Scale-up bandwidth | 260 TB/s |
| Scale-out Ethernet bandwidth | 43 TB/s |
| FP4 compute | ~2.9 EFLOPS |
| FP8 compute | ~1.4 EFLOPS |
| Form factor | OCP 2025 double-wide rack, direct-to-chip liquid cooling |
| Rack power | ~130 to 170 kW |
AMD claims roughly a 10 times AI performance increase per rack compared to the MI300X generation, driven by the FP4 matrix path, higher GPU count per rack (72 vs 8 in a standard MI300X server), and the new UALink scale-up fabric.[6][16]
| Specification | MI300X (CDNA 3, 2023) | MI325X (CDNA 3, 2024) | MI355X (CDNA 4, 2025) | MI455X (CDNA 5, 2026) |
|---|---|---|---|---|
| HBM capacity | 192 GB HBM3 | 256 GB HBM3e | 288 GB HBM3e | 432 GB HBM4 |
| HBM bandwidth | 5.3 TB/s | 6 TB/s | 8 TB/s | 19.6 TB/s |
| FP8 dense | 2.6 PFLOPS | 2.6 PFLOPS | 10 PFLOPS | 20 PFLOPS |
| FP4 dense | not supported | not supported | 20 PFLOPS | 40 PFLOPS |
| Scale-up bandwidth | 896 GB/s (Infinity Fabric) | 896 GB/s | 1.075 TB/s | 3.6 TB/s (UALink + IF) |
| Process node | TSMC 5 nm | TSMC 5 nm | TSMC N3 | TSMC N2 (compute) + N3 (I/O) |
| Rack-scale platform | None | None | Helios reference (MI355X) | Helios (72-GPU rack) |
The jump from CDNA 4 to CDNA 5 is the largest generational step AMD has taken in the Instinct line. Memory capacity grew 1.5 times, bandwidth grew 2.45 times, dense FP4 throughput doubled, and the scale-up fabric tripled. AMD compares the MI455X-based Helios rack to the MI300X-based 8-GPU server it replaces, claiming approximately 35 times the AI compute per node and 18 times the HBM capacity per node, before accounting for software efficiency gains.[6][9]
The MI455X is positioned as a direct competitor to NVIDIA's Vera Rubin platform, which features the Rubin R100 GPU paired with NVIDIA's new Vera CPU. The two platforms entered customer evaluation roughly simultaneously in 2026, with NVIDIA targeting partner availability in the second half of 2026.[17][18]
| Metric | AMD MI455X | NVIDIA Rubin R100 (announced) |
|---|---|---|
| Transistors | ~320 billion | ~336 billion |
| Process node | TSMC N2 (compute) | TSMC N3 |
| HBM capacity | 432 GB HBM4 | 288 GB HBM4 |
| HBM bandwidth | 19.6 TB/s | ~22 TB/s |
| FP4 dense compute | 40 PFLOPS | ~50 PFLOPS (NVFP4) |
| FP8 dense compute | 20 PFLOPS | ~35 PFLOPS |
| Scale-up bandwidth | 3.6 TB/s (UALink) | 1.8 TB/s (NVLink 6) |
| Rack scale unit | Helios, 72 GPUs | Rubin NVL144, 144 GPU dies |
| Launch window | H2 2026 | H2 2026 |
AMD leads on memory capacity (1.5 times more HBM4 per accelerator) and per-GPU scale-up bandwidth. NVIDIA leads on peak FP4 and FP8 throughput, per-pin HBM4 data rate, and the maturity of its software ecosystem. The competitive positioning has settled around the idea that AMD's MI400 family offers superior price and performance for inference-heavy workloads and large context windows, while NVIDIA's Rubin offers higher peak training throughput per accelerator and benefits from a more mature CUDA software stack.[18][19]
The MI400 series uses the AMD ROCm open-source software stack, which provides the equivalent functionality to NVIDIA's CUDA platform across compilers, libraries, runtimes, and AI frameworks. AMD released ROCm 7.0 in mid-2025 alongside the MI350 launch, and the version of ROCm shipping with MI400 hardware extends that stack with native support for the new CDNA 5 instructions, FP4 and FP6 matrix paths, UALink-aware collective operations, and integration with the Vulcano scale-out NICs.[20][9]
AMD's headline performance claim for ROCm 7.0 is up to 4 times faster inference and up to 3 times faster training compared to ROCm 6.0 on equivalent hardware. The improvements come from kernel-level GEMM optimizations, new attention implementations tuned for long context, expanded support for vLLM and SGLang, and a redesigned distributed inference runtime that takes advantage of the MI400's UALink fabric for tensor-parallel and expert-parallel sharding.[20]
The ROCm stack for MI400 supports the standard set of AI frameworks: PyTorch, JAX, TensorFlow, vLLM, SGLang, and the Hugging Face Transformers library, with optimized container images for popular open models including Llama 4, DeepSeek, and Qwen.
On October 6, 2025, AMD and OpenAI announced a multi-year strategic partnership covering 6 gigawatts of AMD GPU capacity, beginning with MI400 series deployments in H2 2026. The deal is the largest single non-NVIDIA AI customer commitment ever signed, with industry estimates valuing the GPU portion of the agreement between $15 billion and $25 billion over its multi-year term.[4][21]
As part of the agreement, AMD granted OpenAI a warrant to purchase up to 160 million shares of AMD common stock at a nominal exercise price, with vesting tied to deployment milestones. At full vesting, the warrant represents roughly a 10 percent stake in AMD, aligning OpenAI's commercial interest with AMD's data center GPU success. The first gigawatt of capacity is scheduled to come online in H2 2026, with the remaining 5 gigawatts deploying through subsequent MI400 and MI500 generations through 2029.[4][22]
In October 2025, Oracle and AMD expanded their existing partnership to include MI400 series deployments on Oracle Cloud Infrastructure (OCI), following Oracle's June 2025 commitment to deploy 130,000 MI355X accelerators on OCI. Oracle intends to take delivery of approximately 50,000 MI400 series GPUs in 2026, with the first Helios racks coming online in OCI regions in H2 2026. OCI uses MI400 hardware to back its OCI Compute and OCI Supercluster products. Oracle's MI400 deployment is also tied indirectly to the OpenAI partnership, since OpenAI has separately contracted Oracle for tens of billions of dollars of compute capacity that will run on a mix of NVIDIA and AMD silicon.[23][24]
AMD has named Meta, Microsoft, Oracle, Tesla, and HUMAIN as customers committed to MI400 deployments. Meta is a particularly important partner because the OCP 2025 double-wide rack standard that Helios uses was originally contributed by Meta. Several national supercomputing centers in the United States, Europe, and Asia have committed to MI430X procurements for upcoming exascale and post-exascale systems.[6][14]
The MI400 series compute chiplets are manufactured on TSMC's N2 (2 nm class) process node, making the MI455X one of the first high-volume products on N2. The I/O dies, base interposer, and memory controller chiplets use TSMC's N3 process. Packages are assembled using TSMC's chip-on-wafer-on-substrate (CoWoS) advanced packaging, with HBM4 stacks supplied by SK hynix and Samsung. Production shipments are scheduled to begin in Q2 2026, with broad availability in H2 2026. Analysts estimate AMD could ship between 200,000 and 300,000 MI400 series units in 2026, ramping in 2027 as N2 capacity and HBM4 supply expand. The largest capacity constraint on the MI400 ramp is HBM4 production, since the MI455X uses 12 HBM4 stacks per accelerator and the early HBM4 supply is shared across AMD, NVIDIA, and other AI accelerator makers.[8][16][19]
Industry reception of the MI400 announcement has been broadly positive, particularly around the HBM4 capacity lead over Rubin, the OpenAI partnership as a multi-year demand anchor independent of NVIDIA's customer base, and the Helios rack-scale system, which moves AMD into direct competition with NVIDIA on integrated rack products rather than just individual accelerators.[19][22]
Areas of skepticism include the maturity of the ROCm software stack relative to CUDA, the limited public benchmark data available before silicon ships in volume, and the operational complexity of running 72-GPU UALink fabrics at scale. Financial analysts have raised their data center GPU revenue forecasts for AMD into 2026 and 2027 on the strength of the MI400 ramp and the OpenAI partnership, with AMD's forward guidance citing the MI400 series as a major driver of a $10 billion or larger annual AI accelerator revenue run-rate.[16][22]