The AMD Instinct MI325X is a data center GPU accelerator designed for AI training and inference workloads. Released by Advanced Micro Devices (AMD) on October 10, 2024, the MI325X is a mid-cycle refresh of the AMD Instinct MI300X, retaining the same CDNA 3 compute architecture but replacing its HBM3 memory subsystem with 256 GB of higher-density, higher-bandwidth HBM3E. With a memory bandwidth of 6 TB/s, the MI325X holds the largest memory capacity of any production GPU accelerator at launch, surpassing the NVIDIA H200's 141 GB of HBM3E by a wide margin. It was announced at AMD's Advancing AI 2024 event alongside the 5th generation EPYC server processors.
The MI325X targets large language model (LLM) inference and training workloads where memory capacity constrains which models can be served on a given number of GPUs. AMD positioned it as a direct answer to NVIDIA H200 while production shipments of NVIDIA's next-generation Blackwell-based GPUs were still ramping. Production shipments of the MI325X began in Q4 2024, with broad server platform availability from OEM partners beginning in Q1 2025.
AMD introduced the AMD Instinct MI300X in December 2023 as its flagship AI accelerator, packing 192 GB of HBM3 memory across eight stacks onto a single OAM module. The MI300X marked a significant leap from the prior generation MI250X and attracted major cloud customers including Microsoft Azure, which deployed the accelerator to support OpenAI's GPT model serving workloads.
By early 2024, the memory capacity advantage the MI300X held over NVIDIA's H100 SXM5 (80 GB HBM3) had become a genuine selling point: large models that required multiple H100s to serve could sometimes run on a single MI300X or a smaller cluster of MI300Xs. However, NVIDIA's own H200 arrived with 141 GB of HBM3E and narrowed the gap. At its Computex 2024 appearance in June, AMD revealed a roadmap that included the MI325X as a Q4 2024 refresh, followed by the CDNA 4-based MI350 series in 2025.
The MI325X does not introduce new compute dies. AMD made a deliberate architectural decision to keep the same Aqua Vanjaram chip with its eight Compute Dies (XCDs) and four I/O Dies (IODs), fabricated on TSMC's 5nm and 6nm nodes respectively, and redirect engineering effort toward swapping the memory type. Each HBM3 stack was replaced by a denser HBM3E stack, raising per-GPU capacity from 192 GB to 256 GB and per-GPU bandwidth from approximately 5.3 TB/s to 6.0 TB/s. The result is a chip that an existing MI300X server can adopt with only firmware and cooling updates, since the physical OAM socket and server infrastructure carry over unchanged.
The power envelope, however, did increase. The MI325X has a rated TDP of 1,000 W per OAM module, compared to the MI300X's 750 W. This 33% rise in power reflects the higher energy draw of HBM3E at elevated bandwidth, and it meaningfully changes the cooling and power delivery requirements for data center deployment.
The MI325X shares its full compute architecture with the MI300X under AMD's CDNA 3 (Compute DNA 3) generation. The chip contains 153 billion transistors across a multi-chip module (MCM) design.
The MI325X uses a 3D-stacked chiplet design with the following die configuration:
This arrangement, sometimes called the Aqua Vanjaram design, places eight compute tiles in a ring around four I/O tiles, with the memory stacks integrated vertically. The 2D footprint fits within the standard Open Accelerator Module (OAM) form factor defined by the Open Compute Project (OCP).
Each compute unit in CDNA 3 contains a Matrix Core Engine capable of processing FP8, FP16, BF16, and FP64 matrix multiplications. AMD refers to these as second-generation Matrix Core engines relative to the CDNA 2 generation. The accelerator does not expose tensor cores in NVIDIA's terminology; AMD's compute unit hierarchy pairs scalar, vector, and matrix operations differently.
At the full 1,216 compute unit count, the MI325X achieves:
These figures match the MI300X's compute performance, since the compute dies are identical. The bandwidth increase from HBM3E improves throughput for memory-bound operations without changing peak theoretical compute.
The HBM3E upgrade is the sole hardware change distinguishing the MI325X from the MI300X. High Bandwidth Memory 3E (HBM3E) is a denser and faster variant of the HBM3 standard, with per-stack density increasing from 24 GB (12-high stacks in the MI300X) to 32 GB (12-high stacks in the MI325X). The memory bus width remains 1,024 bits per stack, but HBM3E operates at higher transfer rates.
The resulting system-level numbers:
The Infinity Cache sits between the compute dies and the memory stacks and serves as a last-level cache to reduce latency for frequently reused data. Its capacity is unchanged from the MI300X.
The MI325X uses AMD's Infinity Fabric for GPU-to-GPU communication within a server node. Each OAM module carries seven Infinity Fabric links, each running at 128 GB/s, for an aggregate inter-GPU bandwidth of 896 GB/s per device. In an eight-GPU node this creates a fully connected topology within a single compute partition, enabling the eight GPUs to share 2.048 TB of pooled HBM3E.
Infinity Fabric's per-link bandwidth trails NVIDIA's NVLink 4 interconnect, which the H200 uses in the HGX form factor. NVIDIA's NVSwitch 3 fabric in an HGX H200 node provides an all-to-all bisection bandwidth that exceeds what Infinity Fabric delivers at eight GPU scale. AMD has acknowledged this tradeoff: single-GPU and small-cluster advantages from greater memory capacity do not always translate to proportional multi-GPU training speedups because of the interconnect bottleneck.
The MI325X supports PCIe 5.0 x16 for host CPU communication alongside the Infinity Fabric peer links.
HBM3E is the third generation of High Bandwidth Memory, ratified by JEDEC in 2023. It raises per-die data rates above the 6.4 Gbps per pin achievable with HBM3 while keeping the same physical interface dimensions. SK Hynix was the first manufacturer to produce HBM3E at volume, with Samsung and Micron following. AMD's choice of supplier for the MI325X HBM3E stacks has not been publicly disclosed.
The practical advantage of 256 GB per accelerator is most visible in inference workloads for large models. A fully dense GPT-4-scale model (estimated around 1.8 trillion parameters in some analyses, though the exact count is not public) would require multiple accelerators regardless of memory size. More concretely, models in the 70 billion to 405 billion parameter range fit comfortably in either one or two MI325X GPUs at FP16 precision, whereas they require more H200 GPUs to service comparable batches. Llama 3.1 405B at BF16 occupies roughly 810 GB; three MI325X GPUs cover the model weights with headroom for KV cache, whereas seven H200 GPUs would be needed at equivalent precision.
The 6.0 TB/s bandwidth directly accelerates decode throughput in autoregressive inference. Because each output token requires loading the full model's key/value cache and weight matrices from memory, bandwidth determines tokens-per-second at a given batch size more than compute does in low-to-moderate batch regimes. AMD's internal benchmarks claimed 1.3x higher peak theoretical FP16 and FP8 performance over the H200 when accounting for both the compute units and the bandwidth increase.
At 1,000 W TDP, the MI325X consumes approximately 33% more power than the MI300X. The maximum transient power is specified at 1.1 kW. This places the MI325X in the same power tier as NVIDIA's H200 SXM5 (700 W) and below the air-cooled maximum of several competing accelerators, but the absolute figure still demands significant infrastructure.
AMD lists two board power configurations:
The dominant deployment scenario for the MI325X in high-performance data centers is direct liquid cooling. The OAM form factor plugs into an OCP-compliant Universal Baseboard that routes coolant through integrated cold plates. HPE's ProLiant Compute XD685, one of the first server platforms designed around the MI325X, offered both air and direct liquid cooling configurations in its eight-GPU, 5U chassis. Supermicro's H14 series similarly provided a liquid-cooled eight-GPU OAM board.
The power increase from 750 W to 1,000 W per GPU means an eight-GPU node draws roughly 8,000 W from the accelerators alone. Combined with host CPU, networking, and storage power, a dense MI325X node can approach 12-14 kW total rack unit demand, requiring power distribution units and facility infrastructure rated accordingly.
AMD and its OEM partners addressed this by retaining physical compatibility with MI300X server chassis wherever possible. The OAM socket and Universal Baseboard design are unchanged; operators upgrading from MI300X to MI325X primarily need to verify that existing liquid cooling loops can handle the higher thermal load and that power delivery circuits meet the 1,000 W per module spec.
Because the compute dies are identical, the MI325X's performance gains over the MI300X arise entirely from the memory subsystem upgrade. Operations that are compute-bound (i.e., where the GPU's matrix engines are the bottleneck) see little change. Operations that are memory-bandwidth-bound see gains proportional to the bandwidth increase, approximately 13% improvement in raw bandwidth.
The capacity increase of 33% (192 GB to 256 GB) yields larger practical gains for workloads that previously required model sharding across GPUs. Running Llama 2 70B inference on a single MI325X is possible at FP16 without tensor parallelism, whereas the same model on an MI300X requires FP8 quantization to fit in 192 GB with typical KV cache overhead. This consolidation reduces inter-GPU communication and simplifies deployment.
AMD's own benchmarking at launch cited a 1.3x improvement in inference throughput for large model serving compared to the MI300X on bandwidth-bound workloads.
AMD submitted its first MI325X results in the MLPerf Inference v5.0 round, with partner submissions from Supermicro, ASUS, and Gigabyte. The workloads tested included Llama 2 70B (server and offline scenarios) and Stable Diffusion XL (text-to-image generation).
On Llama 2 70B, MI325X partner results were within 3% of AMD's reference submission performance, and the accelerator traded blows with the H200 across offline and server scenarios. The MLPerf submission used FP8 quantization via the OCP FP8-e4m3 format, multi-step vLLM scheduling to reduce CPU overhead, and GEMM tuning targeting critical matrix operations.
In MLPerf Inference v5.1 (September 2025), AMD submitted the MI355X in addition to the MI325X. The MI355X delivered 2.7x the tokens per second of the MI325X on Llama 2 70B server inference in FP8, reflecting the architectural improvements of CDNA 4 rather than a memory-only change.
The H200 carries 141 GB of HBM3E with 4.8 TB/s bandwidth. At a single accelerator level, the MI325X's advantages in memory capacity and bandwidth are real:
| Specification | AMD Instinct MI325X | NVIDIA H200 SXM5 |
|---|---|---|
| Architecture | CDNA 3 | Hopper |
| Process node | TSMC 5nm/6nm | TSMC 4N |
| Compute units / SMs | 1,216 CUs | 132 SMs |
| FP64 (TFLOPS) | 81.7 | 67.0 |
| FP32 (TFLOPS) | 163.4 | 67.0 |
| FP16 (TFLOPS) | 1,307.4 | 989.4 |
| FP8 (TFLOPS) | 2,614.9 | 1,978.8 |
| FP8 with sparsity (TFLOPS) | ~5,229.8 | 3,957.6 |
| HBM memory | 256 GB HBM3E | 141 GB HBM3E |
| Memory bandwidth | 6.0 TB/s | 4.8 TB/s |
| TDP | 1,000 W | 700 W |
| Inter-GPU interconnect | Infinity Fabric | NVLink 4 |
| Memory capacity advantage | 1.81x vs H200 | baseline |
| Memory bandwidth advantage | 1.25x vs H200 | baseline |
In terms of peak theoretical TFLOPS, the MI325X's numbers exceed the H200 because CDNA 3's Matrix Core architecture computes at a different ratio of dense operations. However, the H200 uses NVLink 4 and NVSwitch 3 for multi-GPU communication, which provides substantially higher all-to-all bisection bandwidth in eight-GPU nodes. This means MI325X's single-GPU advantage partially erodes in multi-GPU training workloads.
AMD claimed the MI325X delivered an eight-GPU Llama 2 70B inference throughput within 3 to 7 percent of an equivalent eight-GPU H200 system, and image generation on SDXL within 10 percent of H200. Independent analysis from The Next Platform noted that single-device advantages in bandwidth mostly held, but scaling efficiency at eight GPUs was limited by Infinity Fabric compared to NVSwitch.
NVIDIA's Blackwell-based NVIDIA B200 was announced in March 2024 and entered production in late 2024. The B200 belongs to a different performance class: it carries 192 GB of HBM3E with 8.0 TB/s bandwidth and substantially higher FP8 compute (4,500+ TFLOPS), making direct comparisons with the MI325X less favorable for AMD. AMD positioned the MI325X against the H200 explicitly, not against Blackwell.
| Specification | AMD Instinct MI325X | NVIDIA B200 SXM5 |
|---|---|---|
| Architecture | CDNA 3 | Blackwell |
| FP16 (TFLOPS) | 1,307.4 | ~2,250 |
| FP8 (TFLOPS) | 2,614.9 | ~4,500 |
| HBM memory | 256 GB HBM3E | 192 GB HBM3E |
| Memory bandwidth | 6.0 TB/s | 8.0 TB/s |
| TDP | 1,000 W | 1,000 W |
The MI325X holds a memory capacity edge over the B200, but the B200 exceeds it in bandwidth and compute. AMD's response to Blackwell was the CDNA 4-based MI350 series, not the MI325X.
The MI325X was announced with commercial support from a broad set of OEM server builders and cloud providers.
At the Advancing AI 2024 event, AMD confirmed system support from Dell Technologies, Eviden, Gigabyte, Hewlett Packard Enterprise, Lenovo, and Supermicro. Production shipments from AMD started in Q4 2024, with OEM system availability broadening in Q1 2025.
HPE's ProLiant Compute XD685 was among the first validated platforms. It houses eight MI325X OAM modules alongside two AMD EPYC 9005-series CPUs in a 5U chassis supporting both air and direct liquid cooling. Supermicro's H14 8U 8-GPU MI325X platform offered a similarly dense configuration with optional liquid cooling.
Vultr was the first cloud provider to make MI325X instances commercially available, launching in early 2025 with configurations pairing eight MI325X GPUs and 2.048 TB of pooled HBM3E. Vultr marketed the offering to enterprises that needed large-memory GPU instances for LLM serving, RAG pipelines, and fine-tuning.
Microsoft Azure had already deployed AMD Instinct MI300X accelerators in its ND MI300X V5 virtual machine series to power Azure OpenAI Service workloads. At the Advancing AI 2024 event, Microsoft was named among the cloud and AI ecosystem partners supporting the MI325X roadmap, though Azure's public MI325X VM SKUs were not announced at that time.
Oracle Cloud Infrastructure (OCI) has been a consistent AMD Instinct partner for both EPYC CPUs and Instinct accelerators. OCI announced expanded AMD EPYC compute instances at the same event, maintaining its status as a key AMD cloud partner.
Google Cloud participated in the Advancing AI 2024 ecosystem announcements. Its primary AMD deployments have been EPYC-based rather than Instinct GPU-based in public disclosures.
Meta was named among customers participating in AMD's launch announcements. Meta has deployed AMD Instinct accelerators as part of its diversified AI infrastructure strategy.
By mid-2025, the cloud GPU rental market showed MI325X pricing typically ranging from $2.00 to $2.25 per GPU per hour, compared to H200 pricing of approximately $3.72 to $10.60 per GPU per hour depending on provider and contract terms. The MI325X's lower rental price, combined with its larger memory, made it attractive for inference workloads that were not compute-bound.
The MI325X runs on AMD's ROCm (Radeon Open Compute) open-source software platform. ROCm provides the GPU runtime, HIP (Heterogeneous Interface for Portability), math libraries (rocBLAS, MIOpen, rocFFT), and communication libraries (RCCL, AMD's equivalent of NCCL).
Framework support covers PyTorch, TensorFlow, and JAX via ROCm backends. vLLM and SGLang, two widely used LLM inference frameworks, added MI300X and MI325X support throughout 2024, making inference deployment accessible without custom kernel development. ROCm 6.x releases during 2024 and into 2025 substantially improved PyTorch and JAX compatibility.
ROCm's software maturity still trails CUDA's in several areas. Most AI frameworks have been developed and optimized for NVIDIA's CUDA ecosystem first. Operators running training workloads at scale reported that achieving performance parity with H200 on MI300X or MI325X required more effort: custom Docker builds, AMD engineering engagement, and manual GEMM tuning. RCCL, used for collective communication in distributed training, has shown lower efficiency than NCCL for some topology configurations. NVIDIA's tight vertical integration between its networking (InfiniBand, NVLink), communication libraries, and CUDA gave H200 clusters an advantage in multi-node training throughput that was not fully offset by AMD's memory capacity lead.
For inference workloads with vLLM or SGLang, the ROCm experience in 2024-2025 was substantially more accessible than earlier generations, with most popular models running without modification.
AMD has not published a suggested retail price for the MI325X. Data center GPU accelerators at this tier are sold through OEM channel deals and direct enterprise agreements rather than consumer retail. List prices for the MI300X in the $10,000-$15,000 USD per unit range have been cited by third-party market researchers, and the MI325X is expected to be priced comparably given the incremental nature of the upgrade.
Cloud rental pricing (as of mid-2025) places MI325X instances at approximately $2.00-$2.25 per GPU per hour. For reference, H200 SXM5 cloud instances on major hyperscalers ranged from roughly $3.72 to higher depending on provider and reservation type. The more favorable rental economics for the MI325X made it competitive for cost-sensitive inference deployments, particularly when memory capacity requirements ruled out H100.
Several limitations have shaped MI325X adoption relative to NVIDIA's offerings.
No architectural compute improvement over MI300X. Because the compute dies are unchanged, the MI325X does not address any compute-bound training workloads better than its predecessor. Customers who needed more raw TFLOPS rather than more memory saw no benefit from upgrading.
Higher power draw. The jump from 750 W to 1,000 W TDP affects total cost of ownership. Data centers optimized for MI300X power budgets need reassessment before deploying MI325X. Liquid cooling infrastructure becomes more strongly recommended rather than optional.
Infinity Fabric scaling ceiling. As noted in independent benchmarks, the Infinity Fabric interconnect limits scaling efficiency at eight-GPU node configurations relative to NVLink 4/NVSwitch 3. Training on large models that require tight GPU-to-GPU communication sees less benefit from the MI325X's memory advantages.
ROCm ecosystem maturity. Despite improvements in 2024-2025, the ROCm software ecosystem continues to require more operator expertise than CUDA for production deployments. Sparse library support, less mature debugging tooling, and lower community availability of ROCm-specific optimizations create real friction in training pipelines.
Rapid succession by MI350 series. AMD's MI350X and AMD Instinct MI355X launched in 2025 with CDNA 4 architecture, 288 GB of HBM3E, and 8.0 TB/s bandwidth. The short interval between MI325X availability and MI355X availability compressed the MI325X's window as AMD's flagship accelerator, and some cloud providers moved to offering MI355X rather than building out MI325X capacity.
No FP4 support. The CDNA 3 architecture does not natively support FP4 or FP6 data types that NVIDIA's Blackwell GPUs and AMD's own CDNA 4 chips added. For workloads using aggressive quantization to maximize throughput per watt, MI325X is at a disadvantage compared to B200 and MI355X.
AMD's accelerator roadmap places the MI325X as the last CDNA 3 product before the CDNA 4-based MI350 series.
The MI350X (air-cooled) and AMD Instinct MI355X (liquid-cooled) are built on the CDNA 4 architecture, manufactured on TSMC's 3nm process node. AMD production shipments of MI350 platforms began in May 2025. Both share:
In MLPerf Inference v5.1, the MI355X achieved 2.7x the Llama 2 70B tokens per second of the MI325X. AMD also claims up to 35x faster inference performance versus the MI300X on certain configurations, though these figures represent favorable workload selection.
AMD's MI400 series, powered by CDNA "Next" architecture, is planned for 2026 with the "Helios" rack architecture. AMD has previewed UALink over Ethernet connectivity in the MI400 platform, targeting a 72-accelerator scale-up domain analogous to NVIDIA's NVL72 configuration. AMD projects up to 10x performance improvement over the MI350 series for AI frontier model workloads.
| Specification | Value |
|---|---|
| Launch date | October 10, 2024 |
| Architecture | CDNA 3 (Aqua Vanjaram) |
| Process node | TSMC N5 (compute), TSMC N6 (I/O) |
| Transistor count | 153 billion |
| Compute units | 1,216 |
| FP64 performance | 81.7 TFLOPS |
| FP32 performance | 163.4 TFLOPS |
| FP16/BF16 performance | 1,307.4 TFLOPS |
| FP8 performance | 2,614.9 TFLOPS |
| FP8 with sparsity | ~5,229.8 TFLOPS |
| Memory type | HBM3E |
| Memory capacity | 256 GB |
| Memory bandwidth | 6.0 TB/s |
| Memory bus width | 8,192 bits |
| L2 / Infinity Cache | 256 MB |
| Inter-GPU interconnect | Infinity Fabric (7x links, 128 GB/s each) |
| Host interface | PCIe 5.0 x16 |
| Form factor | OAM (Open Accelerator Module) |
| TDP | 1,000 W |
| Max power | 1,100 W |
| Predecessor | AMD Instinct MI300X (750 W, 192 GB HBM3) |
| Successor | AMD Instinct MI355X (CDNA 4, 288 GB HBM3E) |