The AMD Instinct MI300X is a data center GPU accelerator released by Advanced Micro Devices on December 6, 2023. Built on the CDNA 3 architecture, it uses a multi-chiplet design combining eight GPU compute dies and four I/O dies on a single package, paired with 192 GB of HBM3 memory and 5.3 TB/s of memory bandwidth. At launch, AMD positioned the MI300X as a direct competitor to NVIDIA's H100, emphasizing its memory capacity advantage as particularly relevant for large language model inference.
The MI300X represents AMD's most significant attempt to challenge NVIDIA's dominance in the AI accelerator market. It launched alongside the MI300A, a related APU variant that combines CPU and GPU cores on the same package, and came with AMD's stated goal of generating $400 million in MI300-series revenue in 2023, a figure that AMD subsequently revised upward multiple times as demand exceeded initial projections. By the end of 2024 AMD reported more than $5 billion in data center GPU revenue, the bulk of it from MI300X sales, and the company crossed $1 billion in MI300X revenue in a single quarter (Q2 2024) within nine months of launch.
MI300X is the first AMD accelerator that customers and analysts treated as a credible alternative to NVIDIA H100 rather than a niche HPC part. It became the GPU that hyperscalers used to diversify their AI fleets in 2024, and the foundation for AMD's Q4 2024 MI325X refresh and the CDNA 4 MI350 series in 2025. Independent benchmarks from SemiAnalysis, ChipsAndCheese, and academic groups have repeatedly confirmed two themes: the hardware is real and competitive, and the software ecosystem still trails NVIDIA's CUDA stack despite rapid improvement.
AMD entered the data center GPU accelerator market in 2016 with the original Radeon Instinct product line, which targeted scientific computing and machine learning workloads. The Instinct branding distinguished these professional accelerators from AMD's consumer Radeon GPUs. The early Instinct cards used graphics-derived architectures with limited software maturity, and AMD struggled to gain traction against NVIDIA's established CUDA ecosystem.
The introduction of the CDNA (Compute DNA) architecture in late 2020 marked a deliberate shift away from graphics-derived designs. AMD released the Instinct MI100 in November 2020, the first GPU built from the ground up for compute rather than graphics. The MI100 used the "Arcturus" die with 7,680 stream processors, 32 GB of HBM2 memory, and support for the then-new BF16 data type. While performance was competitive in certain workloads, software maturity remained a persistent challenge.
In 2021, AMD launched the MI250 and MI250X using the CDNA 2 architecture. The MI250X paired two "Aldebaran" dies in a multi-chip module, delivering 128 GB of HBM2e memory and up to 383 TFLOPS of FP16 performance. The Department of Energy's Frontier supercomputer at Oak Ridge National Laboratory adopted the MI250X, and Frontier became the world's first verified exascale supercomputer in June 2022. That deployment validated AMD's architecture for scientific HPC workloads but did not translate immediately into broad commercial AI adoption.
With MI300X, AMD aimed to convert its HPC credibility into AI infrastructure wins during the generative AI boom of 2023. The product was conceived as a CPU plus GPU unified package (the MI300A) and re-spun as a GPU-only design (the MI300X) when AMD recognized that the AI training and inference market wanted maximum GPU memory and compute, not CPU integration.
The MI300 program began as an HPC APU, originally targeted at the El Capitan supercomputer at Lawrence Livermore National Laboratory. AMD added the GPU-only MI300X variant in 2022 after demand from cloud customers for an AI accelerator with more compute and memory than any then-shipping competitor. AMD swapped three of the MI300A's CPU chiplets for two additional XCDs in the MI300X, increasing GPU compute and memory at the cost of removing the on-package CPU. The two products share the same socket, IOD, and HBM3 stacks, which let AMD amortize design and packaging investment across both SKUs.
AMD unveiled the MI300X at its "Advancing AI" event on December 6, 2023, held in San Jose, California. AMD CEO Lisa Su delivered the keynote, describing the MI300X as "the most advanced AI accelerator in the industry." AMD's initial performance claims focused on inference workloads, where the company said the MI300X delivered 1.6x faster throughput than the H100 on Bloom 176B and 1.4x faster throughput on Llama 2 70B.
Microsoft was announced as the launch cloud partner. Microsoft Azure planned to deploy MI300X accelerators for its Azure OpenAI Service, which at the time powered ChatGPT and GPT-4 inference at scale. AMD also announced commitments from Meta and Oracle at or shortly after the event. SemiAnalysis, a semiconductor research firm, published an analysis the same day noting that the MI300X's memory capacity and bandwidth gave it a concrete hardware advantage over the H100 for large model inference, while raising questions about software maturity.
The launch came during a period of acute GPU scarcity. NVIDIA's H100 had months-long lead times, and hyperscalers were actively seeking alternative supply. AMD's timing was deliberate: the company had been preparing MI300X production through 2023 and aligned the announcement with cloud provider commitments that were already in progress. The keynote also previewed customer endorsements from OpenAI (in the form of optimized inference for OpenAI Triton), Meta, Microsoft, Oracle, Dell, HPE, Lenovo, and Supermicro.
Before the event, NVIDIA published a counter-blog disputing AMD's H100 comparison numbers. NVIDIA argued that AMD compared an unoptimized H100 configuration to a tuned MI300X stack, and showed H100 numbers under TensorRT-LLM that exceeded AMD's quoted figures. AMD responded the next day with updated benchmarks claiming the MI300X retained an inference lead even after applying NVIDIA's recommended optimizations. The exchange foreshadowed a recurring pattern: AMD's marketing benchmarks usually look impressive on paper, and independent third-party measurements tend to land somewhere between AMD's claims and NVIDIA's counter-claims.
The MI300X is the flagship implementation of AMD's CDNA 3 compute architecture. CDNA 3 is a successor to CDNA 2 and was designed exclusively for data center workloads, with no shared design lineage with AMD's consumer RDNA graphics architecture. The architecture targets HPC and AI in a single package and is the first AMD compute architecture to support FP8 numerics, structured 2:4 sparsity, and TF32 matrix operations.
CDNA 3 introduced several compute enhancements over CDNA 2. Each CDNA 3 compute unit (CU) contains 64 stream processors, unchanged from CDNA 2, but the architecture added a second-generation matrix math unit capable of executing native FP8 operations, as well as BF16 and FP16 matrix operations. The FP8 support was significant because many AI training and inference workflows were beginning to use FP8 quantization to reduce memory footprint and increase throughput per GPU. CDNA 3 also increased the size of the L2 cache per compute unit to 4 MB and introduced a shared last-level cache called the Infinity Cache (also referred to as MALL, for Memory Attached Last Level).
The MI300X has 304 active compute units, out of a maximum of 320 in a fully enabled eight-XCD configuration. The 16 disabled compute units represent a yield-driven decision common in high-transistor-count designs. With 64 stream processors per CU, the MI300X exposes 19,456 stream processors. Each compute unit also has four matrix cores, giving the chip 1,216 matrix cores in total.
New matrix math features in CDNA 3 include 2:4 structured sparsity, which doubles effective throughput on supported precisions when half of the weights in each four-element group are zero, and an asynchronous matrix multiply pipeline that overlaps load and compute stages. The MI300X also added support for TF32, which uses an FP32 exponent with FP16 mantissa, providing FP32 dynamic range at FP16 throughput.
The MI300X is built from 12 individual silicon dies: eight Accelerator Complex Dies (XCDs) and four I/O Dies (IODs). This makes it one of the most complex chiplet assemblies in any commercial processor at its launch.
Each XCD is manufactured on TSMC's N5 (5nm) process node and contains 38 active compute units (40 physically present, with two disabled for yield purposes). The XCDs are stacked vertically in pairs on top of the IODs using TSMC's SoIC (System on Integrated Chip) hybrid bonding technology. This 3D stacking places two XCDs directly atop a single IOD, with the connection made through dense arrays of fine-pitch copper bonds carrying tens of terabytes per second of aggregate bandwidth.
The four IODs serve as the system infrastructure layer. Each IOD is manufactured on TSMC's N6 (6nm) process node and manages memory access, I/O connectivity, and inter-die communication for the two XCDs sitting above it. The IODs connect to the HBM3 stacks and expose the PCIe 5.0 host interface and the xGMI links that connect MI300X GPUs to each other.
The Infinity Fabric interconnect runs between IODs at high bandwidth, and the IODs use a mesh topology internally to maintain coherency across the full 304-CU compute space. AMD's official total transistor count for the MI300X package is 153 billion, which it cites in datasheets and at Hot Chips 2024. Tom's Hardware reported a measured count of 146 billion based on delidded photography, and both figures appear in third-party coverage. The discrepancy reflects different counting methodologies. AMD's 153 billion is the figure used in this article.
The full die area exceeds 1,000 mm-squared of active silicon across the package, far larger than what could fit on a single reticle. The chiplet approach was a necessity rather than a preference: a monolithic implementation of MI300X would not be manufacturable on N5 with current photolithography reticle limits.
AMD also offered the MI300X in a GPU partitioning mode. In NPS4 (Non-Uniform Memory Access 4-node) configuration, the GPU presents as four logical partitions to the operating system, each with 48 GB of memory. This allows operators to run multiple independent workloads on a single physical accelerator or to improve memory locality for NUMA-sensitive applications. SPX (single partition) mode keeps all 304 CUs under a single domain.
The MI300X carries 192 GB of HBM3 memory across eight stacks, each providing 24 GB. The aggregate theoretical bandwidth is 5.3 TB/s. This was the largest memory capacity on any commercially available GPU accelerator at launch and represented a 1.5x increase over the MI250X's 128 GB of HBM2e.
The memory advantage over NVIDIA's H100 SXM5 was substantial at launch. The H100 SXM5 carries 80 GB of HBM3 at 3.35 TB/s, meaning the MI300X provides 2.4x more memory and 1.6x more bandwidth per accelerator. For large model inference, this difference has practical consequences: models too large to fit on a single H100 fit on a single MI300X, reducing the need for tensor parallelism across multiple GPUs and the associated inter-GPU communication overhead. A 70-billion-parameter Llama model in FP16 occupies approximately 140 GB; it fits on one MI300X but requires sharding across two H100s.
In addition to the HBM3 main memory, the MI300X includes 256 MB of Infinity Cache (L3 cache), distributed across the four IODs. Measured internal bandwidth to the Infinity Cache reaches approximately 11.9 TB/s, substantially higher than the HBM3 bandwidth, making cache-resident data accesses extremely fast. ChipsAndCheese measured roughly the same number in their independent testing and observed that the Infinity Cache hides much of the HBM access latency for kernels with reasonable spatial locality.
The cache hierarchy has multiple levels:
| Level | Size | Approximate bandwidth |
|---|---|---|
| L1 cache | 32 KB per CU | Tens of TB/s |
| L2 cache | 4 MB per XCD shared by 38 CUs | High |
| Infinity Cache (MALL) | 256 MB shared across all four IODs | ~11.9 TB/s |
| HBM3 | 192 GB total | 5.3 TB/s |
Memory latency is one area where the MI300X does not lead. ChipsAndCheese measured H100 access latency at roughly 57 percent of the MI300X's, partly attributable to TLB miss handling on the MI300X and partly to the larger physical extent of the MI300X package. Latency is comparable when work is split across multiple workgroups, which reduces TLB pressure.
The MI300X exposes the following key hardware parameters:
| Parameter | Value |
|---|---|
| Architecture | CDNA 3 |
| Process | TSMC N5 (XCDs), TSMC N6 (IODs) |
| Chiplets | 8 XCDs + 4 IODs + 8 HBM3 stacks |
| Transistors | 153 billion (AMD), 146 billion (third-party measurement) |
| Compute units | 304 active (320 total) |
| Stream processors | 19,456 |
| Matrix cores | 1,216 |
| Peak engine clock | 2,100 MHz |
| HBM3 capacity | 192 GB across 8 stacks |
| HBM3 bandwidth | 5.3 TB/s |
| Memory bus width | 8,192-bit |
| Infinity Cache | 256 MB |
| L2 per XCD | 4 MB |
| Host interface | PCIe 5.0 x16 (128 GB/s) |
| Inter-GPU | 7 xGMI links, 64 GB/s each |
| Form factor | OAM (OCP Accelerator Module) |
| Rated TDP | 750 W |
AMD published the following theoretical peak performance figures for the MI300X at the 2,100 MHz peak boost engine clock:
| Precision | Peak performance (dense) | With 2:4 sparsity |
|---|---|---|
| FP64 vector | 81.7 TFLOPS | n/a |
| FP64 matrix | 163.4 TFLOPS | n/a |
| FP32 matrix | 163.4 TFLOPS | n/a |
| TF32 matrix | 653.7 TFLOPS | 1,307.4 TFLOPS |
| BF16 matrix | 1,307.4 TFLOPS | 2,614.9 TFLOPS |
| FP16 matrix | 1,307.4 TFLOPS | 2,614.9 TFLOPS |
| FP8 matrix | 2,614.9 TFLOPS | 5,229.8 TFLOPS |
| INT8 | 2,614.9 TOPS | 5,229.8 TOPS |
These figures assume dense arithmetic without sparsity unless otherwise noted. AMD followed a convention NVIDIA introduced with the A100 by also publishing sparsity-enabled peak numbers at 2x the dense figures for supported precisions.
In an eight-GPU MI300X Platform, the aggregate peak figures multiply by eight: 10.5 PFLOPS BF16 dense, 20.9 PFLOPS FP8 dense, and 41.8 PFLOPS FP8 with structured sparsity, on top of 1.5 TB of total HBM3 capacity.
Benchmarks from independent researchers and cloud operators showed a significant gap between theoretical and sustained throughput. SemiAnalysis, which conducted extensive training benchmarks published in late 2024, found that the MI300X achieved roughly 620 TFLOPS in sustained BF16 matrix operations, compared to approximately 720 TFLOPS for the H100. For FP8, the H100 achieved around 1,280 TFLOPS while the MI300X reached approximately 990 TFLOPS.
The utilization efficiency difference was large: the H100 achieved around 73 percent of its rated BF16 peak, while the MI300X achieved around 47 percent. AMD's hardware delivers more raw compute capacity on paper, but the software stack, particularly kernel-level optimizations and tuned libraries, had not yet closed the gap with NVIDIA's mature cuBLAS and cuDNN implementations. SemiAnalysis titled the report "CUDA Moat Still Alive" to summarize the gap, while noting that AMD's MFU was actively improving over the report's five-month observation window.
A notable detail from the SemiAnalysis investigation was that the H100 numbers reflected out-of-the-box performance with no special tuning, while the MI300X numbers required custom Docker images, environment variables, and direct AMD engineering involvement to reach the levels measured. The same reviewers found that many MI300X workloads stalled at under 150 TFLOP/s in default configurations because of bugs in attention backwards passes and torch.compile.
For inference workloads, particularly large language model inference where memory capacity is the binding constraint, the MI300X performed more competitively. Tests running 70B-parameter models showed the MI300X delivering better throughput than the H100 in single-GPU configurations, largely because the H100 requires model sharding across multiple GPUs at that scale while the MI300X can hold the full model in memory. By late 2024, several inference benchmarks (Moreh, AMD ROCm Blogs, Oracle Cloud) were showing MI300X matching or exceeding H100 throughput on Llama 70B inference, particularly at batch sizes where H100 systems are limited by memory.
In early 2025, after the release of DeepSeek-R1 and DeepSeek V3, several groups demonstrated that MI300X could serve large mixture-of-experts models at competitive throughput. AMD published results in which a single 8-GPU MI300X node served DeepSeek R1 671B at sub-50 ms inter-token latency for up to 128 concurrent requests using SGLang with AITER MoE kernels. Moreh, a Korean software vendor, reported reaching 21,224 tokens per second on the same model and node configuration, close to the 22,282 tokens per second SGLang reported on an 8x H100 node. SemiAnalysis's InferenceMAX benchmark found that for very large models such as Llama 3.1 405B and DeepSeek V3 670B, MI300X beat H100 in absolute performance and in performance per dollar, where the larger memory advantage matters most.
The MI300X has a rated Thermal Design Power (TDP) of 750 W. This is higher than the H100 SXM5's 700 W TDP and substantially higher than previous AMD Instinct accelerators. The MI300X uses the OAM (OCP Accelerator Module) form factor, which is an Open Compute Project standard for high-power AI accelerators. OAM-based systems use direct liquid cooling or high-airflow forced-air cooling to manage the thermal load.
At 750 W per accelerator, deploying MI300X at scale requires power infrastructure rated for roughly 6 kW per server in an 8-GPU configuration before accounting for CPU, memory, networking, and power supply losses. Real 8-GPU MI300X chassis from Supermicro, Dell, HPE, and Lenovo land in the 8 to 10 kW range with full system overhead. This places the MI300X at the edge of what air-cooled data centers can reliably operate, driving interest in direct liquid cooling solutions and the next-generation MI355X (1,400 W) and MI300X-class racks.
The primary product is a single-GPU OAM module that mounts directly onto a Universal Baseboard (UBB). The OAM module exposes a 16-lane PCIe 5.0 host link (128 GB/s) and seven xGMI links for GPU-to-GPU communication. Each xGMI link runs at 64 GB/s raw bidirectional bandwidth, with effective bandwidth of approximately 48 to 50 GB/s after CRC and protocol overhead.
AMD's reference design is the Instinct MI300X Platform, an 8-GPU UBB 2.0 baseboard measuring 417 mm by 553 mm. The Platform exposes 1.5 TB of HBM3 (192 GB times eight), 5.3 TB/s of bandwidth per GPU, and 896 GB/s of aggregate peer-to-peer GPU bandwidth. Each GPU's seven xGMI links connect to the other seven GPUs in a fully meshed topology. The Platform has the same OAM physical envelope as competing 8-GPU baseboards from NVIDIA (HGX H100) and Intel (Gaudi UBB), which simplifies thermal and power infrastructure for OEM partners.
OEM 8-GPU MI300X systems shipped from Dell (PowerEdge XE9680), HPE (Cray and ProLiant variants), Lenovo (ThinkSystem SR685a V3), and Supermicro (AS-8125GS-TNMR2 and others). All shared the UBB 2.0 baseboard design, with vendor-specific differences in CPU host (typically AMD EPYC Genoa or Intel Xeon Sapphire Rapids), system memory (typically 2 TB), networking (often eight 400 GbE NICs), and storage. Cluster-scale builds rely on RoCEv2 over Ethernet or InfiniBand for inter-node networking, which is also where the MI300X's collective performance characteristics tend to show up.
AMD did not ship a PCIe add-in-card MI300X variant for general retail, in contrast to the MI210 (CDNA 2) which had a PCIe form factor. The MI300X is OAM-only at the module level. Some OEMs and cloud providers package single MI300X modules into purpose-built carrier boards, but the supported reference platform is the 8-GPU UBB.
ROCm (Radeon Open Compute) is AMD's open-source GPU computing platform, serving as the primary software stack for MI300X workloads. ROCm provides the runtime, compilers, math libraries, and machine learning framework integrations needed to run AI workloads on AMD hardware.
ROCm includes HIP (Heterogeneous-computing Interface for Portability), a C++ API that mirrors CUDA's programming model. Existing CUDA code can often be translated to HIP using AMD's hipify tool, which handles the mechanical renaming of CUDA APIs to their HIP equivalents. AMD claims that a large fraction of CUDA code translates automatically, but workloads using hand-written CUDA kernels, intrinsics, or architecture-specific optimizations require manual porting work.
Key ROCm library components relevant to MI300X workloads include:
ROCm 6, released in early 2024, added meaningful improvements to PyTorch and TensorFlow integration and improved performance on several LLM inference kernels. AMD also introduced FlashAttention support for CDNA 3 in ROCm 6, though this arrived several months after NVIDIA's implementation became available.
ROCm 6.2, released in mid-2024, brought further improvements: Composable Kernel-based FlashAttention-2, OpenAI Triton-based FlashAttention-2, native FP8 support paths, and tuned vLLM and PyTorch images. ROCm 6.3 added FP8 inference quality of life improvements and the AITER kernel library that lifted MoE inference performance by a measured 3x for DeepSeek-class workloads. ROCm 6.4 introduced the QuickReduce all-reduce kernel that delivered up to 3x speedups on certain collective patterns versus stock RCCL.
The software maturity gap with CUDA remained a consistent criticism. Third-party benchmarks showed that achieving competitive training throughput on MI300X required non-trivial configuration: custom Docker images built from source, environment variable tuning flags not set by default, and in some cases direct AMD engineering support. By contrast, NVIDIA's pre-built containers and optimized libraries generally worked without customization. AMD's response was to ship an increasing number of prebuilt vLLM, SGLang, and PyTorch Docker images aligned to specific MI300X tuning targets, and to expand the ROCm catalog of validated models on the Hugging Face Hub.
The most active integration work happened around inference. By late 2024 the MI300X was a first-class citizen in:
Hugging Face announced an expanded collaboration with AMD in mid-2024, formally validating its model libraries on MI300X and Azure ND MI300X v5. The partnership extended to OpenAI Triton kernel work and the publication of optimized images for Llama 3 inference. Hugging Face's MI300 launch blog noted 2x to 3x improvements in time-to-first-token compared to MI250 on Llama 3 70B and observed that the larger 192 GB HBM3 footprint allowed full single-device fine-tuning of 70B models in BF16 without offload.
Alongside the MI300X, AMD released the Instinct MI300A, a heterogeneous APU (Accelerated Processing Unit) that integrates CPU cores, GPU compute, and HBM3 memory on a single package. The MI300A combines 24 Zen 4 CPU cores with a CDNA 3 GPU containing 228 active compute units and 128 GB of HBM3 memory shared between the CPU and GPU.
The shared memory architecture eliminates the PCIe transfer bottleneck for CPU-GPU data movement, which is particularly valuable in HPC workloads that involve frequent data exchange between simulation code running on CPU cores and numerical acceleration on GPU compute units.
The MI300A's primary application was the El Capitan supercomputer at Lawrence Livermore National Laboratory, built by Hewlett Packard Enterprise. El Capitan deployed 43,808 MI300A accelerators and was verified as the world's fastest supercomputer in the November 2024 TOP500 list with a peak performance of approximately 1.742 exaflops (FP64), displacing Frontier from the top position. As of the November 2025 list it remained the top-ranked system. Each MI300A node combines an EPYC 24-core 1.8 GHz Genoa CPU and an MI300A APU with 128 GB of unified HBM3.
Microsoft was the first major cloud provider to deploy the MI300X at scale. The ND MI300X v5 VM series, announced as generally available on May 21, 2024, uses eight MI300X accelerators per instance connected via PCIe Gen 5.0 and AMD Infinity Fabric. The instance pairs the eight GPUs with two 4th-generation Intel Xeon Scalable (Sapphire Rapids) processors for 96 physical cores total, 1.5 TB of aggregate HBM3, and 5.3 TB/s of HBM bandwidth per GPU.
Microsoft deployed MI300X to power portions of its Azure OpenAI Service workloads, including inference for GPT-3.5 and GPT-4 variants. Hugging Face was the launch validation customer on ND MI300X v5 and reported 2x to 3x improvements in time-to-first-token on Meta Llama 3 70B versus the prior-generation MI250-based ND VMs. The Azure deployment represented a meaningful diversification away from exclusive NVIDIA GPU dependence for cloud AI inference.
Meta began integrating MI300X accelerators in 2024 to run inference workloads on its Llama family of models. Meta's adoption was partly enabled by ROCm 6 optimizations specifically tuned for Llama 2 models. Meta ran ranking, recommendation, and content generation workloads on AMD Instinct hardware, making it one of the largest real-world validations of ROCm's inference capability. By 2025 Meta was using MI300X for a meaningful share of its Llama 3 and Llama 4 inference fleet, with AMD calling out Meta as one of the largest single-customer commitments to MI300X.
Oracle Cloud Infrastructure announced general availability of the BM.GPU.MI300X.8 bare-metal instance on September 26, 2024. The instance ships eight MI300X accelerators, 2 TB of system memory, eight 3.84 TB NVMe drives, and a non-blocking RDMA network fabric supporting up to 16,384 MI300X GPUs in a single OCI Supercluster. Oracle priced the bare-metal MI300X instance at $6.00 per GPU per hour, undercutting comparable H100 bare-metal prices by a meaningful margin. Fireworks AI was an early reference customer for the OCI MI300X cluster, citing the larger memory pool as decisive for serving 70B-class models at low latency.
In October 2025, Oracle and AMD announced a follow-on commitment of 50,000 MI450-series GPUs for OCI Superclusters with deployment beginning in Q3 2026, building on the success of the MI300X deployment.
A wide group of neoclouds and AI specialty providers built businesses around MI300X capacity:
| Provider | Notes |
|---|---|
| Crusoe | MI300X cloud GPU instances on dedicated infrastructure with on-demand and reserved pricing |
| Vultr | Bare-metal and on-demand MI300X instances; opened MI300X price competition with sub-$2/hr rates |
| TensorWave | MI300X-only neocloud, prices as low as approximately $1.50 per GPU per hour for bare-metal |
| RunPod | Self-service MI300X access with hourly pricing |
| Hot Aisle | MI300X bare-metal specialist used by independent researchers and ChipsAndCheese |
| Lamini | Model fine-tuning platform built on MI300X |
| Fireworks AI | Inference provider, adopted MI300X on OCI |
| Moreh | Korean software/inference provider, optimized SGLang for MI300X DeepSeek inference |
| TensorOpera (Together AI sub-tier) | Mixed AMD/NVIDIA fleets with MI300X for memory-bound inference |
Databricks, Lamini, and Hugging Face were public-named partners in the MI300X ecosystem development effort. Samsung reportedly purchased approximately $20 million worth of MI300X GPUs for internal AI development work. Numerous smaller cloud providers, including TensorWave and CoreWeave, deployed MI300X to serve customers seeking alternatives to NVIDIA hardware amid H100 scarcity.
The MI300X was designed to compete directly with NVIDIA's H100 and later H200 accelerators. The B200, NVIDIA's Blackwell generation released in 2024, is the closest contemporary in the next NVIDIA tier. The following table compares key specifications:
| Specification | AMD MI300X | NVIDIA H100 SXM5 | NVIDIA H200 SXM | NVIDIA B200 |
|---|---|---|---|---|
| Architecture | CDNA 3 | Hopper | Hopper | Blackwell |
| Process node | TSMC N5 / N6 | TSMC 4N | TSMC 4N | TSMC 4NP |
| Transistors | 153 billion | 80 billion | 80 billion | 208 billion (2 dies) |
| Compute units | 304 CUs | 132 SMs | 132 SMs | 160 SMs (2 dies) |
| FP8 peak (dense) | 2,615 TFLOPS | 1,979 TFLOPS | 1,979 TFLOPS | ~4,500 TFLOPS |
| BF16 peak (dense) | 1,307 TFLOPS | 990 TFLOPS | 990 TFLOPS | ~2,250 TFLOPS |
| FP4 peak | not supported | not supported | not supported | ~9,000 TFLOPS |
| Memory | 192 GB HBM3 | 80 GB HBM3 | 141 GB HBM3e | 192 GB HBM3e |
| Memory bandwidth | 5.3 TB/s | 3.35 TB/s | 4.8 TB/s | 8 TB/s |
| GPU-to-GPU interconnect | 7 xGMI links @ 64 GB/s each | NVLink 4 @ 900 GB/s | NVLink 4 @ 900 GB/s | NVLink 5 @ 1.8 TB/s |
| TDP | 750 W | 700 W | 700 W | 1,000 W |
| Form factor | OAM | SXM5 | SXM5 | SXM6 |
| Launch | December 2023 | March 2022 | November 2023 | March 2024 |
On paper, the MI300X holds substantial advantages in memory capacity and memory bandwidth versus H100 and H200, and matches B200 on memory capacity while ceding bandwidth and FP4 throughput. In sustained compute throughput for training, independent benchmarks place the H100 and H200 ahead of MI300X due to better software utilization. The MI300X's interconnect topology differs in an architecturally important way: its xGMI fabric provides a fully meshed point-to-point topology where each GPU pair can exchange data over a single 64 GB/s link, compared to NVIDIA's NVLink switched topology which provides the full 900 GB/s between any two GPUs simultaneously through the NVSwitch fabric. This difference affects all-reduce collective operation performance and scales unfavorably as cluster size grows beyond a single node.
For multi-node training NVIDIA's H100 and H200 clusters typically use NVLink within the node and InfiniBand between nodes, with NCCL-aware topology routing. AMD MI300X clusters use xGMI within the node and RoCEv2 (or InfiniBand) between nodes, with RCCL providing the collective implementation. SemiAnalysis measured 32-GPU all-reduce collective performance running 2x to 4x slower on MI300X than on H100 in their training benchmark, with the gap growing at larger scale.
SemiAnalysis's "MI300X vs H100 vs H200 Benchmark Part 1: Training" report, published in December 2024 after a five-month investigation, was the most influential independent benchmark of the MI300X. Key findings:
The report concluded that the "CUDA moat" remains real for training, while acknowledging meaningful AMD progress over the test period.
ChipsAndCheese published "Testing AMD's Giant MI300X" in 2024 with low-level microbenchmarks. Their findings:
SemiAnalysis launched InferenceMAX in 2025 as a continuously updated open inference benchmark that runs across NVIDIA and AMD hardware. Its findings position MI300X as competitive on memory-heavy workloads:
AMD's own benchmarks often show favorable comparisons that NVIDIA disputes. Examples include:
These figures should be read with the standard caveat that vendor-published comparisons usually pick favorable configurations. The independent SemiAnalysis and ChipsAndCheese numbers above are a more reliable guide to typical performance.
AMD did not publicly announce list pricing for the MI300X at launch. Third-party reports from late 2023 and 2024 cited customer acquisition costs of approximately $10,000 to $20,000 per GPU for direct purchasers, broadly comparable to or modestly below NVIDIA H100 pricing of $25,000 to $30,000+. Reports of $15,000 single-unit pricing at the OAM level appeared in trade press coverage, with hyperscale-volume pricing presumed lower.
Fully populated 8-GPU MI300X server systems from OEM vendors typically priced in the $200,000 to $300,000 range depending on configuration. Oracle priced its OCI BM.GPU.MI300X.8 bare-metal instance at $6.00 per GPU per hour at launch, with reserved capacity rates lower.
Cloud rental pricing for the MI300X moved aggressively downward during 2024 and 2025. By mid-2025, single-GPU on-demand rates from neoclouds settled in the $1.50 to $3.00 per hour range, with TensorWave at approximately $1.50 per GPU per hour for bare-metal and Vultr opening pricing at $1.85 per hour. RunPod self-service MI300X started at $2.30 per hour. Reserved one-month or annual pricing typically ran 20 to 40 percent below on-demand rates. SemiAnalysis noted that for MI300X to remain economically competitive against H200 on like-for-like inference workloads, pricing needed to land near or below $2 per hour.
AMD's financial disclosures showed that MI300-series products generated revenue that exceeded its initial $400 million forecast for 2023, with the company raising full-year guidance multiple times through 2024 as hyperscaler demand materialized.
The most consistent criticism of the MI300X across independent analyses was software quality. CUDA's 15-plus year head start had produced an ecosystem of optimized kernels, profiling tools, and community knowledge that ROCm could not replicate quickly. Specific pain points documented during the 2024 SemiAnalysis investigation included:
AMD's response was a sustained investment in shipping prebuilt validated images, expanding the ROCm Compatibility Matrix, and contributing upstream to PyTorch, Triton, vLLM, and Hugging Face. By mid-2025 the gap on inference-only workloads had narrowed substantially, while training remained the area where CUDA's lead was most visible.
The point-to-point xGMI topology limits the efficiency of collective operations across multiple GPUs. In multi-node configurations (clusters larger than 8 GPUs), AMD used RoCEv2 over standard Ethernet or InfiniBand for inter-node networking. Independent benchmarks found that AMD's collective operations ran 2 to 4 times slower than NVIDIA's InfiniBand-based configurations at 32-GPU scale, with the gap growing further at larger node counts. NVIDIA's NVLink Switch architecture, introduced with the H100, allows intra-node collective operations to bypass the CPU entirely; AMD's architecture did not offer an equivalent at 8-GPU scale. AMD signaled an intent to address this with the Helios rack-scale design and UALink interconnect in the 2026 MI400 generation.
While the MI300X's HBM3 capacity and bandwidth lead the market at launch, single-thread access latency is higher than the H100's. Workloads sensitive to small-message access latency rather than streaming bandwidth saw measurable performance impact. In practice this affected only a narrow class of irregular-memory-access kernels. Splitting work across multiple workgroups eliminates most of the gap.
Most third-party AI software, from optimized inference runtimes like TensorRT-LLM to profiling tools like Nsight, targeted NVIDIA hardware. Many model serving frameworks added ROCm support incrementally and often without the same level of testing as their CUDA paths. This created integration friction for organizations that had built production pipelines around NVIDIA-specific tooling. The most popular third-party kernel libraries (xFormers, FlashAttention, Cutlass) added ROCm support over time, but typically months after the CUDA implementation.
Industry reaction at launch was cautiously positive. Press coverage from AnandTech, Tom's Hardware, ServeTheHome, and HPCwire treated MI300X as the first AMD AI accelerator to credibly threaten NVIDIA's data center GPU monopoly. SemiAnalysis's launch-day analysis acknowledged the hardware advantage in memory capacity and bandwidth while flagging software risk. NVIDIA's response was confrontational: a same-week blog post disputed AMD's H100 comparison numbers, and TensorRT-LLM optimizations released within weeks pushed H100 inference throughput up.
Customer adoption exceeded AMD's initial guidance. AMD raised its 2024 data center GPU revenue target multiple times through the year, ending 2024 with more than $5 billion in shipped MI300X-class revenue and Q4 data center segment revenue setting a new company record. By Q1 2026 AMD's data center segment ran at a $5.8 billion quarterly run rate, up 57 percent year over year, with Lisa Su forecasting tens of billions of dollars in AI accelerator revenue annually by 2027 and roughly 35 percent annual top-line growth across the company over the following three to five years.
The MI300X also reset expectations of what is competitively possible for non-NVIDIA accelerators in the AI training and inference market. Until 2023, hyperscalers treated AMD's data center GPUs as HPC-only and kept AI workloads on NVIDIA. The MI300X changed that calculus, and the resulting MI325X, MI350 series, and MI400 roadmap commitments reflect AMD's view that AI accelerators can carry the company's data center segment at multi-tens-of-billions-of-dollars scale.
AMD's AI accelerator revenue trajectory tracks the MI300X's customer ramp:
| Period | Reported metric |
|---|---|
| 2023 (full year) | Initial MI300-series guidance of $400M, raised multiple times |
| Q2 2024 | First quarter exceeding $1 billion in MI300X revenue |
| 2024 (full year) | More than $5 billion in data center GPU revenue, primarily MI300X |
| 2025 (full year) | Approximately $16 billion data center segment revenue, with MI300X, MI325X, and MI350 series ramping concurrently |
| Q4 2025 | Data center GPU quarterly revenue of $5.4 billion |
| Q1 2026 | Data center segment revenue $5.8 billion, up 57 percent year over year |
| 2027 forecast | AMD AI data center business projected to reach tens of billions of dollars annually with 80 percent annual growth |
These figures cover the MI300-class ramp plus its successors. AMD has not separated MI300X from MI325X or MI350 series in its segment reporting, but the bulk of the 2024 number is MI300X and the 2025 number is a mix dominated by MI325X in the first half and MI350 series in the second half.
AMD announced the Instinct MI325X on October 10, 2024. The MI325X uses the same CDNA 3 architecture and chiplet configuration as the MI300X but upgrades the memory to 256 GB of HBM3E with 6.0 TB/s bandwidth. The memory upgrade was the primary change; compute specifications remained largely identical to the MI300X. The MI325X carried a 1,000 W TDP, 250 W higher than the MI300X, requiring updated thermal infrastructure. AMD had originally planned 288 GB of HBM3E for the MI325X but scaled back to 256 GB due to supply constraints on 36 GB HBM stacks. AMD positioned the MI325X as a 40 percent inference performance lead over the H200 on selected workloads, though independent verification of those numbers was mixed.
AMD introduced the MI350X and MI355X in 2025, based on the CDNA 4 architecture. Both models carry 288 GB of HBM3E with 8 TB/s bandwidth and add support for FP4 and FP6 data types. AMD claimed up to 4x improvement in AI compute performance compared to the MI300X and a 35x improvement in inference throughput for certain workloads. The MI355X, designed for liquid-cooled systems, operates at 1,400 W TDP and delivers 9.2 PFLOPS of FP4 and 4.6 PFLOPS of FP8 dense throughput per GPU. The MI350X targets air-cooled configurations at lower power and lower clocks.
In an 8-GPU MI355X platform configuration, AMD quotes 2.3 TB of HBM3E aggregate, 64 TB/s aggregate bandwidth, 18.5 PFLOPS of FP16, 37 PFLOPS of FP8, and 74 PFLOPS of FP6 or FP4 throughput. AMD reported during 2025 that the MI350 series became the company's fastest-ramping product in history.
AMD detailed the Instinct MI400 series at Advancing AI 2025 and CES 2026. The series includes the MI430X (HPC plus AI with full FP32 and FP64), MI440X, and MI455X (AI-focused with low-precision emphasis). The lineup is built on CDNA 5 using TSMC's N2 (2nm) class process and is the first AMD accelerator family to support the UALink scale-up interconnect alongside Infinity Fabric. AMD's published preliminary specifications include up to 40 petaflops FP4, 20 petaflops FP8, 432 GB of HBM4 memory, and 19.6 TB/s bandwidth per GPU.
The corresponding rack-scale platform is Helios, a 72-GPU MI455X rack with EPYC Venice (Zen 6) CPU hosts. Helios delivers 31 TB of HBM4, 1.4 PB/s aggregate memory bandwidth, 2.9 FP4 exaflops of inference, and 1.4 FP8 exaflops of training in a single rack. Lisa Su called Helios "the world's best AI rack" at CES 2026. Oracle committed to deploy 50,000 MI450-series GPUs in OCI Superclusters beginning Q3 2026, building on the MI300X relationship.