The AMD Instinct MI355X is a high-end data center GPU accelerator built on AMD's CDNA 4 architecture, announced at the AMD Advancing AI 2025 event on June 12, 2025, and reaching general availability in October 2025. It is the flagship GPU in the MI350 series, positioned as AMD's primary response to NVIDIA's Blackwell generation of accelerators. The MI355X features 288 GB of HBM3E memory, 8 TB/s of memory bandwidth, 185 billion transistors, and hardware support for FP4 and FP6 inference formats. At 1,400 W total board power, it is designed for liquid-cooled data center deployments and is the higher-power sibling of the 1,000 W MI350X. AMD claims the MI355X delivers up to 3.9x more AI compute than the MI300X generation and up to 35x better inference throughput on certain workloads.
Oracle Cloud Infrastructure (OCI) became the first hyperscaler to offer the MI355X publicly, deploying it in superclusters scaling to over 131,000 GPUs. The chip competed directly against the NVIDIA B200 on single-node inference benchmarks and matched it on Llama 2 70B throughput in the MLPerf Inference 6.0 submission. AMD cited a 40% better tokens-per-dollar figure compared to B200 at typical cloud pricing.
AMD's Instinct GPU line is a family of data center accelerators designed separately from the consumer Radeon graphics products. The Instinct series uses AMD's Compute DNA (CDNA) architecture, which is optimized for matrix multiplication and memory-bandwidth-intensive workloads rather than rendering. CDNA 1 debuted with the MI100 in late 2020, followed by CDNA 2 in the MI200 series in 2021.
The AMD Instinct MI300X launched in late 2023 on CDNA 3 architecture and became a commercially significant product for AMD, attracting major cloud deployments for inference at a time when GPU supply from NVIDIA was constrained. The MI300X carried 192 GB of HBM3 memory with 5.3 TB/s of bandwidth and a 750 W TDP. Microsoft deployed it on Azure for both proprietary and open-source model serving. Meta confirmed deployments for the MI300 series as well. The MI300X was important because its memory capacity allowed single-GPU inference of models that would require two NVIDIA H100 cards.
The MI325X followed in late 2024 as an incremental update on CDNA 3, increasing memory to 256 GB of HBM3E and memory bandwidth to 6.0 TB/s while keeping the 1,000 W power envelope. It did not introduce a new architecture; the compute fabric remained CDNA 3. The MI325X served as a bridge product while AMD completed the architectural redesign for CDNA 4.
Work on CDNA 4 involved moving the compute chiplets from TSMC's 5nm node (used for CDNA 3 XCDs) to TSMC's N3P process. AMD also redesigned the I/O die arrangement, reducing from four separate I/O dies in the MI300X to two larger dies in the MI350 series, which simplified the inter-die fabric. The 3D chiplet packaging moved to TSMC's CoWoS-S advanced packaging technology.
The CDNA 4 architecture is the foundation shared by all MI350-series GPUs. AMD introduced several significant microarchitectural changes compared to CDNA 3.
Each Accelerator Complex Die (XCD) in CDNA 4 contains 32 active compute units, down from 38 in the CDNA 3 XCDs used in the MI300X. AMD reduced the CU count deliberately as part of a broader redesign of the matrix execution hardware. The matrix cores within each CU were rebuilt to deliver twice the per-CU throughput for FP8 workloads: from 4,096 FLOPS per clock in CDNA 3 to 8,192 FLOPS per clock in CDNA 4. At the ISSCC 2026 conference in February 2026, AMD fellow design engineer Ramasamy Adaikkalavan explained the approach. The team concluded that the previous generation's per-CU density was underutilized and that adding more matrix computation per CU was more efficient than simply adding more CUs.
The full MI355X package contains 8 XCDs, giving a total of 256 compute units and 16,384 stream processors. Each CU contains 128 stream processors. The maximum clock speed is 2.4 GHz with liquid cooling.
CDNA 4 introduces hardware-native support for FP4 and FP6 data types, including both the OCP Microscaling (MX) format variants: MXFP4 and MXFP6. This is the first AMD data center architecture to support sub-FP8 precision at the hardware level. NVIDIA added native FP4 in Blackwell. The MX format uses block exponent scaling: a small group of values shares a scaling factor stored separately, reducing the overhead compared to per-element scaling. This allows quantized models to maintain accuracy at 4-bit precision that would otherwise degrade with naive integer quantization.
The CDNA 4 matrix cores support a full range of formats: FP64, TF32, FP32, BF16, FP16, MXFP8, OCP-FP8, MXFP6, FP6, MXFP4, and FP4. The architecture provides 1,024 matrix cores in total across the 8 XCDs.
The MI300X used four separate I/O dies. AMD consolidated these into two larger dies in the MI350/MI355X. Each pair of XCDs connects to one I/O die. The I/O dies are fabricated on TSMC N6, a more mature 6nm process. The consolidation reduced the number of die-to-die crossings, which previously required extra protocol translation logic. AMD repurposed the freed area to widen the Infinity Fabric data pipeline.
The result is Infinity Fabric 4 (generation 4), which carries 2 TB/s more bandwidth than the IF 3 used in the MI300X and provides 7 Infinity Fabric links per socket. Inter-die bandwidth within the package is 5.5 TB/s bidirectional.
The L2 cache per XCD is 4 MB, coherent with the other XCDs. The Local Data Share (LDS) capacity per CU was doubled compared to the MI300X. AMD also increased the Infinity Cache to 256 MB.
For memory bandwidth, AMD redesigned the HBM controller to run more efficiently. Adaikkalavan noted at ISSCC 2026 that the raw bandwidth rose 1.5x (from 5.3 to 8.0 TB/s) and AMD achieved a 1.3x improvement in HBM read bandwidth per watt compared to the MI300X. Custom interconnect wire engineering inside the package reduced interconnect power consumption by approximately 20%.
| Specification | Value |
|---|---|
| Architecture | AMD CDNA 4 |
| Process node (XCDs) | TSMC N3P (3nm) |
| Process node (IOD) | TSMC N6 (6nm) |
| Chiplet configuration | 8 XCDs + 2 IODs |
| Transistor count | 185 billion |
| Compute units | 256 (32 per XCD) |
| Stream processors | 16,384 |
| Matrix cores | 1,024 |
| Max clock speed | 2.4 GHz |
| FP64 performance | 78.6 TFLOPS |
| FP32 performance | 157.2 TFLOPS |
| FP16 performance | 5.0 PFLOPS |
| FP8 performance | 10.1 PFLOPS |
| FP6 performance | 20.1 PFLOPS |
| FP4 performance | 20.1 PFLOPS |
| Memory | 288 GB HBM3E |
| Memory bandwidth | 8 TB/s |
| HBM stacks | 8 stacks, 12-Hi, 36 GB each |
| L2 cache | 32 MB (4 MB per XCD) |
| Infinity Cache | 256 MB |
| Total board power (TBP) | 1,400 W |
| Cooling | Liquid cooling required |
| Form factor | OAM (OCP Accelerator Module) |
| Interconnect | Infinity Fabric 4, 7 links |
| PCIe | PCIe 5.0 |
| Scale-out networking | 400 Gbps Ethernet (Pollara NIC) |
| GA date | October 2025 |
The MI355X carries 288 GB of HBM3E memory across 8 physical stacks. Each stack is a 12-Hi (12-layer) configuration with 36 GB capacity per stack, running at 8 Gbps per pin on a 128-bit channel interface. AMD sourced HBM3E from both Samsung Electronics and Micron Technology, making the MI355X one of the first major AI accelerators to use dual suppliers for its HBM. Samsung confirmed its supply of 12-Hi HBM3E to the MI350 family at the Advancing AI 2025 event; Micron separately confirmed its supply.
The 288 GB figure is the largest memory capacity available on any single GPU accelerator as of late 2025. The NVIDIA B200 carries 180 GB of HBM3E, and the H200 carried 141 GB. This gives the MI355X 1.6x the memory of the B200. The practical implication is model size: the MI355X can run models up to approximately 520 billion parameters in FP4 precision without any model parallelism across multiple cards, according to AMD. In FP8, the effective capacity narrows but still exceeds 250 billion parameters on a single GPU.
Memory bandwidth is 8 TB/s, compared to 5 TB/s for the B200 and 4.8 TB/s for the H200. Higher bandwidth directly benefits the memory-bound decode phase of LLM inference, where each generated token requires loading model weights from memory. For a fixed batch size, more bandwidth translates to more tokens per second.
The MI355X's most significant performance numbers come from its low-precision matrix compute.
FP8 performance is 10.1 PFLOPS. At MLPerf Inference 5.1 (results published in 2025), the MI355X achieved 93,045 tokens per second on Llama 2 70B with FP8 weights, a 2.7x improvement over the MI325X. The improvement reflects both the CDNA 4 matrix core redesign and optimizations in the AMD ROCm software stack.
At MLPerf Inference 6.0 (results published in early 2026), the single-node MI355X reached 100,282 tokens per second on Llama 2 70B Server, which AMD characterized as a 3.1x improvement over the prior MI325X submission. Against NVIDIA B200 at MLPerf 6.0, the MI355X matched it in Offline throughput and delivered 97% of B200's Server performance. Against the B300, the MI355X delivered 92% in Offline, 93% in Server, and exceeded it with 104% in Interactive mode.
FP6 and FP4 performance are both rated at 20.1 PFLOPS each, doubling the FP8 figure. This is consistent with halving the bit width. AMD compared CDNA 4 CUs in FP6 to NVIDIA B200 streaming multiprocessors and stated per-CU throughput is roughly equivalent.
The B300 showed a 1.3x advantage over the MI355X on FP4 in SemiAnalysis testing, which SemiAnalysis attributed to NVIDIA's larger FP4 compute array and more mature FP4 kernel implementations in TensorRT-LLM at the time.
SemiAnalysis InferenceX benchmarks published in 2025 tested DeepSeek R1 and Llama 3.1 405B across multiple serving configurations. For single-node FP8 aggregated serving using SGLang, the MI355X delivered better performance per TCO than the B200 on most tested scenarios. AMD claimed 20% higher throughput than B200 on an 8-GPU DeepSeek R1 configuration and 30% higher throughput on Llama 3.1 405B at 8 GPUs.
ROCm (Radeon Open Compute) is AMD's open-source software platform for GPU computing. ROCm provides the HIP runtime (a CUDA-compatible programming model), compiler toolchain, math libraries, and deep learning primitives.
ROCm 7.0, released alongside the MI355X's general availability in October 2025, added native support for the MI355X (gfx950 GPU target) and the MI350X. Key additions included:
The MI355X supports all major deep learning frameworks through ROCm 7.0:
ROCm became a first-class platform in the vLLM ecosystem in late 2025. AMD's approach was to contribute upstream to vLLM's core rather than maintaining a separate fork, which reduced divergence and allowed ROCm users to track the main vLLM release schedule.
AMD's software ecosystem has historically lagged NVIDIA's CUDA in maturity and coverage. CUDA's installed base of libraries, tools, and developer knowledge represents a durable competitive advantage for NVIDIA that extends beyond raw hardware metrics. As of 2025, many custom CUDA kernels written by frontier labs do not have ROCm equivalents, requiring either porting effort or reliance on less-optimized fallbacks. AMD has invested heavily to close this gap, but independent analyses found that training workloads with extensive CUDA-specific dependencies still favor NVIDIA hardware due to software optimization rather than hardware limitations.
The MI355X's primary competitive target is the NVIDIA Blackwell B200 and GB200 platform.
| Specification | AMD MI355X | NVIDIA B200 | NVIDIA B300 |
|---|---|---|---|
| Architecture | CDNA 4 | Blackwell | Blackwell Ultra |
| Process node | TSMC N3P (XCDs) | TSMC 4NP | TSMC N3P |
| Transistors | 185B | 208B | ~230B |
| Memory | 288 GB HBM3E | 180 GB HBM3E | 192 GB HBM3E |
| Memory bandwidth | 8 TB/s | 5 TB/s | 8 TB/s |
| FP8 performance | 10.1 PFLOPS | 9 PFLOPS | 15 PFLOPS |
| FP4 performance | 20.1 PFLOPS | 18 PFLOPS | 30 PFLOPS |
| FP64 performance | 78.6 TFLOPS | 40 TFLOPS | ~80 TFLOPS |
| TDP | 1,400 W | 1,000 W | 1,400 W |
| Scale-up interconnect | 76.8 GB/s (IF4) | 900 GB/s (NVLink 4) | 900 GB/s (NVLink 4) |
| Scale-out networking | 400 Gbps | 400 Gbps (B200) / 800 Gbps (GB200) | 800 Gbps |
| GA date | October 2025 | Q1 2025 | Q3 2025 |
The MI355X has 108 GB more memory than the B200 and bandwidth that is 1.6x higher. On memory-bandwidth-bound workloads such as LLM decode at low batch sizes, this translates directly to throughput. For large model inference (300 billion+ parameters), the MI355X can serve models that require two or more B200 cards, reducing node count and network communication overhead.
The most significant structural disadvantage of the MI355X versus the GB200 NVL72 system is scale-up interconnect bandwidth. NVLink 4 provides 900 GB/s of bidirectional bandwidth per GPU, enabling tight GPU-to-GPU communication across 72 GPUs within a rack. The MI355X's Infinity Fabric 4 provides 76.8 GB/s per link, or roughly 538 GB/s of total scale-up bandwidth using all 7 links. This is approximately 6x less per-GPU scale-up bandwidth than NVLink 4.
For frontier model training across hundreds of GPUs, low-latency scale-up interconnect matters more than memory bandwidth, because gradient all-reduce operations are bottlenecked by communication throughput. The-decoder.com noted that "AMD's MI350 chips deliver big on memory but lag in networking against Nvidia." Hume Consulting described the MI355X as "strong node-level inference, but not yet rack-scale."
AMD acknowledged this limitation and addressed it in the MI400 series roadmap with a new scale-up fabric targeting 72-GPU interconnect at competitive bandwidth to NVLink.
The B300 is approximately 1.3x faster than the MI355X on FP4, according to SemiAnalysis InferenceX v2 benchmarks. At the time of writing, NVIDIA's TensorRT-LLM has more mature FP4 kernel implementations than ROCm's FP4 stack, and some of the B300 advantage was attributed to software rather than pure hardware. AMD's own FP4 figures match at peak PFLOPS but sustained throughput at the system level was lower in independent testing.
For FP64 workloads (double precision), the MI355X delivers 78.6 TFLOPS compared to approximately 40 TFLOPS for the B200. This makes the MI355X competitive for HPC workloads such as molecular dynamics, climate simulation, and computational fluid dynamics.
Total cost of ownership (TCO) is the metric AMD emphasized most in positioning the MI355X. AMD claimed up to 40% more tokens per dollar on Llama 3.1 405B inference compared to B200 at cloud pricing.
The TCO advantage comes from several factors:
First, higher memory capacity reduces node count for large model inference. A cluster serving 400B-parameter models needs fewer MI355X nodes than B200 nodes because fewer inter-node communication hops are required. Fewer nodes means lower server costs, lower networking costs, and lower power infrastructure investment.
Second, AMD's pricing strategy has historically targeted a discount relative to NVIDIA's top-tier offerings. While exact list prices for the MI355X were not publicly disclosed at launch, cloud pricing on OCI at GA was competitive with B200 pricing on comparable platforms.
Third, at high concurrency (batch sizes above 64), the MI355X's memory bandwidth advantage over B200 compounds, delivering more throughput per dollar as servers are fully utilized.
Clarifai benchmarks published in 2025 showed that at batch sizes of 1 to 4, NVIDIA H100 with TensorRT-LLM held a 20-30% throughput advantage. At batch sizes of 64 to 128, the gap narrowed to 5-10%. At sustained serving loads, the MI355X's bandwidth and memory capacity made it more efficient per dollar.
SemiAnalysis's InferenceX analysis of FP8 workloads found that the MI355X (SGLang) beat or matched B200 (TRT and SGLang) in performance-per-TCO for most single-node scenarios. In multi-node FP4 scenarios, NVIDIA retained an advantage.
The MI350 series contains two GPU variants: the MI350X and the MI355X. Both share the same CDNA 4 die configuration, memory capacity, and memory bandwidth. The differences are in power budget and target cooling environment.
| Feature | MI350X | MI355X |
|---|---|---|
| TDP | 1,000 W | 1,400 W |
| Cooling requirement | Air cooling | Liquid cooling (DLC) |
| Max clock speed | Lower (thermally limited) | 2.4 GHz |
| FP8 performance | ~9 PFLOPS | 10.1 PFLOPS |
| FP4 performance | ~18 PFLOPS | 20.1 PFLOPS |
| GPUs per rack (air) | 64 | N/A |
| GPUs per rack (DLC) | N/A | 128 |
| Total rack HBM3E | 18 TB (air) | 36 TB (DLC) |
The MI350X is designed for existing data centers with air-cooled infrastructure where retrofitting liquid cooling is not feasible. The MI355X targets new builds or facilities already equipped for direct liquid cooling (DLC). Liquid cooling removes the thermal constraints that would otherwise force the GPU to reduce its clock speed under sustained load. The MI355X operates at 2.4 GHz continuously, while the MI350X runs at a lower sustained clock to stay within its air-cooling thermal budget.
The 40% higher power draw of the MI355X over the MI350X yields roughly a 7-10% improvement in raw TFLOPS figures at the same precision. The real advantage is sustained throughput: fewer thermal throttle events and consistent performance over hours of continuous inference or training.
DLC racks with 128 MI355X GPUs deliver 36 TB of HBM3E per rack, which AMD positioned as an advantage for very large model training where keeping model weights and activations in GPU memory reduces host memory spill.
Oracle Cloud Infrastructure (OCI) was the first hyperscaler to deploy the MI355X publicly and announced general availability of OCI Compute with AMD Instinct MI355X GPUs on October 14, 2025. Oracle and AMD announced a deployment target exceeding 131,072 MI355X GPUs in a single zettascale supercluster. Oracle separately committed to deploying 50,000 MI450 GPUs (next generation) from AMD beginning in Q3 2026, indicating a long-term partnership agreement.
OCI's MI355X implementation uses the OCI Supercluster architecture with AMD Pensando Pollara Ultra-Ethernet NICs providing 400 Gbps scale-out networking. AMD published detailed performance and technical documentation for the OCI MI355X deployment.
Microsoft deployed the MI300X on Azure for both proprietary and open-source AI model inference. Microsoft had publicly confirmed MI300X deployments powering production serving workloads on Azure before the MI355X launch. AMD's press materials for the MI350 series noted Meta's MI300 deployments and expressed commitments for future products including the MI350. Microsoft's specific MI355X deployment timelines were not publicly confirmed at GA.
Meta discussed its MI300 series deployments at AMD's Advancing AI 2025 event and indicated alignment with future AMD products. Meta was identified among AMD's largest inference customers.
In AMD's MLPerf Inference 6.0 submission, nine companies submitted results on AMD Instinct hardware: Cisco, Dell Technologies, Giga Computing, HPE, MangoBoost, MiTAC, Oracle, Supermicro, and Red Hat. Partner results typically matched AMD's internal figures within 4%, with some within 1%, indicating production-ready supply chain and software parity.
Dell and MangoBoost demonstrated a heterogeneous cluster configuration combining MI300X, MI325X, and MI355X GPUs across geographic locations, achieving 141,521 tokens per second on Llama 2 70B Server. This showed the ROCm stack's ability to manage mixed-generation hardware for distributed inference.
The MI355X's primary commercial use case is large language model inference serving. The combination of 288 GB memory and 8 TB/s bandwidth allows single-node inference for models up to approximately 520 billion parameters in FP4. This covers nearly all publicly available open-weight models as of 2025, including Llama 3.1 405B, Mistral models, and DeepSeek V3 (671B MoE, which has a smaller active parameter count at runtime).
For inference serving, the memory bandwidth is the most important factor at typical batch sizes. More bandwidth means more tokens per second for a given model. The MI355X's 8 TB/s compares favorably with the B200's 5 TB/s and makes it particularly well suited to latency-sensitive serving at low-to-medium concurrency.
For training, the MI355X's FP8 compute (10.1 PFLOPS) and FP16 compute (5.0 PFLOPS) are competitive with B200. AMD submitted MLPerf Training v5.1 results using the MI350 series and showed competitive performance on GPT-3 and ResNet workloads. The scale-up networking limitation affects very large training runs that require frequent all-reduce operations across many nodes. For training runs of models up to a few hundred billion parameters that fit within a single 8-GPU node, the MI355X is fully competitive.
AMD's ROCm blogs published training performance comparisons in late 2025 showing MI355X within 5-10% of B200 on Llama 3 fine-tuning using DeepSpeed and within 3% on BERT pretraining.
The MI355X's FP64 performance (78.6 TFLOPS) is approximately double the B200's FP64 (roughly 40 TFLOPS). This matters for traditional HPC workloads that require full double-precision accuracy: computational fluid dynamics, finite element analysis, molecular dynamics, seismic imaging, and climate modeling. The AMD CDNA line has historically retained higher FP64 compute than NVIDIA's AI-focused accelerators, which deprioritize FP64 to allocate die area toward AI formats.
For graph neural networks used in drug discovery, the MI355X's large memory allows holding molecular graphs with millions of nodes in GPU memory, avoiding expensive data-fetching from host. Multimodal training with image, text, and video modalities simultaneously benefits from the combined memory and bandwidth.
AMD announced the MI400 series at its Advancing AI 2025 event with a targeted launch in the second half of 2026. The MI400 series uses CDNA 5 architecture on TSMC's 2nm (N2) process. Key specifications include:
The MI400 series addresses the MI355X's main competitive weakness on networking. The 72-GPU scale-up fabric directly targets the GB200 NVL72 system's topology. AMD confirmed MI400 variants at CES 2026: MI455X for cloud and training, MI430X for HPC and government, and MI440X as a rack-mounted server combining 8 GPUs with an EPYC Venice CPU.
At the CES 2026 event, AMD claimed the MI400 series would match or exceed GB200 NVL72 on most common frontier model training workloads.
AMD announced the MI500 series with a 2027 target, based on CDNA 6 architecture and TSMC 2nm class process with HBM4E memory. AMD claimed the MI500 series would deliver up to 1,000x the AI performance of the MI300X generation. This figure represents a compound improvement over the MI300X-to-MI355X gain (roughly 4x) and the MI355X-to-MI500 projected gains.
The MI355X has several practical limitations relative to the NVIDIA Blackwell platform:
Scale-up networking is the most significant. The Infinity Fabric 4's 76.8 GB/s per-link bandwidth, totaling roughly 538 GB/s per GPU, is approximately 6x lower than NVLink 4's 900 GB/s. This limits the MI355X's efficiency in training runs that require frequent collective operations across many GPUs.
The CUDA software ecosystem remains wider. Many research implementations are written in CUDA-native code that requires porting to HIP or ROCm equivalents. Not all custom kernels have been ported, and some optimization libraries (FlashAttention 3, for example) had more complete and performant CUDA implementations than ROCm equivalents as of late 2025.
FP4 software maturity lags. NVIDIA's TensorRT-LLM had more mature FP4 kernel implementations than AMD's ROCm FP4 stack at the MI355X's GA date. AMD narrowed this gap with ROCm 7.0 updates but the B300 retained a 1.3x advantage in sustained FP4 throughput in independent testing.
The 1,400 W TDP requires liquid cooling infrastructure. Data centers without existing DLC investment must upgrade their cooling plant to deploy the MI355X, adding upfront capital costs that the lower-power MI350X avoids.
Scale-out networking is 400 Gbps per GPU, the same as the B200, but NVIDIA moved to 800 Gbps with the B300/GB300 platform while AMD's 800G "Vulcano" NIC was not expected to ship until late 2026.