AMD Instinct MI355X

The AMD Instinct MI355X is a high-end data center GPU accelerator built on AMD's CDNA 4 architecture, announced at the AMD Advancing AI 2025 event on June 12, 2025, and reaching general availability in October 2025. It is the flagship GPU in the MI350 series, positioned as AMD's primary response to NVIDIA's Blackwell generation of accelerators. The MI355X features 288 GB of HBM3E memory, 8 TB/s of memory bandwidth, 185 billion transistors, and hardware support for FP4 and FP6 inference formats. At 1,400 W total board power, it is designed for liquid-cooled data center deployments and is the higher-power sibling of the 1,000 W MI350X. AMD claims the MI355X delivers up to 3.9x more AI compute than the MI300X generation and up to 35x better inference throughput on certain workloads.

Oracle Cloud Infrastructure (OCI) became the first hyperscaler to offer the MI355X publicly, deploying it in superclusters scaling to over 131,000 GPUs. The chip competed directly against the NVIDIA B200 on single-node inference benchmarks and matched it on Llama 2 70B throughput in the MLPerf Inference 6.0 submission. AMD cited a 40% better tokens-per-dollar figure compared to B200 at typical cloud pricing.

Background: the AMD Instinct GPU line

AMD's Instinct GPU line is a family of data center accelerators designed separately from the consumer Radeon graphics products. The Instinct series uses AMD's Compute DNA (CDNA) architecture, which is optimized for matrix multiplication and memory-bandwidth-intensive workloads rather than rendering. CDNA 1 debuted with the MI100 in late 2020, followed by CDNA 2 in the MI200 series in 2021.

MI300X (2023)

The AMD Instinct MI300X launched in late 2023 on CDNA 3 architecture and became a commercially significant product for AMD, attracting major cloud deployments for inference at a time when GPU supply from NVIDIA was constrained. The MI300X carried 192 GB of HBM3 memory with 5.3 TB/s of bandwidth and a 750 W TDP. Microsoft deployed it on Azure for both proprietary and open-source model serving. Meta confirmed deployments for the MI300 series as well. The MI300X was important because its memory capacity allowed single-GPU inference of models that would require two NVIDIA H100 cards.

MI325X (2024)

The MI325X followed in late 2024 as an incremental update on CDNA 3, increasing memory to 256 GB of HBM3E and memory bandwidth to 6.0 TB/s while keeping the 1,000 W power envelope. It did not introduce a new architecture; the compute fabric remained CDNA 3. The MI325X served as a bridge product while AMD completed the architectural redesign for CDNA 4.

MI350 series planning

Work on CDNA 4 involved moving the compute chiplets from TSMC's 5nm node (used for CDNA 3 XCDs) to TSMC's N3P process. AMD also redesigned the I/O die arrangement, reducing from four separate I/O dies in the MI300X to two larger dies in the MI350 series, which simplified the inter-die fabric. The 3D chiplet packaging moved to TSMC's CoWoS-S advanced packaging technology.

CDNA 4 architecture

The CDNA 4 architecture is the foundation shared by all MI350-series GPUs. AMD introduced several significant microarchitectural changes compared to CDNA 3.

Compute unit redesign

Each Accelerator Complex Die (XCD) in CDNA 4 contains 32 active compute units, down from 38 in the CDNA 3 XCDs used in the MI300X. AMD reduced the CU count deliberately as part of a broader redesign of the matrix execution hardware. The matrix cores within each CU were rebuilt to deliver twice the per-CU throughput for FP8 workloads: from 4,096 FLOPS per clock in CDNA 3 to 8,192 FLOPS per clock in CDNA 4. At the ISSCC 2026 conference in February 2026, AMD fellow design engineer Ramasamy Adaikkalavan explained the approach. The team concluded that the previous generation's per-CU density was underutilized and that adding more matrix computation per CU was more efficient than simply adding more CUs.

The full MI355X package contains 8 XCDs, giving a total of 256 compute units and 16,384 stream processors. Each CU contains 128 stream processors. The maximum clock speed is 2.4 GHz with liquid cooling.

FP4 and FP6 support

CDNA 4 introduces hardware-native support for FP4 and FP6 data types, including both the OCP Microscaling (MX) format variants: MXFP4 and MXFP6. This is the first AMD data center architecture to support sub-FP8 precision at the hardware level. NVIDIA added native FP4 in Blackwell. The MX format uses block exponent scaling: a small group of values shares a scaling factor stored separately, reducing the overhead compared to per-element scaling. This allows quantized models to maintain accuracy at 4-bit precision that would otherwise degrade with naive integer quantization.

The CDNA 4 matrix cores support a full range of formats: FP64, TF32, FP32, BF16, FP16, MXFP8, OCP-FP8, MXFP6, FP6, MXFP4, and FP4. The architecture provides 1,024 matrix cores in total across the 8 XCDs.

I/O die consolidation

The MI300X used four separate I/O dies. AMD consolidated these into two larger dies in the MI350/MI355X. Each pair of XCDs connects to one I/O die. The I/O dies are fabricated on TSMC N6, a more mature 6nm process. The consolidation reduced the number of die-to-die crossings, which previously required extra protocol translation logic. AMD repurposed the freed area to widen the Infinity Fabric data pipeline.

The result is Infinity Fabric 4 (generation 4), which carries 2 TB/s more bandwidth than the IF 3 used in the MI300X and provides 7 Infinity Fabric links per socket. Inter-die bandwidth within the package is 5.5 TB/s bidirectional.

Memory architecture

The L2 cache per XCD is 4 MB, coherent with the other XCDs. The Local Data Share (LDS) capacity per CU was doubled compared to the MI300X. AMD also increased the Infinity Cache to 256 MB.

For memory bandwidth, AMD redesigned the HBM controller to run more efficiently. Adaikkalavan noted at ISSCC 2026 that the raw bandwidth rose 1.5x (from 5.3 to 8.0 TB/s) and AMD achieved a 1.3x improvement in HBM read bandwidth per watt compared to the MI300X. Custom interconnect wire engineering inside the package reduced interconnect power consumption by approximately 20%.

Specifications

Specification	Value
Architecture	AMD CDNA 4
Process node (XCDs)	TSMC N3P (3nm)
Process node (IOD)	TSMC N6 (6nm)
Chiplet configuration	8 XCDs + 2 IODs
Transistor count	185 billion
Compute units	256 (32 per XCD)
Stream processors	16,384
Matrix cores	1,024
Max clock speed	2.4 GHz
FP64 performance	78.6 TFLOPS
FP32 performance	157.2 TFLOPS
FP16 performance	5.0 PFLOPS
FP8 performance	10.1 PFLOPS
FP6 performance	20.1 PFLOPS
FP4 performance	20.1 PFLOPS
Memory	288 GB HBM3E
Memory bandwidth	8 TB/s
HBM stacks	8 stacks, 12-Hi, 36 GB each
L2 cache	32 MB (4 MB per XCD)
Infinity Cache	256 MB
Total board power (TBP)	1,400 W
Cooling	Liquid cooling required
Form factor	OAM (OCP Accelerator Module)
Interconnect	Infinity Fabric 4, 7 links
PCIe	PCIe 5.0
Scale-out networking	400 Gbps Ethernet (Pollara NIC)
GA date	October 2025

HBM3E memory subsystem

The MI355X carries 288 GB of HBM3E memory across 8 physical stacks. Each stack is a 12-Hi (12-layer) configuration with 36 GB capacity per stack, running at 8 Gbps per pin on a 128-bit channel interface. AMD sourced HBM3E from both Samsung Electronics and Micron Technology, making the MI355X one of the first major AI accelerators to use dual suppliers for its HBM. Samsung confirmed its supply of 12-Hi HBM3E to the MI350 family at the Advancing AI 2025 event; Micron separately confirmed its supply.

The 288 GB figure is the largest memory capacity available on any single GPU accelerator as of late 2025. The NVIDIA B200 carries 180 GB of HBM3E, and the H200 carried 141 GB. This gives the MI355X 1.6x the memory of the B200. The practical implication is model size: the MI355X can run models up to approximately 520 billion parameters in FP4 precision without any model parallelism across multiple cards, according to AMD. In FP8, the effective capacity narrows but still exceeds 250 billion parameters on a single GPU.

Memory bandwidth is 8 TB/s, compared to 5 TB/s for the B200 and 4.8 TB/s for the H200. Higher bandwidth directly benefits the memory-bound decode phase of LLM inference, where each generated token requires loading model weights from memory. For a fixed batch size, more bandwidth translates to more tokens per second.

FP4, FP6, and FP8 performance

The MI355X's most significant performance numbers come from its low-precision matrix compute.

FP8

FP8 performance is 10.1 PFLOPS. At MLPerf Inference 5.1 (results published in 2025), the MI355X achieved 93,045 tokens per second on Llama 2 70B with FP8 weights, a 2.7x improvement over the MI325X. The improvement reflects both the CDNA 4 matrix core redesign and optimizations in the AMD ROCm software stack.

At MLPerf Inference 6.0 (results published in early 2026), the single-node MI355X reached 100,282 tokens per second on Llama 2 70B Server, which AMD characterized as a 3.1x improvement over the prior MI325X submission. Against NVIDIA B200 at MLPerf 6.0, the MI355X matched it in Offline throughput and delivered 97% of B200's Server performance. Against the B300, the MI355X delivered 92% in Offline, 93% in Server, and exceeded it with 104% in Interactive mode.

FP6 and FP4

FP6 and FP4 performance are both rated at 20.1 PFLOPS each, doubling the FP8 figure. This is consistent with halving the bit width. AMD compared CDNA 4 CUs in FP6 to NVIDIA B200 streaming multiprocessors and stated per-CU throughput is roughly equivalent.

The B300 showed a 1.3x advantage over the MI355X on FP4 in SemiAnalysis testing, which SemiAnalysis attributed to NVIDIA's larger FP4 compute array and more mature FP4 kernel implementations in TensorRT-LLM at the time.

Practical inference performance

SemiAnalysis InferenceX benchmarks published in 2025 tested DeepSeek R1 and Llama 3.1 405B across multiple serving configurations. For single-node FP8 aggregated serving using SGLang, the MI355X delivered better performance per TCO than the B200 on most tested scenarios. AMD claimed 20% higher throughput than B200 on an 8-GPU DeepSeek R1 configuration and 30% higher throughput on Llama 3.1 405B at 8 GPUs.

ROCm software ecosystem

ROCm (Radeon Open Compute) is AMD's open-source software platform for GPU computing. ROCm provides the HIP runtime (a CUDA-compatible programming model), compiler toolchain, math libraries, and deep learning primitives.

ROCm 7.0

ROCm 7.0, released alongside the MI355X's general availability in October 2025, added native support for the MI355X (gfx950 GPU target) and the MI350X. Key additions included:

AOT (ahead-of-time) Inductor compilation with Composable Kernel (CK) backend for MI355X
Prefill/decode disaggregation support in SGLang
Native FP4 execution in both vLLM and SGLang
Full PyTorch, TensorFlow, and JAX support with day-zero MI355X enablement

Framework support

The MI355X supports all major deep learning frameworks through ROCm 7.0:

PyTorch (including torch.compile with ROCm backend)
TensorFlow 2.x and JAX
vLLM and SGLang for LLM inference serving
DeepSpeed for distributed training
Hugging Face Transformers and Accelerate
Triton (AMD maintained a fork with MI355X support)

ROCm became a first-class platform in the vLLM ecosystem in late 2025. AMD's approach was to contribute upstream to vLLM's core rather than maintaining a separate fork, which reduced divergence and allowed ROCm users to track the main vLLM release schedule.

Historical CUDA gap

AMD's software ecosystem has historically lagged NVIDIA's CUDA in maturity and coverage. CUDA's installed base of libraries, tools, and developer knowledge represents a durable competitive advantage for NVIDIA that extends beyond raw hardware metrics. As of 2025, many custom CUDA kernels written by frontier labs do not have ROCm equivalents, requiring either porting effort or reliance on less-optimized fallbacks. AMD has invested heavily to close this gap, but independent analyses found that training workloads with extensive CUDA-specific dependencies still favor NVIDIA hardware due to software optimization rather than hardware limitations.

Comparison with NVIDIA Blackwell

The MI355X's primary competitive target is the NVIDIA Blackwell B200 and GB200 platform.

Specification	AMD MI355X	NVIDIA B200	NVIDIA B300
Architecture	CDNA 4	Blackwell	Blackwell Ultra
Process node	TSMC N3P (XCDs)	TSMC 4NP	TSMC N3P
Transistors	185B	208B	~230B
Memory	288 GB HBM3E	180 GB HBM3E	192 GB HBM3E
Memory bandwidth	8 TB/s	5 TB/s	8 TB/s
FP8 performance	10.1 PFLOPS	9 PFLOPS	15 PFLOPS
FP4 performance	20.1 PFLOPS	18 PFLOPS	30 PFLOPS
FP64 performance	78.6 TFLOPS	40 TFLOPS	~80 TFLOPS
TDP	1,400 W	1,000 W	1,400 W
Scale-up interconnect	76.8 GB/s (IF4)	900 GB/s (NVLink 4)	900 GB/s (NVLink 4)
Scale-out networking	400 Gbps	400 Gbps (B200) / 800 Gbps (GB200)	800 Gbps
GA date	October 2025	Q1 2025	Q3 2025

Memory advantage

The MI355X has 108 GB more memory than the B200 and bandwidth that is 1.6x higher. On memory-bandwidth-bound workloads such as LLM decode at low batch sizes, this translates directly to throughput. For large model inference (300 billion+ parameters), the MI355X can serve models that require two or more B200 cards, reducing node count and network communication overhead.

Scale-up networking gap

The most significant structural disadvantage of the MI355X versus the GB200 NVL72 system is scale-up interconnect bandwidth. NVLink 4 provides 900 GB/s of bidirectional bandwidth per GPU, enabling tight GPU-to-GPU communication across 72 GPUs within a rack. The MI355X's Infinity Fabric 4 provides 76.8 GB/s per link, or roughly 538 GB/s of total scale-up bandwidth using all 7 links. This is approximately 6x less per-GPU scale-up bandwidth than NVLink 4.

For frontier model training across hundreds of GPUs, low-latency scale-up interconnect matters more than memory bandwidth, because gradient all-reduce operations are bottlenecked by communication throughput. The-decoder.com noted that "AMD's MI350 chips deliver big on memory but lag in networking against Nvidia." Hume Consulting described the MI355X as "strong node-level inference, but not yet rack-scale."

AMD acknowledged this limitation and addressed it in the MI400 series roadmap with a new scale-up fabric targeting 72-GPU interconnect at competitive bandwidth to NVLink.

FP4 compute gap

The B300 is approximately 1.3x faster than the MI355X on FP4, according to SemiAnalysis InferenceX v2 benchmarks. At the time of writing, NVIDIA's TensorRT-LLM has more mature FP4 kernel implementations than ROCm's FP4 stack, and some of the B300 advantage was attributed to software rather than pure hardware. AMD's own FP4 figures match at peak PFLOPS but sustained throughput at the system level was lower in independent testing.

FP64 scientific computing advantage

For FP64 workloads (double precision), the MI355X delivers 78.6 TFLOPS compared to approximately 40 TFLOPS for the B200. This makes the MI355X competitive for HPC workloads such as molecular dynamics, climate simulation, and computational fluid dynamics.

TCO and pricing analysis

Total cost of ownership (TCO) is the metric AMD emphasized most in positioning the MI355X. AMD claimed up to 40% more tokens per dollar on Llama 3.1 405B inference compared to B200 at cloud pricing.

The TCO advantage comes from several factors:

First, higher memory capacity reduces node count for large model inference. A cluster serving 400B-parameter models needs fewer MI355X nodes than B200 nodes because fewer inter-node communication hops are required. Fewer nodes means lower server costs, lower networking costs, and lower power infrastructure investment.

Second, AMD's pricing strategy has historically targeted a discount relative to NVIDIA's top-tier offerings. While exact list prices for the MI355X were not publicly disclosed at launch, cloud pricing on OCI at GA was competitive with B200 pricing on comparable platforms.

Third, at high concurrency (batch sizes above 64), the MI355X's memory bandwidth advantage over B200 compounds, delivering more throughput per dollar as servers are fully utilized.

Clarifai benchmarks published in 2025 showed that at batch sizes of 1 to 4, NVIDIA H100 with TensorRT-LLM held a 20-30% throughput advantage. At batch sizes of 64 to 128, the gap narrowed to 5-10%. At sustained serving loads, the MI355X's bandwidth and memory capacity made it more efficient per dollar.

SemiAnalysis's InferenceX analysis of FP8 workloads found that the MI355X (SGLang) beat or matched B200 (TRT and SGLang) in performance-per-TCO for most single-node scenarios. In multi-node FP4 scenarios, NVIDIA retained an advantage.

MI350X vs MI355X variants

The MI350 series contains two GPU variants: the MI350X and the MI355X. Both share the same CDNA 4 die configuration, memory capacity, and memory bandwidth. The differences are in power budget and target cooling environment.

Feature	MI350X	MI355X
TDP	1,000 W	1,400 W
Cooling requirement	Air cooling	Liquid cooling (DLC)
Max clock speed	Lower (thermally limited)	2.4 GHz
FP8 performance	~9 PFLOPS	10.1 PFLOPS
FP4 performance	~18 PFLOPS	20.1 PFLOPS
GPUs per rack (air)	64	N/A
GPUs per rack (DLC)	N/A	128
Total rack HBM3E	18 TB (air)	36 TB (DLC)

The MI350X is designed for existing data centers with air-cooled infrastructure where retrofitting liquid cooling is not feasible. The MI355X targets new builds or facilities already equipped for direct liquid cooling (DLC). Liquid cooling removes the thermal constraints that would otherwise force the GPU to reduce its clock speed under sustained load. The MI355X operates at 2.4 GHz continuously, while the MI350X runs at a lower sustained clock to stay within its air-cooling thermal budget.

The 40% higher power draw of the MI355X over the MI350X yields roughly a 7-10% improvement in raw TFLOPS figures at the same precision. The real advantage is sustained throughput: fewer thermal throttle events and consistent performance over hours of continuous inference or training.

DLC racks with 128 MI355X GPUs deliver 36 TB of HBM3E per rack, which AMD positioned as an advantage for very large model training where keeping model weights and activations in GPU memory reduces host memory spill.

Buyers and deployments

Oracle Cloud Infrastructure

Oracle Cloud Infrastructure (OCI) was the first hyperscaler to deploy the MI355X publicly and announced general availability of OCI Compute with AMD Instinct MI355X GPUs on October 14, 2025. Oracle and AMD announced a deployment target exceeding 131,072 MI355X GPUs in a single zettascale supercluster. Oracle separately committed to deploying 50,000 MI450 GPUs (next generation) from AMD beginning in Q3 2026, indicating a long-term partnership agreement.

OCI's MI355X implementation uses the OCI Supercluster architecture with AMD Pensando Pollara Ultra-Ethernet NICs providing 400 Gbps scale-out networking. AMD published detailed performance and technical documentation for the OCI MI355X deployment.

Microsoft Azure

Microsoft deployed the MI300X on Azure for both proprietary and open-source AI model inference. Microsoft had publicly confirmed MI300X deployments powering production serving workloads on Azure before the MI355X launch. AMD's press materials for the MI350 series noted Meta's MI300 deployments and expressed commitments for future products including the MI350. Microsoft's specific MI355X deployment timelines were not publicly confirmed at GA.

MLPerf partner ecosystem

In AMD's MLPerf Inference 6.0 submission, nine companies submitted results on AMD Instinct hardware: Cisco, Dell Technologies, Giga Computing, HPE, MangoBoost, MiTAC, Oracle, Supermicro, and Red Hat. Partner results typically matched AMD's internal figures within 4%, with some within 1%, indicating production-ready supply chain and software parity.

Dell and MangoBoost demonstrated a heterogeneous cluster configuration combining MI300X, MI325X, and MI355X GPUs across geographic locations, achieving 141,521 tokens per second on Llama 2 70B Server. This showed the ROCm stack's ability to manage mixed-generation hardware for distributed inference.

Use cases

LLM inference

The MI355X's primary commercial use case is large language model inference serving. The combination of 288 GB memory and 8 TB/s bandwidth allows single-node inference for models up to approximately 520 billion parameters in FP4. This covers nearly all publicly available open-weight models as of 2025, including Llama 3.1 405B, Mistral models, and DeepSeek V3 (671B MoE, which has a smaller active parameter count at runtime).

For inference serving, the memory bandwidth is the most important factor at typical batch sizes. More bandwidth means more tokens per second for a given model. The MI355X's 8 TB/s compares favorably with the B200's 5 TB/s and makes it particularly well suited to latency-sensitive serving at low-to-medium concurrency.

LLM training

For training, the MI355X's FP8 compute (10.1 PFLOPS) and FP16 compute (5.0 PFLOPS) are competitive with B200. AMD submitted MLPerf Training v5.1 results using the MI350 series and showed competitive performance on GPT-3 and ResNet workloads. The scale-up networking limitation affects very large training runs that require frequent all-reduce operations across many nodes. For training runs of models up to a few hundred billion parameters that fit within a single 8-GPU node, the MI355X is fully competitive.

AMD's ROCm blogs published training performance comparisons in late 2025 showing MI355X within 5-10% of B200 on Llama 3 fine-tuning using DeepSpeed and within 3% on BERT pretraining.

HPC and scientific computing

The MI355X's FP64 performance (78.6 TFLOPS) is approximately double the B200's FP64 (roughly 40 TFLOPS). This matters for traditional HPC workloads that require full double-precision accuracy: computational fluid dynamics, finite element analysis, molecular dynamics, seismic imaging, and climate modeling. The AMD CDNA line has historically retained higher FP64 compute than NVIDIA's AI-focused accelerators, which deprioritize FP64 to allocate die area toward AI formats.

Multimodal and graph workloads

For graph neural networks used in drug discovery, the MI355X's large memory allows holding molecular graphs with millions of nodes in GPU memory, avoiding expensive data-fetching from host. Multimodal training with image, text, and video modalities simultaneously benefits from the combined memory and bandwidth.

Roadmap: MI400 and MI500

MI400 series (2026)

AMD announced the MI400 series at its Advancing AI 2025 event with a targeted launch in the second half of 2026. The MI400 series uses CDNA 5 architecture on TSMC's 2nm (N2) process. Key specifications include:

FP4 performance: 40 PFLOPS (2x the MI355X)
FP8 performance: 20 PFLOPS (2x the MI355X)
Memory: 432 GB HBM4
Memory bandwidth: 19.6 TB/s (2.4x the MI355X)
New scale-out link: 300 GB/s
Scale-up world size: 72 logical GPUs (comparable to NVLink's 72-GPU topology)

The MI400 series addresses the MI355X's main competitive weakness on networking. The 72-GPU scale-up fabric directly targets the GB200 NVL72 system's topology. AMD confirmed MI400 variants at CES 2026: MI455X for cloud and training, MI430X for HPC and government, and MI440X as a rack-mounted server combining 8 GPUs with an EPYC Venice CPU.

At the CES 2026 event, AMD claimed the MI400 series would match or exceed GB200 NVL72 on most common frontier model training workloads.

MI500 series (2027)

AMD announced the MI500 series with a 2027 target, based on CDNA 6 architecture and TSMC 2nm class process with HBM4E memory. AMD claimed the MI500 series would deliver up to 1,000x the AI performance of the MI300X generation. This figure represents a compound improvement over the MI300X-to-MI355X gain (roughly 4x) and the MI355X-to-MI500 projected gains.

Limitations

The MI355X has several practical limitations relative to the NVIDIA Blackwell platform:

Scale-up networking is the most significant. The Infinity Fabric 4's 76.8 GB/s per-link bandwidth, totaling roughly 538 GB/s per GPU, is approximately 6x lower than NVLink 4's 900 GB/s. This limits the MI355X's efficiency in training runs that require frequent collective operations across many GPUs.

The CUDA software ecosystem remains wider. Many research implementations are written in CUDA-native code that requires porting to HIP or ROCm equivalents. Not all custom kernels have been ported, and some optimization libraries (FlashAttention 3, for example) had more complete and performant CUDA implementations than ROCm equivalents as of late 2025.

FP4 software maturity lags. NVIDIA's TensorRT-LLM had more mature FP4 kernel implementations than AMD's ROCm FP4 stack at the MI355X's GA date. AMD narrowed this gap with ROCm 7.0 updates but the B300 retained a 1.3x advantage in sustained FP4 throughput in independent testing.

The 1,400 W TDP requires liquid cooling infrastructure. Data centers without existing DLC investment must upgrade their cooling plant to deploy the MI355X, adding upfront capital costs that the lower-power MI350X avoids.

Scale-out networking is 400 Gbps per GPU, the same as the B200, but NVIDIA moved to 800 Gbps with the B300/GB300 platform while AMD's 800G "Vulcano" NIC was not expected to ship until late 2026.

References

AMD. "AMD Unveils Vision for an Open AI Ecosystem, Detailing New Silicon, Software and Systems at Advancing AI 2025." AMD Newsroom, June 12, 2025. https://www.amd.com/en/newsroom/press-releases/2025-6-12-amd-unveils-vision-for-an-open-ai-ecosystem-detai.html
AMD. "AMD Instinct MI350 Series and Beyond: Accelerating the Future of AI and HPC." AMD Blogs, 2025. https://www.amd.com/en/blogs/2025/amd-instinct-mi350-series-and-beyond-accelerating-the-future-of-ai-and-hpc.html
AMD. "AMD Instinct MI355X GPU Product Page." AMD, 2025. https://www.amd.com/en/products/accelerators/instinct/mi350/mi355x.html
AMD. "AMD Instinct MI350 Series GPUs: A Game Changer for Inference, Training and HPC Workloads." AMD Blogs, 2025. https://www.amd.com/en/blogs/2025/amd-instinct-mi350-series-game-changer.html
Tom's Hardware. "AMD announces MI350X and MI355X AI GPUs, claims up to 4X generational performance gain, 35X faster inference." Tom's Hardware, 2025. https://www.tomshardware.com/pc-components/gpus/amd-announces-mi350x-and-mi355x-ai-gpus-claims-up-to-4x-generational-gain-up-to-35x-faster-inference-performance
Tom's Hardware. "ISSCC 2026: AMD discloses how the Instinct MI355X doubled per-CU throughput despite lower compute unit count." Tom's Hardware, February 2026. https://www.tomshardware.com/tech-industry/semiconductors/inside-the-instinct-mi355x
ServeTheHome. "AMD Dives Deep on CDNA 4 Architecture and MI350 Accelerator at Hot Chips 2025." ServeTheHome, 2025. https://www.servethehome.com/amd-dives-deep-on-cdna-4-architecture-and-mi350-accelerator-at-hot-chips-2025/
TweakTown. "AMD details Instinct MI350: 3D chiplet, 185B transistors, 288GB HBM3E, TSMC N3P node." TweakTown, 2025. https://www.tweaktown.com/news/107359/amd-details-instinct-mi350-3d-chiplet-185b-transistors-288gb-hbm3e-tsmc-n3p-node/index.html
SemiAnalysis. "AMD Advancing AI: MI350X and MI400 UALoE72, MI500 UAL256." SemiAnalysis Newsletter, June 2025. https://newsletter.semianalysis.com/p/amd-advancing-ai-mi350x-and-mi400-ualoe72-mi500-ual256
SemiAnalysis. "InferenceX v2: NVIDIA Blackwell vs AMD vs Hopper." SemiAnalysis Newsletter, 2025. https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs
StorageReview. "AMD Instinct MI355X Achieves MLPerf Inference v6.0 Gains with Over 1 Million Tokens per Second." StorageReview, 2026. https://www.storagereview.com/news/amd-instinct-mi355x-achieves-mlperf-inference-v6-0-gains-with-over-1-million-tokens-per-second-and-supports-scalable-rocm-stack
TechPowerUp. "AMD Instinct MI355X GPUs Surpass 1M Tokens/Sec in MLPerf 6.0." TechPowerUp, 2026. https://www.techpowerup.com/347936/amd-instinct-mi355x-gpus-surpass-1m-tokens-sec-in-mlperf-6-0
Oracle. "Announcing General Availability of OCI Compute with AMD Instinct MI355X GPUs." Oracle Cloud Blog, October 14, 2025. https://blogs.oracle.com/cloud-infrastructure/announcing-general-availability-of-oci-amd-mi355x
Oracle and AMD. "Oracle and AMD Expand Partnership to Help Customers Achieve Next-Generation AI Scale." October 14, 2025. https://www.oracle.com/news/announcement/ai-world-oracle-and-amd-expand-partnership-to-help-customers-achieve-next-generation-ai-scale-2025-10-14/
TrendForce. "Unpacking AMD's MI350: Powered by TSMC's N3P, with Samsung/Micron as Dual HBM3E Suppliers." TrendForce, June 13, 2025. https://www.trendforce.com/news/2025/06/13/news-unpacking-amds-mi350-powered-by-tsmcs-n3p-with-samsung-micron-as-dual-hbm3e-suppliers/
The Decoder. "AMD's MI350 chips deliver big on memory but lag in networking against Nvidia." The Decoder, 2025. https://the-decoder.com/amds-mi350-chips-deliver-big-on-memory-but-lag-in-networking-against-nvidia/
ROCm Blogs. "ROCm 7.0: An AI-Ready Powerhouse for Performance, Efficiency, and Productivity." AMD ROCm Blogs, 2025. https://rocm.blogs.amd.com/ecosystems-and-partners/rocm-7.0-blog/README.html
AMD. "AMD Delivers Breakthrough MLPerf Inference 6.0 Results." AMD Blogs, 2026. https://www.amd.com/en/blogs/2026/amd-delivers-breakthrough-mlperf-inference-6-0-results.html
Hume Consulting. "AMD MI355X: Strong Node-Level Inference, but Not Yet Rack-Scale." Hume Consulting, 2025. https://humeconsulting.co/amd-mi355x-strong-node-level-inference-but-not-yet-rack-scale/
WCCFTech. "AMD's 2026-2027 AI Roadmap: Instinct MI400 and MI500 Target NVIDIA Dominance." WCCFTech, 2025. https://wccftech.com/amd-to-battle-nvidia-ai-dominance-instinct-mi400-accelerators-2026-mi500-2027/

AMD Instinct MI355X

Background: the AMD Instinct GPU line

MI300X (2023)

MI325X (2024)

MI350 series planning

CDNA 4 architecture

Compute unit redesign

FP4 and FP6 support

I/O die consolidation

Memory architecture

Specifications

HBM3E memory subsystem

FP4, FP6, and FP8 performance

FP8

FP6 and FP4

Practical inference performance

ROCm software ecosystem

ROCm 7.0

Framework support

Historical CUDA gap

Comparison with NVIDIA Blackwell

Memory advantage

Scale-up networking gap

FP4 compute gap

FP64 scientific computing advantage

TCO and pricing analysis

MI350X vs MI355X variants

Buyers and deployments

Oracle Cloud Infrastructure

Microsoft Azure

Meta

MLPerf partner ecosystem

Use cases

LLM inference

LLM training

HPC and scientific computing

Multimodal and graph workloads

Roadmap: MI400 and MI500

MI400 series (2026)

MI500 series (2027)

Limitations

See also

References

Improve this article

Related Articles

AMD Instinct MI300X

AMD Instinct MI325X

NVIDIA B200

NVIDIA GB300 NVL72

NVIDIA DGX B300

AMD Instinct MI400

AMD Instinct MI355X

Background: the AMD Instinct GPU line

MI300X (2023)

MI325X (2024)

MI350 series planning

CDNA 4 architecture

Compute unit redesign

FP4 and FP6 support

I/O die consolidation

Memory architecture

Specifications

HBM3E memory subsystem

FP4, FP6, and FP8 performance

FP8

FP6 and FP4

Practical inference performance

ROCm software ecosystem

ROCm 7.0

Framework support

Historical CUDA gap

Comparison with NVIDIA Blackwell

Memory advantage

Scale-up networking gap

FP4 compute gap

FP64 scientific computing advantage

TCO and pricing analysis

MI350X vs MI355X variants

Buyers and deployments

Oracle Cloud Infrastructure