ROCm

15 min read

Updated Jul 23, 2026

ROCm is AMD's open software stack for GPU computing and the main alternative to NVIDIA's CUDA platform. AMD's own documentation defines it as "a software stack, composed primarily of open-source software, that provides the tools for programming AMD Graphics Processing Units (GPUs), from low-level kernels to high-level end-user applications." ^[1] The name began as shorthand for Radeon Open Compute, though AMD no longer treats it as a formal acronym because Open Compute is a registered trademark. ROCm bundles the drivers, compilers, programming models, math and machine-learning libraries, and developer tools needed to run high-performance computing and AI workloads on AMD GPUs. Most of the stack is published as open-source software under various licenses, with the main repository released under the MIT License, which sets it apart from CUDA, where the core toolkit stays proprietary. AMD positions ROCm as the software half of its data-center AI strategy, paired with the Instinct line of accelerators. ^[1]^[2]

For years ROCm was treated as a curiosity outside a handful of national labs. That has changed. As AMD's Instinct MI300X and later parts started landing in large clusters, ROCm became the layer that decides whether all that silicon can actually be used. The September 16, 2025 general release of ROCm 7 was AMD's clearest signal that it wants the software judged on the same terms as the hardware. ^[12]

What is ROCm?

ROCm sits between AMD GPU hardware and the frameworks that data scientists and HPC programmers use. At the bottom is a kernel-mode driver and a user-space runtime. On top of that sit compilers, a set of accelerated libraries, and tooling for debugging and profiling. Above all of that, frameworks like PyTorch and JAX call into ROCm without most users ever touching it directly. AMD describes the stack as covering "compilers, libraries for high-level functions, debuggers, profilers and runtimes," and exposing three programming interfaces: HIP, OpenCL, and OpenMP. ^[1]

The stack runs mostly on Linux, with official packages for Ubuntu, Red Hat Enterprise Linux, and SUSE Linux Enterprise Server, plus support for several other distributions such as Debian and Oracle Linux. AMD has been extending coverage to Windows as well, where the HIP SDK and parts of the toolchain now ship for select consumer and workstation GPUs. ^[3]^[4]

The open-source posture is the point. CUDA has a long head start and a deep ecosystem, but it is controlled entirely by one vendor. ROCm gives cloud providers, labs, and large AI shops a stack they can inspect, patch, and in some cases fork. That openness is a big part of why hyperscalers have been willing to bet on AMD as a second source for GPU compute. ^[2]

How does the HIP programming model work?

The centerpiece of ROCm's developer story is HIP, short for Heterogeneous-computing Interface for Portability. HIP is a C++ runtime API and kernel language for writing GPU code that can run on AMD hardware, and, with a thin compatibility layer, on NVIDIA hardware too. Its syntax was deliberately kept close to CUDA. A developer who knows how to write a CUDA kernel can read and write HIP almost immediately, because the function names, the launch syntax, and the memory model all mirror CUDA's design. ^[5]

That closeness is strategic. AMD's bet is that most GPU code already exists in CUDA, so the fastest way to grow ROCm is to make porting cheap. HIP is meant to be a thin layer that adds little or no overhead. On AMD GPUs it compiles down through the ROCm compiler. On NVIDIA GPUs it compiles to CUDA underneath, so the same source can target both vendors from one codebase. AMD is careful to note that HIP is not a drop-in replacement: a real port usually still needs manual fixes and performance tuning. ^[5]

To move existing code over, ROCm ships HIPIFY, a set of porting tools. One variant, hipify-perl, does fast pattern-based text substitution and swaps CUDA API calls for their HIP equivalents. The other, hipify-clang, parses the source with the Clang front end and rewrites it from the abstract syntax tree, which catches more cases and handles tricky code more reliably. Neither tool promises a perfect one-shot conversion, but both cut the manual effort of a port substantially. ^[5]^[6]

Beyond HIP, ROCm also supports OpenCL and OpenMP offload for code that prefers a vendor-neutral API or compiler directives over an explicit GPU runtime, which matters in traditional HPC where Fortran and directive-based C++ are common. ^[1]

Compilers and the toolchain

The ROCm compiler is built on LLVM and Clang. The usual entry point is hipcc, a compiler driver that figures out the right flags and hands work to the underlying amdclang++ compiler, which produces code for AMD's GPU instruction set. Building on LLVM lets AMD share infrastructure with the wider compiler community and keep pace with new language standards. A Flang-based Fortran compiler covers legacy scientific code. ^[1]^[5]

Around the compiler sits the rest of the developer kit. ROCgdb is the debugger, derived from GDB, for stepping through GPU code. The ROCm profilers collect performance counters and traces so engineers can find bottlenecks. Utilities like rocminfo report what the system sees, and the management interfaces, first ROCm SMI and now the newer AMD SMI, expose temperatures, clocks, memory use, and other telemetry for fleet operators. ^[1]

The library set

Most of ROCm's real-world performance comes from its libraries. They fall into a few groups, and AMD generally ships both a vendor-specific version and a portable HIP-prefixed version that can also run on NVIDIA hardware.

For dense and sparse linear algebra there is rocBLAS, the basic linear algebra subprograms library, along with hipBLAS as its portable wrapper and hipBLASLt for lighter-weight, extended matrix operations that matter a lot in transformer workloads. rocSPARSE and rocSOLVER handle sparse matrices and linear solvers. rocFFT covers fast Fourier transforms, with hipFFT as the portable interface. rocRAND generates random numbers on the GPU. A separate family of primitive libraries, rocPRIM, hipCUB, and rocThrust, provides the parallel building blocks that higher-level code is assembled from. ^[7]

The AI-specific libraries are where ROCm has invested the hardest. MIOpen is AMD's deep-learning primitives library, the rough counterpart to NVIDIA's cuDNN, supplying optimized convolution, pooling, normalization, and attention operations. MIGraphX is a graph optimization and inference engine. Composable Kernel is a performance-portable library for fusing operations into single high-throughput kernels, which is central to getting good utilization on matrix-heavy AI math. For multi-GPU and multi-node training there is RCCL, the ROCm Communication Collectives Library, AMD's answer to NCCL. RCCL handles the all-reduce and broadcast collectives that synchronize gradients across large clusters, and it is the piece that lets ROCm scale past a single server. ^[7]^[8]

Which GPUs does ROCm support?

ROCm's primary target is the AMD Instinct line of data-center accelerators. Coverage spans several generations, from the older MI100 and the MI200 series (the MI210, MI250, and MI250X) through the MI300 generation. The MI300X is a discrete GPU built for large-model training and inference, while the MI300A is an APU that fuses CPU and GPU on one package for HPC. The MI325X extended the MI300 line with more memory, and the MI350 series, the MI350X and MI355X, moved to AMD's CDNA 4 architecture and added low-precision FP4 and FP6 number formats. ^[3]^[9]

On the consumer and workstation side, ROCm supports a selected set of Radeon parts. That list has grown to include RDNA 3 cards such as the Radeon RX 7900 XTX, RX 7900 XT, and RX 7900 GRE, newer RDNA 4 cards in the RX 9000 family, and Radeon PRO workstation cards such as the W7900 and W7800. The coverage is narrower than CUDA's near-universal support of NVIDIA GPUs, and support tiers differ between full data-center validation and lighter developer support, but the trend has been toward wider hardware reach with each release. AMD has also previewed the next-generation MI400 series for 2026, built for rack-scale AI as part of a system AMD calls Helios. ^[3]^[10]

Which deep-learning frameworks run on ROCm?

For most AI practitioners, ROCm is invisible because it lives underneath a framework. PyTorch is the headline case. There are ROCm builds of PyTorch that let the same training and inference scripts run on AMD GPUs with little or no code change, since the framework maps its operations onto ROCm libraries behind the scenes. TensorFlow has ROCm support as well, and JAX runs on AMD hardware through the same compiler and library stack. With the MI350 series, PyTorch, TensorFlow, and JAX all expose the new FP4 data type natively. ^[1]^[4]

Inference serving is the other front. vLLM, the high-throughput LLM serving engine, runs on ROCm and AMD Instinct GPUs, and AMD has worked to keep day-zero or near day-zero support for popular open models. SGLang, another serving framework, is supported too, and AMD ships prebuilt ROCm containers for both. The strategy here is to meet developers where they already are: rather than asking people to learn a new framework, AMD makes the frameworks they use work on AMD silicon. ^[4]^[11]

What is new in ROCm 6 and ROCm 7?

ROCm 6.0 arrived on December 6, 2023, timed to the launch of the Instinct MI300 series. It was a turning point for the project. The release added support for the new MI300X and MI300A, introduced FP8 numeric formats in PyTorch and hipBLASLt for more efficient AI math, shipped optimized attention algorithms, and added the hipSPARSELt sparse-matrix library. AMD reported that ROCm 6 with an MI300X delivered roughly eight times the text-generation latency improvement on Llama 2 compared to ROCm 5 on an MI250. The release reframed the stack around generative AI rather than treating that as an afterthought to traditional HPC. ^[9]

ROCm 7 followed in 2025. AMD previewed it at its Advancing AI event on June 12, 2025, and shipped the 7.0 general release on September 16, 2025, with point updates after that. The headline addition was full enablement of the MI350 series on CDNA 4, including the FP4 and FP6 formats. AMD published large performance claims relative to ROCm 6: up to 3.5 times the inference throughput and roughly 3 times the training performance on comparable workloads. The inference figure broke down by model, with AMD citing around 3.2 times on Llama 3.1 70B, 3.4 times on Qwen2-72B, and 3.8 times on DeepSeek R1. ^[4]^[10]^[12]

AMD also put ROCm 7 up against NVIDIA's newest part. On the DeepSeek R1 model, AMD reported that an MI355X platform of eight GPUs running FP4 precision delivered up to 1.3 times the inference throughput of an eight-GPU NVIDIA B200 platform, and that the same MI355X FP4 configuration reached up to 35 times the generation-over-generation throughput of an MI300X running FP8 on Llama 3.1 405B inference. Those numbers are AMD's own and depend on the specific models and configurations, so they read as vendor benchmarks rather than neutral results, but the direction matches independent reports of steady gains. ROCm 7 leaned hard into distributed inference, which splits the prefill and decode phases of serving across GPUs to cut the cost of token generation for reasoning models, broadened framework coverage including vLLM and SGLang, launched an AMD Developer Cloud for trying the stack without local hardware, and continued the push toward better Windows and consumer support. ^[4]^[10]^[14]

The table below sketches the recent version history.

Release	Timing	Notable additions
ROCm 5.x	2022 to 2023	MI200 series support, broader PyTorch and TensorFlow coverage
ROCm 6.0	December 6, 2023	MI300X and MI300A support, FP8 formats, hipSPARSELt, generative-AI focus
ROCm 6.x	2024 to 2025	MI325X support, library and framework refinements, wider Radeon and Windows reach
ROCm 7.0	September 16, 2025	MI350 series (CDNA 4), FP4 and FP6, up to 3.5x inference and 3x training gains vs ROCm 6, distributed inference, SGLang, AMD Developer Cloud

The next table groups the main components of the stack.

Component	Type	Role
HIP	Programming model	C++ runtime API and kernel language, CUDA-like, portable across AMD and NVIDIA
HIPIFY	Porting tool	Converts CUDA source to HIP via text or Clang-based rewriting
hipcc / amdclang++	Compiler	LLVM and Clang based toolchain for AMD GPU code
rocBLAS / hipBLAS / hipBLASLt	Math library	Dense linear algebra and matrix operations
rocFFT / rocSPARSE / rocSOLVER / rocRAND	Math library	FFT, sparse algebra, solvers, random numbers
MIOpen	AI library	Deep-learning primitives, comparable to cuDNN
MIGraphX	AI library	Graph optimization and inference engine
Composable Kernel	AI library	Operator fusion for high-throughput kernels
RCCL	Communication	Multi-GPU and multi-node collectives, comparable to NCCL
ROCgdb / ROCm profilers / AMD SMI	Tooling	Debugging, profiling, and system management

How does ROCm differ from CUDA, and what is the CUDA moat?

The gap between ROCm and CUDA is usually described as the CUDA moat. NVIDIA publicly unveiled CUDA in November 2006 and shipped the CUDA 1.0 toolkit in June 2007, then spent close to two decades building libraries, documentation, tutorials, and a community of millions of developers. Almost every AI framework, optimization paper, and tutorial assumes CUDA first. That accumulated software advantage, not raw chip speed, is the hardest thing for a competitor to overcome. A buyer choosing a GPU is also choosing whether their existing code, tools, and hiring pool will work on day one. ^[2]^[13]^[15]

ROCm came much later. AMD's open-compute effort grew out of the Boltzmann Initiative announced at SC15 in November 2015, with the first ROCm 1.0 release following in April 2016. HIP's whole design, mirroring CUDA so that porting is cheap, is an attempt to lower the moat rather than dig a parallel one. AMD is betting that an open stack that runs the same frameworks, at a lower cost per unit of memory and compute, is good enough to pull serious workloads even if it is not yet as polished. ^[13]

What are the main criticisms of ROCm, and how has it improved?

For most of its life, ROCm earned a reputation for being hard to use. Installation could be fragile, with driver and kernel version mismatches that broke setups. Hardware support lagged, and consumer GPUs were often left out, which kept hobbyists and students, the people who seed an ecosystem, on NVIDIA. Documentation trailed CUDA's, and many libraries were less complete or less tuned. The practical result was that AMD GPUs frequently delivered well below their theoretical numbers because the software could not keep the hardware fed. ^[13]

Those problems were laid out in detail on December 22, 2024, when the research firm SemiAnalysis published a long benchmark study comparing the MI300X against NVIDIA's H100 and H200 under the title "CUDA Moat Still Alive." The report found that the MI300X had strong specifications, more memory and high peak throughput, but that the software stood in the way. "AMD's software experience is riddled with bugs rendering out of the box training with AMD is impossible," the analysts wrote, adding that "AMD's out of the box experience is very difficult to work with and can require considerable patience and elbow grease." ^[13] They concluded that "the CUDA moat has yet to be crossed by AMD due to AMD's weaker-than-expected software Quality Assurance culture." ^[13] The study leaned heavily on AMD engineers to fix issues as they came up. "If we weren't supported by multiple teams of AMD engineers triaging and fixing bugs in AMD software that we ran into, AMD's results would have been much lower," the report noted, and it observed that the cloud provider TensorWave "has given GPU time for free to a team at AMD to fix software issues, which is insane given they paid for the GPUs." AMD shipped fixes and better builds in response, which is itself a sign of how seriously the company now takes the gap. ^[13]

Progress since then has been real. AMD has expanded framework support so that PyTorch, JAX, vLLM, and SGLang work more reliably on AMD hardware, often with day-zero model coverage. It has widened the list of supported GPUs, including more Radeon parts and a growing Windows story, and stood up the AMD Developer Cloud so people can try the stack without buying hardware. Documentation and developer resources have improved, and the ROCm 7 performance gains, even read conservatively, point to a stack that is maturing quickly. Closer engagement with the open-source community suggests AMD understands that hardware alone will not close the gap. ^[4]^[13]

Limitations

ROCm still trails CUDA in coverage and polish. The set of fully supported GPUs is narrower, and support tiers vary, so a card that technically runs ROCm may not get the same validation as a flagship Instinct part. Some niche libraries and cutting-edge research code remain CUDA-only or need extra porting effort. The Windows and consumer experience, while improving, is younger than CUDA's, which has run everywhere for years. And because so much of the AI world was built CUDA-first, even a clean HIP port can hit functions or kernels that have no tuned AMD equivalent yet. The momentum is clearly toward parity, but ROCm is closing a long lead rather than starting even. ^[2]^[13]

References

^AMD. "What is ROCm?" ROCm Documentation.
^AMD. "AMD ROCm Software." amd.com.
^AMD. "System requirements (Linux)." ROCm Documentation.
^AMD. "ROCm release notes." ROCm Documentation.
^AMD. "HIP Documentation." ROCm Documentation.
^AMD. "HIPIFY." GitHub.
^AMD. "ROCm libraries." ROCm Documentation.
^AMD. "MIOpen Documentation." ROCm Documentation.
^AMD. "Release notes for AMD ROCm 6.0." ROCm Documentation, December 2023.
^AMD. "Enabling the Future of AI: Introducing AMD ROCm 7 and the AMD Developer Cloud." amd.com, 2025.
^Phoronix. "AMD ROCm 7.0 Officially Released With Many Significant Improvements." Phoronix, September 2025.
^AMD. "AMD ROCm 7.0 Software: Supercharging AI and HPC Infrastructure with AMD Instinct Series GPUs and Open Innovation." amd.com, September 2025.
^SemiAnalysis. "MI300X vs H100 vs H200 Benchmark Part 1: Training, CUDA Moat Still Alive." SemiAnalysis, December 22, 2024.
^Tom's Hardware. "AMD unveils ROCm 7, new platform boosts AI performance up to 3.5x, adds Radeon GPU support." Tom's Hardware, June 2025.
^NVIDIA. "CUDA Toolkit." developer.nvidia.com.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · v3 · 3,074 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

ROCm

What is ROCm?

How does the HIP programming model work?

Compilers and the toolchain

The library set

Which GPUs does ROCm support?

Which deep-learning frameworks run on ROCm?

What is new in ROCm 6 and ROCm 7?

How does ROCm differ from CUDA, and what is the CUDA moat?

What are the main criticisms of ROCm, and how has it improved?

Limitations

References

Improve this article

What links here

What links here

What is ROCm?

How does the HIP programming model work?

Compilers and the toolchain

The library set

Which GPUs does ROCm support?

Which deep-learning frameworks run on ROCm?

What is new in ROCm 6 and ROCm 7?

How does ROCm differ from CUDA, and what is the CUDA moat?

What are the main criticisms of ROCm, and how has it improved?

Limitations

References

Improve this article

Related Articles

CUDA

Vector database

Replit

Replicate

Amazon Q

Ray (framework)

What links here

Related Articles

CUDA

Vector database

Replit

Replicate

Amazon Q

Ray (framework)

What links here