ROCm
Last reviewed
May 31, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 2,794 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 2,794 words
Add missing citations, update stale details, or suggest a clearer explanation.
ROCm is AMD's open software stack for GPU computing, and it is the main alternative to NVIDIA's CUDA platform. The name started as shorthand for Radeon Open Compute, though AMD no longer treats it as a formal acronym since Open Compute is a registered trademark. ROCm bundles the drivers, compilers, programming models, math and machine-learning libraries, and developer tools needed to run high-performance computing and AI workloads on AMD GPUs. Most of the stack is published as open-source software under various licenses, with the main repository released under the MIT License, which sets it apart from CUDA, where the core toolkit stays proprietary. AMD positions ROCm as the software half of its data-center AI strategy, paired with the Instinct line of accelerators. [1][2]
For years ROCm was treated as a curiosity outside a handful of national labs. That has changed. As AMD's Instinct MI300X and later parts started landing in large clusters, ROCm became the layer that decides whether all that silicon can actually be used. The September 2025 release of ROCm 7 was AMD's clearest signal that it wants the software judged on the same terms as the hardware.
ROCm sits between AMD GPU hardware and the frameworks that data scientists and HPC programmers use. At the bottom is a kernel-mode driver and a user-space runtime. On top of that sit compilers, a set of accelerated libraries, and tooling for debugging and profiling. Above all of that, frameworks like PyTorch and JAX call into ROCm without most users ever touching it directly. [1]
The stack runs mostly on Linux, with official packages for Ubuntu, Red Hat Enterprise Linux, and SUSE Linux Enterprise Server, plus support for several other distributions such as Debian and Oracle Linux. AMD has been extending coverage to Windows as well, where the HIP SDK and parts of the toolchain now ship for select consumer and workstation GPUs. [3][4]
The open-source posture is the point. CUDA has a long head start and a deep ecosystem, but it is controlled entirely by one vendor. ROCm gives cloud providers, labs, and large AI shops a stack they can inspect, patch, and in some cases fork. That openness is a big part of why hyperscalers have been willing to bet on AMD as a second source for GPU compute. [2]
The centerpiece of ROCm's developer story is HIP, short for Heterogeneous-computing Interface for Portability. HIP is a C++ runtime API and kernel language for writing GPU code that can run on AMD hardware, and, with a thin compatibility layer, on NVIDIA hardware too. Its syntax was deliberately kept close to CUDA. A developer who knows how to write a CUDA kernel can read and write HIP almost immediately, because the function names, the launch syntax, and the memory model all mirror CUDA's design. [5]
That closeness is strategic. AMD's bet is that most GPU code already exists in CUDA, so the fastest way to grow ROCm is to make porting cheap. HIP is meant to be a thin layer that adds little or no overhead. On AMD GPUs it compiles down through the ROCm compiler. On NVIDIA GPUs it compiles to CUDA underneath, so the same source can target both vendors from one codebase. AMD is careful to note that HIP is not a drop-in replacement: a real port usually still needs manual fixes and performance tuning. [5]
To move existing code over, ROCm ships HIPIFY, a set of porting tools. One variant, hipify-perl, does fast pattern-based text substitution and swaps CUDA API calls for their HIP equivalents. The other, hipify-clang, parses the source with the Clang front end and rewrites it from the abstract syntax tree, which catches more cases and handles tricky code more reliably. Neither tool promises a perfect one-shot conversion, but both cut the manual effort of a port substantially. [5][6]
Beyond HIP, ROCm also supports OpenCL and OpenMP offload for code that prefers a vendor-neutral API or compiler directives over an explicit GPU runtime, which matters in traditional HPC where Fortran and directive-based C++ are common. [1]
The ROCm compiler is built on LLVM and Clang. The usual entry point is hipcc, a compiler driver that figures out the right flags and hands work to the underlying amdclang++ compiler, which produces code for AMD's GPU instruction set. Building on LLVM lets AMD share infrastructure with the wider compiler community and keep pace with new language standards. A Flang-based Fortran compiler covers legacy scientific code. [1][5]
Around the compiler sits the rest of the developer kit. ROCgdb is the debugger, derived from GDB, for stepping through GPU code. The ROCm profilers collect performance counters and traces so engineers can find bottlenecks. Utilities like rocminfo report what the system sees, and the management interfaces, first ROCm SMI and now the newer AMD SMI, expose temperatures, clocks, memory use, and other telemetry for fleet operators. [1]
Most of ROCm's real-world performance comes from its libraries. They fall into a few groups, and AMD generally ships both a vendor-specific version and a portable HIP-prefixed version that can also run on NVIDIA hardware.
For dense and sparse linear algebra there is rocBLAS, the basic linear algebra subprograms library, along with hipBLAS as its portable wrapper and hipBLASLt for lighter-weight, extended matrix operations that matter a lot in transformer workloads. rocSPARSE and rocSOLVER handle sparse matrices and linear solvers. rocFFT covers fast Fourier transforms, with hipFFT as the portable interface. rocRAND generates random numbers on the GPU. A separate family of primitive libraries, rocPRIM, hipCUB, and rocThrust, provides the parallel building blocks that higher-level code is assembled from. [7]
The AI-specific libraries are where ROCm has invested the hardest. MIOpen is AMD's deep-learning primitives library, the rough counterpart to NVIDIA's cuDNN, supplying optimized convolution, pooling, normalization, and attention operations. MIGraphX is a graph optimization and inference engine. Composable Kernel is a performance-portable library for fusing operations into single high-throughput kernels, which is central to getting good utilization on matrix-heavy AI math. For multi-GPU and multi-node training there is RCCL, the ROCm Communication Collectives Library, AMD's answer to NCCL. RCCL handles the all-reduce and broadcast collectives that synchronize gradients across large clusters, and it is the piece that lets ROCm scale past a single server. [7][8]
ROCm's primary target is the AMD Instinct line of data-center accelerators. Coverage spans several generations, from the older MI100 and the MI200 series (the MI210, MI250, and MI250X) through the MI300 generation. The MI300X is a discrete GPU built for large-model training and inference, while the MI300A is an APU that fuses CPU and GPU on one package for HPC. The MI325X extended the MI300 line with more memory, and the MI350 series, the MI350X and MI355X, moved to AMD's CDNA 4 architecture and added low-precision FP4 and FP6 number formats. [3][9]
On the consumer and workstation side, ROCm supports a selected set of Radeon parts. That list has grown to include RDNA 3 cards such as the Radeon RX 7900 XTX, RX 7900 XT, and RX 7900 GRE, newer RDNA 4 cards in the RX 9000 family, and Radeon PRO workstation cards such as the W7900 and W7800. The coverage is narrower than CUDA's near-universal support of NVIDIA GPUs, and support tiers differ between full data-center validation and lighter developer support, but the trend has been toward wider hardware reach with each release. AMD has also previewed the next-generation MI400 series for 2026, built for rack-scale AI as part of a system AMD calls Helios. [3][10]
For most AI practitioners, ROCm is invisible because it lives underneath a framework. PyTorch is the headline case. There are ROCm builds of PyTorch that let the same training and inference scripts run on AMD GPUs with little or no code change, since the framework maps its operations onto ROCm libraries behind the scenes. TensorFlow has ROCm support as well, and JAX runs on AMD hardware through the same compiler and library stack. With the MI350 series, PyTorch, TensorFlow, and JAX all expose the new FP4 data type natively. [1][4]
Inference serving is the other front. vLLM, the high-throughput LLM serving engine, runs on ROCm and AMD Instinct GPUs, and AMD has worked to keep day-zero or near day-zero support for popular open models. SGLang, another serving framework, is supported too, and AMD ships prebuilt ROCm containers for both. The strategy here is to meet developers where they already are: rather than asking people to learn a new framework, AMD makes the frameworks they use work on AMD silicon. [4][11]
ROCm 6.0 arrived on December 6, 2023, timed to the launch of the Instinct MI300 series. It was a turning point for the project. The release added support for the new MI300X and MI300A, introduced FP8 numeric formats in PyTorch and hipBLASLt for more efficient AI math, shipped optimized attention algorithms, and added the hipSPARSELt sparse-matrix library. AMD reported that ROCm 6 with an MI300X delivered roughly eight times the text-generation latency improvement on Llama 2 compared to ROCm 5 on an MI250. The release reframed the stack around generative AI rather than treating that as an afterthought to traditional HPC. [9]
ROCm 7 followed in 2025. AMD previewed it at its Advancing AI event on June 12, 2025, and shipped the 7.0 release in September, with point updates after that. The headline addition was full support for the MI350 series on CDNA 4, including the FP4 and FP6 formats. AMD published large performance claims relative to ROCm 6: up to 3.5 times the inference throughput and roughly 3 times the training performance on comparable workloads. The inference figure broke down by model, with AMD citing around 3.2 times on Llama 3.1 70B, 3.4 times on Qwen2-72B, and 3.8 times on DeepSeek R1. Those numbers are AMD's own and depend on the specific models and configurations, so they read as vendor benchmarks rather than neutral results, but the direction matches independent reports of steady gains. ROCm 7 leaned hard into distributed inference, which splits the prefill and decode phases of serving across GPUs to cut the cost of token generation for reasoning models, broadened framework coverage including vLLM and SGLang, launched an AMD Developer Cloud for trying the stack without local hardware, and continued the push toward better Windows and consumer support. [4][10][12]
The table below sketches the recent version history.
| Release | Timing | Notable additions |
|---|---|---|
| ROCm 5.x | 2022 to 2023 | MI200 series support, broader PyTorch and TensorFlow coverage |
| ROCm 6.0 | December 2023 | MI300X and MI300A support, FP8 formats, hipSPARSELt, generative-AI focus |
| ROCm 6.x | 2024 to 2025 | MI325X support, library and framework refinements, wider Radeon and Windows reach |
| ROCm 7.0 | September 2025 | MI350 series (CDNA 4), FP4 and FP6, up to 3.5x inference and 3x training gains, distributed inference, SGLang |
The next table groups the main components of the stack.
| Component | Type | Role |
|---|---|---|
| HIP | Programming model | C++ runtime API and kernel language, CUDA-like, portable across AMD and NVIDIA |
| HIPIFY | Porting tool | Converts CUDA source to HIP via text or Clang-based rewriting |
| hipcc / amdclang++ | Compiler | LLVM and Clang based toolchain for AMD GPU code |
| rocBLAS / hipBLAS / hipBLASLt | Math library | Dense linear algebra and matrix operations |
| rocFFT / rocSPARSE / rocSOLVER / rocRAND | Math library | FFT, sparse algebra, solvers, random numbers |
| MIOpen | AI library | Deep-learning primitives, comparable to cuDNN |
| MIGraphX | AI library | Graph optimization and inference engine |
| Composable Kernel | AI library | Operator fusion for high-throughput kernels |
| RCCL | Communication | Multi-GPU and multi-node collectives, comparable to NCCL |
| ROCgdb / ROCm profilers / AMD SMI | Tooling | Debugging, profiling, and system management |
The gap between ROCm and CUDA is usually described as the CUDA moat. NVIDIA shipped CUDA in 2007 and spent close to two decades building libraries, documentation, tutorials, and a community of millions of developers. Almost every AI framework, optimization paper, and tutorial assumes CUDA first. That accumulated software advantage, not raw chip speed, is the hardest thing for a competitor to overcome. A buyer choosing a GPU is also choosing whether their existing code, tools, and hiring pool will work on day one. [2][13]
ROCm came much later. AMD's open-compute effort grew out of the Boltzmann Initiative announced in 2015, with the first ROCm releases following in 2016. HIP's whole design, mirroring CUDA so that porting is cheap, is an attempt to lower the moat rather than dig a parallel one. AMD is betting that an open stack that runs the same frameworks, at a lower cost per unit of memory and compute, is good enough to pull serious workloads even if it is not yet as polished. [13]
For most of its life, ROCm earned a reputation for being hard to use. Installation could be fragile, with driver and kernel version mismatches that broke setups. Hardware support lagged, and consumer GPUs were often left out, which kept hobbyists and students, the people who seed an ecosystem, on NVIDIA. Documentation trailed CUDA's, and many libraries were less complete or less tuned. The practical result was that AMD GPUs frequently delivered well below their theoretical numbers because the software could not keep the hardware fed. [13]
Those problems were laid out in detail in December 2024, when the research firm SemiAnalysis published a long benchmark study comparing the MI300X against NVIDIA's H100 and H200 under the title CUDA Moat Still Alive. The report found that the MI300X had strong specifications, more memory and high peak throughput, but that out-of-the-box ROCm was, in its words, plagued with bugs that made training models nearly impossible without heavy debugging. The MI300X was not usable out of the box, the analysts wrote, and AMD's real-world performance was nowhere close to its marketed peak. The study leaned heavily on AMD engineers to fix issues as they came up, and one cloud provider, TensorWave, handed AMD engineers free GPU time just so the software could be improved. AMD shipped fixes and better builds in response, which is itself a sign of how seriously the company now takes the gap. [13]
Progress since then has been real. AMD has expanded framework support so that PyTorch, JAX, vLLM, and SGLang work more reliably on AMD hardware, often with day-zero model coverage. It has widened the list of supported GPUs, including more Radeon parts and a growing Windows story, and stood up the AMD Developer Cloud so people can try the stack without buying hardware. Documentation and developer resources have improved, and the ROCm 7 performance gains, even read conservatively, point to a stack that is maturing quickly. Closer engagement with the open-source community suggests AMD understands that hardware alone will not close the gap. [4][13]
ROCm still trails CUDA in coverage and polish. The set of fully supported GPUs is narrower, and support tiers vary, so a card that technically runs ROCm may not get the same validation as a flagship Instinct part. Some niche libraries and cutting-edge research code remain CUDA-only or need extra porting effort. The Windows and consumer experience, while improving, is younger than CUDA's, which has run everywhere for years. And because so much of the AI world was built CUDA-first, even a clean HIP port can hit functions or kernels that have no tuned AMD equivalent yet. The momentum is clearly toward parity, but ROCm is closing a long lead rather than starting even. [2][13]