CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) created by NVIDIA. First released in June 2007, CUDA allows software developers to use NVIDIA GPUs for general-purpose processing, a technique known as GPGPU (General-Purpose computing on Graphics Processing Units). It provides a C/C++-like programming interface that lets developers write code for GPU execution without needing to express computations as graphics operations. Over the past two decades, CUDA has become the dominant software platform for deep learning, scientific computing, and high-performance computing, giving NVIDIA a software ecosystem advantage that competitors have struggled to replicate.
Before CUDA, researchers who wanted to run general-purpose computations on GPUs had to disguise their math as graphics shaders, a cumbersome and error-prone process that limited GPU computing to a small community of specialists. NVIDIA recognized that the thousands of parallel cores inside its GPUs could be useful for workloads far beyond rendering triangles, and began developing a general-purpose programming model in the mid-2000s.
The initial CUDA SDK was made public on February 15, 2007, for Microsoft Windows and Linux [1]. It shipped alongside the GeForce 8800 GTX (G80 architecture), which was the first GPU to support CUDA natively. Mac OS X support was added in CUDA 2.0, released in 2008. The key insight behind CUDA was that GPU hardware, with its thousands of lightweight cores designed for throughput rather than latency, mapped naturally onto the data-parallel workloads found in scientific simulations, image processing, and (eventually) neural network training.
The early years of CUDA adoption were concentrated in the scientific computing community. Researchers in molecular dynamics, computational fluid dynamics, and financial modeling were among the first to exploit GPU parallelism through CUDA. But the event that transformed CUDA from a niche scientific tool into the backbone of the AI industry came in 2012.
On September 30, 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered AlexNet into the ImageNet Large Scale Visual Recognition Challenge. AlexNet was a convolutional neural network with 60 million parameters that achieved a top-5 error rate of 15.3%, beating the runner-up by more than 10 percentage points [2]. Krizhevsky trained the model on two NVIDIA GTX 580 consumer GPUs using his custom cuda-convnet library, written in CUDA. The victory demonstrated that deep neural networks, combined with large datasets and GPU-accelerated training, could dramatically outperform hand-engineered computer vision methods.
AlexNet's success rested on the convergence of three developments: the availability of large labeled datasets (ImageNet), general-purpose GPU computing (CUDA), and improved training techniques for deep networks (such as ReLU activations and dropout regularization). For several years afterward, Krizhevsky's cuda-convnet code was the industry standard and powered the first wave of the deep learning boom. Every major deep learning framework that followed, from Caffe to TensorFlow to PyTorch, was built on top of CUDA.
By 2015, CUDA's development had shifted its focus increasingly toward accelerating machine learning and artificial neural network workloads, a direction that has only intensified since.
CUDA provides an abstraction layer over the GPU's hardware architecture that allows developers to write parallel programs without needing to understand every detail of the underlying silicon. The programming model centers on three core concepts: kernels, the thread hierarchy, and the memory hierarchy.
A CUDA kernel is a function that executes on the GPU. When a program launches a kernel, it runs in parallel across many threads simultaneously. Unlike a CPU function that executes once on a single core, a kernel is invoked by potentially millions of GPU threads, each executing the same code on different data. This is the SIMT (Single Instruction, Multiple Threads) execution model.
A kernel is defined using the __global__ keyword in CUDA C++ and is launched with a special syntax that specifies how many threads should execute it:
__global__ void vectorAdd(float *A, float *B, float *C, int n) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < n) C[i] = A[i] + B[i];
}
In this example, each thread computes a single element of the output vector. The threadIdx, blockIdx, and blockDim built-in variables let each thread determine which element it is responsible for.
CUDA organizes threads into a three-level hierarchy:
| Level | Description | Typical size |
|---|---|---|
| Thread | The smallest unit of execution. Each thread has its own registers and local memory. | 1 thread |
| Thread block | A group of threads that execute on a single streaming multiprocessor (SM). Threads within a block can share data through shared memory and synchronize with each other. | Up to 1,024 threads |
| Grid | A collection of thread blocks that together execute a single kernel. | Millions of threads across thousands of blocks |
Thread blocks and grids can be one-, two-, or three-dimensional, which is convenient for mapping onto data structures like vectors (1D), images (2D), or volumes (3D). All threads within a thread block are guaranteed to execute on the same SM, which enables efficient communication and synchronization within a block. Threads in different blocks cannot directly synchronize with each other during kernel execution.
The hardware further groups threads into warps of 32. All threads in a warp execute the same instruction at the same time. When threads in a warp take different branches (called warp divergence), the GPU must serialize the divergent paths, which reduces efficiency. Writing CUDA code that minimizes warp divergence is a key optimization technique.
Starting with NVIDIA Compute Capability 9.0 (Hopper architecture), CUDA introduces an optional additional level of hierarchy called Thread Block Clusters. A cluster is a group of thread blocks that are guaranteed to be co-scheduled on the same GPU Processing Cluster (GPC). This enables efficient communication between thread blocks within a cluster using distributed shared memory, which was not possible in earlier architectures where thread blocks could be scheduled on any SM. Thread block clusters are particularly useful for workloads like matrix multiplication and attention computation, where data sharing between adjacent blocks can reduce redundant global memory accesses.
CUDA exposes several levels of memory, each with different capacity, latency, and visibility:
| Memory type | Scope | Latency | Capacity | Typical use |
|---|---|---|---|---|
| Registers | Per thread | ~1 cycle | Limited (thousands per SM) | Thread-local variables |
| Shared memory | Per thread block | ~5 cycles | 48-228 KB per SM (configurable) | Inter-thread communication within a block |
| L1 cache | Per SM | ~30 cycles | 128-256 KB | Automatic caching of global memory accesses |
| L2 cache | Per GPU | ~200 cycles | 4-60 MB | Automatic caching |
| Global memory (HBM) | All threads | ~400 cycles | 16-288 GB | Main GPU memory (model weights, data) |
| Constant memory | All threads (read-only) | ~5 cycles (cached) | 64 KB | Kernel parameters, lookup tables |
| Texture memory | All threads (read-only) | ~5 cycles (cached) | Size of global memory | Spatially local read patterns |
Shared memory is the most important tool for optimizing CUDA kernels. Because it resides on-chip and is much faster than global memory, programmers use it to stage data that multiple threads in a block need to access repeatedly. For example, in a matrix multiplication kernel, tiles of the input matrices are loaded into shared memory so that threads can read them many times without incurring the cost of global memory accesses each time.
On Hopper and later architectures, CUDA supports distributed shared memory, which allows thread blocks within a cluster to directly access each other's shared memory. This effectively creates a larger, faster memory pool than what is available to a single block, enabling more efficient implementations of algorithms that require inter-block data sharing.
A critical optimization concept in CUDA is memory coalescing. When threads in a warp access consecutive addresses in global memory, the hardware combines (coalesces) these accesses into a single memory transaction. Coalesced accesses achieve full memory bandwidth utilization, while uncoalesced accesses can reduce effective bandwidth by 8x or more. Designing data layouts that enable coalesced access patterns is one of the most impactful CUDA optimizations.
The CUDA Toolkit is the full software distribution that NVIDIA provides for developing GPU-accelerated applications. It includes a compiler, runtime libraries, debugging and profiling tools, and a suite of GPU-optimized math libraries.
NVCC (NVIDIA CUDA Compiler) is the compiler driver for CUDA programs. It accepts CUDA C++ source files (typically with a .cu extension), separates the host (CPU) code from the device (GPU) code, compiles the device code into PTX (Parallel Thread Execution) intermediate representation or directly into GPU machine code (SASS), and passes the host code to a standard C++ compiler like GCC or MSVC. NVCC supports standard C++ compiler options for defining macros, specifying include and library paths, and controlling optimization levels.
The CUDA Toolkit ships with a rich set of GPU-accelerated libraries that provide optimized building blocks for common computational tasks:
| Library | Purpose | Typical use in AI |
|---|---|---|
| cuBLAS | GPU-accelerated Basic Linear Algebra Subprograms (BLAS). Optimized matrix multiplication, vector operations, and dot products. | Foundation for all matrix math in neural networks |
| cuDNN | GPU-accelerated deep neural network primitives. Provides optimized implementations of convolutions, recurrent layers, normalization, activation functions, and attention operations. | Used by PyTorch, TensorFlow, and every major DL framework |
| cuFFT | GPU-accelerated Fast Fourier Transform library. | Signal processing, audio models |
| cuSPARSE | Sparse matrix operations. | Sparse neural networks, graph neural networks |
| cuSOLVER | Dense and sparse linear system solvers. | Scientific computing, optimization |
| cuRAND | Random number generation on GPU. | Dropout, data augmentation, stochastic processes |
| Thrust | C++ parallel programming library resembling the C++ Standard Template Library (STL). Provides efficient sort, reduce, scan, and other parallel primitives. | Data preprocessing, custom parallel algorithms |
| NCCL | NVIDIA Collective Communications Library for multi-GPU and multi-node communication. | Distributed training across GPU clusters |
cuDNN (CUDA Deep Neural Network library) deserves special attention because it is the library that most directly enables deep learning on NVIDIA GPUs. When PyTorch or TensorFlow execute a convolution or a matrix multiplication during neural network training, they typically call into cuDNN or cuBLAS under the hood. cuDNN is hand-tuned for each NVIDIA GPU architecture and data type, providing performance that would be extremely difficult for framework developers to achieve on their own.
cuDNN 9.0, released in 2024, introduced extensive enhancements for Scaled Dot-Product Attention (SDPA), the core operation in transformer models. Key advances include:
| Feature | Details |
|---|---|
| FlashAttention-style kernels | Highly optimized attention kernels that minimize memory bandwidth usage by computing attention in tiles without materializing the full attention matrix |
| FP8 attention | Native support for FP8 data type on Hopper and Blackwell GPUs, achieving up to 3x throughput improvement over BF16 |
| BF16 attention | Up to 2x faster throughput compared to cuDNN 8.x implementations |
| H200 peak throughput | Up to 1.2 PFLOPS in FP8 on a single H200 GPU |
| Framework support | PyTorch and JAX (vs. flash-attn which only supports PyTorch) |
| Stream-K attention | 200% average speedup for LLM decoding phase (sequence length 1 queries) |
A notable finding from benchmarks: flash-attention outperforms cuDNN attention on Ampere GPUs, while cuDNN attention has a 20-50% advantage on Hopper GPUs. This is because cuDNN 9's attention kernels are specifically optimized for the Hopper Tensor Core architecture and its FP8 support.
The Transformer Engine, built on top of cuDNN, enables automatic mixed-precision training that dynamically selects between FP8 and higher precisions on a per-layer basis. Benchmarks show a 1.15x speedup for Llama 2 70B LoRA fine-tuning when using cuDNN FP8 SDPA via the Transformer Engine on an 8-GPU H200 node [13].
NCCL (NVIDIA Collective Communications Library, pronounced "nickel") is the communication backbone for distributed AI training. It provides topology-aware, hardware-accelerated implementations of collective operations that are essential for synchronizing gradients and distributing data across multiple GPUs.
| Operation | Purpose | Use in training |
|---|---|---|
| AllReduce | Reduces data across all GPUs and distributes result to all | Gradient synchronization in data parallelism |
| AllGather | Gathers data from all GPUs and distributes full result to all | Weight gathering in ZeRO optimization |
| ReduceScatter | Reduces data and scatters result across GPUs | Gradient reduction in ZeRO Stage 2+ |
| Broadcast | Sends data from one GPU to all others | Model weight initialization |
| Send/Recv | Point-to-point data transfer | Pipeline parallelism stage boundaries |
NCCL automatically detects the communication topology (NVLink, PCIe, InfiniBand, Ethernet) and selects optimal algorithms for each. Within a node, NCCL leverages NVLink's high bandwidth (up to 1.8 TB/s on Blackwell NVLink 5). Across nodes, it uses InfiniBand RDMA or RoCE for low-latency, high-throughput communication. NCCL achieves near-linear scaling across GPUs by minimizing CPU involvement in the communication path.
NCCL 2.27 introduced several important features:
NVIDIA developed NCCLX, an extended collective communication framework built on NCCL, specifically designed for clusters exceeding 100,000 GPUs. NCCLX operates beneath the PyTorch layer and manages all communications for both training and inference. It provides three execution modes: host-initiated APIs, host-initiated APIs with GPU-resident metadata, and device-initiated APIs. This level of scale is necessary for training the largest frontier models.
In CUDA Toolkit 13.0, NVIDIA introduced the CUDA Core Compute Library (CCCL) version 3.0, which unifies Thrust, CUB, and libcudacxx into a single parallel programming foundation [3].
The toolkit also includes Nsight Developer Tools for debugging and profiling GPU code. Nsight Compute provides detailed kernel-level performance analysis, showing metrics like achieved occupancy, memory throughput, and instruction mix. Nsight Systems provides system-level profiling that shows how GPU kernels, CPU code, memory transfers, and communication operations interact over time. These tools are essential for identifying and resolving performance bottlenecks in CUDA applications.
Modern NVIDIA GPUs contain two distinct types of processing units that are both important for AI workloads but serve different purposes.
CUDA cores are the general-purpose parallel processors within an NVIDIA GPU. Each CUDA core can execute one floating-point or integer operation per clock cycle. They handle a wide range of computations: element-wise operations, data preprocessing, activation functions, custom operations, and any other general parallel workload. CUDA cores primarily operate on single-precision (FP32) and double-precision (FP64) floating-point numbers.
The number of CUDA cores has grown substantially with each GPU generation. The V100 (Volta, 2017) had 5,120 CUDA cores; the A100 (Ampere, 2020) had 6,912; the H100 (Hopper, 2022) has 16,896; and the B200 (Blackwell, 2024) has over 18,000 [4].
Tensor Cores are specialized hardware units designed specifically for the matrix-multiply-accumulate operations that dominate deep learning. First introduced in the Volta architecture (V100) in 2017, Tensor Cores perform small matrix multiplications (for example, 4x4 or 16x16 tiles) in a single operation, achieving much higher throughput than CUDA cores for these specific workloads.
Tensor Cores support mixed-precision arithmetic, computing in lower-precision formats (FP16, BF16, FP8, INT8, FP4) while accumulating results in higher precision (FP32). This approach reduces memory bandwidth requirements and increases throughput while maintaining acceptable numerical accuracy for neural network training and inference.
| Feature | CUDA Cores | Tensor Cores |
|---|---|---|
| Purpose | General-purpose parallel processing | Matrix multiply-accumulate for deep learning |
| Precision | Primarily FP32, FP64 | FP16, BF16, FP8, FP4, INT8 (with FP32 accumulation) |
| Best for | Preprocessing, simulations, rendering, general compute | Neural network training and inference |
| Speedup for AI | Baseline | 2-5x faster than CUDA cores for matrix operations |
| First introduced | 2007 (G80) | 2017 (Volta V100) |
Each GPU generation has expanded Tensor Core capabilities significantly:
| Generation | Architecture | Year | Key capabilities | Supported precisions |
|---|---|---|---|---|
| 1st gen | Volta | 2017 | First Tensor Cores, 4x4 FP16 matrix ops | FP16 input, FP32 accumulate |
| 2nd gen | Turing | 2018 | Added INT8, INT4 support | FP16, INT8, INT4 |
| 3rd gen | Ampere | 2020 | BF16 + TF32 support, 2:4 structured sparsity (2x speedup), doubled throughput | FP16, BF16, TF32, INT8, INT4 |
| 4th gen | Hopper | 2022 | FP8 support, Transformer Engine (dynamic per-layer precision), 16x16 warp-level operations | FP16, BF16, TF32, FP8, INT8 |
| 5th gen | Blackwell | 2024 | FP4 support (2nd-gen Transformer Engine), micro tensor scaling, roughly doubled throughput again | FP16, BF16, TF32, FP8, FP4, INT8 |
The progression from Volta to Blackwell represents roughly a 30x increase in effective Tensor Core throughput for AI workloads, driven by both more Tensor Cores per chip and support for lower-precision data types that pack more operations per cycle.
The Transformer Engine, introduced with Hopper, is particularly significant. It automatically manages precision selection on a per-layer basis during training. For each layer, the Transformer Engine analyzes the distribution of activations and dynamically chooses between FP8 and FP16/BF16 computation. This eliminates the need for manual mixed-precision tuning and allows models to train in FP8 with minimal accuracy loss. The second-generation Transformer Engine on Blackwell extends this to FP4 with micro tensor scaling [5].
In practice, CUDA cores and Tensor Cores work together. A typical training step involves Tensor Cores handling the heavy matrix multiplications in the forward and backward passes, while CUDA cores handle activation functions, loss computation, gradient scaling, and other operations that do not map onto matrix-multiply patterns.
CUDA has been under continuous development since 2007, with major versions typically aligned to new GPU architectures.
| Version | Year | Key features | GPU architecture |
|---|---|---|---|
| CUDA 1.0 | 2007 | Initial release. C-like programming for GPUs. | G80 (GeForce 8800) |
| CUDA 2.0 | 2008 | Mac OS X support, double-precision (FP64). | GT200 |
| CUDA 3.0 | 2010 | Unified virtual addressing, C++ support improvements. | Fermi |
| CUDA 4.0 | 2011 | GPU Direct, unified virtual addressing across multiple GPUs. | Fermi |
| CUDA 5.0 | 2012 | Dynamic parallelism (kernels launching kernels), GPUDirect RDMA. | Kepler |
| CUDA 6.0 | 2014 | Unified Memory (automatic data migration between CPU and GPU). | Kepler |
| CUDA 7.0 | 2015 | C++11 support, cuSOLVER library. | Maxwell |
| CUDA 8.0 | 2016 | FP16 support, Pascal architecture optimizations. | Pascal |
| CUDA 9.0 | 2017 | Tensor Core support, cooperative groups, Volta optimizations. | Volta |
| CUDA 10.0 | 2018 | Turing Tensor Core support, graph APIs. | Turing |
| CUDA 11.0 | 2020 | Ampere Tensor Core support, BF16/TF32, structured sparsity. | Ampere |
| CUDA 12.0 | 2022 | Hopper Tensor Core support, FP8, Transformer Engine, lazy module loading. | Hopper |
| CUDA 13.0 | 2025 | CCCL 3.0, Blackwell optimizations, updated math libraries. | Blackwell |
| CUDA 13.2 | 2026 | Latest stable release (March 2026). | Blackwell |
NVIDIA typically releases point updates (e.g., 12.1, 12.2) between major versions, adding bug fixes, performance improvements, and support for new GPU SKUs. The CUDA Toolkit Archive lists dozens of point releases across the platform's history [6].
Writing efficient CUDA code requires understanding several concepts beyond the basic kernel launch.
Data must be transferred between host (CPU) memory and device (GPU) memory before and after kernel execution. CUDA provides explicit memory management functions (cudaMalloc, cudaMemcpy, cudaFree) as well as Unified Memory, introduced in CUDA 6.0, which creates a single address space accessible from both CPU and GPU. With Unified Memory, the CUDA runtime automatically migrates data pages between host and device memory as needed, simplifying code at the cost of some performance overhead.
CUDA streams allow multiple operations (kernel launches, memory copies) to execute concurrently. Operations within the same stream execute in order, but operations in different streams can overlap. This is essential for hiding memory transfer latency: while one stream is copying data to the GPU, another stream can be executing a kernel on previously transferred data.
Introduced in CUDA 10, CUDA Graphs allow developers to capture a sequence of GPU operations (kernel launches, memory copies) as a graph data structure, then replay the entire graph with a single API call. This eliminates per-operation CPU overhead and is particularly effective for workloads with many small kernels, such as the iterative token generation loop in LLM inference. CUDA Graphs can reduce CPU overhead by 10-100x for launch-bound workloads.
Occupancy refers to the ratio of active warps to the maximum number of warps an SM can support. Higher occupancy generally helps hide memory latency by giving the scheduler more warps to switch between when one stalls. However, maximum occupancy does not always yield maximum performance; sometimes reducing occupancy to allow each thread to use more registers or shared memory produces better results. NVIDIA provides an occupancy calculator tool to help developers find the optimal balance.
NVCC compiles CUDA code into PTX, an intermediate representation that is similar to assembly language but is forward-compatible across GPU architectures. At runtime, the CUDA driver can JIT-compile PTX into the native machine code (SASS) for the specific GPU in the system. This means a CUDA application compiled on one generation of hardware can run on future GPU architectures without recompilation, although pre-compiled SASS code for a specific architecture will typically run faster.
Introduced in CUDA 9, cooperative groups provide a flexible programming model for expressing thread synchronization at various granularities. Beyond the traditional block-level synchronization (__syncthreads), cooperative groups allow synchronization across thread blocks within a grid, across thread blocks within a cluster (Hopper+), or within arbitrary subgroups of threads within a warp. This flexibility is essential for implementing complex algorithms like multi-block reductions and graph algorithms.
CUDA's dominance in the AI and HPC markets goes far beyond the quality of its compiler or runtime. It is the result of nearly two decades of accumulated ecosystem development that creates enormous switching costs for users.
NVIDIA's software stack extends from low-level driver APIs through libraries (cuBLAS, cuDNN, NCCL), frameworks (PyTorch, TensorFlow, JAX), and up to application-level tools (TensorRT for inference optimization, Triton Inference Server for model serving). Every layer of this stack is optimized for NVIDIA hardware, and the layers are tightly integrated with each other. When a researcher calls torch.matmul() in PyTorch, that call flows through multiple NVIDIA-optimized layers before reaching the GPU hardware.
As of 2025, NVIDIA controls approximately 86% of data center GPU revenue and maintains around 80% of the AI accelerator market share [7]. The switching costs for moving off CUDA exceed the performance advantages offered by any competitor for virtually every customer.
Millions of developers have been trained on CUDA through university courses, online tutorials, and industry experience. This creates a self-reinforcing cycle: because CUDA developers are abundant, companies build on CUDA; because companies build on CUDA, new developers learn CUDA. The CUDA developer community produces a continuous stream of open-source libraries, research code, and educational materials that further entrenches the platform.
Every major deep learning framework has deep CUDA integration. PyTorch's GPU backend is built almost entirely on CUDA, cuDNN, and cuBLAS. TensorFlow's GPU support relies on the same libraries. JAX uses XLA, which generates CUDA code for NVIDIA GPUs. Even newer frameworks and compilers like Triton (the language by OpenAI, not to be confused with NVIDIA's inference server) compile to PTX and run on CUDA-capable GPUs.
The CUDA ecosystem's depth creates a chicken-and-egg problem for competitors. Developers will not invest time in learning a new GPU programming platform unless it has strong library and framework support. Library and framework developers will not invest time in supporting a new platform unless it has a large user base. Breaking out of this cycle requires simultaneous investment on multiple fronts, which is why even well-resourced competitors like AMD and Intel have found it difficult to challenge CUDA's position.
Several alternative GPU computing platforms have emerged as challengers to CUDA, though none has achieved comparable ecosystem breadth.
ROCm (Radeon Open Compute) is AMD's open-source GPU computing platform, analogous to CUDA. ROCm provides a compiler (hipcc), runtime, and libraries (rocBLAS, MIOpen, RCCL) that roughly mirror CUDA's offerings. AMD also provides the HIP (Heterogeneous-computing Interface for Portability) programming language, which is syntactically similar to CUDA and can be compiled for either AMD or NVIDIA GPUs.
ROCm has made significant progress since its early days. As of 2025, ROCm supports AI deployments at companies including Meta, OpenAI, Fireworks AI, and Cohere, as well as cloud-scale systems like Oracle's MI300X superclusters and AMD-based virtual machines on Microsoft Azure [8]. Performance benchmarks in 2025 show that ROCm has dramatically narrowed the gap with CUDA, though CUDA still maintains a lead in library maturity, documentation quality, and breadth of framework support.
The AMD Instinct MI300X (2023) and MI325X (2024) GPUs, running ROCm, have proven competitive with NVIDIA's H100 on many AI workloads. The upcoming MI350 (CDNA 4, expected 2025) and MI400 (2026) generations are expected to further close the hardware gap.
Intel's oneAPI platform is built around SYCL, an open-standard C++ programming model for heterogeneous computing. OneAPI targets CPUs, GPUs, FPGAs, and other accelerators with a single codebase. While oneAPI promotes standardization and vendor neutrality, it has struggled to gain traction in the AI market. Intel's Gaudi accelerators used a separate software stack, and the company's decision to discontinue the Gaudi line in favor of future GPU products has created uncertainty about Intel's AI accelerator roadmap [9].
| Feature | CUDA | ROCm | oneAPI |
|---|---|---|---|
| Vendor | NVIDIA | AMD | Intel |
| Languages | CUDA C/C++, PTX | HIP (CUDA-like), OpenCL | SYCL, DPC++ |
| Key libraries | cuBLAS, cuDNN, NCCL, Thrust | rocBLAS, MIOpen, RCCL | oneMKL, oneDNN |
| GPU support | NVIDIA only | AMD (and NVIDIA via HIP) | Intel, NVIDIA, AMD (via plugins) |
| Maturity | Very high (19 years) | Moderate (improving rapidly) | Low-moderate |
| Open source | Partially (libraries closed) | Fully open source | Partially open source |
| Performance gap vs. CUDA | Baseline | 10-30% slower on compute-bound; competitive on memory-bound | Significant gap |
| Framework support | Universal | PyTorch, TensorFlow, JAX | Limited |
| Hardware breadth | Budget GTX to datacenter H100/B200 | Datacenter Instinct MI series primarily | Datacenter and consumer GPUs |
| Windows support | Full | Limited (ROCm 7.2 added Windows) | Full |
ROCm 7.0 expanded hardware support greatly, and ROCm integrates directly with PyTorch, TensorFlow, and JAX, allowing teams to move models from NVIDIA to AMD hardware by swapping containers and drivers rather than rewriting code. AMD also funded ZLUDA, a drop-in CUDA implementation built on ROCm that is now open-source, enabling some CUDA applications to run on AMD hardware without modification [8].
The Khronos Group's OpenCL provides a vendor-neutral alternative to CUDA, but its lower-level API and less polished tooling have limited its adoption for AI workloads. OpenAI's Triton programming language offers a Python-based alternative for writing GPU kernels that compiles to NVIDIA, AMD, and (experimentally) Intel GPUs, potentially reducing CUDA lock-in at the kernel level. Projects like ZLUDA have attempted to run CUDA binaries on non-NVIDIA hardware through translation layers, though these approaches typically incur performance overhead and compatibility limitations.
CUDA's role in the modern AI stack can be understood by tracing a typical training or inference operation through the software layers.
When a user calls a PyTorch operation like model(input), the following happens:
For large language model inference, the CUDA ecosystem includes additional specialized tools. TensorRT optimizes trained models for inference by performing layer fusion, quantization, and kernel auto-tuning. TensorRT-LLM extends this with LLM-specific optimizations like in-flight batching and KV cache management. Serving frameworks like vLLM, SGLang, and Triton Inference Server all run on CUDA.
CUDA remains the undisputed standard for GPU computing in AI as of early 2026. The latest release, CUDA Toolkit 13.2 (March 2026), includes updates to the NVCC compiler, math libraries, and Nsight developer tools, along with full support for the Blackwell architecture [10].
NVIDIA continues to invest heavily in CUDA's evolution. Recent developments include:
The competitive landscape is evolving. AMD's ROCm has become a credible alternative for organizations willing to invest in porting and optimization. Google's TPU ecosystem operates independently of CUDA using JAX and XLA. But for the vast majority of AI practitioners, CUDA remains the path of least resistance: it works, it is fast, and nearly every AI tool and library supports it out of the box.
The depth of CUDA's ecosystem, built over 19 years of continuous development and adopted by millions of developers worldwide, represents one of the most significant software moats in the technology industry. While the long-term trend may be toward greater hardware abstraction and portability (driven by efforts like Triton, SYCL, and MLIR), CUDA's dominance is likely to persist for years to come.