CUDA

AI Infrastructure Developer Tools NVIDIA

53 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

81 citations

Revision

v8 · 10,655 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that lets developers run general-purpose code on a GPU, turning a graphics chip into a processor with thousands of parallel cores. First announced in November 2006 and released publicly in 2007, CUDA repurposed GPUs from graphics-only processors into massively parallel computing engines, and it is the software layer on which virtually all modern artificial intelligence is trained.^[1] At its GTC 2026 conference, NVIDIA marked 20 years of CUDA and stated that 6 million developers had adopted the platform, up from the 4.5 million it reported in 2025, building on an ecosystem of more than 400 GPU-accelerated libraries.^[2]^[3]^[80] CUDA is the foundation of NVIDIA's dominance in AI computing: it runs only on NVIDIA hardware, and the cost of porting decades of CUDA software to other GPUs is widely described as the "CUDA moat."^[73]^[74]

The platform's significance extends beyond raw technical capability. CUDA provided the computational substrate for the modern deep learning revolution. When AlexNet won the 2012 ImageNet Large Scale Visual Recognition Challenge using two CUDA-powered GeForce GTX 580 GPUs, it demonstrated that deep convolutional neural networks could exceed traditional computer vision methods by a wide margin.^[4]^[5] That result catalyzed the deep learning boom that produced systems such as ChatGPT, Stable Diffusion, and modern large language models, virtually all of which are trained on CUDA infrastructure.^[6]

CUDA enables software developers to use a CUDA-enabled NVIDIA GPU for general-purpose processing, an approach known as GPGPU (General-Purpose computing on Graphics Processing Units).^[7] The platform is delivered through the NVIDIA CUDA Toolkit, a software development kit that includes the NVCC compiler, GPU-accelerated libraries, debugging and profiling tools, and the runtime needed to deploy applications.^[8]

What is CUDA used for?

CUDA is used to accelerate computations that can be broken into many independent pieces and run in parallel. Its single largest use is training and running AI and deep learning models: every major framework, including PyTorch, TensorFlow, and JAX, executes its tensor math on the GPU through CUDA libraries such as cuBLAS and cuDNN.^[45]^[49] Beyond AI, CUDA powers high-performance computing and scientific simulation (molecular dynamics, weather and climate, quantum chromodynamics), medical imaging, computer vision, financial Monte Carlo modeling, and graph analytics, typically delivering order-of-magnitude speedups over CPU-only execution.^[14]^[67] The same platform underpins supercomputers, cloud GPU instances, and consumer applications alike.

NVIDIA frames CUDA as the basis of "accelerated computing," its term for offloading the most compute-intensive parts of an application from the CPU to the GPU while leaving the rest of the code on the CPU.^[16]

Origins: from academic research to commercial platform

CUDA emerged from a convergence of academic research on stream computing and NVIDIA's hardware strategy. Ian Buck, then a Stanford University Ph.D. student in the Computer Graphics Lab, was a principal architect of Brook, a programming language and compiler for stream computing on GPUs published in 2004 with co-authors Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan.^[9] Brook treated the GPU as a streaming co-processor and provided abstractions over the graphics shading languages (such as Cg and HLSL) that early GPGPU researchers had been forced to use.^[10]

In 2004, NVIDIA hired Buck and paired him with John Nickolls, the company's director of architecture for GPU computing.^[11] Together, they transformed Brook's academic concepts into a production platform. On November 8, 2006, NVIDIA announced the GeForce 8800 GTX based on the new G80 (Tesla) architecture, the first CUDA-capable GPU, which unified vertex and pixel shaders into generic streaming processors and added a software-visible compute mode.^[12] The initial CUDA SDK became publicly available on February 15, 2007 for Windows and Linux, with macOS support following in February 2008.^[1] CUDA Toolkit 1.0 shipped in June 2007 as the first official production release.^[13]

In a 2008 ACM Queue paper titled "Scalable Parallel Programming with CUDA," Nickolls, Buck, Michael Garland, and Kevin Skadron described CUDA's model of a unified hardware-software stack designed to scale transparently from a few cores to thousands.^[14] As NVIDIA's CUDA C++ Programming Guide puts it, "At its core are three key abstractions, a hierarchy of thread groups, shared memories, and barrier synchronization," which "guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block."^[17] That decomposition is what "enables automatic scalability," allowing the same compiled program to run faster on a GPU with more cores without any code changes.^[17] Brook's stream-computing roots and Nickolls's hardware-level SIMT design choices are both visible in modern CUDA.^[9]^[14]

Early adoption came from supercomputing facilities. Tokyo Tech's TSUBAME supercomputer deployed CUDA-capable Tesla GPUs in 2008, followed by Oak Ridge National Laboratory's Titan in 2012, the first U.S. Department of Energy supercomputer to use GPUs at petascale.^[15] Around 2012, CUDA began to play a decisive role in the resurgence of artificial intelligence, most famously through the AlexNet deep neural network's breakthrough victory in the ImageNet competition using CUDA-powered GPUs.^[4]^[5]

When was CUDA released?

CUDA was first announced on November 8, 2006 alongside the GeForce 8800 GTX, the first CUDA-capable GPU.^[12] The public CUDA SDK followed on February 15, 2007 for Windows and Linux, and the first production release, CUDA Toolkit 1.0, shipped in June 2007.^[1]^[13] The most recent production release, CUDA Toolkit 13.0, shipped in August 2025 with support for the Blackwell architecture.^[28]

How CUDA works: bridging CPUs and massively parallel GPUs

CUDA operates on a heterogeneous computing model in which CPUs (hosts) and GPUs (devices) work together, each optimized for different computational patterns.^[16] The CPU handles sequential logic and system management while the GPU executes thousands of threads in parallel. Both processors maintain separate physical memory spaces (system DRAM versus GPU VRAM), communicating across the PCIe bus and, in modern systems, over NVLink for direct GPU-to-GPU transfers.^[16]

The kernel execution model

At CUDA's core is the kernel, a C++ function that executes N times in parallel by N different GPU threads. Developers define kernels using the __global__ declaration specifier and launch them with a special syntax specifying the execution configuration:^[17]

kernelFunction<<<numBlocks, threadsPerBlock>>>(arguments);

When a kernel launches, the host transfers input data from system memory to GPU memory, the GPU executes the parallel computation with data cached on-chip for performance, and then results copy back to host memory. This model enables the same CUDA code to scale across different GPU sizes: a GPU with more Streaming Multiprocessors (SMs) simply runs the program faster without code modification.^[14]

A typical CUDA program consists of a mix of host code, which runs sequentially on the CPU, and device code, which runs in parallel on the GPU. The overall flow of a CUDA application generally follows these steps:^[17]

The host allocates memory on the device
The host copies input data from host memory to the allocated device memory
The host launches a computation kernel on the device
The device's parallel cores execute the kernel
The host copies the results from device memory back to host memory
The host frees the allocated device memory

Function type qualifiers define where code executes: __global__ for kernels called from host, __device__ for GPU functions called from device code, and __host__ (default) for CPU functions. Variable qualifiers specify memory placement: __shared__ for block-level shared memory, __constant__ for read-only global constant memory, and __device__ for global memory variables.^[17]

Thread hierarchy: organizing massive parallelism

CUDA organizes threads in a three-level hierarchy designed for both programming clarity and hardware efficiency.^[17] Individual threads represent the finest execution granularity, each identified by a unique threadIdx (with x, y, z components) and possessing private local memory and registers.

Threads group into thread blocks, collections of up to 1,024 threads that cooperate through shared memory and synchronization barriers.^[17] All threads within a block can access the same shared memory region (48-228 KB depending on architecture) and synchronize using __syncthreads(). Independence between blocks is the key invariant that allows the runtime to schedule them in any order on available SMs.

Multiple thread blocks form a grid, the complete set of blocks executing a single kernel. Grids support up to 2^31 - 1 blocks in the x-dimension and 65,535 in y and z dimensions.^[17] The GPU scheduler assigns blocks to available Streaming Multiprocessors dynamically.

Both grids and blocks can be organized in one, two, or three dimensions, which is useful for problems involving multi-dimensional data such as images, volumes, or matrices. Within a kernel, threads can determine their unique identity and location using built-in variables:^[17]^[18]

// Thread position within block int tid_x = threadIdx.x; int tid_y = threadIdx.y; int tid_z = threadIdx.z;

// Block position within grid int bid_x = blockIdx.x; int bid_y = blockIdx.y;

// Dimensions int block_dim_x = blockDim.x; int grid_dim_x = gridDim.x;

// Global thread index (1D example) int global_id = blockIdx.x * blockDim.x + threadIdx.x;

Warps and SIMT architecture

The warp, consisting of 32 consecutive threads, is the fundamental execution unit in NVIDIA's SIMT (Single Instruction, Multiple Thread) architecture.^[19] All threads in a warp execute the same instruction simultaneously while maintaining individual program counters and register states. SIMT differs from traditional SIMD by allowing per-thread divergence at the cost of serialization.

When threads within a warp take different execution paths (branching), warp divergence occurs. The GPU serializes divergent paths, masking inactive threads while executing each branch sequentially.^[19] Threads reconverge after all divergent paths complete. Avoiding divergence inside warps is a basic CUDA optimization rule.

The SM's warp scheduler selects ready warps for execution, switching between warps with zero overhead since register states remain on-chip. When one warp stalls waiting for memory, another immediately executes. This latency hiding through massive multithreading is fundamental to GPU performance.^[19]

The CUDA architecture maps hardware resources to software abstractions in the following ontology:

Hardware Memory	Code Memory	Hardware Computation	Code Syntax	Code Semantics
RAM	Non-CUDA variables	Host	Program	One routine call
VRAM, GPU L2 cache	Global, const, texture	Device	Grid	Simultaneous subroutine calls on many processors
GPU L1 cache	Local, shared	SM	Block	Individual subroutine call
-	-	Warp (32 threads)	-	SIMD instructions
GPU L0 cache, register	-	Thread (CUDA core)	-	Scalar ops in vector op

Memory hierarchy: optimizing data access patterns

CUDA exposes a multi-level memory hierarchy that trades capacity for speed:^[17]^[20]

Registers provide the fastest storage, with effectively zero-latency access and thousands available per SM. Automatic variables in kernels occupy registers, but excessive register usage reduces occupancy by limiting concurrent threads per SM.

Shared memory offers register-speed access but is visible to all threads within a block. This 48-228 KB on-chip memory enables fast inter-thread communication and serves as a user-managed cache.^[17] Developers must avoid bank conflicts, which occur when threads in a warp access the same memory bank simultaneously and serialize accesses. Shared memory is organized into 32 equally sized memory modules called banks, and threads in a warp should access different banks to achieve maximum bandwidth.

L1 and L2 caches sit between shared memory and global memory. The L1 cache resides on each SM, sharing on-chip resources with shared memory. The L2 cache, shared across all SMs, ranges from 1.5 MB (Kepler) to 50 MB (Hopper).^[21] Starting with the Ampere architecture (compute capability 8.0), developers can manage L2 cache directly through access policy windows that specify persistence properties for memory regions.^[22]

Global memory, the largest but slowest memory (200-400 cycles latency), resides in off-chip DRAM with capacities reaching 80+ GB on modern data center GPUs and 192 GB on Blackwell B200.^[23] All threads can read and write global memory, which persists across kernel launches. Memory coalescing, ensuring consecutive threads access consecutive memory addresses, remains critical for performance because the GPU combines such accesses into 128-byte transactions.

Constant memory (64 KB) optimizes for broadcast patterns where all threads read the same address.^[17] Constant memory is a read-only region of device memory that is cached on-chip. When all threads in a warp read the same address, the value is broadcast to all threads, achieving very high bandwidth comparable to a register read.

Texture memory, with dedicated read-only caches, optimizes for 2D spatial locality patterns common in image processing.^[17] Originally designed for graphics, it provides hardware support for addressing modes (for example clamping, wrapping) and filtering (for example linear interpolation). Starting with CUDA 3.1, writeable textures known as Surfaces were introduced.

Local memory, despite its name, resides in global memory but remains private to each thread, used for register spills and large structures.^[17]

Starting with compute capability 9.0 (Hopper architecture), thread block clusters add an optional hierarchy level, grouping blocks that execute together on a single GPC with hardware-supported synchronization and distributed shared memory access across blocks.^[24]

Memory Type	Location	Scope	Lifetime	Access
Registers	On-chip	Thread	Kernel	Read/Write
Local Memory	Off-chip	Thread	Kernel	Read/Write
Shared Memory	On-chip	Block	Kernel	Read/Write
Global Memory	Off-chip	Grid	Application	Read/Write
Constant Memory	Off-chip	Grid	Application	Read-Only
Texture Memory	Off-chip	Grid	Application	Read-Only
Unified/Managed	Host/Device	Host/Device	Application	Read/Write

Unified Memory

To simplify the complexities of managing separate host and device memory spaces, NVIDIA introduced Unified Memory in CUDA 6.0 (2014).^[25] This feature creates a single, managed memory pool accessible by both the CPU and GPU using a single pointer. Data is automatically migrated on-demand between physical host and device memory by the CUDA runtime system, driven by page faults on supporting hardware.

With the Pascal architecture (2016) and CUDA 8.0, Unified Memory was substantially upgraded with hardware page faulting and 49-bit virtual addressing via Pascal's Page Migration Engine.^[26] Pascal GPUs such as the Tesla P100 were the first to include hardware support for Unified Memory page faulting and migration, allowing applications to access the same managed memory allocations from both host and device without explicit synchronization. While this automatic management provides ease of use, expert developers can still guide the system using hints (for example cudaMemAdvise) or explicitly prefetch data with cudaMemPrefetchAsync to achieve performance closer to manual memory management.^[26]

Occupancy and resource constraints

Occupancy, the ratio of active warps to maximum possible warps per SM, affects performance through latency hiding.^[17] Higher occupancy provides more warps to switch between when memory stalls occur. However, occupancy faces limits from three primary resources:

Registers per thread multiply by threads per block to determine total register consumption. When this exceeds the SM's register file (typically 64K registers), fewer blocks can execute concurrently. Register spilling to local memory reduces performance.

Shared memory per block must fit within the SM's shared memory capacity (48-228 KB depending on architecture). Applications requiring large shared memory allocations limit concurrent blocks.

Thread blocks per SM face architectural maximums (16-32 depending on compute capability) and total thread limits (1,536-2,048 per SM). Balancing these resources requires careful tuning, often using NVIDIA's occupancy calculator tool.^[17]

Synchronization primitives

CUDA provides multiple synchronization levels.^[17] Thread-level synchronization uses __syncthreads() to create barriers where all threads in a block wait until every thread reaches the barrier. This ensures memory consistency for shared memory operations. Warp-level primitives like __syncwarp() and shuffle operations enable efficient communication within warps.

No direct synchronization exists between blocks in a grid: they must execute independently to enable scalability. Inter-block communication requires atomic operations or separate kernel launches. System-level synchronization uses cudaDeviceSynchronize() to wait for all GPU work or cudaStreamSynchronize() for specific streams. Beginning in CUDA 9.0, the Cooperative Groups API formalized synchronization at warp, block, and grid scope, including cross-block synchronization for cooperative kernels.^[27]

CUDA Toolkit and development environment

The NVIDIA CUDA Toolkit is the official development environment for GPU-accelerated applications, including GPU-accelerated libraries, debugging tools, profiling utilities, the NVCC compiler, and the CUDA runtime.^[8] As of 2026, the latest production release is CUDA Toolkit 13.0 (released August 2025), which introduced support for the Blackwell architecture, a unified Arm toolchain, Zstd-based fatbin compression, and removed offline-compilation support for the Maxwell, Pascal, and Volta architectures.^[28]^[29]

Toolkit evolution and version history

CUDA's version history reflects continuous innovation and expanding capabilities. The platform has evolved since its inception, with each major release introducing new programming features, performance optimizations, and support for new GPU architectures.

Version	Release Date	Key Features Summary	New Architecture Support
1.0	June 2007	Initial production release with C compiler, BLAS, and FFT libraries	Tesla (G80)^[13]
2.0	August 2008	Added support for macOS, 64-bit OS, and 3D textures	-^[1]
3.0	March 2010	C++ support (templates, inheritance), OpenCL 1.0 support, unified graphics interop	Fermi^[30]
4.0	May 2011	Unified Virtual Addressing (UVA), GPUDirect v2.0 (Peer-to-Peer)	-^[1]
5.0	October 2012	Dynamic Parallelism (kernels launching kernels), separate compilation (GPU object linking)	Kepler^[31]
6.0	April 2014	Unified Memory for simplified memory management	-^[25]
7.0	March 2015	C++11 support in device code, cuSOLVER library introduced	Maxwell (in 6.5)^[1]
8.0	September 2016	Native FP16/INT8 support, nvGRAPH library, improved Unified Memory	Pascal^[26]
9.0	September 2017	Cooperative Groups, C++14 support in device code	Volta^[27]
10.0	September 2018	CUDA Graphs, Nsight Compute & Systems tools, Vulkan/DX12 interop	Turing^[1]
11.0	June 2020	Multi-Instance GPU (MIG), 3rd-gen Tensor Cores (TF32, Bfloat16), Arm server support	Ampere^[32]
12.0	December 2022	Device-side graph launch, C++20 support, FP8 support	Hopper, Ada Lovelace^[33]
13.0	August 2025	Foundation for tile-based programming, unified Arm toolchain, Zstd compression	Blackwell^[28]

Detailed release highlights

CUDA 1.x (2007): The first production release, CUDA 1.0, established the core platform. It included the NVCC C compiler, the first versions of the CUBLAS and CUFFT libraries, and an SDK with code examples. It targeted the G80-based Tesla architecture.^[13]

CUDA 2.x (2008-2009): CUDA 2.0 added support for Mac OS X and 32/64-bit Windows Vista, and introduced 3D textures and hardware interpolation for medical imaging and scientific visualization.^[1]

CUDA 3.x (2010): CUDA 3.0 brought first-class C++ support (class inheritance, templates) in device code and added support for the Fermi architecture, concurrent kernel execution, and ECC memory reporting.^[30] It introduced a unified interoperability API for Direct3D and OpenGL and added OpenCL 1.0 features.

CUDA 4.x (2011): CUDA 4.0 simplified multi-GPU programming via Unified Virtual Addressing (UVA), a single virtual address space across CPU and GPUs. It added GPUDirect v2.0 for peer-to-peer transfers between GPUs on the same PCIe bus and integrated the Thrust C++ template library and the NVIDIA Performance Primitives (NPP) library.^[1]

CUDA 5.x (2012-2013): The headline feature of CUDA 5.0 was Dynamic Parallelism, which for the first time allowed a CUDA kernel running on the GPU to launch another kernel.^[31] The release also introduced GPU object linking, allowing device code to be compiled into separate object files and linked together.

CUDA 6.x (2014): CUDA 6.0 changed memory management with the introduction of Unified Memory.^[25] This feature abstracted the separate host and device memory spaces into a single, coherent memory pool, allowing both the CPU and GPU to access data via a single pointer. The release also introduced "drop-in" libraries to accelerate BLAS and FFTW calls.

CUDA 7.x (2015): This release brought support for many C++11 features in device code (lambdas, auto, range-based for loops, variadic templates) and introduced the cuSOLVER library for dense and sparse direct linear algebra.^[1]

CUDA 8.0 (2016): With the Pascal architecture, CUDA 8.0 added native support for FP16 and INT8 computations, offering performance gains for deep learning workloads.^[26] Unified Memory was enhanced with hardware page-faulting capabilities, and the nvGRAPH library for graph analytics was introduced.

CUDA 9.x (2017): Timed with the Volta architecture and its first-generation Tensor Cores, CUDA 9.0 introduced the Cooperative Groups programming model, a more flexible way to define and synchronize groups of threads than the standard __syncthreads().^[27] It also upgraded language support to C++14 in device code.

CUDA 10.x (2018-2019): This series brought support for the Turing architecture. A key innovation was CUDA Graphs, an API to define a whole sequence of operations ahead of time and launch it with minimal CPU overhead.^[1] The release marked the debut of Nsight Compute and Nsight Systems profiling tools, replacing the older Visual Profiler and nvprof.

CUDA 11.x (2020-2022): CUDA 11.0 was a major release supporting the Ampere architecture. It introduced Multi-Instance GPU (MIG) for partitioning a GPU into isolated instances and enabled Ampere's third-generation Tensor Cores with TF32 and Bfloat16 support.^[32] This version also added production support for Arm64 server CPUs and independent component versioning.

CUDA 12.x (2022-2024): This series introduced support for the Hopper and Ada Lovelace architectures.^[33] CUDA Graphs gained the ability to be launched from device-side kernels. The compiler added C++20 support, and libraries exposed FP8 data types for matrix multiplication on Hopper Tensor Cores.

CUDA 13.0 (2025): Supporting the new Blackwell architecture, CUDA 13.0 laid the groundwork for a new tile-based programming model complementing the traditional thread-based model.^[28] It unified the developer toolchain for Arm server and embedded platforms, switched fatbin compression from LZ4 to Zstd, and removed legacy tools (NVIDIA Visual Profiler, nvprof). CUDA 13.0 implements semantic versioning with ABI stability guarantees within the 13.x series and dropped offline-compilation support for the Maxwell, Pascal, and Volta architectures.^[29]

Core components

NVCC (NVIDIA C/C++ Compiler)

The NVIDIA C/C++ Compiler (NVCC) is the core compiler for the CUDA C++ language. It is a front-end driver that processes CUDA source files (typically with a .cu extension).^[34] NVCC separates the source code into two parts: host code, which runs on the CPU, and device code, which runs on the GPU.

The host code is compiled by a standard host compiler such as GCC, Clang, or Microsoft Visual C++. The device code is first compiled into PTX (Parallel Thread Execution), a low-level virtual instruction set architecture.^[35] PTX acts as a stable assembly language for the GPU, enabling both forward and backward compatibility. When an application runs, the device driver performs a final just-in-time (JIT) compilation step, translating the PTX code into the native machine code (SASS) for the specific GPU being used. By embedding PTX code in their binaries, applications and libraries can achieve cross-generation compatibility within a single binary.^[35]

CUDA Runtime and Driver

The toolkit includes the CUDA Runtime and the NVIDIA Driver, which are essential for executing CUDA applications.^[8] The driver is the low-level software that interfaces directly with the GPU hardware. The CUDA Runtime is a higher-level library that provides the functions needed to manage the device, memory, and kernel execution from the host application.

APIs: Runtime vs. Driver

CUDA exposes its functionality through two main C/C++ APIs: the high-level Runtime API and the low-level Driver API.^[36]

Runtime API: This is the more commonly used API, offering a higher level of abstraction that simplifies development. It automatically handles device initialization, context management, and module loading. The Runtime API enables the convenient <<<...>>> kernel launch syntax and is generally easier for application developers to use.
Driver API: This is a lower-level, more verbose API that provides finer-grained control over the GPU. It requires manual management of contexts and modules. The Driver API is necessary for more advanced use cases, such as dynamically loading and unloading PTX code at runtime or fine-tuning JIT compilation options.

In modern versions of CUDA, the two APIs are interoperable and can be used together within the same application.

Developer tools

The CUDA Toolkit is equipped with a suite of tools for debugging, profiling, and optimizing applications.^[8]

Profiling tools

NVIDIA Nsight Suite: This is the modern, primary family of performance analysis tools.
- Nsight Systems: A system-wide performance analysis tool that helps visualize the interaction between the CPU and GPU, identifying bottlenecks in API calls, memory transfers, and kernel executions.
- Nsight Compute: An interactive kernel profiler that provides detailed performance metrics and analysis of CUDA kernels, helping developers understand hardware utilization, memory access patterns, and instruction throughput.
- Nsight Visual Studio Edition: A plugin that integrates debugging and profiling tools directly into the Microsoft Visual Studio IDE for Windows developers.
nvprof and Visual Profiler: These are the legacy command-line and graphical profilers that were the standard tools in older versions of the CUDA Toolkit. They have since been superseded by the Nsight suite and were removed entirely in CUDA 13.0.^[29]

Debugging tools

CUDA-GDB: An extension of the GNU Debugger (GDB) for debugging CUDA applications on Linux. It allows developers to set breakpoints, inspect variables, and step through code running on both the CPU and the GPU.
Compute Sanitizer: A tool for detecting memory access errors (for example out-of-bounds or misaligned accesses) and synchronization errors (race conditions) within CUDA kernels.
CUDA-MEMCHECK: A legacy memory error detection tool, now largely replaced by the Compute Sanitizer.

Ecosystem: GPU-accelerated libraries

A driver of CUDA's widespread adoption is its extensive ecosystem of GPU-accelerated libraries.^[3] These libraries provide pre-built solutions for a range of computational domains, allowing developers to leverage the power of GPUs without writing low-level custom kernels for common tasks. NVIDIA packages many higher-level libraries and tools under the umbrella of "CUDA-X" and "CUDA-X AI" to support applications in AI, HPC, data science, and other domains.^[37]

Foundational libraries

The CUDA Toolkit includes a core set of libraries for fundamental scientific and engineering computations.

cuBLAS: An implementation of the Basic Linear Algebra Subprograms (BLAS) API for high-performance dense matrix and vector operations.^[38] The library supports Level 1 (vector-vector), Level 2 (matrix-vector), and Level 3 (matrix-matrix) operations across multiple precisions including FP16, TF32, BF16, FP32, FP64, and INT8. cuBLASLt extends functionality with a multi-stage GEMM API offering advanced tuning and fusion capabilities optimized for Tensor Cores, while cuBLASXt provides single-process multi-GPU support for large-scale linear algebra.
cuSPARSE: A library containing a set of basic linear algebra subroutines for handling sparse matrices.^[39] Supporting multiple formats (CSR, COO, CSC, Blocked CSR), the library implements sparse matrix-vector (SpMV) and sparse matrix-matrix (SpMM) multiplication. cuSPARSELt leverages Sparse Tensor Cores in Ampere and newer architectures for structured 2:4 sparsity, delivering additional performance gains for AI workloads.
cuFFT: A library for accelerating Fast Fourier Transform (FFT) computations.^[38] The library consists of cuFFT (native GPU-optimized) and cuFFTW (FFTW-compatible for easy porting), supporting single, double, and half-precision with batch processing capabilities. Applications include signal processing, computational chemistry, and seismic analysis.
cuRAND: A library for generating high-quality pseudorandom and quasirandom numbers on the GPU.^[38] cuRAND generates random numbers using both pseudo-random (XORWOW, MRG32k3a, Philox) and quasi-random generators. Supporting various distributions (uniform, normal, log-normal, Poisson), the library offers host APIs for CPU-side generation and device APIs for direct kernel usage, used in Monte Carlo simulations and machine learning initialization.
cuSOLVER: A high-level library for dense and sparse direct linear solvers, providing functionality similar to LAPACK.^[40] cuSOLVER provides LU, QR, SVD, and Cholesky factorizations. The library splits into cuSolverDN (dense), cuSolverSP (sparse), cuSolverRF (refactorization), cuSolverMg (multi-GPU), and cuSolverMp (multi-node multi-GPU).
cuTENSOR: Implements high-performance tensor contractions and element-wise operations optimized for deep learning and quantum computing workloads.^[41]
NVIDIA Performance Primitives (NPP): A library of functions for image, video, and signal processing.^[38] NPP implements over 2,500 image, signal processing, and computer vision routines, including color conversion, filtering, geometry transforms, and morphology operations.
nvJPEG / nvJPEG2000: Libraries providing high-performance, GPU-accelerated encoding and decoding for JPEG and JPEG 2000 image formats. nvJPEG provides hardware-accelerated JPEG encoding/decoding optimized for deep learning data loading pipelines.^[38]

Parallel algorithms: Thrust and CCCL

Thrust is a C++ parallel algorithms library included in the CUDA Toolkit, modeled after the C++ Standard Template Library (STL).^[42] It provides a high-level, expressive interface for common parallel operations such as sorting, scanning, transforming, and reducing data. Developers use thrust::host_vector for CPU data and thrust::device_vector for GPU data, with seamless integration to raw CUDA code.

A key feature of Thrust is its performance portability. It uses a backend system that allows the same code to be compiled to run on different parallel architectures, including NVIDIA GPUs (via CUDA), multi-core CPUs (via OpenMP or TBB), and other platforms.^[42] As of recent CUDA Toolkit releases, Thrust has been unified with the CUB and libcu++ libraries into a single entity known as the CUDA Core Compute Libraries (CCCL).^[43]

CUB (CUDA UnBound) provides lower-level cooperative primitives for CUDA kernel developers, offering warp-wide and block-wide operations (reductions, scans, sorts) with very high performance.^[42]

CUTLASS: tensor core templates

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is a collection of CUDA C++ template abstractions and Python domain-specific languages (DSLs) for high-performance matrix-matrix multiplication (GEMM) and related computations.^[44] It modularizes the computational building blocks of GEMM into reusable template classes, supports FP16, BF16, TF32, FP32, FP64, integer, and binary data types, and is optimized for NVIDIA Tensor Cores from Volta through Blackwell. CUTLASS is widely used inside cuBLAS, cuDNN, and TensorRT, and is also a building block for third-party kernels such as those in FlashAttention.^[44]

AI and deep learning acceleration

The CUDA ecosystem is central to modern AI, providing specialized libraries that are the performance backbone of virtually all deep learning frameworks.^[45]

cuDNN

The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.^[45]^[46] It is not a deep learning framework itself, but rather a low-level performance library that frameworks use to execute standard operations. NVIDIA released cuDNN in September 2014, describing it as a set of low-level GPU primitives for deep neural networks, with novel convolution implementations that delivered ~36% performance improvement on contemporary models such as those in Caffe.^[45]^[46] cuDNN provides highly tuned implementations for fundamental building blocks of deep learning, such as:

Convolutions
Pooling
Normalization
Activation functions
Attention mechanisms
Matrix multiplications

By providing optimized kernels for these routines, cuDNN allows deep learning frameworks like TensorFlow and PyTorch to achieve state-of-the-art performance on NVIDIA GPUs without each framework needing to implement its own low-level GPU optimizations.^[45] cuDNN supports multiple precisions (FP32/FP64, FP16, integer) and features like batched operations optimized for Tensor Cores. The library is used by PyTorch, TensorFlow, JAX, MXNet, Keras, and other major frameworks.

cuDNN's architecture features three API layers: the high-level Frontend API (recommended for Python/C++), the flexible Graph API for operation graph representation, and the low-level Backend API for specialized control. cuDNN 9.x (2024) introduced multi-operation fusion patterns that combine multiple operations into optimized kernels, delivering improvements for transformer architectures.^[46]

TensorRT

The NVIDIA TensorRT SDK and runtime targets high-performance deep learning inference.^[47] After a neural network has been trained, TensorRT takes the model and optimizes it for deployment in production environments where low latency and high throughput are critical.

TensorRT performs several key optimizations:^[47]

Layer and Tensor Fusion: Merges multiple layers of a neural network into a single kernel to reduce overhead
Precision Calibration: Reduces the numerical precision of model weights (for example from 32-bit float to 16-bit float, 8-bit integer, or even 4-bit integer) with minimal loss of accuracy, which increases throughput and reduces memory usage
Kernel Auto-Tuning: Selects the best pre-implemented kernels for the target GPU architecture
Dynamic Tensor Memory: Minimizes memory footprint by reusing memory for tensors

NVIDIA reports that TensorRT can deliver up to 40x inference speedup versus CPU-only platforms.^[47] A specialized version, TensorRT-LLM, is an open-source library focused on optimizing inference for Large Language Models (LLMs).^[48]

Deep learning framework integration

All major deep learning frameworks are built on top of the CUDA platform to enable GPU acceleration. Frameworks like PyTorch and TensorFlow provide high-level APIs (typically in Python) that allow users to define and train neural networks.^[49] Internally, when these frameworks execute operations on tensors (for example matrix multiplication, convolution), they make calls to the CUDA libraries (primarily cuBLAS and cuDNN) to perform the actual computation on the GPU.^[45] This deep integration is what allows data scientists and machine learning engineers to harness the power of GPUs without needing to write low-level CUDA C++ code themselves.

PyTorch ships with pre-built CUDA binaries for CUDA 11.8 through CUDA 12.x, accessible via simple pip installation.^[49] Developers enable GPU acceleration by calling .cuda() on tensors. TensorFlow requires specific CUDA Toolkit and cuDNN version matching.^[50] JAX supports CUDA 12 and CUDA 13 with pre-built wheels for Linux.^[51]

Communication and multi-GPU libraries

NCCL (NVIDIA Collective Communications Library) optimizes multi-GPU and multi-node communication with collective operations including all-reduce, all-gather, reduce-scatter, broadcast, and point-to-point send/receive.^[52] The library automatically detects optimal communication paths across PCIe, NVLink, NVSwitch, InfiniBand, and RoCE networks. NCCL underpins frameworks like PyTorch DDP, TensorFlow, and Horovod, enabling them to scale efficiently across thousands of GPUs.^[52]

NVSHMEM implements the OpenSHMEM standard for GPUs, providing partitioned global address space (PGAS) semantics with direct GPU-to-GPU memory access across nodes.^[53] One-sided communication primitives and GPU-optimized collective operations improve performance for irregular communication patterns common in HPC applications.

Data loading and preprocessing

DALI (Data Loading Library) offers a GPU-accelerated data loading and augmentation pipeline designed for deep learning training.^[54] DALI accelerates the entire ETL (Extract, Transform, Load) pipeline, offloading preprocessing from the CPU to the GPU.

Programming CUDA: languages and interfaces

CUDA supports multiple programming languages and abstraction levels, from low-level C/C++ kernels to high-level Python interfaces.^[17] CUDA C/C++ remains the primary programming interface, extending standard C++ with GPU-specific keywords and compiled using NVCC. The platform supports modern C++ features including C++17 and C++20, with libcu++ providing the CUDA C++ Standard Library for both host and device code.^[43]

CUDA Fortran uses NVIDIA HPC SDK's Fortran compiler (formerly PGI) for Fortran programmers, while bindings exist for Julia, MATLAB, Java, .NET, and other languages.^[55]

Python ecosystem: accessible GPU programming

Python bindings have made GPU programming accessible, enabling researchers and data scientists to leverage CUDA without C++ expertise. Multiple approaches serve different use cases and performance requirements.

CuPy provides a drop-in replacement for NumPy and SciPy, implementing most functions with identical APIs but GPU execution.^[56] The library achieves up to 100x speedups for many operations while utilizing cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, and cuDNN internally. CuPy supports custom kernels via RawKernel and ElementwiseKernel, multi-GPU workflows, and interoperability with PyTorch and JAX via DLPack. Installation requires matching CUDA versions: pip install cupy-cuda12x for CUDA 12.x.^[56]

Numba offers JIT (just-in-time) compilation of Python functions to GPU code using the @cuda.jit decorator.^[57] Developers write kernels entirely in Python without C/C++ knowledge, accessing thread/block indexing, shared memory, device functions, and atomic operations. The @vectorize decorator automatically parallelizes element-wise operations across GPU threads.

PyCUDA provides direct access to the CUDA runtime API from Python, embedding C/C++ CUDA kernels in Python strings for maximum control.^[58] While requiring C++ knowledge, PyCUDA offers object-oriented interfaces, automatic resource management, and NumPy interoperability for custom kernel development and prototyping.

Comparison across Python interfaces: CuPy fits NumPy users wanting GPU acceleration with minimal code changes. Numba fits developers writing custom GPU algorithms entirely in Python. PyCUDA fits those needing maximum control or integrating existing CUDA C/C++ code.

Official CUDA Python bindings provide unified interfaces to CUDA Runtime and Driver APIs, forming the foundation for higher-level libraries.^[59] These bindings enable Python framework and library development, used internally by both CuPy and Numba.

Compute capability and GPU architecture evolution

Compute capability defines hardware features and supported instructions for each NVIDIA GPU architecture, using an X.Y versioning scheme where X indicates the core architecture generation and Y represents incremental improvements.^[17] This version number determines available features, maximum thread/block/grid dimensions, memory capabilities, and concurrent execution support. Each NVIDIA GPU generation introduces new features and higher performance.

Architecture progression: from Tesla to Blackwell

The evolution from Tesla (compute capability 1.x, 2006-2008) to Blackwell (10.0/12.0, 2024 onward) represents nearly two decades of architectural innovation.^[12]^[60]

Tesla (1.x) introduced basic CUDA support with 8-32 cores per SM, establishing GPU computing fundamentals with unified shader architecture.^[12]

Fermi (2.0/2.1, 2010) added L1/L2 cache hierarchy, IEEE 754-2008 compliance, ECC memory support, and concurrent kernel execution.^[61] The GF100 chip incorporated 512 CUDA cores in 16 SMs on TSMC's 40nm process with over 3 billion transistors.

Kepler (3.x, 2012-2014) delivered roughly 3x performance per watt versus Fermi while introducing Dynamic Parallelism, Hyper-Q (32 concurrent work queues), and read-only data cache.^[31] The GK110 chip with 192 CUDA cores per SMX and 7.1 billion transistors achieved over 1 TFLOPS double-precision throughput.

Maxwell (5.x, 2014-2015) focused on efficiency, achieving roughly 2x performance per watt versus Kepler with 128 cores per SM.^[1] Consumer Maxwell cards lacked dedicated FP64 units but delivered substantial energy-efficiency gains.

Pascal (6.x, 2016-2017) introduced HBM2 memory with up to 720 GB/s bandwidth on the P100, NVLink high-speed GPU-to-GPU communication, and Unified Memory with page migration.^[26] The GP100 chip (16nm FinFET, 15.3 billion transistors) achieved roughly 21 TFLOPS FP16 performance.

The AI revolution architectures

Volta (7.0, 2017) introduced Tensor Cores, specialized hardware units performing mixed-precision matrix multiply-accumulate operations.^[62] The GV100 (21.1 billion transistors) combined 64 FP32 cores, 64 INT32 cores, and 8 Tensor Cores per SM, achieving roughly 120 TFLOPS mixed-precision performance on the V100. Volta introduced independent thread scheduling, replacing lockstep warp execution with per-thread program counters. The architecture's 900 GB/s HBM2 memory and NVLink 2.0 supported large neural network training.

Turing (7.5, 2018-2019) added RT Cores for hardware-accelerated ray tracing alongside second-generation Tensor Cores.^[1] The TU102 (18.6 billion transistors, 12nm FFN) combined 64 FP32, 64 INT32, 8 Tensor Cores, and 1 RT Core per SM. Turing introduced concurrent FP32/INT32 execution, mesh shading, and variable rate shading.

Ampere (8.x, 2020-2022) delivered third-generation Tensor Cores with structured sparsity support, enabling 2x throughput for sparse neural networks.^[32]^[63] The flagship A100 (GA100, 54.2 billion transistors, 7nm TSMC) introduced Multi-Instance GPU (MIG) partitioning a GPU into up to seven independent instances, TF32 precision (19-bit format combining FP32 range with FP16 performance), FP64 Tensor Cores, and 40 MB L2 cache. The A100 achieves roughly 156 TFLOPS TF32 and ~312 TFLOPS with structural sparsity.^[63]

Hopper (9.0, 2022) was a large generational leap for AI training with fourth-generation Tensor Cores supporting FP8 precision and the Transformer Engine, custom hardware and software delivering up to 6x faster transformer training versus A100.^[64]^[65] The GH100 (80 billion transistors, TSMC 4N) features 128 FP32 cores, 64 FP64 cores, and 4 Tensor Cores per SM. The H100 supports two FP8 data types (E4M3, E5M2), DPX instructions, thread block clusters, distributed shared memory (228 KB/SM), Tensor Memory Accelerator (TMA), HBM3 (3 TB/s), and NVLink 4.0 (900 GB/s/GPU).^[24]^[64] Hopper also added confidential computing support.

Ada Lovelace (8.9, 2022-2023) brought Tensor Cores and ray tracing to consumer and workstation markets with the GeForce RTX 4000 series and professional RTX Ada cards, introducing DLSS 3 with frame generation.^[33]

Blackwell (10.0/12.0, 2024 onward) continues innovation with fifth-generation Tensor Cores supporting FP4 and FP6 precision for massive model training.^[60]^[66] The Blackwell B200 chip (208 billion transistors across two reticle-sized dies, TSMC 4NP) delivers up to 20 PFLOPS FP4 sparse performance. The GB200 Grace Blackwell Superchip combines two B200 GPUs with a Grace CPU over a 900 GB/s NVLink chip-to-chip interconnect.^[60] B200 memory is HBM3e totaling 192 GB with 8 TB/s of bandwidth, up from the H200's 4.8 TB/s.^[66] The GeForce RTX 5000 series (compute capability 12.0) brings Blackwell to consumer and workstation markets.

Feature availability by compute capability

Compute Capability	Architecture	Year	Key Features Introduced
1.x	Tesla	2006-2008	Basic CUDA support, atomic operations^[12]
2.x	Fermi	2010	L1/L2 cache, ECC, FP64, concurrent kernels^[61]
3.x	Kepler	2012-2014	Dynamic Parallelism, Hyper-Q, Unified Memory^[31]
5.x	Maxwell	2014-2015	Improved efficiency, unified virtual memory
6.x	Pascal	2016-2017	HBM2, NVLink, unified memory page migration, FP16 boost^[26]
7.0	Volta	2017	Tensor Cores, independent thread scheduling, 900 GB/s HBM2^[62]
7.5	Turing	2018-2019	RT Cores, 2nd gen Tensor Cores, mesh shading
8.x	Ampere	2020-2022	Sparse Tensor Cores, TF32, FP64 Tensor Cores, MIG, 3rd gen Tensor Cores^[63]
8.9	Ada Lovelace	2022-2023	DLSS 3, enhanced ray tracing, 3rd gen RT Cores
9.0	Hopper	2022	FP8 Transformer Engine, thread block clusters, DPX, HBM3, 4th gen Tensor Cores^[24]^[64]
10.0/12.0	Blackwell	2024+	FP4/FP6 support, 5th gen Tensor Cores, dual-die GB200 design^[60]

Each architecture delivered substantial gains: Kepler achieved 3x perf/watt over Fermi, Pascal added HBM2 and NVLink, Volta's Tensor Cores enabled the AI boom, and Hopper's FP8 Transformer Engine sped LLM training by 6x.^[31]^[26]^[62]^[64]

Compute Capability	Microarchitecture	Example GPUs (GeForce, Quadro, Datacenter)
1.0 - 1.3	Tesla	GeForce 8800 GTX, Quadro FX 5600, Tesla C870
2.0 - 2.1	Fermi	GeForce GTX 480, Quadro 6000, Tesla C2050
3.0 - 3.7	Kepler	GeForce GTX 780 Ti, Quadro K6000, Tesla K40
5.0 - 5.3	Maxwell	GeForce GTX 980, Quadro M6000, Tesla M40
6.0 - 6.2	Pascal	GeForce GTX 1080 Ti, Quadro P6000, Tesla P100
7.0 - 7.2	Volta	Titan V, Quadro GV100, Tesla V100
7.5	Turing	GeForce RTX 2080 Ti, Quadro RTX 8000, Tesla T4
8.0 - 8.9	Ampere / Ada Lovelace	GeForce RTX 3090, RTX A6000, A100, RTX 4090
9.0	Hopper	H100, H200
10.0/12.0	Blackwell	B200, GB200, GeForce RTX 5090

Supported platforms

The CUDA platform is designed to be compatible with a range of hardware and software configurations, though it is exclusively for NVIDIA GPUs.^[8]

Hardware compatibility

CUDA is supported on all NVIDIA GPUs from the G8x series (Tesla architecture) onward. The specific set of hardware features a GPU supports is defined by its Compute Capability version number.^[1]

Software compatibility

Operating systems

The CUDA Toolkit officially supports the most widely used 64-bit operating systems for desktop, server, and cloud environments.^[29] Support for macOS was deprecated after CUDA 10.2 and completely removed in CUDA 11.0, though some tools are available for cross-compilation and remote debugging from a macOS host.

Distribution	Supported Versions	Architectures
Ubuntu	24.04 LTS, 22.04 LTS	x86_64, Arm64-sbsa
RHEL	10.x, 9.x, 8.x	x86_64, Arm64-sbsa
CentOS	10.x, 9.x, 8.x	x86_64
Rocky Linux	12.x	x86_64
SLES	15.x	x86_64, Arm64-sbsa
Fedora	42	x86_64
Amazon Linux	2023	x86_64, Arm64-sbsa

Operating System	Supported Versions	Architectures
Windows 11	24H2, 23H2, 22H2	x86_64
Windows 10	22H2	x86_64
Server 2025	Server 2025	x86_64
Server 2022	Server 2022	x86_64

Programming languages

The CUDA Toolkit provides native compilers for C, C++, and Fortran. The ecosystem extends far beyond these, with a rich collection of third-party libraries and bindings that enable CUDA acceleration from many other popular languages, including Python, Julia, MATLAB, Java, .NET, and others.^[55]

Applications and performance: quantifying CUDA's impact

CUDA acceleration produces order-of-magnitude performance improvements across diverse computational domains, transforming research and production workloads from impractical to real-time.^[14] Performance gains stem from massive parallelism (thousands of cores), high memory bandwidth (up to 8 TB/s on Blackwell), and specialized hardware like Tensor Cores.^[66]

Artificial intelligence and machine learning dominance

All major deep learning frameworks leverage CUDA for training and inference.^[49] FlashAttention implementations achieve order-of-magnitude speedups over baseline PyTorch attention through CUDA-level optimization built on CUTLASS.^[44] NVIDIA reports that TensorRT and TensorRT-LLM deliver up to 40x higher inference performance on GPUs versus CPU-only platforms.^[47]

The impact extends beyond raw performance. CUDA enabled the 2012 AlexNet breakthrough when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton trained a convolutional neural network on two GeForce GTX 580 GPUs over five to six days using a custom CUDA implementation now known as cuda-convnet.^[4]^[5] In the paper itself, the authors credited GPUs directly: "To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation."^[4] AlexNet achieved a 15.3% top-5 error on ImageNet, compared with 26.2% for the second-place entry, proving that deep learning with GPU acceleration could outperform traditional computer vision methods.^[4] This result, achieved by graduate students rather than in a supercomputing facility, demonstrated CUDA's ability to put large-scale neural network training within reach of small research groups.^[5]

Modern large language models amplify these advantages. ChatGPT was reportedly trained using tens of thousands of NVIDIA GPUs.^[6] Open-source efforts such as Andrej Karpathy's llm.c project reproduced GPT-2 (1.5B parameters) in roughly 24 hours on 8 H100 GPUs using pure C and CUDA. Every major model, including GPT-3 (175B parameters), BERT, Llama, Stable Diffusion, and DALL-E, was trained on CUDA infrastructure.^[6]

Scientific computing accelerations

HPC applications demonstrate large CUDA performance gains. Molecular dynamics simulations show strong scaling: LAMMPS, HOOMD-Blue, NAMD, and GROMACS all use CUDA-accelerated kernels and routinely report order-of-magnitude speedups versus CPU-only nodes.^[67] Lattice quantum chromodynamics codes such as QUDA target GPU clusters, and physics simulations such as MILC and HPCG benefit from CUDA's high memory bandwidth and dense linear algebra throughput.^[67]

OpenACC applications targeting scientific computing show consistent improvements. NVIDIA's GPU-Accelerated Applications catalog lists more than 700 HPC applications and frameworks that use CUDA, spanning chemistry, biology, fluid dynamics, weather and climate, and physics.^[67]

Medical imaging and computational biology

Medical imaging applications achieve large speedups for tasks such as preprocessing, segmentation, registration, and 3D reconstruction.^[68] Compressed sensing MRI reconstruction, 4D CT denoising, ultrasound motion estimation, and digital pathology pipelines benefit from CUDA's high memory bandwidth and parallel arithmetic throughput. Published research shows clear growth in GPU-accelerated medical imaging publications since CUDA's release in 2007.^[68]

Financial modeling and quantitative analysis

Financial applications leverage CUDA for risk analysis and trading strategies.^[69] Monte Carlo simulations for derivatives pricing and risk analysis exhibit large speedups because each simulated path is independent and embarrassingly parallel.^[69] Finite difference methods for option pricing and neural networks for forecasting using cuBLAS reduce response time for real-time analytics.

Computer vision and image processing

OpenCV with CUDA provides drop-in GPU acceleration for many of OpenCV's algorithms.^[70] Operations such as mean shift filtering, feature detection, and dense optical flow show large speedups on consumer NVIDIA GPUs versus mainstream CPUs. NPP supplies over 2,500 optimized image and signal-processing routines used by these pipelines.^[38]

Performance summary across domains

Application Domain	Typical Speedup	Representative Operations
Deep Learning Training	10-40x	Neural network forward/backward passes
HPC Simulations	2-30x	Molecular dynamics, quantum chromodynamics
Medical Imaging	20-60x	Segmentation, registration, reconstruction
Computer Vision	10-100x	OpenCV functions, filtering, transforms
Financial Monte Carlo	20-50x	Risk analysis, option pricing
Graph Analytics	50-200x	PageRank, shortest path algorithms
Linear Algebra	2-10x	Matrix multiplication (cuBLAS)
Image Processing	3-22x	Medical image segmentation

These performance improvements translate to economic benefits: reduced training time from weeks to hours, real-time applications previously impossible, lower total cost of ownership versus CPU-only clusters, and higher performance per watt.

CUDA's role in enabling modern AI

CUDA provided the computational foundation for the deep learning revolution, taking AI from academic curiosity to commercial infrastructure.^[4]^[45] The platform's impact spans breakthrough research, production deployments, and industry transformation.

Framework integration: the software ecosystem moat

Every major deep learning framework implements native CUDA support, creating powerful network effects.^[49] PyTorch ships with pre-built CUDA binaries, accessible via simple pip installation. Developers enable GPU acceleration by calling .cuda() on tensors. At GTC 2026, NVIDIA stated that it was honoring "20 years of CUDA and the 6 million developers who turned a programming model into a global foundation for innovation," up from the 4.5 million developers it had reported the prior year.^[2]^[3]^[80]

TensorFlow requires specific CUDA Toolkit and cuDNN version matching, with NVIDIA providing optimized containers and TensorRT integration.^[50] JAX supports CUDA 12 and CUDA 13 with pre-built wheels for Linux, requiring NVIDIA Driver release 515+ and SM version 7.5+ on CUDA 13.^[51] MXNet historically supported multiple CUDA versions, though development has slowed.

cuDNN forms the critical bridge layer, providing GPU-accelerated primitives for convolution, attention mechanisms, matrix multiplication, pooling, normalization, and activation functions used by PyTorch, JAX, TensorFlow, Caffe2, Chainer, Keras, MATLAB, MXNet, and PaddlePaddle.^[45]^[46] The library's graph-based operation representation and runtime fusion engine optimize neural network execution across framework boundaries.

Why is CUDA so dominant? The CUDA moat

NVIDIA commands an 80-95% share of the AI accelerator market by various analyst estimates, and roughly 92% of the data center GPU market.^[71] NVIDIA's market capitalization surpassed $2 trillion in March 2024, $3 trillion in June 2024, $4 trillion by July 10, 2025 (the first company to reach that threshold), and briefly touched $5 trillion in late 2025.^[72] NVIDIA reported record data center revenue of $51.2 billion in Q3 fiscal year 2026.^[72] Microsoft reportedly purchased 485,000 Hopper-class chips in 2024, roughly double the volume bought by Meta.^[72]

Much of that advantage is attributed not to silicon alone but to software. The "CUDA moat" refers to the switching costs created by nearly two decades of CUDA software: 6 million-plus developers trained in CUDA programming, the extensive library stack (cuDNN, cuBLAS, TensorRT, NCCL), deep framework optimization, hardware-software co-design, and backward compatibility that competitors must replicate to win workloads away.^[3]^[71]^[80] Because CUDA runs only on NVIDIA GPUs, this body of software does not transfer to competing hardware without porting, which is the central concern behind both antitrust scrutiny and rival platforms such as AMD ROCm.^[73]^[74]

Antitrust scrutiny

CUDA's tight coupling of proprietary software to NVIDIA hardware has drawn antitrust attention. The French Autorité de la Concurrence concluded a 2024 market study finding that NVIDIA may be abusing its dominance in AI accelerators, citing "price fixing, production restrictions, unfair contractual conditions and discriminatory behavior."^[73] The U.S. Department of Justice opened its own investigation in mid-2024, examining whether the CUDA ecosystem constitutes anti-competitive software lock-in that makes it intentionally difficult to move workloads to non-NVIDIA hardware.^[73]^[74] Critics argue that the immense switching costs of moving years of CUDA-built software to other platforms reinforce a "walled garden" effect; NVIDIA argues that the platform is the result of long-term R&D investment and competition on the merits.^[74]

Notable implementations and use cases

NVIDIA uses CUDA-X libraries (cuDF, cuML) internally for chip manufacturing optimization, analyzing wafer fabrication, circuit probing, and packaged chip testing.^[37]

Supercomputing facilities deploy CUDA at scale: Oak Ridge National Laboratory's Summit and Frontier systems and numerous other Top500 systems rely on CUDA acceleration.^[15] Cloud providers (AWS, Google Cloud, Microsoft Azure, Oracle Cloud) offer CUDA-accelerated instances as core infrastructure services.^[71]

Academic and research applications span diverse fields: biomimetic modeling, computational biology, molecular dynamics simulations, climate modeling (NVIDIA Earth-2), computational chemistry and physics, and seismic exploration.^[67] Published research shows clear increases in GPU-accelerated medical imaging publications since CUDA's 2007 release.^[68]

Alternatives and competitive landscape

Despite CUDA's dominance, alternatives address vendor lock-in concerns and specific use cases.

OpenCL (Open Computing Language) provides a vendor-neutral, open standard for parallel programming across CPUs, GPUs, FPGAs, and other processors.^[75] Maintained by the Khronos Group, OpenCL is supported by AMD, Intel, NVIDIA, Apple (historically), and others. In practice, CUDA typically outperforms OpenCL on NVIDIA hardware due to closer integration with the underlying ISA and richer tooling.^[75]

AMD ROCm (Radeon Open Compute) is AMD's open-source software stack for GPU computing, first released in 2016.^[76] Its core programming model is HIP (Heterogeneous-Interface for Portability), a C++ runtime API and kernel language that closely mirrors CUDA, with AMD's hipify tool that can automatically convert significant portions of CUDA code to run on AMD GPUs. ROCm 7.x (released in 2025-2026) is the first generation to be broadly viable on consumer Radeon GPUs and supports PyTorch, TensorFlow, JAX, and llama.cpp.^[76]

Intel oneAPI and SYCL offer another open-standards path. SYCL is a Khronos Group specification for a single-source C++ programming model targeting heterogeneous devices, and Intel's oneAPI DPC++ is its leading implementation.^[77] The Intel SYCLomatic tool ports CUDA code to SYCL. SYCL supports Unified Shared Memory (USM) and a single-source C++ programming model that resembles CUDA. As of 2026, oneAPI/SYCL has a smaller ecosystem than CUDA and limited reach in AI-specific frameworks.^[77]

Apple Metal and Metal Performance Shaders (MPS) provide GPU compute on Apple silicon. MPS is a framework of GPU-optimized kernels for graphics, vision, and machine learning, and MPSGraph builds computational graphs for ML workloads.^[78] PyTorch and JAX both ship Metal backends that target Apple silicon GPUs, though Metal is restricted to Apple platforms.

Google TPUs and the XLA compiler offer an entirely different path, with tensor processing units accessed primarily through JAX, TensorFlow, and PyTorch/XLA. TPUs are not CUDA-compatible and target Google Cloud customers.

Triton (originated at Harvard, developed by OpenAI) is a Python-based DSL for writing custom GPU kernels that compete with hand-tuned CUDA C++ for many workloads.^[79] Triton currently targets NVIDIA GPUs but has been used as a portability layer in projects exploring AMD and other backends, and is widely used inside PyTorch 2.x's torch.compile to generate fused GPU kernels.

Platform	Vendor	Open Source	Primary Hardware	Strengths	Limitations
CUDA	NVIDIA	No	NVIDIA GPUs	Mature ecosystem, performance, AI framework support	NVIDIA-only
ROCm/HIP	AMD	Yes	AMD GPUs	Open source, CUDA-like API	Younger ecosystem
OpenCL	Khronos	Yes	CPUs, GPUs, FPGAs, more	Cross-vendor, broad hardware	Less performant than CUDA on NVIDIA
oneAPI/SYCL	Intel/Khronos	Mixed	Intel CPUs/GPUs, others	Open standard, modern C++	Smaller AI ecosystem
Metal/MPS	Apple	No	Apple silicon	Optimized for Macs/iOS	Apple platforms only
Triton	OpenAI	Yes	NVIDIA (primary), AMD	Python-first DSL, fused kernels	Newer, narrower scope

Is CUDA open source?

No. The CUDA Toolkit is proprietary and runs only on NVIDIA GPUs, although NVIDIA has open-sourced parts of the surrounding ecosystem.^[8]^[74] Many higher-level libraries are public on GitHub (for example CUTLASS, CCCL/Thrust, and TensorRT-LLM), and NVIDIA has released its Linux GPU kernel modules as open source under dual GPL and MIT licensing, beginning with the R515 driver in 2022 and made the default installation starting with CUDA 12.6.^[43]^[44]^[81] Even so, the core compiler, runtime, and flagship math libraries such as cuBLAS and cuDNN remain closed-source. Unlike OpenCL, AMD ROCm, or SYCL, CUDA is not a vendor-neutral standard, and in early 2024 NVIDIA's CUDA end-user license agreement was widely reported to prohibit running CUDA libraries on non-NVIDIA hardware through translation layers such as ZLUDA.^[74]

Limitations and criticisms

CUDA's strengths are paired with several documented limitations.

Vendor lock-in. CUDA runs only on NVIDIA GPUs.^[74] Years of CUDA-targeted code, libraries, and developer expertise make migration to alternative hardware expensive. This is the central focus of recent antitrust scrutiny and the motivation for ROCm/HIP, SYCL, and Triton.^[73]^[74]

Proprietary licensing. Unlike OpenCL, ROCm, or SYCL, the CUDA Toolkit is closed-source. In early 2024, NVIDIA's CUDA EULA was widely reported to prohibit using CUDA libraries through translation layers (such as ZLUDA) to run on non-NVIDIA hardware, raising further concerns about portability.^[74]

Learning curve and complexity. Writing efficient CUDA C++ kernels requires understanding of warp execution, memory coalescing, bank conflicts, occupancy, and SM resource limits. Newer high-level interfaces (Triton, Numba, Thrust/CCCL, CUTLASS Python DSL) reduce this burden, but writing peak-performance kernels remains specialist work.

Driver and version sensitivity. CUDA applications can be sensitive to specific combinations of toolkit, driver, cuDNN, and framework versions. Mismatches are a common source of deployment friction, particularly in shared HPC environments and consumer Docker images.

Power and thermal envelopes. Modern NVIDIA data center GPUs (H100, B200) draw 700-1000 W per accelerator, requiring substantial cooling and power infrastructure. This has reshaped data center design and contributed to surging energy demand for AI.^[60]^[64]

Architectural deprecations. Periodic removal of older architectures from offline-compilation support (CUDA 13.0 dropped Maxwell, Pascal, and Volta) forces software vendors to maintain multiple toolchains or drop older GPUs.^[28]^[29]

References

Wikipedia contributors, "CUDA", Wikipedia. https://en.wikipedia.org/wiki/CUDA. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Celebrating the 2 Million Innovators Changing the World", NVIDIA Technical Blog, 2023. https://developer.nvidia.com/blog/celebrating-the-2-million-innovators-changing-the-world/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA-X | NVIDIA". https://www.nvidia.com/en-us/technologies/cuda-x/. Accessed 2026-05-24. ↩
Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey, "ImageNet Classification with Deep Convolutional Neural Networks", Advances in Neural Information Processing Systems 25, 2012. https://proceedings.neurips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. Accessed 2026-05-24. ↩
IEEE Spectrum, "How AlexNet Transformed AI and Computer Vision Forever", 2024. https://spectrum.ieee.org/alexnet-source-code. Accessed 2026-05-24. ↩
Understanding AI, "Why the deep learning boom caught almost everyone by surprise". https://www.understandingai.org/p/why-the-deep-learning-boom-caught. Accessed 2026-05-24. ↩
Wikipedia contributors, "General-purpose computing on graphics processing units". https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA Toolkit Documentation". https://docs.nvidia.com/cuda/. Accessed 2026-05-24. ↩
Buck, Ian; Foley, Tim; Horn, Daniel; Sugerman, Jeremy; Fatahalian, Kayvon; Houston, Mike; Hanrahan, Pat, "Brook for GPUs: Stream Computing on Graphics Hardware", ACM Transactions on Graphics, 23(3), 2004. https://graphics.stanford.edu/papers/brookgpu/brookgpu.pdf. Accessed 2026-05-24. ↩
Communications of the ACM, "The Origins of GPU Computing". https://cacm.acm.org/federal-funding-of-academic-research/the-origins-of-gpu-computing/. Accessed 2026-05-24. ↩
HPCwire, "35 HPC Legends: Ian Buck", 2022. https://www.hpcwire.com/35-hpc-legends-ian-buck/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Tesla architecture and GeForce 8800 GTX (G80)" (November 8, 2006 launch). https://www.nvidia.com/object/geforce_8800.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA Toolkit Archive". https://developer.nvidia.com/cuda-toolkit-archive. Accessed 2026-05-24. ↩
Nickolls, John; Buck, Ian; Garland, Michael; Skadron, Kevin, "Scalable Parallel Programming with CUDA", ACM Queue, 6(2), 2008. https://queue.acm.org/detail.cfm?id=1365500. Accessed 2026-05-24. ↩
Oak Ridge National Laboratory, "Titan supercomputer" (Cray XK7, NVIDIA Tesla K20X, 2012). https://www.olcf.ornl.gov/olcf-resources/compute-systems/titan/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "The CUDA Platform", CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-programming-guide/01-introduction/cuda-platform.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA C++ Programming Guide". https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Intro to CUDA C++", CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/intro-to-cuda-cpp.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Writing CUDA SIMT Kernels", CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/writing-cuda-kernels.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA C++ Best Practices Guide", Memory Optimizations. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA Hopper Architecture In-Depth", NVIDIA Technical Blog, 2022. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA Ampere Architecture In-Depth", NVIDIA Technical Blog, 2020. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/. Accessed 2026-05-24. ↩
Tom's Hardware, "Nvidia's next-gen AI GPU is 4X faster than Hopper: Blackwell B200 GPU delivers up to 20 petaflops", 2024. https://www.tomshardware.com/pc-components/gpus/nvidias-next-gen-ai-gpu-revealed-blackwell-b200-gpu-delivers-up-to-20-petaflops-of-compute-and-massive-improvements-over-hopper-h100. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA Hopper Tuning Guide", CUDA Documentation. https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Unified Memory for CUDA Beginners", NVIDIA Technical Blog. https://developer.nvidia.com/blog/unified-memory-cuda-beginners/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Beyond GPU Memory Limits with Unified Memory on Pascal", NVIDIA Technical Blog, 2016. https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Cooperative Groups: Flexible CUDA Thread Programming", NVIDIA Technical Blog, 2017. https://developer.nvidia.com/blog/cooperative-groups/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "What's New and Important in CUDA Toolkit 13.0", NVIDIA Technical Blog, 2025. https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA Toolkit Release Notes". https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Fermi Architecture Whitepaper", 2009. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110", whitepaper. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA A100 Tensor Core GPU Architecture", whitepaper, 2020. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA Toolkit 12.0 Released for General Availability", NVIDIA Technical Blog, 2022. https://developer.nvidia.com/blog/cuda-toolkit-12-0-released-for-general-availability/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA Compiler Driver NVCC". https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Understanding PTX, the Assembly Language of CUDA GPU Computing", NVIDIA Technical Blog. https://developer.nvidia.com/blog/understanding-ptx-the-assembly-language-of-cuda-gpu-computing/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA Runtime API". https://docs.nvidia.com/cuda/cuda-runtime-api/index.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA-X Libraries". https://developer.nvidia.com/gpu-accelerated-libraries. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA Libraries documentation". https://docs.nvidia.com/cuda-libraries/index.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "cuSPARSE". https://developer.nvidia.com/cusparse. Accessed 2026-05-24. ↩
NVIDIA Corporation, "cuSOLVER". https://developer.nvidia.com/cusolver. Accessed 2026-05-24. ↩
NVIDIA Corporation, "cuTENSOR: A High-Performance CUDA Library For Tensor Primitives". https://developer.nvidia.com/cutensor. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Thrust: The C++ Parallel Algorithms Library", CCCL documentation. https://nvidia.github.io/cccl/thrust/index.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA Core Compute Libraries (CCCL)", GitHub. https://github.com/NVIDIA/cccl. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUTLASS: CUDA Templates and Python DSLs for High-Performance Linear Algebra", GitHub. https://github.com/NVIDIA/cutlass. Accessed 2026-05-24. ↩
Chetlur, Sharan; Woolley, Cliff; Vandermersch, Philippe; Cohen, Jonathan; Tran, John; Catanzaro, Bryan; Shelhamer, Evan, "cuDNN: Efficient Primitives for Deep Learning", arXiv:1410.0759, 2014. https://arxiv.org/abs/1410.0759. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA cuDNN Documentation". https://docs.nvidia.com/cudnn/index.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "TensorRT SDK". https://developer.nvidia.com/tensorrt. Accessed 2026-05-24. ↩
NVIDIA Corporation, "TensorRT-LLM", GitHub. https://github.com/NVIDIA/TensorRT-LLM. Accessed 2026-05-24. ↩
PyTorch Foundation, "PyTorch Get Started: CUDA Installation". https://pytorch.org/get-started/locally/. Accessed 2026-05-24. ↩
TensorFlow, "Install TensorFlow with GPU support". https://www.tensorflow.org/install/gpu. Accessed 2026-05-24. ↩
Google, "JAX: Install JAX with CUDA support". https://docs.jax.dev/en/latest/installation.html. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA Collective Communications Library (NCCL)". https://developer.nvidia.com/nccl. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVSHMEM". https://developer.nvidia.com/nvshmem. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA DALI (Data Loading Library)". https://developer.nvidia.com/dali. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA HPC SDK". https://developer.nvidia.com/hpc-sdk. Accessed 2026-05-24. ↩
Okuta, Ryosuke; Unno, Yuya; Nishino, Daisuke; Hido, Shohei; Loomis, Crissman, "CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations", NeurIPS Workshop on Machine Learning Systems, 2017. http://learningsys.org/nips17/assets/papers/paper_16.pdf. Accessed 2026-05-24. ↩
Anaconda Inc., "Numba: A High Performance Python Compiler". https://numba.pydata.org/. Accessed 2026-05-24. ↩
Klöckner, Andreas, "PyCUDA Documentation". https://documen.tician.de/pycuda/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "CUDA Python". https://nvidia.github.io/cuda-python/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA Blackwell Platform Arrives to Power a New Era of Computing", NVIDIA Newsroom, March 18, 2024. https://nvidianews.nvidia.com/news/nvidia-blackwell-platform-arrives-to-power-a-new-era-of-computing. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA's Next Generation CUDA Compute Architecture: Fermi", whitepaper, 2009. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA Tesla V100 GPU Architecture", whitepaper, August 2017. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA A100 Tensor Core GPU Datasheet". https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf. Accessed 2026-05-24. ↩
NVIDIA Corporation, "H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance", NVIDIA Blog, 2022. https://blogs.nvidia.com/blog/h100-transformer-engine/. Accessed 2026-05-24. ↩
NVIDIA Corporation, "NVIDIA H100 Tensor Core GPU Architecture", whitepaper. https://www.advancedclustering.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf. Accessed 2026-05-24. ↩
IEEE Spectrum, "Nvidia Unveils Blackwell, Its Next GPU", March 2024. https://spectrum.ieee.org/nvidia-blackwell. Accessed 2026-05-24. ↩
NVIDIA Corporation, "GPU-Accelerated Applications" catalog. https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/. Accessed 2026-05-24. ↩
Eklund, Anders; Dufort, Paul; Forsberg, Daniel; LaConte, Stephen, "Medical image processing on the GPU - past, present and future", Medical Image Analysis, 2013. https://www.sciencedirect.com/science/article/abs/pii/S1361841513001023. Accessed 2026-05-24. ↩
NVIDIA Corporation, "Accelerated Computing for Financial Services". https://www.nvidia.com/en-us/industries/financial-services/. Accessed 2026-05-24. ↩
OpenCV, "CUDA Module documentation". https://docs.opencv.org/4.x/d2/dbc/cuda_intro.html. Accessed 2026-05-24. ↩
SlashData, "Why NVIDIA dominates despite low developer program scores", 2024. https://www.slashdata.co/post/why-nvidia-dominates-despite-low-developer-program-scores. Accessed 2026-05-24. ↩
Data Center Dynamics, "Nvidia hits $3 trillion market cap due to AI data center boom", June 2024. https://www.datacenterdynamics.com/en/news/nvidia-hits-3-trillion-market-cap-due-to-ai-data-center-boom/. Accessed 2026-05-24. ↩
Esya Centre, "NVIDIA's Antitrust Investigation: Separating Innovation and Anti-Competitive Conduct", September 2024. https://www.esyacentre.org/perspectives/2024/9/3/nvidias-antitrust-investigation-separating-innovation-and-anti-competitive-conduct. Accessed 2026-05-24. ↩
TechPolicy.Press, "Nvidia is Building a Shield of Concentrated Power". https://www.techpolicy.press/nvidia-is-building-a-shield-of-concentrated-power/. Accessed 2026-05-24. ↩
Khronos Group, "OpenCL Overview". https://www.khronos.org/opencl/. Accessed 2026-05-24. ↩
AMD, "AMD ROCm documentation". https://rocm.docs.amd.com/. Accessed 2026-05-24. ↩
Intel Corporation, "oneAPI Programming Model". https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2025-0/oneapi-programming-model.html. Accessed 2026-05-24. ↩
Apple Inc., "Metal Performance Shaders". https://developer.apple.com/documentation/metalperformanceshaders. Accessed 2026-05-24. ↩
OpenAI, "Introducing Triton: Open-source GPU programming for neural networks", 2021. https://openai.com/research/triton. Accessed 2026-05-24. ↩
NVIDIA Corporation, "20 Years of CUDA: Honoring the Architects of the Accelerated Age", NVIDIA Developer (GTC 2026). https://forums.developer.nvidia.com/t/20-years-of-cuda-honoring-the-architects-of-the-accelerated-age/364654. Accessed 2026-06-20. ↩
NVIDIA Corporation, "NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules", NVIDIA Technical Blog, 2024. https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/. Accessed 2026-06-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit