CuDNN
Template:Use mdy dates Template:Infobox software
NVIDIA cuDNN (CUDA Deep Neural Network library) is a proprietary GPU-accelerated software library of primitives for deep learning developed by NVIDIA. It provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.[1][2]
Built on top of the CUDA parallel computing platform, cuDNN is not a standalone deep learning framework but rather a foundational performance library.[3] It serves as a critical abstraction layer that allows high-level frameworks like PyTorch and TensorFlow to leverage the computational power of NVIDIA GPUs without requiring framework developers to write low-level, hardware-specific CUDA code.[4]
The library's existence represents a strategic separation of concerns within the AI ecosystem. By centralizing the complex and time-consuming task of optimizing deep learning kernels for each new GPU architecture, NVIDIA enables framework developers to concentrate on API design, automatic differentiation, and scientific innovation.[5] This approach significantly accelerates the development cycle of the entire AI community on NVIDIA hardware. Furthermore, cuDNN provides a layer of performance portability across GPU generations. As NVIDIA releases new hardware, updated versions of cuDNN incorporate optimized kernels that exploit the new architectural features. Applications and frameworks built against the cuDNN API can often achieve substantial performance gains on new hardware simply by updating the library, without requiring changes to their own source code.[6]
History
Initial Release and Early Development (2014-2017)
NVIDIA released cuDNN in September 2014, amid the rise of deep learning research following breakthroughs in ImageNet competitions where GPU-accelerated convolutional neural networks achieved dramatic improvements in accuracy.[7] NVIDIA introduced it as a set of low-level GPU primitives to boost the performance of deep neural networks on CUDA-compatible GPUs. Although cuDNN can be used directly via its C/C++ API, NVIDIA anticipated it would mostly be used indirectly through higher-level machine learning frameworks that incorporate it.[7]
From early on, frameworks such as Caffe and Torch integrated cuDNN as a backend, allowing developers to get GPU acceleration with minimal code changes. NVIDIA reported up to a 10× speedup in training throughput when using Caffe with cuDNN compared to Caffe's CPU-only mode, demonstrating the significant performance gains of GPU-accelerated deep learning.[7] The initial cuDNN 1.0 release focused on primitives for convolutional neural networks, including convolution, pooling, softmax, and neuron activations such as Sigmoid, ReLU, and TANH.[7]
Key milestones in early development included:
- cuDNN 2.0 (March 2015): Performance improvements and expanded support for different network configurations.
- cuDNN 3.0 (July 2015): Support for 16-bit floating point (FP16) data storage, enabling training of larger models.[8]
- cuDNN 4.0 (November 2015): Optimizations for Maxwell architecture and initial support for recurrent neural networks (RNNs).
- cuDNN 5.0 (May 2016): Major update announced at GTC 2016. Added comprehensive support for recurrent neural networks including LSTM and GRU layers for the first time, greatly accelerating sequence learning tasks.[9] Introduced new convolution algorithms including the Winograd algorithm for faster convolutions, 3D convolution support, and improved half-precision (FP16) performance on NVIDIA's Pascal architecture. NVIDIA reported up to 6× speedup in LSTM training when using cuDNN v5's RNN support.[9]
Tensor Core Era (2017-2020)
cuDNN 6.0 (April 2017) focused on performance tuning and robustness, with improved support for dilated (atrous) convolutions and optimizations targeting the then-new Volta GPU architecture.
cuDNN 7.x (2017–2019) marked a major milestone with support for Tensor Cores on Volta architecture GPUs. This series of releases introduced support for Tensor Cores in deep learning operations, allowing FP16 compute with FP32 accumulation to leverage Volta's Tensor Core units for significant speedups.[10] Throughout the 7.x releases, NVIDIA added support for new network layers and optimized existing ones, including improved batch normalization. The library also expanded device support including NVIDIA's embedded Jetson platforms via JetPack.[1]
Graph API Revolution (2020-2023)
cuDNN 8.0 (June 2020) represented a major redesign of cuDNN coinciding with the NVIDIA Ampere architecture launch. cuDNN 8 was optimized for the Nvidia A100 GPUs, with NVIDIA reporting up to 5× higher performance on A100 versus V100 out-of-the-box thanks to new optimizations and use of hardware features like TensorFloat-32 (TF32).[11]
The flagship feature was the introduction of the declarative Graph API and a runtime fusion engine.[12] This allowed users to express complex, multi-operation computations that cuDNN could then analyze and optimize holistically. The API was overhauled: v8 introduced a new low-level backend API for more flexibility and performance tuning, while providing a compatibility layer for the previous v7 API to ease transition.[11] New capabilities included improved support for conversational AI, computer vision networks, and the ability to fuse multiple operations through the new graph API. Additionally, cuDNN 8 was modularized into smaller component libraries, so applications could include only the needed portions, making integration more lightweight.[11]
Subsequent cuDNN 8.x releases (2020–2023) continuously improved performance and added features. These included support for the NVIDIA Hopper architecture (H100 GPUs), expanded graph API functionalities, and initial support for new data types such as FP8 in late 8.x versions for Hopper. For example, cuDNN 8.9 introduced fused flash attention for training and inference.[13]
Modern Era: Transformer Focus (2024-2025)
cuDNN 9.0 (February 2024) brought the first major version jump in four years, with a primary focus on accelerating Transformer-based models for the era of generative AI and large language models.[14] This version introduced extensive enhancements for Scaled Dot-Product Attention (SDPA), including highly optimized kernels inspired by FlashAttention and robust support for the FP8 data type on Hopper and Blackwell architecture GPUs, offering up to 2× faster throughput in BF16 and up to 3× in FP8 for attention operations compared to earlier implementations.[14]
Key features introduced in cuDNN 9 included:
- Hardware forward compatibility: Ensuring that applications compiled against cuDNN 9.0 or later can run on future, unreleased GPU architectures without modification, automatically benefiting from available performance improvements.[15]
- Mixed input precision support for matrix multiplications and convolutions (allowing inputs in different precisions, e.g., FP16 and FP32, to be used together in one operation)
- Improved error reporting for developers
- More streamlined installation process
- Library reorganization with dependency change from cuBLAS to cuBLASLt[15]
Subsequent 9.x releases have continued to refine features. As of October 2025, cuDNN 9.14.0 includes automatic runtime configuration, complex datatype support for matrix multiplication, enhanced Blackwell architecture optimizations, and 5-10% SDPA performance improvements on Blackwell GPUs.[16]
Core Functionality and Primitives
cuDNN provides a comprehensive suite of optimized primitives that form the building blocks of modern deep neural networks. The set of accelerated routines has evolved over time, reflecting the major research trends and computational demands of the deep learning field.[1]
Convolution Operations
As the cornerstone of Convolutional Neural Networks (CNNs), convolution was one of the original and most critical functions of cuDNN. The library offers highly optimized implementations for forward (inference) and backward (training, for both data and filter gradients) passes of 2D and 3D convolutions.[4][17] It supports essential features like striding, padding, dilation, and grouped convolutions, along with flexible tensor data layouts such as NCHW and NHWC to minimize data transposition overhead.[18]
For convolution operations, cuDNN provides multiple algorithmic implementations including GEMM-based, FFT-based, and Winograd-based methods.[5] The library uses heuristics to automatically select the optimal algorithm for a given input size and GPU architecture.
Attention Mechanisms
With the rise of the Transformer architecture, attention has become a primary focus of optimization in recent cuDNN versions. The library includes state-of-the-art implementations of Scaled Dot-Product Attention (SDPA), incorporating techniques from algorithms like FlashAttention to reduce memory consumption and accelerate sequence processing.[19][14] This support extends to various use cases, including:
- Fused multi-head attention
- Sliding window attention
- Grouped Query Attention (GQA)
- Relative positional encoding
- Paged attention for efficient KV cache management (v9.4.0+)
These features are vital for training and inferencing large language models.[20]
Matrix Multiplication and Tensor Operations
A fundamental operation for fully connected (dense) layers and numerous components within Transformer models, cuDNN provides highly optimized kernels for general matrix multiplication (GEMM).[1] In addition to neural-network-specific layers, cuDNN provides fundamental tensor operations like matrix multiplication, tensor transforms (reordering data layouts), and reduction operations optimized for GPUs. Many of these leverage NVIDIA's other libraries (e.g., using cuBLAS/cuBLASLt or direct CUDA kernels) and are tuned for deep learning workloads.
Pooling and Activation Functions
The library accelerates standard pooling operations like max pooling and average pooling for 2D and 3D spatial dimensions.[18] It also provides fast implementations for common non-linear activation functions including ReLU, Sigmoid, Tanh, GELU, Swish, and ELU, which can be computed standalone or fused with other operations for efficiency.[1]
Normalization Operations
cuDNN offers efficient implementations of various normalization techniques crucial for training stability and performance, including:
- Batch normalization
- Layer normalization
- Instance normalization
- RMS normalization (Root Mean Square)
- Group normalization
These support both the training (forward and backward normalization) and inference phases of deep networks.[21]
Recurrent Neural Networks
Since version 5, cuDNN includes support for recurrent neural network layers. It implements optimized routines for popular RNN architectures including LSTM and GRU networks, as well as simple RNNs with ReLU or tanh activations.[9] These optimized RNN kernels dramatically improve performance for sequence modeling tasks.[22]
Technical Architecture
Library Components
Starting with version 8, cuDNN underwent a significant architectural transformation. Modern cuDNN consists of multiple sub-libraries organized by functionality:[2]
| Library | Function | Description |
|---|---|---|
| libcudnn_graph.so | Graph API | Main graph API for declarative operation composition |
| libcudnn_engines_precompiled.so | Engines | Pre-compiled kernel implementations |
| libcudnn_engines_runtime_compiled.so | JIT Engines | Runtime kernel generation for fusion patterns |
| libcudnn_heuristic.so | Heuristics | Automatic algorithm selection |
| libcudnn.so | Legacy API | Backward compatibility shim layer |
| libcudnn_cnn.so | CNN Operations | Convolution and pooling operations |
| libcudnn_ops.so | Tensor Operations | Basic tensor operations |
| libcudnn_adv.so | Advanced Operations | RNN and batch normalization |
The Graph API
Introduced in cuDNN v8, the Graph API allows a developer to define an entire computation, or a segment of it, as a directed acyclic graph (DAG). In this graph, operations (like convolution or activation) are represented as nodes, and tensors are represented as edges connecting them.[1][12] This declarative approach is a fundamental shift from the legacy model of calling individual functions one by one.
By providing the library with a "global view" of the intended computation, the cuDNN runtime can perform sophisticated, graph-level optimizations that are impossible with a myopic, operation-by-operation perspective. The most significant of these optimizations is operation fusion.[6] This architectural change effectively transforms cuDNN from a library of fast math routines into a domain-specific graph compiler for deep learning.
API Layers
cuDNN provides three API entry points with different abstraction levels, catering to different use cases and programming environments:[2]
Frontend API (C++/Python)
The Frontend API is the recommended entry point for most users.[1] It is an open-source, header-only C++ library that provides a more concise and user-friendly abstraction over the powerful backend.[23] It also includes Python bindings (available via `nvidia-cudnn-frontend` package), making it directly accessible from popular frameworks.[23] The frontend adds convenience features on top of the backend, such as helpers for autotuning and filters for known hardware or software errata, simplifying the development process.[23]
Backend API (C)
The Backend API is the lower-level, closed-source C interface to the cuDNN engine.[6] It exposes the full capabilities of the graph API and is intended for legacy use cases, integration into environments where C++ or Python are not suitable, or for developers who require maximum control.[1][22] The open-source Frontend API also serves as a valuable reference implementation for developers working directly with the C backend.[23]
Key Features and Optimization Techniques
Operation Fusion
Operation fusion is the process of combining a sequence of distinct neural network operations, such as a convolution followed by a bias addition and a ReLU activation, into a single, monolithic GPU kernel.[6] The primary benefit of this technique is the significant reduction in memory bandwidth requirements. Without fusion, the intermediate result of each operation must be written to the GPU's main memory (global memory) and then read back by the next operation. This round-trip to global memory is a major performance bottleneck.
By fusing operations, intermediate data can be kept in much faster on-chip memory, such as registers or shared memory, throughout the fused sequence.[6] cuDNN can generate kernels for common fusion patterns at runtime or use specialized, pre-written kernels for high-value patterns like fused attention. Common fusion patterns include:
- ConvBiasAct: Convolution + bias + activation
- BnAddRelu: Batch normalization + residual add + ReLU
- Generic fusion patterns with DAG support
The runtime fusion engine with JIT kernel generation can provide up to 2.5× speedup for common patterns.[1]
Heuristics and Automatic Tuning
For many deep learning primitives, especially convolution, multiple algorithms exist (e.g., GEMM-based, FFT-based, Winograd-based).[18][5] The optimal choice depends on numerous factors, including the tensor dimensions, data types, filter sizes, and the specific GPU architecture. cuDNN incorporates a sophisticated heuristics engine that analyzes these parameters and automatically selects the predicted best-performing algorithm for a given workload.[1][6] This eliminates the need for developers to perform tedious manual benchmarking.[19]
For users seeking the absolute best performance, cuDNN also offers an autotuning feature. When enabled, the library can empirically benchmark a small set of promising algorithms on the target hardware at runtime and select the fastest one for subsequent executions.[24] This combination of heuristics and autotuning functions as a form of JIT compilation, creating a highly optimized execution plan tailored to the specific model and hardware.
Mixed-Precision and Tensor Core Acceleration
cuDNN is designed to fully exploit specialized hardware units within NVIDIA GPUs known as Tensor Cores.[1] Tensor Cores provide dramatic speedups for matrix multiplication and convolution operations, which are the computational backbone of deep learning. They achieve maximum throughput using lower-precision numerical formats.
cuDNN supports multiple precision modes:[2]
- FP64: Double precision
- FP32: Single precision (standard floating point)
- TF32: TensorFloat-32 (Ampere architecture and later)
- FP16: Half precision
- BF16: Brain floating point
- FP8: 8-bit floating point (Hopper and Blackwell architectures)
- INT8: 8-bit integer
- Complex FP32/FP64: Complex datatypes (v9.14.0+)
cuDNN provides optimized kernels that utilize Tensor Cores for mixed-precision data types, enabling significantly faster model training and inference while also reducing the memory footprint.[19] The library internally chooses appropriate kernels based on hardware capabilities and user-specified math precision. This allows training in mixed precision (for example, using FP16/BF16 for computations with FP32 accumulation) to improve performance while maintaining accuracy.
Memory Management
Efficient memory management is critical for handling large models and datasets. Many cuDNN operations require temporary storage buffers, known as "workspace" memory, for their intermediate calculations.[25] The library provides APIs that allow applications to query the required workspace size ahead of time. This enables developers to manage a memory pool efficiently, pre-allocating a single large buffer and sub-allocating from it, which avoids the high overhead of repeated `cudaMalloc` calls.[25]
Multi-GPU Scaling with NCCL
While cuDNN's API is context-based and facilitates multi-threaded applications where each thread controls a separate GPU, the library itself is focused on optimizing computations on a single GPU.[26] For scaling deep learning tasks across multiple GPUs or multiple nodes, cuDNN works in concert with the NVIDIA Collective Communications Library (NCCL).[1][27]
In a typical data-parallel training scenario, the workflow is as follows:
- Each GPU uses cuDNN to independently compute the forward and backward passes for its assigned mini-batch of data.
- After gradients are computed on each GPU, NCCL is used to perform a highly efficient All-Reduce collective operation. This operation sums the gradients from all GPUs and distributes the result back to every GPU.
- Each GPU then uses the averaged gradients to update its local copy of the model weights.
This modular design, which separates intra-GPU computation (cuDNN) from inter-GPU communication (NCCL), is a key strength of the NVIDIA platform. It allows each library to be optimized independently—cuDNN for new on-chip compute features and NCCL for new interconnect technologies like NVLink.[27]
Integration with Deep Learning Frameworks
The vast majority of developers interact with cuDNN indirectly through high-level deep learning frameworks.[19][4] cuDNN serves as the default, high-performance execution engine for nearly all major frameworks when running on NVIDIA GPUs.[28]
This integration is designed to be seamless. When a framework like PyTorch or TensorFlow is installed with GPU support, it automatically detects and utilizes the installed cuDNN library.[29] As a result, developers can write high-level code (e.g., defining a convolutional layer in Python) and the framework's backend will automatically translate that operation into a call to the corresponding optimized cuDNN primitive, unlocking GPU acceleration without requiring any low-level programming.[3][4]
| Framework | Primary Maintainer | Integration Status |
|---|---|---|
| PyTorch | Meta AI | Native Integration[1][30] |
| TensorFlow | Native Integration[1][28] | |
| JAX | Integrated via XLA[1][19] | |
| Apache MXNet | Apache Software Foundation | Native Integration[1][4] |
| Caffe / Caffe2 | Berkeley / Meta AI | Native Integration[1][4] |
| Chainer | Preferred Networks | Native Integration[1] |
| Microsoft Cognitive Toolkit | Microsoft | Native Integration[1][28] |
| PaddlePaddle | Baidu | Native Integration[1][28] |
| Keras | Various | Via TensorFlow/PyTorch backends[1] |
| MATLAB | MathWorks | Deep Learning Toolbox Integration[1] |
PyTorch
PyTorch provides deep cuDNN integration with automatic detection and utilization of the library for operations like Conv2d, LSTM, and BatchNorm2d.[30] PyTorch users can enable benchmark mode (`torch.backends.cudnn.benchmark = True`) for automatic algorithm selection, which can provide 10-25% performance improvements.[30]
TensorFlow
TensorFlow automatically detects and uses cuDNN when available.[31] Recent versions like TensorFlow 2.18.0 support CUDA 12.3 with cuDNN 8.9, providing 40-50% speedup for LSTM/GRU operations with automatic Tensor Core utilization.[31]
JAX
JAX leverages cuDNN through XLA compiler integration.[32] The XLA compiler automatically selects cuDNN operations when beneficial, with automatic backend selection between cuDNN, cuBLAS, and custom kernels.
Performance
The performance impact of cuDNN is substantial, often providing speedups of one to two orders of magnitude over CPU-only implementations and significant acceleration compared to unoptimized GPU code.[33]
Benchmarks
cuDNN demonstrates significant performance advantages across various workloads:
- CPU vs GPU: 10-100× speedup for deep learning workloads
- Flash Attention: Up to 75% speedup over FlashAttention v2 implementations on H100
- FP8 precision: Up to 1.2 PFLOPS on H200 for SDPA operations
- Transformer models: 1.15× speedup for Llama2 70B with FP8 on H200[14]
| GPU | Model | Precision | Performance Metric | Reference |
|---|---|---|---|---|
| NVIDIA A100 | ResNet-50 | FP32 | 7,850 images/sec | [34] |
| NVIDIA A100 | ResNet-50 | FP16 | 20,500 images/sec | [34] |
| NVIDIA H100 (SXM5) | Aggregate Models | Mixed | ~5.7× vs V100 | [35] |
| NVIDIA RTX 4090 | Aggregate Models | Mixed | ~2.1× vs V100 | [35] |
MLPerf Results
The NVIDIA AI platform, with cuDNN as a core software component, consistently sets performance records in the industry-standard MLPerf benchmarks for both training and inference across a wide variety of AI workloads.[36] NVIDIA platforms using cuDNN lead MLPerf benchmarks with results including:
- ResNet-50 training: 0.4 minutes at scale
- BERT training: 0.32 minutes at scale
- 3D U-Net: 9.7× speedup with 800 GPUs
- Only platform submitting across all benchmark categories[36]
Hardware Requirements and Compatibility
GPU Architecture Support
cuDNN version requirements vary by GPU architecture:[37]
| Architecture | Compute Capability | Example GPUs | cuDNN Support |
|---|---|---|---|
| Kepler | 3.0, 3.5, 3.7 | Tesla K80, GTX 780 Ti | Supported in cuDNN 7.x and earlier |
| Maxwell | 5.0, 5.2 | GTX 980, GTX 750 Ti | Supported in cuDNN 8.x (CUDA 11.x branch) |
| Pascal | 6.0, 6.1 | Tesla P100, GTX 1080 | Supported in cuDNN 8.x and 9.x (CUDA 11.x branch) |
| Volta | 7.0 | Tesla V100, Titan V | Full support in cuDNN 8.x and 9.x + Tensor Cores |
| Turing | 7.5 | RTX 2080, Tesla T4 | Full support in cuDNN 8.x and 9.x + Tensor Cores |
| Ampere | 8.0, 8.6 | A100, RTX 3090 | Full support + Enhanced Tensor Cores |
| Ada Lovelace | 8.9 | RTX 4090, L4, L40 | Full support (requires cuDNN 8.9+) |
| Hopper | 9.0 | H100, H200 | Full support + FP8 (requires cuDNN 8.9+) |
| Blackwell | 10.0 | B100, B200 | Full support + Enhanced FP8 (requires cuDNN 9.7+) |
Note: cuDNN 9.x with CUDA 12.x requires Turing architecture or later (compute capability 7.5+). For Maxwell and Pascal support, use cuDNN 8.x or 9.x with CUDA 11.x.[37]
CUDA Toolkit Requirements
Each version of cuDNN is built against and requires specific version(s) of the CUDA Toolkit:[37]
- cuDNN 9.x (modern branch):
- CUDA 12.6+ for Volta and later architectures
- CUDA 11.8 for Pascal and Maxwell architectures
- cuDNN 8.x: CUDA 11.0 through CUDA 12.x
- cuDNN 7.x: CUDA 8.0 through CUDA 10.2 (legacy)
Operating System Support
Supported operating systems include:[37]
- Various distributions of Linux (such as Ubuntu, RHEL, and CentOS)
- 64-bit versions of Microsoft Windows
- Android for embedded platforms (Jetson)
Installation
cuDNN can be installed through various methods, each suited to different workflows and requirements.[38]
Package Managers
Recommended installation methods include:
Conda:
conda install nvidia::cudnn cuda-version=12
pip:
pip install nvidia-cudnn-cu12 # For CUDA 12.x
pip install nvidia-cudnn-cu11 # For CUDA 11.x
Docker:
docker pull nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04
Manual Installation
For Linux systems requiring manual installation:[38]
- Download from NVIDIA Developer portal (requires NVIDIA Developer Program registration)
- Extract archive: `tar -xvf cudnn-*.tar.xz`
- Copy files to CUDA directory
- Set appropriate permissions and update `LD_LIBRARY_PATH` environment variable
Users must join the NVIDIA Developer Program to download cuDNN packages.[38]
Applications
cuDNN powers numerous AI applications across industries:[39]
- Computer Vision: Image classification, object detection, image segmentation
- Natural Language Processing: Large language models, machine translation, sentiment analysis
- Autonomous vehicles: Perception pipelines for self-driving cars
- Healthcare: Medical imaging analysis, drug discovery
- Recommender systems: Deep learning-based recommendation engines
- Speech Recognition: Automatic speech recognition, voice assistants
- Generative AI: Text-to-image models, text generation, video synthesis
Comparison with Alternatives
While cuDNN is the de facto standard for deep learning acceleration on NVIDIA hardware, other hardware vendors provide their own libraries for their respective ecosystems.
AMD MIOpen
MIOpen is an open-source deep learning primitives library developed by AMD for its ROCm GPGPU platform. It is designed to be the AMD equivalent of cuDNN.[40] While MIOpen provides implementations for many core primitives like convolutions and pooling, it has historically lagged cuDNN in feature completeness and performance.[40][34] Benchmarks comparing high-end NVIDIA GPUs with cuDNN against high-end AMD GPUs with MIOpen typically show a significant performance advantage for the NVIDIA ecosystem, particularly on models like ResNet-50 and Transformer-based workloads.[34]
Intel oneDNN
Intel's oneAPI Deep Neural Network Library (oneDNN), formerly known as MKL-DNN and DNNL, is an open-source performance library for accelerating deep learning applications on Intel architectures, including CPUs and GPUs.[41] It is a core component of Intel's oneAPI initiative. While oneDNN provides excellent performance on Intel hardware, its primary focus is different from cuDNN's. On NVIDIA hardware, oneDNN has experimental support that often functions by calling cuDNN as a backend, rather than acting as a direct competitor.[42] Performance comparisons show that for GPU-accelerated workloads, the combination of NVIDIA hardware and cuDNN significantly outperforms oneDNN on CPU or GPU hardware.[43]
The competitive landscape highlights that cuDNN's strength comes not just from the library itself, but from its position within a deeply integrated, vertically co-designed ecosystem that spans from silicon architecture (e.g., Tensor Cores) and drivers to the CUDA platform and finally to the high-level frameworks. This tight integration creates a powerful performance advantage that is difficult for competitors to replicate.
| Library | Primary Developer | Primary Hardware Target | Open Source? | Key Differentiator |
|---|---|---|---|---|
| cuDNN | NVIDIA | NVIDIA GPUs | No (Binary distribution) | Deep integration with CUDA ecosystem, Tensor Cores, and hardware co-design |
| MIOpen | AMD | AMD GPUs | Yes | Core component of the open-source AMD ROCm ecosystem |
| oneDNN | Intel / UXL Foundation | Intel CPUs & GPUs | Yes | Optimized for Intel Architecture; part of the cross-platform oneAPI standard |
Licensing
cuDNN is distributed under a proprietary NVIDIA SDK License:[44]
- Free for development, research, and commercial use
- Runtime redistribution allowed with applications
- Requires NVIDIA Developer Program membership for download
- Restrictions on reverse engineering and use in critical applications (e.g., medical devices, nuclear facilities)
cuDNN is distributed free of charge to registered developers as part of NVIDIA's software development kit, but it is proprietary software and its use is governed by NVIDIA's license agreement.[45]
See Also
- CUDA
- Tensor Core
- Deep learning
- Graphics processing unit
- PyTorch
- TensorFlow
- NVIDIA
- MLPerf
- TensorRT
- NVIDIA Collective Communications Library (NCCL)
- Convolutional neural network
- Transformer (machine learning model)
References
- ↑ 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1.21 "NVIDIA cuDNN". NVIDIA Developer. https://developer.nvidia.com/cudnn.
- ↑ 2.0 2.1 2.2 2.3 "NVIDIA cuDNN Documentation". NVIDIA. https://docs.nvidia.com/cudnn/index.html.
- ↑ 3.0 3.1 "What is cuDNN?". Roboflow. https://blog.roboflow.com/what-is-cudnn/.
- ↑ 4.0 4.1 4.2 4.3 4.4 4.5 "CUDA Deep Neural Network (cuDNN)". GeeksforGeeks. https://www.geeksforgeeks.org/deep-learning/cuda-deep-neural-network-cudnn/.
- ↑ 5.0 5.1 5.2 Template:Cite arXiv
- ↑ 6.0 6.1 6.2 6.3 6.4 6.5 "What is cuDNN?". Modal Labs. https://modal.com/gpu-glossary/host-software/cudnn.
- ↑ 7.0 7.1 7.2 7.3 Jérôme Serrano (2014-09-29). "Nvidia Introduces cuDNN, a CUDA-based library for Deep Neural Networks". InfoQ. https://www.infoq.com/news/2014/09/cudnn/.
- ↑ "NVIDIA Doubles Performance for Deep Learning Training". NVIDIA News. https://nvidianews.nvidia.com/news/nvidia-doubles-performance-for-deep-learning-training.
- ↑ 9.0 9.1 9.2 Jeremy Appleyard (2016-04-06). "Optimizing Recurrent Neural Networks in cuDNN 5". NVIDIA Technical Blog. https://developer.nvidia.com/blog/optimizing-recurrent-neural-networks-cudnn-5/.
- ↑ "Tensor Ops Made Easier in cuDNN". NVIDIA Developer Blog. https://developer.nvidia.com/blog/tensor-ops-made-easier-in-cudnn/.
- ↑ 11.0 11.1 11.2 "cuDNN Release 8.0.3 – Highlights". NVIDIA via sdpaninf blog. 2020-09-27. https://sdpaninf.hatenablog.com/entry/2020/09/27/215443.
- ↑ 12.0 12.1 Bill CX. "cuDNN v8 (2020.4.8 GTC)". Medium. https://medium.com/@billchenxi/cudnn-v8-2020-4-8-gtc-5a86365d33c3.
- ↑ "NVIDIA cuDNN Documentation v8.9.0". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-890/index.html.
- ↑ 14.0 14.1 14.2 14.3 Matthew Nicely (2024-05-24). "Accelerating Transformers with NVIDIA cuDNN 9". NVIDIA Technical Blog. https://developer.nvidia.com/blog/accelerating-transformers-with-nvidia-cudnn-9/.
- ↑ 15.0 15.1 "cuDNN 9.1.1 Release Notes". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/backend/v9.1.1/release-notes.html.
- ↑ "cuDNN Release Notes". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html.
- ↑ "NVIDIA cuDNN Developer Guide v8.7.0". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-870/pdf/cuDNN-Developer-Guide.pdf.
- ↑ 18.0 18.1 18.2 "Leveraging NVIDIA CUDA & cuDNN for Accelerated Computer Vision Inference". Rapid Innovation. https://www.rapidinnovation.io/post/leveraging-nvidia-cuda-cudnn-accelerated-computer-vision-inference.
- ↑ 19.0 19.1 19.2 19.3 19.4 "What is cuDNN and Install Guide". AceCloud. https://acecloud.ai/blog/what-is-cudnn-and-install-guide/.
- ↑ "cuDNN Frontend Releases". NVIDIA via GitHub. https://github.com/NVIDIA/cudnn-frontend/releases.
- ↑ "cuDNN 9.3.0 Release Notes". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/backend/v9.3.0/release-notes.html.
- ↑ 22.0 22.1 "NVIDIA cuDNN Backend Documentation". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/backend/v9.11.1/index.html.
- ↑ 23.0 23.1 23.2 23.3 "NVIDIA/cudnn-frontend". GitHub. https://github.com/NVIDIA/cudnn-frontend.
- ↑ "cuDNN Frontend Developer Guide". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/frontend/latest/developer/overview.html.
- ↑ 25.0 25.1 "How does cuDNN handle memory management in deep learning models?". Massed Compute. https://massedcompute.com/faq-answers/?question=How%20does%20cuDNN%20handle%20memory%20management%20in%20deep%20learning%20models.
- ↑ "Can cuDNN handle multi-GPU communication in a cloud-based environment?". Massed Compute. https://massedcompute.com/faq-answers/?question=Can%20cuDNN%20handle%20multi-GPU%20communication%20in%20a%20cloud-based%20environment.
- ↑ 27.0 27.1 "What are the key differences between NCCL and cuDNN for distributed deep learning?". Massed Compute. https://massedcompute.com/faq-answers/?question=What%20are%20the%20key%20differences%20between%20NCCL%20and%20cuDNN%20for%20distributed%20deep%20learning.
- ↑ 28.0 28.1 28.2 28.3 "NVIDIA Deep Learning Frameworks". NVIDIA Developer. https://developer.nvidia.com/deep-learning-frameworks.
- ↑ "Enhancing Deep Learning Efficiency with CUDA and cuDNN". GOML. https://www.goml.io/blog/enhancing-deep-learning-efficiency-with-cuda-and-cudnn-a-comprehensive-guide-for-pytorch-and-tensorflow.
- ↑ 30.0 30.1 30.2 "PyTorch CUDA Semantics". PyTorch. https://pytorch.org/docs/stable/notes/cuda.html.
- ↑ 31.0 31.1 "GPU support - TensorFlow". TensorFlow. https://www.tensorflow.org/install/gpu.
- ↑ "JAX GPU Support". Google JAX. https://github.com/google/jax#installation.
- ↑ "CUDA vs. cuDNN: The Dynamic Duo That Powers Your AI Dreams". Towards AI. https://towardsai.net/p/l/cuda-vs-cudnn-the-dynamic-duo-that-powers-your-ai-dreams.
- ↑ 34.0 34.1 34.2 34.3 "CuDNN vs Other Deep Learning Libraries: Which One Should You Choose for Optimal Performance?". MoldStud. https://moldstud.com/articles/p-cudnn-vs-other-deep-learning-libraries-which-one-should-you-choose-for-optimal-performance.
- ↑ 35.0 35.1 "GPU Benchmarks for Deep Learning". Lambda. https://lambda.ai/gpu-benchmarks.
- ↑ 36.0 36.1 "NVIDIA AI Platform Sets Records in MLPerf Benchmarks". NVIDIA. https://www.nvidia.com/en-us/data-center/resources/mlperf-benchmarks/.
- ↑ 37.0 37.1 37.2 37.3 "cuDNN Support Matrix". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html.
- ↑ 38.0 38.1 38.2 "cuDNN Installation Guide". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html.
- ↑ "Deep Learning Applications". NVIDIA. https://developer.nvidia.com/deep-learning.
- ↑ 40.0 40.1 "MIOpen Porting Guide". AMD. https://rocm.docs.amd.com/projects/MIOpen/en/docs-6.4.1/conceptual/porting-guide.html.
- ↑ "oneAPI Deep Neural Network Library (oneDNN)". GitHub. https://github.com/uxlfoundation/oneDNN.
- ↑ "Intel's oneDNN 2.1 Released With NVIDIA GPU Support". Phoronix. https://www.phoronix.com/news/Intel-oneDNN-2.1-Released.
- ↑ "What are the differences between cuDNN and other deep learning acceleration libraries?". Massed Compute. https://massedcompute.com/faq-answers/?question=What%20are%20the%20differences%20between%20cuDNN%20and%20other%20deep%20learning%20acceleration%20libraries.
- ↑ "NVIDIA cuDNN License". NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/latest/reference/eula.html.
- ↑ "cudnn 9.14.0 – NVIDIA CUDA Deep Neural Network library". Arch Linux Package Repository. https://archlinux.org/packages/extra/x86_64/cudnn/.
