CuDNN

AI Hardware AI Tools & Products Developer Tools NVIDIA

25 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

48 citations

Revision

v7 · 4,899 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NVIDIA cuDNN (CUDA Deep Neural Network library) is a proprietary GPU-accelerated library of primitives for deep learning, first released by NVIDIA on September 7, 2014, that provides highly tuned implementations of standard routines such as forward and backward convolution, attention, matrix multiplication, pooling, and normalization.^[1]^[2]^[7] It is the low-level compute engine that high-level frameworks like PyTorch and TensorFlow call to run neural networks on NVIDIA GPUs. cuDNN was introduced in the paper "cuDNN: Efficient Primitives for Deep Learning" by Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer (arXiv, October 2014), whose authors described the goal as providing "a library of efficient implementations of deep learning primitives."^[47]

Built on top of the CUDA parallel computing platform, cuDNN is not a standalone deep learning framework but rather a foundational performance library.^[3] It serves as a critical abstraction layer that allows high-level frameworks like PyTorch and TensorFlow to leverage the computational power of NVIDIA GPUs without requiring framework developers to write low-level, hardware-specific CUDA code.^[4]

The library's existence represents a strategic separation of concerns within the AI ecosystem. By centralizing the complex and time-consuming task of optimizing deep learning kernels for each new GPU architecture, NVIDIA enables framework developers to concentrate on API design, automatic differentiation, and scientific innovation.^[5] This approach significantly accelerates the development cycle of the entire AI community on NVIDIA hardware. Furthermore, cuDNN provides a layer of performance portability across GPU generations. As NVIDIA releases new hardware, updated versions of cuDNN incorporate optimized kernels that exploit the new architectural features. Applications and frameworks built against the cuDNN API can often achieve substantial performance gains on new hardware simply by updating the library, without requiring changes to their own source code.^[6]

When was cuDNN released?

Initial Release and Early Development (2014-2017)

NVIDIA released cuDNN on September 7, 2014, amid the rise of deep learning research following breakthroughs in ImageNet competitions where GPU-accelerated convolutional neural networks achieved dramatic improvements in accuracy.^[7]^[48] NVIDIA introduced it as a set of low-level GPU primitives to boost the performance of deep neural networks on CUDA-compatible GPUs. Although cuDNN can be used directly via its C/C++ API, NVIDIA anticipated it would mostly be used indirectly through higher-level machine learning frameworks that incorporate it.^[7]

From early on, frameworks such as Caffe and Torch integrated cuDNN as a backend, allowing developers to get GPU acceleration with minimal code changes. NVIDIA reported up to a 10x speedup in training throughput when using Caffe with cuDNN compared to Caffe's CPU-only mode, demonstrating the significant performance gains of GPU-accelerated deep learning.^[7]^[48] The initial cuDNN 1.0 release focused on primitives for convolutional neural networks, including convolution, pooling, softmax, and neuron activations such as Sigmoid, ReLU, and TANH.^[7]

Key milestones in early development included:

cuDNN 2.0 (March 2015): Performance improvements and expanded support for different network configurations.
cuDNN 3.0 (July 2015): Support for 16-bit floating point (FP16) data storage, enabling training of larger models.^[8]
cuDNN 4.0 (November 2015): Optimizations for Maxwell architecture and initial support for recurrent neural networks (RNNs).
cuDNN 5.0 (May 2016): Major update announced at GTC 2016. Added comprehensive support for recurrent neural networks including LSTM and GRU layers for the first time, greatly accelerating sequence learning tasks.^[9] Introduced new convolution algorithms including the Winograd algorithm for faster convolutions, 3D convolution support, and improved half-precision (FP16) performance on NVIDIA's Pascal architecture. NVIDIA reported up to 6x speedup in LSTM training when using cuDNN v5's RNN support.^[9]

Tensor Core Era (2017-2020)

cuDNN 6.0 (April 2017) focused on performance tuning and robustness, with improved support for dilated (atrous) convolutions and optimizations targeting the then-new Volta GPU architecture.

cuDNN 7.x (2017-2019) marked a major milestone with support for Tensor Cores on Volta architecture GPUs. This series of releases introduced support for Tensor Cores in deep learning operations, allowing FP16 compute with FP32 accumulation to leverage Volta's Tensor Core units for significant speedups.^[10] Throughout the 7.x releases, NVIDIA added support for new network layers and optimized existing ones, including improved batch normalization. The library also expanded device support including NVIDIA's embedded Jetson platforms via JetPack.^[1]

Graph API Revolution (2020-2023)

cuDNN 8.0 (June 2020) represented a major redesign of cuDNN coinciding with the NVIDIA Ampere architecture launch. cuDNN 8 was optimized for the NVIDIA A100 GPUs, with NVIDIA reporting up to 5x higher performance on A100 versus V100 out-of-the-box thanks to new optimizations and use of hardware features like TensorFloat-32 (TF32).^[11]

The flagship feature was the introduction of the declarative Graph API and a runtime fusion engine.^[12] This allowed users to express complex, multi-operation computations that cuDNN could then analyze and optimize holistically. The API was overhauled: v8 introduced a new low-level backend API for more flexibility and performance tuning, while providing a compatibility layer for the previous v7 API to ease transition.^[11] New capabilities included improved support for conversational AI, computer vision networks, and the ability to fuse multiple operations through the new graph API. Additionally, cuDNN 8 was modularized into smaller component libraries, so applications could include only the needed portions, making integration more lightweight.^[11]

Subsequent cuDNN 8.x releases (2020-2023) continuously improved performance and added features. These included support for the NVIDIA Hopper architecture (H100 GPUs), expanded graph API functionalities, and initial support for new data types such as FP8 in late 8.x versions for Hopper. For example, cuDNN 8.9 introduced fused flash attention for training and inference.^[13]

Modern Era: Transformer Focus (2024-2025)

cuDNN 9.0 (February 2024) brought the first major version jump in four years, with a primary focus on accelerating Transformer-based models for the era of generative AI and large language models.^[14] This version introduced extensive enhancements for Scaled Dot-Product Attention (SDPA), including highly optimized kernels inspired by FlashAttention and robust support for the FP8 data type on Hopper and Blackwell architecture GPUs, offering up to 2x faster throughput in BF16 and up to 3x in FP8 for attention operations compared to the best available PyTorch eager implementation.^[14] NVIDIA reported the cuDNN 9 FP8 SDPA path reaching up to 1.2 PFLOPS on the H200 Tensor Core GPU.^[14]

Key features introduced in cuDNN 9 included:

Hardware forward compatibility: Ensuring that applications compiled against cuDNN 9.0 or later can run on future, unreleased GPU architectures without modification, automatically benefiting from available performance improvements.^[15]
Mixed input precision support for matrix multiplications and convolutions (allowing inputs in different precisions, e.g., FP16 and FP32, to be used together in one operation)
Improved error reporting for developers
More streamlined installation process
Library reorganization with dependency change from cuBLAS to cuBLASLt^[15]

Subsequent 9.x releases have continued to refine features. As of October 2025, cuDNN 9.14.0 includes automatic runtime configuration, complex datatype support for matrix multiplication, enhanced Blackwell architecture optimizations, and 5-10% SDPA performance improvements on Blackwell GPUs.^[16]

What does cuDNN do? Core functionality and primitives

cuDNN provides a comprehensive suite of optimized primitives that form the building blocks of modern deep neural networks. The set of accelerated routines has evolved over time, reflecting the major research trends and computational demands of the deep learning field.^[1]

Convolution Operations

As the cornerstone of Convolutional Neural Networks (CNNs), convolution was one of the original and most critical functions of cuDNN. The library offers highly optimized implementations for forward (inference) and backward (training, for both data and filter gradients) passes of 2D and 3D convolutions.^[4]^[17] It supports essential features like striding, padding, dilation, and grouped convolutions, along with flexible tensor data layouts such as NCHW and NHWC to minimize data transposition overhead.^[18]

For convolution operations, cuDNN provides multiple algorithmic implementations including GEMM-based, FFT-based, and Winograd-based methods.^[5] The library uses heuristics to automatically select the optimal algorithm for a given input size and GPU architecture.

Attention Mechanisms

With the rise of the Transformer architecture, attention has become a primary focus of optimization in recent cuDNN versions. The library includes state-of-the-art implementations of Scaled Dot-Product Attention (SDPA), incorporating techniques from algorithms like FlashAttention to reduce memory consumption and accelerate sequence processing.^[19]^[14] This support extends to various use cases, including:

Fused multi-head attention
Sliding window attention
Grouped Query Attention (GQA)
Relative positional encoding
Paged attention for efficient KV cache management (v9.4.0+)

These features are vital for training and inferencing large language models.^[20]

Matrix Multiplication and Tensor Operations

A fundamental operation for fully connected (dense) layers and numerous components within Transformer models, cuDNN provides highly optimized kernels for general matrix multiplication (GEMM).^[1] In addition to neural-network-specific layers, cuDNN provides fundamental tensor operations like matrix multiplication, tensor transforms (reordering data layouts), and reduction operations optimized for GPUs. Many of these leverage NVIDIA's other libraries (e.g., using cuBLAS/cuBLASLt or direct CUDA kernels) and are tuned for deep learning workloads.

Pooling and Activation Functions

The library accelerates standard pooling operations like max pooling and average pooling for 2D and 3D spatial dimensions.^[18] It also provides fast implementations for common non-linear activation functions including ReLU, Sigmoid, Tanh, GELU, Swish, and ELU, which can be computed standalone or fused with other operations for efficiency.^[1]

Normalization Operations

cuDNN offers efficient implementations of various normalization techniques crucial for training stability and performance, including:

Batch normalization
Layer normalization
Instance normalization
RMS normalization (Root Mean Square)
Group normalization

These support both the training (forward and backward normalization) and inference phases of deep networks.^[21]

Recurrent Neural Networks

Since version 5, cuDNN includes support for recurrent neural network layers. It implements optimized routines for popular RNN architectures including LSTM and GRU networks, as well as simple RNNs with ReLU or tanh activations.^[9] These optimized RNN kernels dramatically improve performance for sequence modeling tasks.^[22]

How is cuDNN built? Technical architecture

Library Components

Starting with version 8, cuDNN underwent a significant architectural transformation. Modern cuDNN consists of multiple sub-libraries organized by functionality:^[2]

Library	Function	Description
libcudnn_graph.so	Graph API	Main graph API for declarative operation composition
libcudnn_engines_precompiled.so	Engines	Pre-compiled kernel implementations
libcudnn_engines_runtime_compiled.so	JIT Engines	Runtime kernel generation for fusion patterns
libcudnn_heuristic.so	Heuristics	Automatic algorithm selection
libcudnn.so	Legacy API	Backward compatibility shim layer
libcudnn_cnn.so	CNN Operations	Convolution and pooling operations
libcudnn_ops.so	Tensor Operations	Basic tensor operations
libcudnn_adv.so	Advanced Operations	RNN and batch normalization

The Graph API

Introduced in cuDNN v8, the Graph API allows a developer to define an entire computation, or a segment of it, as a directed acyclic graph (DAG). In this graph, operations (like convolution or activation) are represented as nodes, and tensors are represented as edges connecting them.^[1]^[12] This declarative approach is a fundamental shift from the legacy model of calling individual functions one by one.

By providing the library with a "global view" of the intended computation, the cuDNN runtime can perform sophisticated, graph-level optimizations that are impossible with a myopic, operation-by-operation perspective. The most significant of these optimizations is operation fusion.^[6] This architectural change effectively transforms cuDNN from a library of fast math routines into a domain-specific graph compiler for deep learning.

API Layers

cuDNN provides three API entry points with different abstraction levels, catering to different use cases and programming environments:^[2]

Frontend API (C++/Python)

The Frontend API is the recommended entry point for most users.^[1] It is an open-source, header-only C++ library that provides a more concise and user-friendly abstraction over the powerful backend.^[23] It also includes Python bindings (available via nvidia-cudnn-frontend package), making it directly accessible from popular frameworks.^[23] The frontend adds convenience features on top of the backend, such as helpers for autotuning and filters for known hardware or software errata, simplifying the development process.^[23]

Backend API (C)

The Backend API is the lower-level, closed-source C interface to the cuDNN engine.^[6] It exposes the full capabilities of the graph API and is intended for legacy use cases, integration into environments where C++ or Python are not suitable, or for developers who require maximum control.^[1]^[22] The open-source Frontend API also serves as a valuable reference implementation for developers working directly with the C backend.^[23]

Key Features and Optimization Techniques

Operation Fusion

Operation fusion is the process of combining a sequence of distinct neural network operations, such as a convolution followed by a bias addition and a ReLU activation, into a single, monolithic GPU kernel.^[6] The primary benefit of this technique is the significant reduction in memory bandwidth requirements. Without fusion, the intermediate result of each operation must be written to the GPU's main memory (global memory) and then read back by the next operation. This round-trip to global memory is a major performance bottleneck.

By fusing operations, intermediate data can be kept in much faster on-chip memory, such as registers or shared memory, throughout the fused sequence.^[6] cuDNN can generate kernels for common fusion patterns at runtime or use specialized, pre-written kernels for high-value patterns like fused attention. Common fusion patterns include:

ConvBiasAct: Convolution + bias + activation
BnAddRelu: Batch normalization + residual add + ReLU
Generic fusion patterns with DAG support

The runtime fusion engine with JIT kernel generation can provide up to 2.5x speedup for common patterns.^[1]

Heuristics and Automatic Tuning

For many deep learning primitives, especially convolution, multiple algorithms exist (e.g., GEMM-based, FFT-based, Winograd-based).^[18]^[5] The optimal choice depends on numerous factors, including the tensor dimensions, data types, filter sizes, and the specific GPU architecture. cuDNN incorporates a sophisticated heuristics engine that analyzes these parameters and automatically selects the predicted best-performing algorithm for a given workload.^[1]^[6] This eliminates the need for developers to perform tedious manual benchmarking.^[19]

For users seeking the absolute best performance, cuDNN also offers an autotuning feature. When enabled, the library can empirically benchmark a small set of promising algorithms on the target hardware at runtime and select the fastest one for subsequent executions.^[24] This combination of heuristics and autotuning functions as a form of JIT compilation, creating a highly optimized execution plan tailored to the specific model and hardware.

Mixed-Precision and Tensor Core Acceleration

cuDNN is designed to fully exploit specialized hardware units within NVIDIA GPUs known as Tensor Cores.^[1] Tensor Cores provide dramatic speedups for matrix multiplication and convolution operations, which are the computational backbone of deep learning. They achieve maximum throughput using lower-precision numerical formats.

cuDNN supports multiple precision modes:^[2]

FP64: Double precision
FP32: Single precision (standard floating point)
TF32: TensorFloat-32 (Ampere architecture and later)
FP16: Half precision
BF16: Brain floating point
FP8: 8-bit floating point (Hopper and Blackwell architectures)
INT8: 8-bit integer
Complex FP32/FP64: Complex datatypes (v9.14.0+)

cuDNN provides optimized kernels that utilize Tensor Cores for mixed-precision data types, enabling significantly faster model training and inference while also reducing the memory footprint.^[19] The library internally chooses appropriate kernels based on hardware capabilities and user-specified math precision. This allows training in mixed precision (for example, using FP16/BF16 for computations with FP32 accumulation) to improve performance while maintaining accuracy.

Memory Management

Efficient memory management is critical for handling large models and datasets. Many cuDNN operations require temporary storage buffers, known as "workspace" memory, for their intermediate calculations.^[25] The library provides APIs that allow applications to query the required workspace size ahead of time. This enables developers to manage a memory pool efficiently, pre-allocating a single large buffer and sub-allocating from it, which avoids the high overhead of repeated cudaMalloc calls.^[25]

Multi-GPU Scaling with NCCL

While cuDNN's API is context-based and facilitates multi-threaded applications where each thread controls a separate GPU, the library itself is focused on optimizing computations on a single GPU.^[26] For scaling deep learning tasks across multiple GPUs or multiple nodes, cuDNN works in concert with the NVIDIA Collective Communications Library (NCCL).^[1]^[27]

In a typical data-parallel training scenario, the workflow is as follows:

Each GPU uses cuDNN to independently compute the forward and backward passes for its assigned mini-batch of data.
After gradients are computed on each GPU, NCCL is used to perform a highly efficient All-Reduce collective operation. This operation sums the gradients from all GPUs and distributes the result back to every GPU.
Each GPU then uses the averaged gradients to update its local copy of the model weights.

This modular design, which separates intra-GPU computation (cuDNN) from inter-GPU communication (NCCL), is a key strength of the NVIDIA platform. It allows each library to be optimized independently: cuDNN for new on-chip compute features and NCCL for new interconnect technologies like NVLink.^[27]

How does cuDNN work with PyTorch and TensorFlow?

The vast majority of developers interact with cuDNN indirectly through high-level deep learning frameworks.^[19]^[4] cuDNN serves as the default, high-performance execution engine for nearly all major frameworks when running on NVIDIA GPUs.^[28]

This integration is designed to be seamless. When a framework like PyTorch or TensorFlow is installed with GPU support, it automatically detects and utilizes the installed cuDNN library.^[29] As a result, developers can write high-level code (e.g., defining a convolutional layer in Python) and the framework's backend will automatically translate that operation into a call to the corresponding optimized cuDNN primitive, unlocking GPU acceleration without requiring any low-level programming.^[3]^[4]

Framework	Primary Maintainer	Integration Status
PyTorch	Meta AI	Native Integration^[1]^[30]
TensorFlow	Google	Native Integration^[1]^[28]
JAX	Google	Integrated via XLA^[1]^[19]
Apache MXNet	Apache Software Foundation	Native Integration^[1]^[4]
Caffe / Caffe2	Berkeley / Meta AI	Native Integration^[1]^[4]
Chainer	Preferred Networks	Native Integration^[1]
Microsoft Cognitive Toolkit	Microsoft	Native Integration^[1]^[28]
PaddlePaddle	Baidu	Native Integration^[1]^[28]
Keras	Various	Via TensorFlow/PyTorch backends^[1]
MATLAB	MathWorks	Deep Learning Toolbox Integration^[1]

PyTorch

PyTorch provides deep cuDNN integration with automatic detection and utilization of the library for operations like Conv2d, LSTM, and BatchNorm2d.^[30] PyTorch users can enable benchmark mode (torch.backends.cudnn.benchmark = True) for automatic algorithm selection, which can provide 10-25% performance improvements.^[30]

TensorFlow

TensorFlow automatically detects and uses cuDNN when available.^[31] Recent versions like TensorFlow 2.18.0 support CUDA 12.3 with cuDNN 8.9, providing 40-50% speedup for LSTM/GRU operations with automatic Tensor Core utilization.^[31]

JAX

JAX leverages cuDNN through XLA compiler integration.^[32] The XLA compiler automatically selects cuDNN operations when beneficial, with automatic backend selection between cuDNN, cuBLAS, and custom kernels.

How fast is cuDNN? Performance

The performance impact of cuDNN is substantial, often providing speedups of one to two orders of magnitude over CPU-only implementations and significant acceleration compared to unoptimized GPU code.^[33]

Benchmarks

cuDNN demonstrates significant performance advantages across various workloads:

CPU vs GPU: 10-100x speedup for deep learning workloads
Flash Attention: Up to 75% speedup over FlashAttention v2 implementations on H100
FP8 precision: Up to 1.2 PFLOPS on H200 for SDPA operations^[14]
Transformer models: 1.15x speedup for Llama2 70B with FP8 on H200^[14]

GPU	Model	Precision	Performance Metric	Reference
NVIDIA A100	ResNet-50	FP32	7,850 images/sec	^[34]
NVIDIA A100	ResNet-50	FP16	20,500 images/sec	^[34]
NVIDIA H100 (SXM5)	Aggregate Models	Mixed	~5.7x vs V100	^[35]
NVIDIA RTX 4090	Aggregate Models	Mixed	~2.1x vs V100	^[35]

MLPerf Results

The NVIDIA AI platform, with cuDNN as a core software component, consistently sets performance records in the industry-standard MLPerf benchmarks for both training and inference across a wide variety of AI workloads.^[36] NVIDIA platforms using cuDNN lead MLPerf benchmarks with results including:

ResNet-50 training: 0.4 minutes at scale
BERT training: 0.32 minutes at scale
3D U-Net: 9.7x speedup with 800 GPUs
Only platform submitting across all benchmark categories^[36]

Hardware Requirements and Compatibility

GPU Architecture Support

cuDNN version requirements vary by GPU architecture:^[37]

Architecture	Compute Capability	Example GPUs	cuDNN Support
Kepler	3.0, 3.5, 3.7	Tesla K80, GTX 780 Ti	Supported in cuDNN 7.x and earlier
Maxwell	5.0, 5.2	GTX 980, GTX 750 Ti	Supported in cuDNN 8.x (CUDA 11.x branch)
Pascal	6.0, 6.1	Tesla P100, GTX 1080	Supported in cuDNN 8.x and 9.x (CUDA 11.x branch)
Volta	7.0	Tesla V100, Titan V	Full support in cuDNN 8.x and 9.x + Tensor Cores
Turing	7.5	RTX 2080, Tesla T4	Full support in cuDNN 8.x and 9.x + Tensor Cores
Ampere	8.0, 8.6	A100, RTX 3090	Full support + Enhanced Tensor Cores
Ada Lovelace	8.9	RTX 4090, L4, L40	Full support (requires cuDNN 8.9+)
Hopper	9.0	H100, H200	Full support + FP8 (requires cuDNN 8.9+)
Blackwell	10.0	B100, B200	Full support + Enhanced FP8 (requires cuDNN 9.7+)

Note: cuDNN 9.x with CUDA 12.x requires Turing architecture or later (compute capability 7.5+). For Maxwell and Pascal support, use cuDNN 8.x or 9.x with CUDA 11.x.^[37]

CUDA Toolkit Requirements

Each version of cuDNN is built against and requires specific version(s) of the CUDA Toolkit:^[37]

cuDNN 9.x (modern branch): CUDA 12.6+ for Volta and later architectures
CUDA 11.8 for Pascal and Maxwell architectures

cuDNN 8.x: CUDA 11.0 through CUDA 12.x cuDNN 7.x: CUDA 8.0 through CUDA 10.2 (legacy)

Operating System Support

Supported operating systems include:^[37]

Various distributions of Linux (such as Ubuntu, RHEL, and CentOS)
64-bit versions of Microsoft Windows
Android for embedded platforms (Jetson)

How do you install cuDNN?

cuDNN can be installed through various methods, each suited to different workflows and requirements.^[38]

Package Managers

Recommended installation methods include:

Conda:

conda install nvidia::cudnn cuda-version=12

pip:

pip install nvidia-cudnn-cu12 # For CUDA 12.x pip install nvidia-cudnn-cu11 # For CUDA 11.x

Docker:

docker pull nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04

Manual Installation

For Linux systems requiring manual installation:^[38]

Download from NVIDIA Developer portal (requires NVIDIA Developer Program registration)
Extract archive: tar -xvf cudnn-*.tar.xz
Copy files to CUDA directory
Set appropriate permissions and update LD_LIBRARY_PATH environment variable

Users must join the NVIDIA Developer Program to download cuDNN packages.^[38]

What is cuDNN used for? Applications

cuDNN powers numerous AI applications across industries:^[39]

Computer Vision: Image classification, object detection, image segmentation
Natural Language Processing: Large language models, machine translation, sentiment analysis
Autonomous vehicles: Perception pipelines for self-driving cars
Healthcare: Medical imaging analysis, drug discovery
Recommender systems: Deep learning-based recommendation engines
Speech Recognition: Automatic speech recognition, voice assistants
Generative AI: Text-to-image models, text generation, video synthesis

How does cuDNN compare to alternatives?

While cuDNN is the de facto standard for deep learning acceleration on NVIDIA hardware, other hardware vendors provide their own libraries for their respective ecosystems.

AMD MIOpen

MIOpen is an open-source deep learning primitives library developed by AMD for its ROCm GPGPU platform. It is designed to be the AMD equivalent of cuDNN.^[40] While MIOpen provides implementations for many core primitives like convolutions and pooling, it has historically lagged cuDNN in feature completeness and performance.^[40]^[34] Benchmarks comparing high-end NVIDIA GPUs with cuDNN against high-end AMD GPUs with MIOpen typically show a significant performance advantage for the NVIDIA ecosystem, particularly on models like ResNet-50 and Transformer-based workloads.^[34]

Intel oneDNN

Intel's oneAPI Deep Neural Network Library (oneDNN), formerly known as MKL-DNN and DNNL, is an open-source performance library for accelerating deep learning applications on Intel architectures, including CPUs and GPUs.^[41] It is a core component of Intel's oneAPI initiative. While oneDNN provides excellent performance on Intel hardware, its primary focus is different from cuDNN's. On NVIDIA hardware, oneDNN has experimental support that often functions by calling cuDNN as a backend, rather than acting as a direct competitor.^[42] Performance comparisons show that for GPU-accelerated workloads, the combination of NVIDIA hardware and cuDNN significantly outperforms oneDNN on CPU or GPU hardware.^[43]

The competitive landscape highlights that cuDNN's strength comes not just from the library itself, but from its position within a deeply integrated, vertically co-designed ecosystem that spans from silicon architecture (e.g., Tensor Cores) and drivers to the CUDA platform and finally to the high-level frameworks. This tight integration creates a powerful performance advantage that is difficult for competitors to replicate.

Library	Primary Developer	Primary Hardware Target	Open Source?	Key Differentiator
cuDNN	NVIDIA	NVIDIA GPUs	No (Binary distribution)	Deep integration with CUDA ecosystem, Tensor Cores, and hardware co-design
MIOpen	AMD	AMD GPUs	Yes	Core component of the open-source AMD ROCm ecosystem
oneDNN	Intel / UXL Foundation	Intel CPUs & GPUs	Yes	Optimized for Intel Architecture; part of the cross-platform oneAPI standard

Is cuDNN open source? Licensing

cuDNN is distributed under a proprietary NVIDIA SDK License:^[44]

Free for development, research, and commercial use
Runtime redistribution allowed with applications
Requires NVIDIA Developer Program membership for download
Restrictions on reverse engineering and use in critical applications (e.g., medical devices, nuclear facilities)

cuDNN is distributed free of charge to registered developers as part of NVIDIA's software development kit, but it is proprietary software and its use is governed by NVIDIA's license agreement.^[45] Note that while the core cuDNN library is closed source, the cuDNN Frontend API is released as open source on GitHub.^[46]

References

NVIDIA, "NVIDIA cuDNN Documentation," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/latest/ ↩
NVIDIA, "NVIDIA cuDNN Backend: Overview and API," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/ ↩
NVIDIA, "CUDA Deep Neural Network (cuDNN)," NVIDIA Developer. https://developer.nvidia.com/cudnn ↩
NVIDIA, "cuDNN: GPU-accelerated library of primitives for deep neural networks," NVIDIA Developer. https://developer.nvidia.com/cudnn ↩
S. Chetlur et al., "cuDNN: Efficient Primitives for Deep Learning," arXiv:1410.0759, 2014. https://arxiv.org/abs/1410.0759 ↩
NVIDIA, "cuDNN Graph API and Operation Fusion," cuDNN Backend Developer Guide. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/developer/graph-api.html ↩
L. Brown, "Accelerate Machine Learning with the cuDNN Deep Neural Network Library," NVIDIA Technical Blog, Sept. 7, 2014. https://developer.nvidia.com/blog/accelerate-machine-learning-cudnn-deep-neural-network-library/ ↩
NVIDIA, "cuDNN Release Notes (version history)," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html ↩
J. Appleyard, "Optimizing Recurrent Neural Networks in cuDNN 5," NVIDIA Technical Blog, 2016. https://developer.nvidia.com/blog/optimizing-recurrent-neural-networks-cudnn-5/ ↩
NVIDIA, "Tensor Core support in cuDNN (Volta and later)," cuDNN Documentation. https://docs.nvidia.com/deeplearning/cudnn/latest/ ↩
NVIDIA, "cuDNN 8.0 Release Notes," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-880/release-notes/index.html ↩
NVIDIA, "Graph API," NVIDIA cuDNN Backend Documentation. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/developer/graph-api.html ↩
NVIDIA, "cuDNN 8.9 Release Notes (fused flash attention)," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-890/release-notes/index.html ↩
V. Mehta et al., "Accelerating Transformers with NVIDIA cuDNN 9," NVIDIA Technical Blog, Feb. 2024. https://developer.nvidia.com/blog/accelerating-transformers-with-nvidia-cudnn-9/ ↩
NVIDIA, "cuDNN 9.0.0 Release Notes," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/backend/v9.0.0/release-notes.html ↩
NVIDIA, "cuDNN 9.14.0 Release Notes," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html ↩
NVIDIA, "Convolution operations," NVIDIA cuDNN Documentation. https://docs.nvidia.com/deeplearning/cudnn/latest/ ↩
NVIDIA, "cuDNN API Reference: convolution algorithms and tensor layouts," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/api/overview.html ↩
NVIDIA, "Accelerating Transformers with NVIDIA cuDNN 9," NVIDIA Technical Blog, Feb. 2024. https://developer.nvidia.com/blog/accelerating-transformers-with-nvidia-cudnn-9/ ↩
NVIDIA, "Attention operations (SDPA, GQA, paged attention)," NVIDIA cuDNN Documentation. https://docs.nvidia.com/deeplearning/cudnn/latest/operations/Attention.html ↩
NVIDIA, "Normalization operations," NVIDIA cuDNN Documentation. https://docs.nvidia.com/deeplearning/cudnn/latest/ ↩
NVIDIA, "cuDNN Backend API (legacy and advanced operations)," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/ ↩
NVIDIA, "cudnn-frontend (open-source C++/Python frontend)," GitHub. https://github.com/NVIDIA/cudnn-frontend ↩
NVIDIA, "Heuristics and autotuning," NVIDIA cuDNN Backend Developer Guide. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/developer/ ↩
NVIDIA, "Workspace memory management," NVIDIA cuDNN Documentation. https://docs.nvidia.com/deeplearning/cudnn/latest/ ↩
NVIDIA, "cuDNN multi-threading and handle/context model," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/latest/ ↩
NVIDIA, "NVIDIA Collective Communications Library (NCCL)," NVIDIA Developer. https://developer.nvidia.com/nccl ↩
NVIDIA, "Frameworks Support Matrix," NVIDIA Deep Learning Documentation. https://docs.nvidia.com/deeplearning/frameworks/support-matrix/ ↩
NVIDIA, "Installing cuDNN for framework GPU support," NVIDIA cuDNN Installation Guide. https://docs.nvidia.com/deeplearning/cudnn/installation/latest/ ↩
PyTorch, "torch.backends.cudnn," PyTorch Documentation. https://pytorch.org/docs/stable/backends.html ↩
TensorFlow, "GPU support (CUDA and cuDNN requirements)," tensorflow.org. https://www.tensorflow.org/install/gpu ↩
JAX, "GPU (CUDA) installation and XLA backend," JAX Documentation. https://jax.readthedocs.io/en/latest/installation.html ↩
S. Chetlur et al., "cuDNN: Efficient Primitives for Deep Learning," arXiv:1410.0759, 2014. https://arxiv.org/abs/1410.0759 ↩
NVIDIA, "NVIDIA Data Center Deep Learning Product Performance (ResNet-50 throughput)," NVIDIA Developer. https://developer.nvidia.com/deep-learning-performance-training-inference ↩
NVIDIA, "GPU performance comparisons (H100, RTX 4090 vs V100)," NVIDIA Developer. https://developer.nvidia.com/deep-learning-performance-training-inference ↩
NVIDIA, "MLPerf Benchmarks," NVIDIA Data Center. https://www.nvidia.com/en-us/data-center/resources/mlperf-benchmarks/ ↩
NVIDIA, "cuDNN Support Matrix," NVIDIA cuDNN Documentation. https://docs.nvidia.com/deeplearning/cudnn/backend/latest/reference/support-matrix.html ↩
NVIDIA, "cuDNN Installation Guide," docs.nvidia.com. https://docs.nvidia.com/deeplearning/cudnn/installation/latest/ ↩
NVIDIA, "CUDA Deep Neural Network (cuDNN): use cases," NVIDIA Developer. https://developer.nvidia.com/cudnn ↩
AMD, "MIOpen: AMD's deep learning primitives library," ROCm Documentation. https://rocm.docs.amd.com/projects/MIOpen/en/latest/ ↩
UXL Foundation / Intel, "oneDNN (oneAPI Deep Neural Network Library)," oneDNN Documentation. https://uxlfoundation.github.io/oneDNN/ ↩
oneDNN, "NVIDIA GPU support via cuDNN/cuBLAS backend," oneDNN GitHub. https://github.com/uxlfoundation/oneDNN ↩
Intel, "oneDNN performance and supported architectures," oneDNN Documentation. https://uxlfoundation.github.io/oneDNN/ ↩
NVIDIA, "NVIDIA Software License Agreement (cuDNN SDK)," NVIDIA. https://docs.nvidia.com/deeplearning/cudnn/sla/index.html ↩
NVIDIA, "cuDNN download and Developer Program requirements," NVIDIA Developer. https://developer.nvidia.com/cudnn ↩
NVIDIA, "cudnn-frontend (Apache-2.0 licensed)," GitHub. https://github.com/NVIDIA/cudnn-frontend ↩
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, "cuDNN: Efficient Primitives for Deep Learning," arXiv:1410.0759, Oct. 2014. https://arxiv.org/abs/1410.0759 ↩
InfoQ, "Nvidia Introduces cuDNN, a CUDA-based library for Deep Neural Networks," Sept. 2014. https://www.infoq.com/news/2014/09/cudnn/ ↩

External links

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki Artificial intelligence terms Bryan Catanzaro CUDA CUTLASS NVIDIA A100 NVIDIA B200 NVIDIA GB200 NVL72 NVIDIA H100 NVIDIA H200 NVIDIA Hopper NVIDIA Isaac Lab Nvidia Operation (op)RNN Ring Attention Shampoo (optimizer)Terms

When was cuDNN released?

Initial Release and Early Development (2014-2017)

Tensor Core Era (2017-2020)

Graph API Revolution (2020-2023)

Modern Era: Transformer Focus (2024-2025)

What does cuDNN do? Core functionality and primitives

Convolution Operations

Attention Mechanisms

Matrix Multiplication and Tensor Operations

Pooling and Activation Functions

Normalization Operations

Recurrent Neural Networks

How is cuDNN built? Technical architecture

Library Components

The Graph API

API Layers

Frontend API (C++/Python)

Backend API (C)

Key Features and Optimization Techniques

Operation Fusion

Heuristics and Automatic Tuning

Mixed-Precision and Tensor Core Acceleration

Memory Management

Multi-GPU Scaling with NCCL

How does cuDNN work with PyTorch and TensorFlow?

PyTorch

TensorFlow

JAX

How fast is cuDNN? Performance

Benchmarks

MLPerf Results

Hardware Requirements and Compatibility

GPU Architecture Support

CUDA Toolkit Requirements

Operating System Support

How do you install cuDNN?

Package Managers

Manual Installation

What is cuDNN used for? Applications

How does cuDNN compare to alternatives?

AMD MIOpen

Intel oneDNN

Is cuDNN open source? Licensing

See Also

References

External links

Improve this article

Related Articles

CUDA

NVIDIA Triton Inference Server

NVIDIA Isaac Sim

NVIDIA NIM

NVIDIA Dynamo

CUTLASS

What links here

Related Articles

CUDA

NVIDIA Triton Inference Server

NVIDIA Isaac Sim

NVIDIA NIM

NVIDIA Dynamo

CUTLASS

What links here