Grad-CAM (Gradient-weighted Class Activation Mapping) is a technique for producing visual explanations from convolutional neural network (CNN) models. It generates coarse localization heatmaps that highlight regions in an input image that are most relevant to a given prediction. Proposed by Ramprasaath R. Selvaraju and colleagues in 2016 and published at ICCV 2017, Grad-CAM generalizes the earlier Class Activation Mapping (CAM) method by removing its architectural constraints, making it applicable to virtually any differentiable CNN-based architecture without requiring model retraining or structural modifications.
Grad-CAM works by computing the gradient of a target class score with respect to the feature maps of a chosen convolutional layer (typically the last one), then averaging those gradients to obtain importance weights for each feature map. The weighted combination of feature maps, passed through a ReLU activation, produces the final heatmap. This approach has become one of the most widely used methods in explainable AI for computer vision tasks.
Imagine you show a picture of a cat to a robot and ask it, "Is this a cat?" The robot says yes, but you want to know how it decided. Grad-CAM is like putting special glasses on the robot that let you see which part of the picture the robot looked at most. If the robot says "cat," Grad-CAM might show you that the robot was looking at the cat's face and ears, not at the tree in the background. The important parts light up in warm colors (red and yellow), and the parts the robot ignored stay cool (blue). This helps people check whether the robot is making decisions for the right reasons.
Grad-CAM builds directly on Class Activation Mapping (CAM), introduced by Bolei Zhou and colleagues in 2016. Understanding CAM is necessary for understanding why Grad-CAM was developed and how it improves on the original approach.
CAM was proposed in the paper "Learning Deep Features for Discriminative Localization" (Zhou et al., 2016). It showed that convolutional neural networks trained for image classification, when using global average pooling (GAP) before the final classification layer, retain spatial information that can be used to generate class-specific localization maps.
The CAM procedure works as follows:
CAM achieved 37.1% top-5 error for object localization on ILSVRC 2014 without any bounding box supervision, which was close to the 34.2% achieved by fully supervised approaches.
The main limitation of CAM is its strict architectural requirement. It only works with networks that have a specific structure: the last convolutional layer must be followed directly by global average pooling and a single fully connected (linear) layer. This excludes many popular architectures, including VGG, which uses multiple fully connected layers. To apply CAM to a network that does not meet this requirement, the architecture must be modified and the model retrained, which is impractical in many settings.
These limitations directly motivated the development of Grad-CAM.
Grad-CAM replaces CAM's reliance on specific architectural weights with gradient information, making it applicable to any differentiable CNN architecture. The method was introduced by Selvaraju et al. in their 2016 arXiv preprint (later published at ICCV 2017 and in the International Journal of Computer Vision in 2019).
The Grad-CAM algorithm consists of three steps:
Step 1: Compute the gradient. Perform a forward pass of the input image through the network to obtain the class score y^c for a target class c. Then perform backpropagation to compute the gradient of y^c with respect to the feature maps A^k of the chosen convolutional layer (usually the last convolutional layer). This produces a gradient tensor of the same dimensions as the feature maps.
Step 2: Compute importance weights (alpha values). Global average pool the gradients over the spatial dimensions (width u and height v) to obtain a single scalar weight for each feature map k:
alpha_k^c = (1 / Z) * sum_i sum_j (partial y^c / partial A^k(i, j))
where Z = u * v is the number of spatial locations. Each alpha_k^c represents the importance of feature map k for predicting class c.
Step 3: Compute the weighted combination and apply ReLU. The Grad-CAM heatmap is computed as a weighted combination of the feature maps, followed by a ReLU operation:
L_Grad-CAM^c = ReLU(sum_k(alpha_k^c * A^k))
The ReLU is applied because only features with a positive influence on the class of interest are relevant for visualization. Negative values correspond to features that belong to other classes and are suppressed.
The complete mathematical formulation can be summarized as:
| Symbol | Meaning |
|---|---|
| y^c | Score (logit) for class c before softmax |
| A^k | Activation (feature map) of the k-th filter in the target convolutional layer |
| A^k(i, j) | Activation value at spatial location (i, j) in feature map k |
| alpha_k^c | Importance weight for feature map k with respect to class c |
| L_Grad-CAM^c | The resulting Grad-CAM heatmap for class c |
| Z | Number of pixels in the feature map (width times height) |
The key equations are:
Selvaraju et al. proved that Grad-CAM is a strict generalization of CAM. When applied to a network with the exact architecture that CAM requires (last convolutional layer followed by GAP and a single fully connected layer), the Grad-CAM weights alpha_k^c reduce to the learned weights w_k^c used in CAM. In other words, CAM is a special case of Grad-CAM.
The output heatmap has the same spatial resolution as the chosen convolutional layer's feature maps, which is typically much smaller than the input image (for example, 7x7 for the last convolutional layer of VGG-16 with a 224x224 input). To overlay the heatmap on the original image, it is upsampled (usually via bilinear interpolation) to match the input resolution.
One drawback of Grad-CAM is that its heatmaps are coarse due to the low spatial resolution of the final convolutional layer. Conversely, pixel-space gradient visualization methods such as Guided Backpropagation produce high-resolution, fine-grained visualizations but are not class-discriminative (they highlight edges and textures regardless of the target class).
Guided Grad-CAM combines the strengths of both approaches through element-wise multiplication:
L_Guided-Grad-CAM^c = L_Guided-Backprop * upsample(L_Grad-CAM^c)
The result is a visualization that is simultaneously high-resolution (capturing fine-grained pixel-level detail) and class-discriminative (highlighting only the regions relevant to the target class).
Guided Backpropagation works by modifying standard backpropagation so that only positive gradients are propagated through ReLU layers during the backward pass. When combined with Grad-CAM, it produces sharper, more interpretable saliency maps.
However, research by Adebayo et al. (2018) showed that Guided Backpropagation (and by extension, Guided Grad-CAM) fails certain "sanity checks" for saliency methods. Specifically, the output of Guided Backpropagation does not change meaningfully when model weights are randomized, suggesting it may function more as an edge detector than as a true model explanation. Grad-CAM on its own does pass these sanity checks.
Several methods have extended or modified the Grad-CAM approach to address its limitations or provide alternative formulations.
Grad-CAM++ was proposed by Chattopadhay et al. (2018) and presented at the IEEE Winter Conference on Applications of Computer Vision (WACV). It addresses a limitation of Grad-CAM: when multiple instances of a target class appear in a single image, Grad-CAM tends to highlight only the most prominent one. Grad-CAM++ provides better localization of multiple instances and more complete coverage of objects.
The key difference is that Grad-CAM++ replaces the simple global average of gradients with a pixel-wise weighted combination that uses higher-order (second and third) partial derivatives:
w_k^c = sum_{i,j} alpha_k^c(i, j) * ReLU(partial y^c / partial A_k(i, j))
where the pixel-wise weight alpha_k^c(i, j) is computed using second-order and third-order derivatives of the class score with respect to the activations. This weighted scheme provides a more nuanced importance measure that accounts for the spatial distribution of gradients across the feature map.
Score-CAM, introduced by Wang et al. (2020) at the IEEE CVPR Workshop, is a gradient-free alternative to Grad-CAM. Instead of relying on gradient information, Score-CAM determines the importance of each feature map by measuring how much it contributes to the model's output when used as a mask on the input image.
The Score-CAM procedure:
The weight formula is: w_k^c = softmax_k(y^c(X'_k)), where X'_k is the input masked by activation map k.
Score-CAM avoids issues related to noisy or vanishing gradients but is computationally more expensive because it requires one forward pass per activation map (which can number in the hundreds for modern architectures).
LayerCAM was proposed by Jiang et al. (2021) and published in IEEE Transactions on Image Processing. While Grad-CAM uses globally averaged gradients as weights, LayerCAM uses spatially varying (per-pixel) positive gradients as weights:
w_k^c(x, y) = ReLU(partial y^c / partial A_k(x, y))
L_LayerCAM^c(x, y) = ReLU(sum_k w_k^c(x, y) * A_k(x, y))
This approach enables the generation of class activation maps from any convolutional layer, not just the final one. By aggregating maps from multiple layers (shallow to deep), LayerCAM produces higher-quality localization maps that combine fine-grained detail from early layers with semantic information from deeper layers.
Eigen-CAM was introduced by Muhammad and Yeasin (2020) at the International Joint Conference on Neural Networks (IJCNN). It takes a fundamentally different approach by using principal component analysis (PCA) on the feature maps rather than gradient information.
Eigen-CAM computes the first principal component of the activations from the target convolutional layer and uses it as the class activation map. Because it does not depend on gradients, backpropagation, class scores, or any form of feature weighting, it is computationally simple and robust to errors made by the fully connected layers.
However, since Eigen-CAM does not use class-specific information, the resulting heatmap is the same regardless of which class is being analyzed, which limits its usefulness for class-discriminative analysis.
| Method | Year | Gradient-based | Class-specific | Multi-instance | Resolution | Computational cost |
|---|---|---|---|---|---|---|
| CAM | 2016 | No (uses learned weights) | Yes | Limited | Coarse | Low |
| Grad-CAM | 2017 | Yes | Yes | Limited | Coarse | Low |
| Guided Grad-CAM | 2017 | Yes | Yes | Limited | High | Low |
| Grad-CAM++ | 2018 | Yes (higher-order) | Yes | Better | Coarse | Moderate |
| Score-CAM | 2020 | No (perturbation-based) | Yes | Good | Coarse | High |
| Eigen-CAM | 2020 | No (PCA-based) | No | N/A | Coarse | Low |
| LayerCAM | 2021 | Yes (per-pixel) | Yes | Good | Flexible | Low |
Grad-CAM and its variants have been applied across a wide range of domains and tasks.
The original Grad-CAM paper demonstrated that the method can reveal when a model is making predictions based on spurious correlations rather than meaningful features. For example, a model trained to classify "doctor" vs. "nurse" might focus on gender-associated features (such as hairstyle or clothing) rather than medical equipment, revealing bias in the training data. By visualizing what the model attends to, developers can identify and correct such problems.
Grad-CAM has been widely adopted in medical imaging for interpreting deep learning models used in clinical diagnosis. Applications include:
CAM-based methods are among the most commonly used explainability techniques in medical AI, with Grad-CAM and Grad-CAM++ being the most frequently cited. However, researchers have raised concerns about the reliability of gradient-based saliency maps in clinical settings, noting that no single saliency method has been shown to satisfy all trustworthiness criteria consistently across different imaging modalities and patient populations.
Grad-CAM can be applied to multimodal tasks by computing gradients with respect to the CNN feature maps used in the visual branch of the model. In visual question answering (VQA), Grad-CAM can show which image regions the model focuses on when answering a specific question. For image captioning, it can reveal the spatial attention for each word in the generated caption.
Modern implementations of Grad-CAM (such as the pytorch-grad-cam library) support application to object detection models like YOLO and DETR, as well as semantic segmentation models. In these settings, Grad-CAM generates per-object or per-region heatmaps showing the spatial features driving each prediction.
Grad-CAM visualizations can be used to compare different model architectures or training strategies. By examining which regions different models attend to for the same input, researchers can gain insight into how architectural choices or training procedures affect learned representations.
Grad-CAM belongs to the broader family of post-hoc visual explanation methods. It is useful to compare it with other approaches in this space.
| Method | Type | Model-agnostic | Granularity | Computational cost | Class-discriminative |
|---|---|---|---|---|---|
| Grad-CAM | Gradient-based, activation map | CNN-specific | Coarse | Low | Yes |
| SHAP (DeepSHAP) | Shapley value-based | Yes | Pixel-level | High | Yes |
| LIME | Perturbation-based | Yes | Superpixel-level | Moderate | Yes |
| Saliency Maps (vanilla gradient) | Gradient-based | Differentiable models | Pixel-level | Low | Yes |
| Integrated Gradients | Gradient-based, path method | Differentiable models | Pixel-level | Moderate | Yes |
| SmoothGrad | Gradient-based, noise-averaged | Differentiable models | Pixel-level | Moderate | Yes |
| Guided Backpropagation | Modified backpropagation | CNN-specific | Pixel-level | Low | No |
Grad-CAM is generally preferred when a quick, coarse localization of important regions is sufficient. Methods like LIME and SHAP are model-agnostic and can be applied beyond CNNs, but they are more computationally expensive. Gradient-based pixel-level methods (vanilla gradients, Integrated Gradients, SmoothGrad) provide finer-grained attribution but can be noisier and harder to interpret visually.
Despite its popularity, Grad-CAM has several known limitations:
Several open-source libraries provide implementations of Grad-CAM and its variants:
| Library | Language / Framework | Supported methods |
|---|---|---|
| pytorch-grad-cam | PyTorch | Grad-CAM, Grad-CAM++, Score-CAM, LayerCAM, Eigen-CAM, and others |
| tf-explain | TensorFlow / Keras | Grad-CAM, Vanilla Gradients, SmoothGrad |
| keras-vis | Keras | Grad-CAM, saliency maps, class activation maps |
| Captum | PyTorch | Grad-CAM, Integrated Gradients, SHAP, and many others |
The pytorch-grad-cam library by Jacob Gildenblat is the most widely used implementation as of 2024, supporting a broad range of CAM methods and architectures including Vision Transformers, object detection models, and segmentation models.