# Grad-CAM

> Source: https://aiwiki.ai/wiki/grad_cam
> Updated: 2026-07-11
> Categories: Computer Vision, Deep Learning, Interpretability, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Grad-CAM (Gradient-weighted Class Activation Mapping) is a technique for producing visual explanations from [convolutional neural network](/wiki/convolutional_neural_network) (CNN) models by using the gradients of a target class flowing into the final convolutional layer to produce a coarse, class-discriminative localization heatmap that highlights the image regions most important for the prediction [1]. It was introduced by Ramprasaath R. Selvaraju and colleagues in a 2016 arXiv preprint and published at the IEEE International Conference on Computer Vision (ICCV) in 2017 [1]. Grad-CAM generalizes the earlier Class Activation Mapping (CAM) method to a wide variety of CNN architectures (including networks with fully connected layers, structured outputs, and multimodal inputs) without any architectural changes or re-training [1].

The authors describe the method's purpose directly: "We propose a technique for producing 'visual explanations' for decisions from a large class of CNN-based models, making them more transparent" [1]. Grad-CAM works by computing the [gradient](/wiki/gradient) of a target class score with respect to the feature maps of a chosen convolutional layer (typically the last one), then averaging those gradients to obtain importance weights for each feature map. The weighted combination of feature maps, passed through a [ReLU](/wiki/rectified_linear_unit_relu) activation, produces the final heatmap. This approach has become one of the most widely used methods in [explainable AI](/wiki/explainable_ai) for [computer vision](/wiki/computer_vision) tasks, and it is closely related to the broader idea of a [saliency map](/wiki/saliency_map) and to model [interpretability](/wiki/interpretability) [1][8].

## ELI5 (explain like I'm 5)

Imagine you show a picture of a cat to a robot and ask it, "Is this a cat?" The robot says yes, but you want to know how it decided. Grad-CAM is like putting special glasses on the robot that let you see which part of the picture the robot looked at most. If the robot says "cat," Grad-CAM might show you that the robot was looking at the cat's face and ears, not at the tree in the background. The important parts light up in warm colors (red and yellow), and the parts the robot ignored stay cool (blue). This helps people check whether the robot is making decisions for the right reasons.

## What is Grad-CAM?

Grad-CAM is a post-hoc, gradient-based explanation method for CNN classifiers. Given a trained model and a target class, it outputs a heatmap over the input image showing which spatial regions most increased the model's score for that class. Two properties define the technique:

- **Class-discriminative.** The heatmap is specific to the chosen target class. For an image containing both a dog and a cat, asking for the "dog" explanation highlights the dog, while asking for "cat" highlights the cat [1].
- **Architecture-agnostic.** Because it relies only on gradients flowing into a convolutional layer, Grad-CAM applies to "a wide variety of CNN model-families" without modifying or retraining the network [1].

The original paper frames the technique around the gradient signal: Grad-CAM "uses the gradients of any target concept, flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept" [1]. This makes it a practical tool for debugging models, building user trust, and performing weakly-supervised localization.

## When was Grad-CAM introduced and who created it?

Grad-CAM was created by Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra (then affiliated with Georgia Tech and Facebook AI Research). The work first appeared as an arXiv preprint in 2016 (arXiv:1610.02391), was published at ICCV 2017 (pages 618-626), and an extended version appeared in the International Journal of Computer Vision (IJCV) in 2019 [1][2].

| Milestone | Venue | Year |
|---|---|---|
| First preprint (arXiv:1610.02391) | arXiv | 2016 |
| Conference publication | ICCV (pages 618-626) | 2017 |
| Extended journal version | International Journal of Computer Vision (IJCV) | 2019 |

## Background: Class Activation Mapping (CAM)

Grad-CAM builds directly on Class Activation Mapping (CAM), introduced by Bolei Zhou and colleagues in 2016. Understanding CAM is necessary for understanding why Grad-CAM was developed and how it improves on the original approach.

### Original CAM method

CAM was proposed in the paper "Learning Deep Features for Discriminative Localization" (Zhou et al., 2016) [3]. It showed that [convolutional neural networks](/wiki/convolutional_neural_network) trained for image classification, when using [global average pooling](/wiki/pooling) (GAP) before the final classification layer, retain spatial information that can be used to generate class-specific localization maps.

The CAM procedure works as follows:

1. Take a CNN architecture that uses global average pooling after the last convolutional layer, followed by a single fully connected layer for classification.
2. For each feature map A_k from the last convolutional layer, global average pooling produces a scalar value.
3. The classification score for class c is computed as: $$y^c = \sum_k w_k^c F^k$$, where $$w_k^c$$ are the learned weights connecting pooled feature k to class c, and $$F^k$$ is the global average pooled value of feature map k.
4. The class activation map is then: $$L_{\text{CAM}}^c(x, y) = \mathrm{ReLU}\left(\sum_k w_k^c A_k(x, y)\right)$$.

CAM achieved 37.1% top-5 error for object localization on ILSVRC 2014 without any bounding box supervision, which was close to the 34.2% achieved by fully supervised approaches [3].

### Limitations of CAM

The main limitation of CAM is its strict architectural requirement. It only works with networks that have a specific structure: the last convolutional layer must be followed directly by global average pooling and a single fully connected (linear) layer. This excludes many popular architectures, including [VGG](/wiki/vgg), which uses multiple fully connected layers. To apply CAM to a network that does not meet this requirement, the architecture must be modified and the model retrained, which is impractical in many settings.

These limitations directly motivated the development of Grad-CAM.

## How does Grad-CAM work?

Grad-CAM replaces CAM's reliance on specific architectural weights with gradient information, making it applicable to any differentiable CNN architecture. The method was introduced by Selvaraju et al. in their 2016 arXiv preprint (later published at ICCV 2017 and in the International Journal of Computer Vision in 2019) [1][2].

### Step-by-step procedure

The Grad-CAM algorithm consists of three steps:

**Step 1: Compute the gradient.** Perform a forward pass of the input image through the network to obtain the class score y^c for a target class c. Then perform backpropagation to compute the gradient of y^c with respect to the feature maps A^k of the chosen convolutional layer (usually the last convolutional layer). This produces a gradient tensor of the same dimensions as the feature maps.

**Step 2: Compute importance weights (alpha values).** Global average pool the gradients over the spatial dimensions (width u and height v) to obtain a single scalar weight for each feature map k:

$$
\alpha_k^c = \frac{1}{Z} \sum_i \sum_j \frac{\partial y^c}{\partial A^k(i, j)}
$$

where $$Z = u \cdot v$$ is the number of spatial locations. Each $$\alpha_k^c$$ represents the importance of feature map k for predicting class c.

**Step 3: Compute the weighted combination and apply ReLU.** The Grad-CAM heatmap is computed as a weighted combination of the feature maps, followed by a ReLU operation:

$$
L_{\text{Grad-CAM}}^c = \mathrm{ReLU}\left(\sum_k \alpha_k^c A^k\right)
$$

The ReLU is applied because only features with a positive influence on the class of interest are relevant for visualization. Negative values correspond to features that belong to other classes and are suppressed.

### Mathematical formulation

The complete mathematical formulation can be summarized as:

| Symbol | Meaning |
|---|---|
| $$y^c$$ | Score (logit) for class c before softmax |
| $$A^k$$ | Activation (feature map) of the k-th filter in the target convolutional layer |
| $$A^k(i, j)$$ | Activation value at spatial location (i, j) in feature map k |
| $$\alpha_k^c$$ | Importance weight for feature map k with respect to class c |
| $$L_{\text{Grad-CAM}}^c$$ | The resulting Grad-CAM heatmap for class c |
| $$Z$$ | Number of pixels in the feature map (width times height) |

The key equations are:

- Importance weight: $$\alpha_k^c = \frac{1}{Z} \sum_i \sum_j \frac{\partial y^c}{\partial A^k(i, j)}$$
- Heatmap: $$L_{\text{Grad-CAM}}^c = \mathrm{ReLU}\left(\sum_k \alpha_k^c A^k\right)$$

### Properties of the heatmap

The output heatmap has the same spatial resolution as the chosen convolutional layer's feature maps, which is typically much smaller than the input image (for example, 7x7 for the last convolutional layer of [VGG-16](/wiki/vgg) with a 224x224 input). To overlay the heatmap on the original image, it is upsampled (usually via bilinear interpolation) to match the input resolution.

## How is Grad-CAM different from CAM?

Selvaraju et al. proved that Grad-CAM is a strict generalization of CAM. When applied to a network with the exact architecture that CAM requires (last convolutional layer followed by GAP and a single fully connected layer), the Grad-CAM weights $$\alpha_k^c$$ reduce to the learned weights $$w_k^c$$ used in CAM [1]. In other words, CAM is a special case of Grad-CAM.

The practical difference is reach: CAM only works on the narrow GAP-plus-single-linear-layer architecture and otherwise requires retraining, whereas Grad-CAM applies to "a wide variety of CNN model-families ... without any architectural changes or re-training" [1]. This is the property that turned class activation mapping from a method tied to one network design into a general-purpose explanation tool.

| Aspect | CAM (2016) | Grad-CAM (2017) |
|---|---|---|
| Source of weights | Learned GAP-to-class weights | Gradients of class score w.r.t. feature maps |
| Architecture requirement | GAP + single FC layer only | Any differentiable CNN |
| Retraining needed | Yes, if architecture differs | No |
| Relationship | Special case | Strict generalization |

## What is Guided Grad-CAM?

One drawback of Grad-CAM is that its heatmaps are coarse due to the low spatial resolution of the final convolutional layer. Conversely, pixel-space gradient visualization methods such as Guided [Backpropagation](/wiki/backpropagation) produce high-resolution, fine-grained visualizations but are not class-discriminative (they highlight edges and textures regardless of the target class) [9].

Guided Grad-CAM combines the strengths of both approaches through element-wise multiplication:

$$
L_{\text{Guided-Grad-CAM}}^c = L_{\text{Guided-Backprop}} \cdot \mathrm{upsample}(L_{\text{Grad-CAM}}^c)
$$

The result is a visualization that is simultaneously high-resolution (capturing fine-grained pixel-level detail) and class-discriminative (highlighting only the regions relevant to the target class) [1]. The Grad-CAM paper describes this combination as creating "high-resolution class-discriminative" visualizations [1].

Guided Backpropagation works by modifying standard backpropagation so that only positive gradients are propagated through ReLU layers during the backward pass [9]. When combined with Grad-CAM, it produces sharper, more interpretable saliency maps.

However, research by Adebayo et al. (2018) showed that Guided Backpropagation (and by extension, Guided Grad-CAM) fails certain "sanity checks" for saliency methods [8]. Specifically, the output of Guided Backpropagation does not change meaningfully when model weights are randomized, suggesting it may function more as an edge detector than as a true model explanation. Grad-CAM on its own does pass these sanity checks [8].

## Variants and extensions

Several methods have extended or modified the Grad-CAM approach to address its limitations or provide alternative formulations.

### Grad-CAM++

Grad-CAM++ was proposed by Chattopadhay et al. (2018) and presented at the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 839-847 [4]. It addresses a limitation of Grad-CAM: when multiple instances of a target class appear in a single image, Grad-CAM tends to highlight only the most prominent one. Grad-CAM++ provides better localization of multiple instances and more complete coverage of objects [4].

The key difference is that Grad-CAM++ replaces the simple global average of gradients with a pixel-wise weighted combination of the positive partial derivatives of the last convolutional layer's feature maps, using higher-order (second and third) partial derivatives:

$$
w_k^c = \sum_{i,j} \alpha_k^c(i, j) \cdot \mathrm{ReLU}\left(\frac{\partial y^c}{\partial A_k(i, j)}\right)
$$

where the pixel-wise weight alpha_k^c(i, j) is computed using second-order and third-order derivatives of the class score with respect to the activations. This weighted scheme provides a more nuanced importance measure that accounts for the spatial distribution of gradients across the feature map [4].

### Score-CAM

Score-CAM, introduced by Wang et al. (2020) at the IEEE CVPR Workshop, is a gradient-free alternative to Grad-CAM [5]. Instead of relying on gradient information, Score-CAM determines the importance of each feature map by measuring how much it contributes to the model's output when used as a mask on the input image.

The Score-CAM procedure:

1. Extract all activation maps from the target convolutional layer.
2. Upsample each activation map to the input image size and normalize it to create a mask.
3. Apply each mask to the original input image to produce a set of masked inputs.
4. Forward pass each masked input through the model to obtain class scores.
5. Use the class scores (after softmax normalization) as weights for the corresponding activation maps.
6. Compute the final heatmap as a weighted sum followed by ReLU.

The weight formula is: $$w_k^c = \mathrm{softmax}_k(y^c(X'_k))$$, where $$X'_k$$ is the input masked by activation map k.

Score-CAM avoids issues related to noisy or vanishing gradients but is computationally more expensive because it requires one forward pass per activation map (which can number in the hundreds for modern architectures) [5].

### LayerCAM

LayerCAM was proposed by Jiang et al. (2021) and published in IEEE Transactions on Image Processing [6]. While Grad-CAM uses globally averaged gradients as weights, LayerCAM uses spatially varying (per-pixel) positive gradients as weights:

$$
w_k^c(x, y) = \mathrm{ReLU}\left(\frac{\partial y^c}{\partial A_k(x, y)}\right)
$$

$$
L_{\text{LayerCAM}}^c(x, y) = \mathrm{ReLU}\left(\sum_k w_k^c(x, y) A_k(x, y)\right)
$$

This approach enables the generation of class activation maps from any convolutional layer, not just the final one. By aggregating maps from multiple layers (shallow to deep), LayerCAM produces higher-quality localization maps that combine fine-grained detail from early layers with semantic information from deeper layers [6].

### Eigen-CAM

Eigen-CAM was introduced by Muhammad and Yeasin (2020) at the International Joint Conference on Neural Networks (IJCNN) [7]. It takes a fundamentally different approach by using [principal component analysis](/wiki/principal_component_analysis_pca) (PCA) on the feature maps rather than gradient information.

Eigen-CAM computes the first principal component of the activations from the target convolutional layer and uses it as the class activation map. Because it does not depend on gradients, backpropagation, class scores, or any form of feature weighting, it is computationally simple and robust to errors made by the fully connected layers [7].

However, since Eigen-CAM does not use class-specific information, the resulting heatmap is the same regardless of which class is being analyzed, which limits its usefulness for class-discriminative analysis.

### Comparison of CAM variants

| Method | Year | Gradient-based | Class-specific | Multi-instance | Resolution | Computational cost |
|---|---|---|---|---|---|---|
| CAM | 2016 | No (uses learned weights) | Yes | Limited | Coarse | Low |
| Grad-CAM | 2017 | Yes | Yes | Limited | Coarse | Low |
| Guided Grad-CAM | 2017 | Yes | Yes | Limited | High | Low |
| Grad-CAM++ | 2018 | Yes (higher-order) | Yes | Better | Coarse | Moderate |
| Score-CAM | 2020 | No (perturbation-based) | Yes | Good | Coarse | High |
| Eigen-CAM | 2020 | No (PCA-based) | No | N/A | Coarse | Low |
| LayerCAM | 2021 | Yes (per-pixel) | Yes | Good | Flexible | Low |

## What is Grad-CAM used for?

Grad-CAM and its variants have been applied across a wide range of domains and tasks, principally for model debugging, building trust, and weakly-supervised localization.

### Image classification debugging and dataset bias

The original Grad-CAM paper demonstrated that the method can reveal when a model is making predictions based on spurious correlations rather than meaningful features [1]. In the paper's case study, the authors fine-tuned an ImageNet-trained VGG-16 model to classify "doctor" versus "nurse" using images returned by an image search engine. Grad-CAM showed that the model was distinguishing the classes by looking at the person's face and hairstyle rather than at medical context, because the training set was skewed: roughly 78% of the "doctor" images were of men and 93% of the "nurse" images were of women [1]. After rebalancing the dataset to remove this gender bias and retraining, the model's accuracy improved from about 82% to about 90%, and Grad-CAM explanations shifted to focus on clothing and the stethoscope [1]. This example is widely cited as a demonstration of how visual explanations can surface dataset bias.

### Medical imaging

Grad-CAM has been widely adopted in medical imaging for interpreting [deep learning](/wiki/deep_learning) models used in clinical diagnosis. Applications include:

- **Radiology:** Highlighting regions of chest X-rays or CT scans that a model uses to detect pneumonia, lung cancer, or COVID-19.
- **Dermatology:** Visualizing which skin regions contribute to lesion classification decisions, helping verify whether a model focuses on the lesion itself rather than surrounding artifacts.
- **Ophthalmology:** Identifying retinal regions associated with disease predictions in fundus images.
- **Pathology:** Localizing suspicious tissue regions in histopathology slides.

CAM-based methods are among the most commonly used explainability techniques in medical AI, with Grad-CAM and Grad-CAM++ being the most frequently cited. However, researchers have raised concerns about the reliability of gradient-based saliency maps in clinical settings, noting that no single saliency method has been shown to satisfy all trustworthiness criteria consistently across different imaging modalities and patient populations.

### Visual question answering and image captioning

Grad-CAM can be applied to multimodal tasks by computing gradients with respect to the CNN feature maps used in the visual branch of the model. In [visual question answering](/wiki/visual_question_answering_models) (VQA), Grad-CAM can show which image regions the model focuses on when answering a specific question. For image captioning, it can reveal the spatial attention for each word in the generated caption. The original paper notes that "even non-attention based models can localize inputs" in these settings [1].

### Object detection and segmentation

Modern implementations of Grad-CAM (such as the pytorch-grad-cam library) support application to object detection models like [YOLO](/wiki/yolo) and [DETR](/wiki/detr), as well as semantic [segmentation](/wiki/image_segmentation) models [12]. In these settings, Grad-CAM generates per-object or per-region heatmaps showing the spatial features driving each prediction.

### Building trust in models

A central motivation of the paper was helping people calibrate trust in model predictions. Through human studies, the authors showed that Grad-CAM "helps untrained users successfully discern a 'stronger' nodel from a 'weaker' one even when both make identical predictions" [1]. (The word "nodel" is a typo for "model" in the published abstract.) This finding established visual explanation as a tool for human-AI trust, not only for developer debugging.

### Model comparison and selection

Grad-CAM visualizations can be used to compare different model architectures or training strategies. By examining which regions different models attend to for the same input, researchers can gain insight into how architectural choices or training procedures affect learned representations.

## How does Grad-CAM compare with other explainability methods?

Grad-CAM belongs to the broader family of post-hoc visual explanation methods. It is useful to compare it with other approaches in this space.

| Method | Type | Model-agnostic | Granularity | Computational cost | Class-discriminative |
|---|---|---|---|---|---|
| Grad-CAM | Gradient-based, activation map | CNN-specific | Coarse | Low | Yes |
| [SHAP](/wiki/shap) (DeepSHAP) | Shapley value-based | Yes | Pixel-level | High | Yes |
| LIME | Perturbation-based | Yes | Superpixel-level | Moderate | Yes |
| Saliency Maps (vanilla gradient) | Gradient-based | Differentiable models | Pixel-level | Low | Yes |
| Integrated Gradients | Gradient-based, path method | Differentiable models | Pixel-level | Moderate | Yes |
| SmoothGrad | Gradient-based, noise-averaged | Differentiable models | Pixel-level | Moderate | Yes |
| Guided Backpropagation | Modified backpropagation | CNN-specific | Pixel-level | Low | No |

Grad-CAM is generally preferred when a quick, coarse localization of important regions is sufficient. Methods like LIME and SHAP are model-agnostic and can be applied beyond CNNs, but they are more computationally expensive [13]. Gradient-based pixel-level methods (vanilla gradients, Integrated Gradients, SmoothGrad) provide finer-grained attribution but can be noisier and harder to interpret visually [10].

## What are the limitations of Grad-CAM?

Despite its popularity, Grad-CAM has several known limitations:

- **Coarse resolution.** The heatmap resolution is determined by the feature map size of the target convolutional layer, which is typically much smaller than the input image. This means Grad-CAM cannot provide fine-grained, pixel-level attribution on its own.
- **Single-instance focus.** When multiple instances of a class appear in an image, Grad-CAM tends to highlight only the most discriminative one, potentially missing others. Grad-CAM++ and LayerCAM partially address this limitation [4][6].
- **Gradient noise and saturation.** Gradients can be noisy, especially in deep networks, and may suffer from saturation in regions where activations are zero (due to ReLU). This can lead to unreliable importance weights.
- **False emphasis.** Large gradient values can co-occur with low activation values, producing misleading heatmaps where regions are highlighted despite having minimal influence on the actual prediction.
- **Layer selection sensitivity.** The choice of which convolutional layer to use for computing Grad-CAM affects the output. While the last convolutional layer is most commonly used (as it captures the highest-level semantic information), this choice is somewhat arbitrary and may not always produce the most informative heatmap.
- **Potential unfaithfulness.** Research by Draelos and Carin (2020) found that Grad-CAM can highlight regions that the model did not actually use for its prediction. They proposed HiResCAM as an alternative that provides provably faithful explanations by using element-wise products of gradients and activations rather than global average pooling [11].
- **Guided Grad-CAM sanity check failures.** While Grad-CAM itself passes the sanity checks proposed by Adebayo et al. (2018), Guided Grad-CAM does not. This means the high-resolution component of Guided Grad-CAM may not be a reliable model explanation [8].

## Software implementations

Several open-source libraries provide implementations of Grad-CAM and its variants:

| Library | Language / Framework | Supported methods |
|---|---|---|
| pytorch-grad-cam | [PyTorch](/wiki/pytorch) | Grad-CAM, Grad-CAM++, Score-CAM, LayerCAM, Eigen-CAM, and others |
| tf-explain | [TensorFlow](/wiki/tensorflow) / Keras | Grad-CAM, Vanilla Gradients, SmoothGrad |
| keras-vis | Keras | Grad-CAM, saliency maps, class activation maps |
| Captum | PyTorch | Grad-CAM, Integrated Gradients, SHAP, and many others |

The pytorch-grad-cam library by Jacob Gildenblat is the most widely used implementation as of 2024, supporting a broad range of CAM methods and architectures including [Vision Transformers](/wiki/vision_transformer_vit), object detection models, and segmentation models [12].

## See also

- [Explainable AI](/wiki/explainable_ai)
- [Interpretability](/wiki/interpretability)
- [Saliency map](/wiki/saliency_map)
- [Convolutional neural network](/wiki/convolutional_neural_network)
- [SHAP](/wiki/shap)

## References

1. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization." Proceedings of the IEEE International Conference on Computer Vision (ICCV), 618-626. arXiv:1610.02391.

2. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2020). "Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization." International Journal of Computer Vision, 128(2), 336-359.

3. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). "Learning Deep Features for Discriminative Localization." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2921-2929.

4. Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V.N. (2018). "Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks." IEEE Winter Conference on Applications of Computer Vision (WACV), 839-847.

5. Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., & Hu, X. (2020). "Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 111-119.

6. Jiang, P.T., Zhang, C.B., Hou, Q., Cheng, M.M., & Wei, Y. (2021). "LayerCAM: Exploring Hierarchical Class Activation Maps for Localization." IEEE Transactions on Image Processing, 30, 5875-5888.

7. Muhammad, M.B. & Yeasin, M. (2020). "Eigen-CAM: Class Activation Map using Principal Components." International Joint Conference on Neural Networks (IJCNN), 1-7.

8. Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim, B. (2018). "Sanity Checks for Saliency Maps." Advances in Neural Information Processing Systems (NeurIPS), 31.

9. Springenberg, J.T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). "Striving for Simplicity: The All Convolutional Net." ICLR Workshop.

10. Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps." ICLR Workshop.

11. Draelos, R.L. & Carin, L. (2021). "Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks." arXiv preprint arXiv:2011.08891.

12. Gildenblat, J. et al. (2021). "PyTorch library for CAM methods." GitHub repository: pytorch-grad-cam. https://github.com/jacobgil/pytorch-grad-cam

13. Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?: Explaining the Predictions of Any Classifier." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135-1144.