Integrated Gradients
Last reviewed
Apr 28, 2026
Sources
30 citations
Review status
Source-backed
Revision
v1 ยท 4,201 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
30 citations
Review status
Source-backed
Revision
v1 ยท 4,201 words
Add missing citations, update stale details, or suggest a clearer explanation.
Integrated Gradients (IG) is a model interpretability technique introduced by Mukund Sundararajan, Ankur Taly, and Qiqi Yan in their 2017 ICML paper "Axiomatic Attribution for Deep Networks." [1] The method assigns an attribution score to every input feature of a neural network prediction, indicating how much that feature contributed to the model's output relative to a chosen reference input known as a baseline. Unlike earlier saliency map techniques that rely on raw gradients evaluated at a single point, IG averages gradients along a straight-line path from the baseline to the actual input. This integration step is what gives the method its name.
IG sits within the broader field of explainable AI and attribution methods, and it has become one of the most widely cited gradient-based attribution techniques for deep learning models. The paper's main contribution is an axiomatic framework: the authors defined two desirable properties for any attribution method, Sensitivity and Implementation Invariance, and showed that IG is the unique method (up to a family of path integrals) that satisfies both. This grounding is closely tied to the Aumann-Shapley value from cooperative game theory, which generalizes the discrete Shapley value to continuous settings.
IG has been adopted by major libraries including Captum for PyTorch, the Saliency library for TensorFlow, Alibi, and SHAP. It has been applied to image classification on networks like Inception and ResNet trained on ImageNet, to natural language tasks with BERT and other transformer models, to tabular data in finance and healthcare, and to genomics for identifying regulatory motifs in DNA. [2][3]
Let F: R^n -> R be a differentiable function representing a neural network output (for example, a class probability or logit). Let x be the input of interest and x' be a chosen baseline. The Integrated Gradient for input feature i is defined as the path integral of the partial derivative of F with respect to feature i, taken along the straight-line interpolation between x' and x:
IG_i(x) = (x_i - x'_i) * integral from alpha=0 to 1 of dF(x' + alpha * (x - x')) / dx_i d(alpha)
In words: scale each input dimension by the difference between the input and baseline values, and multiply by the average partial derivative of the model along the linear path between them. The factor (x_i - x'_i) ensures the attribution is zero when an input feature equals the baseline, and the integral averages out local fluctuations in the gradient caused by saturated activations or sharp nonlinearities. [1]
The integral is rarely computed analytically. In practice, IG is approximated by a Riemann sum with m discrete steps:
IG_i(x) ~ (x_i - x'_i) / m * sum from k=1 to m of dF(x' + (k/m) * (x - x')) / dx_i
Each step requires one forward pass and one backward pass through the network. The original paper recommends 20 to 300 steps depending on how much the model output changes between the baseline and input. [1] For convolutional networks like Inception V3 evaluated on ImageNet, 50 steps is a common default that converges within a few percent of completeness.
Consider a one-dimensional ReLU network F(x) = max(0, x - 0.5) evaluated at x = 1.0 with baseline x' = 0.0. The model output is F(1.0) = 0.5 and F(0.0) = 0.0. The gradient dF/dx is 0 for x < 0.5 and 1 for x > 0.5. Along the path from 0 to 1 the gradient equals 1 for half the path and 0 for the other half, so the path-averaged gradient is 0.5. Multiplying by (x - x') = 1 gives an attribution of 0.5, exactly equal to F(x) - F(x'). Vanilla saliency, by contrast, would report a gradient of 1 evaluated at x = 1.0, ignoring the saturated region between 0 and 0.5 where the function was actually flat.
The central contribution of the IG paper is the introduction of two axioms that any attribution method should satisfy. The authors argue that prior literature had judged methods primarily by visual appeal rather than by formal properties, making it hard to know whether a saliency map captures model behavior or merely produces plausible-looking heatmaps. [1]
The Sensitivity (a) axiom states: if an input x and baseline x' differ in a single feature, and the model output differs between them, then that feature must receive a non-zero attribution. The Sensitivity (b) axiom adds the converse condition that if a function does not mathematically depend on a variable, then attribution to that variable should be zero.
Vanilla saliency map methods, which compute dF/dx at the input point alone, can fail Sensitivity (a) because of saturated activations. In deep networks with many ReLU units, the gradient of a class score with respect to an input pixel can be exactly zero even when changing the pixel from black to its observed value would change the prediction. IG circumvents this by averaging gradients along the interpolation path, which guarantees that any feature contributing to the prediction difference receives non-zero credit. [1]
The Implementation Invariance axiom requires that two networks computing the exact same input-output function F should produce identical attributions, even if they have different internal architectures. Two networks are functionally equivalent if F_1(x) = F_2(x) for every input x.
This axiom is satisfied automatically by methods that depend only on input gradients, since the gradient of F is a property of the function, not of its implementation. Vanilla gradients and IG both satisfy it. DeepLIFT, proposed by Shrikumar, Greenside, and Kundaje, replaces the partial derivative with a reference difference quotient and propagates modified gradients backward through the network. [4] Because DeepLIFT rules depend on the choice of nonlinearity and on how layers are decomposed, two functionally equivalent networks with different layer structures can yield different attributions. The same critique applies to LRP (Bach et al. 2015). [5]
IG is therefore distinguished by satisfying both axioms simultaneously. Sundararajan and colleagues prove that, among the family of path integration methods that integrate gradients along some path from baseline to input, the linear path is the unique choice that satisfies a stronger set of axioms including Symmetry-Preserving (symmetric inputs receive equal attribution). [1]
IG satisfies a useful accounting property called Completeness: the per-feature attributions sum exactly to the difference in model output between the input and the baseline.
sum over i of IG_i(x) = F(x) - F(x')
This is a direct consequence of the fundamental theorem of calculus applied to the path integral, and it means the attributions can be interpreted as a decomposition of the prediction change into per-feature contributions. Completeness also makes IG comparable to SHAP and other attribution methods that produce a budgeted distribution of credit. [6]
The connection to game theory runs deeper than the surface analogy. IG is mathematically equivalent to the Aumann-Shapley value for the linear path between baseline and input. Aumann and Shapley developed this in 1974 as a continuous extension of the discrete Shapley value for cost allocation where players contribute fractional amounts. [7] In game-theoretic terms, IG distributes the payout F(x) - F(x') among features in a way that satisfies Efficiency, Linearity, Symmetry, and the Dummy axioms.
Sundararajan and Najmi's 2020 paper "The many Shapley values for model explanation" later clarified the relationship between IG, KernelSHAP, baseline Shapley, and other Shapley-derived methods, showing they correspond to different choices of game formulation. [8] IG's specific choice of a deterministic linear path makes it computationally tractable in continuous settings where a true expectation over feature subsets would require exponential evaluation.
The choice of baseline x' is the most consequential modeling decision in IG and the most discussed practical issue in the literature. The baseline defines what "absent" or "neutral" features mean. Different baselines produce different attributions, sometimes dramatically so.
The original paper recommends that the baseline should produce a near-zero output: F(x') ~ 0. For an image classifier this typically means that a baseline image should not be predicted as the target class. Common baseline choices include:
| Baseline type | Description | Typical use case |
|---|---|---|
| Zero / black image | Input replaced with all zeros (or zero after normalization) | Default for image models, often used with Inception, ResNet |
| Random noise | Pixels sampled from Gaussian or uniform noise | Avoids systematic bias of black baseline; common alternative |
| Gaussian blur | Heavily blurred version of the input | Removes high-frequency content but preserves overall structure |
| Mean image | Average of training set | Represents an "average" example |
| Multiple baselines | Average IG over many baselines | Reduces baseline sensitivity, used in Expected Gradients |
| Padding token | Embedding of a special PAD or [MASK] token | Standard for transformer NLP models |
| Empty string embedding | Zero vector in embedding space | Alternative for text classification |
The black-image baseline has been criticized because it is itself informative: a black region in a natural image is unusual, and the gradient with respect to a black pixel may emphasize edge detectors that fire on dark regions rather than truly indicating absence of the feature. [9] Many practitioners now use noise or multi-baseline approaches.
Erion and colleagues introduced Expected Gradients in 2021, which marginalizes over a distribution of baselines (typically the training distribution) by averaging IG attributions across sampled baselines. [10] This approximates the Aumann-Shapley value with respect to the data distribution and removes the need to pick a single arbitrary baseline.
For text models, the baseline is usually the embedding of a PAD or zero token, since the model itself only sees continuous embeddings. Attribution is computed with respect to the embedding vectors and then summed across embedding dimensions to give a scalar score per token.
A reference implementation of IG looks like this in pseudocode:
def integrated_gradients(model, x, x_prime, target_class, m=50):
alphas = linspace(0, 1, m)
interpolated = x_prime + alphas[:, None] * (x - x_prime)
grads = []
for x_alpha in interpolated:
x_alpha.requires_grad = True
output = model(x_alpha)[target_class]
grad = autograd.grad(output, x_alpha)<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>
grads.append(grad)
avg_grad = mean(grads, axis=0)
attributions = (x - x_prime) * avg_grad
return attributions
In production the interpolated inputs are batched, so a single forward pass computes outputs for all m points at once. Memory becomes a constraint at large m, and implementations often chunk the integration into mini-batches.
A common diagnostic is to verify completeness numerically: sum the attributions and compare against F(x) - F(x'). Captum exposes this via the return_convergence_delta=True flag in its IntegratedGradients class. Deltas above a few percent indicate the model contains discontinuities or that more steps are required.
IG requires the model to be differentiable along the path. For piecewise-linear networks (ReLU, max-pooling) the path is differentiable almost everywhere, and the Riemann sum converges. For non-differentiable operations like argmax, gradient smoothing tricks are needed.
IG has been deployed across many domains. The table below summarizes representative use cases.
| Domain | Model architecture | Use of IG | Reference |
|---|---|---|---|
| Image classification | Inception V3, ResNet, VGG on ImageNet | Pixel-level saliency for class predictions | Sundararajan et al. 2017 [1] |
| NLP / question answering | BERT, T5, RoBERTa | Token-level attribution for answer spans and classifications | Mudrakarta et al. 2018 [11] |
| Tabular finance | Gradient-boosted trees and MLPs | Per-feature credit risk explanation | Erion et al. 2021 [10] |
| Healthcare | CNNs for medical imaging, MLPs for EHR | Clinical decision support and bias auditing | Lundberg et al. 2018 [12] |
| Genomics | DeepBind, DeepSEA, BPNet | Identifying regulatory motifs and DNA-binding sites | Avsec et al. 2021 [13] |
| Recommendation systems | Two-tower neural networks | Feature importance for item ranking | Sundararajan et al. 2019 [14] |
| Drug discovery | Graph neural networks | Atom-level attribution for molecular property prediction | Jimenez-Luna et al. 2020 [15] |
| Speech recognition | Wav2Vec, Conformer | Time-frequency attribution on spectrograms | Becker et al. 2018 [16] |
In image work, IG is usually visualized as a heatmap overlay, with positive attributions in red and negative in blue. The maps tend to be smoother than vanilla saliency because integration averages out local gradient noise. Common practice is to take absolute attributions or sum across color channels.
For NLP tasks, IG attributions to token embeddings are summed across embedding dimensions to give a scalar score per token, visualized as heatmap-colored text. Researchers have used IG to identify spurious correlations in BERT-based models, such as overreliance on stopwords. [11]
In genomics, IG and DeepLIFT extract sequence motifs from convolutional networks trained on regulatory tasks. The DeepBind and BPNet papers use IG-style attributions to identify transcription factor binding motifs in DNA. [13]
Several extensions of IG have been proposed to address specific limitations.
Expected Gradients (Erion et al. 2021) replaces the single baseline with a distribution sampled from a reference set, often the training data. Averaging IG across these baselines yields the Aumann-Shapley value with respect to the data distribution. The approach reduces baseline sensitivity and can be used as an attribution-guided regularizer during training. [10]
Blur Integrated Gradients (Xu et al. 2020) replaces the linear path with a path through progressively blurred versions of the input. The intuition is that a Gaussian-blurred image is a more semantically meaningful "absence of features" baseline than a black image. The resulting attributions are smoother and align better with human perceptual judgments on natural images. [9]
Guided Integrated Gradients (Kapishnikov et al. 2021) chooses the integration path adaptively rather than using a straight line. At each step, the path moves the feature with the largest current gradient, rather than all features uniformly. Empirically, Guided IG produces sharper and less noisy attributions on image classifiers, at the cost of breaking the strict Aumann-Shapley interpretation. [17]
SmoothGrad-Integrated Gradients combines IG with SmoothGrad by averaging IG over multiple noisy versions of the input. [18] This reduces visual noise in the saliency map at the cost of additional compute. The combined method is sometimes called Smooth IG or SmoothGrad-IG.
Integrated Hessians (Janizek, Sturmfels, Lee 2021) extends IG to capture pairwise feature interactions by integrating second derivatives along the path. The result is a matrix of pairwise attributions that decomposes the prediction not just into per-feature credit but into per-feature-pair interactions, useful for diagnosing how a model combines features. [19]
XRAI (Kapishnikov et al. 2019) combines IG with image segmentation. After computing pixel-level IG, XRAI aggregates attributions over superpixel regions to produce region-level explanations that are more interpretable on natural images than pixel-level heatmaps. [20]
Despite its theoretical grounding, IG has documented limitations. The most influential critique is the 2018 NeurIPS paper "Sanity Checks for Saliency Maps" by Adebayo and colleagues. [21] They proposed two tests: a parameter randomization test, where weights are randomized layer by layer, and a data randomization test, where labels are shuffled. A reliable method should produce attributions that change substantially when the model is randomized. They found that several popular methods, including some IG configurations depending on the baseline, produced visually similar attributions on randomly initialized and trained networks. The conclusion was that visual similarity to edge maps does not imply faithfulness to the model. This sparked more rigorous evaluation methodology across the saliency literature.
Other limitations include:
Path dependence. The straight-line path is one of many. The Aumann-Shapley value depends on the chosen path, and although the linear path is the unique symmetric one, alternatives (Blur IG, Guided IG) yield different attributions. The choice rests on axiomatic arguments and visual quality, not pure empirics.
Computational cost. A single IG attribution requires 20 to 300 forward and backward passes. For a large transformer with billions of parameters, this can be prohibitive. Approximate methods like sampling-based IG reduce cost.
Baseline sensitivity. As discussed above, attributions can change significantly with the choice of baseline. There is no universally correct baseline, and different fields have developed domain-specific conventions.
Discrete inputs. For categorical or token inputs the partial derivative is undefined. The workaround is to compute IG with respect to the continuous embedding and sum across embedding dimensions. The attribution is then over the embedding, not over the original token.
Limited feature interactions. Standard IG produces an additive decomposition into per-feature contributions. Real models combine features non-additively, and this structure is invisible in standard IG. Integrated Hessians and other extensions address this gap.
Adversarial fragility. Small perturbations that change predictions can produce IG attributions similar to the original, suggesting IG may not always reveal what the model truly responds to. [22]
Faithfulness vs intuitiveness. Visually appealing saliency maps are not always faithful to the model. IG occupies a middle ground that is more faithful than vanilla saliency but less so than deletion-based methods.
The following table compares IG with other widely used attribution methods. All of them produce per-feature attributions but differ in how they obtain those scores.
| Method | Year | Approach | Implementation invariant | Completeness | Computational cost |
|---|---|---|---|---|---|
| Vanilla Saliency | 2014 | Gradient at input point | Yes | No | 1 backward pass |
| Guided Backprop | 2014 | Gradient with negative values clipped | No | No | 1 backward pass |
| Class Activation Map (CAM) | 2016 | Weighted sum of last-layer activations | No | No | 1 forward pass |
| Grad-CAM | 2017 | Gradient-weighted activation map | Partially | No | 1 backward pass |
| LRP | 2015 | Layer-wise relevance propagation rules | No | Yes (with constraints) | 1 backward pass |
| DeepLIFT | 2017 | Reference-based modified backprop | No | Yes | 1 backward pass |
| Integrated Gradients | 2017 | Path integral of gradients | Yes | Yes | 20 to 300 passes |
| SmoothGrad | 2017 | Average gradients over noisy inputs | Yes | No | n forward+backward passes |
| LIME | 2016 | Local linear model around input | Model agnostic | Approximately | Many forward passes |
| SHAP (KernelSHAP) | 2017 | Sampled Shapley values | Model agnostic | Yes | Many forward passes |
| GradientSHAP | 2017 | IG averaged over baselines and noise | Yes | Yes | Many passes |
| Expected Gradients | 2021 | IG averaged over baseline distribution | Yes | Yes | Many passes |
| Blur IG | 2020 | IG along Gaussian blur path | Yes | Yes | 20 to 300 passes |
| Guided IG | 2021 | IG along adaptively chosen path | Yes | Yes | 20 to 300 passes |
Saliency map methods (Simonyan et al. 2014) compute the absolute value of dF/dx at the input. They are cheap but suffer from gradient saturation and noise. [23]
DeepLIFT (Shrikumar et al. 2017) computes attributions by propagating reference activations backward through the network using rules that depend on the specific layer types. DeepLIFT is faster than IG (one backward pass vs many), but it sacrifices Implementation Invariance. [4]
LRP (Bach et al. 2015) propagates relevance backward through the network using conservation rules. Like DeepLIFT, it depends on the specific layer types and so is not implementation invariant. [5]
SHAP (Lundberg and Lee 2017) is a unifying framework that estimates the Shapley value of each input feature using sampling or kernel-based techniques. SHAP is model-agnostic, but its sampling cost grows quickly with the number of features. SHAP can also use IG-like techniques internally; the GradientExplainer in the SHAP library implements an IG variant. [6]
LIME (Ribeiro et al. 2016) fits a sparse linear model in the neighborhood of the input and uses its coefficients as attributions. LIME is also model-agnostic and works for any classifier, but is more sensitive to neighborhood definition and sampling. [24]
SmoothGrad (Smilkov et al. 2017) reduces gradient noise by averaging gradients over many noisy versions of the input. It is often combined with other gradient methods including IG to denoise the attribution maps. [18]
IG is supported by all major interpretability libraries.
Captum is the official PyTorch interpretability library from Meta. Its IntegratedGradients class supports batched integration, multiple baselines, and convergence diagnostics. Captum also includes LayerIntegratedGradients for attributing to intermediate layer activations. [25]
TensorFlow Saliency library (Google's PAIR team) provides a NumPy-based implementation that interoperates with TensorFlow models. It also includes XRAI, Blur IG, and Guided IG. [26]
Alibi is a Python library focused on machine learning explanation, with IntegratedGradients built on top of TensorFlow and Keras. [27]
SHAP library includes a GradientExplainer class that approximates IG by averaging gradients along a path between input and a sampled baseline from a background dataset. It is similar to Expected Gradients in spirit. [6]
InterpretML (Microsoft) wraps several interpretability methods including IG-like approaches. [28]
All implementations share the same algorithm: take an input and a baseline, generate m interpolated inputs along the linear path, compute gradients at each, and combine them via a Riemann sum scaled by (x - x'). Differences arise in batching, layer-level attribution, and convergence diagnostics. For large transformers, layer-level IG is often more useful than input-level IG because token IDs have no natural baseline. Layer IG attributes to embedding outputs or attention activations, which admit natural zero baselines.