Influence functions (machine learning)
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,282 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,282 words
Add missing citations, update stale details, or suggest a clearer explanation.
An influence function is a tool for estimating how a machine learning model's predictions would change if a single training example were removed or perturbed, without retraining the model. In interpretability research, influence functions are used to trace a particular prediction back to the training data most responsible for it, providing a form of training-data attribution. The technique answers a counterfactual question: how would the trained parameters, and therefore the model's outputs, differ if a given example had been weighted slightly more or less during training? [1]
Influence functions come from robust statistics, where they were developed in the 1970s and 1980s to measure how sensitive an estimator is to small changes in the data distribution. The relevant quantity is the derivative of an estimator with respect to an infinitesimal upweighting of one data point, which is closely tied to the notion of an estimator's sensitivity to outliers.
Pang Wei Koh and Percy Liang brought the idea into modern machine learning interpretability in "Understanding Black-box Predictions via Influence Functions," which won the Best Paper Award at the International Conference on Machine Learning (ICML) in 2017. [1][2] Their formulation expresses the influence of upweighting a training point on a test-point loss as a product of three terms: the gradient of the loss at the test point, the inverse of the model's Hessian (the second-derivative matrix of the training loss with respect to the parameters), and the gradient of the loss at the training point. [1] Intuitively, a training example is influential for a given prediction when its gradient aligns, after reshaping by the inverse Hessian, with the gradient that the prediction itself produces.
The paper showed that this could be computed with only oracle access to gradients and Hessian-vector products, avoiding any explicit construction or inversion of the full Hessian, and that the approximation remained informative even for non-convex and non-differentiable models such as convolutional networks, where the underlying theory does not strictly hold. [1] Koh and Liang demonstrated four uses: understanding why a model makes a prediction, debugging model behavior, detecting mislabeled or corrupted training examples, and constructing training-set attacks in which small, visually imperceptible changes to training images degrade a model's behavior. [1]
The central obstacle to applying influence functions at scale is the Hessian. For a model with n parameters, the Hessian is an n by n matrix, and the method requires its inverse. For a model with billions of parameters, forming or inverting this matrix directly is intractable in both memory and computation. [3]
Several strategies make the computation feasible. Koh and Liang used iterative numerical methods, including conjugate gradients and a stochastic estimator (LiSSA), to compute inverse-Hessian-vector products without materializing the matrix. [1] These approaches scale poorly to the largest models, however, and they can be numerically fragile when the loss surface is far from convex. Later work therefore turned to structured approximations of the curvature matrix.
In practice researchers often replace the Hessian with the Gauss-Newton Hessian or the Fisher information matrix, which are positive semi-definite and better behaved, and then approximate that matrix with a factored form that is cheap to invert. [3]
In 2023 a team at Anthropic led by Roger Grosse published "Studying Large Language Model Generalization with Influence Functions," which scaled the method to transformer language models with up to 52 billion parameters. [3][4] The other authors included Juhan Bae, Cem Anil, Nelson Elhage, Ethan Perez, Evan Hubinger, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman, among others. [3]
The work made the inverse-Hessian step tractable with Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC). EK-FAC builds on K-FAC (Kronecker-Factored Approximate Curvature), introduced by James Martens and Roger Grosse in 2015, which approximates each layer's curvature block as a Kronecker product of two smaller matrices, exploiting the structure of neural-network gradients. [3] EK-FAC, introduced by Thomas George and colleagues in 2018, refines K-FAC by computing the eigenbasis of the Kronecker approximation and then refitting the eigenvalues to better match the true curvature, which reduces approximation error at modest extra cost. [5] Applied to a large language model, this yields an inverse-Hessian-vector product that is orders of magnitude faster to evaluate than iterative estimators while matching their accuracy on the cases where both could be run. [3]
Searching for influential examples across a pretraining corpus of many billions of tokens raises a second cost, namely computing a gradient for every candidate training sequence. The Anthropic team reduced this with techniques including TF-IDF filtering to prune obviously irrelevant documents and query batching to amortize work across multiple prompts. [3]
The paper used these methods to study how the models generalize, with several recurring observations.
| Finding | Description |
|---|---|
| Sparse influence | For a given query, influence scores roughly follow a power law: a small number of training sequences account for most of the total influence, while the median sequence matters little. [3] |
| Abstraction grows with scale | In smaller models the most influential documents tend to share surface tokens with the query. In larger models the top influences are more often conceptually or thematically related rather than lexically overlapping. [3][6] |
| Cross-lingual generalization | Larger models showed influence flowing across languages, so that a query in one language could be influenced by training data in another, an effect that was weak or absent in smaller models. [3][6] |
| Sensitivity to word order | Influence often collapsed to near zero when the order of key phrases in a sequence was reversed, suggesting that the attribution depends on surface form more than purely on meaning. [3] |
The authors also examined math and programming behavior and role-playing responses, and noted that influence for many behaviors is spread diffusely across a large number of training examples rather than concentrated in a few. [3]
Influence functions are used for training-data attribution, identifying likely mislabeled examples, studying memorization, and investigating which data shaped a specific model behavior. Within mechanistic interpretability and the broader interpretability program at companies such as Anthropic, the appeal is that they connect a model's outputs to concrete documents in its training set, which is relevant to questions of provenance, generalization, and safety for systems like Claude. [3]
The method has well-understood caveats. The classical derivation assumes a model trained to a unique loss minimum with an invertible Hessian, assumptions that do not hold for deep networks trained with stochastic gradient descent, so the computed scores are approximations whose reliability varies. Empirical studies have found that influence estimates can correlate only loosely with the effect of actually retraining a model after removing the data, particularly for deep non-convex models. [7] The approximations introduced for scaling, such as EK-FAC, add further error, and computing influence over a full pretraining corpus remains expensive even with the cost-reduction techniques described above. [3] These limitations mean influence functions are generally treated as an investigative tool rather than an exact accounting of causal contribution.