Score matching
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,101 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,101 words
Add missing citations, update stale details, or suggest a clearer explanation.
Score matching is a family of techniques for fitting probabilistic models in which the learning objective targets the gradient of the log-density, the so-called score function $\nabla_x \log p(x)$, rather than the density $p(x)$ itself. The approach was introduced by Aapo Hyvärinen in a 2005 paper in the Journal of Machine Learning Research, where he showed that an unnormalized probability model can be estimated consistently by minimizing the expected squared distance between the model score and the data score, and that this objective admits a tractable form that does not require the normalization constant.[1] Score matching was largely a niche technique in classical statistics and unsupervised learning for more than a decade, before being revived and extended in the late 2010s through denoising score matching, sliced score matching, and noise-conditional score networks, culminating in the score-based stochastic differential equation framework that underlies modern diffusion models.[2][3][4]
The central appeal of score matching is that it sidesteps the partition function. For an energy-based model $p_\theta(x) \propto \exp(-E_\theta(x))$, evaluating $p_\theta(x)$ or its log-likelihood requires the intractable integral $Z_\theta = \int \exp(-E_\theta(x)), dx$, but the score $\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$ does not depend on $Z_\theta$ at all. Score matching thus provides an alternative to Markov Chain Monte Carlo (MCMC) estimation and noise-contrastive estimation for training Boltzmann machine-style models and other generative models whose densities are known only up to a constant.[5]
Estimating a parametric density from data conventionally proceeds by maximum likelihood estimation (MLE). When the model is normalized, MLE is straightforward; when it is not, the normalization constant becomes an intractable function of the parameters, and direct likelihood evaluation requires either a closed-form integral, a Monte Carlo approximation, or contrastive training.[5] Hyvärinen's 2005 paper was motivated by precisely this difficulty in the context of Boltzmann machine-like energy models and overcomplete independent component analysis, where the partition function depends nonlinearly on every parameter and is generally not computable.[1]
The score function had appeared earlier in the statistics literature in several guises. The Fisher score is a classical object in maximum likelihood theory; the Stein identity and integration-by-parts techniques have a long history in mathematical statistics; and Tweedie's formula, attributed by Robbins (1956) to personal correspondence with M. C. K. Tweedie and revisited by Efron (2011), expresses the posterior mean under Gaussian noise as the noisy observation plus the variance times the score of the marginal density.[6] Hyvärinen's contribution was to package these elements into a generic estimator that bypassed the normalization constant entirely and was tractable for any model whose log-density admitted second derivatives.[1]
Let $p_{\text{data}}(x)$ denote the true data distribution over $x \in \mathbb{R}^d$, and let $p_\theta(x)$ denote the model. Define the Fisher divergence between data and model as $$D_F(p_{\text{data}} ,|, p_\theta) = \frac{1}{2},\mathbb{E}{x \sim p{\text{data}}}!\left[, |\nabla_x \log p_{\text{data}}(x) - \nabla_x \log p_\theta(x)|2^2 ,\right].$$ Minimizing $D_F$ over $\theta$ is called explicit score matching because it explicitly compares the two scores. The objective in this form is infeasible because the data score $\nabla_x \log p{\text{data}}(x)$ is unknown.[7]
Hyvärinen's key result is that, under mild regularity conditions (smooth densities and decay at infinity), integration by parts allows the unknown data score to be eliminated. Expanding the squared norm and using integration by parts on the cross term yields the implicit score matching objective: $$J(\theta) = \mathbb{E}{x \sim p{\text{data}}}!\left[, \tfrac{1}{2}|\nabla_x \log p_\theta(x)|2^2 + \operatorname{tr}!\left(\nabla_x^2 \log p\theta(x)\right) ,\right] + \text{const},$$ where the constant depends only on $p_{\text{data}}$ and not on $\theta$.[1] The estimator replaces the data expectation by a sample average, so training reduces to computing, for each minibatch element, the squared norm of the model score and the trace of the model Hessian of the log-density. Because $\nabla_x \log p_\theta = -\nabla_x E_\theta$, the partition function never appears.[1][5]
Hyvärinen proved that the resulting estimator is consistent under standard identifiability conditions and that it is locally asymptotically normal when the model is correctly specified.[1] In the original paper the method was validated on multivariate Gaussian models, on independent component analysis, and on an overcomplete ICA model of natural image patches whose features resembled simple cell receptive fields.[1] Later work by Lyu (2009) generalized the framework, connecting it to maximum likelihood under noise and extending it to discrete data,[8] while Kingma and LeCun (2010) introduced a regularized variant that suppressed instabilities arising from finite samples and quantized inputs.[9]
The bottleneck of the original objective is the trace of the Hessian $\operatorname{tr}(\nabla_x^2 \log p_\theta(x))$. For a $d$-dimensional input and a deep neural parameterization, computing this trace exactly costs $O(d)$ backward passes per example, because each diagonal entry of the Hessian requires a separate backpropagation. This made score matching practical for shallow models and low-dimensional data but prohibitive for, say, ImageNet-scale images. Two main strategies emerged to remove this bottleneck: denoising score matching and sliced score matching.[2][7]
Pascal Vincent's 2011 paper in Neural Computation established a connection between score matching and denoising autoencoders, proving that training a denoising autoencoder with a Gaussian noise model is equivalent (up to a constant) to performing score matching on a Parzen-window estimator of the data density.[2] Concretely, given clean data $x$ and a noise level $\sigma$, define the perturbed distribution $$q_\sigma(\tilde x \mid x) = \mathcal{N}(\tilde x; x, \sigma^2 I), \qquad q_\sigma(\tilde x) = \int q_\sigma(\tilde x \mid x), p_{\text{data}}(x), dx.$$ A score model $s_\theta(\tilde x)$ trained with the denoising score matching (DSM) objective $$\mathcal{L}{\text{DSM}}(\theta) = \mathbb{E}{x \sim p_{\text{data}}}, \mathbb{E}{\tilde x \sim q\sigma(\cdot \mid x)}!\left[, \tfrac{1}{2},|s_\theta(\tilde x) - \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)|2^2 ,\right]$$ recovers the score of the noisy distribution $q\sigma$ as the optimal solution.[2] Because $q_\sigma(\tilde x \mid x)$ is a Gaussian centered at $x$, its conditional score has the closed form $-(\tilde x - x)/\sigma^2$, so the regression target is simply the scaled noise. This is the form that diffusion models adopt almost universally.[4]
The connection to Tweedie's formula makes the DSM objective particularly transparent. For Gaussian noise, Tweedie's formula states that the posterior mean of the clean signal given the noisy observation is $$\mathbb{E}[x \mid \tilde x] = \tilde x + \sigma^2 \nabla_{\tilde x} \log q_\sigma(\tilde x),$$ which directly links the optimal denoiser to the score of the marginal noisy density.[6] A network trained as a denoiser implicitly learns the score, and a network trained as a score model implicitly learns to denoise; this duality underlies most modern parameterizations of diffusion models.[4][6]
DSM has two key advantages over Hyvärinen's original method. First, it eliminates the Hessian term entirely, because the regression target is a known quantity rather than something requiring a derivative. Second, by averaging over a noisy version of the data, it provides better-conditioned training signals in regions of low data density, which is critical for sampling.[3] Its main caveat is that the learned score is the score of $q_\sigma$, not of $p_{\text{data}}$ itself; in the limit $\sigma \to 0$ the two coincide, but at any positive noise level the model captures a smoothed density.[3]
A complementary route to scaling score matching was proposed by Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon in Sliced Score Matching: A Scalable Approach to Density and Score Estimation (UAI 2019), with the arXiv preprint posted 17 May 2019.[7] The idea is to project both the model and data scores onto random directions before measuring their disagreement. For a unit vector $v$ drawn from some distribution $p_v$, define the projected Fisher divergence $$D_F^v(\theta) = \tfrac{1}{2},\mathbb{E}{v}, \mathbb{E}{x}!\left[, (v^\top \nabla_x \log p_{\text{data}}(x) - v^\top s_\theta(x))^2 ,\right].$$ Integration by parts on the cross term now produces a term $v^\top \nabla_x^2 \log p_\theta(x), v$, which is a Hessian-vector product rather than the full Hessian or its trace.[7] Hessian-vector products can be computed exactly in $O(1)$ backward passes using reverse-mode automatic differentiation, irrespective of the input dimension, so the per-step cost of sliced score matching is comparable to ordinary backpropagation.[7]
Song et al. proved that the estimator is consistent and asymptotically normal under regularity conditions, derived several practical variants (sliced score matching with variance reduction, finite-sample bias control), and demonstrated that the method scales to deep energy-based models, variational inference, and Wasserstein autoencoders that were previously inaccessible to score-based estimation.[7]
| Variant | Objective | Cost per example | Notes |
|---|---|---|---|
| Hyvärinen score matching | $\tfrac{1}{2}|s_\theta|^2 + \operatorname{tr}(\nabla_x s_\theta)$ | $O(d)$ backward passes | Exact; bottleneck is the trace. |
| Denoising score matching | $|s_\theta(\tilde x) - \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)|^2$ | $O(1)$ backward passes | Recovers score of $q_\sigma$, not $p_{\text{data}}$. |
| Sliced score matching | $\tfrac{1}{2}(v^\top s_\theta)^2 + v^\top \nabla_x s_\theta v$ | $O(1)$ backward passes | Exact; uses Hutchinson-style projection. |
Through 2018 score matching remained largely a parameter-estimation technique. The decisive shift came in 2019 when Song and Ermon recognized that a learned score function could be used to generate samples by simulating Langevin dynamics that follow the score field.[3] Given $s_\theta(x) \approx \nabla_x \log p_{\text{data}}(x)$, the Langevin update $$x_{t+1} = x_t + \tfrac{\epsilon}{2}, s_\theta(x_t) + \sqrt{\epsilon}, z_t, \qquad z_t \sim \mathcal{N}(0, I)$$ defines a Markov chain whose stationary distribution, in the limit of small step size and many steps, is $p_{\text{data}}$.[3] In practice, naive Langevin sampling from a score model trained with score matching suffered from two failure modes that Song and Ermon explicitly analyzed.[3]
The first failure mode is the manifold problem: real data such as natural images typically lie on a low-dimensional manifold embedded in a high-dimensional ambient space, so the data density is degenerate and the score is undefined off the manifold. The second is low-density regions: even when the manifold issue is mitigated, the Fisher divergence weights pointwise score errors by $p_{\text{data}}(x)$, so errors in regions where $p_{\text{data}}$ is small are largely invisible to the training objective, yet these regions are exactly where Langevin chains spend most of their time when initialized from random noise.[3]
The cure proposed in Generative Modeling by Estimating Gradients of the Data Distribution (Song and Ermon, NeurIPS 2019; arXiv:1907.05600, submitted 12 July 2019) is to perturb the data with a sequence of Gaussian noise levels $\sigma_1 > \sigma_2 > \cdots > \sigma_L$ and to train a single neural network, the Noise-Conditional Score Network (NCSN), to estimate the score of each perturbed distribution $q_{\sigma_i}$ simultaneously.[3] The network takes as input both $x$ and the noise level $\sigma_i$, and is trained with a denoising score matching loss summed over noise levels with appropriate per-level weights.[3]
Sampling then proceeds via annealed Langevin dynamics: the chain is initialized with random noise, run with the score model evaluated at the highest noise level $\sigma_1$ for some number of steps, then continued with the next-highest noise level, and so on down to $\sigma_L$, which is set small enough that $q_{\sigma_L} \approx p_{\text{data}}$.[3] The largest noise level smooths the distribution enough that low-density regions are populated and Langevin chains mix freely; smaller noise levels successively sharpen the samples toward the data manifold.[3]
NCSN with annealed Langevin dynamics achieved an Inception score of 8.87 on CIFAR-10, the state of the art for unconditional generation at publication time, and demonstrated compelling unconditional samples on MNIST and CelebA together with inpainting applications.[3]
Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole then unified NCSN and the contemporaneous denoising diffusion probabilistic model (DDPM, Ho et al. 2020) in Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021; arXiv:2011.13456, submitted 26 November 2020).[4] Rather than working with a discrete sequence of noise levels, the paper takes a continuous-time limit in which the data is gradually corrupted into noise by a forward stochastic differential equation (SDE) $$dx = f(x, t), dt + g(t), dw,$$ where $w$ is a standard Wiener process. By a classical result of Brian Anderson (1982), the reverse-time process is also an SDE, with drift modified by the score of the forward marginals: $$dx = \left[f(x, t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t), d\bar w,$$ where $\bar w$ is a reverse-time Wiener process and $p_t$ is the marginal density at time $t$.[4][10] Sampling reduces to learning $\nabla_x \log p_t(x)$ via denoising score matching applied over a continuum of noise scales, then numerically integrating the reverse SDE backward in time from $t = T$ (pure noise) to $t = 0$ (data).[4]
Song et al. identified three families of forward SDEs.[4]
| SDE | Forward dynamics | Discrete analog |
|---|---|---|
| Variance Exploding (VE) | $dx = \sqrt{d\sigma^2(t)/dt}, dw$ | NCSN. |
| Variance Preserving (VP) | $dx = -\tfrac{1}{2}\beta(t) x, dt + \sqrt{\beta(t)}, dw$ | DDPM. |
| sub-VP | Variant of VP with reduced terminal variance | New; gives improved likelihoods. |
The VE-SDE corresponds in the continuous limit to NCSN's sequence of Gaussian perturbations of increasing variance; the VP-SDE recovers DDPM's linear-Gaussian Markov chain when discretized; and the sub-VP-SDE was introduced in the same paper and yielded the best likelihoods reported there.[4] Critically, the framework also gives a deterministic ODE whose marginals match those of the reverse SDE, the probability-flow ODE: $$\frac{dx}{dt} = f(x, t) - \tfrac{1}{2} g(t)^2 \nabla_x \log p_t(x).$$ This neural ODE shares all generative behavior of the SDE but allows exact likelihood computation via the instantaneous change-of-variables formula, typically with Hutchinson trace estimation for tractability.[4]
The paper reported CIFAR-10 results of FID 2.20 and Inception score 9.89, both state of the art at the time, together with the first score-based generation of 1024x1024 images via a deep VE-SDE on a high-resolution face dataset, competitive likelihoods of 2.99 bits/dim, and a predictor-corrector sampling algorithm that combines numerical SDE integration with Langevin corrections to reduce discretization error.[4] It also demonstrated controllable generation, class-conditional sampling, inpainting, and colorization as instances of conditional reverse-SDE simulation.[4]
The score-SDE perspective showed that two threads of research, score-based modeling stemming from NCSN and denoising diffusion modeling stemming from Sohl-Dickstein et al. (2015) and DDPM (Ho, Jain, Abbeel, 2020), are different discretizations of the same underlying continuous-time process.[4][11] DDPM trains a network $\epsilon_\theta(x_t, t)$ to predict the Gaussian noise added at each step of a discrete-time Markov chain. Through Tweedie's formula and the equivalence of noise prediction to denoising under Gaussian corruption, the DDPM training objective is, up to a per-step weighting, equivalent to denoising score matching at the corresponding noise scale.[4][11] The simplified $L_{\text{simple}}$ loss of DDPM in particular drops the variational weighting and becomes a uniformly weighted denoising score matching objective; Ho et al. explicitly framed this equivalence in their paper.[11]
This unification had practical consequences. Numerical SDE solvers and probability-flow ODE solvers can be applied to networks trained with either NCSN-style or DDPM-style losses; sampling techniques developed for one (such as predictor-corrector schemes, exponential integrators, or DPM-Solver) transfer to the other; and likelihood evaluation, sample diversity, and conditional generation can be reasoned about within a single mathematical framework rather than as separate phenomena of two unrelated model families.[4]
The score-SDE viewpoint also paved the way for flow matching, introduced by Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le in Flow Matching for Generative Modeling (arXiv:2210.02747, posted 6 October 2022).[12] Flow matching trains a continuous normalizing flow by regressing the velocity field of a prescribed probability path connecting noise to data, without simulating the path during training. When the prescribed path is the Gaussian diffusion path, the velocity field is directly related to the score of the perturbed distribution via Tweedie's formula, and flow matching recovers score-based diffusion training as a special case.[12] Flow matching generalizes beyond Gaussian paths, however, allowing for example optimal-transport displacement interpolation between data and noise, which yields straighter trajectories and faster sampling than diffusion paths while retaining the same training simplicity.[12] See Flow Matching for a fuller treatment.
Score matching and its descendants now anchor a substantial fraction of modern generative modeling research.
Score matching's significance has shifted over time. In the 2005-2018 period it was a tool for fitting unnormalized statistical models, often in computational neuroscience or statistical signal processing, with the partition function as its principal nuisance. From 2019 onward it became a foundation for a new paradigm of high-fidelity generative modeling, one in which sampling is recast as the numerical integration of a stochastic or deterministic differential equation driven by a learned score field.[3][4]
A useful way to see why this matters: classical likelihood-based generative models (autoregressive models, normalizing flows, variational autoencoders) couple model architecture tightly to the demands of the density itself, requiring tractable conditionals, invertibility, or variational bounds. Score-based modeling decouples the learning objective from the density; the network only needs to output a vector field of the same dimensionality as the data, and any neural architecture suitable for image, video, audio, or other inputs can be plugged in.[3][4] This architectural freedom, combined with the stable denoising-style training objective, is largely responsible for the empirical dominance of score-based and diffusion approaches in continuous-data generation since 2021.
A number of estimators relate closely to score matching.
| Estimator | Year | Idea |
|---|---|---|
| Hyvärinen score matching[1] | 2005 | Match model and data scores via integration by parts. |
| Lyu generalization[8] | 2009 | Maximum likelihood under noise; extensions to discrete data. |
| Regularized score matching[9] | 2010 | Kingma and LeCun add explicit regularization for image statistics. |
| Denoising score matching[2] | 2011 | Vincent connects score matching to denoising autoencoders on Parzen kernels. |
| Sliced score matching[7] | 2019 | Song et al. project onto random vectors to avoid Hessian computation. |
| NCSN[3] | 2019 | Multi-scale denoising score matching with annealed Langevin sampling. |
| Score-based SDE[4] | 2020 | Continuous-time formulation unifying NCSN and DDPM. |
| Flow matching[12] | 2022 | Velocity-field regression generalizing score-based training. |
Contrastive Learning and noise-contrastive estimation provide a different family of MCMC-free training schemes that target the density via a classification surrogate, and remain competitive in regimes where the score is hard to learn directly.
| Method | Avoids partition function | Avoids MCMC | Avoids data score | Notes |
|---|---|---|---|---|
| Maximum likelihood + MCMC[5] | No (needs MCMC) | No | Yes | Asymptotically optimal but expensive. |
| Noise-contrastive estimation[5] | Yes | Yes | Yes | Density ratio learned by classifier; converges to MLE. |
| Score matching[1] | Yes | Yes | Yes (via IBP) | Targets gradient of log-density. |
| Denoising score matching[2] | Yes | Yes | Yes | Learns the score of a noised distribution. |