# Score matching

> Source: https://aiwiki.ai/wiki/score_matching
> Updated: 2026-06-25
> Categories: Diffusion Models, Machine Learning, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Score matching** is a method for fitting a probabilistic model by matching the gradient of its log-density, the so-called *score function* $\nabla_x \log p(x)$, to the same gradient of the data distribution, rather than fitting the density $p(x)$ directly. It was introduced by Aapo Hyvärinen in a 2005 paper in the *Journal of Machine Learning Research* (vol. 6, pp. 695-709), which proved that an unnormalized probability model can be estimated consistently by minimizing the expected squared distance between the model score and the data score, and that this objective reduces to a tractable sample average that never requires the normalization constant.[^1] Score matching remained a niche technique in classical statistics and unsupervised learning for more than a decade before being revived and extended in the late 2010s through *denoising score matching*, *sliced score matching*, and *noise-conditional score networks*, culminating in the score-based stochastic differential equation framework that underlies modern [diffusion models](/wiki/diffusion_model).[^2][^3][^4]

The central appeal of score matching is that it sidesteps the partition function. For an energy-based model $p_\theta(x) \propto \exp(-E_\theta(x))$, evaluating $p_\theta(x)$ or its log-likelihood requires the intractable integral $Z_\theta = \int \exp(-E_\theta(x))\, dx$, but the score $\nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)$ does not depend on $Z_\theta$ at all. Score matching thus provides an alternative to [Markov Chain Monte Carlo (MCMC)](/wiki/mcmc) estimation and noise-contrastive estimation for training [Boltzmann machine](/wiki/boltzmann_machine)-style models and other [generative models](/wiki/generative_model) whose densities are known only up to a constant.[^5]

## What problem does score matching solve?

Estimating a parametric density from data conventionally proceeds by [maximum likelihood estimation (MLE)](/wiki/maximum_likelihood_estimation). When the model is normalized, MLE is straightforward; when it is not, the normalization constant becomes an intractable function of the parameters, and direct likelihood evaluation requires either a closed-form integral, a Monte Carlo approximation, or contrastive training.[^5] Hyvärinen's 2005 paper was motivated by precisely this difficulty in the context of [Boltzmann machine](/wiki/boltzmann_machine)-like energy models and overcomplete independent component analysis, where the partition function depends nonlinearly on every parameter and is generally not computable.[^1] As the paper's abstract puts it, "One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant," and the usual recourse is "Markov Chain Monte Carlo methods, or approximations of the normalization constant."[^1]

The score function had appeared earlier in the statistics literature in several guises. The Fisher score is a classical object in maximum likelihood theory; the Stein identity and integration-by-parts techniques have a long history in mathematical statistics; and Tweedie's formula, attributed by Robbins (1956) to personal correspondence with M. C. K. Tweedie and revisited by Efron (2011), expresses the posterior mean under Gaussian noise as the noisy observation plus the variance times the score of the marginal density.[^6] Hyvärinen's contribution was to package these elements into a generic estimator that bypassed the normalization constant entirely and was tractable for any model whose log-density admitted second derivatives.[^1]

## How does score matching avoid the normalizing constant?

Let $p_{\text{data}}(x)$ denote the true data distribution over $x \in \mathbb{R}^d$, and let $p_\theta(x)$ denote the model. Define the *Fisher divergence* between data and model as
$$D_F(p_{\text{data}} \,\|\, p_\theta) = \frac{1}{2}\,\mathbb{E}_{x \sim p_{\text{data}}}\!\left[\, \|\nabla_x \log p_{\text{data}}(x) - \nabla_x \log p_\theta(x)\|_2^2 \,\right].$$
Minimizing $D_F$ over $\theta$ is called *explicit score matching* because it explicitly compares the two scores. The objective in this form is infeasible because the data score $\nabla_x \log p_{\text{data}}(x)$ is unknown.[^7]

Hyvärinen's key result is that, under mild regularity conditions (smooth densities and decay at infinity), integration by parts allows the unknown data score to be eliminated. He summarized the surprise in the paper: "we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model."[^1] Expanding the squared norm and using integration by parts on the cross term yields the *implicit score matching* objective:
$$J(\theta) = \mathbb{E}_{x \sim p_{\text{data}}}\!\left[\, \tfrac{1}{2}\|\nabla_x \log p_\theta(x)\|_2^2 + \operatorname{tr}\!\left(\nabla_x^2 \log p_\theta(x)\right) \,\right] + \text{const},$$
where the constant depends only on $p_{\text{data}}$ and not on $\theta$.[^1] The estimator replaces the data expectation by a sample average, so training reduces to computing, for each minibatch element, the squared norm of the model score and the trace of the model Hessian of the log-density. Because $\nabla_x \log p_\theta = -\nabla_x E_\theta$, the partition function never appears.[^1][^5]

Hyvärinen proved that the resulting estimator is consistent under standard identifiability conditions and that it is locally asymptotically normal when the model is correctly specified.[^1] In the original paper the method was validated on multivariate Gaussian models, on independent component analysis, and on an overcomplete ICA model of natural image patches whose features resembled simple cell receptive fields.[^1] Later work by Lyu (2009) generalized the framework, connecting it to maximum likelihood under noise and extending it to discrete data,[^8] while Kingma and LeCun (2010) introduced a regularized variant that suppressed instabilities arising from finite samples and quantized inputs.[^9]

### What is the computational cost of the original objective?

The bottleneck of the original objective is the trace of the Hessian $\operatorname{tr}(\nabla_x^2 \log p_\theta(x))$. For a $d$-dimensional input and a deep neural parameterization, computing this trace exactly costs $O(d)$ backward passes per example, because each diagonal entry of the Hessian requires a separate backpropagation. This made score matching practical for shallow models and low-dimensional data but prohibitive for, say, ImageNet-scale images. Two main strategies emerged to remove this bottleneck: denoising score matching and sliced score matching.[^2][^7]

## What is denoising score matching?

Pascal Vincent's 2011 paper in *Neural Computation* (vol. 23, no. 7, pp. 1661-1674) established a connection between score matching and [denoising](/wiki/denoising) [autoencoders](/wiki/autoencoder), proving that training a denoising autoencoder with a Gaussian noise model is equivalent (up to a constant) to performing score matching on a Parzen-window estimator of the data density.[^2] Concretely, given clean data $x$ and a noise level $\sigma$, define the perturbed distribution
$$q_\sigma(\tilde x \mid x) = \mathcal{N}(\tilde x; x, \sigma^2 I), \qquad q_\sigma(\tilde x) = \int q_\sigma(\tilde x \mid x)\, p_{\text{data}}(x)\, dx.$$
A score model $s_\theta(\tilde x)$ trained with the *denoising score matching* (DSM) objective
$$\mathcal{L}_{\text{DSM}}(\theta) = \mathbb{E}_{x \sim p_{\text{data}}}\, \mathbb{E}_{\tilde x \sim q_\sigma(\cdot \mid x)}\!\left[\, \tfrac{1}{2}\,\|s_\theta(\tilde x) - \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)\|_2^2 \,\right]$$
recovers the score of the noisy distribution $q_\sigma$ as the optimal solution.[^2] Because $q_\sigma(\tilde x \mid x)$ is a Gaussian centered at $x$, its conditional score has the closed form $-(\tilde x - x)/\sigma^2$, so the regression target is simply the scaled noise. This is the form that diffusion models adopt almost universally.[^4]

The connection to Tweedie's formula makes the DSM objective particularly transparent. For Gaussian noise, Tweedie's formula states that the posterior mean of the clean signal given the noisy observation is
$$\mathbb{E}[x \mid \tilde x] = \tilde x + \sigma^2 \nabla_{\tilde x} \log q_\sigma(\tilde x),$$
which directly links the optimal denoiser to the score of the marginal noisy density.[^6] A network trained as a denoiser implicitly learns the score, and a network trained as a score model implicitly learns to denoise; this duality underlies most modern parameterizations of diffusion models.[^4][^6]

DSM has two key advantages over Hyvärinen's original method. First, it eliminates the Hessian term entirely, because the regression target is a known quantity rather than something requiring a derivative. Second, by averaging over a noisy version of the data, it provides better-conditioned training signals in regions of low data density, which is critical for sampling.[^3] Its main caveat is that the learned score is the score of $q_\sigma$, not of $p_{\text{data}}$ itself; in the limit $\sigma \to 0$ the two coincide, but at any positive noise level the model captures a smoothed density.[^3]

## What is sliced score matching?

A complementary route to scaling score matching was proposed by Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon in *Sliced Score Matching: A Scalable Approach to Density and Score Estimation* (UAI 2019, Tel Aviv, 22-25 July 2019), with the arXiv preprint posted 17 May 2019.[^7] The idea is to project both the model and data scores onto random directions before measuring their disagreement. For a unit vector $v$ drawn from some distribution $p_v$, define the projected Fisher divergence
$$D_F^v(\theta) = \tfrac{1}{2}\,\mathbb{E}_{v}\, \mathbb{E}_{x}\!\left[\, (v^\top \nabla_x \log p_{\text{data}}(x) - v^\top s_\theta(x))^2 \,\right].$$
Integration by parts on the cross term now produces a term $v^\top \nabla_x^2 \log p_\theta(x)\, v$, which is a *Hessian-vector product* rather than the full Hessian or its trace.[^7] Hessian-vector products can be computed exactly in $O(1)$ backward passes using reverse-mode automatic differentiation, irrespective of the input dimension, so the per-step cost of sliced score matching is comparable to ordinary backpropagation.[^7]

Song et al. proved that the estimator is consistent and asymptotically normal under regularity conditions, derived several practical variants (sliced score matching with variance reduction, finite-sample bias control), and demonstrated that the method scales to deep energy-based models, variational inference, and Wasserstein autoencoders that were previously inaccessible to score-based estimation.[^7]

| Variant | Objective | Cost per example | Notes |
| --- | --- | --- | --- |
| Hyvärinen score matching | $\tfrac{1}{2}\|s_\theta\|^2 + \operatorname{tr}(\nabla_x s_\theta)$ | $O(d)$ backward passes | Exact; bottleneck is the trace. |
| Denoising score matching | $\|s_\theta(\tilde x) - \nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)\|^2$ | $O(1)$ backward passes | Recovers score of $q_\sigma$, not $p_{\text{data}}$. |
| Sliced score matching | $\tfrac{1}{2}(v^\top s_\theta)^2 + v^\top \nabla_x s_\theta v$ | $O(1)$ backward passes | Exact; uses Hutchinson-style projection. |

## How is score matching used in generative modeling?

Through 2018 score matching remained largely a parameter-estimation technique. The decisive shift came in 2019 when Song and Ermon recognized that a learned score function could be used to *generate* samples by simulating Langevin dynamics that follow the score field.[^3] Given $s_\theta(x) \approx \nabla_x \log p_{\text{data}}(x)$, the Langevin update
$$x_{t+1} = x_t + \tfrac{\epsilon}{2}\, s_\theta(x_t) + \sqrt{\epsilon}\, z_t, \qquad z_t \sim \mathcal{N}(0, I)$$
defines a Markov chain whose stationary distribution, in the limit of small step size and many steps, is $p_{\text{data}}$.[^3] In practice, naive Langevin sampling from a score model trained with score matching suffered from two failure modes that Song and Ermon explicitly analyzed.[^3]

The first failure mode is the *manifold problem*: real data such as natural images typically lie on a low-dimensional manifold embedded in a high-dimensional ambient space, so the data density is degenerate and the score is undefined off the manifold. The second is *low-density regions*: even when the manifold issue is mitigated, the Fisher divergence weights pointwise score errors by $p_{\text{data}}(x)$, so errors in regions where $p_{\text{data}}$ is small are largely invisible to the training objective, yet these regions are exactly where Langevin chains spend most of their time when initialized from random noise.[^3]

### What are Noise-Conditional Score Networks (NCSN)?

The cure proposed in *Generative Modeling by Estimating Gradients of the Data Distribution* (Song and Ermon, NeurIPS 2019; arXiv:1907.05600, submitted 12 July 2019) is to perturb the data with a sequence of Gaussian noise levels $\sigma_1 > \sigma_2 > \cdots > \sigma_L$ and to train a single neural network, the *Noise-Conditional Score Network* (NCSN), to estimate the score of each perturbed distribution $q_{\sigma_i}$ simultaneously.[^3] The network takes as input both $x$ and the noise level $\sigma_i$, and is trained with a denoising score matching loss summed over noise levels with appropriate per-level weights.[^3]

Sampling then proceeds via *annealed Langevin dynamics*: the chain is initialized with random noise, run with the score model evaluated at the highest noise level $\sigma_1$ for some number of steps, then continued with the next-highest noise level, and so on down to $\sigma_L$, which is set small enough that $q_{\sigma_L} \approx p_{\text{data}}$.[^3] The largest noise level smooths the distribution enough that low-density regions are populated and Langevin chains mix freely; smaller noise levels successively sharpen the samples toward the data manifold.[^3]

NCSN with annealed Langevin dynamics achieved an Inception score of 8.87 on [CIFAR-10](/wiki/cifar_10), the state of the art for unconditional generation at publication time, and demonstrated compelling unconditional samples on [MNIST](/wiki/mnist) and CelebA together with inpainting applications.[^3]

### How does the score-based SDE framework work?

Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole then unified NCSN and the contemporaneous denoising diffusion probabilistic model (DDPM, Ho et al. 2020) in *Score-Based Generative Modeling through Stochastic Differential Equations* (ICLR 2021; arXiv:2011.13456, submitted 26 November 2020).[^4] The paper opens with a now-famous framing of the goal: "Creating noise from data is easy; creating data from noise is generative modeling."[^4] Rather than working with a discrete sequence of noise levels, the paper takes a continuous-time limit in which the data is gradually corrupted into noise by a forward stochastic differential equation (SDE)
$$dx = f(x, t)\, dt + g(t)\, dw,$$
where $w$ is a standard Wiener process. By a classical result of Brian Anderson (1982), the reverse-time process is also an SDE, with drift modified by the score of the forward marginals:
$$dx = \left[f(x, t) - g(t)^2 \nabla_x \log p_t(x)\right] dt + g(t)\, d\bar w,$$
where $\bar w$ is a reverse-time Wiener process and $p_t$ is the marginal density at time $t$.[^4][^10] Sampling reduces to learning $\nabla_x \log p_t(x)$ via denoising score matching applied over a continuum of noise scales, then numerically integrating the reverse SDE backward in time from $t = T$ (pure noise) to $t = 0$ (data).[^4]

Song et al. identified three families of forward SDEs.[^4]

| SDE | Forward dynamics | Discrete analog |
| --- | --- | --- |
| Variance Exploding (VE) | $dx = \sqrt{d\sigma^2(t)/dt}\, dw$ | NCSN. |
| Variance Preserving (VP) | $dx = -\tfrac{1}{2}\beta(t) x\, dt + \sqrt{\beta(t)}\, dw$ | DDPM. |
| sub-VP | Variant of VP with reduced terminal variance | New; gives improved likelihoods. |

The VE-SDE corresponds in the continuous limit to NCSN's sequence of Gaussian perturbations of increasing variance; the VP-SDE recovers [DDPM](/wiki/ddpm)'s linear-Gaussian Markov chain when discretized; and the sub-VP-SDE was introduced in the same paper and yielded the best likelihoods reported there.[^4] Critically, the framework also gives a deterministic ODE whose marginals match those of the reverse SDE, the *probability-flow ODE*:
$$\frac{dx}{dt} = f(x, t) - \tfrac{1}{2} g(t)^2 \nabla_x \log p_t(x).$$
This neural ODE shares all generative behavior of the SDE but allows exact likelihood computation via the instantaneous change-of-variables formula, typically with Hutchinson trace estimation for tractability.[^4]

The paper reported CIFAR-10 results of FID 2.20 and Inception score 9.89, both state of the art at the time, together with the first score-based generation of 1024x1024 images via a deep VE-SDE on a high-resolution face dataset, competitive likelihoods of 2.99 bits/dim, and a *predictor-corrector* sampling algorithm that combines numerical SDE integration with Langevin corrections to reduce discretization error.[^4] It also demonstrated controllable generation, class-conditional sampling, inpainting, and colorization as instances of conditional reverse-SDE simulation.[^4]

## How does score matching relate to denoising diffusion models?

The score-SDE perspective showed that two threads of research, score-based modeling stemming from NCSN and denoising diffusion modeling stemming from Sohl-Dickstein et al. (2015) and [DDPM](/wiki/ddpm) (Ho, Jain, Abbeel, 2020), are different discretizations of the same underlying continuous-time process.[^4][^11] [DDPM](/wiki/ddpm) trains a network $\epsilon_\theta(x_t, t)$ to predict the Gaussian noise added at each step of a discrete-time Markov chain. Through Tweedie's formula and the equivalence of noise prediction to denoising under Gaussian corruption, the DDPM training objective is, up to a per-step weighting, equivalent to denoising score matching at the corresponding noise scale.[^4][^11] The simplified $L_{\text{simple}}$ loss of [DDPM](/wiki/ddpm) in particular drops the variational weighting and becomes a uniformly weighted denoising score matching objective; Ho et al. explicitly framed this equivalence in their paper.[^11]

This unification had practical consequences. Numerical SDE solvers and probability-flow ODE solvers can be applied to networks trained with either NCSN-style or DDPM-style losses; sampling techniques developed for one (such as predictor-corrector schemes, exponential integrators, or DPM-Solver) transfer to the other; and likelihood evaluation, sample diversity, and conditional generation can be reasoned about within a single mathematical framework rather than as separate phenomena of two unrelated model families.[^4]

## How does score matching relate to flow matching?

The score-SDE viewpoint also paved the way for *flow matching*, introduced by Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le in *Flow Matching for Generative Modeling* (arXiv:2210.02747, posted 6 October 2022).[^12] Flow matching trains a continuous normalizing flow by regressing the velocity field of a prescribed probability path connecting noise to data, without simulating the path during training. When the prescribed path is the Gaussian diffusion path, the velocity field is directly related to the score of the perturbed distribution via Tweedie's formula, and flow matching recovers score-based diffusion training as a special case.[^12] Flow matching generalizes beyond Gaussian paths, however, allowing for example optimal-transport displacement interpolation between data and noise, which yields straighter trajectories and faster sampling than diffusion paths while retaining the same training simplicity.[^12] See [Flow Matching](/wiki/flow_matching) for a fuller treatment.

## What is score matching used for?

Score matching and its descendants now anchor a substantial fraction of modern [generative modeling](/wiki/generative_model) research.

- **Image generation.** Score-based and diffusion models trained with denoising score matching underpin [Stable Diffusion](/wiki/stable_diffusion), [DALL-E](/wiki/dall-e) 2 and 3, [Imagen](/wiki/imagen), [Midjourney](/wiki/midjourney), and most other large open-source and commercial text-to-image systems.[^4] The score-SDE framework directly informs the sampling code of these systems.
- **Video generation.** Score-based diffusion models trained over space-time tensors power [Sora](/wiki/sora) and a broad class of video-diffusion systems, which inherit the same training recipe with appropriate factorizations of the score network.[^4]
- **Audio synthesis.** Diffusion and score-based models have been adopted for raw [audio](/wiki/audio) waveform generation, neural vocoders, and high-fidelity text-to-speech.
- **3D generation.** *Score distillation sampling*, introduced in the DreamFusion paper (Poole et al. 2022), uses a pretrained 2D image diffusion model as a score field to optimize a parameterized 3D representation (a neural radiance field) so that its renderings look like samples from the diffusion model.[^13] Variational score distillation and related variants address the over-saturation and mode-collapse issues of the original SDS objective.[^13]
- **Inverse problems and Bayesian imaging.** The score of a diffusion prior can be combined with a likelihood gradient at inference time to perform posterior sampling for inpainting, super-resolution, deblurring, and medical image reconstruction, framing classical inverse problems as conditional reverse-SDE simulation.[^4]
- **Scientific modeling.** Score-based diffusion models have been applied to molecular conformation generation, protein structure ensembles, materials discovery, and quantum-state generation, leveraging the SDE framework's natural handling of continuous symmetry and equivariance.
- **Energy-based modeling.** Sliced and denoising score matching provide MCMC-free training schemes for [Boltzmann machine](/wiki/boltzmann_machine)-style models and other [generative models](/wiki/generative_model) specified by an unnormalized energy.[^5][^7]

## Why is score matching significant?

Score matching's significance has shifted over time. In the 2005-2018 period it was a tool for fitting unnormalized statistical models, often in computational neuroscience or statistical signal processing, with the partition function as its principal nuisance. From 2019 onward it became a foundation for a new paradigm of high-fidelity generative modeling, one in which sampling is recast as the numerical integration of a stochastic or deterministic differential equation driven by a learned score field.[^3][^4]

A useful way to see why this matters: classical likelihood-based generative models (autoregressive models, normalizing flows, variational autoencoders) couple model architecture tightly to the demands of the density itself, requiring tractable conditionals, invertibility, or variational bounds. Score-based modeling decouples the learning objective from the density; the network only needs to output a vector field of the same dimensionality as the data, and any neural architecture suitable for image, video, audio, or other inputs can be plugged in.[^3][^4] This architectural freedom, combined with the stable denoising-style training objective, is largely responsible for the empirical dominance of score-based and diffusion approaches in continuous-data generation since 2021.

## What are the limitations of score matching?

- **Manifold and low-density-region pathology.** Without noise perturbation, score matching estimates are ill-behaved on data that lie on lower-dimensional manifolds, and Fisher divergence under-weights errors in regions where the data density is low.[^3] Multi-scale and SDE-based formulations mitigate but do not eliminate this issue.
- **Slow sampling.** Reverse-SDE and probability-flow ODE integration with high-fidelity samples typically requires tens to hundreds of network evaluations per sample, although accelerated solvers and distillation techniques have brought this number down substantially.
- **Discretization bias.** Numerical SDE solvers introduce discretization errors; predictor-corrector schemes and exponential integrators reduce but do not remove this bias.[^4]
- **Statistical efficiency.** Hyvärinen score matching can be statistically less efficient than maximum likelihood when MLE is feasible, particularly in unimodal regimes; subsequent work on the statistical efficiency of score matching has formalized this trade-off in terms of isoperimetric constants of the target distribution.
- **Likelihood gap.** While the probability-flow ODE allows exact likelihood evaluation, the SDE itself does not, and reported likelihoods of score-based models, while competitive, sometimes lag behind exact likelihood-based methods such as autoregressive models on natural-image bits-per-dimension benchmarks.[^4]
- **Score distortion at high noise.** At very large noise levels the learned score is essentially the score of a Gaussian, providing little information about the data; the network's modeling capacity is therefore concentrated at intermediate noise scales, an observation that has motivated time-weighting reforms of the training loss.

## What are the main variants and related estimators?

A number of estimators relate closely to score matching.

| Estimator | Year | Idea |
| --- | --- | --- |
| Hyvärinen score matching[^1] | 2005 | Match model and data scores via integration by parts. |
| Lyu generalization[^8] | 2009 | Maximum likelihood under noise; extensions to discrete data. |
| Regularized score matching[^9] | 2010 | Kingma and LeCun add explicit regularization for image statistics. |
| Denoising score matching[^2] | 2011 | Vincent connects score matching to denoising autoencoders on Parzen kernels. |
| Sliced score matching[^7] | 2019 | Song et al. project onto random vectors to avoid Hessian computation. |
| NCSN[^3] | 2019 | Multi-scale denoising score matching with annealed Langevin sampling. |
| Score-based SDE[^4] | 2020 | Continuous-time formulation unifying NCSN and DDPM. |
| Flow matching[^12] | 2022 | Velocity-field regression generalizing score-based training. |

[Contrastive Learning](/wiki/contrastive_learning) and noise-contrastive estimation provide a different family of MCMC-free training schemes that target the density via a classification surrogate, and remain competitive in regimes where the score is hard to learn directly.

## How does score matching compare with other unnormalized-model estimators?

| Method | Avoids partition function | Avoids MCMC | Avoids data score | Notes |
| --- | --- | --- | --- | --- |
| Maximum likelihood + MCMC[^5] | No (needs MCMC) | No | Yes | Asymptotically optimal but expensive. |
| Noise-contrastive estimation[^5] | Yes | Yes | Yes | Density ratio learned by classifier; converges to MLE. |
| Score matching[^1] | Yes | Yes | Yes (via IBP) | Targets gradient of log-density. |
| Denoising score matching[^2] | Yes | Yes | Yes | Learns the score of a noised distribution. |

## ELI5: what is score matching?

Imagine you want to describe the shape of a landscape (where the data is dense and where it is sparse), but you are not allowed to measure the total volume of the hills, which is expensive to compute. Score matching is like learning the *slope* of the landscape at every point instead of its height. The slope tells you which way is "uphill" toward more data, and from the slopes alone you can walk samples toward the realistic regions. The clever trick (Hyvärinen 2005) is a piece of calculus that lets you learn those slopes without ever needing the true slopes of the real data, and without ever computing the total volume. Modern image generators like Stable Diffusion learn a noisy version of this slope field and then roll random noise downhill until it turns into a picture.

## Related Work

- [Diffusion model](/wiki/diffusion_model) is the most prominent downstream application of score-matching theory.
- [DDPM](/wiki/ddpm) is the discrete-time denoising diffusion formulation linked to score matching by Tweedie's formula.
- [Flow Matching](/wiki/flow_matching) generalizes score-based SDE training to arbitrary probability paths.
- [Markov Chain Monte Carlo (MCMC)](/wiki/mcmc) and [Maximum likelihood estimation (MLE)](/wiki/maximum_likelihood_estimation) are the classical alternatives that score matching circumvents.
- [Boltzmann machine](/wiki/boltzmann_machine) models are one of the original motivating applications.
- [Autoencoder](/wiki/autoencoder) and [Denoising](/wiki/denoising) training criteria are tightly connected to denoising score matching via the Vincent equivalence.
- [Variational Autoencoder](/wiki/variational_autoencoder) and [GAN](/wiki/gan) are competing generative-modeling paradigms that score-based methods largely displaced for continuous-data tasks.
- [Stable Diffusion](/wiki/stable_diffusion), [DALL-E](/wiki/dall-e), [Imagen](/wiki/imagen), [Midjourney](/wiki/midjourney), and [Sora](/wiki/sora) are major systems that rely on score-matching or diffusion training in production.

## See also

- [Diffusion model](/wiki/diffusion_model)
- [DDPM](/wiki/ddpm)
- [Flow Matching](/wiki/flow_matching)
- [Latent diffusion model](/wiki/latent_diffusion)
- [Stable Diffusion](/wiki/stable_diffusion)
- [Generative Model](/wiki/generative_model)
- [Variational Autoencoder](/wiki/variational_autoencoder)
- [Boltzmann machine](/wiki/boltzmann_machine)
- [Markov Chain Monte Carlo (MCMC)](/wiki/mcmc)
- [Maximum likelihood estimation (MLE)](/wiki/maximum_likelihood_estimation)
- [Contrastive Learning](/wiki/contrastive_learning)
- [Autoencoder](/wiki/autoencoder)
- [Denoising](/wiki/denoising)
- [Noise](/wiki/noise)
- [NeurIPS](/wiki/neurips)
- [International Conference on Learning Representations](/wiki/iclr)
- [Frechet Inception Distance](/wiki/frechet_inception_distance)
- [CIFAR-10](/wiki/cifar_10)
- [MNIST](/wiki/mnist)
- [AI Image Generation](/wiki/ai_image_generation)

## References

[^1]: Aapo Hyvärinen, "Estimation of Non-Normalized Statistical Models by Score Matching", Journal of Machine Learning Research, vol. 6, pp. 695-709, 2005-04. https://www.jmlr.org/papers/v6/hyvarinen05a.html. Accessed 2026-06-25.
[^2]: Pascal Vincent, "A Connection Between Score Matching and Denoising Autoencoders", Neural Computation, vol. 23 no. 7, pp. 1661-1674, 2011-07. https://direct.mit.edu/neco/article/23/7/1661/7677/A-Connection-Between-Score-Matching-and-Denoising. Accessed 2026-06-25.
[^3]: Yang Song and Stefano Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution", arXiv:1907.05600, 2019-07-12. https://arxiv.org/abs/1907.05600. Accessed 2026-06-25.
[^4]: Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole, "Score-Based Generative Modeling through Stochastic Differential Equations", arXiv:2011.13456, 2020-11-26. https://arxiv.org/abs/2011.13456. Accessed 2026-06-25.
[^5]: Yang Song and Diederik P. Kingma, "How to Train Your Energy-Based Models", arXiv:2101.03288, 2021-01-09. https://arxiv.org/abs/2101.03288. Accessed 2026-06-25.
[^6]: Bradley Efron, "Tweedie's Formula and Selection Bias", Journal of the American Statistical Association, vol. 106 no. 496, 2011-12. https://efron.ckirby.su.domains/papers/2011TweediesFormula.pdf. Accessed 2026-06-25.
[^7]: Yang Song, Sahaj Garg, Jiaxin Shi, Stefano Ermon, "Sliced Score Matching: A Scalable Approach to Density and Score Estimation", arXiv:1905.07088, 2019-05-17. https://arxiv.org/abs/1905.07088. Accessed 2026-06-25.
[^8]: Siwei Lyu, "Interpretation and Generalization of Score Matching", Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI), 2009-06. https://arxiv.org/abs/1205.2629. Accessed 2026-06-25.
[^9]: Diederik P. Kingma and Yann LeCun, "Regularized estimation of image statistics by Score Matching", Advances in Neural Information Processing Systems 23 (NeurIPS), 2010-12. https://proceedings.neurips.cc/paper/2010/hash/6f3e29a35278d71c7f65495871231324-Abstract.html. Accessed 2026-06-25.
[^10]: Brian D. O. Anderson, "Reverse-time diffusion equation models", Stochastic Processes and their Applications, vol. 12 no. 3, 1982-05. https://www.sciencedirect.com/science/article/pii/0304414982900515. Accessed 2026-06-25.
[^11]: Jonathan Ho, Ajay Jain, Pieter Abbeel, "Denoising Diffusion Probabilistic Models", arXiv:2006.11239, 2020-06-19. https://arxiv.org/abs/2006.11239. Accessed 2026-06-25.
[^12]: Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, Matt Le, "Flow Matching for Generative Modeling", arXiv:2210.02747, 2022-10-06. https://arxiv.org/abs/2210.02747. Accessed 2026-06-25.
[^13]: Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall, "DreamFusion: Text-to-3D using 2D Diffusion", arXiv:2209.14988, 2022-09-29. https://arxiv.org/abs/2209.14988. Accessed 2026-06-25.

