# Consistency Models

> Source: https://aiwiki.ai/wiki/consistency_models
> Updated: 2026-06-28
> Categories: Diffusion Models, Generative AI, OpenAI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Consistency models** are a family of [generative models](/wiki/generative_ai), introduced by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever at [OpenAI](/wiki/openai) in March 2023, that learn to map any point along the probability flow ordinary differential equation (PF ODE) trajectory of a [diffusion](/wiki/diffusion_model) process directly to that trajectory's clean origin.[^1] By construction, a single network evaluation can transform pure noise into a sample, while iterative refinement is preserved through an optional multistep sampler that re-injects noise between calls.[^1][^2] The original paper, "Consistency Models," reached arXiv as preprint 2303.01469 on 2 March 2023 and was accepted to [ICML](/wiki/icml) 2023; subsequent work has substantially improved training stability, scaled the formulation to billion-parameter image generators, and adapted the idea to latent, audio, and video domains.[^1][^3][^4][^5][^6]

The authors frame the goal directly in the paper's abstract: "we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality."[^1] On standard benchmarks the original models reached a one-step Frechet Inception Distance (FID) of 3.55 on [CIFAR-10](/wiki/cifar_10) and 6.20 on [ImageNet](/wiki/imagenet) 64x64, the state-of-the-art for single-step diffusion distillation at publication.[^1]

The motivation is straightforward: diffusion models such as [DDPM](/wiki/ddpm) and the score-based stochastic differential equation family produce samples by solving a learned ODE or SDE backward from noise to data, which typically requires dozens to hundreds of sequential neural network evaluations.[^1][^7] Consistency models retain the iterative trajectory of diffusion at training time but force the network to collapse the whole trajectory into a single mapping, enabling one-step or very-few-step generation. Two training regimes were proposed in the original paper: **consistency distillation (CD)**, which uses a pre-trained diffusion model as a teacher, and **consistency training (CT)**, which trains a consistency model from scratch on data.[^1] Follow-up work by Song and Dhariwal in October 2023 introduced "Improved Techniques for Training Consistency Models" (arXiv:2310.14189), the same month that Kim et al. proposed Consistency Trajectory Models (CTM, arXiv:2310.02279), and the line was further generalised in October 2024 by Cheng Lu and Yang Song in "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models" (arXiv:2410.11081, often abbreviated sCM), which scaled the approach to a 1.5 billion parameter model on ImageNet 512x512.[^2][^4][^19] A closely related sibling line, [Latent Consistency Models](/wiki/latent_consistency_models), transposes the same recipe into the latent space of a pre-trained latent diffusion model.[^5]

| Property | Details |
|---|---|
| First public release | arXiv:2303.01469, 2 March 2023[^1] |
| Authors (original) | Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever[^1] |
| Affiliation | OpenAI[^1] |
| Venue | ICML 2023 (PMLR vol. 202, pp. 32211 to 32252)[^3] |
| Training regimes | Consistency Distillation (CD), Consistency Training (CT)[^1] |
| Sampling | One step by design; optional multistep refinement[^1] |
| Original one-step FID, CIFAR-10 | 3.55 (CD)[^1] |
| Original one-step FID, ImageNet 64x64 | 6.20 (CD)[^1] |
| Improved CT one-step FID, CIFAR-10 (iCT) | 2.51[^2] |
| CTM one-step FID, CIFAR-10 | 1.73[^19] |
| sCM two-step FID, ImageNet 512x512 (1.5B) | 1.88[^4] |
| Best known follow-ups | CTM (ICLR 2024)[^19], sCM (arXiv:2410.11081, ICLR 2025)[^4][^8] |

## What is a consistency model?

A consistency model is a generative network that maps any noised point on a diffusion trajectory straight to the clean data point at the start of that trajectory, so a sample can be produced in a single forward pass instead of the dozens to hundreds of denoising steps a standard diffusion sampler needs.[^1] The model is defined by a **consistency function** that is required to be self-consistent: any two points sampled along the same probability flow ODE trajectory must be mapped to the same output.[^1] Because the mapping targets the trajectory origin directly, generation reduces to evaluating the function once on a pure-noise input, while an optional multistep sampler can spend a few extra evaluations to raise quality.[^1][^2]

## Background

### Diffusion models and the probability flow ODE

Modern continuous-time [diffusion models](/wiki/diffusion_model) view sample generation as the time reversal of a stochastic process that gradually adds Gaussian noise to clean data. In the formulation by Song et al. (ICLR 2021), this forward process is described by an Ito stochastic differential equation (SDE) whose marginal densities can also be sampled deterministically using a corresponding **probability flow ODE** that shares the same time-marginal distributions as the SDE.[^7] In the variance-exploding parameterisation used by the consistency models paper, the forward process can be written as `dx = sqrt(2 t) dW`, so that at time `t` the noisy sample `x_t = x_0 + t z`, with `z` standard Gaussian, follows a tractable family of densities; the associated probability flow ODE is `dx_t/dt = -t s(x_t, t)`, where `s` is the score, i.e. the gradient of the log density.[^1][^7]

Sampling proceeds by integrating this ODE backward from a large noise level `T` (where the marginal is essentially Gaussian) to a small terminal time `eps > 0`, using a pre-trained score network. With high-order ODE solvers such as Heun's method or DPM-Solver, this still requires tens of evaluations to reach competitive sample quality.[^7][^9] Many earlier acceleration techniques (DDIM-style deterministic samplers, DPM-Solver, [knowledge distillation](/wiki/knowledge_distillation) variants such as progressive distillation) attempt to reduce this evaluation count, but each retains some iterative character or pays a quality penalty in the very-few-step regime.[^9][^10]

### The Karras EDM framework

The consistency models paper adopts the design space and noise schedule introduced by Karras et al. ("Elucidating the Design Space of Diffusion-Based Generative Models," NeurIPS 2022), commonly abbreviated EDM. EDM uses a variance-exploding parameterisation with discretised noise levels `sigma_min = 0.002`, `sigma_max = 80`, and a rho parameter of 7 controlling the geometric spacing of timesteps along the trajectory.[^11] The choice fixes the precise meaning of `t` in the probability flow ODE and supplies a backbone (a U-Net with EDM preconditioning) that the consistency model reuses.[^1][^11]

### Why is one-step generation desirable?

Reducing the number of neural network evaluations (NFEs) per sample directly reduces latency, energy use, and serving cost. For text-to-image systems, a 25 to 50 step diffusion pass on a large model can dominate end-to-end inference time; cutting this to one to four steps without large quality loss enables interactive editing, real-time previews, and viable deployment on mobile or browser hardware.[^5][^6][^12] These pressures, plus the desire for a deeper theoretical understanding of when fast samplers are possible, motivated consistency models and a wave of related distillation methods that appeared in 2023 and 2024.[^4][^5][^6][^12]

## How do consistency models work?

### The consistency function

Given a probability flow ODE trajectory `{x_t}` for `t in [eps, T]`, the **consistency function** `f` is defined by `f(x_t, t) = x_eps`, mapping any point on the trajectory to the same endpoint near `t = 0`.[^1] The defining self-consistency property is that for any two times `t, t'` on the same trajectory, `f(x_t, t) = f(x_t', t')`. A learned parametric approximation `f_theta` is the model.[^1]

To make the function easy to train and to enforce a clean terminal behaviour, the network must satisfy the **boundary condition** `f_theta(x, eps) = x`. The authors implement this via a skip parameterisation: `f_theta(x, t) = c_skip(t) x + c_out(t) F_theta(x, t)`, where `F_theta` is a free neural network (typically a U-Net inherited from the diffusion teacher), and `c_skip, c_out` are differentiable scalar functions satisfying `c_skip(eps) = 1` and `c_out(eps) = 0`, so the boundary is met by construction without an architectural hack.[^1] EDM-style preconditioning provides convenient closed forms for `c_skip(t)` and `c_out(t)` that work well empirically.[^1][^11]

### How are consistency models trained?

The original paper proposed two ways to train the consistency function, summarised here and detailed in the two subsections that follow.[^1]

| Training mode | Needs a diffusion teacher? | Source of the trajectory pair | Original one-step CIFAR-10 FID |
|---|---|---|---|
| Consistency distillation (CD) | Yes | A single ODE-solver step using the teacher's score | 3.55[^1] |
| Consistency training (CT) | No | Two adjacent noise levels added to the same clean sample | 8.70[^1] |

#### Consistency distillation (CD)

In **consistency distillation**, a pre-trained diffusion model supplies the gradients of the probability flow ODE. Training proceeds by drawing a data sample `x`, a noise level `t_{n+1}` from a fixed discretisation `eps = t_1 < t_2 < ... < t_N = T`, and a Gaussian noise vector to form `x_{t_{n+1}}`. A single step of an ODE solver `phi` (typically Heun's method using the teacher's score) approximates `x_{t_n}` from `x_{t_{n+1}}`. The student is then trained to match itself at these two adjacent points on the trajectory, using the loss

`L_CD(theta, theta^-; phi) = E[ lambda(t_n) d( f_theta(x_{t_{n+1}}, t_{n+1}), f_{theta^-}(x_hat^phi_{t_n}, t_n) ) ]`

where `d` is a distance such as squared L2 or LPIPS, `lambda` a weighting function, and `theta^-` a slow-moving target parameter set updated by an exponential moving average (EMA) of the trained parameters.[^1] Because both arguments lie on (an approximation of) the same trajectory, minimising this loss enforces self-consistency along that trajectory. The teacher only enters through the single ODE step, never through a direct supervision signal on the endpoint.[^1]

#### Consistency training (CT)

In **consistency training**, no diffusion teacher is needed. The pair of points on the trajectory is generated by adding scaled Gaussian noise to a clean data sample at two adjacent noise levels:

`L_CT(theta, theta^-) = E[ lambda(t_n) d( f_theta(x + t_{n+1} z, t_{n+1}), f_{theta^-}(x + t_n z, t_n) ) ]`

with the same Gaussian `z` used at both noise levels.[^1] The variance-exploding diffusion process makes the two perturbed points an unbiased estimator of adjacent ODE trajectory points in expectation, which the paper proves under mild conditions.[^1] Consistency training thus serves as a self-contained generative model in its own right, with no reliance on an external diffusion model. The original paper observed that CD outperforms CT at small to moderate model and compute scale, but that CT closes much of the gap with more iterations and larger networks.[^1]

### How do you sample from a consistency model?

#### One-step sampling

After training, one-step generation is trivial: sample `x_T ~ N(0, T^2 I)` (or the equivalent under the chosen schedule), then output `f_theta(x_T, T)`. This single network call produces a candidate sample whose quality is competitive with one-step [GANs](/wiki/gan) and one-step diffusion distillations on standard benchmarks.[^1]

#### Multistep sampling

For users willing to spend a few additional network evaluations, the paper proposes a multistep sampler that trades compute for quality. Given a decreasing schedule `tau_1 > tau_2 > ... > tau_{N-1}`, the algorithm: (1) initialises `x_T`; (2) outputs an estimate `x_hat = f_theta(x_T, T)`; (3) re-injects fresh Gaussian noise of magnitude `sqrt(tau_n^2 - eps^2)` to obtain `x_{tau_n}`; (4) applies the consistency model again to get a refined `x_hat = f_theta(x_{tau_n}, tau_n)`; and repeats steps 3 to 4 for each intermediate time. Each iteration moves the sample partway back along an ODE trajectory and then re-collapses it to the data manifold, which empirically improves sample fidelity for the same network architecture.[^1] In practice, two to four steps recover most of the quality gap to many-step diffusion teachers on CIFAR-10 and ImageNet 64x64.[^1][^2]

### Boundary, weighting, and metric choices

Three design choices are critical for stable training. First, the boundary condition `f_theta(x, eps) = x` must hold exactly, and the skip-parameterisation accomplishes this with the EDM coefficients.[^1][^11] Second, the loss weighting `lambda(t)` and the metric `d` have a strong effect on which timesteps dominate the gradient: the original paper used LPIPS, which provided strong perceptual gradients on natural images but introduced a dependence on an auxiliary VGG feature network.[^1] Third, the discretisation of timesteps and the EMA decay rate for the teacher `theta^-` interact non-trivially with the network's tendency to collapse. The 2310.14189 follow-up addresses each of these in turn.[^2]

## What did Improved Techniques for Training Consistency Models (iCT) change?

In October 2023 Song and Dhariwal published "Improved Techniques for Training Consistency Models" (arXiv:2310.14189), focusing on closing the gap between consistency training and consistency distillation so that practitioners do not need a pre-trained diffusion teacher.[^2] The paper identified an overlooked flaw in prior CT theory: applying an exponential moving average to the teacher network distorts the consistency-training objective. The proposed correction simply removes the EMA from `theta^-`, using the current weights directly as the target, which the authors prove is consistent with the underlying ODE.[^2]

Other contributions include:

- **Pseudo-Huber loss.** The LPIPS perceptual metric is replaced with a Pseudo-Huber loss `sqrt(||a - b||^2 + c^2) - c`, where `c` is a small constant. This avoids the dependence on a learned VGG network, sidesteps an evaluation-bias issue when training and evaluation share the same perceptual features, and provides smoother optimisation.[^2]
- **Lognormal noise schedule.** Noise levels for CT are drawn from a lognormal distribution over `sigma`, concentrating training on the regime where the loss surface is most informative.[^2]
- **Discretisation curriculum.** The number of timesteps `N` is doubled at regular intervals during training, providing a coarse-to-fine schedule that begins with a few large jumps and progresses to a denser discretisation.[^2]
- **Architectural cleanups.** Minor changes to the U-Net (such as removing certain attention layers at higher resolutions) and a careful re-tuning of optimisation hyperparameters.[^2]

Together these techniques cut the one-step FID on CIFAR-10 from 8.70 (original CT) to 2.51, and on ImageNet 64x64 from 13.0 to 3.25; two-step sampling further reduced these to 2.24 and 2.77 respectively. In both cases iCT trained from scratch matched or surpassed the original consistency distillation, demonstrating that a diffusion teacher is not strictly necessary.[^2]

## What are Consistency Trajectory Models (CTM)?

In parallel with iCT, in October 2023 Dongjun Kim and collaborators at [Sony](/wiki/sony) AI and Stanford University proposed **Consistency Trajectory Models (CTM)** in "Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion" (arXiv:2310.02279), accepted to ICLR 2024.[^19] CTM generalises both consistency models and score-based diffusion models as special cases. Where the original consistency function only maps a point to the trajectory's origin near `t = 0`, CTM learns a more general decoder `G(x_t, t, s)` that can jump from any starting time `t` to any earlier target time `s` along the same PF ODE trajectory, an "anytime-to-anytime" traversal.[^19] Setting `s = 0` recovers consistency-style one-step generation, while taking the limit of infinitesimal jumps recovers the underlying score, so a single network exposes both the score function and direct trajectory jumps.[^19]

This design gives CTM two practical advantages over plain consistency models. First, because the network still provides the score, CTM supports a clean quality-versus-compute trade-off: an alternating sampler ('gamma-sampling') interleaves deterministic long jumps with score-based denoising, so adding sampling steps reliably improves quality rather than saturating.[^19] Second, access to the score streamlines likelihood evaluation and the reuse of conditional-generation techniques developed for diffusion models.[^19] CTM is trained with a combination of a trajectory-matching (soft consistency) loss, a denoising score-matching loss, and an optional adversarial (GAN) loss on the decoded samples.[^19] With these ingredients CTM reported one-step (NFE=1) FID scores of 1.73 on CIFAR-10 and 1.92 on ImageNet 64x64, state-of-the-art for single-step generation at the time of publication; an official PyTorch implementation was released by Sony.[^19] The trajectory-mapping idea was later carried into latent text-to-image acceleration by **Trajectory Consistency Distillation (TCD)** (Zheng et al., arXiv:2402.19159, February 2024), which distills a semi-linear trajectory consistency function into an [SDXL](/wiki/sdxl) [LoRA](/wiki/lora) and improves detail at low step counts without the adversarial training used in some competing methods.[^21]

## How do continuous-time consistency models and sCM work?

Although the original consistency models can be formulated in continuous time (with `t` ranging over a real interval and a tangent-based loss), in practice the published results all relied on a discrete grid of timesteps. Continuous-time CT was numerically unstable for the original parameterisation. In October 2024 Cheng Lu and Yang Song addressed this in "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models" (arXiv:2410.11081), introducing the **sCM** family.[^4][^8]

### TrigFlow

The paper proposes **TrigFlow**, a parameterisation that unifies the EDM and [Flow Matching](/wiki/flow_matching) frameworks using purely trigonometric coefficients. The forward process is written as `x_t = cos(t) x_0 + sin(t) z` for `t in [0, pi/2]`, and the model is parameterised as `f_theta(x_t, t) = F_theta(x_t / sigma_d, c_noise(t))` with `c_skip(t) = cos(t)`, `c_out(t) = sin(t)`, and `c_in(t) = 1 / sigma_d`.[^4] These clean expressions remove the discontinuities and large-magnitude derivatives that plagued the EDM-style continuous-time consistency objective near the endpoints of the trajectory, while preserving the variance-exploding nature that makes the ODE well-conditioned.[^4]

### Stability fixes

The authors identify several specific sources of instability in continuous-time CT and patch each:

- An **identity time transformation** `c_noise(t) = t` replaces EDM's `log(sigma_d tan t)` to avoid blow-ups near `t = pi/2`.[^4]
- **Positional embeddings** of time replace high-scale Fourier features, reducing variance in time-derivative gradients.[^4]
- **Adaptive double normalisation** modifies adaptive group normalisation to apply pixel-wise normalisation on scale and bias signals.[^4]
- **Tangent normalisation** rescales the tangent vector that appears in the continuous-time loss, preventing isolated large gradients from destabilising training.[^4]
- An **adaptive weighting** function `w_phi(t)` is learned alongside the model to equalise loss variance across timesteps.[^4]
- A **tangent warmup** linearly ramps the coefficient of the unstable term over the first 10,000 iterations.[^4]

### How well does sCM scale, and what FID does it reach?

With these fixes, the sCM training algorithm reliably scales to large models. The paper reports a 1.5 billion parameter model trained on ImageNet 512x512, the largest continuous-time consistency model published.[^4] Reported two-step FIDs are 2.06 on CIFAR-10, 1.48 on ImageNet 64x64, and 1.88 on ImageNet 512x512; the ImageNet 512 number is within roughly 10% of the best diffusion-model FIDs at the same resolution, while sCM uses only two function evaluations.[^4] OpenAI's accompanying communications described an approximately 50x wall-clock speedup at inference, with a single sample from the 1.5B sCM generated in about 0.11 seconds on one NVIDIA A100 GPU.[^13]

The paper distinguishes **sCD** (continuous-time consistency distillation), which uses a teacher diffusion model, from **sCT** (continuous-time consistency training), which trains from scratch. Both variants benefit from the TrigFlow parameterisation and the stability machinery; sCD reaches the best reported FIDs and converges in roughly 20,000 fine-tuning iterations from a strong teacher.[^4] sCM was accepted to ICLR 2025.[^8]

### Easy Consistency Tuning (ECT)

A complementary 2024 line attacks the training cost rather than the asymptotic quality. **Easy Consistency Tuning (ECT)**, introduced by Zhengyang Geng, Ashwini Pokle, Weijian Luo, Justin Lin, and J. Zico Kolter in "Consistency Models Made Easy" (arXiv:2406.14548), reframes consistency training as a lightweight fine-tune of an already-trained diffusion model that progressively tightens the consistency condition over a shrinking time gap.[^20] Because it starts from a pre-trained diffusion checkpoint rather than from scratch, ECT reaches a two-step FID of 2.73 on CIFAR-10 in about one hour on a single NVIDIA A100 GPU, matching consistency distillations that previously required hundreds of GPU hours; the paper was accepted to ICLR 2025.[^20]

## What are the main variants and downstream uses?

### Latent Consistency Models (LCM)

In October 2023, Simian Luo and collaborators at Tsinghua University published "Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference" (arXiv:2310.04378), which transposes the consistency-model recipe into the latent space of a pre-trained latent diffusion model such as [Stable Diffusion](/wiki/stable_diffusion).[^5] Rather than running the consistency objective on pixels, the [latent consistency model](/wiki/latent_consistency_models) distills the probability flow ODE of an existing LDM in the LDM's own latent space. The resulting models generate 768x768 images in two to four sampling steps and the authors report that the distillation requires only about 32 A100 GPU hours, since both the latent and the teacher are already trained.[^5]

A follow-up paper from the same group, **LCM-LoRA: A Universal Stable-Diffusion Acceleration Module** (arXiv:2311.05556, November 2023), shows that the LCM distillation can be captured in a [LoRA](/wiki/lora) adapter that plugs into any fine-tuned Stable Diffusion or SDXL checkpoint without retraining, effectively acting as a "drop-in PF ODE solver."[^6] LCM-LoRA modules for SD 1.5, SSD-1B, and [SDXL](/wiki/sdxl) were released openly via [Hugging Face](/wiki/hugging_face) and are widely used in real-time generation interfaces.[^6] The dedicated [Latent Consistency Models](/wiki/latent_consistency_models) article covers the augmented PF ODE, the guidance-scale embedding, and the Diffusers `LCMScheduler` in more detail.

### Consistency models for audio and speech

The consistency-distillation principle has been transferred to audio. **ConsistencyTTA** (arXiv:2309.10740, September 2023) applies CFG-aware latent consistency distillation to a diffusion text-to-audio model, reducing inference from hundreds of NFEs to a single network query while maintaining CLAP-score quality.[^14] Follow-on systems such as CoMoSpeech and AudioLCM extend the idea to text-to-speech and singing voice synthesis with one to four step generation.[^14]

### How do consistency models compare to other few-step methods?

The most directly comparable acceleration technique is **progressive distillation** (Salimans and Ho, ICLR 2022), which iteratively halves the number of sampling steps by training each student to reproduce two teacher steps in one.[^10] On CIFAR-10 and ImageNet 64x64, the original consistency distillation results matched or exceeded progressive distillation at one and two sampling steps without the repeated student-teacher distillation cycle.[^1] **DPM-Solver** and its variants (DPM-Solver-2, DPM-Solver++, DPM-Solver-v3) are training-free higher-order ODE solvers that improve sample quality in the 5 to 20 step regime; they remain competitive when more compute is available but cannot reach the one to two step quality of consistency models on most benchmarks.[^9] Consistency models are also related to **[rectified flow](/wiki/rectified_flow)** (Liu et al., ICLR 2023), which straightens the noise-to-data transport so that a few-step (or, after reflow, near-one-step) Euler integration suffices; both pursue short generative paths, but rectified flow straightens the trajectory while consistency models collapse a fixed curved trajectory into a learned jump.[^4] **Adversarial Diffusion Distillation (ADD)**, used in Stability AI's SDXL Turbo (November 2023), combines score distillation with a GAN-style adversarial loss to enable one to four step generation from large foundation diffusion models, and is a popular alternative for text-to-image acceleration.[^15]

| Method | Training cost | Sampling steps | One-step CIFAR-10 FID | Notes |
|---|---|---|---|---|
| Diffusion (EDM) teacher | High | ~35 | n/a | Baseline; many NFEs required[^11] |
| Progressive distillation | Multi-stage | 1 to 4 | 9.12[^1] | Iterative teacher to student halving[^10] |
| Consistency distillation (CD) | Single stage from teacher | 1 to 4 | 3.55[^1] | Original 2023 paper[^1] |
| Consistency training (CT) | From scratch | 1 to 4 | 8.70[^1] | No teacher diffusion model[^1] |
| Improved CT (iCT) | From scratch | 1 to 4 | 2.51[^2] | Lognormal sched + Pseudo-Huber[^2] |
| CTM | From teacher (+ GAN, score) | 1 to a few | 1.73[^19] | Anytime-to-anytime jumps; exposes score[^19] |
| ECT | Fine-tune from diffusion | 1 to 2 | n/a (2-step 2.73)[^20] | ~1 A100 hour on CIFAR-10[^20] |
| sCT (continuous-time) | From scratch | 1 to 2 | n/a (2-step 2.06)[^4] | TrigFlow + stability fixes[^4] |
| sCD (continuous-time) | Single stage from teacher | 1 to 2 | n/a (2-step 1.88 on ImageNet 512)[^4] | Scales to 1.5B parameters[^4] |
| DPM-Solver | Training-free | ~10 | n/a | Higher-order ODE solver[^9] |
| ADD (SDXL Turbo) | Adversarial + distillation | 1 to 4 | n/a | Used in real-time SDXL[^15] |

(FID values quoted are from the original publications above and may differ slightly across re-runs; ImageNet rows are reported separately because most consistency-model papers report ImageNet 64x64 rather than CIFAR-10 at one step.)

## What are consistency models used for?

### Real-time and interactive generation

The most prominent practical impact of consistency models has been in **real-time image generation**. LCM and LCM-LoRA enabled the first widely-deployed sub-second 1024x1024 Stable Diffusion inference on commodity GPUs, and they underpin many interactive editors and "live canvas" interfaces released during late 2023 and 2024.[^5][^6][^12] The sCM line shows that the underlying ideas extend to billion-parameter image generators with two-step generation matched against the best diffusion FID scores on ImageNet 512.[^4][^13]

### Edge and on-device deployment

Because consistency models keep most of the diffusion-model machinery (U-Net or transformer backbone, latent VAE if used) intact, they slot into existing inference frameworks. The combination of LCM-LoRA with quantisation has been used to fit Stable Diffusion variants onto laptops and phones; tooling such as the Intel OpenVINO LCM notebook and various Hugging Face Diffusers pipelines document such deployments.[^12]

### Zero-shot editing

The original paper showed that consistency models inherit the **zero-shot editing** capabilities of their diffusion teachers: by initialising the sampler with a partially-noised image or with a noise vector conditioned on a mask, the same trained model can perform inpainting, colorisation, and super-resolution without any task-specific fine-tuning.[^1] These properties are routinely exploited in LCM-based image editors and downstream pipelines.[^1][^5]

### Theoretical interest

Beyond engineering, consistency models contributed a clean theoretical viewpoint on the trade-off between [diffusion](/wiki/diffusion_model) sampling speed and quality. They show that the entire PF ODE trajectory can in principle be summarised by a single function from `(x_t, t)` to `x_eps`, and that this function can be learned by enforcing local self-consistency along the trajectory rather than reproducing the noise sequence exactly. This perspective has informed subsequent work on "shortcut" learning, flow matching, and one-step diffusion variants, and convergence guarantees for both single-step and multistep consistency sampling have been studied (for example, arXiv:2308.11449 and arXiv:2505.03194).[^16][^17]

## What are the limitations of consistency models?

Despite their successes, consistency models have well-documented practical drawbacks. The original paper used the LPIPS perceptual metric, which introduces an external dependency on a VGG feature extractor and may bias the FID evaluation when training and evaluation share perceptual features; iCT replaced LPIPS with Pseudo-Huber to mitigate this concern.[^2] Even after the iCT and sCM fixes, a quality gap to the strongest diffusion teachers persists at one step on the largest benchmarks: the sCM authors describe their result as narrowing the FID gap on ImageNet 512x512 to within about 10% of the best diffusion FIDs, but not closing it.[^4] One-step sampling also offers no clear knob for guidance strength comparable to classifier-free guidance scaling in the multistep regime, and naive use of strong guidance during distillation tends to bake in artifacts.[^6][^15]

Training instability in the continuous-time setting was the central problem motivating sCM, and even the patched TrigFlow recipe relies on a careful combination of adaptive weighting, time conditioning, normalisation, and a tangent warmup; without these, continuous-time consistency training diverges.[^4] Discrete-time CT remains hyperparameter-sensitive, with the lognormal noise schedule, discretisation curriculum, and EMA settings all having a strong effect on outcomes.[^2] The community has also documented that pure consistency objectives can get trapped in local optima that sacrifice endpoint fidelity to global self-consistency, leading some authors to combine consistency training with auxiliary adversarial or energy-based losses to recover the last bits of FID; CTM's optional GAN loss is one example of this pattern.[^18][^15][^19]

Finally, consistency models inherit the limitations and biases of their diffusion teachers when used in CD mode, since the teacher's score function defines the trajectories along which self-consistency is enforced. A noisy or miscalibrated teacher produces a noisy or miscalibrated student.[^1][^15]

## Related work

- [Diffusion model](/wiki/diffusion_model): the underlying generative paradigm whose ODE trajectories consistency models compress to a single function.[^1][^7]
- [DDPM](/wiki/ddpm) (Denoising Diffusion Probabilistic Models): the discrete-time precursor that established the noise prediction objective reused by consistency model backbones.[^1]
- [Latent diffusion model](/wiki/latent_diffusion) and [Stable Diffusion](/wiki/stable_diffusion): the latent-space diffusion framework that LCM and LCM-LoRA accelerate.[^5][^6]
- [Latent Consistency Models](/wiki/latent_consistency_models): the sibling line that ports the consistency recipe to the latent space of Stable Diffusion for few-step text-to-image generation.[^5][^6]
- [Knowledge distillation](/wiki/knowledge_distillation): the broader teacher to student training paradigm; consistency distillation is a specific instance applied to ODE trajectories.[^1][^10]
- [Flow Matching](/wiki/flow_matching) and [rectified flow](/wiki/rectified_flow): related continuous-transport frameworks; flow matching's objective is unified with EDM by TrigFlow, and rectified flow pursues straight few-step paths.[^4]
- [Variational Autoencoder](/wiki/variational_autoencoder): provides the encoder/decoder pair used by latent diffusion and therefore by latent consistency models.[^5]
- [U-Net](/wiki/unet): the standard architecture for the denoiser network used by both the diffusion teacher and the consistency student.[^1][^11]
- [LoRA (Low-Rank Adaptation)](/wiki/lora): the parameter-efficient fine-tuning method used in LCM-LoRA and TCD to ship consistency distillations as plug-in adapters.[^6][^21]
- [Generative model](/wiki/generative_model) and [GAN](/wiki/gan): alternative one-step generators that consistency models compete with on FID and Inception Score at low NFE budgets.[^1]
- [AI image generation](/wiki/ai_image_generation) and [text-to-video generation](/wiki/text_to_video): application domains where consistency-based accelerators have been deployed in production systems and research.[^5][^6][^12]

## See also

- [Diffusion model](/wiki/diffusion_model)
- [DDPM](/wiki/ddpm)
- [Latent diffusion model](/wiki/latent_diffusion)
- [Latent Consistency Models](/wiki/latent_consistency_models)
- [Stable Diffusion](/wiki/stable_diffusion)
- [SDXL (Stable Diffusion XL)](/wiki/sdxl)
- [Flow Matching](/wiki/flow_matching)
- [Rectified Flow](/wiki/rectified_flow)
- [Knowledge Distillation](/wiki/knowledge_distillation)
- [U-Net](/wiki/unet)
- [Variational Autoencoder](/wiki/variational_autoencoder)
- [LoRA (Low-Rank Adaptation)](/wiki/lora)
- [Generative Model](/wiki/generative_model)
- [Generative AI](/wiki/generative_ai)
- [Generative Adversarial Network (GAN)](/wiki/gan)
- [AI Image Generation](/wiki/ai_image_generation)
- [ImageNet](/wiki/imagenet)
- [CIFAR-10](/wiki/cifar_10)
- [OpenAI](/wiki/openai)
- [Ilya Sutskever](/wiki/ilya_sutskever)
- [ICML](/wiki/icml)
- [AI Video Generation](/wiki/ai_video_generation)
- [Text-to-video generation](/wiki/text_to_video)
- [Hugging Face](/wiki/hugging_face)

## References

[^1]: Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever, "Consistency Models", arXiv preprint 2303.01469, 2023-03-02. https://arxiv.org/abs/2303.01469. Accessed 2026-06-28.
[^2]: Yang Song, Prafulla Dhariwal, "Improved Techniques for Training Consistency Models", arXiv preprint 2310.14189, 2023-10-22. https://arxiv.org/abs/2310.14189. Accessed 2026-06-28.
[^3]: Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever, "Consistency Models", Proceedings of the 40th International Conference on Machine Learning (PMLR vol. 202, pp. 32211 to 32252), 2023-07-23. https://proceedings.mlr.press/v202/song23a.html. Accessed 2026-06-28.
[^4]: Cheng Lu, Yang Song, "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models", arXiv preprint 2410.11081, 2024-10-14. https://arxiv.org/abs/2410.11081. Accessed 2026-06-28.
[^5]: Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, Hang Zhao, "Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference", arXiv preprint 2310.04378, 2023-10-06. https://arxiv.org/abs/2310.04378. Accessed 2026-06-28.
[^6]: Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinario Passos, Longbo Huang, Jian Li, Hang Zhao, "LCM-LoRA: A Universal Stable-Diffusion Acceleration Module", arXiv preprint 2311.05556, 2023-11-09. https://arxiv.org/abs/2311.05556. Accessed 2026-06-28.
[^7]: Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole, "Score-Based Generative Modeling through Stochastic Differential Equations", International Conference on Learning Representations (ICLR) 2021 (Oral), 2021-02-10. https://openreview.net/forum?id=PxTIG12RRHS. Accessed 2026-06-28.
[^8]: OpenReview, "Simplifying, Stabilizing and Scaling Continuous-time Consistency Models", ICLR 2025 paper page, 2025-03-01. https://openreview.net/forum?id=LyJi5ugyJx. Accessed 2026-06-28.
[^9]: Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps", NeurIPS 2022 (arXiv:2206.00927), 2022-06-02. https://arxiv.org/abs/2206.00927. Accessed 2026-06-28.
[^10]: Tim Salimans, Jonathan Ho, "Progressive Distillation for Fast Sampling of Diffusion Models", International Conference on Learning Representations (ICLR) 2022 (arXiv:2202.00512), 2022-02-01. https://arxiv.org/abs/2202.00512. Accessed 2026-06-28.
[^11]: Tero Karras, Miika Aittala, Timo Aila, Samuli Laine, "Elucidating the Design Space of Diffusion-Based Generative Models", NeurIPS 2022 (arXiv:2206.00364), 2022-06-01. https://arxiv.org/abs/2206.00364. Accessed 2026-06-28.
[^12]: Intel OpenVINO Documentation, "Image generation with Latent Consistency Model and OpenVINO", 2024 OpenVINO docs notebook. https://docs.openvino.ai/2024/notebooks/latent-consistency-models-image-generation-with-output.html. Accessed 2026-06-28.
[^13]: VentureBeat (Carl Franzen), "OpenAI researchers develop new model that speeds up media generation by 50X", VentureBeat, 2024-10-23. https://venturebeat.com/ai/openai-researchers-develop-new-model-that-speeds-up-media-generation-by-50x. Accessed 2026-06-28.
[^14]: Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi, "ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation", arXiv preprint 2309.10740, 2023-09-19. https://arxiv.org/abs/2309.10740. Accessed 2026-06-28.
[^15]: Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach, "Adversarial Diffusion Distillation", Stability AI Research preprint (and arXiv:2311.17042), 2023-11-28. https://stability.ai/research/adversarial-diffusion-distillation. Accessed 2026-06-28.
[^16]: Junlong Lyu, Zhitang Chen, Shoubo Feng, "Convergence Guarantee for Consistency Models", arXiv preprint 2308.11449, 2023-08-22. https://arxiv.org/abs/2308.11449. Accessed 2026-06-28.
[^17]: Anonymous authors, "Convergence of Consistency Model with Multistep Sampling", arXiv preprint 2505.03194, 2025-05-06. https://arxiv.org/abs/2505.03194. Accessed 2026-06-28.
[^18]: Shelly Golan, Roy Ganz, Michael Elad, "Enhancing Consistency-Based Image Generation via Adversarially-Trained Classification and Energy-Based Discrimination", arXiv preprint 2405.16260, 2024-05-25. https://arxiv.org/abs/2405.16260. Accessed 2026-06-28.
[^19]: Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon, "Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion", International Conference on Learning Representations (ICLR) 2024 (arXiv:2310.02279), 2023-10-01. https://arxiv.org/abs/2310.02279. Accessed 2026-06-28.
[^20]: Zhengyang Geng, Ashwini Pokle, Weijian Luo, Justin Lin, J. Zico Kolter, "Consistency Models Made Easy", International Conference on Learning Representations (ICLR) 2025 (arXiv:2406.14548), 2024-06-20. https://arxiv.org/abs/2406.14548. Accessed 2026-06-28.
[^21]: Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao, Tat-Jen Cham, "Trajectory Consistency Distillation: Improved Latent Consistency Distillation by Semi-Linear Consistency Function with Trajectory Mapping", arXiv preprint 2402.19159, 2024-02-29. https://arxiv.org/abs/2402.19159. Accessed 2026-06-28.