Consistency Models

Diffusion Models Generative AI OpenAI

26 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v3 · 5,299 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Consistency models are a family of generative models, introduced by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever at OpenAI in March 2023, that learn to map any point along the probability flow ordinary differential equation (PF ODE) trajectory of a diffusion process directly to that trajectory's clean origin.^[1] By construction, a single network evaluation can transform pure noise into a sample, while iterative refinement is preserved through an optional multistep sampler that re-injects noise between calls.^[1]^[2] The original paper, "Consistency Models," reached arXiv as preprint 2303.01469 on 2 March 2023 and was accepted to ICML 2023; subsequent work has substantially improved training stability, scaled the formulation to billion-parameter image generators, and adapted the idea to latent, audio, and video domains.^[1]^[3]^[4]^[5]^[6]

The authors frame the goal directly in the paper's abstract: "we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality."^[1] On standard benchmarks the original models reached a one-step Frechet Inception Distance (FID) of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64, the state-of-the-art for single-step diffusion distillation at publication.^[1]

The motivation is straightforward: diffusion models such as DDPM and the score-based stochastic differential equation family produce samples by solving a learned ODE or SDE backward from noise to data, which typically requires dozens to hundreds of sequential neural network evaluations.^[1]^[7] Consistency models retain the iterative trajectory of diffusion at training time but force the network to collapse the whole trajectory into a single mapping, enabling one-step or very-few-step generation. Two training regimes were proposed in the original paper: consistency distillation (CD), which uses a pre-trained diffusion model as a teacher, and consistency training (CT), which trains a consistency model from scratch on data.^[1] Follow-up work by Song and Dhariwal in October 2023 introduced "Improved Techniques for Training Consistency Models" (arXiv:2310.14189), the same month that Kim et al. proposed Consistency Trajectory Models (CTM, arXiv:2310.02279), and the line was further generalised in October 2024 by Cheng Lu and Yang Song in "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models" (arXiv:2410.11081, often abbreviated sCM), which scaled the approach to a 1.5 billion parameter model on ImageNet 512x512.^[2]^[4]^[19] A closely related sibling line, Latent Consistency Models, transposes the same recipe into the latent space of a pre-trained latent diffusion model.^[5]

Property	Details
First public release	arXiv:2303.01469, 2 March 2023^[1]
Authors (original)	Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever^[1]
Affiliation	OpenAI^[1]
Venue	ICML 2023 (PMLR vol. 202, pp. 32211 to 32252)^[3]
Training regimes	Consistency Distillation (CD), Consistency Training (CT)^[1]
Sampling	One step by design; optional multistep refinement^[1]
Original one-step FID, CIFAR-10	3.55 (CD)^[1]
Original one-step FID, ImageNet 64x64	6.20 (CD)^[1]
Improved CT one-step FID, CIFAR-10 (iCT)	2.51^[2]
CTM one-step FID, CIFAR-10	1.73^[19]
sCM two-step FID, ImageNet 512x512 (1.5B)	1.88^[4]
Best known follow-ups	CTM (ICLR 2024)^[19], sCM (arXiv:2410.11081, ICLR 2025)^[4]^[8]

What is a consistency model?

A consistency model is a generative network that maps any noised point on a diffusion trajectory straight to the clean data point at the start of that trajectory, so a sample can be produced in a single forward pass instead of the dozens to hundreds of denoising steps a standard diffusion sampler needs.^[1] The model is defined by a consistency function that is required to be self-consistent: any two points sampled along the same probability flow ODE trajectory must be mapped to the same output.^[1] Because the mapping targets the trajectory origin directly, generation reduces to evaluating the function once on a pure-noise input, while an optional multistep sampler can spend a few extra evaluations to raise quality.^[1]^[2]

Background

Diffusion models and the probability flow ODE

Modern continuous-time diffusion models view sample generation as the time reversal of a stochastic process that gradually adds Gaussian noise to clean data. In the formulation by Song et al. (ICLR 2021), this forward process is described by an Ito stochastic differential equation (SDE) whose marginal densities can also be sampled deterministically using a corresponding probability flow ODE that shares the same time-marginal distributions as the SDE.^[7] In the variance-exploding parameterisation used by the consistency models paper, the forward process can be written as dx = sqrt(2 t) dW, so that at time t the noisy sample x_t = x_0 + t z, with z standard Gaussian, follows a tractable family of densities; the associated probability flow ODE is dx_t/dt = -t s(x_t, t), where s is the score, i.e. the gradient of the log density.^[1]^[7]

Sampling proceeds by integrating this ODE backward from a large noise level T (where the marginal is essentially Gaussian) to a small terminal time eps > 0, using a pre-trained score network. With high-order ODE solvers such as Heun's method or DPM-Solver, this still requires tens of evaluations to reach competitive sample quality.^[7]^[9] Many earlier acceleration techniques (DDIM-style deterministic samplers, DPM-Solver, knowledge distillation variants such as progressive distillation) attempt to reduce this evaluation count, but each retains some iterative character or pays a quality penalty in the very-few-step regime.^[9]^[10]

The Karras EDM framework

The consistency models paper adopts the design space and noise schedule introduced by Karras et al. ("Elucidating the Design Space of Diffusion-Based Generative Models," NeurIPS 2022), commonly abbreviated EDM. EDM uses a variance-exploding parameterisation with discretised noise levels sigma_min = 0.002, sigma_max = 80, and a rho parameter of 7 controlling the geometric spacing of timesteps along the trajectory.^[11] The choice fixes the precise meaning of t in the probability flow ODE and supplies a backbone (a U-Net with EDM preconditioning) that the consistency model reuses.^[1]^[11]

Why is one-step generation desirable?

Reducing the number of neural network evaluations (NFEs) per sample directly reduces latency, energy use, and serving cost. For text-to-image systems, a 25 to 50 step diffusion pass on a large model can dominate end-to-end inference time; cutting this to one to four steps without large quality loss enables interactive editing, real-time previews, and viable deployment on mobile or browser hardware.^[5]^[6]^[12] These pressures, plus the desire for a deeper theoretical understanding of when fast samplers are possible, motivated consistency models and a wave of related distillation methods that appeared in 2023 and 2024.^[4]^[5]^[6]^[12]

How do consistency models work?

The consistency function

Given a probability flow ODE trajectory {x_t} for t in [eps, T], the consistency function f is defined by f(x_t, t) = x_eps, mapping any point on the trajectory to the same endpoint near t = 0.^[1] The defining self-consistency property is that for any two times t, t' on the same trajectory, f(x_t, t) = f(x_t', t'). A learned parametric approximation f_theta is the model.^[1]

To make the function easy to train and to enforce a clean terminal behaviour, the network must satisfy the boundary condition f_theta(x, eps) = x. The authors implement this via a skip parameterisation: f_theta(x, t) = c_skip(t) x + c_out(t) F_theta(x, t), where F_theta is a free neural network (typically a U-Net inherited from the diffusion teacher), and c_skip, c_out are differentiable scalar functions satisfying c_skip(eps) = 1 and c_out(eps) = 0, so the boundary is met by construction without an architectural hack.^[1] EDM-style preconditioning provides convenient closed forms for c_skip(t) and c_out(t) that work well empirically.^[1]^[11]

How are consistency models trained?

The original paper proposed two ways to train the consistency function, summarised here and detailed in the two subsections that follow.^[1]

Training mode	Needs a diffusion teacher?	Source of the trajectory pair	Original one-step CIFAR-10 FID
Consistency distillation (CD)	Yes	A single ODE-solver step using the teacher's score	3.55^[1]
Consistency training (CT)	No	Two adjacent noise levels added to the same clean sample	8.70^[1]

Consistency distillation (CD)

In consistency distillation, a pre-trained diffusion model supplies the gradients of the probability flow ODE. Training proceeds by drawing a data sample x, a noise level t_{n+1} from a fixed discretisation eps = t_1 < t_2 < ... < t_N = T, and a Gaussian noise vector to form x_{t_{n+1}}. A single step of an ODE solver phi (typically Heun's method using the teacher's score) approximates x_{t_n} from x_{t_{n+1}}. The student is then trained to match itself at these two adjacent points on the trajectory, using the loss

L_CD(theta, theta^-; phi) = E[ lambda(t_n) d( f_theta(x_{t_{n+1}}, t_{n+1}), f_{theta^-}(x_hat^phi_{t_n}, t_n) ) ]

where d is a distance such as squared L2 or LPIPS, lambda a weighting function, and theta^- a slow-moving target parameter set updated by an exponential moving average (EMA) of the trained parameters.^[1] Because both arguments lie on (an approximation of) the same trajectory, minimising this loss enforces self-consistency along that trajectory. The teacher only enters through the single ODE step, never through a direct supervision signal on the endpoint.^[1]

Consistency training (CT)

In consistency training, no diffusion teacher is needed. The pair of points on the trajectory is generated by adding scaled Gaussian noise to a clean data sample at two adjacent noise levels:

L_CT(theta, theta^-) = E[ lambda(t_n) d( f_theta(x + t_{n+1} z, t_{n+1}), f_{theta^-}(x + t_n z, t_n) ) ]

with the same Gaussian z used at both noise levels.^[1] The variance-exploding diffusion process makes the two perturbed points an unbiased estimator of adjacent ODE trajectory points in expectation, which the paper proves under mild conditions.^[1] Consistency training thus serves as a self-contained generative model in its own right, with no reliance on an external diffusion model. The original paper observed that CD outperforms CT at small to moderate model and compute scale, but that CT closes much of the gap with more iterations and larger networks.^[1]

How do you sample from a consistency model?

One-step sampling

After training, one-step generation is trivial: sample x_T ~ N(0, T^2 I) (or the equivalent under the chosen schedule), then output f_theta(x_T, T). This single network call produces a candidate sample whose quality is competitive with one-step GANs and one-step diffusion distillations on standard benchmarks.^[1]

Multistep sampling

For users willing to spend a few additional network evaluations, the paper proposes a multistep sampler that trades compute for quality. Given a decreasing schedule tau_1 > tau_2 > ... > tau_{N-1}, the algorithm: (1) initialises x_T; (2) outputs an estimate x_hat = f_theta(x_T, T); (3) re-injects fresh Gaussian noise of magnitude sqrt(tau_n^2 - eps^2) to obtain x_{tau_n}; (4) applies the consistency model again to get a refined x_hat = f_theta(x_{tau_n}, tau_n); and repeats steps 3 to 4 for each intermediate time. Each iteration moves the sample partway back along an ODE trajectory and then re-collapses it to the data manifold, which empirically improves sample fidelity for the same network architecture.^[1] In practice, two to four steps recover most of the quality gap to many-step diffusion teachers on CIFAR-10 and ImageNet 64x64.^[1]^[2]

Boundary, weighting, and metric choices

Three design choices are critical for stable training. First, the boundary condition f_theta(x, eps) = x must hold exactly, and the skip-parameterisation accomplishes this with the EDM coefficients.^[1]^[11] Second, the loss weighting lambda(t) and the metric d have a strong effect on which timesteps dominate the gradient: the original paper used LPIPS, which provided strong perceptual gradients on natural images but introduced a dependence on an auxiliary VGG feature network.^[1] Third, the discretisation of timesteps and the EMA decay rate for the teacher theta^- interact non-trivially with the network's tendency to collapse. The 2310.14189 follow-up addresses each of these in turn.^[2]

What did Improved Techniques for Training Consistency Models (iCT) change?

In October 2023 Song and Dhariwal published "Improved Techniques for Training Consistency Models" (arXiv:2310.14189), focusing on closing the gap between consistency training and consistency distillation so that practitioners do not need a pre-trained diffusion teacher.^[2] The paper identified an overlooked flaw in prior CT theory: applying an exponential moving average to the teacher network distorts the consistency-training objective. The proposed correction simply removes the EMA from theta^-, using the current weights directly as the target, which the authors prove is consistent with the underlying ODE.^[2]

Other contributions include:

Pseudo-Huber loss. The LPIPS perceptual metric is replaced with a Pseudo-Huber loss sqrt(||a - b||^2 + c^2) - c, where c is a small constant. This avoids the dependence on a learned VGG network, sidesteps an evaluation-bias issue when training and evaluation share the same perceptual features, and provides smoother optimisation.^[2]
Lognormal noise schedule. Noise levels for CT are drawn from a lognormal distribution over sigma, concentrating training on the regime where the loss surface is most informative.^[2]
Discretisation curriculum. The number of timesteps N is doubled at regular intervals during training, providing a coarse-to-fine schedule that begins with a few large jumps and progresses to a denser discretisation.^[2]
Architectural cleanups. Minor changes to the U-Net (such as removing certain attention layers at higher resolutions) and a careful re-tuning of optimisation hyperparameters.^[2]

Together these techniques cut the one-step FID on CIFAR-10 from 8.70 (original CT) to 2.51, and on ImageNet 64x64 from 13.0 to 3.25; two-step sampling further reduced these to 2.24 and 2.77 respectively. In both cases iCT trained from scratch matched or surpassed the original consistency distillation, demonstrating that a diffusion teacher is not strictly necessary.^[2]

What are Consistency Trajectory Models (CTM)?

In parallel with iCT, in October 2023 Dongjun Kim and collaborators at Sony AI and Stanford University proposed Consistency Trajectory Models (CTM) in "Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion" (arXiv:2310.02279), accepted to ICLR 2024.^[19] CTM generalises both consistency models and score-based diffusion models as special cases. Where the original consistency function only maps a point to the trajectory's origin near t = 0, CTM learns a more general decoder G(x_t, t, s) that can jump from any starting time t to any earlier target time s along the same PF ODE trajectory, an "anytime-to-anytime" traversal.^[19] Setting s = 0 recovers consistency-style one-step generation, while taking the limit of infinitesimal jumps recovers the underlying score, so a single network exposes both the score function and direct trajectory jumps.^[19]

This design gives CTM two practical advantages over plain consistency models. First, because the network still provides the score, CTM supports a clean quality-versus-compute trade-off: an alternating sampler ('gamma-sampling') interleaves deterministic long jumps with score-based denoising, so adding sampling steps reliably improves quality rather than saturating.^[19] Second, access to the score streamlines likelihood evaluation and the reuse of conditional-generation techniques developed for diffusion models.^[19] CTM is trained with a combination of a trajectory-matching (soft consistency) loss, a denoising score-matching loss, and an optional adversarial (GAN) loss on the decoded samples.^[19] With these ingredients CTM reported one-step (NFE=1) FID scores of 1.73 on CIFAR-10 and 1.92 on ImageNet 64x64, state-of-the-art for single-step generation at the time of publication; an official PyTorch implementation was released by Sony.^[19] The trajectory-mapping idea was later carried into latent text-to-image acceleration by Trajectory Consistency Distillation (TCD) (Zheng et al., arXiv:2402.19159, February 2024), which distills a semi-linear trajectory consistency function into an SDXL LoRA and improves detail at low step counts without the adversarial training used in some competing methods.^[21]

How do continuous-time consistency models and sCM work?

Although the original consistency models can be formulated in continuous time (with t ranging over a real interval and a tangent-based loss), in practice the published results all relied on a discrete grid of timesteps. Continuous-time CT was numerically unstable for the original parameterisation. In October 2024 Cheng Lu and Yang Song addressed this in "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models" (arXiv:2410.11081), introducing the sCM family.^[4]^[8]

TrigFlow

The paper proposes TrigFlow, a parameterisation that unifies the EDM and Flow Matching frameworks using purely trigonometric coefficients. The forward process is written as x_t = cos(t) x_0 + sin(t) z for t in [0, pi/2], and the model is parameterised as f_theta(x_t, t) = F_theta(x_t / sigma_d, c_noise(t)) with c_skip(t) = cos(t), c_out(t) = sin(t), and c_in(t) = 1 / sigma_d.^[4] These clean expressions remove the discontinuities and large-magnitude derivatives that plagued the EDM-style continuous-time consistency objective near the endpoints of the trajectory, while preserving the variance-exploding nature that makes the ODE well-conditioned.^[4]

Stability fixes

The authors identify several specific sources of instability in continuous-time CT and patch each:

An identity time transformation c_noise(t) = t replaces EDM's log(sigma_d tan t) to avoid blow-ups near t = pi/2.^[4]
Positional embeddings of time replace high-scale Fourier features, reducing variance in time-derivative gradients.^[4]
Adaptive double normalisation modifies adaptive group normalisation to apply pixel-wise normalisation on scale and bias signals.^[4]
Tangent normalisation rescales the tangent vector that appears in the continuous-time loss, preventing isolated large gradients from destabilising training.^[4]
An adaptive weighting function w_phi(t) is learned alongside the model to equalise loss variance across timesteps.^[4]
A tangent warmup linearly ramps the coefficient of the unstable term over the first 10,000 iterations.^[4]

How well does sCM scale, and what FID does it reach?

With these fixes, the sCM training algorithm reliably scales to large models. The paper reports a 1.5 billion parameter model trained on ImageNet 512x512, the largest continuous-time consistency model published.^[4] Reported two-step FIDs are 2.06 on CIFAR-10, 1.48 on ImageNet 64x64, and 1.88 on ImageNet 512x512; the ImageNet 512 number is within roughly 10% of the best diffusion-model FIDs at the same resolution, while sCM uses only two function evaluations.^[4] OpenAI's accompanying communications described an approximately 50x wall-clock speedup at inference, with a single sample from the 1.5B sCM generated in about 0.11 seconds on one NVIDIA A100 GPU.^[13]

The paper distinguishes sCD (continuous-time consistency distillation), which uses a teacher diffusion model, from sCT (continuous-time consistency training), which trains from scratch. Both variants benefit from the TrigFlow parameterisation and the stability machinery; sCD reaches the best reported FIDs and converges in roughly 20,000 fine-tuning iterations from a strong teacher.^[4] sCM was accepted to ICLR 2025.^[8]

Easy Consistency Tuning (ECT)

A complementary 2024 line attacks the training cost rather than the asymptotic quality. Easy Consistency Tuning (ECT), introduced by Zhengyang Geng, Ashwini Pokle, Weijian Luo, Justin Lin, and J. Zico Kolter in "Consistency Models Made Easy" (arXiv:2406.14548), reframes consistency training as a lightweight fine-tune of an already-trained diffusion model that progressively tightens the consistency condition over a shrinking time gap.^[20] Because it starts from a pre-trained diffusion checkpoint rather than from scratch, ECT reaches a two-step FID of 2.73 on CIFAR-10 in about one hour on a single NVIDIA A100 GPU, matching consistency distillations that previously required hundreds of GPU hours; the paper was accepted to ICLR 2025.^[20]

What are the main variants and downstream uses?

Latent Consistency Models (LCM)

In October 2023, Simian Luo and collaborators at Tsinghua University published "Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference" (arXiv:2310.04378), which transposes the consistency-model recipe into the latent space of a pre-trained latent diffusion model such as Stable Diffusion.^[5] Rather than running the consistency objective on pixels, the latent consistency model distills the probability flow ODE of an existing LDM in the LDM's own latent space. The resulting models generate 768x768 images in two to four sampling steps and the authors report that the distillation requires only about 32 A100 GPU hours, since both the latent and the teacher are already trained.^[5]

A follow-up paper from the same group, LCM-LoRA: A Universal Stable-Diffusion Acceleration Module (arXiv:2311.05556, November 2023), shows that the LCM distillation can be captured in a LoRA adapter that plugs into any fine-tuned Stable Diffusion or SDXL checkpoint without retraining, effectively acting as a "drop-in PF ODE solver."^[6] LCM-LoRA modules for SD 1.5, SSD-1B, and SDXL were released openly via Hugging Face and are widely used in real-time generation interfaces.^[6] The dedicated Latent Consistency Models article covers the augmented PF ODE, the guidance-scale embedding, and the Diffusers LCMScheduler in more detail.

Consistency models for audio and speech

The consistency-distillation principle has been transferred to audio. ConsistencyTTA (arXiv:2309.10740, September 2023) applies CFG-aware latent consistency distillation to a diffusion text-to-audio model, reducing inference from hundreds of NFEs to a single network query while maintaining CLAP-score quality.^[14] Follow-on systems such as CoMoSpeech and AudioLCM extend the idea to text-to-speech and singing voice synthesis with one to four step generation.^[14]

How do consistency models compare to other few-step methods?

The most directly comparable acceleration technique is progressive distillation (Salimans and Ho, ICLR 2022), which iteratively halves the number of sampling steps by training each student to reproduce two teacher steps in one.^[10] On CIFAR-10 and ImageNet 64x64, the original consistency distillation results matched or exceeded progressive distillation at one and two sampling steps without the repeated student-teacher distillation cycle.^[1] DPM-Solver and its variants (DPM-Solver-2, DPM-Solver++, DPM-Solver-v3) are training-free higher-order ODE solvers that improve sample quality in the 5 to 20 step regime; they remain competitive when more compute is available but cannot reach the one to two step quality of consistency models on most benchmarks.^[9] Consistency models are also related to rectified flow (Liu et al., ICLR 2023), which straightens the noise-to-data transport so that a few-step (or, after reflow, near-one-step) Euler integration suffices; both pursue short generative paths, but rectified flow straightens the trajectory while consistency models collapse a fixed curved trajectory into a learned jump.^[4] Adversarial Diffusion Distillation (ADD), used in Stability AI's SDXL Turbo (November 2023), combines score distillation with a GAN-style adversarial loss to enable one to four step generation from large foundation diffusion models, and is a popular alternative for text-to-image acceleration.^[15]

Method	Training cost	Sampling steps	One-step CIFAR-10 FID	Notes
Diffusion (EDM) teacher	High	~35	n/a	Baseline; many NFEs required^[11]
Progressive distillation	Multi-stage	1 to 4	9.12^[1]	Iterative teacher to student halving^[10]
Consistency distillation (CD)	Single stage from teacher	1 to 4	3.55^[1]	Original 2023 paper^[1]
Consistency training (CT)	From scratch	1 to 4	8.70^[1]	No teacher diffusion model^[1]
Improved CT (iCT)	From scratch	1 to 4	2.51^[2]	Lognormal sched + Pseudo-Huber^[2]
CTM	From teacher (+ GAN, score)	1 to a few	1.73^[19]	Anytime-to-anytime jumps; exposes score^[19]
ECT	Fine-tune from diffusion	1 to 2	n/a (2-step 2.73)^[20]	~1 A100 hour on CIFAR-10^[20]
sCT (continuous-time)	From scratch	1 to 2	n/a (2-step 2.06)^[4]	TrigFlow + stability fixes^[4]
sCD (continuous-time)	Single stage from teacher	1 to 2	n/a (2-step 1.88 on ImageNet 512)^[4]	Scales to 1.5B parameters^[4]
DPM-Solver	Training-free	~10	n/a	Higher-order ODE solver^[9]
ADD (SDXL Turbo)	Adversarial + distillation	1 to 4	n/a	Used in real-time SDXL^[15]

(FID values quoted are from the original publications above and may differ slightly across re-runs; ImageNet rows are reported separately because most consistency-model papers report ImageNet 64x64 rather than CIFAR-10 at one step.)

What are consistency models used for?

Real-time and interactive generation

The most prominent practical impact of consistency models has been in real-time image generation. LCM and LCM-LoRA enabled the first widely-deployed sub-second 1024x1024 Stable Diffusion inference on commodity GPUs, and they underpin many interactive editors and "live canvas" interfaces released during late 2023 and 2024.^[5]^[6]^[12] The sCM line shows that the underlying ideas extend to billion-parameter image generators with two-step generation matched against the best diffusion FID scores on ImageNet 512.^[4]^[13]

Edge and on-device deployment

Because consistency models keep most of the diffusion-model machinery (U-Net or transformer backbone, latent VAE if used) intact, they slot into existing inference frameworks. The combination of LCM-LoRA with quantisation has been used to fit Stable Diffusion variants onto laptops and phones; tooling such as the Intel OpenVINO LCM notebook and various Hugging Face Diffusers pipelines document such deployments.^[12]

Zero-shot editing

The original paper showed that consistency models inherit the zero-shot editing capabilities of their diffusion teachers: by initialising the sampler with a partially-noised image or with a noise vector conditioned on a mask, the same trained model can perform inpainting, colorisation, and super-resolution without any task-specific fine-tuning.^[1] These properties are routinely exploited in LCM-based image editors and downstream pipelines.^[1]^[5]

Theoretical interest

Beyond engineering, consistency models contributed a clean theoretical viewpoint on the trade-off between diffusion sampling speed and quality. They show that the entire PF ODE trajectory can in principle be summarised by a single function from (x_t, t) to x_eps, and that this function can be learned by enforcing local self-consistency along the trajectory rather than reproducing the noise sequence exactly. This perspective has informed subsequent work on "shortcut" learning, flow matching, and one-step diffusion variants, and convergence guarantees for both single-step and multistep consistency sampling have been studied (for example, arXiv:2308.11449 and arXiv:2505.03194).^[16]^[17]

What are the limitations of consistency models?

Despite their successes, consistency models have well-documented practical drawbacks. The original paper used the LPIPS perceptual metric, which introduces an external dependency on a VGG feature extractor and may bias the FID evaluation when training and evaluation share perceptual features; iCT replaced LPIPS with Pseudo-Huber to mitigate this concern.^[2] Even after the iCT and sCM fixes, a quality gap to the strongest diffusion teachers persists at one step on the largest benchmarks: the sCM authors describe their result as narrowing the FID gap on ImageNet 512x512 to within about 10% of the best diffusion FIDs, but not closing it.^[4] One-step sampling also offers no clear knob for guidance strength comparable to classifier-free guidance scaling in the multistep regime, and naive use of strong guidance during distillation tends to bake in artifacts.^[6]^[15]

Training instability in the continuous-time setting was the central problem motivating sCM, and even the patched TrigFlow recipe relies on a careful combination of adaptive weighting, time conditioning, normalisation, and a tangent warmup; without these, continuous-time consistency training diverges.^[4] Discrete-time CT remains hyperparameter-sensitive, with the lognormal noise schedule, discretisation curriculum, and EMA settings all having a strong effect on outcomes.^[2] The community has also documented that pure consistency objectives can get trapped in local optima that sacrifice endpoint fidelity to global self-consistency, leading some authors to combine consistency training with auxiliary adversarial or energy-based losses to recover the last bits of FID; CTM's optional GAN loss is one example of this pattern.^[18]^[15]^[19]

Finally, consistency models inherit the limitations and biases of their diffusion teachers when used in CD mode, since the teacher's score function defines the trajectories along which self-consistency is enforced. A noisy or miscalibrated teacher produces a noisy or miscalibrated student.^[1]^[15]

Diffusion model: the underlying generative paradigm whose ODE trajectories consistency models compress to a single function.^[1]^[7]
DDPM (Denoising Diffusion Probabilistic Models): the discrete-time precursor that established the noise prediction objective reused by consistency model backbones.^[1]
Latent diffusion model and Stable Diffusion: the latent-space diffusion framework that LCM and LCM-LoRA accelerate.^[5]^[6]
Latent Consistency Models: the sibling line that ports the consistency recipe to the latent space of Stable Diffusion for few-step text-to-image generation.^[5]^[6]
Knowledge distillation: the broader teacher to student training paradigm; consistency distillation is a specific instance applied to ODE trajectories.^[1]^[10]
Flow Matching and rectified flow: related continuous-transport frameworks; flow matching's objective is unified with EDM by TrigFlow, and rectified flow pursues straight few-step paths.^[4]
Variational Autoencoder: provides the encoder/decoder pair used by latent diffusion and therefore by latent consistency models.^[5]
U-Net: the standard architecture for the denoiser network used by both the diffusion teacher and the consistency student.^[1]^[11]
LoRA (Low-Rank Adaptation): the parameter-efficient fine-tuning method used in LCM-LoRA and TCD to ship consistency distillations as plug-in adapters.^[6]^[21]
Generative model and GAN: alternative one-step generators that consistency models compete with on FID and Inception Score at low NFE budgets.^[1]
AI image generation and text-to-video generation: application domains where consistency-based accelerators have been deployed in production systems and research.^[5]^[6]^[12]

References

Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever, "Consistency Models", arXiv preprint 2303.01469, 2023-03-02. https://arxiv.org/abs/2303.01469. Accessed 2026-06-28. ↩
Yang Song, Prafulla Dhariwal, "Improved Techniques for Training Consistency Models", arXiv preprint 2310.14189, 2023-10-22. https://arxiv.org/abs/2310.14189. Accessed 2026-06-28. ↩
Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever, "Consistency Models", Proceedings of the 40th International Conference on Machine Learning (PMLR vol. 202, pp. 32211 to 32252), 2023-07-23. https://proceedings.mlr.press/v202/song23a.html. Accessed 2026-06-28. ↩
Cheng Lu, Yang Song, "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models", arXiv preprint 2410.11081, 2024-10-14. https://arxiv.org/abs/2410.11081. Accessed 2026-06-28. ↩
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, Hang Zhao, "Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference", arXiv preprint 2310.04378, 2023-10-06. https://arxiv.org/abs/2310.04378. Accessed 2026-06-28. ↩
Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinario Passos, Longbo Huang, Jian Li, Hang Zhao, "LCM-LoRA: A Universal Stable-Diffusion Acceleration Module", arXiv preprint 2311.05556, 2023-11-09. https://arxiv.org/abs/2311.05556. Accessed 2026-06-28. ↩
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole, "Score-Based Generative Modeling through Stochastic Differential Equations", International Conference on Learning Representations (ICLR) 2021 (Oral), 2021-02-10. https://openreview.net/forum?id=PxTIG12RRHS. Accessed 2026-06-28. ↩
OpenReview, "Simplifying, Stabilizing and Scaling Continuous-time Consistency Models", ICLR 2025 paper page, 2025-03-01. https://openreview.net/forum?id=LyJi5ugyJx. Accessed 2026-06-28. ↩
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps", NeurIPS 2022 (arXiv:2206.00927), 2022-06-02. https://arxiv.org/abs/2206.00927. Accessed 2026-06-28. ↩
Tim Salimans, Jonathan Ho, "Progressive Distillation for Fast Sampling of Diffusion Models", International Conference on Learning Representations (ICLR) 2022 (arXiv:2202.00512), 2022-02-01. https://arxiv.org/abs/2202.00512. Accessed 2026-06-28. ↩
Tero Karras, Miika Aittala, Timo Aila, Samuli Laine, "Elucidating the Design Space of Diffusion-Based Generative Models", NeurIPS 2022 (arXiv:2206.00364), 2022-06-01. https://arxiv.org/abs/2206.00364. Accessed 2026-06-28. ↩
Intel OpenVINO Documentation, "Image generation with Latent Consistency Model and OpenVINO", 2024 OpenVINO docs notebook. https://docs.openvino.ai/2024/notebooks/latent-consistency-models-image-generation-with-output.html. Accessed 2026-06-28. ↩
VentureBeat (Carl Franzen), "OpenAI researchers develop new model that speeds up media generation by 50X", VentureBeat, 2024-10-23. https://venturebeat.com/ai/openai-researchers-develop-new-model-that-speeds-up-media-generation-by-50x. Accessed 2026-06-28. ↩
Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi, "ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation", arXiv preprint 2309.10740, 2023-09-19. https://arxiv.org/abs/2309.10740. Accessed 2026-06-28. ↩
Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach, "Adversarial Diffusion Distillation", Stability AI Research preprint (and arXiv:2311.17042), 2023-11-28. https://stability.ai/research/adversarial-diffusion-distillation. Accessed 2026-06-28. ↩
Junlong Lyu, Zhitang Chen, Shoubo Feng, "Convergence Guarantee for Consistency Models", arXiv preprint 2308.11449, 2023-08-22. https://arxiv.org/abs/2308.11449. Accessed 2026-06-28. ↩
Anonymous authors, "Convergence of Consistency Model with Multistep Sampling", arXiv preprint 2505.03194, 2025-05-06. https://arxiv.org/abs/2505.03194. Accessed 2026-06-28. ↩
Shelly Golan, Roy Ganz, Michael Elad, "Enhancing Consistency-Based Image Generation via Adversarially-Trained Classification and Energy-Based Discrimination", arXiv preprint 2405.16260, 2024-05-25. https://arxiv.org/abs/2405.16260. Accessed 2026-06-28. ↩
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon, "Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion", International Conference on Learning Representations (ICLR) 2024 (arXiv:2310.02279), 2023-10-01. https://arxiv.org/abs/2310.02279. Accessed 2026-06-28. ↩
Zhengyang Geng, Ashwini Pokle, Weijian Luo, Justin Lin, J. Zico Kolter, "Consistency Models Made Easy", International Conference on Learning Representations (ICLR) 2025 (arXiv:2406.14548), 2024-06-20. https://arxiv.org/abs/2406.14548. Accessed 2026-06-28. ↩
Jianbin Zheng, Minghui Hu, Zhongyi Fan, Chaoyue Wang, Changxing Ding, Dacheng Tao, Tat-Jen Cham, "Trajectory Consistency Distillation: Improved Latent Consistency Distillation by Semi-Linear Consistency Function with Trajectory Mapping", arXiv preprint 2402.19159, 2024-02-29. https://arxiv.org/abs/2402.19159. Accessed 2026-06-28. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

DDPM Diffusion Transformer (DiT)EDM (Elucidating Diffusion Models)Latent Consistency Models (LCM)Luma AI Step

What is a consistency model?

Background

Diffusion models and the probability flow ODE

The Karras EDM framework

Why is one-step generation desirable?

How do consistency models work?

The consistency function

How are consistency models trained?

Consistency distillation (CD)

Consistency training (CT)

How do you sample from a consistency model?

One-step sampling

Multistep sampling

Boundary, weighting, and metric choices

What did Improved Techniques for Training Consistency Models (iCT) change?

What are Consistency Trajectory Models (CTM)?

How do continuous-time consistency models and sCM work?

TrigFlow

Stability fixes

How well does sCM scale, and what FID does it reach?

Easy Consistency Tuning (ECT)

What are the main variants and downstream uses?

Latent Consistency Models (LCM)

Consistency models for audio and speech

How do consistency models compare to other few-step methods?

What are consistency models used for?

Real-time and interactive generation

Edge and on-device deployment

Zero-shot editing

Theoretical interest

What are the limitations of consistency models?

Related work

See also

References

Improve this article

Related Articles

DALL-E

Sora

GLIDE (OpenAI)

Stable Diffusion

Midjourney

Imagen (text-to-image model)

What links here

Related Articles

DALL-E

Sora

GLIDE (OpenAI)

Stable Diffusion

Midjourney

Imagen (text-to-image model)

What links here