Consistency Models
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,973 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,973 words
Add missing citations, update stale details, or suggest a clearer explanation.
Consistency models are a family of generative models, introduced by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever at OpenAI in March 2023, that learn to map any point along the probability flow ordinary differential equation (PF ODE) trajectory of a diffusion process directly to that trajectory's clean origin.[1] By construction, a single network evaluation can transform pure noise into a sample, while iterative refinement is preserved through an optional multistep sampler that re-injects noise between calls.[1][2] The original paper, "Consistency Models," reached arXiv as preprint 2303.01469 on 2 March 2023 and was accepted to ICML 2023; subsequent work has substantially improved training stability, scaled the formulation to billion-parameter image generators, and adapted the idea to latent, audio, and video domains.[1][3][4][5][6]
The motivation is straightforward: diffusion models such as DDPM and the score-based stochastic differential equation family produce samples by solving a learned ODE or SDE backward from noise to data, which typically requires dozens to hundreds of sequential neural network evaluations.[1][7] Consistency models retain the iterative trajectory of diffusion at training time but force the network to collapse the whole trajectory into a single mapping, enabling one-step or very-few-step generation. Two training regimes were proposed in the original paper: consistency distillation (CD), which uses a pre-trained diffusion model as a teacher, and consistency training (CT), which trains a consistency model from scratch on data.[1] Follow-up work by Song and Dhariwal in October 2023 introduced "Improved Techniques for Training Consistency Models" (arXiv:2310.14189), the same month that Kim et al. proposed Consistency Trajectory Models (CTM, arXiv:2310.02279), and the line was further generalised in October 2024 by Cheng Lu and Yang Song in "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models" (arXiv:2410.11081, often abbreviated sCM), which scaled the approach to a 1.5 billion parameter model on ImageNet 512x512.[2][4][19] A closely related sibling line, Latent Consistency Models, transposes the same recipe into the latent space of a pre-trained latent diffusion model.[5]
| Property | Details |
|---|---|
| First public release | arXiv:2303.01469, 2 March 2023[1] |
| Authors (original) | Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever[1] |
| Affiliation | OpenAI[1] |
| Venue | ICML 2023 (PMLR vol. 202, pp. 32211 to 32252)[3] |
| Training regimes | Consistency Distillation (CD), Consistency Training (CT)[1] |
| Sampling | One step by design; optional multistep refinement[1] |
| Original one-step FID, CIFAR-10 | 3.55 (CD)[1] |
| Original one-step FID, ImageNet 64x64 | 6.20 (CD)[1] |
| Improved CT one-step FID, CIFAR-10 (iCT) | 2.51[2] |
| CTM one-step FID, CIFAR-10 | 1.73[19] |
| sCM two-step FID, ImageNet 512x512 (1.5B) | 1.88[4] |
| Best known follow-ups | CTM (ICLR 2024)[19], sCM (arXiv:2410.11081, ICLR 2025)[4][8] |
Modern continuous-time diffusion models view sample generation as the time reversal of a stochastic process that gradually adds Gaussian noise to clean data. In the formulation by Song et al. (ICLR 2021), this forward process is described by an Ito stochastic differential equation (SDE) whose marginal densities can also be sampled deterministically using a corresponding probability flow ODE that shares the same time-marginal distributions as the SDE.[7] In the variance-exploding parameterisation used by the consistency models paper, the forward process can be written as dx = sqrt(2 t) dW, so that at time t the noisy sample x_t = x_0 + t z, with z standard Gaussian, follows a tractable family of densities; the associated probability flow ODE is dx_t/dt = -t s(x_t, t), where s is the score, i.e. the gradient of the log density.[1][7]
Sampling proceeds by integrating this ODE backward from a large noise level T (where the marginal is essentially Gaussian) to a small terminal time eps > 0, using a pre-trained score network. With high-order ODE solvers such as Heun's method or DPM-Solver, this still requires tens of evaluations to reach competitive sample quality.[7][9] Many earlier acceleration techniques (DDIM-style deterministic samplers, DPM-Solver, knowledge distillation variants such as progressive distillation) attempt to reduce this evaluation count, but each retains some iterative character or pays a quality penalty in the very-few-step regime.[9][10]
The consistency models paper adopts the design space and noise schedule introduced by Karras et al. ("Elucidating the Design Space of Diffusion-Based Generative Models," NeurIPS 2022), commonly abbreviated EDM. EDM uses a variance-exploding parameterisation with discretised noise levels sigma_min = 0.002, sigma_max = 80, and a rho parameter of 7 controlling the geometric spacing of timesteps along the trajectory.[11] The choice fixes the precise meaning of t in the probability flow ODE and supplies a backbone (a U-Net with EDM preconditioning) that the consistency model reuses.[1][11]
Reducing the number of neural network evaluations (NFEs) per sample directly reduces latency, energy use, and serving cost. For text-to-image systems, a 25 to 50 step diffusion pass on a large model can dominate end-to-end inference time; cutting this to one to four steps without large quality loss enables interactive editing, real-time previews, and viable deployment on mobile or browser hardware.[5][6][12] These pressures, plus the desire for a deeper theoretical understanding of when fast samplers are possible, motivated consistency models and a wave of related distillation methods that appeared in 2023 and 2024.[4][5][6][12]
Given a probability flow ODE trajectory {x_t} for t in [eps, T], the consistency function f is defined by f(x_t, t) = x_eps, mapping any point on the trajectory to the same endpoint near t = 0.[1] The defining self-consistency property is that for any two times t, t' on the same trajectory, f(x_t, t) = f(x_t', t'). A learned parametric approximation f_theta is the model.[1]
To make the function easy to train and to enforce a clean terminal behaviour, the network must satisfy the boundary condition f_theta(x, eps) = x. The authors implement this via a skip parameterisation: f_theta(x, t) = c_skip(t) x + c_out(t) F_theta(x, t), where F_theta is a free neural network (typically a U-Net inherited from the diffusion teacher), and c_skip, c_out are differentiable scalar functions satisfying c_skip(eps) = 1 and c_out(eps) = 0, so the boundary is met by construction without an architectural hack.[1] EDM-style preconditioning provides convenient closed forms for c_skip(t) and c_out(t) that work well empirically.[1][11]
In consistency distillation, a pre-trained diffusion model supplies the gradients of the probability flow ODE. Training proceeds by drawing a data sample x, a noise level t_{n+1} from a fixed discretisation eps = t_1 < t_2 < ... < t_N = T, and a Gaussian noise vector to form x_{t_{n+1}}. A single step of an ODE solver phi (typically Heun's method using the teacher's score) approximates x_{t_n} from x_{t_{n+1}}. The student is then trained to match itself at these two adjacent points on the trajectory, using the loss
L_CD(theta, theta^-; phi) = E[ lambda(t_n) d( f_theta(x_{t_{n+1}}, t_{n+1}), f_{theta^-}(x_hat^phi_{t_n}, t_n) ) ]
where d is a distance such as squared L2 or LPIPS, lambda a weighting function, and theta^- a slow-moving target parameter set updated by an exponential moving average (EMA) of the trained parameters.[1] Because both arguments lie on (an approximation of) the same trajectory, minimising this loss enforces self-consistency along that trajectory. The teacher only enters through the single ODE step, never through a direct supervision signal on the endpoint.[1]
In consistency training, no diffusion teacher is needed. The pair of points on the trajectory is generated by adding scaled Gaussian noise to a clean data sample at two adjacent noise levels:
L_CT(theta, theta^-) = E[ lambda(t_n) d( f_theta(x + t_{n+1} z, t_{n+1}), f_{theta^-}(x + t_n z, t_n) ) ]
with the same Gaussian z used at both noise levels.[1] The variance-exploding diffusion process makes the two perturbed points an unbiased estimator of adjacent ODE trajectory points in expectation, which the paper proves under mild conditions.[1] Consistency training thus serves as a self-contained generative model in its own right, with no reliance on an external diffusion model. The original paper observed that CD outperforms CT at small to moderate model and compute scale, but that CT closes much of the gap with more iterations and larger networks.[1]
After training, one-step generation is trivial: sample x_T ~ N(0, T^2 I) (or the equivalent under the chosen schedule), then output f_theta(x_T, T). This single network call produces a candidate sample whose quality is competitive with one-step GANs and one-step diffusion distillations on standard benchmarks.[1]
For users willing to spend a few additional network evaluations, the paper proposes a multistep sampler that trades compute for quality. Given a decreasing schedule tau_1 > tau_2 > ... > tau_{N-1}, the algorithm: (1) initialises x_T; (2) outputs an estimate x_hat = f_theta(x_T, T); (3) re-injects fresh Gaussian noise of magnitude sqrt(tau_n^2 - eps^2) to obtain x_{tau_n}; (4) applies the consistency model again to get a refined x_hat = f_theta(x_{tau_n}, tau_n); and repeats steps 3 to 4 for each intermediate time. Each iteration moves the sample partway back along an ODE trajectory and then re-collapses it to the data manifold, which empirically improves sample fidelity for the same network architecture.[1] In practice, two to four steps recover most of the quality gap to many-step diffusion teachers on CIFAR-10 and ImageNet 64x64.[1][2]
Three design choices are critical for stable training. First, the boundary condition f_theta(x, eps) = x must hold exactly, and the skip-parameterisation accomplishes this with the EDM coefficients.[1][11] Second, the loss weighting lambda(t) and the metric d have a strong effect on which timesteps dominate the gradient: the original paper used LPIPS, which provided strong perceptual gradients on natural images but introduced a dependence on an auxiliary VGG feature network.[1] Third, the discretisation of timesteps and the EMA decay rate for the teacher theta^- interact non-trivially with the network's tendency to collapse. The 2310.14189 follow-up addresses each of these in turn.[2]
In October 2023 Song and Dhariwal published "Improved Techniques for Training Consistency Models" (arXiv:2310.14189), focusing on closing the gap between consistency training and consistency distillation so that practitioners do not need a pre-trained diffusion teacher.[2] The paper identified an overlooked flaw in prior CT theory: applying an exponential moving average to the teacher network distorts the consistency-training objective. The proposed correction simply removes the EMA from theta^-, using the current weights directly as the target, which the authors prove is consistent with the underlying ODE.[2]
Other contributions include:
sqrt(||a - b||^2 + c^2) - c, where c is a small constant. This avoids the dependence on a learned VGG network, sidesteps an evaluation-bias issue when training and evaluation share the same perceptual features, and provides smoother optimisation.[2]sigma, concentrating training on the regime where the loss surface is most informative.[2]N is doubled at regular intervals during training, providing a coarse-to-fine schedule that begins with a few large jumps and progresses to a denser discretisation.[2]Together these techniques cut the one-step FID on CIFAR-10 from 8.70 (original CT) to 2.51, and on ImageNet 64x64 from 13.0 to 3.25; two-step sampling further reduced these to 2.24 and 2.77 respectively. In both cases iCT trained from scratch matched or surpassed the original consistency distillation, demonstrating that a diffusion teacher is not strictly necessary.[2]
In parallel with iCT, in October 2023 Dongjun Kim and collaborators at Sony AI and Stanford University proposed Consistency Trajectory Models (CTM) in "Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion" (arXiv:2310.02279), accepted to ICLR 2024.[19] CTM generalises both consistency models and score-based diffusion models as special cases. Where the original consistency function only maps a point to the trajectory's origin near t = 0, CTM learns a more general decoder G(x_t, t, s) that can jump from any starting time t to any earlier target time s along the same PF ODE trajectory, an "anytime-to-anytime" traversal.[19] Setting s = 0 recovers consistency-style one-step generation, while taking the limit of infinitesimal jumps recovers the underlying score, so a single network exposes both the score function and direct trajectory jumps.[19]
This design gives CTM two practical advantages over plain consistency models. First, because the network still provides the score, CTM supports a clean quality-versus-compute trade-off: an alternating sampler ('gamma-sampling') interleaves deterministic long jumps with score-based denoising, so adding sampling steps reliably improves quality rather than saturating.[19] Second, access to the score streamlines likelihood evaluation and the reuse of conditional-generation techniques developed for diffusion models.[19] CTM is trained with a combination of a trajectory-matching (soft consistency) loss, a denoising score-matching loss, and an optional adversarial (GAN) loss on the decoded samples.[19] With these ingredients CTM reported one-step (NFE=1) FID scores of 1.73 on CIFAR-10 and 1.92 on ImageNet 64x64, state-of-the-art for single-step generation at the time of publication; an official PyTorch implementation was released by Sony.[19] The trajectory-mapping idea was later carried into latent text-to-image acceleration by Trajectory Consistency Distillation (TCD) (Zheng et al., arXiv:2402.19159, February 2024), which distills a semi-linear trajectory consistency function into an SDXL LoRA and improves detail at low step counts without the adversarial training used in some competing methods.[21]
Although the original consistency models can be formulated in continuous time (with t ranging over a real interval and a tangent-based loss), in practice the published results all relied on a discrete grid of timesteps. Continuous-time CT was numerically unstable for the original parameterisation. In October 2024 Cheng Lu and Yang Song addressed this in "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models" (arXiv:2410.11081), introducing the sCM family.[4][8]
The paper proposes TrigFlow, a parameterisation that unifies the EDM and Flow Matching frameworks using purely trigonometric coefficients. The forward process is written as x_t = cos(t) x_0 + sin(t) z for t in [0, pi/2], and the model is parameterised as f_theta(x_t, t) = F_theta(x_t / sigma_d, c_noise(t)) with c_skip(t) = cos(t), c_out(t) = sin(t), and c_in(t) = 1 / sigma_d.[4] These clean expressions remove the discontinuities and large-magnitude derivatives that plagued the EDM-style continuous-time consistency objective near the endpoints of the trajectory, while preserving the variance-exploding nature that makes the ODE well-conditioned.[4]
The authors identify several specific sources of instability in continuous-time CT and patch each:
c_noise(t) = t replaces EDM's log(sigma_d tan t) to avoid blow-ups near t = pi/2.[4]w_phi(t) is learned alongside the model to equalise loss variance across timesteps.[4]With these fixes, the sCM training algorithm reliably scales to large models. The paper reports a 1.5 billion parameter model trained on ImageNet 512x512, the largest continuous-time consistency model published.[4] Reported two-step FIDs are 2.06 on CIFAR-10, 1.48 on ImageNet 64x64, and 1.88 on ImageNet 512x512; the ImageNet 512 number is within roughly 10% of the best diffusion-model FIDs at the same resolution, while sCM uses only two function evaluations.[4] OpenAI's accompanying communications described an approximately 50x wall-clock speedup at inference, with a single sample from the 1.5B sCM generated in about 0.11 seconds on one NVIDIA A100 GPU.[13]
The paper distinguishes sCD (continuous-time consistency distillation), which uses a teacher diffusion model, from sCT (continuous-time consistency training), which trains from scratch. Both variants benefit from the TrigFlow parameterisation and the stability machinery; sCD reaches the best reported FIDs and converges in roughly 20,000 fine-tuning iterations from a strong teacher.[4] sCM was accepted to ICLR 2025.[8]
A complementary 2024 line attacks the training cost rather than the asymptotic quality. Easy Consistency Tuning (ECT), introduced by Zhengyang Geng, Ashwini Pokle, Weijian Luo, Justin Lin, and J. Zico Kolter in "Consistency Models Made Easy" (arXiv:2406.14548), reframes consistency training as a lightweight fine-tune of an already-trained diffusion model that progressively tightens the consistency condition over a shrinking time gap.[20] Because it starts from a pre-trained diffusion checkpoint rather than from scratch, ECT reaches a two-step FID of 2.73 on CIFAR-10 in about one hour on a single NVIDIA A100 GPU, matching consistency distillations that previously required hundreds of GPU hours; the paper was accepted to ICLR 2025.[20]
In October 2023, Simian Luo and collaborators at Tsinghua University published "Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference" (arXiv:2310.04378), which transposes the consistency-model recipe into the latent space of a pre-trained latent diffusion model such as Stable Diffusion.[5] Rather than running the consistency objective on pixels, the latent consistency model distills the probability flow ODE of an existing LDM in the LDM's own latent space. The resulting models generate 768x768 images in two to four sampling steps and the authors report that the distillation requires only about 32 A100 GPU hours, since both the latent and the teacher are already trained.[5]
A follow-up paper from the same group, LCM-LoRA: A Universal Stable-Diffusion Acceleration Module (arXiv:2311.05556, November 2023), shows that the LCM distillation can be captured in a LoRA adapter that plugs into any fine-tuned Stable Diffusion or SDXL checkpoint without retraining, effectively acting as a "drop-in PF ODE solver."[6] LCM-LoRA modules for SD 1.5, SSD-1B, and SDXL were released openly via Hugging Face and are widely used in real-time generation interfaces.[6] The dedicated Latent Consistency Models article covers the augmented PF ODE, the guidance-scale embedding, and the Diffusers LCMScheduler in more detail.
The consistency-distillation principle has been transferred to audio. ConsistencyTTA (arXiv:2309.10740, September 2023) applies CFG-aware latent consistency distillation to a diffusion text-to-audio model, reducing inference from hundreds of NFEs to a single network query while maintaining CLAP-score quality.[14] Follow-on systems such as CoMoSpeech and AudioLCM extend the idea to text-to-speech and singing voice synthesis with one to four step generation.[14]
The most directly comparable acceleration technique is progressive distillation (Salimans and Ho, ICLR 2022), which iteratively halves the number of sampling steps by training each student to reproduce two teacher steps in one.[10] On CIFAR-10 and ImageNet 64x64, the original consistency distillation results matched or exceeded progressive distillation at one and two sampling steps without the repeated student-teacher distillation cycle.[1] DPM-Solver and its variants (DPM-Solver-2, DPM-Solver++, DPM-Solver-v3) are training-free higher-order ODE solvers that improve sample quality in the 5 to 20 step regime; they remain competitive when more compute is available but cannot reach the one to two step quality of consistency models on most benchmarks.[9] Consistency models are also related to rectified flow (Liu et al., ICLR 2023), which straightens the noise-to-data transport so that a few-step (or, after reflow, near-one-step) Euler integration suffices; both pursue short generative paths, but rectified flow straightens the trajectory while consistency models collapse a fixed curved trajectory into a learned jump.[4] Adversarial Diffusion Distillation (ADD), used in Stability AI's SDXL Turbo (November 2023), combines score distillation with a GAN-style adversarial loss to enable one to four step generation from large foundation diffusion models, and is a popular alternative for text-to-image acceleration.[15]
| Method | Training cost | Sampling steps | One-step CIFAR-10 FID | Notes |
|---|---|---|---|---|
| Diffusion (EDM) teacher | High | ~35 | n/a | Baseline; many NFEs required[11] |
| Progressive distillation | Multi-stage | 1 to 4 | 9.12[1] | Iterative teacher to student halving[10] |
| Consistency distillation (CD) | Single stage from teacher | 1 to 4 | 3.55[1] | Original 2023 paper[1] |
| Consistency training (CT) | From scratch | 1 to 4 | 8.70[1] | No teacher diffusion model[1] |
| Improved CT (iCT) | From scratch | 1 to 4 | 2.51[2] | Lognormal sched + Pseudo-Huber[2] |
| CTM | From teacher (+ GAN, score) | 1 to a few | 1.73[19] | Anytime-to-anytime jumps; exposes score[19] |
| ECT | Fine-tune from diffusion | 1 to 2 | n/a (2-step 2.73)[20] | ~1 A100 hour on CIFAR-10[20] |
| sCT (continuous-time) | From scratch | 1 to 2 | n/a (2-step 2.06)[4] | TrigFlow + stability fixes[4] |
| sCD (continuous-time) | Single stage from teacher | 1 to 2 | n/a (2-step 1.88 on ImageNet 512)[4] | Scales to 1.5B parameters[4] |
| DPM-Solver | Training-free | ~10 | n/a | Higher-order ODE solver[9] |
| ADD (SDXL Turbo) | Adversarial + distillation | 1 to 4 | n/a | Used in real-time SDXL[15] |
(FID values quoted are from the original publications above and may differ slightly across re-runs; ImageNet rows are reported separately because most consistency-model papers report ImageNet 64x64 rather than CIFAR-10 at one step.)
The most prominent practical impact of consistency models has been in real-time image generation. LCM and LCM-LoRA enabled the first widely-deployed sub-second 1024x1024 Stable Diffusion inference on commodity GPUs, and they underpin many interactive editors and "live canvas" interfaces released during late 2023 and 2024.[5][6][12] The sCM line shows that the underlying ideas extend to billion-parameter image generators with two-step generation matched against the best diffusion FID scores on ImageNet 512.[4][13]
Because consistency models keep most of the diffusion-model machinery (U-Net or transformer backbone, latent VAE if used) intact, they slot into existing inference frameworks. The combination of LCM-LoRA with quantisation has been used to fit Stable Diffusion variants onto laptops and phones; tooling such as the Intel OpenVINO LCM notebook and various Hugging Face Diffusers pipelines document such deployments.[12]
The original paper showed that consistency models inherit the zero-shot editing capabilities of their diffusion teachers: by initialising the sampler with a partially-noised image or with a noise vector conditioned on a mask, the same trained model can perform inpainting, colorisation, and super-resolution without any task-specific fine-tuning.[1] These properties are routinely exploited in LCM-based image editors and downstream pipelines.[1][5]
Beyond engineering, consistency models contributed a clean theoretical viewpoint on the trade-off between diffusion sampling speed and quality. They show that the entire PF ODE trajectory can in principle be summarised by a single function from (x_t, t) to x_eps, and that this function can be learned by enforcing local self-consistency along the trajectory rather than reproducing the noise sequence exactly. This perspective has informed subsequent work on "shortcut" learning, flow matching, and one-step diffusion variants, and convergence guarantees for both single-step and multistep consistency sampling have been studied (for example, arXiv:2308.11449 and arXiv:2505.03194).[16][17]
Despite their successes, consistency models have well-documented practical drawbacks. The original paper used the LPIPS perceptual metric, which introduces an external dependency on a VGG feature extractor and may bias the FID evaluation when training and evaluation share perceptual features; iCT replaced LPIPS with Pseudo-Huber to mitigate this concern.[2] Even after the iCT and sCM fixes, a quality gap to the strongest diffusion teachers persists at one step on the largest benchmarks: the sCM authors describe their result as narrowing the FID gap on ImageNet 512x512 to within about 10% of the best diffusion FIDs, but not closing it.[4] One-step sampling also offers no clear knob for guidance strength comparable to classifier-free guidance scaling in the multistep regime, and naive use of strong guidance during distillation tends to bake in artifacts.[6][15]
Training instability in the continuous-time setting was the central problem motivating sCM, and even the patched TrigFlow recipe relies on a careful combination of adaptive weighting, time conditioning, normalisation, and a tangent warmup; without these, continuous-time consistency training diverges.[4] Discrete-time CT remains hyperparameter-sensitive, with the lognormal noise schedule, discretisation curriculum, and EMA settings all having a strong effect on outcomes.[2] The community has also documented that pure consistency objectives can get trapped in local optima that sacrifice endpoint fidelity to global self-consistency, leading some authors to combine consistency training with auxiliary adversarial or energy-based losses to recover the last bits of FID; CTM's optional GAN loss is one example of this pattern.[18][15][19]
Finally, consistency models inherit the limitations and biases of their diffusion teachers when used in CD mode, since the teacher's score function defines the trajectories along which self-consistency is enforced. A noisy or miscalibrated teacher produces a noisy or miscalibrated student.[1][15]