Diffusion models are a class of generative models that learn to produce data by reversing a gradual noising process. Starting from pure random noise, these models iteratively denoise a sample until it resembles data from the training distribution. Since 2020, diffusion models have become the dominant approach in image generation, surpassing generative adversarial networks (GANs) in both sample quality and diversity. They now power many of the most widely used AI image, video, and audio generation systems.
The core idea is straightforward: given a data sample, progressively add Gaussian noise over many steps until the original signal is completely destroyed. Then train a neural network to reverse this process step by step. At generation time, start with pure noise and apply the learned reverse process to produce a new sample.
The intellectual roots of diffusion models trace back to non-equilibrium statistical physics. The formal introduction of diffusion probabilistic models to machine learning came in 2015, and the field saw rapid development from 2020 onward.
In 2005, Aapo Hyvarinen introduced score matching, a technique for estimating the gradient of the log probability density (the "score function") of a distribution without needing to compute an intractable normalizing constant. In 2011, Pascal Vincent showed that training a denoising autoencoder is equivalent to performing score matching on a noise-perturbed version of the data distribution. This connection, known as denoising score matching, later became a theoretical pillar of diffusion models.
In 2015, Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli published "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" at ICML. This paper introduced the concept of systematically destroying structure in a data distribution through an iterative forward diffusion process, then learning a reverse diffusion process that restores structure. The approach drew directly from non-equilibrium thermodynamics, modeling the forward process as a Markov chain that gradually adds noise. While the results were promising, the generated image quality did not match GANs at the time, and the paper received relatively limited attention for several years.
In 2019, Yang Song and Stefano Ermon proposed Noise Conditional Score Networks (NCSN), which estimated the score function at multiple noise levels and used Langevin dynamics to generate samples. This approach, sometimes called score-based generative modeling, produced competitive image samples and demonstrated that score matching could serve as an effective foundation for generation.
The modern era of diffusion models began with the 2020 paper "Denoising Diffusion Probabilistic Models" (DDPM) by Jonathan Ho, Ajay Jain, and Pieter Abbeel. DDPM showed that diffusion models could generate images competitive with GANs in terms of quality, while avoiding the training instability and mode collapse problems that plagued adversarial approaches. The key insight was a simplified training objective: instead of predicting the clean data directly, the network predicts the noise that was added at each step. This noise prediction objective proved both stable to train and effective at producing high-quality samples.
DDPM achieved an FID score of 3.17 on the CIFAR-10 dataset, which was state of the art for likelihood-based models at the time. The paper demonstrated that diffusion models were a serious competitor to GANs for image generation.
In late 2020, Jiaming Song, Chenlin Meng, and Stefano Ermon introduced Denoising Diffusion Implicit Models (DDIM), which generalized the DDPM sampling process to non-Markovian diffusion processes. DDIM enabled deterministic sampling and could produce high-quality images 10 to 50 times faster than DDPM by using fewer sampling steps.
In 2021, Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole published "Score-Based Generative Modeling through Stochastic Differential Equations," which unified DDPM and score-based models into a single framework based on stochastic differential equations (SDEs). In this framework, the forward noising process is described by a continuous-time SDE, and generation follows the corresponding reverse-time SDE. This unification showed that DDPM and score-based approaches are two perspectives on the same underlying mathematical structure.
Also in 2021, Prafulla Dhariwal and Alex Nichol at OpenAI published "Diffusion Models Beat GANs on Image Synthesis," introducing classifier guidance and architectural improvements that allowed diffusion models to surpass GANs on ImageNet generation for the first time. This result marked a turning point: diffusion models were no longer just competitive with GANs but could outperform them.
Diffusion models are built on two processes: a forward process that gradually adds noise to data, and a reverse process that learns to remove that noise.
Given a data point x_0 sampled from the real data distribution q(x_0), the forward process defines a Markov chain that adds Gaussian noise over T steps:
q(x_t | x_{t-1}) = N(x_t; sqrt(1 - beta_t) * x_{t-1}, beta_t * I)
Here, beta_1, beta_2, ..., beta_T is a noise schedule that controls how much noise is added at each step. As t increases, the data becomes progressively noisier. After sufficiently many steps, x_T is approximately standard Gaussian noise.
A useful property of this formulation is that x_t can be sampled directly at any timestep t without iterating through all previous steps. Define alpha_t = 1 - beta_t and alpha_bar_t = product of alpha_1 through alpha_t. Then:
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
where epsilon is drawn from N(0, I). This closed-form expression is central to efficient training, since the model can be trained on randomly sampled timesteps rather than full sequences.
The reverse process starts from Gaussian noise x_T and iteratively denoises to recover a data sample. The true reverse conditional q(x_{t-1} | x_t) is intractable, but when conditioned on x_0, the posterior q(x_{t-1} | x_t, x_0) is Gaussian and can be computed in closed form.
A neural network with parameters theta is trained to approximate the reverse transitions:
p_theta(x_{t-1} | x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2 * I)
The network predicts parameters of this Gaussian distribution. In the DDPM formulation, the mean mu_theta is parameterized in terms of a noise prediction network epsilon_theta(x_t, t), which estimates the noise that was added to produce x_t from x_0.
The standard DDPM training objective is a simplified form of the variational lower bound (VLB) on the data log-likelihood. The simplified loss function is:
L_simple = E[|| epsilon - epsilon_theta(x_t, t) ||^2]
where the expectation is over t sampled uniformly from {1, ..., T}, x_0 sampled from the data, and epsilon sampled from N(0, I). In practice, this amounts to: (1) pick a random training image, (2) pick a random timestep, (3) add the corresponding amount of noise, and (4) train the network to predict the noise that was added.
This simple mean squared error loss on noise prediction works remarkably well. Ho et al. found that this simplified objective produced better sample quality than the full variational bound, likely because it places more weight on the more perceptually important lower noise levels.
From the score matching perspective, the noise prediction network is closely related to the score function, which is the gradient of the log probability density with respect to the data:
score(x) = gradient of log p(x) with respect to x
Specifically, the noise prediction at timestep t is proportional to the score of the noisy data distribution at that noise level. The score function tells the model which direction to "push" a noisy sample to move it toward higher-probability (cleaner) regions of the data distribution.
Yang Song and Stefano Ermon's score-based framework trains a neural network to directly estimate this score function at many noise levels. At generation time, samples are produced by running Langevin dynamics (or annealed Langevin dynamics), which follows the score function to iteratively move from noise toward data.
The noise schedule {beta_t} determines how quickly the forward process destroys the data signal. Common choices include:
| Schedule | Description | Notes |
|---|---|---|
| Linear | beta_t increases linearly from beta_1 to beta_T | Used in the original DDPM paper (beta_1 = 0.0001, beta_T = 0.02, T = 1000) |
| Cosine | alpha_bar_t follows a cosine curve | Proposed by Nichol and Dhariwal (2021); avoids too-rapid noise addition at early steps |
| Scaled linear | Linear schedule scaled for latent space | Common in latent diffusion models |
| Sigmoid | beta_t follows a sigmoid curve | Used in some continuous-time formulations |
The choice of noise schedule affects both training stability and sample quality. The cosine schedule generally produces better results than the linear schedule because it adds noise more gradually at the start of the process, preserving more signal in the intermediate steps.
Most early diffusion models use a U-Net architecture as the denoising network. The U-Net was originally designed for biomedical image segmentation and features an encoder-decoder structure with skip connections between corresponding encoder and decoder layers.
In the diffusion model context, the U-Net takes a noisy image x_t and a timestep t as input and predicts the noise epsilon. The architecture typically includes:
The encoder progressively downsamples the spatial resolution while increasing the channel dimension, and the decoder upsamples back to the original resolution. This multi-scale structure allows the network to capture both fine-grained texture details and global image structure.
In 2023, William Peebles and Saining Xie introduced the Diffusion Transformer (DiT), which replaces the U-Net with a transformer that operates on sequences of image patches. The DiT architecture is similar to a Vision Transformer (ViT), modified to accept timestep and class conditioning through adaptive layer normalization (adaLN).
The key finding was that DiT models follow clear scaling laws: models with more compute (measured in GFLOPs) consistently achieve lower FID scores. The largest model, DiT-XL/2, achieved a state-of-the-art FID of 2.27 on class-conditional ImageNet 256x256 generation.
The DiT architecture has since been adopted by several major systems, including Stable Diffusion 3, Sora, and FLUX. The shift from U-Net to transformer reflects a broader trend in deep learning toward transformer-based architectures across modalities.
DDIM, proposed by Jiaming Song, Chenlin Meng, and Stefano Ermon in 2020, generalizes the DDPM framework by constructing non-Markovian forward processes that share the same training objective as DDPM but allow for faster, deterministic sampling.
The reverse step in DDIM is:
x_{t-1} = sqrt(alpha_bar_{t-1}) * predicted_x_0 + sqrt(1 - alpha_bar_{t-1}) * direction_pointing_to_x_t
Because DDIM sampling is deterministic (given a fixed initial noise vector), it enables consistent image generation from the same latent code and supports meaningful interpolation in latent space. DDIM can use as few as 10 to 50 steps instead of DDPM's 1000 steps, with relatively minor quality loss.
A major efficiency breakthrough came with Latent Diffusion Models, introduced by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer in their 2022 CVPR paper "High-Resolution Image Synthesis with Latent Diffusion Models."
The core idea is to run the diffusion process not on raw pixels but in the compressed latent space of a pretrained variational autoencoder (VAE). The VAE encoder compresses an image (e.g., 512x512x3 pixels) into a much smaller latent representation (e.g., 64x64x4), and the diffusion model operates on this compact representation. After diffusion generates a latent code, the VAE decoder converts it back to pixel space.
This approach offers several advantages:
Latent diffusion models became the foundation for Stable Diffusion and influenced the design of many subsequent systems.
Dhariwal and Nichol (2021) introduced classifier guidance to improve the quality of conditional generation. The idea is to train a separate classifier on noisy images and use its gradients to steer the diffusion sampling process toward a desired class. During each reverse step, the noise prediction is modified by adding the gradient of the classifier's log probability for the target class, scaled by a guidance weight s:
epsilon_guided = epsilon_theta(x_t, t) - s * gradient of log p(y | x_t)
Higher values of s produce images more strongly associated with the target class but with reduced diversity. This approach requires training a separate classifier on noisy data, which adds complexity.
Jonathan Ho and Tim Salimans proposed classifier-free guidance (CFG) in 2022, which eliminates the need for a separate classifier. During training, the conditioning signal (e.g., text prompt) is randomly dropped for a fraction of examples, so the model learns both conditional and unconditional generation within a single network.
At inference time, the guided prediction is a linear combination:
epsilon_guided = epsilon_unconditional + w * (epsilon_conditional - epsilon_unconditional)
where w is the guidance scale. When w = 1, sampling is standard conditional generation. Values of w > 1 amplify the influence of the conditioning signal, producing outputs more closely aligned with the input text at the cost of reduced diversity.
Classifier-free guidance has become the standard conditioning approach in virtually all modern text-to-image diffusion systems, including DALL-E 2, Stable Diffusion, Imagen, and Midjourney. Typical guidance scale values range from 7 to 15 for text-to-image generation.
In text-conditioned diffusion models, the text prompt is first encoded into a sequence of embedding vectors using a text encoder such as CLIP or T5. These embeddings are then injected into the denoising network through cross-attention layers.
In each cross-attention layer, the image features serve as queries (Q), while the text embeddings provide keys (K) and values (V). This allows every spatial position in the image feature map to attend to relevant parts of the text description, enabling fine-grained alignment between the generated image and the input prompt.
For Stable Diffusion 1.x and 2.x, the text encoder is a CLIP model. Stable Diffusion XL uses two text encoders (CLIP ViT-L and OpenCLIP ViT-bigG). Stable Diffusion 3 uses three text encoders (two CLIP models and T5-XXL).
The following table summarizes the most prominent diffusion-based generation systems.
| System | Organization | Release | Architecture | Capabilities |
|---|---|---|---|---|
| DALL-E 2 | OpenAI | April 2022 | CLIP prior + cascaded diffusion decoder (unCLIP) | Text-to-image, image variations, inpainting |
| Imagen | Google Brain | May 2022 | T5-XXL text encoder + cascaded U-Net diffusion (64 to 256 to 1024) | Text-to-image at 1024x1024 resolution |
| Stable Diffusion 1.5 | Stability AI / CompVis / Runway | October 2022 | Latent diffusion with U-Net, CLIP text encoder, VAE | Text-to-image, inpainting, img2img; open source |
| Midjourney v4 | Midjourney, Inc. | November 2022 | Diffusion model (proprietary) | Text-to-image via Discord |
| DALL-E 3 | OpenAI | October 2023 | Improved diffusion with better captioning pipeline | Text-to-image with strong text rendering and prompt following |
| Stable Diffusion XL | Stability AI | July 2023 | Latent diffusion with larger U-Net, dual CLIP encoders | Higher resolution (1024x1024), better detail |
| Midjourney v6 | Midjourney, Inc. | December 2023 | Third-generation model trained from scratch | Improved photorealism, text in images |
| Stable Diffusion 3 | Stability AI | June 2024 | Multimodal DiT (MMDiT) with rectified flow | Three text encoders (CLIP x2 + T5-XXL), improved text rendering |
| FLUX.1 | Black Forest Labs | August 2024 | 12B parameter rectified flow transformer | Text-to-image; Pro, Dev, and Schnell variants |
| Sora | OpenAI | December 2024 (public) | Diffusion transformer on spacetime patches | Text-to-video up to 1 minute at 1080p |
| Midjourney v7 | Midjourney, Inc. | April 2025 | New architecture (proprietary) | Draft mode, improved coherence |
| FLUX.2 | Black Forest Labs | November 2025 | Rectified flow transformer + Mistral-3 24B VLM | Unified generation and editing |
DALL-E 2, released by OpenAI in April 2022, uses an approach called unCLIP. It consists of a CLIP text encoder, a prior model that maps CLIP text embeddings to CLIP image embeddings, and a diffusion decoder that generates images conditioned on the image embedding. The prior can be either a diffusion model or an autoregressive model. Two super-resolution diffusion models upsample the output from 64x64 to 256x256 and then to 1024x1024. The full paper, "Hierarchical Text-Conditional Image Generation with CLIP Latents," was authored by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen.
DALL-E 3, released in October 2023, focused on improving prompt fidelity rather than changing the fundamental diffusion architecture. The key innovation was training on vastly improved image-text pairs generated by a recaptioning pipeline. James Betker and colleagues showed that the quality of text-image pairing in training data is a primary bottleneck for prompt following. By generating highly descriptive captions for training images, DALL-E 3 achieved significantly better alignment between text prompts and generated images, including much-improved text rendering within images.
Imagen, introduced by Chitwan Saharia and colleagues at Google Brain in May 2022, demonstrated that scaling the text encoder (using a frozen T5-XXL language model with 11 billion parameters) improved image quality and text alignment more effectively than scaling the diffusion model itself. Imagen uses a cascade of three diffusion models: a base model generating 64x64 images, a first super-resolution model upsampling to 256x256, and a second upsampling to 1024x1024. All three denoising networks are U-Nets.
Stable Diffusion, first released in August 2022, is the most widely used open-source diffusion model. It is based on the latent diffusion architecture developed by Robin Rombach, Andreas Blattmann, and colleagues. The model operates in the latent space of a VAE, using a U-Net (versions 1.x through XL) or a Diffusion Transformer (version 3 onward) as the denoising backbone.
Stable Diffusion's open-source release had a large impact on the generative AI community, enabling a broad ecosystem of fine-tuned models, LoRA adapters, ControlNet extensions, and custom pipelines. The model has been adapted for tasks including inpainting, outpainting, image-to-image translation, and ControlNet-guided generation.
Stable Diffusion 3, released in June 2024, replaced the U-Net with a Multimodal Diffusion Transformer (MMDiT) and adopted rectified flow for the noise schedule. It uses three text encoders for richer text understanding.
Midjourney is a proprietary text-to-image service created by David Holz and a small team of engineers in San Francisco. The company, founded in August 2021, initially operated its service exclusively through Discord. While Midjourney has not published detailed technical papers about its architecture, it is known to use a diffusion model conditioned on text via a large language model.
Midjourney has released multiple major versions, with each iteration trained from scratch rather than fine-tuned from the previous version. Version 6 (December 2023) was the first to support text rendering in images. Version 7 (April 2025) introduced a completely new architecture with a draft mode for rapid prototyping at reduced cost.
FLUX is a family of text-to-image models developed by Black Forest Labs, a company founded in 2024 by Robin Rombach, Andreas Blattmann, and Patrick Esser (the same researchers behind Latent Diffusion Models and Stable Diffusion). FLUX.1, released in August 2024, uses a 12-billion-parameter rectified flow transformer architecture. It was offered in three variants: Pro (proprietary, highest quality), Dev (open-weight, for non-commercial use), and Schnell (fast, Apache 2.0 licensed).
FLUX.2, released in November 2025, couples the rectified flow transformer with a Mistral-3 24B vision-language model, combining image generation and editing in a single architecture.
Sora is OpenAI's text-to-video model, first previewed in February 2024 and released publicly in December 2024. It is a diffusion transformer that generates video by denoising spacetime patches in a latent space. Sora can produce up to one minute of 1080p video with coherent motion and scene consistency. Its architecture combines a video compressor (mapping video to latent space), a Vision Transformer-based denoiser, and a CLIP-like conditioning mechanism for text guidance.
Text-to-image generation is the most prominent application of diffusion models. Given a text description, the model generates a corresponding image. All major systems (DALL-E, Stable Diffusion, Midjourney, Imagen, FLUX) operate in this mode. Modern systems can handle complex multi-object scenes, specific art styles, photorealistic rendering, and even legible text within images.
Diffusion models can transform an existing image based on a text prompt. The process starts with a partially noised version of the input image (rather than pure noise) and denoises it according to a new text prompt. The amount of initial noise controls the balance between preserving the input image and following the new prompt. This technique, sometimes called SDEdit, enables style transfer, content modification, and creative editing.
Inpainting uses a diffusion model to fill in masked regions of an image, guided by a text description and the surrounding context. Outpainting extends an image beyond its original boundaries. Both tasks leverage the model's ability to generate content that is contextually consistent with existing pixels.
Diffusion models can upsample low-resolution images to higher resolutions while adding realistic detail. This is used both as a standalone application and as part of cascaded generation pipelines (as in Imagen and DALL-E 2), where a base model generates a small image that is progressively upsampled by specialized super-resolution diffusion models.
Diffusion models have been extended to video generation, where the denoising process operates on sequences of frames. Sora (OpenAI), Runway Gen-2, and Stable Video Diffusion are examples of video diffusion models. The main challenge is maintaining temporal coherence (consistent objects, smooth motion) across frames. Sora addresses this by operating on spacetime patches that jointly model spatial and temporal dimensions.
AudioLDM, introduced in 2023, applies the latent diffusion framework to audio generation. It uses contrastive language-audio pretraining (CLAP) embeddings to condition a latent diffusion model for generating speech, sound effects, and music from text descriptions. AudioLDM 2 extends this framework to handle speech, music, and sound effects within a unified architecture using a self-supervised audio representation called "language of audio" (LOA).
Other notable audio diffusion systems include Riffusion (which generates music through spectrogram diffusion) and various text-to-speech systems that use diffusion for high-quality speech synthesis.
Diffusion models have found applications in computational chemistry and drug design. They can generate three-dimensional molecular structures with specified properties, which is useful for discovering new drug candidates. Unlike traditional molecular generation methods, diffusion models can capture complex nonlinear relationships in molecular geometry and produce novel structures that are difficult to generate through conventional approaches.
In structural biology, diffusion models have achieved strong results on protein 3D structure generation and molecular docking. AlphaFold 3, which can predict the structures of protein complexes, ligand-protein interactions, and nucleic acid structures, incorporates a diffusion-based module for structure prediction. Diffusion models for protein design can generate novel protein sequences and structures with desired functional properties, with applications in drug development, vaccine research, and enzyme engineering.
Several methods extend diffusion models with additional spatial control signals beyond text:
| Feature | Diffusion models | GANs | VAEs | Flow-based models |
|---|---|---|---|---|
| Training stability | Stable; single network trained with MSE loss | Unstable; requires balancing generator and discriminator | Stable; trained with ELBO | Stable; trained with exact log-likelihood |
| Sample quality | State of the art for images and video | High quality but prone to artifacts | Often blurry due to pixel-level reconstruction loss | Good but generally below diffusion and GANs |
| Sample diversity | High; good mode coverage | Susceptible to mode collapse | High diversity by design | High diversity |
| Generation speed | Slow (many iterative steps); can be accelerated with distillation | Fast (single forward pass) | Fast (single decoder pass) | Fast (single pass through invertible layers) |
| Likelihood estimation | Approximate (via variational bound) | Not available | Approximate (ELBO) | Exact (change of variables) |
| Conditioning | Flexible via classifier-free guidance and cross-attention | Requires conditional architectures | Conditional VAE variants | Conditional flow variants |
| Computational cost (training) | High (many timesteps per sample) | Moderate to high | Moderate | High (invertible constraints) |
Diffusion models have largely replaced GANs as the preferred approach for high-quality image generation. GANs remain useful for real-time applications where generation speed is critical, and they are sometimes used as discriminators or for distilling diffusion models into faster single-step generators. VAEs continue to play a supporting role as the encoder-decoder framework in latent diffusion models.
Consistency models, introduced by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever in 2023, learn to map any point along the diffusion trajectory directly to the clean data point in a single step. This allows high-quality generation in one or very few steps, rather than the dozens or hundreds of steps required by standard diffusion sampling. Consistency models can be trained either by distilling a pretrained diffusion model or from scratch.
Enhanced Consistency Models (ECM) further improved this approach, achieving quality close to multi-step diffusion models with only two sampling steps. Research published at ICLR 2025 showed that the Straight Consistent Trajectory (SCoT) framework can unify consistency models and rectified flow, combining the benefits of both approaches.
Rectified flow learns a transport map between two distributions (noise and data) along straight paths, rather than the curved trajectories typical of standard diffusion. This makes the generation process more efficient because straighter paths require fewer discretization steps. Stable Diffusion 3 and FLUX both use rectified flow as their noise formulation.
Several methods have been developed to compress the multi-step diffusion process into fewer steps:
These techniques have enabled near-real-time image generation. SDXL Turbo can produce 512x512 images in a single step, and SDXL Lightning achieves high quality in 2 to 4 steps.
Following the DiT paper's finding that diffusion transformers exhibit predictable scaling behavior, the field has moved toward larger transformer-based diffusion models. FLUX.1 uses 12 billion parameters, and proprietary systems like Midjourney v7 and DALL-E 3 are believed to be similarly large. The Dynamic Diffusion Transformer (presented at ICLR 2025) introduced mechanisms for adaptive computation during the denoising process, allocating more compute to harder denoising steps.
The original DDPM used T = 1000 timesteps for both training and sampling. This made generation slow, requiring 1000 sequential neural network evaluations per image. Modern approaches have dramatically reduced this:
Diffusion models can parameterize their output in different ways:
All three parameterizations are mathematically equivalent and can be converted between each other, but they have different numerical properties that affect training stability and sample quality at different noise levels.
Early diffusion models were trained at fixed resolutions (typically 256x256 or 512x512). Stable Diffusion XL introduced training on multiple aspect ratios using bucket-based batching, where training images are grouped into aspect ratio buckets and the model learns to generate at various resolutions. Stable Diffusion 3 and FLUX handle arbitrary resolutions and aspect ratios natively.
In classifier-free guidance, the "unconditional" prediction can be replaced with a prediction conditioned on a negative prompt, which describes attributes the user wants to avoid. The guidance formula becomes:
epsilon_guided = epsilon_negative + w * (epsilon_positive - epsilon_negative)
This technique allows users to steer the generation away from undesired features (e.g., "blurry, low quality, distorted hands") while still amplifying the desired attributes.
Diffusion models, despite their strong performance, have several known limitations: