A latent diffusion model (LDM) is a type of diffusion model that performs the denoising diffusion process in a compressed latent space rather than directly in pixel space. Introduced by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer in December 2021, the paper "High-Resolution Image Synthesis with Latent Diffusion Models" was published at CVPR 2022 and has become one of the most influential works in generative AI [1]. The key insight is that by first compressing images into a lower-dimensional latent representation using a pretrained autoencoder, the computationally expensive diffusion process can operate on a much smaller representation, reducing training and inference costs by roughly two orders of magnitude compared to pixel-space diffusion while maintaining comparable or superior image quality.
Latent diffusion models are the direct architectural foundation of Stable Diffusion, one of the most widely used open-source image generation systems in the world. The approach also influenced the broader field of AI image generation, including systems like DALL-E 2, Imagen, and their successors. The separation of perceptual compression from the generative model proved to be a pivotal design choice, enabling high-resolution image synthesis on consumer-grade hardware and catalyzing the explosion of open-source image generation tools from 2022 onward.
Diffusion models, first introduced by Sohl-Dickstein et al. (2015) and refined by Ho et al. in their landmark 2020 paper on Denoising Diffusion Probabilistic Models (DDPMs) [2], generate images by learning to reverse a gradual noising process. Starting from pure Gaussian noise, the model iteratively removes noise over many steps until a clean image emerges. These models produced remarkably high-quality images, rivaling and eventually surpassing generative adversarial networks (GANs) in terms of both sample quality and diversity.
However, pixel-space diffusion models had a critical practical limitation: they operated directly on the full-resolution image representation. For a 512x512 RGB image, the diffusion model had to work with a tensor of shape 3 x 512 x 512 (786,432 values) at every denoising step. This made training extremely expensive. State-of-the-art pixel-space diffusion models like ADM (Dhariwal and Nichol, 2021) required hundreds of GPU-days for training and consumed substantial memory during inference [3]. Generating a single 256x256 image could take minutes on a high-end GPU. Scaling to higher resolutions (512x512 or 1024x1024) was prohibitively expensive for most researchers and practitioners.
Rombach et al. observed that much of this computation was wasted on modeling imperceptible high-frequency details in pixel space. They argued that image generation could be decomposed into two stages: perceptual compression (learning a compact, semantically meaningful representation of images) and generative modeling (learning the distribution of these representations). The first stage could be handled by a pretrained autoencoder, and the diffusion model would then only need to operate in the resulting compressed latent space.
Before latent diffusion, several approaches had attempted to make diffusion models more efficient. Cascaded diffusion models (like those used in DALL-E 2 and Imagen) generated low-resolution images first and then used separate super-resolution diffusion models to upscale them. While effective, this approach still required substantial computation and introduced complexity. Other approaches used progressive generation or truncated diffusion schedules, but none achieved the dramatic efficiency gains of moving the entire diffusion process to a learned latent space.
The idea of combining autoencoders with generative models was not new. Variational autoencoders (VAEs) had long combined encoding and generation, but their outputs tended to be blurry. VQ-VAE and VQ-VAE-2 used vector-quantized autoencoders with autoregressive priors in the latent space. Latent diffusion models built on these ideas by pairing a high-quality continuous autoencoder with the powerful denoising diffusion framework.
The latent diffusion model architecture consists of three main components that work together in a pipeline: a perceptual compression model (the autoencoder), a diffusion model operating in latent space, and a conditioning mechanism.
The first component is a pretrained autoencoder consisting of an encoder E and a decoder D. The encoder maps a high-resolution input image x (in RGB pixel space) to a lower-dimensional latent representation z = E(x). The decoder reconstructs the image from this latent: x_hat = D(z) = D(E(x)).
The autoencoder is trained with a combination of objectives:
The degree of compression is controlled by the downsampling factor f. For an input image of size H x W x 3, the latent representation has size H/f x W/f x c, where c is the number of latent channels. The original paper experimented with downsampling factors of f = 1, 2, 4, 8, 16, and 32 to study the trade-off between compression and reconstruction quality.
| Downsampling Factor (f) | Latent Size (for 256x256 input) | Compression Ratio | Quality Impact |
|---|---|---|---|
| f = 1 | 256 x 256 x 3 | 1x (no compression) | No quality loss; same as pixel space |
| f = 2 | 128 x 128 x c | ~4x | Negligible quality loss |
| f = 4 | 64 x 64 x c | ~16-48x | Very minor quality loss; good detail |
| f = 8 | 32 x 32 x c | ~64-192x | Slight softening; strong efficiency |
| f = 16 | 16 x 16 x c | ~256x+ | Noticeable quality degradation |
| f = 32 | 8 x 8 x c | ~1024x+ | Significant quality loss; too aggressive |
The authors found that f = 4 and f = 8 provided the best balance between efficiency and quality. At f = 8, a 512 x 512 image is compressed to a 64 x 64 x 4 latent representation, requiring 64 times less memory than the original pixel representation. Most practical implementations (including Stable Diffusion) use f = 8.
Critically, the autoencoder is trained once and then frozen. The diffusion model never modifies the autoencoder's weights. This separation means the autoencoder can be optimized independently, and different diffusion models can share the same autoencoder.
With the autoencoder providing the compressed latent space, the core generative model is a denoising diffusion probabilistic model that operates entirely on latent representations z rather than pixel images x.
The forward diffusion process adds Gaussian noise to the latent z over T timesteps according to a noise schedule, producing increasingly noisy versions z_1, z_2, ..., z_T. At t = T, the latent is approximately pure Gaussian noise. The model learns the reverse process: given a noisy latent z_t and timestep t, predict the noise that was added (or equivalently, predict the clean latent z_0).
The denoising model is a time-conditional U-Net architecture. The U-Net takes the noisy latent z_t as input, along with the timestep t (encoded through sinusoidal embeddings), and outputs a prediction of the noise. The U-Net has an encoder-decoder structure with skip connections between corresponding resolution levels, incorporating:
Because the U-Net operates on 64x64 or 32x32 latent representations rather than 512x512 pixel images, each forward pass through the model is dramatically cheaper in both computation and memory. This is the source of the roughly 100x efficiency gain over pixel-space diffusion.
One of the most important contributions of the latent diffusion paper is its flexible conditioning mechanism, which enables the model to generate images guided by various types of input (text, semantic maps, images, layouts, and more).
Conditioning is implemented through cross-attention layers inserted into the U-Net backbone. Given a conditioning input y (such as a text prompt), a domain-specific encoder tau_theta maps y into an intermediate representation tau_theta(y). This representation is then injected into the U-Net through cross-attention at multiple resolution levels:
Attention(Q, K, V) = softmax(QK^T / sqrt(d)) V
where Q = W_Q * phi(z_t) (queries from the U-Net's intermediate features), K = W_K * tau_theta(y) (keys from the conditioning), and V = W_V * tau_theta(y) (values from the conditioning).
This cross-attention mechanism allows the diffusion model to attend to different parts of the conditioning signal at different spatial locations and denoising stages. For text-to-image generation, the conditioning encoder is typically a pretrained text encoder such as CLIP's text transformer or a T5 encoder.
The cross-attention approach is general: the same architecture can be conditioned on text embeddings, class labels, semantic segmentation maps, image embeddings, or any combination of inputs. The conditioning encoder is the only component that changes; the rest of the architecture remains identical.
| Conditioning Type | Encoder | Application |
|---|---|---|
| Text | CLIP text encoder, T5 | Text-to-image generation |
| Semantic map | Spatial encoder (learned) | Semantic image synthesis |
| Image layout | Spatial encoder | Layout-to-image |
| Image (for inpainting) | Concatenation with latent | Image inpainting |
| Class label | Embedding lookup | Class-conditional generation |
| Depth map | Encoder network | Depth-conditioned generation |
Training a latent diffusion model proceeds in two stages.
The autoencoder is trained first on a large image dataset (e.g., ImageNet, OpenImages, or a custom dataset). The training objective combines reconstruction loss, perceptual loss, adversarial loss, and KL or VQ regularization. The autoencoder is trained until it can faithfully reconstruct images from their latent representations with minimal perceptible quality loss. Once trained, the autoencoder weights are frozen.
For KL-regularized autoencoders, the KL penalty weight is kept intentionally small (e.g., 10^-6), applying only a slight push toward a Gaussian latent distribution. This is much weaker than in standard VAEs, allowing the latent space to retain high-fidelity information while providing enough regularity for the diffusion model to work with.
With the autoencoder frozen, the diffusion model is trained on latent representations. For each training image x, the encoder produces z = E(x). Noise is added to z according to the diffusion schedule, and the U-Net learns to predict the added noise:
L_LDM = E[||epsilon - epsilon_theta(z_t, t, tau_theta(y))||^2]
where epsilon is the actual noise added, epsilon_theta is the model's noise prediction, z_t is the noisy latent at timestep t, and tau_theta(y) is the conditioning signal.
This loss is the standard denoising score matching objective, identical to that used in pixel-space diffusion models, but applied in latent space.
The original paper demonstrated latent diffusion models across multiple image generation tasks.
On standard benchmarks for unconditional image generation (CelebA-HQ 256x256 and LSUN Bedrooms/Churches 256x256), LDMs achieved FID scores competitive with or superior to state-of-the-art pixel-space diffusion models and GANs, while requiring a fraction of the training compute.
| Model | Dataset | FID (lower is better) | Training Compute |
|---|---|---|---|
| ADM (pixel-space) [3] | LSUN Bedrooms 256 | 1.90 | ~1000 V100 GPU-days |
| LDM-4 (f=4) | LSUN Bedrooms 256 | 2.95 | ~8 A100 GPU-days |
| LDM-8 (f=8) | LSUN Bedrooms 256 | 4.02 | ~5 A100 GPU-days |
| StyleGAN2 | CelebA-HQ 256 | 3.87 | - |
| LDM-4 (f=4) | CelebA-HQ 256 | 5.11 | ~5 A100 GPU-days |
While the pixel-space ADM model achieved slightly better FID on some benchmarks, LDMs achieved comparable quality with roughly 100 to 200 times fewer GPU-days of training, making the trade-off overwhelmingly favorable for most practical purposes.
For text-conditional image generation, the authors trained a 1.45 billion-parameter LDM on the LAION dataset and evaluated it on MS-COCO. The model produced diverse, high-quality images conditioned on text prompts and achieved a competitive FID of 12.63 on COCO 256x256 zero-shot generation, using CLIP's text encoder for conditioning.
LDMs achieved a new state-of-the-art for image inpainting at the time of publication. By conditioning the model on both the masked image and a binary mask (concatenated with the noisy latent as additional input channels), the model learned to fill in missing regions coherently. The inpainting LDM outperformed prior methods including CoModGAN and LaMa on perceptual quality metrics.
The paper also demonstrated latent diffusion for image super-resolution, conditioning the model on a low-resolution input image to produce a high-resolution output. The approach produced sharper, more detailed results than previous super-resolution methods.
Latent diffusion models are the direct technical foundation of Stable Diffusion, arguably the most impactful open-source AI model released to date.
Stable Diffusion was developed by the CompVis group (the same lab that produced the LDM paper), in collaboration with Runway ML and Stability AI, with compute resources provided by Stability AI. The first public version, Stable Diffusion v1.1, was released in June 2022, followed by v1.4 and v1.5 in the same year [4].
Stable Diffusion uses the exact LDM architecture described in the paper, with specific choices:
The use of CLIP's text encoder for conditioning was a critical design choice. CLIP's text representations already encode rich semantic understanding of visual concepts (having been trained to align text with images), making them effective conditioning signals for image generation. This connection between CLIP and latent diffusion became one of the defining technical relationships of the modern AI image generation stack.
| Version | Release Date | Key Changes | Resolution | Parameters |
|---|---|---|---|---|
| SD 1.4 | August 2022 | Initial widely-used release; CLIP text encoder | 512 x 512 | ~860M (U-Net) |
| SD 1.5 | October 2022 | More training steps on LAION | 512 x 512 | ~860M |
| SD 2.0 | November 2022 | OpenCLIP ViT-H text encoder; depth conditioning | 512-768 | ~865M |
| SD 2.1 | December 2022 | Fine-tuning improvements | 512-768 | ~865M |
| SDXL | July 2023 | Dual text encoders; refiner model; 3.5B params | 1024 x 1024 | ~3.5B |
| SD 3.0 | June 2024 | Diffusion Transformer (DiT) replaces U-Net; triple text encoders (2x CLIP + T5); MMDiT architecture | 1024 x 1024 | 2B-8B |
| SD 3.5 | October 2024 | Improved quality; multiple size variants | 1024 x 1024 | 2.5B-8B |
Starting with SD 3.0, the architecture shifted from the original U-Net backbone to a Diffusion Transformer (DiT), though the fundamental principle of operating in a compressed latent space (the core LDM contribution) was retained.
DALL-E 2, released by OpenAI in April 2022, took a different architectural approach to text-conditional image generation, though it shared some conceptual similarities with latent diffusion [5].
DALL-E 2 used a two-stage pipeline:
Critically, DALL-E 2's decoder operated in pixel space, not in a compressed latent space. It used a cascaded approach: first generating a 64x64 image, then upsampling to 256x256, and finally to 1024x1024 using separate diffusion models at each resolution. This made DALL-E 2 substantially more expensive to run than latent diffusion approaches.
| Aspect | Latent Diffusion (Stable Diffusion) | DALL-E 2 |
|---|---|---|
| Diffusion space | Compressed latent (64x64x4) | Pixel space (64x64 to 1024x1024 cascaded) |
| Text conditioning | CLIP text encoder via cross-attention | CLIP text-to-image embedding prior |
| Computational cost | Low (runs on consumer GPUs) | High (required large clusters) |
| Open source | Yes (weights, code, training data) | No (API access only) |
| Image quality | High, with community-driven improvements | High, with strong coherence |
| Flexibility | Highly customizable; fine-tuning, LoRA, ControlNet | Limited to API parameters |
The efficiency advantage of latent diffusion was decisive in practical adoption. While DALL-E 2 required significant server infrastructure and was available only through OpenAI's API, Stable Diffusion could run on a consumer GPU with 8 GB of VRAM. This accessibility difference drove the massive adoption of Stable Diffusion and the open-source image generation ecosystem built around it.
The release of Stable Diffusion in August 2022, built on the latent diffusion architecture, triggered an unprecedented explosion of open-source AI image generation tools and techniques.
Before Stable Diffusion, state-of-the-art image generation was accessible primarily through commercial APIs (DALL-E 2, Midjourney). The open-source release of model weights, training code, and the underlying LAION training dataset meant that anyone with a modern GPU could generate high-quality images locally, without API costs or content restrictions. This democratization had immediate effects across art, design, entertainment, education, and research.
The open-source nature of Stable Diffusion enabled a rapid wave of community-driven innovations that built on the latent diffusion framework:
These innovations were possible specifically because the latent diffusion architecture cleanly separates the compression stage (autoencoder) from the generative model (U-Net), and because the cross-attention conditioning mechanism is modular and extensible.
Tools like ComfyUI, Automatic1111's WebUI, InvokeAI, and Fooocus emerged as user-friendly interfaces for Stable Diffusion, enabling millions of users without programming experience to generate images. Model sharing platforms like Civitai and Hugging Face hosted thousands of community fine-tuned models. This ecosystem grew organically because the underlying latent diffusion architecture was open, well-documented, and efficient enough to run locally.
The latent diffusion approach offers several fundamental advantages over pixel-space alternatives.
By compressing images by a factor of 48x to 192x (depending on f and the number of latent channels) before applying diffusion, LDMs reduce the computational cost of both training and inference by roughly two orders of magnitude. Training a competitive latent diffusion model requires approximately 5 to 10 A100 GPU-days, compared to hundreds or thousands of GPU-days for comparable pixel-space models.
The reduced latent size translates directly to lower GPU memory requirements. A 512x512 image in pixel space requires 786,432 floating-point values; the same image compressed to a 64x64x4 latent requires only 16,384 values. This 48x reduction in tensor size means the U-Net and its intermediate activations consume proportionally less memory, enabling generation on consumer hardware.
Because the autoencoder and diffusion model are trained separately, each can be optimized independently. The autoencoder can be trained once on a large dataset of images (without any conditioning) and reused across multiple diffusion models with different conditioning modalities. This modularity enables rapid experimentation with new conditioning approaches without retraining the compression model.
The cross-attention conditioning mechanism is agnostic to the type of conditioning input. Any signal that can be encoded into a sequence of vectors can serve as conditioning: text, images, audio features, spatial maps, or combinations thereof. This flexibility has made latent diffusion a versatile backbone for diverse generation tasks beyond text-to-image.
Despite their advantages, latent diffusion models have notable limitations.
The autoencoder introduces a compression bottleneck that inevitably loses some fine-grained detail. While the loss is perceptually minor at f = 4 or f = 8, it can manifest as subtle blurring, loss of fine textures, or artifacts in high-frequency image regions. This is particularly noticeable for images with very fine text, detailed patterns, or sharp edges. Improving autoencoder quality has been an active area of research.
Latent diffusion models (particularly earlier versions) struggled to generate legible text within images. Because the autoencoder compresses character-level details, and because the CLIP text encoder does not explicitly model character shapes, rendered text often appeared garbled or misspelled. Later models (SD 3.0 and FLUX) partially addressed this through improved architectures and text encoders.
Like most diffusion models, LDMs can struggle with compositional prompts involving multiple objects with specific attributes and spatial relationships (e.g., "a red ball on a blue table next to a green lamp"). The cross-attention mechanism does not enforce explicit binding between attributes and objects, leading to attribute leakage (wrong colors or properties assigned to wrong objects).
Despite operating in a compressed space, LDMs still require multiple denoising steps (typically 20 to 50) at inference time, with each step requiring a full forward pass through the U-Net. Various distillation and consistency techniques have been developed to reduce the required number of steps, with some approaches achieving acceptable quality in as few as 1 to 4 steps.
As of early 2026, the latent diffusion paradigm remains the dominant approach in image generation, though the specific architectures have evolved considerably.
The U-Net backbone used in the original LDM has been increasingly replaced by Diffusion Transformers (DiT), introduced by Peebles and Xie in 2023 [9]. DiT replaces the convolutional U-Net with a pure transformer operating on latent patches, similar to how Vision Transformers replaced CNNs for image classification. The latent compression stage (autoencoder) remains identical; only the denoising backbone changes. Stable Diffusion 3.0 and 3.5 use a Multimodal Diffusion Transformer (MMDiT), and FLUX.1 (from Black Forest Labs, co-founded by Robin Rombach himself) also uses a transformer-based architecture in latent space.
In early 2024, Robin Rombach (lead author of the LDM paper) and other key researchers departed Stability AI to found Black Forest Labs. In August 2024, they released FLUX.1, which builds on the latent diffusion paradigm with a transformer backbone and improved text rendering capabilities [10]. FLUX immediately topped community benchmarks and demonstrated that the latent diffusion approach continues to evolve and improve under the guidance of its original creators.
The latent diffusion framework has been extended to video generation (adding temporal dimensions to the latent space) and 3D content generation. Models like Stable Video Diffusion, Sora (OpenAI), and Gen-3 (Runway) apply diffusion in spatiotemporal latent spaces to generate video sequences. The same principle of compressing the signal to a lower-dimensional space before applying diffusion has proven effective across modalities.
Recent work has focused on improving the autoencoder component. Better autoencoders with higher reconstruction quality allow the diffusion model to generate finer details. Some approaches have explored continuous-token autoencoders (as opposed to the fixed-grid latent used in original LDMs), and others have investigated end-to-end training of the autoencoder and diffusion model jointly.
Consistency models, progressive distillation, and adversarial distillation techniques have dramatically reduced the number of required denoising steps. Some modern models can produce high-quality images in 1 to 4 steps (compared to the original 50 to 200), making real-time interactive image generation feasible on consumer hardware.