Latent diffusion model

Computer Vision Deep Learning Generative AI

24 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v7 · 4,720 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A latent diffusion model (LDM) is a type of diffusion model that runs the denoising diffusion process in a compressed latent space learned by a pretrained autoencoder, rather than directly in pixel space, which cuts the compute and memory cost of high-resolution image synthesis by roughly two orders of magnitude while preserving image quality. The architecture was introduced by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer in the paper "High-Resolution Image Synthesis with Latent Diffusion Models," submitted to arXiv on 20 December 2021 and published at CVPR 2022, where it appeared as pages 10684 to 10695 of the proceedings ^[1]. The authors state that operating in latent space "allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity" ^[1]. The key insight is that by first compressing images into a lower-dimensional latent representation, the computationally expensive diffusion process can operate on a much smaller representation, reducing training and inference costs dramatically while maintaining comparable or superior image quality.

Latent diffusion models are the direct architectural foundation of Stable Diffusion, one of the most widely used open-source image generation systems in the world. The approach also influenced the broader field of AI image generation, including systems like DALL-E 2, Imagen, and their successors. The separation of perceptual compression from the generative model proved to be a pivotal design choice, enabling high-resolution image synthesis on consumer-grade hardware and catalyzing the explosion of open-source image generation tools from 2022 onward. The stated motivation was efficiency: the authors set out "to enable DM training on limited computational resources while retaining their quality and flexibility" by applying diffusion "in the latent space of powerful pretrained autoencoders" ^[1].

What is a latent diffusion model used for?

Latent diffusion models are used for high-resolution image synthesis tasks where running diffusion in pixel space would be prohibitively expensive. The original paper demonstrated LDMs across unconditional image generation, text-to-image synthesis, class-conditional generation, image inpainting, semantic image synthesis, and super-resolution, reporting "highly competitive performance on various tasks" while "significantly reducing computational requirements compared to pixel-based DMs" ^[1]. Because the conditioning mechanism is modular, the same LDM backbone can be steered by text, segmentation maps, layouts, depth maps, class labels, or reference images, which is why the architecture became a general-purpose foundation for generative image, video, and 3D systems rather than a single-task model.

Background and Motivation

The Computational Problem with Pixel-Space Diffusion

Diffusion models, first introduced by Sohl-Dickstein et al. (2015) and refined by Ho et al. in their landmark 2020 paper on Denoising Diffusion Probabilistic Models (DDPMs) ^[2], generate images by learning to reverse a gradual noising process. Starting from pure Gaussian noise, the model iteratively removes noise over many steps until a clean image emerges. These models produced remarkably high-quality images, rivaling and eventually surpassing generative adversarial networks (GANs) in terms of both sample quality and diversity.

However, pixel-space diffusion models had a critical practical limitation: they operated directly on the full-resolution image representation. For a 512x512 RGB image, the diffusion model had to work with a tensor of shape 3 x 512 x 512 (786,432 values) at every denoising step. This made training extremely expensive. State-of-the-art pixel-space diffusion models like ADM (Dhariwal and Nichol, 2021) required hundreds of GPU-days for training and consumed substantial memory during inference ^[3]. Generating a single 256x256 image could take minutes on a high-end GPU. Scaling to higher resolutions (512x512 or 1024x1024) was prohibitively expensive for most researchers and practitioners.

Rombach et al. observed that much of this computation was wasted on modeling imperceptible high-frequency details in pixel space. They argued that image generation could be decomposed into two stages: perceptual compression (learning a compact, semantically meaningful representation of images) and generative modeling (learning the distribution of these representations). The first stage could be handled by a pretrained autoencoder, and the diffusion model would then only need to operate in the resulting compressed latent space.

Prior Approaches

Before latent diffusion, several approaches had attempted to make diffusion models more efficient. Cascaded diffusion models (like those used in DALL-E 2 and Imagen) generated low-resolution images first and then used separate super-resolution diffusion models to upscale them. While effective, this approach still required substantial computation and introduced complexity. Other approaches used progressive generation or truncated diffusion schedules, but none achieved the dramatic efficiency gains of moving the entire diffusion process to a learned latent space.

The idea of combining autoencoders with generative models was not new. Variational autoencoders (VAEs) had long combined encoding and generation, but their outputs tended to be blurry. VQ-VAE and VQ-VAE-2 used vector-quantized autoencoders with autoregressive priors in the latent space. Latent diffusion models built on these ideas by pairing a high-quality continuous autoencoder with the powerful denoising diffusion framework.

Architecture

The latent diffusion model architecture consists of three main components that work together in a pipeline: a perceptual compression model (the autoencoder), a diffusion model operating in latent space, and a conditioning mechanism.

Perceptual Compression: The Autoencoder

The first component is a pretrained autoencoder consisting of an encoder $E$ and a decoder $D$ . The encoder maps a high-resolution input image $x$ (in RGB pixel space) to a lower-dimensional latent representation $z = E(x)$ . The decoder reconstructs the image from this latent: $\hat{x} = D(z) = D(E(x))$ .

The autoencoder is trained with a combination of objectives:

Reconstruction loss: An L1 or L2 pixel-level loss ensuring the decoded image closely matches the input.
Perceptual loss: A loss computed in the feature space of a pretrained network (like VGG), encouraging perceptually similar reconstructions rather than pixel-exact copies.
Adversarial loss: A patch-based discriminator loss that encourages the reconstructed images to be realistic, preventing blurriness.
Regularization: Either KL-divergence regularization (pushing the latent distribution toward a standard Gaussian, as in a VAE) or vector quantization (as in VQ-VAE).

The degree of compression is controlled by the downsampling factor $f$ . For an input image of size $H \times W \times 3$ , the latent representation has size $H/f \times W/f \times c$ , where $c$ is the number of latent channels. The original paper explicitly evaluated "LDMs with different downsampling factors $f \in \{1, 2, 4, 8, 16, 32\}$ " to study the trade-off between compression and reconstruction quality ^[1].

Downsampling Factor (f)	Latent Size (for 256x256 input)	Compression Ratio	Quality Impact
f = 1	256 x 256 x 3	1x (no compression)	No quality loss; same as pixel space
f = 2	128 x 128 x c	~4x	Negligible quality loss
f = 4	64 x 64 x c	~16-48x	Very minor quality loss; good detail
f = 8	32 x 32 x c	~64-192x	Slight softening; strong efficiency
f = 16	16 x 16 x c	~256x+	Noticeable quality degradation
f = 32	8 x 8 x c	~1024x+	Significant quality loss; too aggressive

The authors found that f = 4 and f = 8 provided the best balance between efficiency and quality. At f = 8, a 512 x 512 image is compressed to a 64 x 64 x 4 latent representation, requiring 64 times less memory than the original pixel representation. Most practical implementations (including Stable Diffusion) use f = 8.

Critically, the autoencoder is trained once and then frozen. The diffusion model never modifies the autoencoder's weights. This separation means the autoencoder can be optimized independently, and different diffusion models can share the same autoencoder.

The Diffusion Model in Latent Space

With the autoencoder providing the compressed latent space, the core generative model is a denoising diffusion probabilistic model that operates entirely on latent representations $z$ rather than pixel images $x$ .

The forward diffusion process adds Gaussian noise to the latent $z$ over $T$ timesteps according to a noise schedule, producing increasingly noisy versions $z_1, z_2, \ldots, z_T$ . At $t = T$ , the latent is approximately pure Gaussian noise. The model learns the reverse process: given a noisy latent $z_t$ and timestep $t$ , predict the noise that was added (or equivalently, predict the clean latent $z_0$ ).

The denoising model is a time-conditional U-Net architecture. The U-Net takes the noisy latent z_t as input, along with the timestep t (encoded through sinusoidal embeddings), and outputs a prediction of the noise. The U-Net has an encoder-decoder structure with skip connections between corresponding resolution levels, incorporating:

ResNet blocks with group normalization and SiLU (Swish) activations at each resolution level
Spatial self-attention layers at certain resolution levels (typically 16x16 and 8x8), allowing the model to capture long-range spatial dependencies within the latent
Downsampling and upsampling operations that create a multi-scale processing hierarchy
Timestep conditioning via adaptive group normalization, where the timestep embedding modulates the intermediate features

Because the U-Net operates on 64x64 or 32x32 latent representations rather than 512x512 pixel images, each forward pass through the model is dramatically cheaper in both computation and memory. This is the source of the roughly 100x efficiency gain over pixel-space diffusion.

Conditioning via Cross-Attention

One of the most important contributions of the latent diffusion paper is its flexible conditioning mechanism, which enables the model to generate images guided by various types of input (text, semantic maps, images, layouts, and more). As the authors put it, "by introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner" ^[1].

Conditioning is implemented through cross-attention layers inserted into the U-Net backbone. Given a conditioning input $y$ (such as a text prompt), a domain-specific encoder $\tau_\theta$ maps $y$ into an intermediate representation $\tau_\theta(y)$ . This representation is then injected into the U-Net through cross-attention at multiple resolution levels:

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V

where $Q = W_Q \phi(z_t)$ (queries from the U-Net's intermediate features), $K = W_K \tau_\theta(y)$ (keys from the conditioning), and $V = W_V \tau_\theta(y)$ (values from the conditioning).

This cross-attention mechanism allows the diffusion model to attend to different parts of the conditioning signal at different spatial locations and denoising stages. For text-to-image generation, the conditioning encoder is typically a pretrained text encoder such as CLIP's text transformer or a T5 encoder.

The cross-attention approach is general: the same architecture can be conditioned on text embeddings, class labels, semantic segmentation maps, image embeddings, or any combination of inputs. The conditioning encoder is the only component that changes; the rest of the architecture remains identical.

Conditioning Type	Encoder	Application
Text	CLIP text encoder, T5	Text-to-image generation
Semantic map	Spatial encoder (learned)	Semantic image synthesis
Image layout	Spatial encoder	Layout-to-image
Image (for inpainting)	Concatenation with latent	Image inpainting
Class label	Embedding lookup	Class-conditional generation
Depth map	Encoder network	Depth-conditioned generation

Training Procedure

Training a latent diffusion model proceeds in two stages.

Stage 1: Autoencoder Training

The autoencoder is trained first on a large image dataset (e.g., ImageNet, OpenImages, or a custom dataset). The training objective combines reconstruction loss, perceptual loss, adversarial loss, and KL or VQ regularization. The autoencoder is trained until it can faithfully reconstruct images from their latent representations with minimal perceptible quality loss. Once trained, the autoencoder weights are frozen.

For KL-regularized autoencoders, the KL penalty weight is kept intentionally small (e.g., $10^{-6}$ ), applying only a slight push toward a Gaussian latent distribution. This is much weaker than in standard VAEs, allowing the latent space to retain high-fidelity information while providing enough regularity for the diffusion model to work with.

Stage 2: Diffusion Model Training

With the autoencoder frozen, the diffusion model is trained on latent representations. For each training image $x$ , the encoder produces $z = E(x)$ . Noise is added to z according to the diffusion schedule, and the U-Net learns to predict the added noise:

L_{\text{LDM}} = \mathbb{E}\left[\lVert \epsilon - \epsilon_\theta(z_t, t, \tau_\theta(y)) \rVert^2\right]

where $\epsilon$ is the actual noise added, $\epsilon_\theta$ is the model's noise prediction, $z_t$ is the noisy latent at timestep $t$ , and $\tau_\theta(y)$ is the conditioning signal.

This loss is the standard denoising score matching objective, identical to that used in pixel-space diffusion models, but applied in latent space.

Key Results

The original paper demonstrated latent diffusion models across multiple image generation tasks.

Unconditional Image Generation

On standard benchmarks for unconditional image generation (CelebA-HQ 256x256 and LSUN Bedrooms/Churches 256x256), LDMs achieved FID scores competitive with or superior to state-of-the-art pixel-space diffusion models and GANs, while requiring a fraction of the training compute. On CelebA-HQ 256x256 the model reported a Frechet Inception Distance (FID) of 5.11, setting a new state of the art for likelihood-based models at the time ^[1].

Model	Dataset	FID (lower is better)	Training Compute
ADM (pixel-space) ^[3]	LSUN Bedrooms 256	1.90	~1000 V100 GPU-days
LDM-4 (f=4)	LSUN Bedrooms 256	2.95	~8 A100 GPU-days
LDM-8 (f=8)	LSUN Bedrooms 256	4.02	~5 A100 GPU-days
StyleGAN2	CelebA-HQ 256	3.87	-
LDM-4 (f=4)	CelebA-HQ 256	5.11	~5 A100 GPU-days

While the pixel-space ADM model achieved slightly better FID on some benchmarks, LDMs achieved comparable quality with roughly 100 to 200 times fewer GPU-days of training, making the trade-off overwhelmingly favorable for most practical purposes.

Text-to-Image Generation

For text-conditional image generation, the authors trained a 1.45 billion-parameter KL-regularized LDM on the LAION-400M dataset and evaluated it on MS-COCO. The model produced diverse, high-quality images conditioned on text prompts and achieved a zero-shot FID of 12.63 on MS-COCO 256x256 (reported in Table 2 of the paper as model LDM-KL-8-G), using CLIP's text encoder for conditioning ^[1].

Image Inpainting

LDMs achieved a new state-of-the-art for image inpainting at the time of publication. By conditioning the model on both the masked image and a binary mask (concatenated with the noisy latent as additional input channels), the model learned to fill in missing regions coherently. The inpainting LDM outperformed prior methods including CoModGAN and LaMa on perceptual quality metrics.

Super-Resolution

The paper also demonstrated latent diffusion for image super-resolution, conditioning the model on a low-resolution input image to produce a high-resolution output. The approach produced sharper, more detailed results than previous super-resolution methods.

Relationship to Stable Diffusion

Latent diffusion models are the direct technical foundation of Stable Diffusion, arguably the most impactful open-source AI model released to date.

From Research to Product

Stable Diffusion was developed by the CompVis group (the same lab that produced the LDM paper), in collaboration with Runway ML and Stability AI, with compute resources provided by Stability AI. The first public version, Stable Diffusion v1.1, was released in June 2022, followed by v1.4 and v1.5 in the same year ^[4].

Stable Diffusion uses the exact LDM architecture described in the paper, with specific choices:

Autoencoder: A KL-regularized autoencoder with downsampling factor f = 8, compressing 512 x 512 x 3 images to 64 x 64 x 4 latent representations
U-Net: A time-conditional U-Net with approximately 860 million parameters, featuring ResNet blocks, spatial self-attention, and cross-attention layers
Text encoder: CLIP ViT-L/14 text encoder (frozen, approximately 123 million parameters) for text conditioning
Training data: LAION-5B, a large-scale publicly available dataset of image-text pairs

The use of CLIP's text encoder for conditioning was a critical design choice. CLIP's text representations already encode rich semantic understanding of visual concepts (having been trained to align text with images), making them effective conditioning signals for image generation. This connection between CLIP and latent diffusion became one of the defining technical relationships of the modern AI image generation stack.

Stable Diffusion Versions

Version	Release Date	Key Changes	Resolution	Parameters
SD 1.4	August 2022	Initial widely-used release; CLIP text encoder	512 x 512	~860M (U-Net)
SD 1.5	October 2022	More training steps on LAION	512 x 512	~860M
SD 2.0	November 2022	OpenCLIP ViT-H text encoder; depth conditioning	512-768	~865M
SD 2.1	December 2022	Fine-tuning improvements	512-768	~865M
SDXL	July 2023	Dual text encoders; refiner model; 3.5B params	1024 x 1024	~3.5B
SD 3.0	June 2024	Diffusion Transformer (DiT) replaces U-Net; triple text encoders (2x CLIP + T5); MMDiT architecture	1024 x 1024	2B-8B
SD 3.5	October 2024	Improved quality; multiple size variants	1024 x 1024	2.5B-8B

Starting with SD 3.0, the architecture shifted from the original U-Net backbone to a Diffusion Transformer (DiT), though the fundamental principle of operating in a compressed latent space (the core LDM contribution) was retained. The SD 3.0 paper, "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" by Esser et al. (2024), reports training models "ranging from 15 blocks with 450M parameters to 38 blocks with 8B parameters" and notes that the largest 8B model "fits into the 24GB VRAM of a RTX 4090" and "takes 34 seconds to generate an image of resolution 1024x1024 when using 50 sampling steps" ^[11].

Relationship to DALL-E 2

DALL-E 2, released by OpenAI in April 2022, took a different architectural approach to text-conditional image generation, though it shared some conceptual similarities with latent diffusion ^[5].

DALL-E 2 used a two-stage pipeline:

A "prior" model that maps CLIP text embeddings to CLIP image embeddings
A diffusion-based "decoder" (called unCLIP) that generates a pixel image conditioned on the CLIP image embedding

Critically, DALL-E 2's decoder operated in pixel space, not in a compressed latent space. It used a cascaded approach: first generating a 64x64 image, then upsampling to 256x256, and finally to 1024x1024 using separate diffusion models at each resolution. This made DALL-E 2 substantially more expensive to run than latent diffusion approaches.

Aspect	Latent Diffusion (Stable Diffusion)	DALL-E 2
Diffusion space	Compressed latent (64x64x4)	Pixel space (64x64 to 1024x1024 cascaded)
Text conditioning	CLIP text encoder via cross-attention	CLIP text-to-image embedding prior
Computational cost	Low (runs on consumer GPUs)	High (required large clusters)
Open source	Yes (weights, code, training data)	No (API access only)
Image quality	High, with community-driven improvements	High, with strong coherence
Flexibility	Highly customizable; fine-tuning, LoRA, ControlNet	Limited to API parameters

The efficiency advantage of latent diffusion was decisive in practical adoption. While DALL-E 2 required significant server infrastructure and was available only through OpenAI's API, Stable Diffusion could run on a consumer GPU. The official CompVis release listed a requirement of roughly 10 GB of GPU VRAM, and memory-optimized community forks soon reduced this further, in some cases to around 4 GB through reduced-precision and tiling techniques ^[4]. This accessibility difference drove the massive adoption of Stable Diffusion and the open-source image generation ecosystem built around it.

Impact on Open-Source Image Generation

The release of Stable Diffusion in August 2022, built on the latent diffusion architecture, triggered an unprecedented explosion of open-source AI image generation tools and techniques.

Democratization of Image Generation

Before Stable Diffusion, state-of-the-art image generation was accessible primarily through commercial APIs (DALL-E 2, Midjourney). The open-source release of model weights, training code, and the underlying LAION training dataset meant that anyone with a modern GPU could generate high-quality images locally, without API costs or content restrictions. This democratization had immediate effects across art, design, entertainment, education, and research.

Community-Driven Innovation

The open-source nature of Stable Diffusion enabled a rapid wave of community-driven innovations that built on the latent diffusion framework:

Fine-tuning and Dreambooth: Techniques for adapting Stable Diffusion to specific styles, characters, or concepts using small sets of example images ^[6].
LoRA (Low-Rank Adaptation): Efficient fine-tuning that modifies only small adapter weights, allowing users to share compact model modifications ^[7].
ControlNet: Adding spatial conditioning (edge maps, depth maps, pose skeletons) to guide image generation while preserving the base model's quality ^[8].
Textual Inversion: Learning new "words" in CLIP's embedding space to represent specific visual concepts.
IP-Adapter: Conditioning image generation on reference images in addition to text.
AnimateDiff: Extending latent diffusion to video generation by adding temporal attention layers.

These innovations were possible specifically because the latent diffusion architecture cleanly separates the compression stage (autoencoder) from the generative model (U-Net), and because the cross-attention conditioning mechanism is modular and extensible.

Ecosystem Growth

Tools like ComfyUI, Automatic1111's WebUI, InvokeAI, and Fooocus emerged as user-friendly interfaces for Stable Diffusion, enabling millions of users without programming experience to generate images. Model sharing platforms like Civitai and Hugging Face hosted thousands of community fine-tuned models. This ecosystem grew organically because the underlying latent diffusion architecture was open, well-documented, and efficient enough to run locally.

Technical Advantages of Latent Diffusion

The latent diffusion approach offers several fundamental advantages over pixel-space alternatives.

Computational Efficiency

By compressing images by a factor of 48x to 192x (depending on f and the number of latent channels) before applying diffusion, LDMs reduce the computational cost of both training and inference by roughly two orders of magnitude. Training a competitive latent diffusion model requires approximately 5 to 10 A100 GPU-days, compared to hundreds or thousands of GPU-days for comparable pixel-space models.

Memory Efficiency

The reduced latent size translates directly to lower GPU memory requirements. A 512x512 image in pixel space requires 786,432 floating-point values; the same image compressed to a 64x64x4 latent requires only 16,384 values. This 48x reduction in tensor size means the U-Net and its intermediate activations consume proportionally less memory, enabling generation on consumer hardware.

Decoupled Training

Because the autoencoder and diffusion model are trained separately, each can be optimized independently. The autoencoder can be trained once on a large dataset of images (without any conditioning) and reused across multiple diffusion models with different conditioning modalities. This modularity enables rapid experimentation with new conditioning approaches without retraining the compression model.

Flexible Conditioning

The cross-attention conditioning mechanism is agnostic to the type of conditioning input. Any signal that can be encoded into a sequence of vectors can serve as conditioning: text, images, audio features, spatial maps, or combinations thereof. This flexibility has made latent diffusion a versatile backbone for diverse generation tasks beyond text-to-image.

Limitations

Despite their advantages, latent diffusion models have notable limitations.

Information Loss from Compression

The autoencoder introduces a compression bottleneck that inevitably loses some fine-grained detail. While the loss is perceptually minor at f = 4 or f = 8, it can manifest as subtle blurring, loss of fine textures, or artifacts in high-frequency image regions. This is particularly noticeable for images with very fine text, detailed patterns, or sharp edges. Improving autoencoder quality has been an active area of research.

Text Rendering

Latent diffusion models (particularly earlier versions) struggled to generate legible text within images. Because the autoencoder compresses character-level details, and because the CLIP text encoder does not explicitly model character shapes, rendered text often appeared garbled or misspelled. Later models (SD 3.0 and FLUX) partially addressed this through improved architectures and text encoders.

Compositionality

Like most diffusion models, LDMs can struggle with compositional prompts involving multiple objects with specific attributes and spatial relationships (e.g., "a red ball on a blue table next to a green lamp"). The cross-attention mechanism does not enforce explicit binding between attributes and objects, leading to attribute leakage (wrong colors or properties assigned to wrong objects).

Sequential Denoising

Despite operating in a compressed space, LDMs still require multiple denoising steps (typically 20 to 50) at inference time, with each step requiring a full forward pass through the U-Net. Various distillation and consistency techniques have been developed to reduce the required number of steps, with some approaches achieving acceptable quality in as few as 1 to 4 steps.

Current State and Evolution (2025-2026)

As of early 2026, the latent diffusion paradigm remains the dominant approach in image generation, though the specific architectures have evolved considerably.

How does latent diffusion relate to Diffusion Transformers?

The U-Net backbone used in the original LDM has been increasingly replaced by Diffusion Transformers (DiT), introduced by Peebles and Xie at ICCV 2023 ^[9]. DiT replaces the convolutional U-Net with a pure transformer operating on latent patches, similar to how Vision Transformers replaced CNNs for image classification. Crucially, DiT keeps the latent compression stage: the authors "train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches," and their largest DiT-XL/2 model achieves "a state-of-the-art FID of 2.27" on class-conditional ImageNet 256x256 ^[9]. Only the denoising backbone changes; the autoencoder and the principle of diffusing in latent space remain the core LDM contribution. Stable Diffusion 3.0 and 3.5 use a Multimodal Diffusion Transformer (MMDiT), and FLUX.1 (from Black Forest Labs, co-founded by Robin Rombach himself) also uses a transformer-based architecture in latent space.

FLUX and Black Forest Labs

In 2024, Robin Rombach (lead author of the LDM paper) and fellow LDM and Stable Diffusion authors Andreas Blattmann and Patrick Esser departed Stability AI to found Black Forest Labs. In August 2024, the company released FLUX.1, described as "a rectified flow transformer" trained "in the latent space of an image encoder" ^[10]. The open-weight FLUX.1 [dev] variant is a 12 billion-parameter model, making it substantially larger than the original Stable Diffusion U-Net, and it ships alongside FLUX.1 [pro] (API-only) and the 4-step-distilled, Apache 2.0-licensed FLUX.1 [schnell] ^[10]. FLUX immediately topped community benchmarks and demonstrated that the latent diffusion approach continues to evolve and improve under the guidance of its original creators.

Video and 3D Generation

The latent diffusion framework has been extended to video generation (adding temporal dimensions to the latent space) and 3D content generation. Models like Stable Video Diffusion, Sora (OpenAI), and Gen-3 (Runway) apply diffusion in spatiotemporal latent spaces to generate video sequences. The same principle of compressing the signal to a lower-dimensional space before applying diffusion has proven effective across modalities.

Improved Autoencoders

Recent work has focused on improving the autoencoder component. Better autoencoders with higher reconstruction quality allow the diffusion model to generate finer details. Some approaches have explored continuous-token autoencoders (as opposed to the fixed-grid latent used in original LDMs), and others have investigated end-to-end training of the autoencoder and diffusion model jointly.

Faster Inference

Consistency models, progressive distillation, and adversarial distillation techniques have dramatically reduced the number of required denoising steps. Some modern models can produce high-quality images in 1 to 4 steps (compared to the original 50 to 200), making real-time interactive image generation feasible on consumer hardware.

References

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022 (pp. 10684-10695). https://arxiv.org/abs/2112.10752 ↩
Ho, J., Jain, A., Abbeel, P. "Denoising Diffusion Probabilistic Models." NeurIPS 2020. https://arxiv.org/abs/2006.11239 ↩
Dhariwal, P., Nichol, A. "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021. https://arxiv.org/abs/2105.05233 ↩
"Stable Diffusion." Wikipedia. https://en.wikipedia.org/wiki/Stable_Diffusion ↩
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M. "Hierarchical Text-Conditional Image Generation with CLIP Latents." 2022. https://arxiv.org/abs/2204.06125 ↩
Ruiz, N., Li, Y., Jampani, V., et al. "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation." CVPR 2023. https://arxiv.org/abs/2208.12242 ↩
Hu, E.J., Shen, Y., Wallis, P., et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. https://arxiv.org/abs/2106.09685 ↩
Zhang, L., Rao, A., Agrawala, M. "Adding Conditional Control to Text-to-Image Diffusion Models." ICCV 2023. https://arxiv.org/abs/2302.05543 ↩
Peebles, W., Xie, S. "Scalable Diffusion Models with Transformers." ICCV 2023. https://arxiv.org/abs/2212.09748 ↩
Black Forest Labs. "Announcing Black Forest Labs." 2024. https://blackforestlabs.ai/announcing-black-forest-labs/ ↩
Esser, P., Kulal, S., Blattmann, A., et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." ICML 2024. https://arxiv.org/abs/2403.03206 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit