Stable Diffusion

Diffusion Models Generative AI Image Generation Open Source AI

34 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

57 citations

Revision

v10 · 6,820 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Stable Diffusion is a family of open-weights text-to-image latent diffusion models first released on August 22, 2022 by Stability AI, and it was the first capable text-to-image model with publicly downloadable weights that could run on a single consumer GPU, generating 512x512 images using about 6.9 GB of VRAM. ^[1] ^[55] It was developed in collaboration with the CompVis research group at Ludwig Maximilian University of Munich (LMU Munich), Runway, LAION, and EleutherAI. ^[1] ^[2] The model builds on the "High-Resolution Image Synthesis with Latent Diffusion Models" paper by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, presented at CVPR 2022 and originally posted to arXiv as 2112.10752 on December 20, 2021. ^[3] By performing the diffusion process in a compressed latent space rather than directly in pixel space, Stable Diffusion brought near-state-of-the-art image generation within reach of consumer GPUs. The paper's stated goal was "to enable DM training on limited computational resources while retaining their quality and flexibility," achieved by applying diffusion "in the latent space of powerful pretrained autoencoders." ^[3]

Released under the permissive CreativeML Open RAIL-M license, Stable Diffusion stood in stark contrast to the closed DALL-E 2 from OpenAI and Imagen from Google, neither of which published weights. ^[4] ^[5] Stability AI described the launch as "unlocking the power of open-source generative AI to expand human creativity" and characterized the model as "a single file that compresses the visual information of humanity into a few gigabytes." ^[55] Its release triggered the open-source generative image revolution: community interfaces such as AUTOMATIC1111's WebUI made local generation accessible to non-programmers, and fine-tuning techniques like Dreambooth, Textual Inversion, LoRA, and ControlNet produced thousands of derivative models. The lineage progressed through versions 1.x (2022), 2.x (late 2022), SDXL (2023), SDXL Turbo (late 2023), Stable Diffusion 3 (2024), and Stable Diffusion 3.5 (October 2024). ^[6]

The model's history is inseparable from the corporate trajectory of Stability AI: from a $1 billion unicorn valuation in October 2022 to near-collapse and the departure of founder Emad Mostaque in March 2024, followed by a recapitalization under former Weta Digital CEO Prem Akkaraju, with Sean Parker and James Cameron joining the board in 2024. ^[7] ^[8] The original paper's authors largely departed for Black Forest Labs in 2024, where they released the Flux family, a successor in spirit if not in name. Stable Diffusion's legacy includes a flourishing community ecosystem, lasting controversies over training data (LAION-5B, CSAM, Getty Images), and a permanent shift in expectations about what open-source generative AI can achieve.

What does Stable Diffusion run on?

The defining practical fact about the original Stable Diffusion release is that it ran on hardware most people already owned. Stability AI stated that "the final memory usage on the release of the model should be 6.9 Gb of VRAM," with NVIDIA chips recommended, allowing a 512x512 image to be produced in seconds on a consumer graphics card. ^[55] This was the gap that closed-API competitors left open: DALL-E 2 and Imagen produced comparable or better images but offered no way to run the model locally, inspect it, or fine-tune it. By contrast, SD 1.5 could run on a 4-6 GB consumer GPU, SDXL on an 8-12 GB GPU, and the later SD 3.5 Medium on an 8-10 GB GPU. The ability to run on commodity hardware, combined with downloadable weights, is what made the community ecosystem possible.

Background

Generative image modeling matured along two parallel tracks in the late 2010s and early 2020s. Generative Adversarial Networks (GANs) dominated through 2020 but suffered from training instability and difficulty with text conditioning. The diffusion track became practically competitive after Jonathan Ho, Ajay Jain, and Pieter Abbeel published "Denoising Diffusion Probabilistic Models" (DDPM) in June 2020, demonstrating that a parameterized Markov chain trained with a simple noise-prediction loss could match or exceed GAN image quality. ^[9] Follow-up work in 2021, including classifier-free guidance, DDIM samplers, and score-based formulations, made diffusion practical for high-resolution synthesis, but pixel-space diffusion at 512x512 still required hundreds of GPU-days.

How does Stable Diffusion differ from DALL-E 2 and Imagen?

In April 2022, OpenAI revealed DALL-E 2, a two-stage diffusion system using a CLIP-conditioned prior plus a cascaded diffusion decoder. DALL-E 2 produced startlingly photorealistic imagery but was released only as a waitlisted closed beta with API-only access. The following month, Google announced Imagen, with even higher quality but no public access at all. ^[10] Text-to-image had crossed into practical art generation, but the underlying models remained walled off, with no way to inspect, modify, or fine-tune the system. The opportunity was clear: build a diffusion model with comparable quality but lower compute requirements, and release its weights openly. This was the gap Stable Diffusion filled. The core difference is therefore not architectural novelty but distribution: Stable Diffusion shipped its full weights as a downloadable file under a permissive license, while DALL-E 2 and Imagen remained proprietary services.

Authors and origin

The architecture beneath Stable Diffusion was developed at the Computer Vision & Learning Group (CompVis), led by Professor Björn Ommer. The group was based at Heidelberg University until 2021 and then moved with Ommer to LMU Munich. ^[11] The defining paper, "High-Resolution Image Synthesis with Latent Diffusion Models," was authored by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser (Runway ML, with CompVis affiliation), and Björn Ommer (group leader). The arXiv preprint (2112.10752) appeared on December 20, 2021; the paper was accepted at CVPR 2022 and presented in New Orleans in June 2022. ^[3] The accompanying GitHub repository CompVis/latent-diffusion released checkpoints for unconditional, class-conditional, super-resolution, and inpainting variants.

The transition from "Latent Diffusion Models" to a public-facing "Stable Diffusion" product required compute and packaging that the academic group could not provide alone. Stability AI, founded in 2019 in London by Emad Mostaque and Cyrus Hodes, agreed to fund and donate the compute. Training was performed on approximately 256 NVIDIA A100 GPUs (32 machines with 8 A100s each) on AWS, accumulating around 150,000 GPU-hours at an estimated cost of roughly $600,000. ^[2] ^[12] Mostaque stated on Twitter that "the model alone cost $600,000" to train at market GPU prices, a figure widely cited as evidence that frontier image generation had become unexpectedly cheap. ^[56] Runway ML contributed via Patrick Esser; LAION supplied the training dataset (LAION-5B); EleutherAI provided additional research support.

When was Stable Diffusion released?

On August 10, 2022, a closed-beta release went to researchers and selected members of the AI community. On August 22, 2022, the model was made publicly available under the CreativeML Open RAIL-M license, with the v1.4 checkpoint published on Hugging Face. ^[4] ^[13] There was never a published version 1.0; the first public checkpoint was v1.4. Four of the five paper authors (Rombach, Blattmann, Esser, and Lorenz) joined Stability AI shortly thereafter.

Latent Diffusion Model architecture

A standard pixel-space diffusion model performs every denoising step on the full image; for a 512x512x3 image that is 786,432 dimensions of noise per step. The central insight of Stable Diffusion is that this is wasteful: most perceptual content can be captured in a much smaller latent representation, and the diffusion process can be run entirely within that latent space at a fraction of the cost. The original paper reports roughly 48-fold compute reductions relative to pixel-space diffusion at comparable quality, stating that LDMs "achieve a new state of the art for image inpainting and highly competitive performance on various tasks ... while significantly reducing computational requirements compared to pixel-based DMs." ^[3]

The SD 1.x/2.x architecture comprises four components:

Variational Autoencoder (VAE)

A Variational Autoencoder compresses RGB images into a much smaller latent tensor and decodes latents back into images. For SD 1.x the encoder maps 512x512x3 pixels to a 64x64x4 latent (an 8x downsampling factor in each spatial dimension, with 4 channels), and the decoder performs the inverse mapping. The VAE is trained separately using a combination of pixel reconstruction loss, perceptual loss, and a small KL-divergence regularization. The diffusion U-Net operates entirely on these latents; pixels reappear only at the end when the decoded latent is read out as the final image.

U-Net denoiser

The core generative model is a conditional U-Net, a convolutional encoder-decoder with skip connections at multiple resolutions. It takes a noisy latent and a denoising timestep as input and predicts the noise present at that timestep, following the DDPM formulation. The SD 1.x U-Net has approximately 860 million parameters; the SD 2.x U-Net inherited the same size. The architecture intersperses ResNet-style convolutional blocks with Transformer-based attention blocks: self-attention mixes spatial latent features, while cross-attention layers attend from the latent feature map to the text embedding sequence produced by the text encoder. This cross-attention is the conditioning channel through which the prompt steers generation. ^[3]

Text encoder

The model uses a frozen pretrained text encoder to convert a tokenized prompt into a sequence of embedding vectors. SD 1.x used the CLIP ViT-L/14 text encoder from OpenAI (77 tokens of 768-dimensional embeddings). SD 2.x switched to OpenCLIP ViT-H/14 (an open replacement trained by LAION); SDXL concatenated CLIP-L and OpenCLIP-bigG/14 embeddings; SD 3 added Google's T5-XXL as a third text encoder. ^[14] The text encoder remains frozen during diffusion training; only the U-Net learns to align with its representations.

Sampling and classifier-free guidance

At inference, the model starts from a pure-noise latent and runs the U-Net for typically 20-50 denoising steps (with DDIM, DPM-Solver, or similar sampler). At each step, two forward passes are performed: one conditioned on the text embedding, one unconditioned. The two predictions are combined using classifier-free guidance, extrapolating away from the unconditioned prediction toward the conditioned one by a guidance scale (typically 5-12) to amplify the influence of the prompt. The final latent is decoded by the VAE to produce the output image. ^[3]

Training

Dataset: LAION-5B subset

Stable Diffusion was trained on subsets of LAION-5B, a dataset of approximately 5.85 billion image-URL plus alt-text pairs scraped from the public web via Common Crawl by the LAION non-profit. ^[15] ^[2] The training subset for SD 1.x was filtered using a CLIP-based aesthetic scoring model: only images with predicted aesthetic scores above 5.0 (on a 10-point scale), with a minimum resolution of 512x512 and an estimated watermark probability below 0.5, were used. This "LAION-Aesthetics v2 5+" subset contained roughly 600 million image-text pairs. ^[15]

Training procedure

Training was performed in multiple stages. The publicly documented sequence for SD 1.x was: SD 1.1 was trained on 237 million steps at 256x256 on LAION-2B-en, then 194 million steps at 512x512 on LAION-HD; SD 1.2 was fine-tuned from SD 1.1 for an additional 515,000 steps on LAION-Aesthetics v2 5+ with text drop applied for classifier-free guidance; SD 1.3 was fine-tuned from SD 1.2 for an additional 195,000 steps; and SD 1.4 was fine-tuned from SD 1.2 for 225,000 steps. ^[16] The full training run consumed approximately 150,000 A100-GPU-hours on AWS, with Stability AI estimating the compute cost at around $600,000. ^[12]

SD 1.x lineage

The 1.x family established the practical patterns and ecosystem that would define Stable Diffusion for years.

SD 1.1, 1.2, 1.3, 1.4 (August 2022, CompVis)

Versions 1.1 through 1.4 were released by CompVis on Hugging Face in August 2022. There was never a published version 1.0; the public release on August 22, 2022 was version 1.4. ^[13] ^[16] All 1.x versions share the same architecture (860M-parameter U-Net, CLIP ViT-L/14 text encoder, VAE) and generate at a native 512x512 resolution. They differ only in fine-tuning regime; each subsequent version was further fine-tuned from a prior checkpoint, with version 1.4 chosen as the most widely useful balance for public release.

SD 1.5 (October 2022, RunwayML, then Stability)

Stable Diffusion 1.5 was released on Hugging Face by RunwayML on October 20, 2022 under the existing CreativeML Open RAIL-M license. RunwayML fine-tuned from the SD 1.2 checkpoint for an additional 595,000 steps at 512x512 on the same LAION-Aesthetics subset. ^[17] The release was preceded by friction with Stability AI: Stability had been delaying its own SD 1.5 release for several weeks over reported "legal concerns," and RunwayML proceeded to publish it independently. Stability filed a takedown request to Hugging Face citing IP leak. After Runway clarified that Patrick Esser, as a co-author of the original Latent Diffusion paper and a Runway employee, had legitimate rights to release derived weights, Stability withdrew the request, and the release was retroactively recognized as the official SD 1.5. ^[18]

SD 1.5 quickly became the canonical Stable Diffusion checkpoint and remained the dominant base model in the open-source community well into 2024. Its prevalence rested on extensive community fine-tunes (thousands of derivative models), broad tool support, and modest hardware requirements (running on a 4-6 GB consumer GPU). In August 2024, RunwayML deleted its Hugging Face repository, and stewardship migrated to the stable-diffusion-v1-5/stable-diffusion-v1-5 community repository. ^[19]

SD 2.x

Stable Diffusion 2.0 was released by Stability AI on November 24, 2022. ^[20] It introduced multiple changes simultaneously, several of which proved controversial:

Text encoder: CLIP ViT-L/14 was replaced with OpenCLIP ViT-H/14, an open-source variant trained by LAION with Stability AI compute. This avoided dependence on OpenAI artifacts but produced different embedding distributions, making existing SD 1.x prompts and LoRAs largely incompatible.
Resolution: Default supported resolutions were 512x512 and 768x768.
Auxiliary models: A depth-to-image model (depth2img) used MiDaS depth prediction; a 4x upscaler diffusion model enlarged 128x128 inputs to 512x512; a new inpainting model was released.
NSFW filtering: More aggressive NSFW filtering using LAION's safety classifier removed many images that SD 1.x had been trained on. ^[20]

Community reception was mixed. SD 2.0 lost much of the prompt-style vocabulary and celebrity-recognition capability that users had developed for SD 1.5, and many SD 1.5 prompts simply did not work in SD 2.0. Much of the community continued using SD 1.5. ^[21] Stable Diffusion 2.1 was released on December 7, 2022 with a relaxed NSFW filter that restored some artistic vocabulary. ^[22] Despite these fixes, SD 2.1 never overtook SD 1.5 in adoption, and the 2.x line effectively became a footnote.

SDXL

Stable Diffusion XL (SDXL) 1.0 was released by Stability AI on July 26, 2023. ^[23] ^[24] SDXL was the first version to substantially scale the architecture relative to the original Latent Diffusion design while retaining the U-Net-plus-VAE structure. According to its paper, "SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators." ^[23] In the paper's user-preference study, SDXL with its refinement stage was the highest-rated option at a 48.44% win rate and the SDXL base model at 36.93%, both far ahead of Stable Diffusion 1.5 and 2.1. ^[57] Key changes:

Larger U-Net: Approximately 3.5 billion parameters in the base U-Net (described in the paper as a "three times larger UNet backbone"), more than four times the size of the SD 1.x U-Net. ^[23]
Two text encoders: SDXL concatenated embeddings from OpenAI's CLIP ViT-L/14 and the much larger OpenCLIP ViT-bigG/14 into a 2,048-dimensional context tensor. This was a deliberate fusion of the SD 1.x (CLIP-L) and SD 2.x (OpenCLIP) lineages, with bigG providing richer semantic representations and CLIP-L preserving compatibility with the SD 1.5 prompt vocabulary. ^[25]
Native 1024x1024 resolution with multi-aspect-ratio bucket training so the model handled landscape and portrait crops natively.
Refiner model: SDXL shipped as a two-stage pipeline: a base model generated an initial latent, and an optional refiner model (approximately 2.3 billion parameters) added fine detail in a second pass.
Improved VAE and crop/size conditioning as auxiliary U-Net inputs to handle different image scales.

SDXL was released under the CreativeML Open RAIL++-M license, with the same permissive commercial-use posture as before. It quickly became the preferred model for users with sufficient GPU memory (8 GB VRAM minimum, 12 GB recommended), though the SD 1.5 ecosystem continued to coexist due to the volume of LoRAs targeting that older base.

SDXL Turbo

SDXL Turbo was released on November 28, 2023 alongside a research paper titled "Adversarial Diffusion Distillation" (ADD) authored by Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. ^[26] ^[27] The technique used a combination of score distillation (using a fixed pretrained SDXL teacher) and an adversarial loss (using a discriminator trained against real images) to distill the multi-step SDXL into a 1-4 step student model.

The result was a model capable of producing 512x512 outputs in a single forward pass, generating in under 100 milliseconds on a high-end consumer GPU, against the several-second-per-image latency of standard SDXL. Quality at one step was visibly lower than full SDXL but matched contemporary state-of-the-art at four steps. SDXL Turbo was initially released under a non-commercial research license, drawing some community criticism about Stability AI moving away from permissive open licensing. Stability subsequently released a more permissive Stable Diffusion 2.1-derived SD-Turbo and adjusted licensing terms for later distilled models.

Stable Diffusion 3 and SD3 Medium

The Stable Diffusion 3 (SD3) research paper, "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis," was posted to arXiv on March 5, 2024 by Patrick Esser, Sumith Kulal, Andreas Blattmann, and 14 co-authors. ^[28] ^[29] An API-only preview launched on February 22, 2024. Open-weight SD3 Medium (2 billion parameters) was released on Hugging Face on June 12, 2024. ^[30] The paper introduced "a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens," and reported that "our largest models outperform state-of-the-art models." ^[28] SD3 represented the most substantial architectural change since the original Latent Diffusion paper:

MMDiT (Multimodal Diffusion Transformer) backbone: The U-Net was replaced entirely with a transformer-based architecture using two separate sets of transformer weights for text tokens and image latent tokens, joining their token sequences inside each attention block so information flows between modalities through bidirectional attention rather than one-way cross-attention. ^[28]
Triple text encoder: CLIP-L, OpenCLIP-bigG, and Google T5-XXL (a 5B-parameter encoder-only transformer originally trained for general language tasks). T5-XXL substantially improved long-prompt understanding and text rendering. It could be optionally dropped at inference for reduced VRAM usage.
Rectified Flow formulation: Instead of the DDPM/DDIM noise schedule, SD3 used rectified flow, which trains the model to predict a velocity field connecting noise and data along approximately straight linear paths. ^[28]

The open-weights release of SD3 Medium received significant community criticism. Users reported poor anatomy (especially hands and reclining poses), inferior photorealism compared to community SDXL fine-tunes, and a restrictive license: the Stability AI Community License that capped free commercial use at $1 million in annual revenue. ^[31] Stability AI acknowledged the disappointing reception. The arrival of Flux.1 [dev] from Black Forest Labs in August 2024, also a rectified-flow MMDiT-style architecture but produced by many of the same researchers who had recently left Stability, sharpened the contrast.

Stable Diffusion 3.5

Stable Diffusion 3.5 was announced on October 22, 2024 with three variants: SD 3.5 Large (8.1 billion parameters, the flagship), SD 3.5 Large Turbo (a 4-step distilled version), and SD 3.5 Medium (2.5B parameters, sized for 8-10 GB consumer VRAM, released October 29, 2024). All three are MMDiT models; Large and Turbo use the standard MMDiT, while Medium uses an enhanced MMDiT-X that adds self-attention in the first 13 transformer layers to improve multi-resolution coherence (generating images from 0.25 to 2 megapixels). ^[32] ^[33] Improvements over SD3 Medium include Query-Key normalization for training stability, better human anatomy and typography, and an expanded prompt vocabulary.

SD 3.5 was released under the Stability AI Community License, with the same $1 million annual revenue cap for free commercial use, controversial in a community accustomed to the much more permissive Open RAIL-M license. In April 2025, Stability deprecated the SD 3.0 API and migrated paying users to SD 3.5 at no extra cost. SD 3.5 was also released as an NVIDIA NIM microservice and through Microsoft Azure AI Foundry. ^[34]

Ecosystem

The open-weights nature of Stable Diffusion catalyzed an ecosystem of user interfaces, fine-tuning techniques, and adjacent tools that grew far faster than any single company could match.

AUTOMATIC1111 WebUI

The Stable Diffusion WebUI maintained by the pseudonymous developer AUTOMATIC1111 was the first widely-adopted local interface, with its initial GitHub release within weeks of the SD 1.4 launch. ^[35] Built on the Gradio framework, it presents a tabbed interface for text-to-image, image-to-image, inpainting, outpainting, and many other modes, with a vast extension ecosystem covering ControlNet integration, LoRA management, X/Y/Z parameter sweeps, and dozens of samplers. By 2023 it was the de facto reference interface for the Stable Diffusion community, and by 2024 had been forked into related projects including Forge (by ControlNet author Lvmin Zhang) and SDNext.

ComfyUI

ComfyUI, released by developer comfyanonymous in early 2023, takes a node-graph approach: the user constructs a directed acyclic graph in which each node represents a step of the pipeline (load model, encode text, sample, decode latent, save image). This makes complex workflows much easier to express than AUTOMATIC1111's flat UI, and the underlying engine is more memory-efficient. ComfyUI became the preferred interface for advanced users and for serving the larger SDXL, SD3, and SD 3.5 models, and is effectively the reference open execution platform for image and video diffusion models more broadly.

InvokeAI, Fooocus, and others

InvokeAI targets creative professionals with a polished canvas-based interface oriented around inpainting and outpainting plus a node workspace. Fooocus, released by Lvmin Zhang in August 2023, hides nearly all technical parameters behind opinionated defaults that approximate the user experience of Midjourney.

Hugging Face Diffusers

The Diffusers library from Hugging Face is the dominant Python library for diffusion model research and application development. Released in mid-2022 around the SD launch, Diffusers provides a clean modular API in which model weights, schedulers, and pipelines are decoupled, with reference implementations for SD 1.x, 2.x, SDXL, SD3, SD 3.5, and most major non-Stability diffusion models. ^[36]

Civitai emerged in late 2022 as the dominant community marketplace for Stable Diffusion checkpoints, LoRAs, textual inversions, and ControlNet conditioners, hosting hundreds of thousands of user-trained derivative models by 2024. Hugging Face has functioned as the canonical model registry for first-party Stability releases.

Adaptations and add-ons

A defining feature of the Stable Diffusion ecosystem is the layer of personalization and control techniques built on top of the base model. These approaches let users customize the model for specific subjects, styles, or compositional constraints without retraining the full 860M-to-8B-parameter base.

LoRA fine-tunes

LoRA (Low-Rank Adaptation), originally introduced by Hu et al. at Microsoft for large language models in 2021, was adapted to Stable Diffusion in late 2022. ^[37] Instead of fine-tuning the entire U-Net, LoRA freezes the base model and inserts pairs of low-rank matrix adapters into the attention layers, training only those small matrices. The resulting adapter files are typically 10-200 MB compared to the multi-GB base, can be applied on top of any compatible checkpoint, and can be combined and weighted. LoRA became the dominant fine-tuning approach: tens of thousands of LoRAs targeting specific characters, art styles, lighting setups, and visual effects are available on Civitai and Hugging Face.

Textual Inversion

Textual Inversion, introduced by Rinon Gal and colleagues at Tel Aviv University and NVIDIA in August 2022 (paper "An Image is Worth One Word"), takes a different approach: instead of fine-tuning model weights, it learns one or a few new text-embedding vectors (often denoted by placeholder tokens like <concept>) that evoke a specific concept the user has trained on 3-5 images. ^[38] It is even cheaper than LoRA (often only kilobytes per concept), but produces less faithful renderings of complex subjects.

ControlNet

ControlNet, introduced by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala at Stanford in February 2023, added precise spatial conditioning to Stable Diffusion. ^[39] ^[40] A ControlNet takes the form of a trainable copy of the SD U-Net's encoder branch, connected to the frozen base model through "zero convolutions" (initialized so that ControlNet starts identical to base). Each ControlNet is trained for a particular conditioning signal: Canny edges, depth maps, OpenPose skeletons, segmentation maps, normal maps, scribbles, and more.

ControlNet transformed Stable Diffusion into a tool that could enforce precise compositional control, with applications including character pose transfer, architectural layout preservation, and sketch-to-image illustration. It was presented at ICCV 2023.

Dreambooth

Dreambooth, introduced by Nataniel Ruiz and colleagues at Google Research in August 2022, fine-tunes the full diffusion model (not just adapters) on 3-5 images of a specific subject to bind that subject to a unique identifier token. ^[41] Originally developed against Google's Imagen, it was rapidly adapted to Stable Diffusion within weeks of release. Dreambooth produces the highest fidelity of the early personalization techniques but is computationally expensive (full fine-tuning) and produces full-sized checkpoints. It was eventually superseded by LoRA for most use cases.

Other extensions

The ecosystem includes many more techniques: IP-Adapter for image-prompt conditioning, AnimateDiff for video generation from still SD models, Latent Consistency Models for few-step sampling, regional prompting, and specialized samplers. The pace of innovation in 2023-2024 was rapid enough that any complete catalog dates quickly.

Is Stable Diffusion open source?

Stable Diffusion's licensing history reflects an evolving tension between open release and downstream-use concerns. The early checkpoints were released with open weights under permissive licenses, but the term "open source" is contested because the licenses impose behavioral restrictions and, for the SD 3.x line, a revenue cap.

SD 1.x, SD 2.x, SDXL: The CreativeML Open RAIL-M license (and its slight successor RAIL++-M for SDXL). ^[42] Derived from the BigScience BLOOM RAIL license, the OpenRAIL-M license permits royalty-free use, modification, and redistribution for any purpose, including commercial, subject to behavioral restrictions that the licensee must not use the model for a list of harmful applications (illegal activity, discrimination, harassment, generating CSAM, defamation, etc.). The restrictions apply downstream to derivative works. This permissive structure was a central driver of the Stable Diffusion community's growth. Stability AI described it at launch as "a permissive license that allows for commercial and non-commercial usage" focused on "the ethical and legal use of the model." ^[55]
SDXL Turbo: Initially released under a non-commercial research license, which drew community criticism. Stability later released related distilled models under more permissive terms.
SD3, SD 3.5: The Stability AI Community License, which allows free use for research, personal/hobbyist, and organizations with annual revenue under $1 million; organizations exceeding that threshold must purchase an enterprise license. The license also has revocation provisions for misuse. The Community License sparked considerable community debate about whether the SD 3.x line could legitimately be called "open." ^[31]

Controversies

LAION-5B training data

Stable Diffusion was trained on subsets of LAION-5B, a web-scraped image-URL plus alt-text dataset. The dataset includes copyrighted images, personal photographs, watermarked stock images, and (as later research showed) far more troubling content. Stability AI took the position that scraping public URLs for AI training is fair use; many artists, photographers, and rights-holders disagreed.

CSAM in LAION-5B

In December 2023, the Stanford Internet Observatory published a report identifying more than 3,000 suspected and confirmed instances of child sexual abuse material (CSAM) within LAION-5B. ^[43] The Stanford team recommended that models trained on the dataset "should be deprecated and distribution ceased where feasible." LAION temporarily took down LAION-5B and LAION-400M. In August 2024, LAION released Re-LAION-5B, a cleaned version filtered against known CSAM lists, in cooperation with Internet Watch Foundation and Canadian Centre for Child Protection. ^[44]

Getty Images v. Stability AI

In January 2023, Getty Images filed lawsuits against Stability AI in both the United States and the United Kingdom, alleging that Stability scraped more than 12 million Getty images for training without permission. Generated outputs sometimes reproduced recognizable Getty watermarks. Getty initially sought damages of up to $1.7 billion. ^[45] The UK case went to trial; on November 4, 2025, the High Court of England and Wales issued judgment largely in Stability AI's favor. ^[46] Getty had abandoned its primary copyright infringement and database right claims before closing submissions, after accepting there was no evidence training and development took place in the UK. The Court rejected Getty's secondary copyright claim, holding that AI model weights are not a "copy" of training images because they are statistically trained parameters rather than stored copies or reconstructions. The Court found only narrow trademark infringement on a small number of outputs reproducing Getty watermarks. The US litigation remains active.

Andersen v. Stability AI

In January 2023, artists Sarah Andersen, Kelly McKernan, and Karla Ortiz filed a class-action lawsuit in the Northern District of California against Stability AI, Midjourney, and DeviantArt, alleging copyright infringement and right-of-publicity violations. ^[47] In October 2023 most claims were dismissed, but the direct copyright infringement claim survived. In August 2024, Judge William Orrick denied motions to dismiss the surviving claims, allowing discovery to proceed. The trial is scheduled for September 2026.

Artist style mimicry

Stable Diffusion was used extensively to imitate the visual styles of living artists, often using their names as prompts (e.g., "by Greg Rutkowski" was one of the most-used artist prompts in SD 1.x). This sparked sustained protest from the digital illustration community and the launch of artist-protection tools like Glaze and Nightshade (University of Chicago, 2023-2024) that perturb published images to make them less useful for SD fine-tuning.

Stability AI corporate timeline

The corporate history of Stability AI is unusually consequential because the company controlled releases of every numbered Stable Diffusion checkpoint after the original CompVis launch.

Founding and SD launch (2019-2022)

Stability AI was founded in 2019 in London by Emad Mostaque and Cyrus Hodes. Mostaque, a British-Bangladeshi entrepreneur with a background in hedge fund management, was initially self-funding the company. The early identity was as a community-of-communities, providing GPU resources to LAION, EleutherAI, and CompVis. The Stable Diffusion 1.4 launch on August 22, 2022 transformed the company's profile overnight, with the model reaching the top of GitHub trending and Hugging Face downloads within days. In October 2022, Stability closed a $101 million seed round at a $1 billion valuation, led by Coatue Management and Lightspeed Venture Partners with participation from O'Shaughnessy Ventures. ^[48] The valuation was extraordinary for an effectively pre-revenue company.

2023: Financial strain

By October 2023, internal accounting reportedly showed Stability burning roughly $8 million per month, largely on AWS GPU costs, against monthly revenue of approximately $5.4 million. ^[7] An attempt to raise additional capital at a $4 billion valuation failed. Investor patience with Mostaque eroded as fundraising stalled, senior staff departed, and reporting in Bloomberg and Forbes scrutinized his biographical claims and management practices. In an October 2023 letter to the board, Lightspeed stated that Mostaque's mismanagement had "severely undermined" their confidence and urged the company to seek a buyer. Coatue separately pushed for his resignation.

March 2024: Mostaque resigns

On March 22, 2024, Emad Mostaque resigned as CEO and stepped down from the board, publicly framing the departure as a move to pursue decentralized AI, saying "you can't beat centralized AI with more centralized AI." ^[49] Reporting indicated the actual driver was sustained investor pressure. Shan Shan Wong (COO) and Christian Laforte (CTO) were appointed interim co-CEOs.

The same month, Robin Rombach, Andreas Blattmann, and Dominik Lorenz, three of the four original Latent Diffusion authors who had joined Stability, also resigned. They co-founded Black Forest Labs in Freiburg, Germany, where in August 2024 they released the Flux family of rectified-flow MMDiT-style image models with $31 million in seed funding led by Andreessen Horowitz. ^[50] ^[51] Community observers viewed Flux.1 as the spiritual successor to the Stable Diffusion line, with subsequent funding bringing Black Forest Labs to a $3.25 billion valuation by late 2025.

2024 recapitalization and new leadership

In June 2024, Stability AI closed a recapitalization led by a group of new investors including Sean Parker (co-founder of Napster, former president of Facebook). Prem Akkaraju, former CEO of Weta Digital, was appointed CEO, and Parker joined as Executive Chairman. The deal converted approximately $100 million of existing debt and roughly $300 million in future spending commitments, restoring solvency, with approximately $80 million in new equity bringing total funding to around $225 million. ^[52] On September 24, 2024, filmmaker James Cameron joined the Board of Directors, signaling a pivot toward film and entertainment industry tools. ^[53]

2025-2026: Recovery

In December 2024, CEO Akkaraju reported triple-digit revenue growth year-on-year and elimination of the company's debt. In March 2025, WPP announced a strategic investment and partnership integrating Stability's visual-media models into WPP's creative platforms. ^[54] As of early 2026, Stability AI is privately held with approximately 190 employees, with image, video, audio, and 3D generation models in its portfolio. The company has neither announced a Stable Diffusion 4 nor any successor named in the SD line, with model effort concentrated on SD 3.5 derivatives and on adjacent products in video, audio, and 3D.

Successors and current status

The center of gravity for open-source image generation shifted away from Stable Diffusion during 2024-2025. Black Forest Labs' Flux.1 family (August 2024), comprising Flux.1 [pro] (commercial), [dev] (non-commercial weights), and [schnell] (Apache 2.0, distilled), set a new bar for open-source text-to-image quality, with Flux.1 [dev] largely displacing SDXL and SD 3.5 in many community workflows. ^[51] Closed proprietary models (DALL-E 3, Imagen 3, Midjourney 7, GPT-Image-1) continued to lead on out-of-the-box aesthetic quality and text rendering. In this landscape, Stable Diffusion 3.5 occupies a middle position: capable and freely downloadable for non-commercial and small-business use, but no longer the default open-source choice for users prioritizing quality. The SD 1.5 base remains in active community use for its low VRAM requirements and enormous LoRA ecosystem.

Legacy and impact

Stable Diffusion's place in the history of generative AI is secure for several reasons:

The first open-weight, state-of-the-art image generator. Prior to August 2022, capable text-to-image models existed only behind APIs. Stable Diffusion demonstrated that an open-weight release could match closed competitors and reach millions of users within months, a result that influenced subsequent open releases of language models (Meta's LLaMA, Mistral, DeepSeek) and image/video models (Flux, AnimateDiff, HunyuanVideo, and many more).
The catalyst for a new tools ecosystem. Within 18 months of release, the community had produced multiple major user interfaces (AUTOMATIC1111, ComfyUI, InvokeAI, Fooocus), a model-sharing platform with hundreds of thousands of community uploads (Civitai), and standard libraries for working with diffusion models (Hugging Face Diffusers).
The reference architecture for a generation of image and video models. The latent diffusion blueprint (VAE compression plus a U-Net or transformer denoiser plus a frozen text encoder) became the standard pattern across image diffusion, video diffusion, and many other modalities. Even systems that have replaced specific components retain the overall latent-space design.
A focal point for legal and ethical debates in generative AI. The Getty UK ruling, the Andersen US litigation, the LAION-5B CSAM disclosure, and broader debates about artist consent all crystallized around Stable Diffusion as the most visible distributable model.
A test case for the commercial limits of open release. Stability AI's near-collapse in 2023-2024 made clear that open release alone did not guarantee a sustainable business, even when adoption was massive.

Whether or not Stability AI eventually produces a Stable Diffusion 4, the lineage's influence is permanent. The phrase "text-to-image" carries different connotations after August 2022 than before, both technically (assumed availability of open weights and local inference) and culturally (assumed availability of generative image tools to anyone with a consumer GPU).

References

Stability AI. "Stable Diffusion Public Release." Stability AI News, August 22, 2022. https://stability.ai/news/stable-diffusion-public-release ↩
Stability AI. "Stable Diffusion Launch Announcement." Stability AI News, August 10, 2022. https://stability.ai/news/stable-diffusion-announcement ↩
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. "High-Resolution Image Synthesis with Latent Diffusion Models." arXiv:2112.10752, December 2021. CVPR 2022. https://arxiv.org/abs/2112.10752 ↩
CompVis. "Stable Diffusion v1.4 Model Card." Hugging Face. https://huggingface.co/CompVis/stable-diffusion-v1-4 ↩
"CreativeML Open RAIL-M license text." Hugging Face. https://huggingface.co/spaces/CompVis/stable-diffusion-license/raw/main/license.txt ↩
Wikipedia. "Stable Diffusion." https://en.wikipedia.org/wiki/Stable_Diffusion ↩
Konrad, A. "Inside Stability AI's Bad Breakup with Coatue and Lightspeed Venture." Forbes / Fortune, March 2024. https://fortune.com/2024/03/29/stability-ai-emad-mostaque-resignation-coatue-lightspeed-venture/ ↩
Deadline. "Former WETA Digital CEO Prem Akkaraju, Sean Parker Join Stability AI As It Closes Funding Round." June 2024. https://deadline.com/2024/06/weta-digital-prem-akkaraju-napster-sean-parker-join-stability-ai-1235982742/ ↩
Ho, J., Jain, A., & Abbeel, P. "Denoising Diffusion Probabilistic Models." arXiv:2006.11239, June 2020. NeurIPS 2020. https://arxiv.org/abs/2006.11239 ↩
Ramesh, A., et al. "Hierarchical Text-Conditional Image Generation with CLIP Latents." arXiv:2204.06125, April 2022. https://arxiv.org/abs/2204.06125 ↩
CompVis. "Computer Vision & Learning Group." Ommer Lab, LMU Munich. https://ommer-lab.com/ ↩
CompVis. "stable-diffusion GitHub repository." https://github.com/CompVis/stable-diffusion ↩
CompVis. "Stable Diffusion v1.4 Original Model Card." Hugging Face. https://huggingface.co/CompVis/stable-diffusion-v-1-4-original ↩
Stability AI. "Stable Diffusion 3: Research Paper." February 2024. https://stability.ai/news-updates/stable-diffusion-3-research-paper ↩
LAION. "LAION-Aesthetics dataset." https://laion.ai/blog/laion-aesthetics/ ↩
CompVis. "Stable Diffusion v1-2 Model Card." Hugging Face. https://huggingface.co/CompVis/stable-diffusion-v1-2 ↩
Runway. "stable-diffusion-v1-5 Model Card." Hugging Face (archived). https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5 ↩
Hacker News thread on Stability AI takedown of Runway SD 1.5. October 2022. https://news.ycombinator.com/item?id=33279290 ↩
Hugging Face. "stable-diffusion-v1-5/stable-diffusion-v1-5 community repo." https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5 ↩
Stability AI. "Stable Diffusion 2.0 Release." Stability AI News, November 24, 2022. https://stability.ai/news/stable-diffusion-v2-release ↩
AssemblyAI. "Stable Diffusion 1 vs 2: What You Need to Know." 2022. https://www.assemblyai.com/blog/stable-diffusion-1-vs-2-what-you-need-to-know/ ↩
Stability AI. "Stable Diffusion 2.1 Release Notes." Stability AI News, December 2022. https://stability.ai/news/stablediffusion2-1-release7-dec-2022 ↩
Podell, D., et al. "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis." arXiv:2307.01952, July 2023. https://arxiv.org/abs/2307.01952 ↩
Stability AI. "Announcing SDXL 1.0." July 26, 2023. https://stability.ai/news/stable-diffusion-sdxl-1-announcement ↩
Hugging Face. "stabilityai/stable-diffusion-xl-base-1.0." https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0 ↩
Sauer, A., Lorenz, D., Blattmann, A., Rombach, R. "Adversarial Diffusion Distillation." arXiv:2311.17042, November 2023. https://arxiv.org/abs/2311.17042 ↩
Stability AI. "Introducing SDXL Turbo: A Real-Time Text-to-Image Generation Model." November 28, 2023. https://stability.ai/news/stability-ai-sdxl-turbo ↩
Esser, P., Kulal, S., Blattmann, A., et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." arXiv:2403.03206, March 2024. https://arxiv.org/abs/2403.03206 ↩
Stability AI. "Stable Diffusion 3: Research Paper." February 2024. https://stability.ai/news-updates/stable-diffusion-3-research-paper ↩
Stability AI. "Stable Diffusion 3 Medium." Hugging Face, June 12, 2024. https://huggingface.co/stabilityai/stable-diffusion-3-medium ↩
Stability AI. "Stability AI Community License." https://stability.ai/license ↩
Stability AI. "Introducing Stable Diffusion 3.5." Stability AI News, October 22, 2024. https://stability.ai/news/introducing-stable-diffusion-3-5 ↩
Hugging Face. "stabilityai/stable-diffusion-3.5-medium." https://huggingface.co/stabilityai/stable-diffusion-3.5-medium ↩
NVIDIA. "Stable Diffusion 3.5 NIM Microservice." NVIDIA Developer Catalog. https://build.nvidia.com/ ↩
AUTOMATIC1111. "Stable Diffusion WebUI GitHub repository." https://github.com/AUTOMATIC1111/stable-diffusion-webui ↩
Hugging Face. "Diffusers documentation." https://huggingface.co/docs/diffusers ↩
Hu, E. J., et al. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685, June 2021. https://arxiv.org/abs/2106.09685 ↩
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-Or, D. "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion." arXiv:2208.01618, August 2022. https://arxiv.org/abs/2208.01618 ↩
Zhang, L., Rao, A., & Agrawala, M. "Adding Conditional Control to Text-to-Image Diffusion Models." arXiv:2302.05543, February 2023. ICCV 2023. https://arxiv.org/abs/2302.05543 ↩
ControlNet GitHub repository. https://github.com/lllyasviel/ControlNet ↩
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation." arXiv:2208.12242, August 2022. https://arxiv.org/abs/2208.12242 ↩
Hugging Face. "OpenRAIL: Towards open and responsible AI licensing frameworks." https://huggingface.co/blog/open_rail ↩
Thiel, D. "Identifying and Eliminating CSAM in Generative ML Training Data and Models." Stanford Internet Observatory, December 2023. https://purl.stanford.edu/kh752sm9123 ↩
LAION. "Releasing Re-LAION 5B." August 2024. https://laion.ai/blog/relaion-5b/ ↩
Reuters. "Getty Images sues AI art generator Stable Diffusion in the U.S. for copyright infringement." February 6, 2023. https://www.reuters.com/legal/getty-images-sues-stability-ai-misusing-photos-train-ai-2023-02-06/ ↩
Latham & Watkins. "Getty Images v. Stability AI: English High Court Rejects Secondary Copyright Claim." November 2025. https://www.lw.com/en/insights/getty-images-v-stability-ai-english-high-court-rejects-secondary-copyright-claim ↩
Andersen et al. v. Stability AI Ltd. et al. N.D. Cal., Case 3:23-cv-00201. Court order August 12, 2024. https://www.courtlistener.com/docket/66732129/andersen-v-stability-ai-ltd/ ↩
TechCrunch. "Stability AI, the startup behind Stable Diffusion, raises $101M." October 17, 2022. https://techcrunch.com/2022/10/17/stability-ai-the-startup-behind-stable-diffusion-raises-101m/ ↩
TechCrunch. "Stability AI CEO resigns because you can't beat centralized AI with more centralized AI." March 22, 2024. https://techcrunch.com/2024/03/22/stability-ai-ceo-resigns-because-youre-not-going-to-beat-centralized-ai-with-more-centralized-ai/ ↩
VentureBeat. "Stable Diffusion creators launch Black Forest Labs, secure $31M for FLUX.1 AI image generator." August 1, 2024. https://venturebeat.com/ai/stable-diffusion-creators-launch-black-forest-labs-secure-31m-for-flux-1-ai-image-generator/ ↩
Black Forest Labs. "Announcing Black Forest Labs." August 1, 2024. https://blackforestlabs.ai/announcing-black-forest-labs/ ↩
Stability AI. "Stability AI Secures Significant New Investment." June 2024. https://stability.ai/news/stability-ai-secures-significant-new-investment ↩
Stability AI. "James Cameron, Academy Award-Winning Filmmaker, Joins Stability AI Board of Directors." September 24, 2024. https://stability.ai/news/james-cameron-joins-stability-ai-board-of-directors ↩
WPP. "WPP and Stability AI Announce Partnership." March 2025. https://www.wpp.com/en/news/2025/03/wpp-and-stability-ai-announce-strategic-partnership ↩
Stability AI. "Stable Diffusion Public Release." Stability AI News, August 22, 2022 (VRAM and license language; "a single file that compresses the visual information of humanity"). https://stability.ai/news-updates/stable-diffusion-public-release ↩
The Decoder. "Training cost for Stable Diffusion was just $600,000 (Emad Mostaque, 256 A100 GPUs, ~150,000 GPU-hours)." 2022. https://the-decoder.com/training-cost-for-stable-diffusion-was-just-600000-and-that-is-a-good-sign-for-ai-progress/ ↩
Podell, D., et al. "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis" (user-preference win rates: SDXL+refiner 48.44%, SDXL base 36.93%). arXiv:2307.01952, July 2023. https://arxiv.org/abs/2307.01952 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

9 revisions by 1 contributors · full history

Suggest edit