Stable Diffusion
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 6,177 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 6,177 words
Add missing citations, update stale details, or suggest a clearer explanation.
Stable Diffusion is a family of open-weights text-to-image latent diffusion models first released on August 22, 2022 by Stability AI in collaboration with the CompVis research group at Ludwig Maximilian University of Munich (LMU Munich), Runway, LAION, and EleutherAI. [1] [2] The model builds on the "High-Resolution Image Synthesis with Latent Diffusion Models" paper by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, presented at CVPR 2022 and originally posted to arXiv as 2112.10752 in December 2021. [3] By performing the diffusion process in a compressed latent space rather than directly in pixel space, Stable Diffusion brought near-state-of-the-art image generation within reach of consumer GPUs.
Released under the permissive CreativeML Open RAIL-M license, Stable Diffusion was the first capable text-to-image model with publicly downloadable weights, in stark contrast to the closed DALL-E 2 from OpenAI and Imagen from Google. [4] [5] Its release triggered the open-source generative image revolution: community interfaces such as AUTOMATIC1111's WebUI made local generation accessible to non-programmers, and fine-tuning techniques like Dreambooth, Textual Inversion, LoRA, and ControlNet produced thousands of derivative models. The lineage progressed through versions 1.x (2022), 2.x (late 2022), SDXL (2023), SDXL Turbo (late 2023), Stable Diffusion 3 (2024), and Stable Diffusion 3.5 (October 2024). [6]
The model's history is inseparable from the corporate trajectory of Stability AI: from a $1 billion unicorn valuation in October 2022 to near-collapse and the departure of founder Emad Mostaque in March 2024, followed by a recapitalization under former Weta Digital CEO Prem Akkaraju, with Sean Parker and James Cameron joining the board in 2024. [7] [8] The original paper's authors largely departed for Black Forest Labs in 2024, where they released the Flux family, a successor in spirit if not in name. Stable Diffusion's legacy includes a flourishing community ecosystem, lasting controversies over training data (LAION-5B, CSAM, Getty Images), and a permanent shift in expectations about what open-source generative AI can achieve.
Generative image modeling matured along two parallel tracks in the late 2010s and early 2020s. Generative Adversarial Networks (GANs) dominated through 2020 but suffered from training instability and difficulty with text conditioning. The diffusion track became practically competitive after Jonathan Ho, Ajay Jain, and Pieter Abbeel published "Denoising Diffusion Probabilistic Models" (DDPM) in June 2020, demonstrating that a parameterized Markov chain trained with a simple noise-prediction loss could match or exceed GAN image quality. [9] Follow-up work in 2021, including classifier-free guidance, DDIM samplers, and score-based formulations, made diffusion practical for high-resolution synthesis, but pixel-space diffusion at 512x512 still required hundreds of GPU-days.
In April 2022, OpenAI revealed DALL-E 2, a two-stage diffusion system using a CLIP-conditioned prior plus a cascaded diffusion decoder. DALL-E 2 produced startlingly photorealistic imagery but was released only as a waitlisted closed beta with API-only access. The following month, Google announced Imagen, with even higher quality but no public access at all. [10] Text-to-image had crossed into practical art generation, but the underlying models remained walled off, with no way to inspect, modify, or fine-tune the system. The opportunity was clear: build a diffusion model with comparable quality but lower compute requirements, and release its weights openly. This was the gap Stable Diffusion filled.
The architecture beneath Stable Diffusion was developed at the Computer Vision & Learning Group (CompVis), led by Professor Björn Ommer. The group was based at Heidelberg University until 2021 and then moved with Ommer to LMU Munich. [11] The defining paper, "High-Resolution Image Synthesis with Latent Diffusion Models," was authored by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser (Runway ML, with CompVis affiliation), and Björn Ommer (group leader). The arXiv preprint (2112.10752) appeared on December 20, 2021; the paper was accepted at CVPR 2022 and presented in New Orleans in June 2022. [3] The accompanying GitHub repository CompVis/latent-diffusion released checkpoints for unconditional, class-conditional, super-resolution, and inpainting variants.
The transition from "Latent Diffusion Models" to a public-facing "Stable Diffusion" product required compute and packaging that the academic group could not provide alone. Stability AI, founded in 2019 in London by Emad Mostaque and Cyrus Hodes, agreed to fund and donate the compute. Training was performed on approximately 256 NVIDIA A100 GPUs on AWS, accumulating around 150,000 GPU-hours at an estimated cost of roughly $600,000. [2] [12] Runway ML contributed via Patrick Esser; LAION supplied the training dataset (LAION-5B); EleutherAI provided additional research support.
On August 10, 2022, a closed-beta release went to researchers and selected members of the AI community. On August 22, 2022, the model was made publicly available under the CreativeML Open RAIL-M license, with the v1.4 checkpoint published on Hugging Face. [4] [13] Four of the five paper authors (Rombach, Blattmann, Esser, and Lorenz) joined Stability AI shortly thereafter.
A standard pixel-space diffusion model performs every denoising step on the full image; for a 512x512x3 image that is 786,432 dimensions of noise per step. The central insight of Stable Diffusion is that this is wasteful: most perceptual content can be captured in a much smaller latent representation, and the diffusion process can be run entirely within that latent space at a fraction of the cost. The original paper reports roughly 48-fold compute reductions relative to pixel-space diffusion at comparable quality. [3]
The SD 1.x/2.x architecture comprises four components:
A Variational Autoencoder compresses RGB images into a much smaller latent tensor and decodes latents back into images. For SD 1.x the encoder maps 512x512x3 pixels to a 64x64x4 latent (an 8x downsampling factor in each spatial dimension, with 4 channels), and the decoder performs the inverse mapping. The VAE is trained separately using a combination of pixel reconstruction loss, perceptual loss, and a small KL-divergence regularization. The diffusion U-Net operates entirely on these latents; pixels reappear only at the end when the decoded latent is read out as the final image.
The core generative model is a conditional U-Net, a convolutional encoder-decoder with skip connections at multiple resolutions. It takes a noisy latent and a denoising timestep as input and predicts the noise present at that timestep, following the DDPM formulation. The SD 1.x U-Net has approximately 860 million parameters; the SD 2.x U-Net inherited the same size. The architecture intersperses ResNet-style convolutional blocks with Transformer-based attention blocks: self-attention mixes spatial latent features, while cross-attention layers attend from the latent feature map to the text embedding sequence produced by the text encoder. This cross-attention is the conditioning channel through which the prompt steers generation. [3]
The model uses a frozen pretrained text encoder to convert a tokenized prompt into a sequence of embedding vectors. SD 1.x used the CLIP ViT-L/14 text encoder from OpenAI (77 tokens of 768-dimensional embeddings). SD 2.x switched to OpenCLIP ViT-H/14 (an open replacement trained by LAION); SDXL concatenated CLIP-L and OpenCLIP-bigG/14 embeddings; SD 3 added Google's T5-XXL as a third text encoder. [14] The text encoder remains frozen during diffusion training; only the U-Net learns to align with its representations.
At inference, the model starts from a pure-noise latent and runs the U-Net for typically 20-50 denoising steps (with DDIM, DPM-Solver, or similar sampler). At each step, two forward passes are performed: one conditioned on the text embedding, one unconditioned. The two predictions are combined using classifier-free guidance, extrapolating away from the unconditioned prediction toward the conditioned one by a guidance scale (typically 5-12) to amplify the influence of the prompt. The final latent is decoded by the VAE to produce the output image. [3]
Stable Diffusion was trained on subsets of LAION-5B, a dataset of approximately 5.85 billion image-URL plus alt-text pairs scraped from the public web via Common Crawl by the LAION non-profit. [15] [2] The training subset for SD 1.x was filtered using a CLIP-based aesthetic scoring model: only images with predicted aesthetic scores above 5.0 (on a 10-point scale), with a minimum resolution of 512x512 and an estimated watermark probability below 0.5, were used. This "LAION-Aesthetics v2 5+" subset contained roughly 600 million image-text pairs. [15]
Training was performed in multiple stages. The publicly documented sequence for SD 1.x was: SD 1.1 was trained on 237 million steps at 256x256 on LAION-2B-en, then 194 million steps at 512x512 on LAION-HD; SD 1.2 was fine-tuned from SD 1.1 for an additional 515,000 steps on LAION-Aesthetics v2 5+ with text drop applied for classifier-free guidance; SD 1.3 was fine-tuned from SD 1.2 for an additional 195,000 steps; and SD 1.4 was fine-tuned from SD 1.2 for 225,000 steps. [16] The full training run consumed approximately 150,000 A100-GPU-hours on AWS, with Stability AI estimating the compute cost at around $600,000. [12]
The 1.x family established the practical patterns and ecosystem that would define Stable Diffusion for years.
Versions 1.1 through 1.4 were released by CompVis on Hugging Face in August 2022. There was never a published version 1.0; the public release on August 22, 2022 was version 1.4. [13] [16] All 1.x versions share the same architecture (860M-parameter U-Net, CLIP ViT-L/14 text encoder, VAE) and generate at a native 512x512 resolution. They differ only in fine-tuning regime; each subsequent version was further fine-tuned from a prior checkpoint, with version 1.4 chosen as the most widely useful balance for public release.
Stable Diffusion 1.5 was released on Hugging Face by RunwayML on October 20, 2022 under the existing CreativeML Open RAIL-M license. RunwayML fine-tuned from the SD 1.2 checkpoint for an additional 595,000 steps at 512x512 on the same LAION-Aesthetics subset. [17] The release was preceded by friction with Stability AI: Stability had been delaying its own SD 1.5 release for several weeks over reported "legal concerns," and RunwayML proceeded to publish it independently. Stability filed a takedown request to Hugging Face citing IP leak. After Runway clarified that Patrick Esser, as a co-author of the original Latent Diffusion paper and a Runway employee, had legitimate rights to release derived weights, Stability withdrew the request, and the release was retroactively recognized as the official SD 1.5. [18]
SD 1.5 quickly became the canonical Stable Diffusion checkpoint and remained the dominant base model in the open-source community well into 2024. Its prevalence rested on extensive community fine-tunes (thousands of derivative models), broad tool support, and modest hardware requirements (running on a 4-6 GB consumer GPU). In August 2024, RunwayML deleted its Hugging Face repository, and stewardship migrated to the stable-diffusion-v1-5/stable-diffusion-v1-5 community repository. [19]
Stable Diffusion 2.0 was released by Stability AI on November 24, 2022. [20] It introduced multiple changes simultaneously, several of which proved controversial:
Community reception was mixed. SD 2.0 lost much of the prompt-style vocabulary and celebrity-recognition capability that users had developed for SD 1.5, and many SD 1.5 prompts simply did not work in SD 2.0. Much of the community continued using SD 1.5. [21] Stable Diffusion 2.1 was released on December 7, 2022 with a relaxed NSFW filter that restored some artistic vocabulary. [22] Despite these fixes, SD 2.1 never overtook SD 1.5 in adoption, and the 2.x line effectively became a footnote.
Stable Diffusion XL (SDXL) 1.0 was released by Stability AI on July 26, 2023. [23] [24] SDXL was the first version to substantially scale the architecture relative to the original Latent Diffusion design while retaining the U-Net-plus-VAE structure. Key changes:
SDXL was released under the CreativeML Open RAIL++-M license, with the same permissive commercial-use posture as before. It quickly became the preferred model for users with sufficient GPU memory (8 GB VRAM minimum, 12 GB recommended), though the SD 1.5 ecosystem continued to coexist due to the volume of LoRAs targeting that older base.
SDXL Turbo was released on November 28, 2023 alongside a research paper titled "Adversarial Diffusion Distillation" (ADD) authored by Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. [26] [27] The technique used a combination of score distillation (using a fixed pretrained SDXL teacher) and an adversarial loss (using a discriminator trained against real images) to distill the multi-step SDXL into a 1-4 step student model.
The result was a model capable of producing 512x512 outputs in a single forward pass, generating in under 100 milliseconds on a high-end consumer GPU, against the several-second-per-image latency of standard SDXL. Quality at one step was visibly lower than full SDXL but matched contemporary state-of-the-art at four steps. SDXL Turbo was initially released under a non-commercial research license, drawing some community criticism about Stability AI moving away from permissive open licensing. Stability subsequently released a more permissive Stable Diffusion 2.1-derived SD-Turbo and adjusted licensing terms for later distilled models.
The Stable Diffusion 3 (SD3) research paper, "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis," was posted to arXiv on March 5, 2024 by Patrick Esser, Sumith Kulal, Andreas Blattmann, and 14 co-authors. [28] [29] An API-only preview launched on February 22, 2024. Open-weight SD3 Medium (2 billion parameters) was released on Hugging Face on June 12, 2024. [30] SD3 represented the most substantial architectural change since the original Latent Diffusion paper:
The open-weights release of SD3 Medium received significant community criticism. Users reported poor anatomy (especially hands and reclining poses), inferior photorealism compared to community SDXL fine-tunes, and a restrictive license: the Stability AI Community License that capped free commercial use at $1 million in annual revenue. [31] Stability AI acknowledged the disappointing reception. The arrival of Flux.1 [dev] from Black Forest Labs in August 2024, also a rectified-flow MMDiT-style architecture but produced by many of the same researchers who had recently left Stability, sharpened the contrast.
Stable Diffusion 3.5 was announced on October 22, 2024 with three variants: SD 3.5 Large (8.1 billion parameters, the flagship), SD 3.5 Large Turbo (a 4-step distilled version), and SD 3.5 Medium (2.5B parameters, sized for 8-10 GB consumer VRAM, released October 29, 2024). All three are MMDiT models; Large and Turbo use the standard MMDiT, while Medium uses an enhanced MMDiT-X that adds self-attention in the first 13 transformer layers to improve multi-resolution coherence (generating images from 0.25 to 2 megapixels). [32] [33] Improvements over SD3 Medium include Query-Key normalization for training stability, better human anatomy and typography, and an expanded prompt vocabulary.
SD 3.5 was released under the Stability AI Community License, with the same $1 million annual revenue cap for free commercial use, controversial in a community accustomed to the much more permissive Open RAIL-M license. In April 2025, Stability deprecated the SD 3.0 API and migrated paying users to SD 3.5 at no extra cost. SD 3.5 was also released as an NVIDIA NIM microservice and through Microsoft Azure AI Foundry. [34]
The open-weights nature of Stable Diffusion catalyzed an ecosystem of user interfaces, fine-tuning techniques, and adjacent tools that grew far faster than any single company could match.
The Stable Diffusion WebUI maintained by the pseudonymous developer AUTOMATIC1111 was the first widely-adopted local interface, with its initial GitHub release within weeks of the SD 1.4 launch. [35] Built on the Gradio framework, it presents a tabbed interface for text-to-image, image-to-image, inpainting, outpainting, and many other modes, with a vast extension ecosystem covering ControlNet integration, LoRA management, X/Y/Z parameter sweeps, and dozens of samplers. By 2023 it was the de facto reference interface for the Stable Diffusion community, and by 2024 had been forked into related projects including Forge (by ControlNet author Lvmin Zhang) and SDNext.
ComfyUI, released by developer comfyanonymous in early 2023, takes a node-graph approach: the user constructs a directed acyclic graph in which each node represents a step of the pipeline (load model, encode text, sample, decode latent, save image). This makes complex workflows much easier to express than AUTOMATIC1111's flat UI, and the underlying engine is more memory-efficient. ComfyUI became the preferred interface for advanced users and for serving the larger SDXL, SD3, and SD 3.5 models, and is effectively the reference open execution platform for image and video diffusion models more broadly.
InvokeAI targets creative professionals with a polished canvas-based interface oriented around inpainting and outpainting plus a node workspace. Fooocus, released by Lvmin Zhang in August 2023, hides nearly all technical parameters behind opinionated defaults that approximate the user experience of Midjourney.
The Diffusers library from Hugging Face is the dominant Python library for diffusion model research and application development. Released in mid-2022 around the SD launch, Diffusers provides a clean modular API in which model weights, schedulers, and pipelines are decoupled, with reference implementations for SD 1.x, 2.x, SDXL, SD3, SD 3.5, and most major non-Stability diffusion models. [36]
Civitai emerged in late 2022 as the dominant community marketplace for Stable Diffusion checkpoints, LoRAs, textual inversions, and ControlNet conditioners, hosting hundreds of thousands of user-trained derivative models by 2024. Hugging Face has functioned as the canonical model registry for first-party Stability releases.
A defining feature of the Stable Diffusion ecosystem is the layer of personalization and control techniques built on top of the base model. These approaches let users customize the model for specific subjects, styles, or compositional constraints without retraining the full 860M-to-8B-parameter base.
LoRA (Low-Rank Adaptation), originally introduced by Hu et al. at Microsoft for large language models in 2021, was adapted to Stable Diffusion in late 2022. [37] Instead of fine-tuning the entire U-Net, LoRA freezes the base model and inserts pairs of low-rank matrix adapters into the attention layers, training only those small matrices. The resulting adapter files are typically 10-200 MB compared to the multi-GB base, can be applied on top of any compatible checkpoint, and can be combined and weighted. LoRA became the dominant fine-tuning approach: tens of thousands of LoRAs targeting specific characters, art styles, lighting setups, and visual effects are available on Civitai and Hugging Face.
Textual Inversion, introduced by Rinon Gal and colleagues at Tel Aviv University and NVIDIA in August 2022 (paper "An Image is Worth One Word"), takes a different approach: instead of fine-tuning model weights, it learns one or a few new text-embedding vectors (often denoted by placeholder tokens like <concept>) that evoke a specific concept the user has trained on 3-5 images. [38] It is even cheaper than LoRA (often only kilobytes per concept), but produces less faithful renderings of complex subjects.
ControlNet, introduced by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala at Stanford in February 2023, added precise spatial conditioning to Stable Diffusion. [39] [40] A ControlNet takes the form of a trainable copy of the SD U-Net's encoder branch, connected to the frozen base model through "zero convolutions" (initialized so that ControlNet starts identical to base). Each ControlNet is trained for a particular conditioning signal: Canny edges, depth maps, OpenPose skeletons, segmentation maps, normal maps, scribbles, and more.
ControlNet transformed Stable Diffusion into a tool that could enforce precise compositional control, with applications including character pose transfer, architectural layout preservation, and sketch-to-image illustration. It was presented at ICCV 2023.
Dreambooth, introduced by Nataniel Ruiz and colleagues at Google Research in August 2022, fine-tunes the full diffusion model (not just adapters) on 3-5 images of a specific subject to bind that subject to a unique identifier token. [41] Originally developed against Google's Imagen, it was rapidly adapted to Stable Diffusion within weeks of release. Dreambooth produces the highest fidelity of the early personalization techniques but is computationally expensive (full fine-tuning) and produces full-sized checkpoints. It was eventually superseded by LoRA for most use cases.
The ecosystem includes many more techniques: IP-Adapter for image-prompt conditioning, AnimateDiff for video generation from still SD models, Latent Consistency Models for few-step sampling, regional prompting, and specialized samplers. The pace of innovation in 2023-2024 was rapid enough that any complete catalog dates quickly.
Stable Diffusion's licensing history reflects an evolving tension between open release and downstream-use concerns:
Stable Diffusion was trained on subsets of LAION-5B, a web-scraped image-URL plus alt-text dataset. The dataset includes copyrighted images, personal photographs, watermarked stock images, and (as later research showed) far more troubling content. Stability AI took the position that scraping public URLs for AI training is fair use; many artists, photographers, and rights-holders disagreed.
In December 2023, the Stanford Internet Observatory published a report identifying more than 3,000 suspected and confirmed instances of child sexual abuse material (CSAM) within LAION-5B. [43] The Stanford team recommended that models trained on the dataset "should be deprecated and distribution ceased where feasible." LAION temporarily took down LAION-5B and LAION-400M. In August 2024, LAION released Re-LAION-5B, a cleaned version filtered against known CSAM lists, in cooperation with Internet Watch Foundation and Canadian Centre for Child Protection. [44]
In January 2023, Getty Images filed lawsuits against Stability AI in both the United States and the United Kingdom, alleging that Stability scraped more than 12 million Getty images for training without permission. Generated outputs sometimes reproduced recognizable Getty watermarks. Getty initially sought damages of up to $1.7 billion. [45] The UK case went to trial; on November 4, 2025, the High Court of England and Wales issued judgment largely in Stability AI's favor. [46] Getty had abandoned its primary copyright infringement and database right claims before closing submissions, after accepting there was no evidence training and development took place in the UK. The Court rejected Getty's secondary copyright claim, holding that AI model weights are not a "copy" of training images because they are statistically trained parameters rather than stored copies or reconstructions. The Court found only narrow trademark infringement on a small number of outputs reproducing Getty watermarks. The US litigation remains active.
In January 2023, artists Sarah Andersen, Kelly McKernan, and Karla Ortiz filed a class-action lawsuit in the Northern District of California against Stability AI, Midjourney, and DeviantArt, alleging copyright infringement and right-of-publicity violations. [47] In October 2023 most claims were dismissed, but the direct copyright infringement claim survived. In August 2024, Judge William Orrick denied motions to dismiss the surviving claims, allowing discovery to proceed. The trial is scheduled for September 2026.
Stable Diffusion was used extensively to imitate the visual styles of living artists, often using their names as prompts (e.g., "by Greg Rutkowski" was one of the most-used artist prompts in SD 1.x). This sparked sustained protest from the digital illustration community and the launch of artist-protection tools like Glaze and Nightshade (University of Chicago, 2023-2024) that perturb published images to make them less useful for SD fine-tuning.
The corporate history of Stability AI is unusually consequential because the company controlled releases of every numbered Stable Diffusion checkpoint after the original CompVis launch.
Stability AI was founded in 2019 in London by Emad Mostaque and Cyrus Hodes. Mostaque, a British-Bangladeshi entrepreneur with a background in hedge fund management, was initially self-funding the company. The early identity was as a community-of-communities, providing GPU resources to LAION, EleutherAI, and CompVis. The Stable Diffusion 1.4 launch on August 22, 2022 transformed the company's profile overnight, with the model reaching the top of GitHub trending and Hugging Face downloads within days. In October 2022, Stability closed a $101 million seed round at a $1 billion valuation, led by Coatue Management and Lightspeed Venture Partners with participation from O'Shaughnessy Ventures. [48] The valuation was extraordinary for an effectively pre-revenue company.
By October 2023, internal accounting reportedly showed Stability burning roughly $8 million per month, largely on AWS GPU costs, against monthly revenue of approximately $5.4 million. [7] An attempt to raise additional capital at a $4 billion valuation failed. Investor patience with Mostaque eroded as fundraising stalled, senior staff departed, and reporting in Bloomberg and Forbes scrutinized his biographical claims and management practices. In an October 2023 letter to the board, Lightspeed stated that Mostaque's mismanagement had "severely undermined" their confidence and urged the company to seek a buyer. Coatue separately pushed for his resignation.
On March 22, 2024, Emad Mostaque resigned as CEO and stepped down from the board, publicly framing the departure as a move to pursue decentralized AI, saying "you can't beat centralized AI with more centralized AI." [49] Reporting indicated the actual driver was sustained investor pressure. Shan Shan Wong (COO) and Christian Laforte (CTO) were appointed interim co-CEOs.
The same month, Robin Rombach, Andreas Blattmann, and Dominik Lorenz, three of the four original Latent Diffusion authors who had joined Stability, also resigned. They co-founded Black Forest Labs in Freiburg, Germany, where in August 2024 they released the Flux family of rectified-flow MMDiT-style image models with $31 million in seed funding led by Andreessen Horowitz. [50] [51] Community observers viewed Flux.1 as the spiritual successor to the Stable Diffusion line, with subsequent funding bringing Black Forest Labs to a $3.25 billion valuation by late 2025.
In June 2024, Stability AI closed a recapitalization led by a group of new investors including Sean Parker (co-founder of Napster, former president of Facebook). Prem Akkaraju, former CEO of Weta Digital, was appointed CEO, and Parker joined as Executive Chairman. The deal converted approximately $100 million of existing debt and roughly $300 million in future spending commitments, restoring solvency, with approximately $80 million in new equity bringing total funding to around $225 million. [52] On September 24, 2024, filmmaker James Cameron joined the Board of Directors, signaling a pivot toward film and entertainment industry tools. [53]
In December 2024, CEO Akkaraju reported triple-digit revenue growth year-on-year and elimination of the company's debt. In March 2025, WPP announced a strategic investment and partnership integrating Stability's visual-media models into WPP's creative platforms. [54] As of early 2026, Stability AI is privately held with approximately 190 employees, with image, video, audio, and 3D generation models in its portfolio. The company has neither announced a Stable Diffusion 4 nor any successor named in the SD line, with model effort concentrated on SD 3.5 derivatives and on adjacent products in video, audio, and 3D.
The center of gravity for open-source image generation shifted away from Stable Diffusion during 2024-2025. Black Forest Labs' Flux.1 family (August 2024), comprising Flux.1 [pro] (commercial), [dev] (non-commercial weights), and [schnell] (Apache 2.0, distilled), set a new bar for open-source text-to-image quality, with Flux.1 [dev] largely displacing SDXL and SD 3.5 in many community workflows. [51] Closed proprietary models (DALL-E 3, Imagen 3, Midjourney 7, GPT-Image-1) continued to lead on out-of-the-box aesthetic quality and text rendering. In this landscape, Stable Diffusion 3.5 occupies a middle position: capable and freely downloadable for non-commercial and small-business use, but no longer the default open-source choice for users prioritizing quality. The SD 1.5 base remains in active community use for its low VRAM requirements and enormous LoRA ecosystem.
Stable Diffusion's place in the history of generative AI is secure for several reasons:
Whether or not Stability AI eventually produces a Stable Diffusion 4, the lineage's influence is permanent. The phrase "text-to-image" carries different connotations after August 2022 than before, both technically (assumed availability of open weights and local inference) and culturally (assumed availability of generative image tools to anyone with a consumer GPU).