Stable Diffusion 3 (SD3) is a family of text-to-image diffusion models developed by Stability AI, first announced as an early preview on February 22, 2024. The series introduced a new architecture called the Multimodal Diffusion Transformer (MMDiT), trained with a rectified flow objective, and abandoned the U-Net backbone that had defined every previous Stable Diffusion release. SD3 was meant to be the company's answer to closed competitors such as DALL-E 3 and Midjourney v6, with substantially improved prompt following, multi-subject handling, and in-image text rendering compared to Stable Diffusion XL (SDXL). [1] [2]
The public release did not go to plan. SD3 Medium (2 billion parameters) shipped on June 12, 2024, under a Stability AI Community License whose commercial terms many in the open-source community considered hostile. Within days, the model became the subject of memes about distorted human anatomy, and CivitAI, the largest hub for Stable Diffusion fine-tunes, briefly banned uploads of SD3-derived models entirely. Four months later, after a near-complete management turnover at Stability AI and the departure of most of the original Stable Diffusion research team to found Black Forest Labs, the company released the SD 3.5 family on October 22, 2024, with a revised license, retrained weights, and three model sizes. [3] [4] [5]
This article covers the SD3 and SD 3.5 series together, since the two are technically continuous and the SD 3.5 release effectively superseded the original SD3 weights. It also covers the corporate context that shaped the launch, because SD3 is one of the rare cases where a model's reception was driven more by licensing and company drama than by image quality.
| Developer | Stability AI |
| Initial preview | February 22, 2024 |
| Open weights release | June 12, 2024 (SD3 Medium) |
| Latest variant | SD 3.5 Large (October 22, 2024) |
| Architecture | Multimodal Diffusion Transformer (MMDiT) |
| Sampler objective | Rectified flow |
| Parameters | 2B (Medium) to 8B (Large) |
| Text encoders | CLIP-L, OpenCLIP-bigG, T5-XXL |
| VAE channels | 16 |
| License | Stability AI Community License |
| Predecessor | SDXL |
| Successor in lineage | Flux (by ex-Stability researchers) |
| Paper | arxiv.org/abs/2403.03206 |
| Hugging Face | huggingface.co/stabilityai/stable-diffusion-3-medium |
The period leading up to SD3 was, by the company's own later admission, chaotic. By October 2023 Stability AI was reportedly burning roughly $8 million per month on cloud compute against monthly revenue near $5.4 million. An attempted fundraise at a $4 billion valuation fell apart. In a letter to the board, Lightspeed Venture Partners said founder Emad Mostaque's management had "severely undermined" their confidence in him; Coatue Management pushed for his resignation and opened an internal review. [6]
Emad Mostaque resigned as CEO and from the board on March 23, 2024, exactly one month after SD3 was first shown to the public. He framed the move publicly as a desire to pursue "decentralized AI," though contemporary reporting described it as effectively forced. Within days, three of the four authors of the original latent diffusion paper, Robin Rombach, Andreas Blattmann, and Dominik Lorenz, also left Stability AI. They went on to co-found Black Forest Labs and ship the Flux image models, which in many respects became the spiritual successor to SD3 within the same year. [7] [8]
The SD3 paper credits a long author list, with Patrick Esser as first author and Rombach as senior author. By the time the open weights came out in June 2024, several names on that paper were no longer Stability employees.
The immediate technical motivation for SD3 was that SDXL, released July 2023, had hit a ceiling on prompt following. SDXL's 3.5 billion-parameter U-Net produced sharp, well-composed images at 1024x1024 native resolution, but it routinely failed at multi-subject prompts ("a red cube on top of a blue sphere"), legible in-image text, and complex spatial relationships. Closed competitors had moved further: DALL-E 3, released by OpenAI in October 2023 and integrated into ChatGPT, used a captioning rewrite step that gave it noticeably better prompt adherence; Midjourney v6, released around the same time, also rendered text reasonably well. [9]
Stability AI needed both a quality jump and an architectural story for the next round of fundraising. SD3 was supposed to provide both.
SD3 is described in the paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (Esser et al., arxiv:2403.03206). The architecture differs from SDXL in three substantive ways: it uses a transformer rather than a U-Net, trains on a rectified flow objective rather than the standard EDM-style diffusion noise schedule, and conditions on three text encoders rather than two. [1]
The MMDiT block is the centerpiece. Earlier diffusion transformers (DiT, PixArt) treated image and text tokens as a single concatenated sequence, fed through ordinary transformer blocks. MMDiT instead keeps two separate parameter streams, one for image tokens and one for text tokens, with their own weight matrices for query, key, value, and feed-forward projections. The two streams meet only inside the attention operation, where their projected keys and values are concatenated, attention is computed jointly, and the resulting context is split back to the two streams.
In practice this means a SD3 transformer block is roughly twice the size of a corresponding DiT block at the same depth, but the two modalities can develop their own representations rather than fighting over a shared embedding space. Esser et al. report that this separation is the single largest source of the gain in prompt adherence over earlier DiT variants. [1]
Traditional diffusion models train on a noise schedule, learning to denoise samples drawn at randomly chosen timesteps along a curved trajectory between data and Gaussian noise. Rectified flow, introduced in 2022 by Liu et al., reformulates this as learning a constant velocity field along a straight line in the data-noise space. SD3 uses a rectified flow loss with a logit-normal weighting on the timestep, which the paper finds outperforms standard diffusion across model sizes. [1]
In inference terms this matters because rectified flow models can be sampled with relatively few steps (often 25 to 30 for high quality) and respond well to distillation, which is why Stability could ship SD3 Large Turbo and SD 3.5 Large Turbo as 4-step distilled variants without major quality collapse.
SD 1.x and 2.x used a 4-channel VAE latent. SDXL kept the 4-channel encoder. SD3 moves to a 16-channel VAE with a higher reconstruction fidelity, which the paper credits for sharper detail and substantially better text rendering inside images. The 16-channel VAE is also part of why SD3 weights are not interchangeable with prior Stable Diffusion checkpoints: the latent space has different dimensionality.
SDXL used CLIP ViT-L/14 plus OpenCLIP ViT-bigG/14. SD3 keeps both of those and adds Google's T5 XXL encoder (about 4.7 billion parameters), giving the model a text encoder stack that is substantially larger than its image transformer at the Medium scale.
The T5 encoder is optional at inference. Users on tight VRAM budgets can drop T5 and run with just the two CLIP encoders. The trade-off is significant: T5 is the source of most of the long-prompt and complex-composition gains over SDXL. Without it, SD3 still beats SDXL on simple prompts but loses much of its advantage on the hard ones.
| Feature | SD 2.1 | SDXL 1.0 | SD3 Medium | SD 3.5 Large |
|---|---|---|---|---|
| Backbone | U-Net | U-Net (larger) | MMDiT | MMDiT-X |
| Diffusion parameters | ~865M | ~3.5B (Base) + ~2.3B (Refiner) | 2B | 8B |
| Training objective | DDPM (v-prediction) | EDM | Rectified flow | Rectified flow |
| VAE channels | 4 | 4 | 16 | 16 |
| Native resolution | 768x768 | 1024x1024 | 1024x1024 | 1024x1024 |
| Text encoder(s) | OpenCLIP ViT-H/14 | CLIP ViT-L/14 + OpenCLIP ViT-bigG/14 | CLIP ViT-L/14 + OpenCLIP ViT-bigG/14 + T5-XXL | CLIP ViT-L/14 + OpenCLIP ViT-bigG/14 + T5-XXL |
| Inference steps (typical) | 30-50 | 25-40 | 28-50 | 28-50 |
| Two-stage pipeline | No | Yes (optional Refiner) | No | No |
The SD 3.5 family does not use exactly the same blocks as SD3. Stability AI calls the updated architecture MMDiT-X, with two changes worth noting. First, self-attention modules are added in the first 13 transformer layers, which the company says improves multi-resolution generation. Second, query-key normalization is applied before the attention scores are computed, which reduces training instability and makes the SD 3.5 weights easier to fine-tune. The 8B SD 3.5 Large is also wider and deeper than the SD3 model that was scaled up but never publicly released as open weights. [4]
| Variant | Parameters (diffusion model) | Release date | Open weights | Notes |
|---|---|---|---|---|
| SD3 (early preview) | Multiple sizes 0.8B-8B (paper) | February 22, 2024 | No (API only) | Closed alpha; quality previews shown in paper |
| SD3 Medium | 2B | June 12, 2024 | Yes | First open weights release; restrictive license; widely criticized for anatomy issues |
| SD3 Large | 8B | API only | No | Available via Stability API and Fireworks AI; never open-sourced under SD3 branding |
| SD3 Large Turbo | 8B | API only | No | 4-step distilled SD3 Large; API only |
| SD 3.5 Large | 8B | October 22, 2024 | Yes | Flagship open release; MMDiT-X with QK normalization |
| SD 3.5 Large Turbo | 8B | October 22, 2024 | Yes | 4-step distilled SD 3.5 Large |
| SD 3.5 Medium | 2.5B | October 29, 2024 | Yes | Slightly larger than SD3 Medium; targets consumer GPUs |
A few details are worth pinning down because they often get confused. The SD3 Large and SD3 Large Turbo weights were, as far as the public is aware, never released under permissive open weights. Stability shipped them through the Stability API and through partner inference providers like Fireworks AI, but the open-weights story for the 8B class only began with SD 3.5 Large in October 2024. The SD3 Medium open release in June 2024 was the only SD3-branded checkpoint that ever ended up on Hugging Face for direct download. [3] [10]
SD 3.5 Medium (2.5B) is slightly larger than SD3 Medium (2B). The choice of 2.5B was driven by the desire to fit comfortably in 12 GB of VRAM with the T5 encoder offloaded to system memory.
The paper claims state-of-the-art results across three axes: prompt following, typography, and visual aesthetics, evaluated against SDXL, DALL-E 3, Midjourney v6, Ideogram v1.0, and Pixart-alpha. In Stability's own GenEval and human preference scores at preview, SD3 came out ahead on prompt following and tied or led on typography. [1]
The specific gains over SDXL fall into a few buckets:
Where SD3 did not deliver is human anatomy. The June 2024 SD3 Medium release became infamous for hands, elongated limbs, and the now-meme-worthy "woman lying on grass" prompts where the model produced bodies in physically impossible configurations. The cause has been variously attributed to aggressive NSFW filtering of training data (which removed enough human bodies that the model under-trained on them), distillation effects, and bugs in the released checkpoint. SD 3.5 substantially closes the gap, though Flux and Midjourney v7 remain stronger on photorealistic human figures as of early 2026. [3] [11]
| Date | Event |
|---|---|
| February 22, 2024 | SD3 announced via Stability AI blog post; closed early preview; waitlist opened |
| March 5, 2024 | Paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" posted on arXiv (2403.03206) |
| March 23, 2024 | Emad Mostaque resigns as CEO of Stability AI |
| March 2024 | Robin Rombach, Andreas Blattmann, Dominik Lorenz leave Stability AI |
| April 17, 2024 | SD3 API access opens through Stability Developer Platform and Fireworks AI |
| June 12, 2024 | SD3 Medium open weights released on Hugging Face under Stability AI Community License |
| June 14-16, 2024 | CivitAI bans SD3 Medium fine-tunes amid license uncertainty |
| June 17, 2024 | Stability AI publishes clarification on commercial license terms |
| June 24, 2024 | Prem Akkaraju named CEO; Sean Parker named Executive Chairman |
| August 1, 2024 | Black Forest Labs (Rombach, Blattmann, Esser) launches FLUX.1 |
| October 22, 2024 | SD 3.5 Large and SD 3.5 Large Turbo released as open weights with revised Community License |
| October 29, 2024 | SD 3.5 Medium released as open weights |
| April 2025 | SD3 (3.0) APIs deprecated; users automatically migrated to SD 3.5 |
A few of these dates are worth elaborating. The original February 22 announcement was a preview only, with API access gated behind a waitlist. The actual open-weights release came almost four months later, by which point the company had a different CEO, several departed authors, and a louder open-source community than it had bargained for. [2] [3]
The license drama around SD3 is the part of the story most people remember, and it is the most tangled to summarize accurately because the terms shifted in real time during June 2024.
When SD3 Medium open weights went live on Hugging Face on June 12, 2024, they were released under the Stability AI Community License. The license, in its initial form, contained provisions that the open-source image generation community read as substantially more restrictive than the OpenRAIL-M license that had governed Stable Diffusion 1.x and 2.x and the SDXL license. The most contested provisions, as written in the original license text and reported across community sources at the time, were:
The cumulative effect, as the community read it, was that anyone building a paid product on SD3, even a small Patreon-funded image generation site, faced uncertainty about whether they would owe Stability fees, whether their fine-tuned LoRAs were legal, and whether the license could be revoked. [12] [13]
The largest community hub for Stable Diffusion custom models, CivitAI, responded by banning SD3-based content from its platform within roughly two days of the open weights release. The ban covered uploads of SD3 Medium itself, fine-tuned SD3 checkpoints, and LoRAs targeted at SD3. CivitAI's stated reason was that the license language was unclear enough that hosting derivatives risked exposing both the platform and uploaders to retroactive license claims. The decision was widely covered in tech media and on developer forums; for several days, the consensus among major Stable Diffusion communities was effectively to pretend SD3 did not exist. [13]
AUTOMATIC1111, ComfyUI, and other major tool maintainers were more cautious in their public statements but slow to integrate SD3 fully. ComfyUI added SD3 support quickly, but several extension authors paused work pending license clarification.
Stability AI responded over the following days with a series of blog posts and license clarifications, addressing what it called "misunderstandings" of the terms. The company stated that:
Whether these clarifications matched the original license text or constituted retroactive interpretation is, frankly, a matter of opinion in the community. The text itself was edited multiple times in June 2024, which made any single "what does the license say" answer a moving target. CivitAI partially reinstated SD3 content over the following weeks once specific provisions were softened. [12] [13]
With the SD 3.5 release on October 22, 2024, Stability AI shipped a revised version of the Stability AI Community License. The October 2024 revision, which also retroactively governed SD3 Medium, made several changes that brought it closer to community expectations without making it fully open source by OSI definitions:
This is the form the license has held through 2025 and into 2026. The model weights remain free for the vast majority of users; the constraint applies only to a relatively small number of medium- and large-revenue commercial deployments. Most observers, even those who criticized the original launch, regarded the revised terms as reasonable, though the Open Source Initiative does not consider them open source because of the revenue gating. [4] [12]
| Date | License version | Key terms |
|---|---|---|
| June 12, 2024 | Stability AI Community License (initial) | $1M revenue cap, ambiguous derivative training restrictions, output tracking provisions |
| June 17-25, 2024 | Stability AI Community License (clarifications) | Public statements softening derivative-training language; LoRA explicitly permitted |
| October 22, 2024 | Stability AI Community License (revised) | Free under $1M annual revenue; enterprise license required above; explicit LoRA and fine-tune permission; no output tracking |
SD 3.5, released October 22, 2024, was the company's chance to reset the SD3 story. By that point Stability had a new CEO (Prem Akkaraju), a recapitalized balance sheet, an executive chairman from Napster (Sean Parker), and the looming presence of Flux, released by ex-Stability researchers two months earlier. The SD 3.5 release had to demonstrate that the company could ship competitive open-weights image models without the original Stable Diffusion team. [4] [5]
Several things changed simultaneously, which makes attribution of the quality improvements somewhat murky:
The SD 3.5 release page emphasized customizability and ease of fine-tuning as primary goals, an explicit acknowledgment that the original SD3 license had alienated the fine-tuning community. [4]
SD 3.5 Large Turbo is a 4-step distilled version of SD 3.5 Large, trained using Adversarial Diffusion Distillation (ADD), the same family of techniques Stability used for SDXL Turbo. At 4 inference steps, SD 3.5 Large Turbo gives roughly 7-10x faster generation than SD 3.5 Large at the cost of some prompt fidelity, especially on long T5-conditioned prompts. The Turbo variant has the same architecture and parameter count as SD 3.5 Large; it differs only in training, with a one-step student model distilled from a multi-step teacher.
SD 3.5 Medium (2.5B), released a week after SD 3.5 Large on October 29, 2024, was positioned as the consumer-hardware option. It does not match SD 3.5 Large in absolute quality, but it fits comfortably in 12 GB of VRAM with the T5 encoder offloaded to system memory, and at 16 GB VRAM it can run with full T5 conditioning. The MMDiT-X architecture and the QK normalization are present in both Medium and Large.
The SD3 Medium reception was, in a word, ugly. "Sloppy anatomy" memes proliferated on Reddit's Stable Diffusion subreddit during the week of June 12-19, 2024, with the most-shared examples being prompts like "a woman lying on grass" or "a person holding hands with another person" producing bodies that looked, charitably, like they had been assembled by someone who had never seen a human. The license backlash and the anatomy memes reinforced each other: the community read the situation as a company that had alienated its developers and shipped a buggy model on the same day. [3] [11]
Senior figures inside Stability AI publicly acknowledged the problems within days. The Decoder, citing Stability sources, reported that the company "apologized" for the disappointing release. Hanno Basse, who held an interim CEO role during the transition before Prem Akkaraju was appointed, was quoted in trade press around mid-2024 acknowledging that the launch had not gone as planned. The official Stability community statement at the time described the SD3 Medium release as a "first step" with improvements promised in subsequent versions. [3]
SD 3.5's reception was warmer but quieter. By October 2024, much of the energy that would have gone into evaluating an open-weights image model from Stability had moved to Flux, released August 2024 by ex-Stability researchers under Black Forest Labs. FLUX.1 schnell shipped under Apache 2.0 (genuinely open source by OSI standards), FLUX.1 dev had a clearer non-commercial license, and FLUX.1 pro was available through API. SD 3.5 Large was widely considered to roughly match FLUX.1 dev in image quality on many tasks, with some advantages on artistic styles and disadvantages on photorealistic humans. The community largely accepted SD 3.5 as a viable alternative without crowning it. [4] [14]
Professional reviews of SD 3.5 Large in the months following release tended to highlight its prompt-following improvements and its better-behaved fine-tuning. The flip side, repeatedly noted, was that running SD 3.5 Large with the T5 encoder requires roughly 24 GB of VRAM, which puts it in workstation-card territory; SDXL fine-tunes were comfortable on 12 GB cards.
| Model | Developer | Released | Architecture | Open weights | Known strengths |
|---|---|---|---|---|---|
| SD 3.5 Large | Stability AI | October 2024 | MMDiT-X (8B) | Yes (Community License) | Style range, prompt following, fine-tuning ecosystem |
| FLUX.1 dev | Black Forest Labs | August 2024 | Hybrid flow transformer (12B) | Yes (non-commercial) | Photorealism, in-image text, anatomy |
| FLUX.1 schnell | Black Forest Labs | August 2024 | Distilled flow transformer (12B) | Yes (Apache 2.0) | Genuinely open source; fast inference |
| FLUX.1 pro | Black Forest Labs | August 2024 | Flow transformer (12B+) | No (API) | Highest BFL quality tier |
| DALL-E 3 | OpenAI | October 2023 | Diffusion + caption rewrite | No (API) | Prompt rewriting, ChatGPT integration |
| Midjourney v6 | Midjourney Inc. | December 2023 | Closed | No (Discord/web) | Aesthetics, stylistic coherence |
| Midjourney v7 | Midjourney Inc. | April 2025 | Closed | No (Discord/web) | Refinement of v6 strengths |
| Imagen 3 | August 2024 | Closed | No (Vertex AI) | Photorealism, text rendering | |
| Imagen 4 | 2025 | Closed | No (Vertex AI / consumer) | Sharper detail, better text | |
| Ideogram v2 | Ideogram | 2024 | Closed | No (web) | In-image text rendering |
A blunt summary in early 2026: SD 3.5 is a credible open-weights model in the second tier of image quality, behind closed leaders (Midjourney v7, Imagen 4, DALL-E 3 in some categories) and behind FLUX in several open categories. Its main argument over FLUX is its larger fine-tuning ecosystem inherited from earlier Stable Diffusion versions, plus the genuine open-weights status of all SD 3.5 variants under the revised Community License. [14]
| Variant | Minimum VRAM (with T5 offloaded) | Recommended VRAM (with T5 in GPU) | System RAM | Disk |
|---|---|---|---|---|
| SD3 Medium (2B) | ~10 GB | ~12-16 GB | 16 GB+ | ~10 GB |
| SD 3.5 Medium (2.5B) | ~10 GB | ~12-16 GB | 16 GB+ | ~10 GB |
| SD 3.5 Large (8B) | ~16 GB | ~24 GB | 32 GB+ | ~16 GB |
| SD 3.5 Large Turbo (8B, 4-step) | ~16 GB | ~24 GB | 32 GB+ | ~16 GB |
The T5 XXL encoder is the largest single consumer of memory at inference. T5 XXL is roughly 4.7 billion parameters in fp16, or about 9.4 GB if loaded into VRAM directly. ComfyUI and similar tools support offloading the encoder to CPU, which lets users run T5-conditioned generation at the cost of slower text encoding. Quantized variants of the T5 encoder (8-bit and 4-bit) are widely used in community workflows to bring the memory footprint down further.
SD 3.5 Large can be run without T5 by passing only CLIP-L and OpenCLIP-bigG embeddings, but the typical recommendation is that anyone interested in SD 3.5 Large's prompt-following advantages over SDXL should keep T5 in the pipeline. The 8B image transformer in fp16 is itself about 16 GB.
The SD3 launch is one of the cleanest examples in recent AI history of a release whose reception was shaped almost as much by company drama as by model quality. The personnel and ownership changes at Stability AI during 2024 deserve a paragraph in any treatment of SD3.
| Date | Event |
|---|---|
| October 2023 | Lightspeed Venture Partners and Coatue Management push for Mostaque resignation; Stability burns $8M/month against $5.4M monthly revenue |
| February 22, 2024 | SD3 announced; closed alpha; paper not yet posted |
| March 5, 2024 | SD3 paper posted on arXiv |
| March 23, 2024 | Emad Mostaque resigns as CEO and from board |
| March 2024 | Robin Rombach, Andreas Blattmann, Dominik Lorenz leave Stability |
| March-June 2024 | Shan Shan Wong (former COO) and Christian Laforte (CTO) serve as interim co-CEOs |
| April 17, 2024 | SD3 API access opens via Stability and Fireworks AI |
| June 12, 2024 | SD3 Medium open weights released; license backlash |
| June 24, 2024 | Prem Akkaraju named CEO; Sean Parker named Executive Chairman; ~$80M raised; ~$100M debt and ~$300M supplier obligations forgiven |
| August 1, 2024 | FLUX.1 launches from Black Forest Labs (Rombach, Blattmann, Esser) |
| September 2024 | James Cameron joins Stability AI board |
| October 22, 2024 | SD 3.5 Large and Large Turbo released |
| October 29, 2024 | SD 3.5 Medium released |
| December 2024 | Akkaraju reports triple-digit revenue growth; debt eliminated |
Several of these dates are worth re-emphasizing because they line up uncomfortably. SD3 was previewed February 22; the paper went up March 5; Mostaque resigned March 23. The original SD3 research team left in waves through March. The open weights release on June 12 happened in the middle of an interim co-CEO regime, twelve days before Prem Akkaraju and Sean Parker took over and the recapitalization closed. By the time SD 3.5 launched in October, the company was operating with new leadership, James Cameron on the board, and a clear positioning around enterprise customers and entertainment-industry partnerships. [7] [15] [16]
CoreWeave, the GPU cloud company, was Stability's primary compute partner and was named in the June 2024 recapitalization announcements as having forgiven future spending commitments as part of the rescue package. The exact figure for forgiven CoreWeave-specific obligations is not public, but Fortune and Deadline both reported the combined figure of roughly $100 million in debt and $300 million in future supplier spending, with CoreWeave understood to be the largest single component on the supplier side. [15]
The departure of Robin Rombach, Andreas Blattmann, and Patrick Esser to found Black Forest Labs is critical context for SD3's trajectory. Rombach was first author on the original 2021 latent diffusion paper. Esser was first author on the SD3 paper in 2024. By the time SD3 Medium open weights shipped in June, Esser was still credited but had already physically left the company; by August 2024, FLUX.1 had launched. The same lineage that produced Stable Diffusion and SD3 immediately produced FLUX, and FLUX was widely seen as a continuation of where SD3 had been heading without the licensing friction. Black Forest Labs publicly raised $31 million at launch and went on to raise substantially more across 2024 and 2025; by early 2026 it was valued around $3.25 billion. [8] [14]