# SDXL (Stable Diffusion XL)

> Source: https://aiwiki.ai/wiki/sdxl
> Updated: 2026-06-21
> Categories: Diffusion Models, Image Generation, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**SDXL**, short for Stable Diffusion XL, is an open-weights latent text-to-image diffusion model released by [Stability AI](/wiki/stability_ai) on 26 July 2023, built around a 2.6 billion parameter U-Net backbone, two text encoders, native 1024x1024 generation, and an optional refinement model used in a two-stage "ensemble of experts" pipeline.[^1][^2] It represents the third major generation of the [Stable Diffusion](/wiki/stable_diffusion) family and combines OpenCLIP ViT-bigG/14 with the OpenAI CLIP ViT-L/14, giving a U-Net the SDXL paper describes as "three times larger" than the SD 1.5 and 2.x backbones.[^1] The base U-Net contains roughly 2.6 billion parameters and the text encoders add 817 million, with the complete two-model pipeline reaching approximately 6.6 billion parameters, making SDXL the largest open-access [latent diffusion](/wiki/latent_diffusion) image model at the time of its release.[^1][^2][^3] SDXL set the dominant baseline for the open-source image-generation community throughout 2023 and 2024 before being succeeded by [Stable Diffusion 3](/wiki/stable_diffusion_3) in 2024 and [Stable Diffusion 3.5](/wiki/stable_diffusion_3_5) later that year.[^4][^5]

| Field | Value |
| --- | --- |
| Developer | [Stability AI](/wiki/stability_ai) (with CompVis and Runway lineage) |
| Initial release | 26 July 2023 (SDXL 1.0)[^2] |
| Paper | "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis", arXiv:2307.01952[^1] |
| Authors | Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna, Robin Rombach[^1] |
| Architecture | Latent diffusion model with [U-Net](/wiki/unet) backbone and dual text encoders[^1] |
| Base U-Net parameters | ~2.6 billion[^1] |
| Text encoder parameters | ~817 million combined[^1] |
| Total pipeline parameters | ~6.6 billion (base + refiner)[^2] |
| Native resolution | 1024x1024 (with multi-aspect buckets)[^1] |
| Text encoders | OpenCLIP ViT-bigG/14 and OpenAI [CLIP](/wiki/clip) ViT-L/14[^3] |
| Base license | CreativeML Open RAIL++-M[^2][^3] |
| Turbo license | sai-nc-community (non-commercial)[^6] |
| Successors | [Stable Diffusion 3](/wiki/stable_diffusion_3) (2024), [Stable Diffusion 3.5](/wiki/stable_diffusion_3_5) (2024)[^4][^5] |

## What is SDXL?

SDXL is an open-weights latent text-to-image diffusion model that, in the words of its own paper, leverages "a three times larger UNet backbone" than earlier Stable Diffusion versions, with the increase "mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder".[^1] It generates images directly at 1024x1024 (and at non-square resolution buckets), pairs a base model with an optional refiner, and is distributed under a permissive open license, which together made it the default open-source image model of 2023 and 2024. The single most-cited fact about SDXL is its scale: a 2.6 billion parameter U-Net, 817 million parameters of combined text encoders, and a roughly 6.6 billion parameter base-plus-refiner ensemble, all runnable on a single consumer GPU.[^1][^2]

## Background

The Stable Diffusion family originates from the "High-Resolution Image Synthesis with Latent Diffusion Models" paper by Robin Rombach and collaborators at the CompVis group of LMU Munich, released in late 2021 and presented at CVPR 2022.[^7] The latent diffusion approach compresses RGB images into a lower-dimensional latent space using a variational autoencoder (VAE), then trains a U-Net denoiser to invert a Gaussian noising process in that latent space, conditioning each step on a text embedding produced by a frozen [CLIP](/wiki/clip) encoder.[^7] [Stability AI](/wiki/stability_ai) sponsored compute for the original Stable Diffusion checkpoints and released the first public weights in August 2022 under the CreativeML Open RAIL-M license, kicking off an extensive third-party ecosystem.[^8]

Two subsequent Stable Diffusion generations preceded SDXL. Stable Diffusion 1.4 and 1.5 used a 860 million parameter U-Net trained at 512x512 with a single OpenAI ViT-L/14 text encoder; the 1.5 checkpoint, released by Runway in October 2022, became the de facto community baseline. Stable Diffusion 2.0 and 2.1, released in late 2022, swapped the text encoder for OpenCLIP ViT-H/14 and trained at higher resolutions but were less broadly adopted because the new conditioning behaved differently from 1.5 prompts and because content filtering in the LAION training subset altered the model's aesthetic.[^8] SDXL was designed to address these gaps by scaling the U-Net more aggressively, combining the two text encoders, training natively at 1024x1024, and addressing several artifacts (cropped subjects, low aesthetic scoring) traced back to data handling in earlier releases.[^1]

## When was SDXL released?

The SDXL paper, "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis", was first posted to arXiv on 4 July 2023 (arXiv:2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna and Robin Rombach.[^1] The 1.0 weights followed three weeks later, on 26 July 2023, distributed through Hugging Face under CreativeML Open RAIL++-M alongside availability on Clipdrop, the Stability AI Platform API, AWS SageMaker, AWS Bedrock, the Stable Foundation Discord and DreamStudio.[^2][^3] The distilled SDXL Turbo followed on 28 November 2023, and ByteDance's SDXL Lightning on 21 February 2024.[^6][^14]

## How does SDXL work?

### Latent diffusion backbone

SDXL inherits the structural template of a [latent diffusion model](/wiki/latent_diffusion): an encoder compresses pixels into a latent representation eight times smaller per spatial dimension, a U-Net denoiser operates iteratively on those latents, and a decoder reconstructs pixels.[^1][^7] What changed in SDXL is the size and arrangement of the U-Net and conditioning signals rather than the overall paradigm. The paper describes the new U-Net as "three times larger" than the SD 1.5 and 2.x backbones, with the increase concentrated in additional transformer blocks and a wider cross-attention context.[^1] Specifically, the SDXL U-Net deletes the highest-feature-map level (the 64x64 latent resolution still receives convolutions, but transformer blocks are removed there) and adds many more transformer blocks at the middle resolutions where global semantics are formed.[^1] The base model has roughly 2.6 billion parameters in the U-Net alone.[^1]

The VAE that maps between pixels and latents is structurally similar to the SD 2.x VAE but retrained on a different mixture and at higher precision; it still compresses a 1024x1024 RGB image to a 128x128x4 latent tensor.[^9] Because the original SDXL VAE accumulates activation variance through its decoder, naive fp16 inference can overflow to NaN values; a widely used community fp16 fix released by Olin (madebyollin/sdxl-vae-fp16-fix) freezes the convolutional weight matrices and fine-tunes only the biases, normalization layers and per-matrix scalers, allowing fp16 decoding without quality loss.[^9]

### Why does SDXL use two text encoders?

A central architectural change in SDXL is the simultaneous use of two pretrained text encoders. The first is OpenAI's [CLIP](/wiki/clip) ViT-L/14, the same encoder used in SD 1.5; the second is the much larger OpenCLIP ViT-bigG/14 model trained by the LAION community.[^1][^3] At inference time both encoders process the prompt independently, and their penultimate-layer token features are concatenated along the channel dimension before being fed into the U-Net via [cross-attention](/wiki/cross_attention). Additionally, the pooled output of OpenCLIP ViT-bigG/14 is added to the time-step embedding as a global conditioning signal.[^1] The two encoders total roughly 817 million parameters and produce a combined text embedding of substantially higher dimensionality than the single-encoder embeddings used by earlier Stable Diffusion versions, which the paper credits with improving prompt following and complex-scene handling.[^1]

### Micro-conditioning on size, crop and aspect ratio

Section 2.2 of the SDXL paper introduces three explicit conditioning signals collectively referred to as "micro-conditioning", which the model receives in addition to the text embedding and timestep.[^1][^10]

* **Original image size.** During training each example is annotated with the pixel resolution of the original source image. At inference the user can specify the value to use; setting the original size below the training crop tends to produce softer, more blurred outputs, while specifying realistic high-resolution values triggers crisper outputs.[^1][^10]
* **Crop coordinates.** Each training image is associated with the top-left coordinates of the crop applied during preprocessing. Earlier Stable Diffusion versions silently discarded the crop offset, which the SDXL authors identified as the cause of subjects whose heads or limbs were chopped off the generated frame. By conditioning on these offsets, SDXL learns to generate cropped imagery only when the user explicitly requests it; the recommended default at inference is (0, 0), corresponding to well-centered framing.[^1][^10]
* **Target size / aspect ratio.** Training images are grouped into "resolution buckets" (e.g., 1024x1024, 1152x896, 896x1152, 1216x832, and so on) so that batches contain images of identical shape. The bucket dimensions are passed to the model as an additional conditioning signal, allowing SDXL to generate non-square images at inference without the distortions that affected earlier Stable Diffusion versions trained primarily on square crops.[^1][^10]

All three signals are encoded with sinusoidal embeddings, concatenated, and added to the conditioning stream alongside the timestep and pooled OpenCLIP-G feature.[^1]

### What does the SDXL refiner do?

In addition to the base model, the SDXL release includes a separate "refiner" U-Net specialized for the low-noise end of the diffusion schedule.[^1][^2] The refiner operates on latents produced by the base model and is intended to be applied for the final 200 or so denoising steps using a SDEdit-style image-to-image pass.[^1][^11] Because both models share a VAE and operate in the same latent space, the handoff is implemented as a continuation rather than an encode-decode round trip. The refiner uses a smaller U-Net than the base model and is conditioned only by the OpenCLIP ViT-bigG/14 encoder, which simplifies its conditioning interface.[^11] Stability AI describes the combined system as an "ensemble of experts" pipeline, contrasting it with a single end-to-end model.[^2] In practice many downstream users skip the refiner because the base model alone produces acceptable output and adding the refiner roughly doubles inference time.

## How was SDXL trained?

The SDXL paper describes a multi-stage training procedure for the base model. Pretraining is performed on a large internal dataset at progressively higher resolutions (256x256, 512x512, then 1024x1024), with the resolution-bucket aspect-ratio system applied throughout the 1024x1024 stage.[^1] After pretraining, the authors apply a final fine-tuning pass on a curated subset of higher-aesthetic-quality images, which is described as analogous to the supervised fine-tuning step in instruction-tuned language models.[^1] The micro-conditioning signals (original size, crop coordinates, target size) are present from the start of training so that the model learns to treat them as bona fide control inputs rather than nuisance metadata.[^1]

The refiner is trained as a separate model on the same latent VAE but specialized to a high-quality, low-noise distribution. The paper notes that the refiner is intended to sharpen local details such as hands and faces rather than to alter scene composition.[^1] Stability AI has not released the full training dataset, citing both legal and operational reasons, although the use of LAION-derived data is widely documented and the company has been a defendant in legal action over training-data sourcing.[^12]

## What SDXL variants were released?

### SDXL Base 1.0 and Refiner 1.0

Both the base and refiner checkpoints, totaling around 6.6 billion parameters, were released together on 26 July 2023 through Stability AI's Hugging Face repositories (`stabilityai/stable-diffusion-xl-base-1.0` and `stabilityai/stable-diffusion-xl-refiner-1.0`).[^2][^3] The base checkpoint can be used standalone, and several inference frameworks expose the refiner as an optional second stage.[^3] Both checkpoints are released under the **CreativeML Open RAIL++-M License**, which permits commercial use subject to a use-based restriction clause prohibiting certain harmful applications.[^3]

### SDXL Turbo

On 28 November 2023, Stability AI released SDXL Turbo, a distilled variant of SDXL 1.0 capable of producing usable 512x512 images in a single sampling step.[^6][^13] Turbo is based on the Adversarial Diffusion Distillation (ADD) technique introduced in "Adversarial Diffusion Distillation" by Axel Sauer, Dominik Lorenz, Andreas Blattmann and Robin Rombach (arXiv:2311.17042, submitted 28 November 2023).[^13] ADD combines score-distillation against the original SDXL teacher with an adversarial loss that uses a [CLIP](/wiki/clip)-initialized discriminator, allowing the student model to retain sample fidelity in the one-to-four-step regime where pure score-distillation methods often produce blurry outputs.[^13][^6] Stability AI reports that, "on an A100, SDXL Turbo generates a 512x512 image in 207ms", of which 67 milliseconds correspond to the single U-Net forward pass.[^6] In blind human evaluation Stability AI reported that 1-step Turbo outperformed 4-step LCM-XL outputs, and 4-step Turbo matched 50-step base SDXL.[^6] SDXL Turbo is distributed under the bespoke **sai-nc-community** license, which permits personal, research, educational and other non-commercial use; commercial deployment requires a paid Stability AI membership.[^6]

### SDXL Lightning

In February 2024, ByteDance researchers Shanchuan Lin, Anran Wang and Xiao Yang published "SDXL-Lightning: Progressive Adversarial Diffusion Distillation" (arXiv:2402.13929, initial version 21 February 2024) along with checkpoints on Hugging Face under the `ByteDance/SDXL-Lightning` repository.[^14][^15] The technique combines progressive distillation, in which the model is iteratively halved in step count, with an adversarial objective implemented in latent space using the pretrained SDXL U-Net encoder as the discriminator backbone.[^14] Distilled checkpoints are provided for 1-step, 2-step, 4-step and 8-step generation, in both full U-Net and LoRA formats, and operate at native SDXL 1024x1024 resolution rather than the 512x512 of Turbo.[^15] The SDXL Lightning weights are released under the **openrail++** license, permitting commercial use under the use-based restrictions inherited from CreativeML Open RAIL++.[^15]

### Comparison of SDXL variants

| Variant | Release | Steps for typical output | Resolution | Distillation method | License |
| --- | --- | --- | --- | --- | --- |
| SDXL Base 1.0 | 26 July 2023[^2] | ~25 to 50 | 1024x1024 | None (teacher) | CreativeML Open RAIL++-M[^3] |
| SDXL Refiner 1.0 | 26 July 2023[^2] | Used for final ~20% of steps | 1024x1024 | None (paired with base) | CreativeML Open RAIL++-M[^3] |
| SDXL Turbo | 28 November 2023[^6][^13] | 1 to 4 | 512x512[^6] | Adversarial Diffusion Distillation[^13] | sai-nc-community (non-commercial)[^6] |
| SDXL Lightning | 21 February 2024[^14] | 1, 2, 4, 8 | 1024x1024[^15] | Progressive Adversarial Diffusion Distillation[^14] | openrail++[^15] |

## What is the SDXL ecosystem?

### Fine-tunes, LoRAs and DreamBooth

SDXL became the dominant base for community-trained checkpoints from late 2023 onward. Fine-tunes such as Juggernaut XL, RealVisXL and Pony Diffusion V6 XL achieved millions of downloads on the [Civitai](/wiki/civitai) model-sharing platform, in many cases adapting SDXL toward photorealism, anime aesthetics or other stylistic priorities.[^16] Because the base model is large, full fine-tuning is expensive, so most community variants are trained either through DreamBooth, the subject-personalization technique introduced by Ruiz et al., or through low-rank adaptation using the [LoRA](/wiki/lora) method, both of which keep most weights frozen and train a small additional set of parameters.[^16][^17] [Hugging Face](/wiki/hugging_face) published an official DreamBooth training script for SDXL in `diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py`, and tools such as Kohya SS GUI and OneTrainer enabled DreamBooth fine-tuning of the full SDXL U-Net on consumer GPUs through gradient checkpointing and 8-bit optimizers.[^17]

### ControlNet for SDXL

[ControlNet](/wiki/controlnet), a conditioning mechanism originally introduced for Stable Diffusion 1.5 by Lvmin Zhang and Maneesh Agrawala, was extended to SDXL during the second half of 2023. [Hugging Face](/wiki/hugging_face) released first-party SDXL ControlNet weights in canny, depth and Zoe-depth variants under the `diffusers/controlnet-*-sdxl-1.0` repositories, including "mid" and "small" distilled versions that reduce the ControlNet adapter to roughly one-fifth or one-seventh of the original parameter count.[^18] Community contributors such as `thibaud` and `xinsir` released additional SDXL ControlNets for OpenPose, line-art, scribble, and tile conditioning, and `xinsir`'s "ControlNet Union" attempted to consolidate multiple conditioning modalities into a single adapter.[^18]

### IP-Adapter and image prompts

The IP-Adapter family from Tencent's AIIG group, introduced in 2023, provides lightweight image-prompt conditioning by adding decoupled cross-attention layers that consume embeddings from a CLIP image encoder. SDXL versions of IP-Adapter (including the Plus variant for higher fidelity and the FaceID series for identity preservation) were widely adopted in the [ComfyUI](/wiki/comfyui) and [AUTOMATIC1111](/wiki/automatic1111) communities for tasks such as style transfer and reference-image-driven generation. InstantID, a related 2024 technique, combines an IP-Adapter-style face embedder with an SDXL [ControlNet](/wiki/controlnet) to inject facial identity into generated images while preserving prompt-driven composition.[^19]

### Inference frameworks

SDXL is supported by every major Stable Diffusion inference frontend. [AUTOMATIC1111](/wiki/automatic1111) added SDXL support in version 1.5.0, released in late July 2023 shortly after the official weights drop, with subsequent versions improving VRAM efficiency.[^20] [ComfyUI](/wiki/comfyui), a node-based interface developed by `comfyanonymous`, was an early SDXL adopter and is widely credited with making the base-plus-refiner pipeline practical to compose; ComfyUI workflows for SDXL with IP-Adapter, multiple LoRAs and ControlNet have been shared widely on [Civitai](/wiki/civitai).[^16][^20] InvokeAI, Fooocus (developed by Lvmin Zhang of [ControlNet](/wiki/controlnet) fame, designed specifically around SDXL with curated default settings), Draw Things (an iOS and macOS client by Liu Liu), and the Hugging Face `diffusers` library all shipped SDXL pipelines within weeks of the model's release.[^20] On the model-hub side, `stabilityai/stable-diffusion-xl-base-1.0` and `stabilityai/sdxl-turbo` together accumulated tens of millions of downloads on Hugging Face during 2023 and 2024, while [Civitai](/wiki/civitai) hosted thousands of SDXL fine-tunes, LoRAs and embeddings.[^3][^6][^16]

## What is SDXL used for?

Across 2023 and 2024, SDXL became the default open-weights base model for a large fraction of professional and hobbyist image-generation work. The model's 1024x1024 native resolution, dual text encoders and improved prompt following make it broadly suitable for editorial illustration, concept art for games and film, product visualization, character design, and stock imagery generation. The micro-conditioning mechanism allows precise control over framing and resolution without resorting to inpainting, which made SDXL particularly useful in workflow tools that target non-square outputs (mobile wallpapers, social-media banners, print).[^1][^10]

In professional pipelines the base model is often paired with [ControlNet](/wiki/controlnet) for layout control, with IP-Adapter for style or identity reference, and with task-specific [LoRA](/wiki/lora) adapters for proprietary characters or branded aesthetics; the existence of dedicated SDXL inpainting checkpoints (such as `diffusers/stable-diffusion-xl-1.0-inpainting-0.1`) extends the same model to localized editing tasks.[^18] On the inference-cost side, the distilled SDXL Turbo and SDXL Lightning variants reduced per-image latency by roughly an order of magnitude relative to the base model, making interactive applications (live drawing tools, real-time previews, prompt-and-iterate UX) feasible on a single consumer GPU.[^6][^14] Drawing applications such as Krita's AI Diffusion plugin and Photoshop's Generative Workspace plugins, browser-based playgrounds, and on-device mobile clients all integrated SDXL or its distilled variants in this period.[^16][^20]

## How was SDXL received?

SDXL was met with broadly positive technical reception. Independent reviewers noted clear improvements over SD 1.5 and 2.x on prompt adherence, anatomy, composition and resolution, and the model became a frequent benchmark target for subsequent academic papers on diffusion distillation, controllable generation and personalization.[^1][^13][^14] Within the open-source community, SDXL's adoption was accelerated by Stability AI's relatively permissive CreativeML Open RAIL++-M license, the maturation of [AUTOMATIC1111](/wiki/automatic1111) and [ComfyUI](/wiki/comfyui) as widely used frontends, and the rapid growth of [Civitai](/wiki/civitai) as a model and workflow hub.[^16][^20]

At the same time, the rollout coincided with broader public concern about generative-image abuse and copyright disputes. Stability AI faced ongoing litigation from Getty Images and from a group of artists alleging that LAION-derived training data infringed copyrighted works; while the legal questions were not specific to SDXL, the model's prominence in the open-source ecosystem placed it at the center of those debates.[^12] Stability AI's own corporate situation became turbulent in early 2024, with founder Emad Mostaque resigning as CEO in March 2024, key research staff (including several SDXL co-authors) departing for other organizations, and the company taking on new investment from a consortium led by Sean Parker later that year. SDXL's continued availability under an open license and its widespread mirroring on Hugging Face and [Civitai](/wiki/civitai) mean that downstream use has continued largely uninterrupted regardless of corporate turbulence.

The dataset used for SDXL pretraining has not been fully disclosed, which has drawn criticism from researchers who view reproducibility as a precondition for open science. Stability AI has cited a combination of legal exposure and contractual constraints when declining to release the dataset in detail.

## What are the limitations of SDXL?

The SDXL paper and the public Hugging Face model card explicitly document several known limitations.[^1][^3]

* **Photorealistic faces and hands.** Although clearly improved over SD 1.5, SDXL still produces incorrect anatomical detail in close-up faces and hands at a non-trivial rate; community fine-tunes and post-processing detailers (such as ADetailer) are routinely used to mitigate these artifacts.[^3]
* **Legible text.** SDXL cannot reliably render legible in-image text, especially for words longer than a few characters; this limitation was a key motivator for the [Stable Diffusion 3](/wiki/stable_diffusion_3) redesign, which added a T5 text encoder specifically to improve text rendering.[^3][^4]
* **Compositional prompts.** Long, complex prompts that require precise spatial relationships among multiple objects sometimes fail in predictable ways, including object-attribute leakage (e.g., the color attribute of one subject being applied to another). The dual-encoder design partially mitigates this relative to SD 2.x but does not eliminate it.[^1][^3]
* **Photorealism vs. base aesthetic.** The base SDXL model has an "illustration-leaning" aesthetic that some users find unsuitable for hyper-realistic outputs; this drove the popularity of photorealism-targeted community fine-tunes such as RealVisXL and Juggernaut XL.[^16]
* **Bias and representational harms.** The Hugging Face model card warns that SDXL inherits demographic and cultural biases from its training data, that it is not designed to generate factual likenesses of real people, and that users must comply with Stability AI's acceptable-use policy.[^3]
* **VRAM and latency.** The base 6.6 billion parameter pipeline is significantly heavier than SD 1.5 and requires roughly 8 GB of VRAM at fp16 with offloading, which is one reason distilled variants (Turbo, Lightning) saw rapid uptake on consumer hardware.[^2]
* **Refiner overhead.** Because the refiner roughly doubles inference time without consistently improving output quality, many community workflows omit it entirely; this in turn means the deployed parameter count in practice is closer to the base 2.6 billion U-Net than to the headline 6.6 billion ensemble.

The SDXL Turbo model card adds further explicit caveats: Turbo is restricted to a 512x512 output resolution, cannot render legible text, often produces low-quality faces and people, and inherits lossy VAE autoencoding artifacts.[^6]

## How does SDXL compare with related models?

| Model | Release | Parameters (U-Net or transformer) | Architecture | Native resolution | Text encoder(s) |
| --- | --- | --- | --- | --- | --- |
| SD 1.5 | October 2022[^8] | ~860 million | [U-Net](/wiki/unet) latent diffusion | 512x512 | [CLIP](/wiki/clip) ViT-L/14[^8] |
| SD 2.1 | December 2022[^8] | ~865 million | U-Net latent diffusion | 768x768 | OpenCLIP ViT-H/14[^8] |
| SDXL 1.0 | July 2023[^2] | ~2.6 billion U-Net (6.6 B with refiner) | U-Net latent diffusion + refiner | 1024x1024 | [CLIP](/wiki/clip) ViT-L/14 + OpenCLIP ViT-bigG/14[^1] |
| SDXL Turbo | November 2023[^6] | ~2.6 billion (distilled) | U-Net latent diffusion | 512x512 | Same as SDXL[^6] |
| SDXL Lightning | February 2024[^14] | ~2.6 billion (distilled) | U-Net latent diffusion | 1024x1024 | Same as SDXL[^14] |
| [Stable Diffusion 3](/wiki/stable_diffusion_3) | June 2024 (weights)[^4] | 2 B and 8 B variants | MMDiT [Diffusion Transformer](/wiki/diffusion_transformer) | up to 1024x1024 | Two [CLIP](/wiki/clip) encoders + T5-XXL[^4] |
| [Stable Diffusion 3.5](/wiki/stable_diffusion_3_5) Large | 22 October 2024[^5] | 8.1 billion | MMDiT [Diffusion Transformer](/wiki/diffusion_transformer) | up to ~1 megapixel | Two CLIP encoders + T5-XXL[^5] |
| [Stable Diffusion 3.5](/wiki/stable_diffusion_3_5) Medium | 29 October 2024[^5] | 2.5 billion | MMDiT [Diffusion Transformer](/wiki/diffusion_transformer) | up to ~2 megapixels[^5] | Two CLIP encoders + T5-XXL[^5] |

Successor architectures, including [Stable Diffusion 3](/wiki/stable_diffusion_3) and [Stable Diffusion 3.5](/wiki/stable_diffusion_3_5), replaced the U-Net with a Multimodal Diffusion Transformer (MMDiT) backbone and added a T5-XXL text encoder for improved text-in-image rendering, complex prompt following and typographic fidelity.[^4][^5] The 3.5 release also returned to a more permissive licensing structure (the Stability AI Community License, free for noncommercial use and commercial use up to one million USD in annual revenue), which Stability AI positioned as an evolution of the RAIL++ approach used for SDXL.[^5] Outside the Stable Diffusion lineage, Black Forest Labs' [FLUX.1](/wiki/flux_1) series (released in August 2024 by a team that includes several SDXL co-authors including Robin Rombach) became a major non-Stability competitor, also using a transformer-based diffusion backbone with T5 text conditioning.

## Why is SDXL significant?

SDXL is widely regarded as the inflection point at which open-weights text-to-image models reached visual parity with closed commercial systems such as [Midjourney](/wiki/midjourney) v5 and [DALL-E](/wiki/dall-e) 3 for a broad set of prompts, while remaining feasible to run locally on a single consumer GPU.[^1][^16] Its release accelerated several trends that came to define the open generative-image ecosystem in 2023 and 2024: the central role of [Hugging Face](/wiki/hugging_face) as a model hub, the rise of [Civitai](/wiki/civitai) as a community marketplace for fine-tunes and LoRAs, the maturation of node-based interfaces ([ComfyUI](/wiki/comfyui)) for advanced workflows, and the academic interest in diffusion distillation methods such as ADD and progressive adversarial distillation.[^13][^14][^16][^20] The micro-conditioning mechanism introduced in SDXL has been reused or adapted in several follow-on papers and toolchains. Even after the architectural switch to MMDiT in [Stable Diffusion 3](/wiki/stable_diffusion_3), much of the practical tooling (samplers, LoRA training scripts, ControlNet adapters) developed for SDXL remained in active use through 2025 and into 2026.

## See also

* [Stable Diffusion](/wiki/stable_diffusion)
* [Stable Diffusion 3](/wiki/stable_diffusion_3)
* [Stable Diffusion 3.5](/wiki/stable_diffusion_3_5)
* [Latent diffusion model](/wiki/latent_diffusion)
* [Diffusion model](/wiki/diffusion_model)
* [U-Net](/wiki/unet)
* [CLIP (Contrastive Language-Image Pre-training)](/wiki/clip)
* [LoRA (Low-Rank Adaptation)](/wiki/lora)
* [ControlNet](/wiki/controlnet)
* [Civitai](/wiki/civitai)
* [AUTOMATIC1111](/wiki/automatic1111)
* [ComfyUI](/wiki/comfyui)
* [Stability AI](/wiki/stability_ai)
* [Clipdrop](/wiki/clipdrop)
* [FLUX.1](/wiki/flux_1)
* [Midjourney](/wiki/midjourney)
* [DALL-E](/wiki/dall-e)
* [Diffusion Transformer (DiT)](/wiki/diffusion_transformer)
* [Cross-attention](/wiki/cross_attention)
* [Hugging Face](/wiki/hugging_face)

## References

[^1]: Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna, Robin Rombach, "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis", arXiv, 2023-07-04. https://arxiv.org/abs/2307.01952. Accessed 2026-06-22.
[^2]: Stability AI, "Announcing SDXL 1.0", Stability AI News, 2023-07-26. https://stability.ai/news-updates/stable-diffusion-sdxl-1-announcement. Accessed 2026-06-22.
[^3]: Stability AI, "stabilityai/stable-diffusion-xl-base-1.0 (model card)", Hugging Face, 2023-07-26. https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0. Accessed 2026-06-22.
[^4]: Stability AI, "Stable Diffusion 3: Research Paper", Stability AI News, 2024-03-05. https://stability.ai/news/stable-diffusion-3-research-paper. Accessed 2026-06-22.
[^5]: Stability AI, "Introducing Stable Diffusion 3.5", Stability AI News, 2024-10-22. https://stability.ai/news-updates/introducing-stable-diffusion-3-5. Accessed 2026-06-22.
[^6]: Stability AI, "stabilityai/sdxl-turbo (model card)" and "Introducing SDXL Turbo: A Real-Time Text-to-Image Generation Model", Hugging Face / Stability AI News, 2023-11-28. https://huggingface.co/stabilityai/sdxl-turbo. Accessed 2026-06-22.
[^7]: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bjorn Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models", arXiv, 2021-12-20. https://arxiv.org/abs/2112.10752. Accessed 2026-06-22.
[^8]: Wikipedia contributors, "Stable Diffusion", Wikipedia, 2026-04-30. https://en.wikipedia.org/wiki/Stable_Diffusion. Accessed 2026-06-22.
[^9]: Olin (madebyollin), "sdxl-vae-fp16-fix (model card)", Hugging Face, 2023-07-28. https://huggingface.co/madebyollin/sdxl-vae-fp16-fix. Accessed 2026-06-22.
[^10]: Hugging Face, "Stable Diffusion XL (Diffusers documentation)", Hugging Face Docs, 2023-07-26. https://huggingface.co/docs/diffusers/en/api/pipelines/stable_diffusion/stable_diffusion_xl. Accessed 2026-06-22.
[^11]: Stability AI, "stabilityai/stable-diffusion-xl-refiner-1.0 (model card)", Hugging Face, 2023-07-26. https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0. Accessed 2026-06-22.
[^12]: Reuters, "Getty Images lawsuit against Stability AI to go to trial in UK", Reuters, 2023-12-01. https://www.reuters.com/legal/litigation/getty-images-lawsuit-against-stability-ai-go-trial-uk-2023-12-01/. Accessed 2026-06-22.
[^13]: Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach, "Adversarial Diffusion Distillation", arXiv, 2023-11-28. https://arxiv.org/abs/2311.17042. Accessed 2026-06-22.
[^14]: Shanchuan Lin, Anran Wang, Xiao Yang, "SDXL-Lightning: Progressive Adversarial Diffusion Distillation", arXiv, 2024-02-21. https://arxiv.org/abs/2402.13929. Accessed 2026-06-22.
[^15]: ByteDance, "ByteDance/SDXL-Lightning (model card)", Hugging Face, 2024-02-21. https://huggingface.co/ByteDance/SDXL-Lightning. Accessed 2026-06-22.
[^16]: Andrew Z., "Stable Diffusion XL 1.0 model", Stable Diffusion Art, 2023-08-15. https://stable-diffusion-art.com/sdxl-model/. Accessed 2026-06-22.
[^17]: Hugging Face, "DreamBooth training example for Stable Diffusion XL", GitHub (huggingface/diffusers), 2023-09-12. https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sdxl.py. Accessed 2026-06-22.
[^18]: Hugging Face, "ControlNet with Stable Diffusion XL (Diffusers documentation)", Hugging Face Docs, 2023-09-04. https://huggingface.co/docs/diffusers/en/api/pipelines/controlnet_sdxl. Accessed 2026-06-22.
[^19]: Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, Wei Yang, "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models", arXiv, 2023-08-13. https://arxiv.org/abs/2308.06721. Accessed 2026-06-22.
[^20]: AUTOMATIC1111, "AUTOMATIC1111/stable-diffusion-webui v1.5.0 release notes", GitHub, 2023-07-25. https://github.com/AUTOMATIC1111/stable-diffusion-webui/releases/tag/v1.5.0. Accessed 2026-06-22.