SDXL (Stable Diffusion XL)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,236 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,236 words
Add missing citations, update stale details, or suggest a clearer explanation.
SDXL, short for Stable Diffusion XL, is an open-weights latent text-to-image diffusion model released by Stability AI in July 2023. It represents the third major generation of the Stable Diffusion family and introduced a substantially larger U-Net backbone, a dual text-encoder configuration combining OpenCLIP ViT-bigG/14 with the OpenAI CLIP ViT-L/14, native 1024x1024 training, and a separate refinement model used in a two-stage "ensemble of experts" pipeline.[1][2] The base U-Net contains roughly 2.6 billion parameters, with the complete two-model pipeline reaching approximately 6.6 billion parameters, making SDXL the largest open-access latent diffusion image model at the time of its release.[2][3] SDXL set the dominant baseline for the open-source image-generation community throughout 2023 and 2024 before being succeeded by Stable Diffusion 3 in 2024 and Stable Diffusion 3.5 later that year.[4][5]
| Field | Value |
|---|---|
| Developer | Stability AI (with CompVis and Runway lineage) |
| Initial release | 26 July 2023 (SDXL 1.0)[2] |
| Paper | "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis", arXiv:2307.01952[1] |
| Authors | Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna, Robin Rombach[1] |
| Architecture | Latent diffusion model with U-Net backbone and dual text encoders[1] |
| Base U-Net parameters | ~2.6 billion[1] |
| Total pipeline parameters | ~6.6 billion (base + refiner)[2] |
| Native resolution | 1024x1024 (with multi-aspect buckets)[1] |
| Text encoders | OpenCLIP ViT-bigG/14 and OpenAI CLIP ViT-L/14[3] |
| Base license | CreativeML Open RAIL++-M[2][3] |
| Turbo license | sai-nc-community (non-commercial)[6] |
| Successors | Stable Diffusion 3 (2024), Stable Diffusion 3.5 (2024)[4][5] |
The Stable Diffusion family originates from the "High-Resolution Image Synthesis with Latent Diffusion Models" paper by Robin Rombach and collaborators at the CompVis group of LMU Munich, released in late 2021 and presented at CVPR 2022.[7] The latent diffusion approach compresses RGB images into a lower-dimensional latent space using a variational autoencoder (VAE), then trains a U-Net denoiser to invert a Gaussian noising process in that latent space, conditioning each step on a text embedding produced by a frozen CLIP encoder.[7] Stability AI sponsored compute for the original Stable Diffusion checkpoints and released the first public weights in August 2022 under the CreativeML Open RAIL-M license, kicking off an extensive third-party ecosystem.[8]
Two subsequent Stable Diffusion generations preceded SDXL. Stable Diffusion 1.4 and 1.5 used a 860 million parameter U-Net trained at 512x512 with a single OpenAI ViT-L/14 text encoder; the 1.5 checkpoint, released by Runway in October 2022, became the de facto community baseline. Stable Diffusion 2.0 and 2.1, released in late 2022, swapped the text encoder for OpenCLIP ViT-H/14 and trained at higher resolutions but were less broadly adopted because the new conditioning behaved differently from 1.5 prompts and because content filtering in the LAION training subset altered the model's aesthetic.[8] SDXL was designed to address these gaps by scaling the U-Net more aggressively, combining the two text encoders, training natively at 1024x1024, and addressing several artifacts (cropped subjects, low aesthetic scoring) traced back to data handling in earlier releases.[1]
The SDXL paper, "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis", was first posted to arXiv on 4 July 2023 (arXiv:2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Muller, Joe Penna and Robin Rombach.[1] The 1.0 weights followed three weeks later, on 26 July 2023, distributed through Hugging Face under CreativeML Open RAIL++-M alongside availability on Clipdrop, the Stability AI Platform API, AWS SageMaker, AWS Bedrock, the Stable Foundation Discord and DreamStudio.[2][3]
SDXL inherits the structural template of a latent diffusion model: an encoder compresses pixels into a latent representation eight times smaller per spatial dimension, a U-Net denoiser operates iteratively on those latents, and a decoder reconstructs pixels.[1][7] What changed in SDXL is the size and arrangement of the U-Net and conditioning signals rather than the overall paradigm. The paper describes the new U-Net as "three times larger" than the SD 1.5 and 2.x backbones, with the increase concentrated in additional transformer blocks and a wider cross-attention context.[1] Specifically, the SDXL U-Net deletes the highest-feature-map level (the 64x64 latent resolution still receives convolutions, but transformer blocks are removed there) and adds many more transformer blocks at the middle resolutions where global semantics are formed.[1] The base model has roughly 2.6 billion parameters in the U-Net alone.[1]
The VAE that maps between pixels and latents is structurally similar to the SD 2.x VAE but retrained on a different mixture and at higher precision; it still compresses a 1024x1024 RGB image to a 128x128x4 latent tensor.[9] Because the original SDXL VAE accumulates activation variance through its decoder, naive fp16 inference can overflow to NaN values; a widely used community fp16 fix released by Olin (madebyollin/sdxl-vae-fp16-fix) freezes the convolutional weight matrices and fine-tunes only the biases, normalization layers and per-matrix scalers, allowing fp16 decoding without quality loss.[9]
A central architectural change in SDXL is the simultaneous use of two pretrained text encoders. The first is OpenAI's CLIP ViT-L/14, the same encoder used in SD 1.5; the second is the much larger OpenCLIP ViT-bigG/14 model trained by the LAION community.[1][3] At inference time both encoders process the prompt independently, and their penultimate-layer token features are concatenated along the channel dimension before being fed into the U-Net via cross-attention. Additionally, the pooled output of OpenCLIP ViT-bigG/14 is added to the time-step embedding as a global conditioning signal.[1] The combined text embedding therefore has substantially higher dimensionality than the single-encoder embeddings used by earlier Stable Diffusion versions, which the paper credits with improving prompt following and complex-scene handling.[1]
Section 2.2 of the SDXL paper introduces three explicit conditioning signals collectively referred to as "micro-conditioning", which the model receives in addition to the text embedding and timestep.[1][10]
All three signals are encoded with sinusoidal embeddings, concatenated, and added to the conditioning stream alongside the timestep and pooled OpenCLIP-G feature.[1]
In addition to the base model, the SDXL release includes a separate "refiner" U-Net specialized for the low-noise end of the diffusion schedule.[1][2] The refiner operates on latents produced by the base model and is intended to be applied for the final 200 or so denoising steps using a SDEdit-style image-to-image pass.[1][11] Because both models share a VAE and operate in the same latent space, the handoff is implemented as a continuation rather than an encode-decode round trip. The refiner uses a smaller U-Net than the base model and is conditioned only by the OpenCLIP ViT-bigG/14 encoder, which simplifies its conditioning interface.[11] Stability AI describes the combined system as an "ensemble of experts" pipeline, contrasting it with a single end-to-end model.[2] In practice many downstream users skip the refiner because the base model alone produces acceptable output and adding the refiner roughly doubles inference time.
The SDXL paper describes a multi-stage training procedure for the base model. Pretraining is performed on a large internal dataset at progressively higher resolutions (256x256, 512x512, then 1024x1024), with the resolution-bucket aspect-ratio system applied throughout the 1024x1024 stage.[1] After pretraining, the authors apply a final fine-tuning pass on a curated subset of higher-aesthetic-quality images, which is described as analogous to the supervised fine-tuning step in instruction-tuned language models.[1] The micro-conditioning signals (original size, crop coordinates, target size) are present from the start of training so that the model learns to treat them as bona fide control inputs rather than nuisance metadata.[1]
The refiner is trained as a separate model on the same latent VAE but specialized to a high-quality, low-noise distribution. The paper notes that the refiner is intended to sharpen local details such as hands and faces rather than to alter scene composition.[1] Stability AI has not released the full training dataset, citing both legal and operational reasons, although the use of LAION-derived data is widely documented and the company has been a defendant in legal action over training-data sourcing.[12]
Both the base and refiner checkpoints, totaling around 6.6 billion parameters, were released together on 26 July 2023 through Stability AI's Hugging Face repositories (stabilityai/stable-diffusion-xl-base-1.0 and stabilityai/stable-diffusion-xl-refiner-1.0).[2][3] The base checkpoint can be used standalone, and several inference frameworks expose the refiner as an optional second stage.[3] Both checkpoints are released under the CreativeML Open RAIL++-M License, which permits commercial use subject to a use-based restriction clause prohibiting certain harmful applications.[3]
On 28 November 2023, Stability AI released SDXL Turbo, a distilled variant of SDXL 1.0 capable of producing usable 512x512 images in a single sampling step.[6][13] Turbo is based on the Adversarial Diffusion Distillation (ADD) technique introduced in "Adversarial Diffusion Distillation" by Axel Sauer, Dominik Lorenz, Andreas Blattmann and Robin Rombach (arXiv:2311.17042, submitted 28 November 2023).[13] ADD combines score-distillation against the original SDXL teacher with an adversarial loss that uses a CLIP-initialized discriminator, allowing the student model to retain sample fidelity in the one-to-four-step regime where pure score-distillation methods often produce blurry outputs.[13][6] Stability AI reports that, on an NVIDIA A100, Turbo generates a 512x512 image in 207 milliseconds, of which 67 milliseconds correspond to the U-Net forward pass.[6] In blind human evaluation Stability AI reported that 1-step Turbo outperformed 4-step LCM-XL outputs, and 4-step Turbo matched 50-step base SDXL.[6] SDXL Turbo is distributed under the bespoke sai-nc-community license, which permits personal, research, educational and other non-commercial use; commercial deployment requires a paid Stability AI membership.[6]
In February 2024, ByteDance researchers Shanchuan Lin, Anran Wang and Xiao Yang published "SDXL-Lightning: Progressive Adversarial Diffusion Distillation" (arXiv:2402.13929, initial version 21 February 2024) along with checkpoints on Hugging Face under the ByteDance/SDXL-Lightning repository.[14][15] The technique combines progressive distillation, in which the model is iteratively halved in step count, with an adversarial objective implemented in latent space using the pretrained SDXL U-Net encoder as the discriminator backbone.[14] Distilled checkpoints are provided for 1-step, 2-step, 4-step and 8-step generation, in both full U-Net and LoRA formats, and operate at native SDXL 1024x1024 resolution rather than the 512x512 of Turbo.[15] The SDXL Lightning weights are released under the openrail++ license, permitting commercial use under the use-based restrictions inherited from CreativeML Open RAIL++.[15]
| Variant | Release | Steps for typical output | Resolution | Distillation method | License |
|---|---|---|---|---|---|
| SDXL Base 1.0 | 26 July 2023[2] | ~25 to 50 | 1024x1024 | None (teacher) | CreativeML Open RAIL++-M[3] |
| SDXL Refiner 1.0 | 26 July 2023[2] | Used for final ~20% of steps | 1024x1024 | None (paired with base) | CreativeML Open RAIL++-M[3] |
| SDXL Turbo | 28 November 2023[6][13] | 1 to 4 | 512x512[6] | Adversarial Diffusion Distillation[13] | sai-nc-community (non-commercial)[6] |
| SDXL Lightning | 21 February 2024[14] | 1, 2, 4, 8 | 1024x1024[15] | Progressive Adversarial Diffusion Distillation[14] | openrail++[15] |
SDXL became the dominant base for community-trained checkpoints from late 2023 onward. Fine-tunes such as Juggernaut XL, RealVisXL and Pony Diffusion V6 XL achieved millions of downloads on the Civitai model-sharing platform, in many cases adapting SDXL toward photorealism, anime aesthetics or other stylistic priorities.[16] Because the base model is large, full fine-tuning is expensive, so most community variants are trained either through DreamBooth, the subject-personalization technique introduced by Ruiz et al., or through low-rank adaptation using the LoRA method, both of which keep most weights frozen and train a small additional set of parameters.[16][17] Hugging Face published an official DreamBooth training script for SDXL in diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py, and tools such as Kohya SS GUI and OneTrainer enabled DreamBooth fine-tuning of the full SDXL U-Net on consumer GPUs through gradient checkpointing and 8-bit optimizers.[17]
ControlNet, a conditioning mechanism originally introduced for Stable Diffusion 1.5 by Lvmin Zhang and Maneesh Agrawala, was extended to SDXL during the second half of 2023. Hugging Face released first-party SDXL ControlNet weights in canny, depth and Zoe-depth variants under the diffusers/controlnet-*-sdxl-1.0 repositories, including "mid" and "small" distilled versions that reduce the ControlNet adapter to roughly one-fifth or one-seventh of the original parameter count.[18] Community contributors such as thibaud and xinsir released additional SDXL ControlNets for OpenPose, line-art, scribble, and tile conditioning, and xinsir's "ControlNet Union" attempted to consolidate multiple conditioning modalities into a single adapter.[18]
The IP-Adapter family from Tencent's AIIG group, introduced in 2023, provides lightweight image-prompt conditioning by adding decoupled cross-attention layers that consume embeddings from a CLIP image encoder. SDXL versions of IP-Adapter (including the Plus variant for higher fidelity and the FaceID series for identity preservation) were widely adopted in the ComfyUI and AUTOMATIC1111 communities for tasks such as style transfer and reference-image-driven generation. InstantID, a related 2024 technique, combines an IP-Adapter-style face embedder with an SDXL ControlNet to inject facial identity into generated images while preserving prompt-driven composition.[19]
SDXL is supported by every major Stable Diffusion inference frontend. AUTOMATIC1111 added SDXL support in version 1.5.0, released in late July 2023 shortly after the official weights drop, with subsequent versions improving VRAM efficiency.[20] ComfyUI, a node-based interface developed by comfyanonymous, was an early SDXL adopter and is widely credited with making the base-plus-refiner pipeline practical to compose; ComfyUI workflows for SDXL with IP-Adapter, multiple LoRAs and ControlNet have been shared widely on Civitai.[16][20] InvokeAI, Fooocus (developed by Lvmin Zhang of ControlNet fame, designed specifically around SDXL with curated default settings), Draw Things (an iOS and macOS client by Liu Liu), and the Hugging Face diffusers library all shipped SDXL pipelines within weeks of the model's release.[20] On the model-hub side, stabilityai/stable-diffusion-xl-base-1.0 and stabilityai/sdxl-turbo together accumulated tens of millions of downloads on Hugging Face during 2023 and 2024, while Civitai hosted thousands of SDXL fine-tunes, LoRAs and embeddings.[3][6][16]
Across 2023 and 2024, SDXL became the default open-weights base model for a large fraction of professional and hobbyist image-generation work. The model's 1024x1024 native resolution, dual text encoders and improved prompt following make it broadly suitable for editorial illustration, concept art for games and film, product visualization, character design, and stock imagery generation. The micro-conditioning mechanism allows precise control over framing and resolution without resorting to inpainting, which made SDXL particularly useful in workflow tools that target non-square outputs (mobile wallpapers, social-media banners, print).[1][10]
In professional pipelines the base model is often paired with ControlNet for layout control, with IP-Adapter for style or identity reference, and with task-specific LoRA adapters for proprietary characters or branded aesthetics; the existence of dedicated SDXL inpainting checkpoints (such as diffusers/stable-diffusion-xl-1.0-inpainting-0.1) extends the same model to localized editing tasks.[18] On the inference-cost side, the distilled SDXL Turbo and SDXL Lightning variants reduced per-image latency by roughly an order of magnitude relative to the base model, making interactive applications (live drawing tools, real-time previews, prompt-and-iterate UX) feasible on a single consumer GPU.[6][14] Drawing applications such as Krita's AI Diffusion plugin and Photoshop's Generative Workspace plugins, browser-based playgrounds, and on-device mobile clients all integrated SDXL or its distilled variants in this period.[16][20]
SDXL was met with broadly positive technical reception. Independent reviewers noted clear improvements over SD 1.5 and 2.x on prompt adherence, anatomy, composition and resolution, and the model became a frequent benchmark target for subsequent academic papers on diffusion distillation, controllable generation and personalization.[1][13][14] Within the open-source community, SDXL's adoption was accelerated by Stability AI's relatively permissive CreativeML Open RAIL++-M license, the maturation of AUTOMATIC1111 and ComfyUI as widely used frontends, and the rapid growth of Civitai as a model and workflow hub.[16][20]
At the same time, the rollout coincided with broader public concern about generative-image abuse and copyright disputes. Stability AI faced ongoing litigation from Getty Images and from a group of artists alleging that LAION-derived training data infringed copyrighted works; while the legal questions were not specific to SDXL, the model's prominence in the open-source ecosystem placed it at the center of those debates.[12] Stability AI's own corporate situation became turbulent in early 2024, with founder Emad Mostaque resigning as CEO in March 2024, key research staff (including several SDXL co-authors) departing for other organizations, and the company taking on new investment from a consortium led by Sean Parker later that year. SDXL's continued availability under an open license and its widespread mirroring on Hugging Face and Civitai mean that downstream use has continued largely uninterrupted regardless of corporate turbulence.
The dataset used for SDXL pretraining has not been fully disclosed, which has drawn criticism from researchers who view reproducibility as a precondition for open science. Stability AI has cited a combination of legal exposure and contractual constraints when declining to release the dataset in detail.
The SDXL paper and the public Hugging Face model card explicitly document several known limitations.[1][3]
The SDXL Turbo model card adds further explicit caveats: Turbo is restricted to a 512x512 output resolution, cannot render legible text, often produces low-quality faces and people, and inherits lossy VAE autoencoding artifacts.[6]
| Model | Release | Parameters (U-Net or transformer) | Architecture | Native resolution | Text encoder(s) |
|---|---|---|---|---|---|
| SD 1.5 | October 2022[8] | ~860 million | U-Net latent diffusion | 512x512 | CLIP ViT-L/14[8] |
| SD 2.1 | December 2022[8] | ~865 million | U-Net latent diffusion | 768x768 | OpenCLIP ViT-H/14[8] |
| SDXL 1.0 | July 2023[2] | ~2.6 billion U-Net (6.6 B with refiner) | U-Net latent diffusion + refiner | 1024x1024 | CLIP ViT-L/14 + OpenCLIP ViT-bigG/14[1] |
| SDXL Turbo | November 2023[6] | ~2.6 billion (distilled) | U-Net latent diffusion | 512x512 | Same as SDXL[6] |
| SDXL Lightning | February 2024[14] | ~2.6 billion (distilled) | U-Net latent diffusion | 1024x1024 | Same as SDXL[14] |
| Stable Diffusion 3 | June 2024 (weights)[4] | 2 B and 8 B variants | MMDiT Diffusion Transformer | up to 1024x1024 | Two CLIP encoders + T5-XXL[4] |
| Stable Diffusion 3.5 Large | 22 October 2024[5] | 8.1 billion | MMDiT Diffusion Transformer | up to ~1 megapixel | Two CLIP encoders + T5-XXL[5] |
| Stable Diffusion 3.5 Medium | 29 October 2024[5] | 2.5 billion | MMDiT Diffusion Transformer | up to ~2 megapixels[5] | Two CLIP encoders + T5-XXL[5] |
Successor architectures, including Stable Diffusion 3 and Stable Diffusion 3.5, replaced the U-Net with a Multimodal Diffusion Transformer (MMDiT) backbone and added a T5-XXL text encoder for improved text-in-image rendering, complex prompt following and typographic fidelity.[4][5] The 3.5 release also returned to a more permissive licensing structure (the Stability AI Community License, free for noncommercial use and commercial use up to one million USD in annual revenue), which Stability AI positioned as an evolution of the RAIL++ approach used for SDXL.[5] Outside the Stable Diffusion lineage, Black Forest Labs' FLUX.1 series (released in August 2024 by a team that includes several SDXL co-authors including Robin Rombach) became a major non-Stability competitor, also using a transformer-based diffusion backbone with T5 text conditioning.
SDXL is widely regarded as the inflection point at which open-weights text-to-image models reached visual parity with closed commercial systems such as Midjourney v5 and DALL-E 3 for a broad set of prompts, while remaining feasible to run locally on a single consumer GPU.[1][16] Its release accelerated several trends that came to define the open generative-image ecosystem in 2023 and 2024: the central role of Hugging Face as a model hub, the rise of Civitai as a community marketplace for fine-tunes and LoRAs, the maturation of node-based interfaces (ComfyUI) for advanced workflows, and the academic interest in diffusion distillation methods such as ADD and progressive adversarial distillation.[13][14][16][20] The micro-conditioning mechanism introduced in SDXL has been reused or adapted in several follow-on papers and toolchains. Even after the architectural switch to MMDiT in Stable Diffusion 3, much of the practical tooling (samplers, LoRA training scripts, ControlNet adapters) developed for SDXL remained in active use through 2025 and into 2026.