Würstchen
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,002 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,002 words
Add missing citations, update stale details, or suggest a clearer explanation.
Würstchen is an efficient three-stage cascaded latent diffusion architecture for text-to-image synthesis introduced by Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, and Marc Aubreville in the paper "Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models", first posted to arXiv as preprint 2306.00637 on 1 June 2023 and accepted as an oral presentation at the Twelfth International Conference on Learning Representations (ICLR 2024).[^1][^2] The model trains its text-conditional diffusion component in a deeply compressed semantic latent space at a 42:1 spatial compression ratio, in contrast to the roughly 8:1 ratio used by prior latent diffusion models such as Stable Diffusion 1.x and 2.x. As a result the 1 billion-parameter Stage C prior was trained in only 24,602 A100 GPU hours, against the 200,000 GPU hours reported for Stable Diffusion 2.1 at comparable model capacity.[^1][^3] Würstchen's architecture was subsequently productized by Stability AI as Stable Cascade, released on 13 February 2024 under a non-commercial research license.[^4][^5]
The name "Würstchen" is German for "small sausage", a reference to the sequence of three nested compression and generation stages that make up the model.
| Field | Value |
|---|---|
| Paper | "Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models" |
| arXiv | 2306.00637 (v1: 1 June 2023; v2: 29 September 2023) |
| Conference | ICLR 2024 (oral) |
| Authors | Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, Marc Aubreville |
| Affiliations | Independent (Spain); TH Ingolstadt; Wand Technologies; Université de Montréal; Mila; Polytechnique Montréal |
| Compression ratio | 42:1 (Stage C) over a 4:1 VQGAN (Stage A) |
| Stage A | 18M-parameter f4 VQGAN |
| Stage B | ~1B-parameter latent diffusion decoder |
| Stage C | ~0.99B-parameter text-conditional diffusion prior |
| Text encoder | CLIP-H (v1); CLIP-bigG (v2) |
| Training compute | 24,602 A100 GPU hours (Stage C) |
| Training samples | 1.42B image-text pairs seen |
| Training dataset | Deduplicated improved-aesthetic LAION-5B subsets |
| Code license | MIT (dome272/Wuerstchen) |
| Reference implementation | dome272/Wuerstchen; warp-ai/wuerstchen on Hugging Face |
| Successor | Stable Cascade (Stability AI, 13 February 2024) |
Latent diffusion models (LDMs), exemplified by Stable Diffusion, train their denoising network in a compressed latent space rather than in pixel space. In the original LDM formulation the autoencoder operates at a spatial compression factor between f4 and f16, with f8 chosen for Stable Diffusion 1.x and 2.x; higher compression in a single-stage autoencoder degraded reconstruction beyond acceptable bounds.[^1] Würstchen's authors observed that the dominant cost in text-to-image training is the denoising network operating in this latent space, and that pushing compression substantially further while still preserving image quality should yield large training and inference savings.[^1]
The paper notes that Stable Diffusion 1.4 used roughly 150,000 GPU hours and Stable Diffusion 2.1 roughly 200,000 GPU hours, while smaller, cheaper text-to-image models traded image quality for cost.[^1] The authors set out to break this Pareto frontier by decoupling text-conditional generation from high-resolution pixel projection, training the text-conditional prior in an aggressively compressed semantic latent space and leaving the projection back to pixels to dedicated, smaller decoders.[^1]
A second observation motivated the design. Stable Diffusion's f8 latent encodes a 512x512 image into a 64x64 latent (a factor of 64 in pixel count); Würstchen pushes this to a 24x24 semantic latent for a 1024x1024 image, a factor of approximately (1024/24)^2 ≈ 1820 in pixel count, or roughly an additional 28x to 42x of spatial compression beyond what Stable Diffusion's VAE achieves at the same input resolution.[^1] The authors argue that this additional compression is feasible only because the semantic latents are computed by a network trained jointly with a dedicated diffusion-based decoder (Stage B), rather than by a fully self-supervised autoencoder that must invert pixels exactly.[^1] Standard one-stage VAEs cannot reach this compression ratio without unacceptable reconstruction degradation, and so Würstchen's contribution is procedural as well as architectural: it shows that two cascaded decoders can recover from the kind of lossy semantic compression that a single VAE cannot.[^1]
The Würstchen line of work originated outside the major industrial labs. Dominic Rampas (TH Ingolstadt and Wand Technologies) and Pablo Pernias, an independent researcher, had previously explored convolution-based, token-quantized generative image models including the Paella line, providing the technical foundation for working in deeply compressed latent spaces.[^1] Mats L. Richter (Mila and Université de Montréal) and Christopher J. Pal (Polytechnique Montréal, Mila, Canada CIFAR AI Chair) contributed work on intrinsic feature dimensionality of vision backbones that motivated the choice of the EfficientNetV2-S Semantic Compressor used in Stage B.[^1] Marc Aubreville (TH Ingolstadt) supervised the project from the German side.[^1]
The initial version of the paper, then titled "Wuerstchen: Efficient Pretraining of Text-to-Image Models", was posted to arXiv on 1 June 2023 as arXiv:2306.00637v1.[^2] A first publicly distributed checkpoint (Würstchen v1) was released around that period: it operates at 512x512 resolution with CLIP-H text conditioning and a Stage C trained for 800,000 steps.[^6] On 13 September 2023 Hugging Face published an integration blog post for an upgraded checkpoint (Würstchen v2) trained up to 1024x1024 (and adaptable to 1536x1536) and conditioned on CLIP-bigG-Text, hosted under warp-ai/wuerstchen on the Hugging Face Hub.[^7][^8] The paper was revised on 29 September 2023 with a new title, "Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models", and was accepted to ICLR 2024 as an oral presentation, with the OpenReview record finalized 16 January 2024.[^2][^3]
In February 2024 Stability AI released Stable Cascade, an industrial-scale productization of the Würstchen v3 architecture, with new Stage A, B, and C checkpoints; this is described in the Variants section.[^4][^5]
Würstchen factorizes text-to-image synthesis into three networks operating at progressively coarser representations of the image. During inference, sampling runs from the most compressed representation outwards, but during training the stages are learned in reverse order (Stage A first, then Stage B, then Stage C).[^1]
Stage A is a Vector-Quantized Generative Adversarial Network (VQGAN-style autoencoder) operating at a spatial compression factor of 4. Given an input image of shape 3x1024x1024 it encodes to 256x256 discrete tokens drawn from a learned codebook of size 8,192.[^1] The encoder and decoder each consist of two stages built from ConvNeXt blocks with 384 input channels and 1,536 embedding channels, joined by pixel-shuffle resampling at scale factor 2.[^1] Stage A has approximately 18 million parameters in the v1 reference implementation (19M in v2) and is the only part of the cascade that performs final pixel-space decoding at inference time.[^1][^6]
After Stage A is trained, the quantization step is dropped and Stage B is trained against Stage A's unquantized continuous latent space.[^1]
Stage B is a U-Net-shaped latent diffusion model conditioned on (1) CLIP text embeddings and (2) the output of a Semantic Compressor: an EfficientNetV2-S backbone whose final pooling and classifier head are replaced by a 1x1 convolution that compresses channel depth to 16.[^1] The Semantic Compressor ingests 384x384 or 768x768 inputs and produces deeply compressed latents of shape 1280x24x24 (or 16x24x24 after the projection), which serve as the conditioning interface between the cascade's stages.[^1]
Stage B is tasked with reconstructing Stage A's continuous latents from these semantic latents plus text. The U-Net has four encoder/decoder stages with channel widths 320, 640, 1280, and 1280, blocks combining ConvNeXt layers, time-step conditioning, and cross-attention over CLIP-H text embeddings (1024-d) and Semantic Compressor embeddings.[^1] Stage B is approximately 1 billion parameters in the reference implementation.[^1]
End to end, encoding a 1024x1024 image into a 16x24x24 semantic latent and decoding it back through Stages B and A yields a total spatial compression ratio of 42:1.[^1]
Stage C is the text-conditional diffusion model that produces semantic latents from noise and text. Because the inputs are already 42x compressed, Stage C does not need a U-Net topology with multiple resolution levels: the authors instead use a flat stack of 16 ConvNeXt blocks (no downsampling), with cross-attention to un-pooled CLIP-H embeddings and time-step injection after each block; channel width is 1,280 and each cross-attention block has 16 heads.[^1] Stage C in the published model has approximately 0.99 billion parameters and is trained using a standard DDPM forward/reverse scheme on the EfficientNetV2-S latents, using AdamW (learning rate 1e-4) with p2 loss weighting.[^1]
Stage C is trained four consecutive times on deduplicated subsets of the improved-aesthetic LAION-5B dataset: 500,000 iterations at 12x12 latents (batch size 1,536), 364,000 iterations at 24x24 latents (batch size 1,536), 4,000 steps of aspect-ratio mixing, and a final 50,000-iteration aesthetic-quality fine-tuning at batch size 384.[^1] In total, the model sees approximately 1.42 billion image-text training samples.[^1] During Stage C training the EfficientNetV2-S Semantic Compressor remains fixed (it was already trained jointly with Stage B); at inference time the Semantic Compressor is discarded entirely, and Stage C's output replaces it as the conditioning signal to Stage B.[^1]
At inference, the user prompt is encoded with CLIP-H (CLIP-bigG in v2) and passed to Stage C, which uses DDPM with classifier-free guidance (default guidance scale 4.0, default 60 steps in v1) to sample a 16x24x24 semantic latent from random noise.[^1] This latent is flattened to 576x16 and passed as conditioning to Stage B, which runs a standard LDM sampling for 12 steps with guidance scale 4 to produce a Stage A latent of shape 4x256x256. Stage B is initialized with random tokens drawn from the VQGAN codebook.[^1] Finally Stage A's VQGAN decoder projects the latent back to a 3x1024x1024 image.[^1]
Training proceeds in the opposite direction from inference: Stage A is trained first as a standard VQGAN with reconstruction and adversarial losses; Stage B is then trained as an LDM in Stage A's continuous latent space, conditioned on Semantic Compressor outputs and text; finally Stage C is trained as a text-conditional LDM on the Semantic Compressor latents themselves.[^1] Because Stage C dominates compute, the reported total training cost is essentially the sum of these three trainings, of which Stage C accounts for the bulk.[^1]
Stage B is trained for 457,000 iterations at 512x512 input (batch size 512) followed by 300,000 iterations at 1024x1024 (batch size 128), with the Semantic Compressor fed corresponding 384x384 and 768x768 inputs; image normalization uses standard ImageNet statistics (mean 0.485/0.456/0.406, std 0.229/0.224/0.225).[^1] During Stage B training the Semantic Compressor weights are updated (initialized from ImageNet pretraining), and the Stage B loss intermittently noises the Semantic Compressor's embeddings so that the model is robust to imperfect conditioning when Stage C produces them at inference.[^1] Stage C's four consecutive training runs use AdamW (learning rate 1e-4) with a linear warm-up of 10,000 steps and an approximate total of 1.5 million training steps at batch size 1,280 across the runs.[^1]
The training reverse-order procedure also matters for inference fidelity. Because Stage B is trained to reconstruct Stage A's continuous latents from arbitrary Semantic Compressor outputs (including noised ones), Stage C can produce semantic latents that are close in distribution to but not identical with the Semantic Compressor's natural outputs, and Stage B still decodes them robustly. At inference time the EfficientNetV2-S Semantic Compressor itself is discarded entirely (Stage C replaces it), so the deployed model consists of only Stage C, Stage B, and Stage A.[^1]
The Würstchen paper's central efficiency claims, drawn from Table 2 of the camera-ready version, are summarized below.
| Model | Params | Steps | GPU hours (A100) | Train samples | tCO2 |
|---|---|---|---|---|---|
| Würstchen (Stage C) | 0.99B | 60 | 24,602 | 1.42B | 2,276 kg |
| Baseline LDM (paper, same compute) | 0.99B | 60 | ~25,000 | n/a | ~2,300 kg |
| Stable Diffusion 1.4 | 0.8B | 50 | 150,000 | 4.8B | 11,250 kg |
| Stable Diffusion 2.1 | 0.8B | 50 | 200,000 | 3.9B | 15,000 kg |
| SDXL | 2.6B | 50 | (not reported) | (not reported) | (not reported) |
Source: [^1].
The Stage C training cost of 24,602 GPU hours corresponds roughly to an 8x reduction over Stable Diffusion 2.1's 200,000 GPU hours at comparable model capacity.[^1] When the additional ~11,000 GPU hours and ~318M training samples for Stage B are accounted for, Würstchen still trains with substantially less compute than the Stable Diffusion baselines.[^1] The reported carbon footprint of 2,275.68 kg CO2-equivalent reflects training on A100 PCIe 40GB GPUs on AWS US-east, as documented in the warp-ai/wuerstchen model card.[^8]
On image-quality metrics computed on a 30,000-sample subset of MS-COCO validation prompts, Würstchen achieves a CLIP-driven Inception Score of 40.9 (similar to SD 1.4's 40.6 and SD 2.1's 40.1) and a FID of 23.6.[^1] PickScore preference comparisons show Würstchen preferred over SD 1.4, SD 2.1, and the authors' compute-matched baseline LDM, while SDXL is favored on raw image quality at substantially higher (and undisclosed) compute.[^1] Inference time on a single A100 at 1024x1024 is reported as significantly faster than Stable Diffusion 2.1 and SDXL across batch sizes, with and without torch.compile optimization.[^1] The Hugging Face integration blog summarizes the result as Würstchen outperforming SDXL across batch sizes in both wall-clock generation time and peak GPU memory.[^7]
The first published checkpoint, released alongside arXiv:2306.00637v1 in mid-2023, was trained at 512x512 resolution with CLIP-H-Text conditioning for 800,000 Stage C steps. Parameter breakdown is 19M (Stage A) + 600M (Stage B) + 1B (Stage C). It was distributed via the dome272/wuerstchen Hugging Face repository under the MIT license.[^6]
Released alongside the Hugging Face integration on 13 September 2023, v2 retains the same parameter budget but trains Stage C to 918,000 steps with CLIP-bigG-Text conditioning and supports variable aspect ratios up to 1536x1536 pixels per side.[^6][^7] The warp-ai/wuerstchen model card reports 24,602 A100 hours and 2,275.68 kg CO2 for v2 and uses CLIP ViT-bigG/14 as the (frozen) text encoder.[^8] v2 is the checkpoint cited in Table 2 of the camera-ready ICLR 2024 paper.[^1] It ships under the MIT license through Hugging Face's diffusers library, exposed via AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen").[^7][^8]
In February 2024 Stability AI released Stable Cascade, described in their announcement as "a new text to image model building upon the Würstchen architecture".[^4] Stable Cascade preserves the 42:1 compression ratio (encoding a 1024x1024 image into a 24x24 latent) and the Stage A / Stage B / Stage C cascade structure, with substantially scaled-up parameter counts.[^4][^9] The architecture corresponds to what the Würstchen authors internally referred to as v3, although the public release uses the Stable Cascade name.
| Stable Cascade component | Parameters | Variants released |
|---|---|---|
| Stage A (VAE-style decoder) | 20M | single |
| Stage B (latent decoder) | 700M or 1.5B | two |
| Stage C (text-conditional prior) | 1B or 3.6B | two |
Source: [^9][^10].
Stable Cascade was released on 13 February 2024 in research preview under the Stability AI Non-Commercial Research Community License, with code under MIT in the Stability-AI/StableCascade GitHub repository.[^4][^10] Stability AI's announcement reports that the architecture achieves "16x cost reduction compared to training a similar-sized Stable Diffusion model" because ControlNets and LoRAs can train exclusively on the small Stage C latent space, and that, despite Stable Cascade's larger total parameter count (the largest configuration has roughly 1.4B more parameters than SDXL), inference is faster than SDXL.[^4]
Würstchen v2 and Stable Cascade are both integrated into Hugging Face's diffusers Python library through dedicated WuerstchenPipeline / WuerstchenCombinedPipeline and StableCascadeCombinedPipeline classes, supporting torch.compile and PyTorch 2 scaled-dot-product attention out of the box.[^7][^10]
Würstchen's primary contribution is empirical: it demonstrates that very high spatial compression (here 42:1) of the latent space used for the text-conditional diffusion step is feasible without catastrophic loss of image quality, provided that (a) the highly compressed latents come from a Semantic Compressor pretrained on natural images and fine-tuned during Stage B, and (b) a dedicated mid-stage decoder bridges back to a milder compression where a VQGAN can faithfully reconstruct pixels.[^1] By demonstrating this, the paper establishes a design pattern (small, semantic latent + multi-stage decoder) that influences subsequent open-source text-to-image work, most directly via Stable Cascade.[^4]
The model also reduces the practical floor for training a competitive text-to-image system. The reported 24,602 A100-hour Stage C training puts an under-10,000-USD reproduction within reach on cloud rentals at typical 2023-2024 spot prices, an order of magnitude cheaper than Stable Diffusion 2.1.[^1] Stability AI's marketing of Stable Cascade explicitly cites this property: because ControlNets and LoRAs train on the small 24x24 Stage C latent rather than on full-resolution Stage B/A, customization compute drops further.[^4]
Würstchen's compact latent representation also enables faster inference. The Würstchen v2 Hugging Face integration reports up to several-times faster generation at 1024x1024 against SDXL on a single consumer-class GPU at large batch sizes, with reduced peak memory.[^7] In Appendix D of the paper, the authors investigate how Stage B and Stage C share generative workload by training a small 3.9M-parameter probe decoder on Stage C's outputs alone; the reconstructions show that Stage C is responsible for the semantic content of the image while Stage B mainly adds high-frequency detail and sharpness.[^1] This factorization is itself useful: experiments such as ControlNet-style guidance or LoRA adaptation can be applied to Stage C alone, while Stage B remains a fixed renderer, drastically reducing the compute needed for customization.[^4]
The architecture's influence on downstream open-source releases extends beyond Stability AI's direct productization in Stable Cascade. Stable Cascade itself ships not only the weights but training code for ControlNets, LoRAs, and image-to-image reconstruction in the Stage C latent space, accelerating community fine-tuning.[^10] Würstchen v2's MIT-licensed weights and the dome272/Wuerstchen MIT codebase additionally enabled academic reuse without the OpenRAIL-M license restrictions attached to Stable Diffusion family weights.[^6][^8]
The published Würstchen and the derivative Stable Cascade ship with several documented caveats. The Stable Cascade model card explicitly lists that the model "is not able to generate legible text" reliably, that "faces and people in general may not be generated properly", that the autoencoding step is lossy, and that the model "was not trained to be factual or true representations of people or events".[^9] These limitations are inherited from training-distribution biases in the LAION-5B aesthetic subsets rather than from the cascade structure itself.[^1][^9]
The paper concedes that SDXL produces superior image quality on aesthetic measures, and notes that direct comparison is not fully fair because SDXL's data and compute budget are not disclosed.[^1] Würstchen's reported FID of 23.6 on COCO-30K is higher than SD 2.1's 15.1 and SD 1.4's 16.2, with the authors arguing that COCO FID under-rewards aesthetic improvements and over-rewards proximity to COCO's photographic distribution; PickScore comparisons therefore favor Würstchen on user preference even where FID does not.[^1]
The cascade introduces a structural cost: inference requires running three networks (Stage C, then Stage B, then Stage A), and Stage B at 1B parameters is not itself small. The paper notes that even with this overhead, total sampling time at 1024x1024 is significantly faster than SD 2.1 and SDXL on the same A100; nevertheless, peak memory across all three stages must be managed, particularly on consumer GPUs.[^1][^7]
Würstchen's license situation differs across variants. The original code base (dome272/Wuerstchen) and the warp-ai/wuerstchen weights ship under the MIT license, allowing commercial use with attribution.[^6][^8] Stable Cascade's weights are released under the Stability AI Non-Commercial Research Community License, restricting commercial deployment, although Stability AI's code under Stability-AI/StableCascade is MIT-licensed.[^4][^10]
A further methodological caveat concerns the comparison baselines. The authors explicitly trained their own 1B-parameter "Baseline LDM" on the SD 2.1 first-stage autoencoder and text encoder for an equivalent ~25,000 GPU hours to isolate the architectural contribution from confounding factors; under this controlled comparison Würstchen achieves FID 23.6 vs. the Baseline LDM's 43.5 and Inception Score 40.9 vs. 20.1 on COCO-30K.[^1] This is the cleanest like-for-like demonstration that the cascade design, rather than incidental engineering improvements, drives the efficiency gain.[^1] Comparisons to the published Stable Diffusion 1.4 and 2.1 weights, by contrast, mix in differences in dataset filtering, text encoder choice, and training duration, and should be interpreted with caution.[^1]
After the September 2023 Hugging Face integration, the warp-ai/wuerstchen checkpoint accumulated community demos, fine-tunes, and integrations. The Hugging Face Hub model card lists dozens of community Spaces using the model and an active discussion thread, and Würstchen pipelines (WuerstchenCombinedPipeline, WuerstchenPriorPipeline, WuerstchenDecoderPipeline) became first-class entries in the diffusers library API surface.[^7][^8] Stable Cascade's February 2024 release in turn ranks among Stability AI's most downloaded text-to-image checkpoints on Hugging Face, with the model card reporting roughly 13,000 downloads in a single recent month as of the time of writing.[^9]
The combination of MIT-licensed Würstchen v2 weights and an explicit reduced-compute recipe also made the architecture attractive for academic experimentation. The dome272/Wuerstchen repository ships not only inference scripts but training scripts for Stage B and Stage C, lowering the barrier to research on alternative semantic backbones, alternative compressed-latent objectives, and alternative sampling schedules.[^6] The Stability-AI/StableCascade repository adds production training code for ControlNet, LoRA, image-to-image, and inpainting workflows operating in the small Stage C latent, which has reduced the GPU footprint of customization workflows reported by community contributors.[^10]
| Model | Compression | Stage C / prior params | Training compute | License |
|---|---|---|---|---|
| Würstchen v2 | 42:1 (over a 4:1 VQGAN) | ~1B | 24,602 A100-h | MIT (weights) [^8] |
| Stable Cascade | 42:1 | 1B or 3.6B | not publicly broken out | NC research [^4][^10] |
| Stable Diffusion 1.4 | 8:1 single-stage VAE | 0.8B | 150,000 GPU-h | OpenRAIL-M [^1] |
| Stable Diffusion 2.1 | 8:1 single-stage VAE | 0.8B | 200,000 GPU-h | OpenRAIL-M [^1] |
| SDXL | 8:1 single-stage VAE | 2.6B (U-Net) | not disclosed | OpenRAIL-M++ [^1] |
| Imagen | pixel + upsamplers (no LDM) | n/a | not disclosed | proprietary |
| DALL-E 2 (unCLIP) | pixel + upsamplers | n/a | not disclosed | proprietary |
Source: [^1][^4][^8][^10].
Within the open-source text-to-image landscape Würstchen sits between two design poles: the single-stage VAE plus U-Net diffusion architecture used by Stable Diffusion 1.x/2.x/SDXL, and the pixel-space-plus-upsamplers approach used by Imagen and unCLIP/DALL-E 2.[^1] By placing the heavy text-conditional diffusion work in a 42:1 semantic latent and using two smaller models to project back to pixels, Würstchen prefigures the design intuitions also visible in later open-weights systems such as FLUX.1, which similarly attempts to separate semantic generation from pixel decoding for efficiency.[^7]