# Classifier-Free Guidance (CFG)

> Source: https://aiwiki.ai/wiki/classifier_free_guidance
> Updated: 2026-06-23
> Categories: Deep Learning, Generative AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Classifier-Free Guidance (CFG)** is an inference-time technique that steers conditional generative models, especially [diffusion models](/wiki/diffusion_model), by combining a single model's conditional and unconditional predictions and extrapolating between them with a tunable guidance scale to trade sample diversity for prompt fidelity. It was introduced by Jonathan Ho and Tim Salimans in the paper "Classifier-Free Diffusion Guidance" [1]. CFG is the standard sampling method in nearly every modern [text-to-image](/wiki/text_to_image) diffusion model, including [Stable Diffusion](/wiki/stable_diffusion), [DALL-E 2](/wiki/dall-e_2), [Imagen](/wiki/imagen), [SDXL](/wiki/sdxl), and [GLIDE](/wiki/glide), and it has since been adapted to autoregressive image and language models. Its core innovation is that the generative model "guides itself" using its own learned distribution, so no separate classifier is required.

A conditional diffusion model can in principle generate samples from a prompt by simply conditioning its denoiser on that prompt. In practice, samples drawn this way are often only loosely aligned with the condition and can be of mediocre quality. Guidance methods trade sample diversity for sample fidelity and prompt adherence, analogous to low-temperature sampling or truncation in other generative models [1][2].

The predecessor method, classifier guidance, achieved this by adding the gradient of a separately trained image classifier to the diffusion model's score estimate during sampling [2]. Classifier-Free Guidance removes the external classifier entirely. Instead, a single model is trained to operate both with and without conditioning, and at sampling time its two outputs are linearly combined and extrapolated. As Ho and Salimans put it, "in what we call classifier-free guidance, we jointly train a conditional and an unconditional diffusion model, and we combine the resulting conditional and unconditional score estimates to attain a trade-off between sample quality and diversity similar to that obtained using classifier guidance" [1]. Because the model guides itself rather than relying on the gradients of a separate network, CFG is simpler to deploy, requires no extra classifier or labeled noisy-image dataset, and works well even when the conditioning (such as free-form text) would be hard for a classifier to predict [3].

## When was classifier-free guidance introduced?

The technique first appeared as a short paper at the NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, with the expanded version posted to arXiv on July 26, 2022 [1]. It rapidly became ubiquitous: by 2022 essentially all leading text-to-image systems relied on it.

## How does classifier guidance work?

Classifier guidance was introduced by Prafulla Dhariwal and Alex Nichol in "Diffusion Models Beat GANs on Image Synthesis" (2021), the work that first showed diffusion models surpassing GANs on [ImageNet](/wiki/imagenet) image synthesis [2]. The method trains an auxiliary classifier on noisy images (the same noised inputs the diffusion model sees at each timestep) and uses the gradient of that classifier's log-probability for the target class to nudge each denoising step toward producing an image of that class.

In score-based terms, sampling from a diffusion model approximates following the score (the gradient of the log-density). Classifier guidance modifies this score by adding the classifier gradient scaled by a factor s:

    score_guided = score(x_t) + s * grad_x log p_classifier(c | x_t)

Larger s sharpens the conditional distribution, improving fidelity and class consistency while reducing diversity. The approach is compute-efficient at sampling time but has notable drawbacks: it requires training and maintaining a separate classifier specifically on noisy images, and the classifier's gradients can be adversarial-like signals that do not correspond to genuinely improved images [1][2]. These costs motivated a method that needed no classifier at all.

## How does classifier-free guidance work?

CFG trains one conditional model that doubles as an unconditional one. During training, the conditioning input c is randomly replaced with a special null token (for example, an empty text string) with some probability p_uncond, typically in the range of about 10 to 20 percent [1][3]. The model thus learns both the conditional score eps_cond = eps(x_t, c) and the unconditional score eps_uncond = eps(x_t, null) within a single set of weights, where eps denotes the network's noise (or score) prediction at noise level t.

At sampling time, the two predictions are combined and extrapolated using a guidance scale w (also called the guidance weight or CFG scale):

    eps_guided = eps_uncond + w * (eps_cond - eps_uncond)

Equivalently, this can be written as eps_guided = (1 + g) * eps_cond - g * eps_uncond with g = w - 1, making explicit that the result moves in the direction of the conditional prediction and away from the unconditional one. Special cases clarify the behavior:

- w = 0 yields purely unconditional generation (the prompt is ignored).
- w = 1 reproduces ordinary conditional sampling with no guidance.
- w > 1 amplifies the difference between conditional and unconditional predictions, sharpening prompt adherence at the expense of diversity.

(Note on conventions: some libraries, including the original Imagen and Stable Diffusion implementations, label this same parameter the "guidance scale," where a value of 1 means no guidance. Other formulations define the multiplier on the difference term directly. The two differ by an offset of 1, so the same qualitative behavior may be reported with shifted numeric values.)

Intuitively, the difference vector (eps_cond - eps_uncond) isolates the component of the prediction attributable specifically to the condition. Pushing along it emphasizes prompt-relevant features. Because the model supplies both terms, CFG produces a sharper implicit conditional distribution proportional to p(x | c) * (p(c | x) ^ g) without ever evaluating an explicit classifier [1].

## What is the guidance scale and what does it trade off?

The guidance scale is the single most important knob a user turns when sampling from a CFG model. It governs a tradeoff between two competing goals: fidelity and prompt adherence on one side, and diversity and naturalness on the other [1][2].

| Guidance scale | Effect on output |
|---|---|
| Very low (about 1 to 3) | Weak prompt adherence; more diverse, sometimes off-prompt or washed-out results |
| Moderate (about 5 to 9) | Common operating range; good balance of prompt fidelity and image quality |
| High (about 10 to 20+) | Strong prompt adherence but reduced diversity, oversaturated colors, blown-out highlights, and unrealistic artifacts |

Default and recommended values vary by model. Stable Diffusion's reference implementation uses a default guidance scale of 7.5 [4]. SDXL, with stronger text understanding, tends to oversaturate at high scales more readily, so practitioners often use roughly 5 to 9 and sometimes lower [5]. Imagen and DALL-E 2 likewise depend critically on guidance for effective text conditioning, with Imagen reporting sweeps over guidance weights from 1 up to 10 [3][6].

The principal failure mode at high scales is oversaturation: as the guided prediction is pushed hard, pixel values drift outside the model's training range (commonly normalized to [-1, 1]), producing overexposed, garishly colored, or overly contrasty images [6]. Analyses have linked the effect to an accumulation of redundant low-frequency signal as the guidance scale grows [7]. Several remedies, discussed below, address this without abandoning high guidance.

## Where is classifier-free guidance used?

CFG is near-universal in conditional diffusion generation. GLIDE (Nichol et al., 2021) compared classifier-free guidance against CLIP-based guidance and found that human raters preferred classifier-free guidance for both photorealism and caption similarity [3]. In that human evaluation, classifier-free guidance reached an Elo score of 82.7 for photorealism and 110.9 for caption similarity, versus -73.2 and 29.3 respectively for CLIP guidance [3]. The authors attributed this to a single model guiding synthesis with its own knowledge rather than relying on a separate model's interpretation, and to the difficulty of building a classifier for free-form text [3]. Imagen (Saharia et al., 2022) relies on CFG together with dynamic thresholding for its text conditioning [6]. DALL-E 2 (unCLIP) enabled classifier-free guidance by intermittently zeroing the CLIP image embedding and randomly dropping the text caption during training [8]. Stable Diffusion and SDXL expose the guidance scale as a primary user-facing parameter, the familiar "CFG scale" slider in many interfaces [4][5]. Latent and pixel-space diffusion systems alike adopt the same mechanism; see [latent diffusion model](/wiki/latent_diffusion) and [DDPM](/wiki/ddpm) for the underlying generative framework, and [Midjourney](/wiki/midjourney) among the broader family of guided text-to-image systems.

Beyond diffusion, CFG has been applied to autoregressive models. "Stay on Topic with Classifier-Free Guidance" (Sanchez et al., 2023) showed that the same conditional-minus-unconditional extrapolation, applied to next-token logits, improves pure language models. For logits the rule is logits_guided = logits_uncond + w * (logits_cond - logits_uncond), where the unconditional branch drops the prompt or system instruction. The authors reported gains across question answering, reasoning, code generation, and machine translation for Pythia, GPT-2, and LLaMA-family models, with improvements they characterized as roughly equivalent to doubling parameter count, and the method stacks with chain-of-thought and self-consistency [9].

## Extensions and variants

Because high guidance scales are useful but introduce artifacts, much follow-up work refines how and when guidance is applied.

### Negative prompts

A [negative prompt](/wiki/negative_prompt) generalizes CFG by replacing the unconditional (null) branch with a second, non-empty condition describing content to avoid. Instead of eps_uncond, the model evaluates eps_neg = eps(x_t, c_negative), and sampling extrapolates away from that negative condition: eps_guided = eps_neg + w * (eps_cond - eps_neg). This steers the output toward the positive prompt and explicitly away from the described undesirable attributes (for example "blurry, extra fingers, watermark"). Negative prompting is widely exposed in Stable Diffusion and SDXL interfaces and is a direct repurposing of the CFG machinery [4][5].

### CFG rescale (guidance rescale)

To counter oversaturation, Lin et al. (2023), in "Common Diffusion Noise Schedules and Sample Steps are Flawed," proposed rescaling the guided prediction so its statistics do not blow up [10]. The idea is to compute the guided prediction as usual, then rescale it so that its standard deviation matches that of the plain conditional prediction, optionally blending the rescaled and original outputs by a factor (often denoted phi) to avoid over-correction. This "guidance rescale" lets users keep high guidance scales while preventing the over-exposed look [10].

### Guidance intervals

Kynkaanniemi et al. (2024), in "Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models," found that guidance is harmful at the highest noise levels (early in sampling), largely unnecessary at the lowest noise levels (late in sampling), and beneficial mainly in the middle [11]. Restricting CFG to a limited interval of noise levels improved sample quality, setting a new FID record on ImageNet-512 (improving from 1.81 to 1.40 over the EDM2 baseline) and helping on large models including SDXL. The authors recommended exposing the guidance interval as a standard hyperparameter [11].

### Dynamic thresholding

Introduced with Imagen, dynamic thresholding is a sampling-time fix specifically for high guidance weights [6]. At each denoising step it clips the predicted clean image to a percentile-based threshold s (such as the 99.5th percentile of absolute pixel values) and then rescales back into the valid range, rather than statically clipping to [-1, 1]. This keeps pixels within range and prevents the saturation that high guidance would otherwise induce, enabling more detailed and photorealistic results [6].

## How does CFG relate to other methods?

CFG is best understood against its predecessor and its descendants. Compared with classifier guidance, it achieves the same fidelity-versus-diversity control but folds the guiding signal into the generative model itself, eliminating the separate noisy-image classifier and its training cost [1][2]. Compared with negative prompting, ordinary CFG is the special case in which the negative condition is the empty/null prompt [4].

In language modeling, CFG is closely related to [contrastive decoding](/wiki/contrastive_decoding) and other contrastive inference methods. All of these steer generation by contrasting two distributions and amplifying their difference. Contrastive decoding typically contrasts a strong "expert" model against a weaker "amateur" model, whereas CFG contrasts the same model conditioned versus unconditioned; both amplify a desirable signal while suppressing a baseline, and the two families are sometimes combined [9]. More broadly, CFG sits within the larger class of inference-time steering techniques that improve generation quality without retraining, alongside truncation, temperature control, and reward-guided decoding.

## References

1. Ho, J., and Salimans, T. "Classifier-Free Diffusion Guidance." arXiv:2207.12598, July 2022 (short version at the NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications). https://arxiv.org/abs/2207.12598
2. Dhariwal, P., and Nichol, A. "Diffusion Models Beat GANs on Image Synthesis." arXiv:2105.05233, May 2021. https://arxiv.org/abs/2105.05233
3. Nichol, A., et al. "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models." arXiv:2112.10741, December 2021. https://arxiv.org/abs/2112.10741
4. "Stable Diffusion with Diffusers." Hugging Face Blog. https://huggingface.co/blog/stable_diffusion
5. "What Is CFG Scale in Stable Diffusion? Complete Guide." AI Photo Generator, 2026. https://www.aiphotogenerator.net/blog/2026/02/what-is-cfg-scale-stable-diffusion
6. Saharia, C., et al. "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" (Imagen). NeurIPS 2022. https://papers.neurips.cc/paper_files/paper/2022/file/ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf
7. Yang, et al. "Rethinking Oversaturation in Classifier-Free Guidance via Low Frequency." arXiv:2506.21452, 2025. https://arxiv.org/abs/2506.21452
8. Ramesh, A., et al. "Hierarchical Text-Conditional Image Generation with CLIP Latents" (DALL-E 2 / unCLIP). arXiv:2204.06125, April 2022. https://arxiv.org/abs/2204.06125
9. Sanchez, G., et al. "Stay on Topic with Classifier-Free Guidance." arXiv:2306.17806, June 2023 (ICML 2024). https://arxiv.org/abs/2306.17806
10. Lin, S., Liu, B., Li, J., and Yang, X. "Common Diffusion Noise Schedules and Sample Steps are Flawed." arXiv:2305.08891, May 2023. https://arxiv.org/abs/2305.08891
11. Kynkaanniemi, T., et al. "Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models." arXiv:2404.07724, April 2024. https://arxiv.org/abs/2404.07724

