DreamBooth

Deep Learning Generative AI

11 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v2 · 2,118 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DreamBooth is a subject-driven fine-tuning method for text-to-image diffusion models that personalizes a pretrained model to a specific subject, for example a particular dog, toy, or person, from just 3 to 5 casual photographs, and then renders that subject in new scenes, poses, lighting, and styles. It works by fine-tuning the model's weights to bind the subject to a rare unique identifier token paired with a class noun (the prompt form "a [V] dog"), while a class-specific prior-preservation loss keeps the model from forgetting the broader class. DreamBooth was introduced by Google Research in August 2022 and presented at CVPR 2023, and together with its LoRA variant it became the dominant practical recipe for training custom image models ^[1]^[2]^[4].

What is DreamBooth?

DreamBooth is a method for subject-driven generation that personalizes a pretrained text-to-image diffusion model so it can reproduce a specific subject from a handful of casual photographs and then render that subject in new scenes, poses, lighting, and styles. Given typically 3 to 5 images of the subject, DreamBooth fine-tunes the weights of the entire generative model to bind the subject to a unique textual identifier, after which prompts such as "a photo of a [V] dog on the beach" synthesize novel images of that exact instance ^[1]^[2]. The Hugging Face Diffusers documentation summarizes it succinctly: "DreamBooth is a training technique that updates the entire diffusion model by training on just a few images of a subject or style" ^[2].

The technique was introduced in the paper "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation" by Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. It was first posted to arXiv on 25 August 2022, revised on 15 March 2023, and published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023. The authors were at Google Research; lead author Ruiz was also affiliated with Boston University ^[1]^[3]. The name evokes a photo booth: the project tagline describes it as "like a photo booth, but once the subject is captured, it can be synthesized wherever your dreams take you" ^[3].

DreamBooth became one of the most widely adopted personalization techniques after the public release of Stable Diffusion in 2022, even though the original paper used Google's Imagen as its primary model. Together with textual inversion, it defined the first generation of consumer-facing fine-tuning workflows for diffusion models, and its combination with LoRA remains a dominant practical recipe for training custom image models ^[2]^[4].

Why was DreamBooth created? (motivation: subject-driven generation)

Large text-to-image models trained on web-scale data can synthesize highly varied images from natural language, but they cannot depict a specific subject that they have never seen. As the paper states, "these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts" ^[1]. A user can prompt for "a dog," but the model has no way to render that user's particular dog with its unique markings and proportions. Detailed text descriptions are insufficient, because language cannot precisely specify an individual instance, and the models lack the ability to reconstruct an exact appearance from a caption alone ^[1].

DreamBooth frames this as a personalization problem: teach a pretrained model a new subject from a small reference set, then exploit the model's existing semantic knowledge, its "prior," to place that subject in contexts that never appeared in the references. The paper demonstrates several capabilities once a subject is learned: recontextualization (placing the subject in new environments), text-guided view synthesis (generating unseen viewpoints), artistic rendition in the style of various painters, and property modification such as changing color or accessorizing the subject ^[1]^[3]. The key requirement is high subject fidelity: the synthesized instance must preserve the distinctive identifying details of the reference subject, while still responding flexibly to the prompt.

How does DreamBooth work?

Unique identifier

DreamBooth associates the subject with a rare token used as a unique identifier, written in the literature as [V], and pairs the subject images with a structured prompt of the form "a [V] [class noun]," for example "a [V] dog." Including the coarse class noun (dog) lets the model reuse its existing prior about dogs, which both speeds learning and improves quality, while the identifier [V] carries the specific instance ^[1].

Choosing a good identifier matters. The authors warn that common English words such as "unique" or "special" are suboptimal, because the model must first unlearn their existing meaning. Their approach is to find rare tokens in the tokenizer vocabulary and invert them into text space, minimizing the chance that the identifier already carries a strong prior. In community implementations, including the Hugging Face Diffusers training script, the short string "sks" became the conventional default identifier (for example the demo prompt "a photo of sks dog") ^[1]^[2].

Class-specific prior preservation loss

Naively fine-tuning a model on a few images of one subject causes two characteristic failures that the paper names explicitly. The first is language drift: the fine-tuned model gradually forgets how to generate other members of the subject's class, so prompts for any dog start producing the specific subject dog. The second is reduced output diversity, a form of overfitting in which the model collapses onto the few training viewpoints and can no longer pose or vary the subject ^[1].

DreamBooth's central contribution is an autogenous class-specific prior preservation loss that counteracts both problems. As the Diffusers documentation puts it, "Prior preservation loss is a method that uses a model's own generated samples to help it learn how to generate more diverse images" ^[2]. Before or during training, the frozen pretrained model generates its own samples of the broad class using the simple prompt "a [class noun]," for example by ancestral sampling roughly 1,000 images of generic dogs. The fine-tuning objective then combines two terms:

A reconstruction term that fits the model to the user's subject images under the prompt "a [V] [class noun]."
A prior preservation term that supervises the model on the self-generated class images under the prompt "a [class noun]," weighted by a coefficient lambda (the paper uses lambda = 1, the same default exposed in Diffusers as prior_loss_weight = 1.0).

The second term anchors the model to its original knowledge of the class while the first term injects the new subject, preserving class diversity and preventing the identifier from contaminating the whole class. The following table summarizes the two components.

Loss term	Conditioning prompt	Target images	Purpose
Reconstruction	"a [V] [class noun]"	3 to 5 user subject photos	Learn the specific subject
Prior preservation	"a [class noun]"	~1,000 class images from the frozen model	Prevent language drift and overfitting

Fine-tuning scope and training cost

For maximum subject fidelity, DreamBooth fine-tunes all layers of the model, including the layers conditioned on the text embeddings, rather than only a subset. When applied to Imagen, which is a cascaded diffusion model, this means fine-tuning both the base text-to-image module and the super-resolution modules so that fine details remain faithful at high resolution ^[1].

Training is fast and data-light. The paper reports roughly 1,000 training iterations at a learning rate of about 1e-5 for Imagen (and about 5e-6 for Stable Diffusion), taking on the order of 5 minutes on a TPUv4. The authors also introduce DreamBench, an evaluation set of 30 subjects (21 objects and 9 live subjects) with 25 prompts each, scored with DINO and CLIP-I for subject fidelity and CLIP-T for prompt fidelity ^[1].

DreamBooth vs textual inversion vs LoRA?

DreamBooth is frequently contrasted with textual inversion, a concurrent 2022 personalization method, and with LoRA, which is now most often layered on top of DreamBooth. The three approaches differ fundamentally in what they modify: textual inversion learns only a new word embedding, DreamBooth updates the full model weights, and DreamBooth-LoRA updates only small injected low-rank matrices ^[2]^[4].

How does DreamBooth differ from textual inversion?

Textual inversion freezes the entire diffusion model and learns only a single new word embedding vector that points to the subject in the model's existing text-embedding space. DreamBooth instead leaves the vocabulary largely fixed and updates the model weights themselves ^[2]^[4].

This difference drives their trade-offs. Textual inversion is extremely lightweight: the learned artifact is a small embedding of a few kilobytes that can be shared and composed easily, but because the underlying model is never changed, it often achieves lower subject fidelity and can struggle to capture fine details. DreamBooth changes the weights and so reaches markedly higher subject and prompt fidelity, but at the cost of storing full model weights per subject, which is far heavier ^[2]^[4].

Property	DreamBooth	Textual inversion
What is trained	Full model weights	A single token embedding
Subject fidelity	High	Lower
Artifact size	Full checkpoint (gigabytes)	A few kilobytes
Risk	Overfitting, language drift, storage	Limited expressiveness

How is DreamBooth used with LoRA?

Storing a multi-gigabyte checkpoint for every subject is the main practical drawback of full DreamBooth fine-tuning, since a Stable Diffusion DreamBooth checkpoint contains the entire model and is typically several gigabytes ^[2]. The standard remedy is to combine DreamBooth with LoRA (Low-Rank Adaptation). Instead of updating all weights, DreamBooth-LoRA freezes the base model and injects small trainable low-rank matrices into the attention layers of the diffusion U-Net, training only those added parameters with the same DreamBooth objective, including prior preservation ^[4].

DreamBooth-LoRA preserves most of the subject fidelity of full fine-tuning while shrinking the trainable parameters and the resulting artifact by orders of magnitude. The Diffusers documentation notes that with LoRA "training is faster and it is easier to store the resulting weights because they are a lot smaller (~100MBs)" ^[2], and adapters can be reduced to a few megabytes while cutting memory enough to train on a single consumer GPU. Because LoRA adapters are small and modular, they can be distributed, swapped, and stacked, which is why DreamBooth-LoRA became the dominant practical recipe for custom image models and is the form most users encounter today. It is supported as a first-class training path in the Hugging Face Diffusers library, alongside full DreamBooth, including SDXL and DeepFloyd IF variants ^[2]^[4].

What are DreamBooth's limitations and impact?

DreamBooth's limitations follow directly from its mechanism. Overfitting remains a risk when the reference set is small or training runs too long, manifesting as reduced pose and context diversity or as the subject's environment leaking into outputs. The Diffusers maintainers warn that "DreamBooth is very sensitive to training hyperparameters, and it is easy to overfit" ^[2]. Even with prior preservation, some language drift and degradation of the broader class can occur. The original full-fine-tuning formulation also incurs heavy storage, which the LoRA variant largely addresses. Additional reported failure modes include difficulty with rare or complex subjects, occasional inability to faithfully render fine details, and a tendency to blend subject and context for uncommon prompt combinations ^[1].

Despite these constraints, DreamBooth had an outsized impact. It demonstrated that personalizing a powerful text-to-image model to a new subject required only a few images and a few minutes of training, and it provided a principled objective (the prior preservation loss) for doing so without destroying the model's general knowledge. After the open release of Stable Diffusion, DreamBooth and its LoRA variant drove an enormous wave of community fine-tuning, custom subject and style models, and avatar and product-imagery applications. The authors themselves followed up with HyperDreamBooth in 2023, which uses a hypernetwork to personalize faces in roughly 20 seconds from as little as one image, reported as about 25 times faster than DreamBooth and 125 times faster than textual inversion, producing an adapter (its Lightweight DreamBooth, roughly 100KB) many thousands of times smaller than a full DreamBooth model ^[4]^[5]. DreamBooth's identifier-plus-class prompting and prior-preservation strategy remain reference points for subsequent personalization and customization research.

References

N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, K. Aberman, "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation," arXiv:2208.12242, 2022. https://arxiv.org/abs/2208.12242 ↩
Hugging Face, "DreamBooth," Diffusers documentation. https://huggingface.co/docs/diffusers/training/dreambooth ↩
N. Ruiz et al., "DreamBooth" project page, Google Research. https://dreambooth.github.io/ ↩
N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, K. Aberman, "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. https://openaccess.thecvf.com/content/CVPR2023/html/Ruiz_DreamBooth_Fine_Tuning_Text-to-Image_Diffusion_Models_for_Subject-Driven_Generation_CVPR_2023_paper.html ↩
N. Ruiz et al., "HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models," arXiv:2307.06949, 2023. https://arxiv.org/abs/2307.06949 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

ControlNet IP-Adapter Prompt-to-Prompt Runwayml/stable-diffusion-v1-5 model Stable Diffusion 3 Textual Inversion

What is DreamBooth?

Why was DreamBooth created? (motivation: subject-driven generation)

How does DreamBooth work?

Unique identifier

Class-specific prior preservation loss

Fine-tuning scope and training cost

DreamBooth vs textual inversion vs LoRA?

How does DreamBooth differ from textual inversion?

How is DreamBooth used with LoRA?

What are DreamBooth's limitations and impact?

References

Improve this article

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here