Textual Inversion

Deep Learning Generative AI

10 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 1,938 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Textual Inversion is a technique for personalizing text-to-image diffusion models that teaches a frozen model a new visual concept from only three to five example images by learning a single new "pseudo-word" embedding vector in the text encoder's embedding space, without changing any of the model's weights ^[1]^[2]. The learned embedding can then be referenced by its keyword in ordinary prompts, letting a user generate, restyle, or recompose that specific concept (an object, a person, or an artistic style) just as if it were a real word in the model's vocabulary ^[1]. It was introduced in the August 2022 paper "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion" by Rinon Gal and colleagues at Tel Aviv University and NVIDIA, and it became one of the first widely shared personalization formats in the open Stable Diffusion community ^[1]^[4].

What is Textual Inversion?

Textual Inversion is a personalization method for text-to-image generation that teaches a frozen diffusion model a new, user-supplied visual concept, such as a specific object or an artistic style, from only three to five example images. Instead of changing any of the model's weights, the method introduces a single new "pseudo-word" into the vocabulary of the model's text encoder and optimizes only that word's embedding vector so that it reconstructs the example images. The learned embedding can then be dropped into ordinary text prompts like any other word, letting a user generate new images of the concept, restyle it, or compose it into novel scenes ^[1]^[2]. As the paper states, "Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new 'words' in the embedding space of a frozen text-to-image model" ^[1].

The technique was introduced in the 2022 paper "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion" by Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or, a collaboration between Tel Aviv University and NVIDIA. It was posted to arXiv on August 2, 2022, and presented as an oral paper at the International Conference on Learning Representations (ICLR) in 2023 ^[1]^[3]. Because it learns nothing but an embedding vector, a Textual Inversion concept is extremely small (on the order of kilobytes) and portable, which made it one of the first widely shared personalization formats in the open Stable Diffusion community ^[4]^[5].

The name "Textual Inversion" reflects the core idea: rather than inverting an image back into a model's latent space (as in classical GAN inversion), the method inverts a small image set into the text-embedding space of a frozen generator, finding the embedding that, when used as a prompt token, causes the model to produce the target concept ^[1].

Why was Textual Inversion created? (personalization)

Large pretrained text-to-image models such as Stable Diffusion and Latent Diffusion can render an enormous range of subjects described in natural language, but they cannot, on their own, depict a specific personal concept that was not in their training distribution: a particular pet, a unique handmade toy, a friend's face, or a niche artistic style. Plain text is often too coarse to pin down such an instance, and simply describing it in words rarely recovers the exact appearance ^[1].

The goal of Textual Inversion is to bridge this gap with minimal cost. Given just a handful of casual photographs of one concept, the method should produce a compact handle, a new word, that the frozen model already "understands" how to combine with the rest of language. The authors framed the design around two requirements: the new representation must capture fine visual detail of the concept, and it must remain composable, so that prompts like "a photo of S* on the beach" or "an oil painting in the style of S*" behave as expected. They found, somewhat surprisingly, that a single learned word embedding is often sufficient to capture varied and detailed concepts. As the paper reports, "we find evidence that a single word embedding is sufficient for capturing unique and varied concepts" ^[1].

How does Textual Inversion work?

Textual Inversion operates on the text-conditioning pathway of a latent diffusion model. In a typical model, a text prompt is tokenized, each token is mapped to a continuous embedding vector through an embedding lookup table, and these embeddings are passed through a text encoder (in Stable Diffusion, a CLIP text encoder, whose output conditioning vectors have dimension 768) to produce the conditioning that guides the denoising diffusion process ^[1]^[2].

The method proceeds as follows:

A new placeholder token, written in the paper as "S*", is added to the tokenizer's vocabulary, along with a fresh row in the embedding table. This new embedding is the only parameter that will be trained. It is usually initialized from the embedding of a coarse descriptor word (for example, "toy" or "sculpture") to give optimization a reasonable starting point ^[1]^[6].
Every other component is frozen: the embedding table for all existing tokens, the rest of the text encoder, the diffusion U-Net (the noise predictor), and the variational autoencoder. No model weights change during training ^[1]^[2].
The few concept images are paired with simple templated prompts containing the placeholder, such as "a photo of S*". For each training step, an image is encoded to the latent space, noise is added at a randomly sampled timestep, and the frozen U-Net is asked to predict that noise while conditioned on the prompt embedding (which now includes the trainable vector). The objective is exactly the same latent diffusion reconstruction (denoising) loss used to train the original model, and gradients flow back only into the single placeholder embedding ^[1]^[2].

By minimizing this loss, the optimizer searches the text-embedding space for the vector that best explains the example images. The original paper used a batch size of 8 and ran roughly 5,000 optimization steps with a base learning rate of about 0.005, taking on the order of a couple of GPU-hours per concept on the hardware of the time; community implementations report similar runs of about an hour on a single modern GPU ^[1]^[6]^[2]. The end product is one (or a small number of) embedding vectors, saved as a file of only a few kilobytes, that can be loaded into the model and referenced by the chosen keyword in any prompt ^[2]^[4].

Textual Inversion vs DreamBooth: what is the difference?

Textual Inversion is frequently compared with DreamBooth, a concurrent 2022 personalization method by Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman of Google Research and Boston University (arXiv 2208.12242, posted August 25, 2022). Both methods learn a concept from only three to five images, but they make opposite trade-offs: Textual Inversion keeps the diffusion model frozen and trains only a new word embedding, while DreamBooth fine-tunes the model's weights ^[7].

Aspect	Textual Inversion	DreamBooth
What is trained	A single new word embedding (768-dim)	The diffusion model weights (often the full U-Net)
Model weights changed	None; model fully frozen	Yes; model is fine-tuned
Concept identifier	A new optimized pseudo-word ("S*")	A rare-token identifier bound to a class noun
Output artifact size	A few kilobytes	A full or partial model checkpoint (often gigabytes)
Prior preservation	Not needed	Uses a class-specific prior-preservation loss to limit drift
Typical fidelity	Lower subject and detail fidelity	Higher subject and prompt fidelity
Portability	Very high; embeddings stack and share easily	Lower; large, model-specific checkpoints

Because Textual Inversion only moves a point in embedding space and never touches the generator, it cannot add genuinely new visual capacity to the model; it can only express the concept in terms the frozen model can already render. DreamBooth, by fine-tuning weights, can memorize subject detail more faithfully and is generally more expressive, at the cost of much larger artifacts, greater compute, and the risk of overfitting or "language drift," which it counters with a prior-preservation regularization term over the subject's class ^[7]. Measured comparisons report sizeable gaps favoring DreamBooth: the DreamBooth authors found that DreamBooth scores higher on the DINO and CLIP-T similarity metrics and that users prefer it for both subject fidelity and prompt fidelity over Textual Inversion ^[7]. The two are nonetheless complementary: a Textual Inversion embedding can be used alongside a fine-tuned model to add linguistic nuance ^[7]. A later, even lighter-weight alternative, LoRA adapters, became popular as a middle ground that approaches DreamBooth quality with small, shareable files.

What are the limitations of Textual Inversion?

The authors and subsequent users identified several limitations ^[1]:

Per-concept optimization is comparatively slow, since each new concept requires its own training run of thousands of steps rather than a single forward pass.
The frozen-model constraint caps fidelity. A single embedding can struggle to capture very fine or intricate detail, and the reconstructed concept may drift from the reference, especially for complex subjects or precise shapes.
Results are sensitive to initialization (the coarse descriptor word) and to the optimization hyperparameters.
Compositional and relational prompts can fail. The project authors explicitly noted that relational prompts do not reliably work, writing: "Unfortunately, this doesn't yet work for relational prompts, so we can't show you our cat on a fishing trip with our clock" ^[6].
As with all such personalization methods, learning a concept from a person's photographs raises identity, consent, and copyright concerns.

How widely is Textual Inversion used? (impact and adoption)

Textual Inversion arrived just as Stable Diffusion was released publicly in 2022, and its tiny, model-friendly output made it an immediate fit for that ecosystem. The method was implemented in widely used tooling, including the Hugging Face Diffusers library and the AUTOMATIC1111 Stable Diffusion web UI, where learned vectors are commonly called "embeddings" ^[2]^[5].

A large community of shared embeddings grew quickly. The Stable Diffusion Concepts Library (the sd-concepts-library on Hugging Face) accumulated on the order of a thousand community-contributed Textual Inversion files covering styles, objects, and characters, which any user can load by keyword ^[4]^[5]. A distinctive and heavily used variant is the "negative embedding," a Textual Inversion vector trained to represent undesirable image traits (artifacts, bad anatomy) that is then placed in the negative prompt to steer generations away from those traits; the widely circulated EasyNegative embedding is a prominent example ^[5]^[8].

Beyond practical use, Textual Inversion influenced a line of research on text-to-image personalization and inversion. Follow-up work extended it with richer per-layer or per-timestep conditioning, faster or gradient-free optimization, and combinations with weight fine-tuning, while the underlying idea, optimizing in a frozen model's conditioning space to capture a concept, became a standard baseline in the personalization literature ^[1]^[7].

References

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion." arXiv:2208.01618, August 2, 2022. https://arxiv.org/abs/2208.01618 ↩
Hugging Face. "Textual Inversion" (Diffusers training guide). https://huggingface.co/docs/diffusers/training/text_inversion ↩
ICLR 2023. "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion" (Oral). https://iclr.cc/virtual/2023/oral/12700 ↩
Hugging Face. "Stable Diffusion Concepts Library" (sd-concepts-library). https://huggingface.co/sd-concepts-library ↩
Stable Diffusion Art. "How to use embeddings in Stable Diffusion." https://stable-diffusion-art.com/embedding/ ↩
Textual Inversion project page. "An Image is Worth One Word." https://textual-inversion.github.io/ ↩
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., and Aberman, K. "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation." arXiv:2208.12242, August 25, 2022. https://arxiv.org/abs/2208.12242 ↩
gsdf. "EasyNegative" (negative Textual Inversion embedding). https://huggingface.co/datasets/gsdf/EasyNegative ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

ControlNet DreamBooth IP-Adapter Prompt-to-Prompt

What is Textual Inversion?

Why was Textual Inversion created? (personalization)

How does Textual Inversion work?

Textual Inversion vs DreamBooth: what is the difference?

What are the limitations of Textual Inversion?

How widely is Textual Inversion used? (impact and adoption)

References

Improve this article

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here