# Textual Inversion

> Source: https://aiwiki.ai/wiki/textual_inversion
> Updated: 2026-06-08
> Categories: Deep Learning, Generative AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

## Overview

Textual Inversion is a personalization method for [text-to-image](/wiki/text-to-image) generation that teaches a frozen [diffusion model](/wiki/diffusion_model) a new, user-supplied visual concept, such as a specific object or an artistic style, from only three to five example images. Instead of changing any of the model's weights, the method introduces a single new "pseudo-word" into the vocabulary of the model's text encoder and optimizes only that word's embedding vector so that it reconstructs the example images. The learned embedding can then be dropped into ordinary text prompts like any other word, letting a user generate new images of the concept, restyle it, or compose it into novel scenes [1][2].

The technique was introduced in the 2022 paper "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion" by Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or, a collaboration between [Tel Aviv University](/wiki/tel_aviv_university) and [NVIDIA](/wiki/nvidia). It was posted to arXiv on August 2, 2022, and presented as an oral paper at the International Conference on Learning Representations (ICLR) in 2023 [1][3]. Because it learns nothing but an embedding vector, a Textual Inversion concept is extremely small (on the order of kilobytes) and portable, which made it one of the first widely shared personalization formats in the open [Stable Diffusion](/wiki/stable_diffusion) community [4][5].

The name "Textual Inversion" reflects the core idea: rather than inverting an image back into a model's latent space (as in classical GAN inversion), the method inverts a small image set into the text-embedding space of a frozen generator, finding the embedding that, when used as a prompt token, causes the model to produce the target concept [1].

## Motivation: personalization

Large pretrained text-to-image models such as Stable Diffusion and Latent Diffusion can render an enormous range of subjects described in natural language, but they cannot, on their own, depict a specific personal concept that was not in their training distribution: a particular pet, a unique handmade toy, a friend's face, or a niche artistic style. Plain text is often too coarse to pin down such an instance, and simply describing it in words rarely recovers the exact appearance [1].

The goal of Textual Inversion is to bridge this gap with minimal cost. Given just a handful of casual photographs of one concept, the method should produce a compact handle, a new word, that the frozen model already "understands" how to combine with the rest of language. The authors framed the design around two requirements: the new representation must capture fine visual detail of the concept, and it must remain composable, so that prompts like "a photo of S* on the beach" or "an oil painting in the style of S*" behave as expected. They found, somewhat surprisingly, that a single learned word embedding is often sufficient to capture varied and detailed concepts [1].

## How Textual Inversion works

Textual Inversion operates on the text-conditioning pathway of a latent diffusion model. In a typical model, a text prompt is tokenized, each token is mapped to a continuous embedding vector through an embedding lookup table, and these embeddings are passed through a text encoder (in Stable Diffusion, a [CLIP](/wiki/clip) text encoder, whose output conditioning vectors have dimension 768) to produce the conditioning that guides the denoising diffusion process [1][2].

The method proceeds as follows:

1. A new placeholder token, written in the paper as "S*", is added to the tokenizer's vocabulary, along with a fresh row in the embedding table. This new embedding is the only parameter that will be trained. It is usually initialized from the embedding of a coarse descriptor word (for example, "toy" or "sculpture") to give optimization a reasonable starting point [1][6].
2. Every other component is frozen: the embedding table for all existing tokens, the rest of the text encoder, the diffusion U-Net (the noise predictor), and the variational autoencoder. No model weights change during training [1][2].
3. The few concept images are paired with simple templated prompts containing the placeholder, such as "a photo of S*". For each training step, an image is encoded to the latent space, noise is added at a randomly sampled timestep, and the frozen U-Net is asked to predict that noise while conditioned on the prompt embedding (which now includes the trainable vector). The objective is exactly the same [latent diffusion](/wiki/latent_diffusion) reconstruction (denoising) loss used to train the original model, and gradients flow back only into the single placeholder embedding [1][2].

By minimizing this loss, the optimizer searches the text-embedding space for the vector that best explains the example images. The original paper used a batch size of 8 and ran roughly 5,000 optimization steps with a base learning rate of about 0.005, taking on the order of a couple of GPU-hours per concept on the hardware of the time; community implementations report similar runs of about an hour on a single modern GPU [1][6][2]. The end product is one (or a small number of) embedding vectors, saved as a file of only a few kilobytes, that can be loaded into the model and referenced by the chosen keyword in any prompt [2][4].

## Comparison to DreamBooth

Textual Inversion is frequently compared with [DreamBooth](/wiki/dreambooth), a concurrent 2022 personalization method by Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman of Google Research and Boston University (arXiv 2208.12242, posted August 25, 2022). The two approaches target the same problem but make opposite trade-offs [7].

| Aspect | Textual Inversion | DreamBooth |
| --- | --- | --- |
| What is trained | A single new word embedding (768-dim) | The diffusion model weights (often the full U-Net) |
| Model weights changed | None; model fully frozen | Yes; model is fine-tuned |
| Concept identifier | A new optimized pseudo-word ("S*") | A rare-token identifier bound to a class noun |
| Output artifact size | A few kilobytes | A full or partial model checkpoint (often gigabytes) |
| Prior preservation | Not needed | Uses a class-specific prior-preservation loss to limit drift |
| Typical fidelity | Lower subject and detail fidelity | Higher subject and prompt fidelity |
| Portability | Very high; embeddings stack and share easily | Lower; large, model-specific checkpoints |

Because Textual Inversion only moves a point in embedding space and never touches the generator, it cannot add genuinely new visual capacity to the model; it can only express the concept in terms the frozen model can already render. DreamBooth, by fine-tuning weights, can memorize subject detail more faithfully and is generally more expressive, at the cost of much larger artifacts, greater compute, and the risk of overfitting or "language drift," which it counters with a prior-preservation regularization term over the subject's class [7]. Measured comparisons in the DreamBooth paper and later studies report sizeable gaps in subject and prompt fidelity favoring DreamBooth, though the two are complementary: a Textual Inversion embedding can be used alongside a fine-tuned model to add linguistic nuance [7]. A later, even lighter-weight alternative, [LoRA](/wiki/lora) adapters, became popular as a middle ground that approaches DreamBooth quality with small, shareable files.

## Limitations

The authors and subsequent users identified several limitations [1]:

- Per-concept optimization is comparatively slow, since each new concept requires its own training run of thousands of steps rather than a single forward pass.
- The frozen-model constraint caps fidelity. A single embedding can struggle to capture very fine or intricate detail, and the reconstructed concept may drift from the reference, especially for complex subjects or precise shapes.
- Results are sensitive to initialization (the coarse descriptor word) and to the optimization hyperparameters.
- Compositional and relational prompts can fail; the project authors explicitly noted that relational prompts ("two of S* next to each other," precise spatial relations) do not reliably work.
- As with all such personalization methods, learning a concept from a person's photographs raises identity, consent, and copyright concerns.

## Impact and adoption

Textual Inversion arrived just as Stable Diffusion was released publicly in 2022, and its tiny, model-friendly output made it an immediate fit for that ecosystem. The method was implemented in widely used tooling, including the Hugging Face [Diffusers](/wiki/diffusers) library and the AUTOMATIC1111 Stable Diffusion web UI, where learned vectors are commonly called "embeddings" [2][5].

A large community of shared embeddings grew quickly. The Stable Diffusion Concepts Library (the sd-concepts-library on Hugging Face) accumulated on the order of a thousand community-contributed Textual Inversion files covering styles, objects, and characters, which any user can load by keyword [4][5]. A distinctive and heavily used variant is the "negative embedding," a Textual Inversion vector trained to represent undesirable image traits (artifacts, bad anatomy) that is then placed in the negative prompt to steer generations away from those traits; the widely circulated EasyNegative embedding is a prominent example [5][8].

Beyond practical use, Textual Inversion influenced a line of research on text-to-image personalization and inversion. Follow-up work extended it with richer per-layer or per-timestep conditioning, faster or gradient-free optimization, and combinations with weight fine-tuning, while the underlying idea, optimizing in a frozen model's conditioning space to capture a concept, became a standard baseline in the personalization literature [1][7].

## References

[1] Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., and Cohen-Or, D. "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion." arXiv:2208.01618, August 2, 2022. https://arxiv.org/abs/2208.01618

[2] Hugging Face. "Textual Inversion" (Diffusers training guide). https://huggingface.co/docs/diffusers/training/text_inversion

[3] ICLR 2023. "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion" (Oral). https://iclr.cc/virtual/2023/oral/12700

[4] Hugging Face. "Stable Diffusion Concepts Library" (sd-concepts-library). https://huggingface.co/sd-concepts-library

[5] Stable Diffusion Art. "How to use embeddings in Stable Diffusion." https://stable-diffusion-art.com/embedding/

[6] Textual Inversion project page. "An Image is Worth One Word." https://textual-inversion.github.io/

[7] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., and Aberman, K. "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation." arXiv:2208.12242, August 25, 2022. https://arxiv.org/abs/2208.12242

[8] gsdf. "EasyNegative" (negative Textual Inversion embedding). https://huggingface.co/datasets/gsdf/EasyNegative

