IP-Adapter
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,744 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,744 words
Add missing citations, update stale details, or suggest a clearer explanation.
IP-Adapter (short for Image Prompt Adapter) is a lightweight neural network module that adds image-prompt conditioning to a pretrained text-to-image diffusion model, allowing a reference image to guide generation of subject or style without fine-tuning the underlying model [1]. It was introduced in the August 2023 paper "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models" by Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang of Tencent AI Lab [1][2].
The central idea is a decoupled cross-attention mechanism: the reference image is encoded with a CLIP image encoder, and a new, separate set of cross-attention layers dedicated to those image features is inserted into the diffusion U-Net in parallel with the existing text cross-attention. Only this small adapter, about 22 million parameters in the original design, is trained; the base model stays frozen [1][2]. Because the base weights are untouched, a single trained IP-Adapter generalizes to custom checkpoints fine-tuned from the same base model and composes with structural-control tools such as ControlNet [1][3].
IP-Adapter was released as open source alongside the paper and has become one of the most widely used conditioning methods in the open diffusion ecosystem, with native support in ComfyUI, AUTOMATIC1111, and the Hugging Face Diffusers library [3][4][5]. The starting description of IP-Adapter as a 2023 lightweight, training-free-at-inference image-prompt adapter using decoupled cross-attention is accurate; this article corrects and expands the technical details below.
Text prompts are a powerful but limited interface. Many visual concepts, the precise appearance of a particular person, an exact object, or a specific artistic style, are difficult or impossible to specify in words. As the IP-Adapter authors put it, "an image is worth a thousand words," and crafting a text prompt to reproduce a desired reference is often impractical [1].
Before IP-Adapter, the common solutions conditioned a model on a reference by learning new parameters per concept. DreamBooth fine-tunes (some or all of) the diffusion model on a handful of images of a subject, binding it to a rare token; textual inversion instead freezes the model and optimizes a new embedding vector to represent the concept. Both require a separate optimization run for every new subject or style, which is slow and storage-heavy, and DreamBooth-style fine-tuning risks degrading the base model [1].
An alternative is to retrain a diffusion model from scratch (or heavily) to accept an image as a direct conditioning input, as in some image-variation models. This avoids per-concept training at inference time but is expensive and produces a model that is no longer easily compatible with text prompts and existing community checkpoints [1]. IP-Adapter targets the gap between these approaches: a single, small adapter trained once that accepts an arbitrary reference image at inference with no further optimization, while remaining fully compatible with text conditioning and with the broad library of community fine-tunes.
A latent text-to-image diffusion model such as Stable Diffusion (built on latent diffusion) injects the text prompt through cross-attention layers in its U-Net denoiser: the noisy latent provides the queries, and CLIP text embeddings provide the keys and values [1]. A naive way to add an image prompt would be to concatenate image and text features and feed them to the same cross-attention layers, but the authors found this insufficient, because the projection weights are tuned for text features and the two modalities interfere [1].
IP-Adapter instead uses a decoupled (separated) cross-attention design [1][2]:
During training, only the new image-projection network and the added key/value matrices are optimized; the original U-Net, the text encoder, and the CLIP image encoder are all frozen. The trainable adapter is therefore small, roughly 22M parameters, and is trained with the standard diffusion denoising objective on image-text pairs (the paper used about 10 million pairs from datasets such as LAION-2B and COYO-700M) [1][2]. At inference no per-image optimization is needed: a user supplies any reference image and, optionally, a text prompt, and the frozen base model plus the adapter generate the result in a single forward diffusion process [1].
The original global-embedding adapter captures overall content and style but can lose fine detail, because a single CLIP image vector is a coarse summary. A family of variants addresses different needs [3][6]:
| Variant | Image conditioning | Notes |
|---|---|---|
| IP-Adapter (base) | Global CLIP image embedding | About 22M parameters; general subject/style transfer for SD 1.5 [1][6] |
| IP-Adapter Plus | Fine-grained CLIP patch embeddings via a perceiver-resampler (Flamingo-style) | Stronger resemblance to the reference; the paper notes finer features can also copy spatial structure, reducing diversity [1][6] |
| IP-Adapter Plus Face | Patch embeddings on a cropped face | Tuned for facial appearance and portraits [3][6] |
| IP-Adapter FaceID | InsightFace (ArcFace) face-recognition embedding plus a LoRA | Conditions on a face-ID vector instead of CLIP; the LoRA improves identity consistency [3][7] |
| IP-Adapter FaceID Plus / PlusV2 | Face-ID embedding combined with a CLIP face embedding | PlusV2 exposes a controllable weight on the CLIP "face structure" term [6][7] |
| IP-Adapter FaceID Portrait | Multiple face-ID embeddings, no LoRA or ControlNet required | Accepts several reference photos to strengthen likeness [7] |
| SDXL versions | Either OpenCLIP ViT-bigG/14 global or ViT-H/14 patch features | Adapters trained for SDXL (and later SD 2 and other bases) [6] |
The FaceID line is notable for departing from CLIP: it derives identity from a face-recognition model (InsightFace's ArcFace), because a normalized face-ID embedding captures who a person is more reliably than a general CLIP image vector. The FaceID embedding is harder to learn, so a companion LoRA is added to help the U-Net use it [7]. The base IP-Adapter FaceID was published in late 2023, with FaceID Plus and PlusV2 following in December 2023 and SDXL and Portrait variants in January 2024 [7].
IP-Adapter saw rapid uptake in the open-source image-generation community. It is integrated into Hugging Face Diffusers (loadable with a single call and stackable with multiple reference images), into AUTOMATIC1111 and Forge through the ControlNet web UI extension, and into ComfyUI, where the community node pack ComfyUI_IPAdapter_plus by Matteo Spinelli (cubiq) is a de facto standard [3][4][5]. Pretrained weights for the SD 1.5 and SDXL variants are distributed from the official tencent-ailab/IP-Adapter repository and the h94 model collections on Hugging Face [6][7].
Typical uses include style transfer (apply the look of a reference image to new content), subject or character consistency across generations, face swapping and identity-preserving portraits (via the FaceID variants), and image variation. A key practical strength is composability: because the base model is frozen, IP-Adapter can be combined with ControlNet so that one input fixes structure (pose, depth, line art) while the IP-Adapter reference supplies appearance or style, and it works directly with the many community-fine-tuned checkpoints built on the same base [3][4]. IP-Adapter also influenced later identity-preservation systems; for example, InstantID combines an IP-Adapter-style image branch with face-recognition embeddings and a ControlNet-like spatial module [8].
IP-Adapter occupies a distinct point in the design space of conditioning methods:
Versus DreamBooth and Textual Inversion. DreamBooth and textual inversion are personalization methods that run a fresh optimization for each new concept: DreamBooth fine-tunes model weights, and textual inversion learns a new embedding. IP-Adapter trains once and then accepts any reference image at inference with no per-concept optimization, making it far faster to apply to a new subject [1]. The trade-off is fidelity: methods that explicitly fit a concept, especially DreamBooth, can reproduce a specific subject's identity more tightly than a feed-forward adapter, whereas IP-Adapter favors speed, reusability, and zero-shot flexibility. The FaceID variants narrow this gap for the specific case of faces [1][7].
Versus ControlNet. ControlNet adds a trainable copy of the U-Net encoder to inject spatial/structural conditions such as edges, pose, or depth maps; it controls layout and geometry. IP-Adapter, by contrast, injects appearance and semantic content from a reference image through cross-attention, not spatial structure. The two are complementary rather than competing and are routinely used together: ControlNet for "where things go" and IP-Adapter for "what they look like" [1][3].
Versus reference-only methods. Some training-free techniques (such as the "reference-only" preprocessor in the web UI ecosystem) bias generation toward a reference by manipulating the model's own self-attention at inference, with no added parameters. IP-Adapter differs in that it introduces and trains dedicated cross-attention parameters once, which the authors report yields stronger and more controllable image-prompt adherence while still requiring no optimization at generation time [1][3].
In summary, IP-Adapter generalizes image prompting into a small, reusable module that preserves text controllability and slots into existing diffusion pipelines, which explains both its technical interest and its broad practical adoption.