IP-Adapter

Deep Learning Generative AI

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v1 · 1,744 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

IP-Adapter (short for Image Prompt Adapter) is a lightweight neural network module that adds image-prompt conditioning to a pretrained text-to-image diffusion model, allowing a reference image to guide generation of subject or style without fine-tuning the underlying model ^[1]. It was introduced in the August 2023 paper "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models" by Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang of Tencent AI Lab ^[1]^[2].

The central idea is a decoupled cross-attention mechanism: the reference image is encoded with a CLIP image encoder, and a new, separate set of cross-attention layers dedicated to those image features is inserted into the diffusion U-Net in parallel with the existing text cross-attention. Only this small adapter, about 22 million parameters in the original design, is trained; the base model stays frozen ^[1]^[2]. Because the base weights are untouched, a single trained IP-Adapter generalizes to custom checkpoints fine-tuned from the same base model and composes with structural-control tools such as ControlNet ^[1]^[3].

IP-Adapter was released as open source alongside the paper and has become one of the most widely used conditioning methods in the open diffusion ecosystem, with native support in ComfyUI, AUTOMATIC1111, and the Hugging Face Diffusers library ^[3]^[4]^[5]. The starting description of IP-Adapter as a 2023 lightweight, training-free-at-inference image-prompt adapter using decoupled cross-attention is accurate; this article corrects and expands the technical details below.

Motivation: image prompting

Text prompts are a powerful but limited interface. Many visual concepts, the precise appearance of a particular person, an exact object, or a specific artistic style, are difficult or impossible to specify in words. As the IP-Adapter authors put it, "an image is worth a thousand words," and crafting a text prompt to reproduce a desired reference is often impractical ^[1].

Before IP-Adapter, the common solutions conditioned a model on a reference by learning new parameters per concept. DreamBooth fine-tunes (some or all of) the diffusion model on a handful of images of a subject, binding it to a rare token; textual inversion instead freezes the model and optimizes a new embedding vector to represent the concept. Both require a separate optimization run for every new subject or style, which is slow and storage-heavy, and DreamBooth-style fine-tuning risks degrading the base model ^[1].

An alternative is to retrain a diffusion model from scratch (or heavily) to accept an image as a direct conditioning input, as in some image-variation models. This avoids per-concept training at inference time but is expensive and produces a model that is no longer easily compatible with text prompts and existing community checkpoints ^[1]. IP-Adapter targets the gap between these approaches: a single, small adapter trained once that accepts an arbitrary reference image at inference with no further optimization, while remaining fully compatible with text conditioning and with the broad library of community fine-tunes.

How IP-Adapter works: decoupled cross-attention

A latent text-to-image diffusion model such as Stable Diffusion (built on latent diffusion) injects the text prompt through cross-attention layers in its U-Net denoiser: the noisy latent provides the queries, and CLIP text embeddings provide the keys and values ^[1]. A naive way to add an image prompt would be to concatenate image and text features and feed them to the same cross-attention layers, but the authors found this insufficient, because the projection weights are tuned for text features and the two modalities interfere ^[1].

IP-Adapter instead uses a decoupled (separated) cross-attention design ^[1]^[2]:

Image encoding. The reference image is passed through a frozen CLIP image encoder. The original model uses the global image embedding from OpenCLIP ViT-H/14 (for the Stable Diffusion 1.5 adapter), projected through a small trainable network into a short sequence of tokens that matches the dimension of the text features ^[1]^[6].
A parallel attention branch. For each existing text cross-attention layer in the U-Net, a new cross-attention layer is added exclusively for the image tokens. Crucially, this image branch shares the query projection with the text branch but has its own, newly initialized key and value projection matrices ^[1].
Adding the outputs. The query attends separately to the text keys/values and to the image keys/values, and the two attention outputs are summed (often with an adjustable scale on the image term that lets users dial image influence up or down) before continuing through the network ^[1]^[3]. Because the branches are separate, text and image conditioning do not compete inside a single softmax, which the paper credits for both stronger image fidelity and preserved text controllability ^[1].

During training, only the new image-projection network and the added key/value matrices are optimized; the original U-Net, the text encoder, and the CLIP image encoder are all frozen. The trainable adapter is therefore small, roughly 22M parameters, and is trained with the standard diffusion denoising objective on image-text pairs (the paper used about 10 million pairs from datasets such as LAION-2B and COYO-700M) ^[1]^[2]. At inference no per-image optimization is needed: a user supplies any reference image and, optionally, a text prompt, and the frozen base model plus the adapter generate the result in a single forward diffusion process ^[1].

Variants (Plus, FaceID)

The original global-embedding adapter captures overall content and style but can lose fine detail, because a single CLIP image vector is a coarse summary. A family of variants addresses different needs ^[3]^[6]:

Variant	Image conditioning	Notes
IP-Adapter (base)	Global CLIP image embedding	About 22M parameters; general subject/style transfer for SD 1.5 ^[1]^[6]
IP-Adapter Plus	Fine-grained CLIP patch embeddings via a perceiver-resampler (Flamingo-style)	Stronger resemblance to the reference; the paper notes finer features can also copy spatial structure, reducing diversity ^[1]^[6]
IP-Adapter Plus Face	Patch embeddings on a cropped face	Tuned for facial appearance and portraits ^[3]^[6]
IP-Adapter FaceID	InsightFace (ArcFace) face-recognition embedding plus a LoRA	Conditions on a face-ID vector instead of CLIP; the LoRA improves identity consistency ^[3]^[7]
IP-Adapter FaceID Plus / PlusV2	Face-ID embedding combined with a CLIP face embedding	PlusV2 exposes a controllable weight on the CLIP "face structure" term ^[6]^[7]
IP-Adapter FaceID Portrait	Multiple face-ID embeddings, no LoRA or ControlNet required	Accepts several reference photos to strengthen likeness ^[7]
SDXL versions	Either OpenCLIP ViT-bigG/14 global or ViT-H/14 patch features	Adapters trained for SDXL (and later SD 2 and other bases) ^[6]

The FaceID line is notable for departing from CLIP: it derives identity from a face-recognition model (InsightFace's ArcFace), because a normalized face-ID embedding captures who a person is more reliably than a general CLIP image vector. The FaceID embedding is harder to learn, so a companion LoRA is added to help the U-Net use it ^[7]. The base IP-Adapter FaceID was published in late 2023, with FaceID Plus and PlusV2 following in December 2023 and SDXL and Portrait variants in January 2024 ^[7].

Adoption and use

IP-Adapter saw rapid uptake in the open-source image-generation community. It is integrated into Hugging Face Diffusers (loadable with a single call and stackable with multiple reference images), into AUTOMATIC1111 and Forge through the ControlNet web UI extension, and into ComfyUI, where the community node pack ComfyUI_IPAdapter_plus by Matteo Spinelli (cubiq) is a de facto standard ^[3]^[4]^[5]. Pretrained weights for the SD 1.5 and SDXL variants are distributed from the official tencent-ailab/IP-Adapter repository and the h94 model collections on Hugging Face ^[6]^[7].

Typical uses include style transfer (apply the look of a reference image to new content), subject or character consistency across generations, face swapping and identity-preserving portraits (via the FaceID variants), and image variation. A key practical strength is composability: because the base model is frozen, IP-Adapter can be combined with ControlNet so that one input fixes structure (pose, depth, line art) while the IP-Adapter reference supplies appearance or style, and it works directly with the many community-fine-tuned checkpoints built on the same base ^[3]^[4]. IP-Adapter also influenced later identity-preservation systems; for example, InstantID combines an IP-Adapter-style image branch with face-recognition embeddings and a ControlNet-like spatial module ^[8].

Relationship to DreamBooth, Textual Inversion, and ControlNet

IP-Adapter occupies a distinct point in the design space of conditioning methods:

Versus DreamBooth and Textual Inversion. DreamBooth and textual inversion are personalization methods that run a fresh optimization for each new concept: DreamBooth fine-tunes model weights, and textual inversion learns a new embedding. IP-Adapter trains once and then accepts any reference image at inference with no per-concept optimization, making it far faster to apply to a new subject ^[1]. The trade-off is fidelity: methods that explicitly fit a concept, especially DreamBooth, can reproduce a specific subject's identity more tightly than a feed-forward adapter, whereas IP-Adapter favors speed, reusability, and zero-shot flexibility. The FaceID variants narrow this gap for the specific case of faces ^[1]^[7].
Versus ControlNet. ControlNet adds a trainable copy of the U-Net encoder to inject spatial/structural conditions such as edges, pose, or depth maps; it controls layout and geometry. IP-Adapter, by contrast, injects appearance and semantic content from a reference image through cross-attention, not spatial structure. The two are complementary rather than competing and are routinely used together: ControlNet for "where things go" and IP-Adapter for "what they look like" ^[1]^[3].
Versus reference-only methods. Some training-free techniques (such as the "reference-only" preprocessor in the web UI ecosystem) bias generation toward a reference by manipulating the model's own self-attention at inference, with no added parameters. IP-Adapter differs in that it introduces and trains dedicated cross-attention parameters once, which the authors report yields stronger and more controllable image-prompt adherence while still requiring no optimization at generation time ^[1]^[3].

In summary, IP-Adapter generalizes image prompting into a small, reusable module that preserves text controllability and slots into existing diffusion pipelines, which explains both its technical interest and its broad practical adoption.

References

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, Wei Yang. "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models." arXiv:2308.06721, August 2023. https://arxiv.org/abs/2308.06721 ↩
IP-Adapter project page. https://ip-adapter.github.io/ ↩
"IP-Adapters: All you need to know." Stable Diffusion Art. https://stable-diffusion-art.com/ip-adapter/ ↩
cubiq (Matteo Spinelli). "ComfyUI_IPAdapter_plus" (GitHub repository). https://github.com/cubiq/ComfyUI_IPAdapter_plus ↩
"IP-Adapter." Hugging Face Diffusers documentation. https://huggingface.co/docs/diffusers/en/using-diffusers/ip_adapter ↩
"h94/IP-Adapter" (model card listing variants and image encoders). Hugging Face. https://huggingface.co/h94/IP-Adapter ↩
"h94/IP-Adapter-FaceID" (model card for the FaceID variants). Hugging Face. https://huggingface.co/h94/IP-Adapter-FaceID ↩
Qixun Wang et al. "InstantID: Zero-shot Identity-Preserving Generation in Seconds." arXiv:2401.07519, January 2024. https://arxiv.org/abs/2401.07519 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

ComfyUI ControlNet

Overview

Motivation: image prompting

How IP-Adapter works: decoupled cross-attention

Variants (Plus, FaceID)

Adoption and use

Relationship to DreamBooth, Textual Inversion, and ControlNet

References

Improve this article

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here