Search

4.3. IPAdapter

Chapter
4.Core

Learning Goals

a. What is an IP-Adapter and what is it used for? b. How do I connect and use an IP-Adapter? c. How does an IP-Adapter adjust the coefficient value? d. Why do there seem to be so many different types of IP-Adapters? e. What is the difference between an IP-Adapter with FaceID and an IP-Adapter without FaceID? f. What is the difference between loading 1 card and 4 cards? g. Can I drag and drop the IP-Adapter I need to use to suit my situation? h. What is embedding? How do I feel about difficult concepts like embedding and latent?
Plain Text
복사

Workflow

Don't be intimidated by the number of workflows. They're all pretty much the same, and they're all pretty straightforward and one way to use them.
Index
Workflow Name
Actionable Link
Summary
4.3.1
IP-Adapter(Linear)
Basic connection of IP Adapter:.
4.3.2
IP-Adapter(StyleTransfer)
Style transfer that extracts only the style is possible.
4.3.3
IP-Adapter(StyleTransfer Precise)
Various types of weight adjustments can be made.
4.3.4
IP-Adapter(Compositon)
Various types of weight adjustments can be made.
4.3.5
IP-Adapter(Style and Composition)
Various types of weight adjustments can be made.
4.3.6
IP-Adapter(Multi, StyleTransfer)
You can input not just one but up to four reference images.
4.3.7
IP-Adapter(FaceID)
It focuses on specific features or styles of the face and may refer to supplementary data like FaceID.
4.3.8
IP-Adapter(Multi, FaceID)
Similarly, up to four face images can be input.
4.3.9
IP-Adapter(1Style+1Face)
Face and style can be used overlappingly.
4.3.10
IP-Adapter(4Style+4Face)
Not just one face and one style, but up to four faces and four styles can be used.
4.3.11
IP-Adapter(Templates)
When using the IP Adapter, turn on this snippet to copy the connection settings.
4.3.12
IP-Adapter(Legacy, SDXL)
This is the older version of the IP Adapter.
4.3.13
IP-Adapter (Legacy, SD1.5)
This is the older version of the IP Adapter.

a. What is IP-Adapter? When is it used?

For creators, reference images are crucial. It might be more effective to show an "image" rather than just describing what kind of image to create with words. Just as LoRA modifies the values of a checkpoint, IP-Adapter uses a similar method to modify the values of a checkpoint. (However, note that there's a fine line between referencing and imitation. Be cautious to avoid plagiarism when using reference images.) If ControlNet was connected to yellow conditioning, IP-Adapter should be connected in a similar manner to the purple model. (Hence, IP-Adapter is standardized in purple.)
For those who are more curious, you might want to check out the diagrams on the Style Drop page, a project released by Google in 2023. Previously, creating a "LoRA" required 20 to 1000+ images, but now with IP-Adapter, you can extract the desired style sharply with just one image. Although it still falls short compared to LoRA in terms of performance, its utility lies in the ability to reference with minimal data.

b. How to Use IP-Adapter

Let's start using it. Here, IPAdapter Advanced plays the same crucial role as ControlNet Apply (Advanced). To use it, load the reference images and, to facilitate this, load the necessary IP-Adapter model via Unified Loader. Additionally, include the Clip Vision model to enhance text understanding. (While ControlNet required separate handling for XL and 1.5 models, IP-Adapter improves this inconvenience through Unified Loader.)
Connect the models. Link the checkpoint's 'model' to the Unified Loader's model, and then use the IP-Adapter Advanced to apply the model and input it as the KSampler's Model.
I will apply Style Transfer. I set the weight value to 0.8 and entered the prompt: “cat on the car, chibi character, minimalism, white background.”
"Cat on the car" created using one reference image
There is also a method for using four images instead of one. Various methods will be explored in detail later.
"Cat on the car" created using one reference image

c. How does IP-Adapter adjust coefficient values?

Start with a weight value of 0.8, as recommended by the creators in the official documentation. However, you can adjust this value up or down as needed. For example, you can set it to 0.7, 0.6, 0.55, or even increase it to 1.
start_at and end_at are the same as described in ControlNet in section 4.2. They affect the ksampler steps. These are generally not used frequently.
combine_embeds and embeds_scaling will be explained later in the section about embeddings. They are not very important for now.
Now, let's explain weight_type. It is important and should be understood.

d. Why does IP-Adapter seem complex with so many types?

In summary: You can use one image or up to four images. You can apply IP-Adapter directly or enhance face similarity using FaceID.
The complexity arises from the various combinations and parameters (coefficients) for each setting. Focus on understanding the core functionality.

weight_type

Here, we will explain the weight_type used for style transfer.
(Primarily use style transfer; other options are more experimental and can be adjusted based on experience, similar to various ControlNet preprocessors.)
For a subtle style transfer, use style transfer precise. For a stronger style transfer, use strong style transfer.
However, it's not only about extracting the ‘style’ from the ‘reference image’. You can also extract ‘composition’ or provide both ‘style and composition’.
좌측부터 차례로 : 레퍼런스 이미지, style transfer, composition, style and composition
Of course, the most basic is linear. It simply makes the output resemble the reference image and applies the IP-Adapter without additional style transfer. It was used quite a bit even before style transfer became available.
(For a deeper explanation, IP-Adapter assigns weights to specific blocks in the UNet, and depending on how these weights are incorporated into the graph structure, there are various variations such as ease in, ease out, weak input, etc. It is a usable element but does not follow a strict pattern.)

e. What is the difference between using IP-Adapter with FaceID and without it?

When you want to make an image resemble a specific face, you can use FaceID with IP-Adapter. However, FaceID does not make the face look exactly the same but rather achieves a subtle resemblance.
The difference lies in using Unified Loader with FaceID rather than just Unified Loader. To use FaceID, you need to load InsightFace, which means you'll be using IPAdapter FaceID instead of IPAdapter Advanced. Therefore, you should pay attention to the additional weight values, setting them around 0.8. Since you're using GPU, set the Provider to GPU (CUDA is a type of GPU). Otherwise, the usage is the same. Make sure to chain the models accordingly.
You might want to use both ‘face’ and ‘style’.
When using ‘face’, you can connect it independently as previously described.
Alternatively, you can also use a combination of ‘face’ and ‘style’ by layering them.

f. How does using 1 image differ from using 4 images?

From the user's perspective, when using 4 images, you don't need to know all the detailed configurations; you simply swap out the images as needed. Using 2 or 3 images is also possible. Adding 4 images doesn’t always guarantee better results. Often, a single good reference image is sufficient.
(However, understanding detailed components is useful for more precise adjustments if needed. For those interested, refer to the embedding explanation at the bottom of this document.)
For using 4 style images:
When using 4 face images
When using 4 face images and 4 style images

g. Can I use the IP-Adapter I need according to the situation?

In the end, it’s all about connecting the 'models', properly linking them, adding suitable reference images, and setting appropriate coefficient values.
When needed, enter this workflow, copy the necessary modules, and use them. I hope you create more good use cases.

h. What is embedding?

(From here, I’ll provide additional explanations, but you’ve learned all you need to know for practical use, so go ahead and use it. If you’re curious, feel free to ask more.)
When comparing Chapter 1 and Chapter 4 in detail, they are somewhat similar, but new elements like IPAdapter Encoder and IPAdapterCombineEmbeds have appeared.
The term 'embedding' ... it’s a complex term, but don’t be intimidated; try to understand it lightly. It’s okay if you don’t fully grasp it.
Artificial intelligence doesn’t understand the pixels we see very well. That’s why it’s explained as being converted into Latent. It’s similar to that.
It’s akin to that. To help AI understand pixels, it’s converted into embeddings and then sent over.
(I won’t explain the concept of vectors, but you can think of them as mathematical concepts with direction and magnitude.)
The words "king" and "queen" are like points that can be placed on a coordinate plane.
The difference between the words "man" and "woman" can be represented as a vector.
And that vector will be similar to the vector between the words "king" and "queen."
Thus, the four reference images above can also be represented as follows.
Image 1 -> Extract text using CLIP Vision (image2text) -> Embedding vector 1 Image 2 -> Extract text using CLIP Vision (image2text) -> Embedding vector 2 Image 3 -> Extract text using CLIP Vision (image2text) -> Embedding vector 3 Image 4 -> Extract text using CLIP Vision (image2text) -> Embedding vector 4
Plain Text
복사
And while a single image doesn’t have meaning on its own, with multiple images, we’ll check just two coefficient values that become meaningful.
embeds_combine_method
embeds_scaling
embeds_combine_method determines how we combine or merge the obtained embeddings.
concat: This method concatenates or joins the embeddings.
average: This method averages the embeddings.
The explanation of the code below is not precise.
concat(1,2,3,4) -> 1+2+3+4 average(1,2,3,4) -> 1+2+3+4/4 concat(embed1, embed2, embed3, embed4) average(embed1, embed2, embed3, embed4) if embed1 = cat, cute embed2 = cat, angry embed3 = cat, white embed4 = cat, flower concat(sen1, sen2, sen3, sen4) -> cat, cute, cat, angry, cat, white, cat, flower average(sen1, sen2, sen3, sen4) -> cat, something
Plain Text
복사
And the embedding value is a dictionary-like data structure with keys and values.
Whether to use only the key, only the value, or both the key and value, or apply channel penalties, etc.
If meaningful differences in results occur and are practically significant, I would explain the principles, but since they don’t, I’ll skip the explanation of the principles.
As a result, the following additional experimental possibilities arise: (Note: When using only one image, this also doesn’t affect the results. Try seed fixing and coefficient value testing.)
Start by trying concat.
There are cases where averaging (average) might be better.
Try using k+v as a basic approach.
There are cases where using v only or applying channel penalties (c penalty) might be better.