⭐

4.3. IPAdapter

Chapter

4.Core

Learning Goals

a. What is an IP-Adapter and what is it used for?
b. How do I connect and use an IP-Adapter?
c. How does an IP-Adapter adjust the coefficient value?
d. Why do there seem to be so many different types of IP-Adapters?
e. What is the difference between an IP-Adapter with FaceID and an IP-Adapter without FaceID?
f. What is the difference between loading 1 card and 4 cards?
g. Can I drag and drop the IP-Adapter I need to use to suit my situation?
h. What is embedding? How do I feel about difficult concepts like embedding and latent?
Plain Text
복사

Workflow

Don't be intimidated by the number of workflows. They're all pretty much the same, and they're all pretty straightforward and one way to use them.

Index	Workflow Name	Actionable Link	Summary
4.3.1	IP-Adapter(Linear)	https://nordy.ai/basic-workflows/66b02ed6de9e5a447b1089af	Basic connection of IP Adapter:.
4.3.2	IP-Adapter(StyleTransfer)	https://nordy.ai/basic-workflows/66b19dfbecf847aa45c8788e	Style transfer that extracts only the style is possible.
4.3.3	IP-Adapter(StyleTransfer Precise)	https://nordy.ai/basic-workflows/66b0301c0d43bf34eee5e2a4	Various types of weight adjustments can be made.
4.3.4	IP-Adapter(Compositon)	https://nordy.ai/basic-workflows/66b19f22e8c05e0008c623d6	Various types of weight adjustments can be made.
4.3.5	IP-Adapter(Style and Composition)	https://nordy.ai/basic-workflows/66b1a045e8c05e0008c625a5	Various types of weight adjustments can be made.
4.3.6	IP-Adapter(Multi, StyleTransfer)	https://nordy.ai/basic-workflows/66b314c1bac171d0bacba9c1	You can input not just one but up to four reference images.
4.3.7	IP-Adapter(FaceID)	https://nordy.ai/basic-workflows/66ac86f982cc009a59ed6dce	It focuses on specific features or styles of the face and may refer to supplementary data like FaceID.
4.3.8	IP-Adapter(Multi, FaceID)	https://nordy.ai/basic-workflows/66ac8a02e8c05e0008c54c86	Similarly, up to four face images can be input.
4.3.9	IP-Adapter(1Style+1Face)	https://nordy.ai/basic-workflows/66ac8ca60d43bf34eee57cb2	Face and style can be used overlappingly.
4.3.10	IP-Adapter(4Style+4Face)	https://nordy.ai/basic-workflows/66ac8e858707efff9456ff7c	Not just one face and one style, but up to four faces and four styles can be used.
4.3.11	IP-Adapter(Templates)	https://nordy.ai/basic-workflows/66b0317428511144ee7250d0	When using the IP Adapter, turn on this snippet to copy the connection settings.
4.3.12	IP-Adapter(Legacy, SDXL)	https://nordy.ai/basic-workflows/66b0b5eb58f149bc2918b729	This is the older version of the IP Adapter.
4.3.13	IP-Adapter (Legacy, SD1.5)	https://nordy.ai/basic-workflows/66b0bd8458f149bc2918b88e	This is the older version of the IP Adapter.

혹시 기존 Legacy 버전의 IP-Adapter를 사용하고 싶으시다면 이 문서를 참고해주세요.

If you wish to use the previous Legacy version of the IP-Adapter, please refer to this document.

a. What is IP-Adapter? When is it used?

For creators, reference images are crucial. It might be more effective to show an "image" rather than just describing what kind of image to create with words. Just as LoRA modifies the values of a checkpoint, IP-Adapter uses a similar method to modify the values of a checkpoint. (However, note that there's a fine line between referencing and imitation. Be cautious to avoid plagiarism when using reference images.) If ControlNet was connected to yellow conditioning, IP-Adapter should be connected in a similar manner to the purple model. (Hence, IP-Adapter is standardized in purple.)

For those who are more curious, you might want to check out the diagrams on the Style Drop page, a project released by Google in 2023. Previously, creating a "LoRA" required 20 to 1000+ images, but now with IP-Adapter, you can extract the desired style sharply with just one image. Although it still falls short compared to LoRA in terms of performance, its utility lies in the ability to reference with minimal data.

b. How to Use IP-Adapter

Let's start using it. Here, IPAdapter Advanced plays the same crucial role as ControlNet Apply (Advanced). To use it, load the reference images and, to facilitate this, load the necessary IP-Adapter model via Unified Loader. Additionally, include the Clip Vision model to enhance text understanding. (While ControlNet required separate handling for XL and 1.5 models, IP-Adapter improves this inconvenience through Unified Loader.)

Connect the models. Link the checkpoint's 'model' to the Unified Loader's model, and then use the IP-Adapter Advanced to apply the model and input it as the KSampler's Model.

I will apply Style Transfer. I set the weight value to 0.8 and entered the prompt: “cat on the car, chibi character, minimalism, white background.”

AI Design Tool: ComfyUI — Effortlessly create professional-level designs for free.

https://nordy.ai/basic-workflows/66ac818ff89910376a5a7205

"Cat on the car" created using one reference image

There is also a method for using four images instead of one. Various methods will be explored in detail later.

"Cat on the car" created using one reference image

c. How does IP-Adapter adjust coefficient values?

Start with a weight value of 0.8, as recommended by the creators in the official documentation. However, you can adjust this value up or down as needed. For example, you can set it to 0.7, 0.6, 0.55, or even increase it to 1.

start_at and end_at are the same as described in ControlNet in section 4.2. They affect the ksampler steps. These are generally not used frequently.

combine_embeds and embeds_scaling will be explained later in the section about embeddings. They are not very important for now.

Now, let's explain weight_type. It is important and should be understood.

d. Why does IP-Adapter seem complex with so many types?

In summary: You can use one image or up to four images. You can apply IP-Adapter directly or enhance face similarity using FaceID.

The complexity arises from the various combinations and parameters (coefficients) for each setting. Focus on understanding the core functionality.

weight_type

Here, we will explain the weight_type used for style transfer.

(Primarily use style transfer; other options are more experimental and can be adjusted based on experience, similar to various ControlNet preprocessors.)

For a subtle style transfer, use style transfer precise. For a stronger style transfer, use strong style transfer.

However, it's not only about extracting the ‘style’ from the ‘reference image’. You can also extract ‘composition’ or provide both ‘style and composition’.

•

style transfer : https://nordy.ai/basic-workflows/66b19dfbecf847aa45c8788e

•

composition : https://nordy.ai/basic-workflows/66b19f22e8c05e0008c623d6

•

style and composition : https://nordy.ai/basic-workflows/66b1a045e8c05e0008c625a5

좌측부터 차례로 : 레퍼런스 이미지, style transfer, composition, style and composition

Of course, the most basic is linear. It simply makes the output resemble the reference image and applies the IP-Adapter without additional style transfer. It was used quite a bit even before style transfer became available.

AI Design Tool: ComfyUI — Effortlessly create professional-level designs for free.

https://nordy.ai/basic-workflows/66b02ed6de9e5a447b1089af

(For a deeper explanation, IP-Adapter assigns weights to specific blocks in the UNet, and depending on how these weights are incorporated into the graph structure, there are various variations such as ease in, ease out, weak input, etc. It is a usable element but does not follow a strict pattern.)

e. What is the difference between using IP-Adapter with FaceID and without it?

When you want to make an image resemble a specific face, you can use FaceID with IP-Adapter. However, FaceID does not make the face look exactly the same but rather achieves a subtle resemblance.

The difference lies in using Unified Loader with FaceID rather than just Unified Loader. To use FaceID, you need to load InsightFace, which means you'll be using IPAdapter FaceID instead of IPAdapter Advanced. Therefore, you should pay attention to the additional weight values, setting them around 0.8. Since you're using GPU, set the Provider to GPU (CUDA is a type of GPU). Otherwise, the usage is the same. Make sure to chain the models accordingly.

AI Design Tool: ComfyUI — Effortlessly create professional-level designs for free.

https://nordy.ai/basic-workflows/66ac86f982cc009a59ed6dce

You might want to use both ‘face’ and ‘style’.

When using ‘face’, you can connect it independently as previously described.

Alternatively, you can also use a combination of ‘face’ and ‘style’ by layering them.

AI Design Tool: ComfyUI — Effortlessly create professional-level designs for free.

https://nordy.ai/basic-workflows/66ac8ca60d43bf34eee57cb2

f. How does using 1 image differ from using 4 images?

From the user's perspective, when using 4 images, you don't need to know all the detailed configurations; you simply swap out the images as needed. Using 2 or 3 images is also possible. Adding 4 images doesn’t always guarantee better results. Often, a single good reference image is sufficient.

(However, understanding detailed components is useful for more precise adjustments if needed. For those interested, refer to the embedding explanation at the bottom of this document.)

For using 4 style images:

AI Design Tool: ComfyUI — Effortlessly create professional-level designs for free.

https://nordy.ai/basic-workflows/66b314c1bac171d0bacba9c1

When using 4 face images

AI Design Tool: ComfyUI — Effortlessly create professional-level designs for free.

https://nordy.ai/basic-workflows/66ac8a02e8c05e0008c54c86

When using 4 face images and 4 style images

AI Design Tool: ComfyUI — Effortlessly create professional-level designs for free.

https://nordy.ai/basic-workflows/66ac8e858707efff9456ff7c

g. Can I use the IP-Adapter I need according to the situation?

In the end, it’s all about connecting the 'models', properly linking them, adding suitable reference images, and setting appropriate coefficient values.

When needed, enter this workflow, copy the necessary modules, and use them. I hope you create more good use cases.

AI Design Tool: ComfyUI — Effortlessly create professional-level designs for free.

https://nordy.ai/basic-workflows/66b0317428511144ee7250d0

h. What is embedding?

(From here, I’ll provide additional explanations, but you’ve learned all you need to know for practical use, so go ahead and use it. If you’re curious, feel free to ask more.)

When comparing Chapter 1 and Chapter 4 in detail, they are somewhat similar, but new elements like IPAdapter Encoder and IPAdapterCombineEmbeds have appeared.

The term 'embedding' ... it’s a complex term, but don’t be intimidated; try to understand it lightly. It’s okay if you don’t fully grasp it.

Artificial intelligence doesn’t understand the pixels we see very well. That’s why it’s explained as being converted into Latent. It’s similar to that.

It’s akin to that. To help AI understand pixels, it’s converted into embeddings and then sent over.

(I won’t explain the concept of vectors, but you can think of them as mathematical concepts with direction and magnitude.)

The words "king" and "queen" are like points that can be placed on a coordinate plane.

The difference between the words "man" and "woman" can be represented as a vector.

And that vector will be similar to the vector between the words "king" and "queen."

Thus, the four reference images above can also be represented as follows.

Image 1 -> Extract text using CLIP Vision (image2text) -> Embedding vector 1  
Image 2 -> Extract text using CLIP Vision (image2text) -> Embedding vector 2  
Image 3 -> Extract text using CLIP Vision (image2text) -> Embedding vector 3  
Image 4 -> Extract text using CLIP Vision (image2text) -> Embedding vector 4
Plain Text
복사

And while a single image doesn’t have meaning on its own, with multiple images, we’ll check just two coefficient values that become meaningful.

•

embeds_combine_method

•

embeds_scaling 

embeds_combine_method determines how we combine or merge the obtained embeddings.

•

concat: This method concatenates or joins the embeddings.

•

average: This method averages the embeddings.

The explanation of the code below is not precise.

concat(1,2,3,4) -> 1+2+3+4
average(1,2,3,4) -> 1+2+3+4/4

concat(embed1, embed2, embed3, embed4)
average(embed1, embed2, embed3, embed4)

if
    embed1 = cat, cute
    embed2 = cat, angry
    embed3 = cat, white
    embed4 = cat, flower
    
concat(sen1, sen2, sen3, sen4) -> cat, cute, cat, angry, cat, white, cat, flower
average(sen1, sen2, sen3, sen4) -> cat, something
Plain Text
복사

And the embedding value is a dictionary-like data structure with keys and values.

•

Whether to use only the key, only the value, or both the key and value, or apply channel penalties, etc.

If meaningful differences in results occur and are practically significant, I would explain the principles, but since they don’t, I’ll skip the explanation of the principles.

As a result, the following additional experimental possibilities arise: (Note: When using only one image, this also doesn’t affect the results. Try seed fixing and coefficient value testing.)

•

Start by trying concat.

•

There are cases where averaging (average) might be better.

•

Try using k+v as a basic approach.

•

There are cases where using v only or applying channel penalties (c penalty) might be better.