Search

4.1. Text2Image

Chapter
4.Core

Learning Goals

a. What is Text2Image? What is KSampler? b. What is a model? What is a checkpoint? What is LoRA?
Plain Text
복사

Workflow

Index
Workflow Name
Actionable links
Summary
4.1.1
Text2Image(SDXL)
Basic Generation with SDXL
4.1.2
Text2Image(SD1.5)
Basic Generation with SD1.5
4.1.3
Text2Image + LoRA
How to use LoRA with SDXL
4.1.4
Text2Image + Multi LoRA
How to stack two LoRAs with SDXL
4.1.5
Text2Image + Checkpoint Merge
How to merge and use two SDXL checkpoints
4.1.6
Image2Image(SDXL)
I2I with SDXL T2I and added Load Latent Image

a. What is Text2Image?

First, it would be helpful to stare at this image for a while, connecting the picture with the concept.
I will conceptualize it to provide a deeper understanding. The colors are not distinctly different, but they follow Comfy's basic color scheme.

Seed & Latent

In Text2Image, you input an Empty Latent Image with a ratio of 768x1024. Then, in KSampler, you set a seed value. This means you are specifying that this noise will be used as input. (It doesn't actually look like this in reality.)

Conditioning

It’s a difficult term, but don’t be intimidated. The word condition means 'condition'. It means that instead of generating randomly, you are providing conditions to create something according to the prompt you input. In other words, the prompt is actually a type of conditioning. Later, conditioning can also be applied in ControlNet.

KSampler

If I input the prompt "1girl" (i.e., applied conditioning), and assuming, as previously explained, that the step is set to 20 and other settings are configured, the image is created by performing 20 denoising/sampling steps from the noise.
In other words, Stable Diffusion is the process where KSampler performs denoising from the initial seed noise.
So, if you use a very low number of steps, you will see poor-quality images. However, increasing the number of steps beyond the necessary range (e.g., 20) does not necessarily produce better images. (Though some specialized models may require up to 60 steps, this is rare, whereas lightweight models like LCM require fewer steps.)

b. What is a model? What is a checkpoint? What is LoRA?

Artificial intelligence encompasses various AI models.
Among these, diffusion-based models like Stable Diffusion are referred to as checkpoints.

Checkpoints

In gaming, a checkpoint is a feature that saves your progress at a certain level. For example, saving when you have collected 7 logs at level 35 allows you to return to this point later. Similarly, in training, a checkpoint captures and saves the state of the model at a specific moment. The file extension is typically .ckpt, though .safetensor is used for enhanced security.
To understand models, it helps to grasp the concept of training.
The models we use, such as SDXL and SD1.5, were created by researchers or engineers who understand mathematical concepts and spend time researching. They train the base models using hundreds of GPUs over weeks, learning from millions of images. This is known as Train From Scratch. Our models are trained further from these base models, learning from thousands or tens of thousands of images. This is known as Train From Base.
So, what happens when we train a model? It's not something extraordinarily complex. It’s simple. Understand this:
It’s just the update of numerical values. The checkpoints you know are not remarkable but simply arrays of numbers like [0.3, 0.25, 0.87, 0.9, 0.5, 0.23, …]. Further training updates these numbers, so [0.3 → 0.25, 0.25 → 0.27, …] is just a modified array. In essence, a model is an optimized collection of numbers for specific data. (Currently, we are focusing on a practical understanding rather than an exact one.)
Thus, a checkpoint trained with many animations will perform well with animations, while one trained with numerous photos excels at realistic images. Similarly, a checkpoint trained with 3D Unity/Unreal images will produce styles similar to 3D textures.

LoRA

So, what is LoRA?
The size of checkpoints is quite large, ranging from 6GB to 7GB, because they contain a vast number of numerical values. Training often involves generating multiple models, which means using 60GB or 70GB of storage for just a few images is impractical. To address this, LoRA (Low-Rank Adaptation) technology, which has been extensively researched in the LLM (Large Language Model) field, was developed.
Researchers found that adjusting only about 1% of the weights in a checkpoint can produce effects similar to those of the full checkpoint. Therefore, LoRA, which involves lightly training a small number of images (30 to 100) and combining it with a checkpoint, has become popular. Essentially, it can be understood as a lightweight model that is about 1% the size of the checkpoint. Despite being smaller, LoRA can be as powerful as the checkpoint.
Additionally, while a checkpoint might be represented as [0.3, 0.25, 0.87, 0.9, 0.5, 0.23, …], LoRA stores the changes made to the checkpoint, such as [0.003, -0.0025, 0.008, -0.009, 0.005, -0.0023, …]. Thus, LoRA cannot be used alone and must be combined with a checkpoint.
The image above shows how a checkpoint and LoRA are connected. Pay attention to the model, lora_name, and strength_model. Let’s assume these values are as follows.
model: [0.3, 0.25, 0.87, 0.9, 0.5, 0.23,] LoRA: [0.003, 0.0025, 0.0087, 0.009, 0.005, 0.0023,] # lora_name으로 읽은 로라 데이터 strength_model: 1
Python
복사
In this case, the model produced by Load LoRA is calculated as [0.3003, 0.2525, 0.8787, 0.909, 0.505, 0.2323, ...].
MODEL = inputModel + strength_model * lora # MODEL: [0.303, 0.2525, 0.8787, 0.909, 0.505, 0.2323, ...] = [0.3, 0.25, 0.87, 0.9, 0.5, 0.23, …] + 1 * [0.003, 0.0025, 0.0087, 0.009, 0.005, 0.0023, …]
Python
복사

Checkpoint Merging

Suppose you have Checkpoint A, trained extensively on animations, and Checkpoint B, trained on photos. To create a model that generates semi-realistic images, you could either reassemble and retrain with images from both Checkpoint A and B, or gather and retrain with a large set of semi-realistic images. Collecting the right images will result in a model that produces the desired outputs.
However, there’s an interesting phenomenon to note. If you simply merge the two models, the styles from both checkpoints combine! Why? Even researchers are not entirely sure. It’s a bit of a "wow, this works?" scenario.
How is it done? It’s simply done by arithmetic averaging. You add the values from both models and divide by 2.
A checkpoint = [1, 0.8, 0.6, … ,0.8] B checkpoint = [0, 0.4, 0.6, … ,0] A+B checkpoint = [0.5, 0.6, 0.6, …, 0.4] # (A+B)/2 The first element of A is 1, the first element of B is 0, and the average of A and B is (1 + 0) / 2.
Plain Text
복사
It really works like that? Oh, the styles are merging too? That’s why it's used. (This is just to help you understand; it doesn’t necessarily mean Checkpoint Merging is highly useful. Also, regarding the Clip and VAE that come with it, you can use whichever you prefer if you’re not specifically trying to use them correctly.)

Multi LoRA Use

Just as it's possible to average two checkpoints, you can also stack multiple LoRAs on a single checkpoint. For example, using the checkpoint captured in the image (dreamshaperXL_alpha2Xl10), LoRA1 (HKStyle_V3-00019), and LoRA2 (DreamARTSDXL) as data, you can combine them as follows.
Checkpoint: [1, 0.8, 0.6,,0.8] ALoRa: [0.001, 0.008, 0.006, ..., 0.008] BLoRa: [0.002, 0.016, 0.012, ..., 0.016]
Python
복사
In this case, the model created by using Load LoRA twice is calculated as follows, resulting in an effect that blends both styles equally.
MODEL = Checkpoint + 0.5*LoRA1 + 0.5*LoRA2 # Here, the 0.5 in front of LoRA1 and LoRA2 represents the strength_model values in the Load LoRA settings. # MODEL: [1.0015, 0.812, 0.609, ..., 0.812] = [1, 0.8, 0.6, … ,0.8] + 0.5*[0.001, 0.008, 0.006, ..., 0.008] + 0.5*[0.002, 0.016, 0.012, ..., 0.016]
Python
복사
Again, checkpoint merging and stacking LoRAs are explained here more for understanding rather than their practical utility. Based on this updated understanding of Stable Diffusion, we can proceed to the next parts. What we previously referred to as Text2Image should now be understood as Basic Generation with text conditioning.