Introduction to Stable Diffusion - Code

generative
stable diffusion
diffusers
computer vision
Author

Lucas van Walstijn

Published

July 3, 2023

In the previous blog post, the main components and some intuition behind Stable Diffusion were introduced. Now, let’s see how we can use the HuggingFace diffusers library to generate images. The content of this blog post is based on Lesson 9 and Lesson 10 of Deep Learning for Coders. The end-to-end pipeline is very practical and easy to use, it’s basically a one-liner. We create a diffusion pipeline by downloading pre-trained models from a repo in the HuggingFace hub. Then, we can call this pipe object with a certain prompt:

# pip install diffusers==0.12.1
# pip install accelerate
# pip install transformers==4.25.1

from torchvision import transforms as tfms

import matplotlib.pyplot as plt
import torch
from diffusers import StableDiffusionPipeline
from PIL import Image

num_inference_steps = 50
batch_size = 1

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to("cuda")

prompt = "Homer from the Simpsons on his roadbike climbing a mountain in the Pyrenees"
The config attributes {'scaling_factor': 0.18215} were passed to AutoencoderKL, but are not expected and will be ignored. Please verify your config.json configuration file.
torch.manual_seed(114)
pipe(prompt, num_inference_steps=num_inference_steps, guidance_scale=7.5).images[0]

Not bad, but not great either. Let’s dive one layer deeper, and create the components described in the previous post: the Unet, the autoencoder, text encoder and noise scheduler:

from diffusers import AutoencoderKL, LMSDiscreteScheduler, UNet2DConditionModel
from transformers import CLIPTextModel, CLIPTokenizer, logging

logging.set_verbosity_error()

# Autoencoder, to go from image -> latents (encoder) and back (decoder)
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae").to("cuda")

# UNet, to predict the noise (latents) from noisy image (latents)
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet").to("cuda")

# Tokenizer and Text encoder to create prompt embeddings
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda")

# The noise scheduler
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
scheduler.set_timesteps(num_inference_steps)
The config attributes {'scaling_factor': 0.18215} were passed to AutoencoderKL, but are not expected and will be ignored. Please verify your config.json configuration file.

To use these components, we have to first tokenize the prompt. Tokenization is nothing more then transforming each word of the prompt into it’s associated integer according to a “vocabulary”. The “vocabulary” is the mapping of words to integers and is thus generally quite large.

text_input = tokenizer(prompt,               # the prompt we want to tokenize
                       padding="max_length", # pad the tokenized input to the max length
                       return_tensors="pt")  # return PyTorch tensors
text_input.input_ids
tensor([[49406, 16931,   633,   518, 21092,   525,   787,  4370,  3701,  9877,
           320,  3965,   530,   518, 39744, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]])

Above we see the integers that are associated with each word in our prompt. We can decode the integers back into words and see if it matches our prompt. Let’s have a look at the first 5 tokens:

[tokenizer.decode(token) for token in text_input.input_ids[0]][:5]
['<|startoftext|>', 'homer', 'from', 'the', 'simpsons']

We see that all capital letters have been removed by the tokenization, and a special token is inserted at the beginning of the prompt. Also, we see the integer 49407 is being used to pad our input to the maximum length:

tokenizer.decode(49407)
'<|endoftext|>'

Next, we will pass these tokens through the text-encoder to turn each token into an embedding vector. Since we have 77 tokens and the embeddings are of size 768, this will be a tensor of shape [77, 768].

text_embeddings = text_encoder(text_input.input_ids.to("cuda"))[0]
text_embeddings.shape
torch.Size([1, 77, 768])

When generating a completely new image, we start with a fully random noisy latent, so let’s create one:

torch.manual_seed(1024)
latents = torch.randn((batch_size,              # batch size: 1
                       unet.config.in_channels, # input channels of the unet: 4
                       unet.config.sample_size, # height dimension of the unet: 64
                       unet.config.sample_size) # width dimension of the unet: 64
                     ).to("cuda")               # put the tensor on the GPU

latents = latents * scheduler.init_noise_sigma  # scale the noise

latents.shape
torch.Size([1, 4, 64, 64])

The latents thus carry 4 channels and are of size 64 by 64. Let’s pass this latent iteratively through the Unet, each time subtracting partly the amount of predicted noise (the output of the Unet)

for i, t in enumerate(scheduler.timesteps):
    inputs = latents
    inputs = scheduler.scale_model_input(inputs, t)

    # predict the noise 
    with torch.no_grad(): 
        pred = unet(inputs, t, encoder_hidden_states=text_embeddings).sample

    # update the latents by removing the predicted noise according to the noise schedule
    latents = scheduler.step(pred, t, latents).prev_sample

Let’s visualize the four channels of this latent representation in grey-scale:

fig, axs = plt.subplots(1, 4, figsize=(16, 4))
for c in range(4):
    axs[c].imshow(latents[0][c].cpu(), cmap='Greys')

To transform the latent representation to full-size images, we can use the decoder of the VAE. Note that when we do that, we move from a tensor of shape [4, 64, 64] to [3, 512, 512]:

print(latents.shape, vae.decode(latents).sample.shape)
torch.Size([1, 4, 64, 64]) torch.Size([1, 3, 512, 512])

And let’s visualize the result:

#scale back according to the VAE paper
with torch.no_grad(): image = vae.decode(1 / 0.18215 * latents).sample

# move tensor to numpy
image = image[0].detach().cpu().permute(1, 2, 0).numpy()
# scale the values to 0-255
image = ((image / 2 + 0.5).clip(0, 1) * 255).round().astype("uint8")
Image.fromarray(image)

Unfortunately, the result looks very bad and especially much worse then our one-liner. The main reason for this, is that the StableDiffusionPipeline is using something called Classifier Free Diffusion Guidance. So let’s have a look at that. But before we do, let’s add two code snippets to transfrom from the latent representation to the full size image representation and back. We will do this a couple of times, so it helps to keep the code a bit cleaner:

def latents_to_image(latent):
    with torch.no_grad(): 
        image = vae.decode(1 / 0.18215 * latent).sample

    image = image[0].detach().cpu().permute(1, 2, 0).numpy()
    image = ((image / 2 + 0.5).clip(0, 1) * 255).round().astype("uint8")
    return Image.fromarray(image)
    
def image_to_latent(input_im):
    with torch.no_grad():
        latent = vae.encode(torch.Tensor(np.transpose(np.array(input_im) / 255., (2, 0, 1))).unsqueeze(0).to('cuda')*2-1)
    return 0.18215 * (latent).latent_dist.sample()

Classifier Free Diffusion Guidance

Classifier Free Guidance refers to a technique in which two images are being constructed at the same time from the same latent. One of the images is being reconstructed based on the specified prompt (conditional generation), the other image is being generated by an empty prompt (unconditional generation). By mixing the two images in the process according to a parameter (called the guidance-scale) the generated image for the prompt is going to look much better:

torch.manual_seed(3016)

cond_input = tokenizer("Homer from the Simpsons on his roadbike climbing a mountain in the Pyrenees", padding="max_length", return_tensors="pt") 
cond_embeddings = text_encoder(cond_input.input_ids.to("cuda"))[0]

# Create embeddings for the unconditioned process
uncond_input = tokenizer("", padding="max_length", return_tensors="pt") 
uncond_embeddings = text_encoder(uncond_input.input_ids.to("cuda"))[0]

# Concatenate the embeddings
embeddings = torch.cat([cond_embeddings, uncond_embeddings])

guidance_scale = 7.5

# Create a "fresh" random latent to start with
latents = torch.randn((batch_size, unet.config.in_channels, unet.config.sample_size, unet.config.sample_size)).to("cuda")               
latents = latents * scheduler.init_noise_sigma

for i, t in enumerate(scheduler.timesteps):
    inputs = torch.cat([latents, latents]) # concatenate the latents
    inputs = scheduler.scale_model_input(inputs, t)

    # predict the noise 
    with torch.no_grad(): 
        pred = unet(inputs, t, encoder_hidden_states=embeddings).sample
    
    # pull both images apart again
    pred_cond, pred_uncond = pred.chunk(2)
    
    # mix the results according to the guidance scale parameter
    pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond)

    # update the latents by removing the predicted noise according to the noise schedule
    latents = scheduler.step(pred, t, latents).prev_sample
    
latents_to_image(latents)

Much better! As you can see, Classifier Free Guidance is a simple technique but it works very well. This morning (03-07-2023) I saw a tweet that introduced the same concept to the world of Large Language Models (LLMs):

Negative Prompt

As mentioned, the unconditional image with Classifier Free Guidance is created from an empty prompt. It turns out that we can use the prompt of this second image as a so-called negative prompt. If there are certain elements we don’t want to see in our image, we can specify it in this prompt.

We can see this by rewriting the Classifier Free Guidance equation:

\[\begin{align} p &= p_{uc} + g (p_{c} - p_{uc}) \\ p &= g p_{c} + (1 - g) p_{uc} \\ \end{align}\]

So with a guidance scale value larger than 1, the unconditional prediction \(p_{uc}\) is being subtracted from the conditional prediction \(p_c\), which has the effect of removing the concept from the conditional image.

An example of Homer Simpson eating lunch:

cond_input = tokenizer("Homer Simpson eating lunch", padding="max_length", return_tensors="pt") 
cond_embeddings = text_encoder(cond_input.input_ids.to("cuda"))[0]

uncond_input = tokenizer("", padding="max_length", return_tensors="pt") 
uncond_embeddings = text_encoder(uncond_input.input_ids.to("cuda"))[0]
Code
torch.manual_seed(105)

embeddings = torch.cat([cond_embeddings, uncond_embeddings])

guidance_scale = 7.5

latents = torch.randn((batch_size, unet.config.in_channels, unet.config.sample_size, unet.config.sample_size)).to("cuda")               
latents = latents * scheduler.init_noise_sigma

for i, t in enumerate(scheduler.timesteps):
    inputs = torch.cat([latents, latents]) # concatenate the latents
    inputs = scheduler.scale_model_input(inputs, t)

    with torch.no_grad(): 
        pred = unet(inputs, t, encoder_hidden_states=embeddings).sample
    
    pred_cond, pred_uncond = pred.chunk(2)
    
    pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond)

    latents = scheduler.step(pred, t, latents).prev_sample

latents_to_image(latents)

And removing the blue chair by using a negative prompt:

uncond_input = tokenizer("blue chair", padding="max_length", return_tensors="pt") 
uncond_embeddings = text_encoder(uncond_input.input_ids.to("cuda"))[0]
Code
torch.manual_seed(105)

embeddings = torch.cat([cond_embeddings, uncond_embeddings])

guidance_scale = 7.5

latents = torch.randn((batch_size, unet.config.in_channels, unet.config.sample_size, unet.config.sample_size)).to("cuda")               
latents = latents * scheduler.init_noise_sigma

for i, t in enumerate(scheduler.timesteps):
    inputs = torch.cat([latents, latents]) # concatenate the latents
    inputs = scheduler.scale_model_input(inputs, t)

    with torch.no_grad(): 
        pred = unet(inputs, t, encoder_hidden_states=embeddings).sample
    
    pred_cond, pred_uncond = pred.chunk(2)
    
    pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond)

    latents = scheduler.step(pred, t, latents).prev_sample

image = latents_to_image(latents)
image

And gone is the blue chair! I must admit that this doesn’t always work as great as in this example, in fact I had to try out quite a lot of prompts in combination with negative prompts to find a good example for this post..

Image-to-image generation

Image-to-image generation is another super interesting process, in which we use both a prompt and an image to guide the generation process. This comes in handy, if for example we want to create a variant of an image we already have. Let’s say we have an awesome image of Homer eating a burger, and we want to have a similar image but instead we want Marge to eat the burger, or we want Homer to eat a slice of pizza instead. We can then feed both the correct promt as well as the already existing image to guide the image generation process even more.

The way this works, is by not starting with a completely random latent, but instead build a noisy latent of our existing image.

Let’s start with the image above and add some noise to it, for example by adding the noise for level 15 (we have 50 noise levels in total, so level 15 means that we still have 35 denoising steps to go):

latent = image_to_latent(image)
noise_latent = torch.randn_like(latent)

sampling_step = 15
noised_latent = scheduler.add_noise(latent, noise_latent, timesteps=torch.tensor([scheduler.timesteps[sampling_step]]))

latents_to_pil(noised_latent)[0]

If you squint with your eyes you can already see some structure, in the middle there is some yellow blob sitting around (eating lunch..). Let’s take this noisy latent, and do the remaining 35 denoising steps with a different prompt:

torch.manual_seed(105)

cond_input = tokenizer("Homer Simpson eating Hot Pot", padding="max_length", return_tensors="pt") 
cond_embeddings = text_encoder(cond_input.input_ids.to("cuda"))[0]

uncond_input = tokenizer("", padding="max_length", return_tensors="pt") 
uncond_embeddings = text_encoder(uncond_input.input_ids.to("cuda"))[0]

embeddings = torch.cat([cond_embeddings, uncond_embeddings])

guidance_scale = 7.5

# We start with the noised_latent coming from the image, defined above
latents = noised_latent

for i, t in enumerate(scheduler.timesteps):
    # we only do the steps starting from the specified level
    if i >= sampling_step:
        
        inputs = torch.cat([latents, latents])
        inputs = scheduler.scale_model_input(inputs, t)

        with torch.no_grad(): 
            pred = unet(inputs, t, encoder_hidden_states=embeddings).sample

        pred_cond, pred_uncond = pred.chunk(2)
        pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond)

        latents = scheduler.step(pred, t, latents).prev_sample
    
latents_to_image(latents)

This image is very similar to what we started with. Color scheme, composition and camera angle are all the same. At the same time, the prompt is also reflected by a change of dishes on the table.

And that’s it, that’s image-to-image generation. As you can see, it’s nothing deeply complicated, it’s just a smart way to re-use the components we have already seen.

I hope this blog post shows how the components that are introduced in the previous post, translate to code. The examples shown here, only touch upon what can be achieved. In fact, the lessons upon which this post is based show a lot more interesting concepts such as textual inversion. If you are interested, have a look here