In the previous blog post, the main components and some intuition behind Stable Diffusion were introduced. Now, let’s see how we can use the HuggingFace diffusers library to generate images. The content of this blog post is based on Lesson 9 and Lesson 10 of Deep Learning for Coders. The end-to-end pipeline is very practical and easy to use, it’s basically a one-liner. We create a diffusion pipeline by downloading pre-trained models from a repo in the HuggingFace hub. Then, we can call this pipe object with a certain prompt:
# pip install diffusers==0.12.1# pip install accelerate# pip install transformers==4.25.1from torchvision import transforms as tfmsimport matplotlib.pyplot as pltimport torchfrom diffusers import StableDiffusionPipelinefrom PIL import Imagenum_inference_steps =50batch_size =1pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to("cuda")prompt ="Homer from the Simpsons on his roadbike climbing a mountain in the Pyrenees"
The config attributes {'scaling_factor': 0.18215} were passed to AutoencoderKL, but are not expected and will be ignored. Please verify your config.json configuration file.
Not bad, but not great either. Let’s dive one layer deeper, and create the components described in the previous post: the Unet, the autoencoder, text encoder and noise scheduler:
from diffusers import AutoencoderKL, LMSDiscreteScheduler, UNet2DConditionModelfrom transformers import CLIPTextModel, CLIPTokenizer, logginglogging.set_verbosity_error()# Autoencoder, to go from image -> latents (encoder) and back (decoder)vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae").to("cuda")# UNet, to predict the noise (latents) from noisy image (latents)unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet").to("cuda")# Tokenizer and Text encoder to create prompt embeddingstokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda")# The noise schedulerscheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)scheduler.set_timesteps(num_inference_steps)
The config attributes {'scaling_factor': 0.18215} were passed to AutoencoderKL, but are not expected and will be ignored. Please verify your config.json configuration file.
To use these components, we have to first tokenize the prompt. Tokenization is nothing more then transforming each word of the prompt into it’s associated integer according to a “vocabulary”. The “vocabulary” is the mapping of words to integers and is thus generally quite large.
text_input = tokenizer(prompt, # the prompt we want to tokenize padding="max_length", # pad the tokenized input to the max length return_tensors="pt") # return PyTorch tensorstext_input.input_ids
Above we see the integers that are associated with each word in our prompt. We can decode the integers back into words and see if it matches our prompt. Let’s have a look at the first 5 tokens:
[tokenizer.decode(token) for token in text_input.input_ids[0]][:5]
We see that all capital letters have been removed by the tokenization, and a special token is inserted at the beginning of the prompt. Also, we see the integer 49407 is being used to pad our input to the maximum length:
tokenizer.decode(49407)
'<|endoftext|>'
Next, we will pass these tokens through the text-encoder to turn each token into an embedding vector. Since we have 77 tokens and the embeddings are of size 768, this will be a tensor of shape [77, 768].
When generating a completely new image, we start with a fully random noisy latent, so let’s create one:
torch.manual_seed(1024)latents = torch.randn((batch_size, # batch size: 1 unet.config.in_channels, # input channels of the unet: 4 unet.config.sample_size, # height dimension of the unet: 64 unet.config.sample_size) # width dimension of the unet: 64 ).to("cuda") # put the tensor on the GPUlatents = latents * scheduler.init_noise_sigma # scale the noiselatents.shape
torch.Size([1, 4, 64, 64])
The latents thus carry 4 channels and are of size 64 by 64. Let’s pass this latent iteratively through the Unet, each time subtracting partly the amount of predicted noise (the output of the Unet)
for i, t inenumerate(scheduler.timesteps): inputs = latents inputs = scheduler.scale_model_input(inputs, t)# predict the noise with torch.no_grad(): pred = unet(inputs, t, encoder_hidden_states=text_embeddings).sample# update the latents by removing the predicted noise according to the noise schedule latents = scheduler.step(pred, t, latents).prev_sample
Let’s visualize the four channels of this latent representation in grey-scale:
To transform the latent representation to full-size images, we can use the decoder of the VAE. Note that when we do that, we move from a tensor of shape [4, 64, 64] to [3, 512, 512]:
#scale back according to the VAE paperwith torch.no_grad(): image = vae.decode(1/0.18215* latents).sample# move tensor to numpyimage = image[0].detach().cpu().permute(1, 2, 0).numpy()# scale the values to 0-255image = ((image /2+0.5).clip(0, 1) *255).round().astype("uint8")Image.fromarray(image)
Unfortunately, the result looks very bad and especially much worse then our one-liner. The main reason for this, is that the StableDiffusionPipeline is using something called Classifier Free Diffusion Guidance. So let’s have a look at that. But before we do, let’s add two code snippets to transfrom from the latent representation to the full size image representation and back. We will do this a couple of times, so it helps to keep the code a bit cleaner:
Classifier Free Guidance refers to a technique in which two images are being constructed at the same time from the same latent. One of the images is being reconstructed based on the specified prompt (conditional generation), the other image is being generated by an empty prompt (unconditional generation). By mixing the two images in the process according to a parameter (called the guidance-scale) the generated image for the prompt is going to look much better:
torch.manual_seed(3016)cond_input = tokenizer("Homer from the Simpsons on his roadbike climbing a mountain in the Pyrenees", padding="max_length", return_tensors="pt") cond_embeddings = text_encoder(cond_input.input_ids.to("cuda"))[0]# Create embeddings for the unconditioned processuncond_input = tokenizer("", padding="max_length", return_tensors="pt") uncond_embeddings = text_encoder(uncond_input.input_ids.to("cuda"))[0]# Concatenate the embeddingsembeddings = torch.cat([cond_embeddings, uncond_embeddings])guidance_scale =7.5# Create a "fresh" random latent to start withlatents = torch.randn((batch_size, unet.config.in_channels, unet.config.sample_size, unet.config.sample_size)).to("cuda") latents = latents * scheduler.init_noise_sigmafor i, t inenumerate(scheduler.timesteps): inputs = torch.cat([latents, latents]) # concatenate the latents inputs = scheduler.scale_model_input(inputs, t)# predict the noise with torch.no_grad(): pred = unet(inputs, t, encoder_hidden_states=embeddings).sample# pull both images apart again pred_cond, pred_uncond = pred.chunk(2)# mix the results according to the guidance scale parameter pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond)# update the latents by removing the predicted noise according to the noise schedule latents = scheduler.step(pred, t, latents).prev_samplelatents_to_image(latents)
Much better! As you can see, Classifier Free Guidance is a simple technique but it works very well. This morning (03-07-2023) I saw a tweet that introduced the same concept to the world of Large Language Models (LLMs):
Negative Prompt
As mentioned, the unconditional image with Classifier Free Guidance is created from an empty prompt. It turns out that we can use the prompt of this second image as a so-called negative prompt. If there are certain elements we don’t want to see in our image, we can specify it in this prompt.
We can see this by rewriting the Classifier Free Guidance equation:
\[\begin{align}
p &= p_{uc} + g (p_{c} - p_{uc}) \\
p &= g p_{c} + (1 - g) p_{uc} \\
\end{align}\]
So with a guidance scale value larger than 1, the unconditional prediction \(p_{uc}\) is being subtracted from the conditional prediction \(p_c\), which has the effect of removing the concept from the conditional image.
And gone is the blue chair! I must admit that this doesn’t always work as great as in this example, in fact I had to try out quite a lot of prompts in combination with negative prompts to find a good example for this post..
Image-to-image generation
Image-to-image generation is another super interesting process, in which we use both a prompt and an image to guide the generation process. This comes in handy, if for example we want to create a variant of an image we already have. Let’s say we have an awesome image of Homer eating a burger, and we want to have a similar image but instead we want Marge to eat the burger, or we want Homer to eat a slice of pizza instead. We can then feed both the correct promt as well as the already existing image to guide the image generation process even more.
The way this works, is by not starting with a completely random latent, but instead build a noisy latent of our existing image.
Let’s start with the image above and add some noise to it, for example by adding the noise for level 15 (we have 50 noise levels in total, so level 15 means that we still have 35 denoising steps to go):
If you squint with your eyes you can already see some structure, in the middle there is some yellow blob sitting around (eating lunch..). Let’s take this noisy latent, and do the remaining 35 denoising steps with a different prompt:
torch.manual_seed(105)cond_input = tokenizer("Homer Simpson eating Hot Pot", padding="max_length", return_tensors="pt") cond_embeddings = text_encoder(cond_input.input_ids.to("cuda"))[0]uncond_input = tokenizer("", padding="max_length", return_tensors="pt") uncond_embeddings = text_encoder(uncond_input.input_ids.to("cuda"))[0]embeddings = torch.cat([cond_embeddings, uncond_embeddings])guidance_scale =7.5# We start with the noised_latent coming from the image, defined abovelatents = noised_latentfor i, t inenumerate(scheduler.timesteps):# we only do the steps starting from the specified levelif i >= sampling_step: inputs = torch.cat([latents, latents]) inputs = scheduler.scale_model_input(inputs, t)with torch.no_grad(): pred = unet(inputs, t, encoder_hidden_states=embeddings).sample pred_cond, pred_uncond = pred.chunk(2) pred = pred_uncond + guidance_scale * (pred_cond - pred_uncond) latents = scheduler.step(pred, t, latents).prev_samplelatents_to_image(latents)
This image is very similar to what we started with. Color scheme, composition and camera angle are all the same. At the same time, the prompt is also reflected by a change of dishes on the table.
And that’s it, that’s image-to-image generation. As you can see, it’s nothing deeply complicated, it’s just a smart way to re-use the components we have already seen.
I hope this blog post shows how the components that are introduced in the previous post, translate to code. The examples shown here, only touch upon what can be achieved. In fact, the lessons upon which this post is based show a lot more interesting concepts such as textual inversion. If you are interested, have a look here