Image Generation with DeepFloyd and Implementation of a Diffusion Denoiser

Disclaimer for the TAs:

I finished the website after I submitted everything to gradescope. However, all images, except for the images for part 1.10 (I forgot to upload them before) were committed to the repository before I submitted everything to gradescope. As I am taking the class PNP, please consider the website for grading since although the pdf I submitted to gradescope contains all the necessary imagery, it is not meant to be a report since I was very time constrained when I submitted everything to gradescope. But feel free to regenerate everything as the code I submitted to gradescope was final.

Part 0: Setup

In the first part of this project, we will be generating images with a pre-trained denoising model based on two unets, stage_1 unet can generate 64x64 pixel images and stage_2 unet can generate 256 x 256 pixel images. We access the model through Hugging Face. The model also takes in an argument of num_inference, which increases the steps in the denoising process. This implies that we expect images generated with higher num_inferences to be denoised further and be somewhat of higher quality and more detailed. Below is 2 sets of 3 images generated with different num_inferences, for the first set, num_inference is 20, for the second set, num_inference is 80.

Images with num_inference = 20

Images with num_inference = 80

Part 1: Sampling Loops

1.1 Implementing the forward process

We have a list of precomputed alphas, alpha_bars, and betas generated by the former 2 in accordance with the DDPM paper equations. In diffusion, we assume that we can recover a noisy image by denoising it, so the question now becomes to learn the noise added at each time step, so that we could recover the original image. To do that, we progressively noise a given image, and then subtract the estimated noise from it to get back the original image. If we train a model for this, we can simply recover any image that resembled the images the model was trained with, which is the task of the second part. For this part, we simply noise an image adhering to the below equation. Notice that when t = 0, the second term in the equation should be zero since we should recover the original image in the first timestep, thus, as t approaches 0, alpha_bar_t aproaches 1 and vice versa.

Associated images

1.2 Classical Denoising

In this part we show that using Gaussian Blurring is useless to recover the images from timesteps [250, 500, 750] or recover any noisy image for that matter!

Associated Images

1.3 Implementing One Step Denoising

Here, we implement one step denoising, which is simply rearranging the second equation to obtain x_0, original image. To do this, we feed noisy images at different timesteps ([250, 500, 750]) to get the noise estimate from the diffusion model, and put this noise estimate to the above equation and with x_t being our noisy image, we recover the original image.

Associated Images

1.4 Implementing Iterative Denoising

Although the images above look good, we can do better with iterative denoising. Here, we adhere to the below equation. To get x_t, we input the noisy image to the model to get the noise estimate and recover following the same procedure as one step denoising, then we input this recovered image to the iterative denoising loop getting a iteratively denoised estimate. We follow this procedure with strided time steps for efficiency until we get the clean image. Here we start the strided time steps with i_start = 10, so we start off with a noisy image but not pure noise, just the noised image of the image that we are trying to regenerate. This can be viewed as interpolation in the image manifold, we are basically cruising along the line below from x_t to x_0 according to our model.

Associated Images

1.5 Diffusion Model Sampling

Notice that if we set i_start = 0, then we start off with pure noise, which generates an image from scratch. Since DeepFloyd is also a text model, we can generate images from scratch with i_start = 0 and unconditional prompt "a high quality photo".

Associated Images

Part 1.6: Classifier Free Guidance

Apperantly, getting noise estimates for a conditional prompt and an unconditional prompt, and combining them according to the below equation with strength measure gamma, we can get better quality images even when we start from scratch. In this case, the conditional prompt is "a high quality photo" and the unconditional prompt is the empty string with gamma = 7.

Associated Images

Part 1.7: Image to image Translation

Here, we show the results of iterative classifier free guidance below for i_start values in [1, 3, 5, 7, 10, 20] and conditional prompt "a high quality photo". For this part, I set the Berkeley Campanile, and the album covers of Animals and High Hopes from Pink Floyd.

Associated Images

Part 1.7.1: Editing Hand-Drawn and Web Images

Here, I show the results of iterative classifier free guidance below with one web image and two hand drawn images of my own. As target images, I use the web-image of an avocado, and two hand drawn images of a shark and a sailing yacht.

Associated Images

Part 1.7.2: Inpainting

We can also do inpainting, which is leaving off a certain part of the image unchanged and the filling the rest with the generated image. We can easily accomplish this by creating a mask and multiply the denoised image with the mask and multiplying the original image with (1 - mask) adhering to the equation below. I replace the top of Berkeley Campanile, Izmir Saat Kulesi, and Istanbul Kiz Kulesi

Associated Images

Part 1.7.3: Text-Conditioned Image-to-image Translation

Here, we generate images with prompts with different noise levels for input images. To do this, we simply change the conditional prompt to any prompt we'd like and denoise from the same levels as above. For this part I use the prompt "a rocket ship", for 4 of my images.

Associated Images

Part 1.8: Visual Anagrams

With this models, for two different prmopts we get two noise estimates adhering the equation below and take their average to be our final noise estimate. This way we denoise an image that looks like prompt 1 upright, and like prompt 2 rotated. For this part, I use the prompts ("an oil painting of people around a campfire", an oil painting of an old man), ("a coffee cup with a handle", a classic golden bell with a curved handle"), ("an oil painting of a snowy mountain village", "a rocket ship"), respectively for the below images.

Associated Images

Part 1.10: Hybrid Images

Just like project 3, we can generate hybrid images with this model, adhering to the equations below. This simply means that we get two seperate noise estimates and lowpass one of them with a gaussian blur and highpass the other one with Gaussian as well and add them to get a hybrid noise estimate for the two different prompts and denoise from there. I use the ("a lithograph of a skull", "a lithograph of waterfalls"), ("a rocket ship", "a pencil"), ("a detailed photograph of a full moon with craters", "a simple round clock on black background") respectively, the results are below.

Associated Images

Part: 2.1: Training a single Step denoising UNet

To implement the unet, we follow the below UNet architecture using the necessary tools from pytorch and GPUs from collab. The loss function L, which is used for backpropagation later to train the UNet better. This UNet architecture helps us getting noise estimates for the image. The convolutional process the image at given resolution and learn patterns and structures in the given noisy image. The down blocks downsample the image, reducing resolution, while increasing the number of channels, essentially extracting features at lower resolution at a lower cost. Up blocks increase the resolution while reducing number of channels, essentially learning global features. Finally the concatenation blocks concatenate blocks at same level of resolution, essentially helping the model combine local details with global features.

Associated Images

A sample set of images from the dataset noised with noise levels [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Part 2.2.1: Using the UNet to train a Denoiser

After implementing the model above, we can use the UNet to predict the image with a noisy input of the image, by defining the loss to be the distance between the actual image and the predicted image. While training the model, for each epoch, we basically extract a batch from the dataset and noise the images in the batch and put it into the model and run backpropagation algorithm with the given loss function. The loss curve is provided below with the predicted images at different epochs.

The equation for the noisy image, where epsilon is noise and sigma is the noise level.

Associated Images

Model Training Loss

Regenerated images from the model at 1st and 5th epochs respectively:

Part 2.2.2: Out of Distribution Testing

We train the above UNet with standard deviation 0.5. Here we test the results of the model with images with different noise levels with different standard deviations specifically [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0].

Part 2.3: Training a Diffusion Model

Instead of making the model predict images at different noise levels, we can make the model predict noise at different time steps, and iteratively denoise the noisy image. We do this by simply finding a way to incorporate time conditioning into the model and subtract the noise at different time levels iteratively, getting to ground truth after T time steps, adhering to the equations below.

Loss function for the noise

Equation for the noisy image.

Associated Images

Regenerated images from the model with different noise levels at [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0] respectively:

Part 2.3.1: Adding Time Conditioning to UNet

We add time conditioning with the fully connected block architecture below. We add time conditioning to the first unflattened block, and the first upscaled block as described below.

Time conditioned Loss function

Part 2.3.2: Training the UNet

We train the time conditioned UNet with backprogation again. Only difference is that we are now predicting noise and the loss function is defined as the distance between the predicted noise and actual noise, adhering to the algorithm below.

Time conditioned UNet architecture

Fully connected block architecture

Training algorithm

Associated Images

Time conditioned loss at epoch 5:

Time conditioned loss at epoch 20:

Part 2.3.3: Sampling from the UNet

To sample from this UNet, we generate a pure random image, then iteratively predict noise at each time step with our trained model and subtract that noise iteratively until we get to x_0, adhering to the equations below.

Sampling algorithm with T = 300

Associated Images

Sample from the model trained with 5 epochs:

Sample from the model trained with 20 epochs:

Part 2.4: Adding Class Conditioning to UNet

We can generate better outputs by making the model class conditional as well. We only have 10 classes since we are working with the MNIST database. To achieve an optimal bias-variance for the model, we add another variable called p_uncond, which randomly turns the class conditional block into the identity map, which allows the model to train on classless inputs, which increase the model performance. To add the class conditioning, we add another fully connected block and multiply it with first unflattened image, and the first upscaled image, then add time conditioning. We train with the same backpropagation algorithm from the time conditional training.

Part 2.5: Sampling from the Class and Time Conditional UNet

The sampling process is the same as before, only difference is that we also add classifier-free guidance while generating the noise. The results are below.

Time and class conditioned sampling algorithm

Associated Images

Training Loss of the model trained with 5 epochs:

Training Loss of the model trained with 20 epochs:

Sample from the model trained with 5 epochs:

Sample from the model trained with 20 epochs: