Stable Diffusion: A state-of-the-art text-to-image Model – CyPSi Lab

Stable Diffusion

Stable Diffusion is a text-based image generation machine learning model like DALL-E released by stability.ai (live example here). Stability.ai has not just open-sourced its source code for academic purposes, but also released the model weights to create customized applications. The AI model and its variants generate images based on a prompt and/or an input image. Unlike other text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services, stable diffusion code and model weights that have been released publicly can run on most consumer-grade hardware equipped with a modest GPU.

AI models such as Generative Adversarial Networks (GAN) have been the standard way for generating images from scratch in the past. GAN learns to generate new data with the same statistics as the input training set using reinforcement learning (learn more). However, training GANs becomes complicated when it is attempted to train the generator as well as the discriminator extremely well. A very well-trained generator model leads to a worse discriminator model as the fundamental motivation of the two networks are opposite in nature. As mentioned in this article, other common and related issues with GANs are as follows :-

Failure to Converge
Mode Collapse
Vanishing Gradients

Working of Stable Diffusion

Stable Diffusion can produce high-quality images using the concept of Super Resolution. Using super-resolution, a deep learning model is trained, which is used to denoise an input image and generate a high-resolution image as an output.

Fig 1: Overview of Stable Diffusion Model (fig source link)

As shown in Figure 1, Stable Diffusion Model comprises the following components.

Text Encoder: After taking the user input this particular component converts it to a vector form.
The text encoder is responsible for text processing, transforming the prompt into an embedding space. Similar to Google’s Imagen, Stable Diffusion uses a frozen CLIP ViT-L/14 Text Encoder.[1]
Random Noise Generator: RNG learns to create fake images by incorporating feedback from the discriminator. It learns to make the discriminator classify its output as real. Generator training requires tighter integration between the generator and the discriminator as compared to the requirement of discriminator training.[2]
Diffusion Model: It performs the denoising of the N*N image matrix in the loop (loop count set to 50)
Decoder: The generator part of a GAN

Our Experiment

We performed image-to-image stable diffusion on the images of our lab members for style transfer using this hugging-face repository. The text prompt used was van gogh. The following parameter values were used.