Stable Diffusion: A state-of-the-art text-to-image Model

Stable diffusion A state-of-the-art text-to-image Model article banner
Stable Diffusion

Stable Diffusion is a text-based image generation machine learning model like DALL-E released by (live example here). has not just open-sourced its source code for academic purposes, but also released the model weights to create customized applications.  The AI model and its variants generate images based on a prompt and/or an input image. Unlike other text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services,  stable diffusion code and model weights that have been released publicly can run on most consumer-grade hardware equipped with a modest GPU. 

AI models such as Generative Adversarial Networks (GAN) have been the standard way for generating images from scratch in the past. GAN learns to generate new data with the same statistics as the input training set using reinforcement learning (learn more). However, training GANs becomes complicated when it is attempted to train the generator as well as the discriminator extremely well. A very well-trained generator model leads to a worse discriminator model as the fundamental motivation of the two networks are opposite in nature. As mentioned in this article, other common and related issues with GANs are as follows :-

  1. Failure to Converge
  2. Mode Collapse
  3. Vanishing Gradients


Working of Stable Diffusion

Stable Diffusion can produce high-quality images using the concept of Super Resolution. Using super-resolution, a deep learning model is trained, which is used to denoise an input image and generate a high-resolution image as an output.


Fig 1: Overview of Stable Diffusion Model (fig source link)

As shown in Figure 1, Stable Diffusion Model comprises the following components.

  1. Text Encoder: After taking the user input this particular component converts it to a vector form.
    The text encoder is responsible for text processing, transforming the prompt into an embedding space. Similar to Google’s Imagen, Stable Diffusion uses a frozen CLIP ViT-L/14 Text Encoder.[1]
  2. Random Noise Generator: RNG learns to create fake images by incorporating feedback from the discriminator. It learns to make the discriminator classify its output as real. Generator training requires tighter integration between the generator and the discriminator as compared to the requirement of discriminator training.[2]
  3. Diffusion Model: It performs the denoising of the N*N image matrix in the loop (loop count set to 50)
  4.  Decoder: The generator part of a GAN


Our Experiment

We performed image-to-image stable diffusion on the images of our lab members for style transfer using this hugging-face repository. The text prompt used was van gogh. The following parameter values were used.

  1. Guidance Scale – 9.4
  2. Number of Iterations – 25
  3. Seed – 1006602893
  4. Strength – 0.15


Fig 2: Input and output images


References/More Resources on Stable Diffusion

Blogs on Stable Diffusion

  1. Stable Diffusion: From Description to Visualization
  2. Stable Diffusion. The most impressive neural network. How to use this new AI tool?
  3. : Stable Diffusion

Videos on Stable Diffusion

  1. How AI Image Generators Work (Stable Diffusion / Dall-E)
  2. Stable Diffusion – What, Why, How?

Code on Stable Diffusion

  1. CompVis: Stable Diffusion
  2. Google Colab File 
  3. Stable Diffusion with diffusers
  4. High-performance image generation using Stable Diffusion in KerasCV

Models on Stable diffusion on Hugging Face

  1. Stability.AI : Stable Diffusion

References/ Research Papers on Stable Diffusion

  1. High-Resolution Image Synthesis with Latent Diffusion Models
  2. Stable Diffusion (Wikipedia)
  3. The Generator | Machine Learning | Google Developers

Tags: Stable Diffusion, Turing Test, Generative Adversarial Network, Dall-E