Everything you need to get started with Diffusion Models!
This blog is a part of “Road-maps for Generative AI” series and in this blog we will be starting from the absolute basics of neural networks and guiding you through the pre-requisites, fundamentals of diffusion models to SOTA diffusion models like GLIDE, DALL-E 2, STABLE DIFFUSION, IMAGEN. Lastly, we will be covering fine-tuning techniques that would take the personalization level of diffusion models to the next step. Different blogs and YouTube videos from various credible sources have been included at each step.
Before we begin our exciting adventure into the world of diffusion models, let’s make sure we’ve got all the pre-requisites in order. Just like a hiker wouldn’t set out without a map, compass, and water bottle, we need to ensure we have a solid grasp of the fundamentals before diving into the more advanced concepts.
Diffusion models are a class of deep generative models which means that they can generate completely new images and videos, ones that do not exists in reality, from the images that they have been extensively trained on. A more complex explanation would be that it makes use of the parameterized Markov chain. The basic idea is to train a Markov chain to transition between different states, where each state represents a specific aspect of the data, such as color, shape, or texture. The Markov chain is then used to generate new samples by starting at a random initial state and following the transitions defined by the model.
The hype around these generative models is mainly due to their ability to generate realistic images and videos. They are capable of generating images of monuments and structures that do not exists in reality as well as animations of fictional characters and scenarios. Another feat that these models have achieved is their ability to generate deep fakes. Recently, deep fakes have been used in the entertainment industry to create viral videos, such as lip-syncing celebrities or superimposing actors onto different movie scenes. Moreover, they have been used in educational settings to create interactive lessons and have been used in healthcare to create personalized medical simulations allowing doctors and nurses to practice surgeries without actually putting the patient at risk.
Some of the top diffusion models that have contributed to the excitement around this technology are GLIDE, DALL-E 2, STABLE DIFFUSION, IMAGEN and many more.
Starting off with the Neural Networks (NN) which are the most fundamental building blocks of deep learning. These networks are comprised of neurons which performs 2 tasks: 1) Calculating sum of weighted inputs 2) Applying Activation Functions. They are just like the neurons that we humans have in our brain and performs the same task to some extent which is they take the inputs, do some calculations, and then provides us the information as output.
Next, we have Convolutional Neural Networks (CNNs). Convolutional Neural Networks (CNNs) are a special type of Neural Networks (NN) that are designed to work with data that has a grid-like structure, such as an image. Instead of treating an image as a single input, CNNs use a technique called convolution (using filters) to break it down into smaller, manageable parts. Think of it like a game of connecting the dots — instead of looking at the entire image at once, CNNs look at small sections of the image and use those sections to build a deeper understanding of the image as a whole.
In addition to convolution, CNNs also use pooling to reduce the dimensionality of the data and fully connected layers to make final predictions. CNNs are particularly well-suited for image classification, object detection, and other computer vision tasks.
Now moving towards fancy names, we have VGG (Visual Geometry Group) which is a type of neural network architecture that has changed the field of computer vision. VGG introduced a new way of processing images using convolutional neural networks (CNNs). Unlike traditional CNNs, which use a fixed-size filter to detect features, VGG uses multiple convolutional layers to detect features of varying sizes. This allowed VGG to capture more contextual information in images, leading to improved accuracy in image classification tasks. VGGs are also termed as “Very Deep Convolutional Neural Networks” since they use multiple convolutional layers hence the name VGG-16 (16 layers) and VGG-19 (19 layers).
One of the key advancements made by VGG was the introduction of the “max pooling” layer, which reduced the dimensions of the feature maps produced by the convolutional layers. This helped to reduce the number of parameters and improved computational efficiency significantly.
Inception module is regarded as the revolutionary building block in deep learning architecture, designed to overcome the limitations of traditional neural networks. The main challenge faced by traditional CNNs is the trade-off between the number of filters used (width) and the number of convolutional layers (Depth). Deeper networks can capture complex features, but require a large number of parameters, while wider networks have fewer parameters but struggle to capture nuanced patterns. The Inception Module resolves this dilemma by introducing multiple parallel branches with differing filter sizes, thereby combining the strengths of both approaches. This modular design allows the network to process input images simultaneously at multiple scales in a single layer.
The concept “Attention is all you need” represents the core idea behind the Transformer model in deep learning. Imagine the Transformer as a highly versatile and adaptive AI. Unlike traditional models that follow fixed sequences, the Transformer has the remarkable ability to focus its attention on different parts of input data, whether it’s processing text, or images, or performing translation tasks by adding a multi-headed self-attention layer. Now, you may be wondering, what’s so special about multi-headed self-attention layers? Well, my friend, these layers take attention to the next level by allowing the network to jointly attend to information from different positions by evaluating scaled-dot product between Query, Key, and Value vectors in parallel and then passing it through a linear layer.
Imagine you’re trying to solve a puzzle, but instead of just looking at one piece at a time, you can suddenly see how all the pieces fit together simultaneously. That’s kind of like what multi-headed self-attention does — it gives the network a way to consider multiple perspectives at once, leading to richer understanding and decision-making.
Another attention mechanism that you should know about is Cross-Attention. Cross-attention is a type of attention mechanism used in deep learning models that enables the model to attend to different modalities of data simultaneously. It is a variant of self-attention that is specifically designed to handle multi-modal data, where each modality may have its own unique characteristics. It has been used in different generative models such as Imagen, Stable Diffusion, and Muse
In traditional self-attention, the model attends to different parts of the input data, computing attention scores based on the similarity between each part and every other part. However, in multi-modal data, each modality may have its own distinct features, making it challenging for the model to capture the relationships using traditional self-attention.
The key difference between self-attention and cross attention lies in the way the query (Q), key (K), and value (V) matrices are constructed. In self-attention, all three matrices are derived from the same input data, whereas in cross attention, the Q matrix is derived from one modality (e.g., visual features), while the K and V matrices are derived from another modality (e.g., textual features).
Vision Transformer (ViT) is a type of Neural Network architecture that was specifically designed for image classification tasks. The key innovation of ViT is the use of a transformer architecture to process visual data. Unlike traditional CNNs, which use convolutional layers to extract features from images, ViT uses a self-attention mechanism to process the image as a whole. This architecture allows to capture long-range dependencies between different parts of the image, which can be useful for tasks such as object detection and image segmentation. It divides an image into fixed-size patches, and then linearly embeds these patches into a sequence of vectors. This sequence is then fed into a transformer encoder, which processes it using self-attention mechanisms to learn the dependencies between the different patches. The output of the transformer encoder is then passed through an MLP head to produce the final image classification.
CLIP (Contrastive Language-Image Pre-training) is a revolutionary model that has achieved state-of-the-art results in Computer Vision (CV) classification task. One of the applications of CLIP in generative CV is its use in the classifier guidance mechanism, which enables the model to learn from labeled data and improve its performance. Given an image, the model can predict the most accurate description for that image. CLIP is based on the working principle of contrastive learning. During training, the model minimizes the contrastive loss. Now the question arises how does CLIP evaluate the similarities between labels and images? To answer this question simply, it uses cosine similarity to measure the similarity between the input image and context texts. From the image below, you can see that the CLIP model tries to maximize the similarity index along the diagonal. CLIP is used in GLIDE and DALL-E 2 models. So it is necessary to study the workings of CLIP.
FID (Frechet Inception Distance) and IS (Inception Score) are both evaluation metrics used to assess the performance of generative models like GANs and Diffusion models.
The FID measures the distance between the real and generated images in the feature space of a pre-trained Inception Network. Specifically, it computes the maximum mean discrepancy (MMD) between the real and generated images. The FID is a more robust version of the FIS (Frechet Inception Score), as it takes into account the structure of the data in the feature space of the network. The FID ranges from 0 to 1, where a lower value indicates a better match between the generated and real images. A perfect score of 0 means that the generated images are identical to the real images.
One advantage of the FID is that it can be easily computed and is relatively fast to evaluate, especially when compared to other metrics such as the Structural Similarity Index Measure (SSIM).
The general idea behind the Inception Score (IS) is to compare the features extracted from real images and generated images using a pre-trained deep neural network called the Inception Network. The Inception Network is a CNN architecture that is designed to extract high-level features from images. By comparing the features extracted from real and generated images, the Inception Score provides a measure of how well the generated images capture the underlying patterns present in the real images.
The Inception score (IS) measures two things simultaneously:
- The images have variety (e.g. each image is a different breed of dog)
- Each image distinctly looks like something (e.g. one image is clearly a Poodle, the next a great example of a French Bulldog)
Some of the blogs and YouTube videos that can help you learn the working and architectures of these concepts are:
- Jay Alammar’s Illustrated transformer
- Ketan Doshi’s Transformers Explained Visually Part 1 & Part 2
- Vaclav Kosar’s Cross-Attention in Transformer’s Architecture
- Alfredo Canziano’s Youtube Video
- ViT: Vision transformer
- The AI-Epiphany’s ViT
- A beginner’s guide to CLIP model
- OpenAI CLIP: Connecting Text and Images (Paper Explanation)
- A simple explanation of the Inception Score
- What is Frechet’s Inception Distance
✅ Prerequisites for learning diffusion models
You’re now ready to dive into the world 🌍 of Diffusion Probabilistic models.
In order to cover the foundational working of diffusion models, there are four main papers. The first paper that introduced the concept of noising and denoising images is “Denoising Diffusion Probabilistic Models”. This paper 📄 explains how adding noise to an image (Forward process ⏩ ) and removing noise from an image (Reverse process ⏪ ) step by step can yield an incredible image-generating model. Don’t be scared by the Maths 🧮 involved in the diffusion models. Just hang in there.
The next paper that should be covered is “Improved Denoising Diffusion Probabilistic Models”. This paper solves the problem of the log-likelihood score in DDPM by using a cosine scheduler during the forward process and learning the variance of the predicted normal distribution instead of keeping it fixed. The DDPM model was able to generate high-quality images but didn’t fit the dataset very well in terms of the distribution of the real image data. Improved DDPM solved this problem.
The third paper to cover foundations is “Denoising Diffusion Implicit Models”. One problem with the DDPM process is the speed of generating an image after training. DDPM does produce awesome images, but it takes 1,000 iterations to generate a single image. Passing an image through the model 1,000 times takes considerable time. To speed this process up, DDIM redefines the diffusion process as a non-Markovian process. However, there is a bit of image quality trade-off in this case.
Another important paper is “Diffusion Models Beat GANs on Image Synthesis” which introduces the concept of classifier guidance. A pre-trained classification model is used for conditional generation (guide the diffusion model to generate images of a desired class). We feed the diffusion model with the gradient of the log-probability of that specific class provided the image, to guide it. A guidance scale is used to control the level of conditional generation. Increasing the guidance scale improves sample quality at the cost of diversity.
Some of the blogs and YouTube videos that have helped me learn the foundations are:
- Lilian Weng’s “What are Diffusion Models?”
- Outlier’s video on Diffusion models
- HuggingFace 🤗 blog: The Annotated Diffusion Model
✅ Foundation of Diffusion models
All right! Now you’re ready to get your hands on the architectures of SOTA diffusion models!
Some of the major Diffusion models that must be covered are (sorted on the basis of timestamp ⏱️):
- DALL-E: It is a text-to-image model from OpenAI (2021). DALL-E uses a transformer to model text and image tokens as a single stream of data. It uses a 2-stage training process (1) Learning the Visual Codebook and (2) Learning the Prior Distribution. It uses a discrete variational autoencoder (dVAE) for the first stage and a transformer for the second stage. During training, the transformer takes the caption of the image as input and it has to learn to output codebook vectors in an autoregressive fashion. Later, this output codebook vector is used by decoder to generate a new image.
2. Glide: In 2022, OpenAI released a new text-to-image model Glide. They explore the use of CLIP guidance (Classifier guidance) and Classifier-free guidance on generated images. Classifier-free guidance is preferred by human evaluators for both photorealism and caption similarity. Increasing the guidance scale leads to a trade-off between the diversity and fidelity of generated images. Glide also has the capability for impainting to generate complex images. This paper will help in learning the working behind classifier-free guidance.
3. UnClip/DALL-E 2: To learn robust representations of images that capture both semantics and style, DALL-E 2 explores the use of the CLIP contrastive learning model more effectively. What this paper proposes is a way to use CLIP’s text embeddings and Glide’s decoder effectively by learning a Prior model that converts text embeddings to image embeddings. The architecture of DALL-E 2 is shown below:
4. LDM/Stable-Diffusion: Standard diffusion models calculate probabilities for fine-grain details that may not be perceptible by humans. They operate in pixel space which is very high dimensional. Training a standard diffusion model can take 150–1000 V100-days which is a lot of computation. Instead we need a way to reduce dimensionality while also preserving as much information as possible. What this paper proposes is to operate the diffusion process in low dimensional latent space to significantly improve both the training and sampling efficiency of denoising diffusion models. These models are also referred to as Latent diffusion models (LDMs). LDMs work well with unconditional image synthesis, super-resolution, conditional image synthesis, and inpainting.
5. Imagen: Previous generative models like DALL-E 2, and Glide have a few issues like in DALL-E 2 there is a latent prior which has to be learnt that is complex and Glide generates images with insufficient image fidelity. The main contribution of this paper are that it introduces a simple and efficient architecture Efficient U-Net, and dynamic thresholding for photorealistic images. It outperforms all other models on DrawBench and it is new SOTA for COCO dataset. The architecture of Imagen is shown below:
Some of the blogs and YouTube videos that can help you learn the working and architectures of these models are:
✅ SOTA Diffusion models
All right! Now you’re ready to learn about fine-tuning these large diffusion models!
While pre-trained models provide a valuable foundation for various tasks, they often require further adaptation to perform optimally in specific domains or scenarios. This is where fine-tuning comes into play. Fine-tuning tailors the model’s pre-learned knowledge to suit the nuances and intricacies of a particular task or dataset. Some of the important fine-tuning techniques are:
- Textual Inversion: It is a technique that allows us to add new styles or objects to text-to-image models without modifying the underlying model. It involves defining a new keyword representing the desired concept and finding the corresponding embedding vector within the language model. Textual inversion (embeddings) files are typically 10–100KB in size and use *.pt or *.safetensors file extension. There is a gradient-free textual inversion variant as well.
2. Low rank adapter (LoRA) with Diffusion models: This technique was first of all presented for efficient fine-tuning of LLMs. Later this technique was also applied to diffusion models. To reduce the number of trainable parameters, LoRA learns a pair of rank-decomposition matrices while freezing the original weights. Lora attempts to fine-tune the “residual” of the model instead of the entire mode.
3. DreamBooth: It is an efficient few-shot fine-tuning technique that preserves semantic class knowledge. The tuning process requires only 3–5 example images. It also encounters the problem of subject-driven generation i.e. generating new images with high fidelity in new contexts given a particular input image. To deal with the process of overfitting (model generating the exact same image with only a few changes) and language drift (model losing the sense and prior of that specific word which is used as prompt during fine-tuning), DreamBooth uses autogenous class-specific prior preservation loss. It uses a rare token identifier to reference the fine-tuned object.
4. ControlNet: It is a technique to add conditional information like edges, depth, segmentation, and human pose to text-to-image models. ControlNet uses zero-convolution, and trainable copies of encoder blocks to add conditional information. The benefit of using ControlNet is that the production-ready model trained with billions of images remains the same, while the trainable copy reuses such largescale pre-trained model to establish a deep, robust, and strong backbone for handling diverse input conditions.
5. HyperDreamBooth: Two of the major drawbacks of using DreamBooth is the large number of parameters that have to be fine-tuned (weights of UNET model and text encoder) and training time is very high and a lot of iterations are required (1000 iterations for Stable diffusion). This paper encounters the problem of speed and size with Lighweight DreamBooth (LiDB), new HyperNetwork architecture, and rank-relaxed finetuning. HyperDreamBooth is 25x faster than DreamBooth and 125x faster than Textual Inversion. It preserves the sample quality, style diversity and subject fidelity. It achieves personalization in roughly 20 seconds, using as few as one reference image.
Some of the blogs and YouTube videos that explain these techniques are:
- OpenCV’s blog on ControlNet
- Koiboi on LoRA vs DreamBooth vs Textual inversion vs Hypernetwork
- Gabriel Mongaras on HyperDreamBooth, LoRA
✅ Fine-tuning Diffusion models
Now, it’s time to get your hands dirty with code implementations and playing with diffusion models. For that, I suggest you take a look at Stable Diffusion-UI and HugginFace’s Diffusers library. Setup models on your local system, fine-tune them with the above-mentioned techniques, and generate awesome images😊 .