Midjourney, Stable Diffusion, DALL-E, and others are able to generate an image, sometimes a beautiful image, given only a text prompt. You may have heard of a vague description of these algorithms learning to subtract noise to generate an image. In this article, we will go through a concrete explanation of the diffusion model upon which all the recent models are based.
By the end of this article, you will understand the technical details of exactly how it works. We will start with the intuition behind it and then understand the sampling process, starting with pure noise and progressively refining it to obtain a final nice-looking image.
You will learn how to build a neural network that can predict noise in an image. You’ll add context to the model so that you can control where you want it to generate. And finally, by implementing advanced algorithms, you’ll learn how to accelerate the sampling process by a factor of 10.
Table of Contents:
- The Intuition Behind Diffusion Models
- Sampling Technique
- Neural Network
- Diffusion Model Training
- Controlling the Diffusion Model Output
- Speeding Up the Sampling Process
Consider that you have a lot of training data, such as these game character images that you see down here and this is your training data set. You want even more of these game characters that are not represented in your training data set. You can use a neural network that can generate more of these game characters for you, following the diffusion model process.
But the important question that we should answer is how do we make these images useful to the neural network? We want the neural network to learn generally the concept of a game character…