In this article I tell you:
– How I got interested in GANs
– What is a GAN, the core concept
– The power of GANs
– How I implemented a DCGAN to make faces
– My thoughts after doing this project
Generative AI, I find to be the most important form AI has taken so far. Creative work has always been done by humans. This is changing with AI, but you already know that…
The level of actual creativity in different AI agents is still growing. Innovation is creative in its nature, and if AI can innovate, then we will be in for quite the ride.
In a Slack channel, I saw a post with a question asking where game theory could be used in AI dev. Looking online out of curiosity, I quickly found generative adversarial networks (GANs).
Yann LeCun, Meta’s VP and Chief AI Scientist, said this about GANs:
“the most exciting idea in machine learning in the last ten years.”
I saw many examples of GANs’ impressive feats of image generation, upscaling, style transfer, text-to-image, and more. I went on a little YT binge to understand the concept behind them.
I immediately knew I wanted to start experimenting with GANs after seeing their results. With the help of some fantastic resources online (I’ll put them at the bottom), I built and trained a GAN to generate low-quality faces from the popular celebA dataset.
In 2014, Ian Goodfellow and his colleagues proposed the idea of GANs in a paper. Different types of more powerful GANs have been developed. You might have seen something from a GAN and not realized. Let’s see some examples.
Researchers at Nvidia proposed StyleGAN in 2018. This GAN could “combine” styles or features of an image with another. More powerful StyleGAN versions have been developed. Here, you can see StyleGAN combining features of faces.
Generative diffusion models like Dalle require more computation power to generate images. They are limited in the volume and quality of images or data they can produce. GANs can be used in combination with other model types for increased performance.
GigaGAN is used to upscale any images, like those from a diffusion-based models. Here are images upscaled through different GigaGAN models and some other models for comparison. The time it took for researchers to compute each one is on each image.
Its good to recognize that GANs are no longer the standard for image generation, nor are they limited to just generating images. Different types of models are better at certain things. GigaGAN is an example of how different models will be used together to generate higher quality, more creative work faster.
They can train highly effectively with unlabeled data and efficiently produce higher dimensional data. They are less computationally intense, more resistant to overfitting, and capable of capturing hierarchical features of data. More specific strengths depend on the GAN.
I hope if you don’t already know the rough idea behind GANs that you are excited to know now. Lets go through this diagram with an analogy.
In a GANs there is a discriminator and generator model. Imagine them as a detective who examines cash to see if it’s fake and a fraudster who makes counterfeit cash.
Imagine this fraudster is trying to take gross materials and make fake cash. In the actual case of the generator, the gross material is the input, z taken from latent space. The fake cash is the sample that is generated from z.
The detective is given a mix of fake money and real money and tries to guess if it’s real or not. In the actual case of the discriminator, it reduces the input samples through deep neural nets to a value between 0 and 1.
Greater than 0.5 means it ‘thinks’ it is real money, and less means it ‘thinks’ it’s fake. The proximity to 1 or 0 represents the confidence of the discriminator of its guess.
The training is done by calculating a loss for the discriminator and the generator. The loss is proportional to how to correct the discriminator. is. Typically, through backpropagation, this loss is used to tune the weights and biases of the generator and discriminator models.
This is simply the process of the detective and generator altering their ‘methods’ to perform better against each other. This process is repeated and is essentially a game. The whole concept can be written as a function.
minG maxD V(D,G)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]
The discriminator D wants to maximize the value of this expression, while the generator G wants to minimize its value. It’s a competition to continually improve.
Note there are different functions that effectively calculate loss; this is just the original function. This captures GANs in one line. This came from the original proposal of GANs in 2014. Everything I have gone over has made me want to build my own GAN.
I started experimenting with generating handwritten numbers in a basic GAN that uses multi-layer perceptrons. I had some success, but I decided to implement DCGAN because of its greater performance with images. A DCGAN used convolutional layers in the generator and discriminator to achieve this enhanced performance.
I found Alladin Persson on YT with a series on different GAN implementations. I only got a functioning DCGAN because of his videos, GPT-4, and docs. Here is the specific video I followed on DCGANs. If you want to see the code I was running, it is linked here.
To build this model I first had to setup and appropriate environment on my computer where I could import all needed libraries (PyTorch, TensorFlow, CudaToolKit, etc.) and run the code. I used anaconda to create an environment and I used PyCharm as my IDE.
I downloaded the celebA dataset from here. In training first the generator takes noise and upscales it through convolution layers. The discriminator takes a real image and downsamples it to its guess between 0 and 1. This is repeated on the fake image that follows it.
In this code, we use the BCELoss function from PyTorch. Like the loss function proposed originally, the concept behind GANs can be summed up by the BCELoss function. This is calculated once for each model in the GAN. Both models are trying to minimize their own loss.
BCE(y, y^)=−(y ⋅ log(y^)+(1−y) ⋅ log(1−y^))
The legitimacy of the input is y (either 0 or 1), and the discriminator’s prediction is y^ (between 0 and 1). The sum of the calculated losses between the real and fake samples is used to tune the generator and discriminator.
By taking this loss, it can be backpropagated through the CNNs that form the generator and discriminator.
Through repetition through thousands of images, this is what was produced on my computer. You can see the progression through multiple epochs. Note each epoch goes through thousands of images.
GANs are a part of the amalgam of algorithms that is generative AI. Though what I generated isn’t revolutionary, it was less than a decade ago. The performance of these systems is rapidly growing. I’m excited to see how the field of generative AI will advance with GANs.
I was talking to a mentor of mine, who isn’t into AI but has a background in mathematics. I was showing him the video and the DCGAN script on my computer. He noted to me how using convolutions in programming when he was in university was a 3rd year comp sci project. The code libraries, the compute, and the research wasn’t availible to casually build a face image generator.
Compared to what has been developed in all of human history, my project shows how we stand on the shoulders of giants. Never in history were resources for developing technology so available. Never did people ten years ago have all the resources I have. There really is so much opportunity to start doing cool things.
That said, the resources available make it easy to skip over understanding. There is much going on that I’m still trying to grasp. To be able to build really cool things and useful things, I need to keep seeking understanding.
If you enjoyed this article, clap and share. If you have feedback, feel free to leave a note for me to read 🙂
GPT-4 to edit code, give feedback, help debug, and explain concepts