OpenAI’s diffusion models beat GANs at what they do best

549 points

Generative Adversarial Networks (GANs) are a class of deep learning models that learn to produce new (or pseudo-real) data. Their advent in 2014 and refinement thereafter have led to them dominating the image generation domain for the past few years and laying the foundations of a new paradigm deep fakes. Their ability to mimic training data and produce new samples similar to it has gone more or less unmatched. As such, they hold the state-of-the-art (SOTA) in most image generation tasks today.

Despite these advantages, GANs are notoriously hard to train and are prone to issues like mode collapse and unintelligible training procedures. Moreover, researchers have realized that GANs focus more on fidelity rather than capturing a diverse set of the training data’s distribution. As such, researchers have been looking into improving GANs in this domain or eyeing other architectures that would perform better in the same domain.

Two researchers, Prafulla Dhariwal and Alex Nichol from OpenAI, one of the leading AI-research labs, took up the question and looked towards other architectures. In their latest work “Diffusion Models Beat GANs on Image Synthesis”, published in the preprint repository arXiv this week, they show that a different deep learning architecture, called diffusion models, addresses the aforementioned shortcomings of GANs. They show that not only are diffusion models better at capturing a greater breadth of the training data’s variance compared to GANs, but they also beat the SOTA GANs in image generation tasks.

“We show that models with our improved architecture achieve state-of-the-art on unconditional image synthesis tasks, and with classifier guidance achieve state-of-the-art on conditional image synthesis. When using classifier guidance, we find that we can sample with as few as 25 forward passes while maintaining FIDs comparable to BigGAN. We also compare our improved models to upsampling stacks, finding that the two approaches give complementary improvements and that combining them gives the best results on ImageNet 512×512.”

Before moving further, it is important to understand the crux of diffusion models. Diffusion models are another class of deep learning models (specifically, likelihood models), that do well in image-generation tasks. Unlike GANs which learn to map a random noisy image to a point in the training distribution, diffusion models take a noisy image and then perform a series of de-noising steps that progressively cut the noise and reveal an image that belongs to the training data’s distribution.

READ More:  Google Meet's hand raise feature is now more visible with new animation

Dhariwal and Nichol hypothesized that a series of upgrades to the architecture of contemporary diffusion models would improve their performance. They also incorporated the choice of the tradeoff between fidelity and variance characteristic of GANs into their own diffusion models as well. Taking inspiration from the attention layers of the Transformer architecture, improving the UNet architecture, using Adaptive Group Normalization, and conditioning on class labels, the two researchers trained a fleet of diffusion models and then pitted them against the SOTA GANs in image generation tasks.

Both the BigGAN and OpenAI’s models were trained on the LSUN and ImageNet datasets for unconditional and conditional image generation tasks. The output images were compared using several metrics that weighed precision, recall, and fidelity. Most notably, the venerable Fréchet Inception Distance (FID) and sFID metrics, which quantify the difference between two image distributions, were used.

READ More:  Google is lending support to India to deal with COVID-19 crisis

OpenAI’s diffusion models obtain the best FID on each task and the best sFID on all but one task. The table below shows the results. Note that as stated earlier, FID measures the distance between two image distributions so a perfect score is 0.0, meaning that the two distributions are identical. Thus, in the table below, the lower the score, the better.

Metrics comparing the diffusion model with other architectures

Qualitatively, this leads to the following image outputs. The left column houses results from the SOTA BigGAN-deep model, the middle column has outputs from OpenAI’s diffusion models, and the right column has images from the original training dataset.

Images from BigGAN-Deep OpenAI&039s Diffusion Models and the training dataset

More samples from the experiment are attached at the end of this article. The astute reader would notice that perceptually, the images above look certainly similar, but the authors pointed out that the diffusion models captured more breadth of information from the training set:

“While the samples are of similar perceptual quality, the diffusion model contains more modes than the GAN, such as zoomed ostrich heads, single flamingos, different orientations of cheeseburgers, and a tinca fish with no human holding it.”

With these results out in the open now, the researchers believe that diffusion models are an “extremely promising direction” for generative modeling, a domain that has largely been dominated by GANs.

READ More:  Xbox Design Lab returns with support for the new Xbox controller

More samples from the diffusion model

Despite the promising results, the researchers noted that diffusion models are not without their own set of limitations. Currently, training diffusion models requires more computational resources than GANs. Image synthesis is slower as well due to the multiple de-noising steps that progressively remove noise from the image. They have pointed to existing approaches tackling these issues in their paper, which might be explored in the future.

OpenAI has published the code for this paper on GitHub. You may check out the repository here. Further details can be found in the research paper on arXiv.

Source link

Like it? Share with your friends!

549 points