Deep Learning Weekly: Generative Modeling Deep Dive

A thorough exploration of state-of-the-art models to generate realistic images, text and music.

Hey folks,

Today, we send you our first in a new series of deep dives! With this new issue type, our aim is to provide a precise examination of a given topic. Today we dive into generative modeling so that you’ll know everything you need to deepen your understanding of this topic.

You’ll learn about state-of-the-art models to generate realistic images, text and music. You should also be amazed with the new applications enabled by those models: image interpolation, drug discovery, video conferencing bandwidth optimization, and more.

Generative Modeling: State-of-the-Art and Applications

In the last years, generative modeling has become one of the hottest topics in deep learning, helped by the impressive results produced by models like Generative Adversarial Networks for image generation or GPT-3 for text generation. We present an overview of state-of-the-art generative models, and then we describe the main applications they enable.

Introduction to Generative Modeling

“A generative model is a model able to generate data points following the same distribution as the dataset it has been trained on.”

This somewhat simple definition gave birth to a field of research aiming to solve this unsupervised learning task. First attempts using models like Gaussian mixtures, Hidden Markov chains, Boltzmann Machines or Variational AutoEncoders were quite successful on very specific use-cases but remained confined in the world of research.

The application of deep learning methods in generative modeling has truly revolutionized the field, and enabled the application of generative models on complex datasets: images, videos, text and music.

This revolution has been fueled by the ever-growing quantity of data and computing power available. A noticeable advantage of those methods is that they do not need labeled data: an unlabeled dataset is enough to train a generative model.

Generative Modeling: State-of-the-Art

Quite interestingly, the deep learning architectures developed for image, text and music generation are different, but they all began to produce human-level results in the last few years.


Image generation became a truly hot topic with the invention of the Generative Adversarial Networks (GAN) framework in 2014. The results obtained with this first iteration were still far from human-level, but future architectures based on this framework gave impressive improvements:

  • The DCGAN architecture released in 2016 introduced the use of convolutional neural networks in the GAN framework

  • BigGAN (2019) trains GANs at a much larger scale and brings improvements to stabilize the training process

  • StyleGAN (2019) has been developed by mixing GANs and the style transfer literature, and enabled a big gap in the perceived quality of generated images

  • StyleGAN2 (2020) is as of today the state-of-the-art model for image generation in terms of human perception. Models able to generate very realistic human faces, horses, churches, living rooms and many others have been open-sourced for large-scale use. StyleGAN2-ADA (2020) is an enhancement of StyleGAN2 to improve the model's quality when the training dataset is small. The researchers behind this model made a big effort to release a clean and user-friendly codebase

It is worth noting that very recent approaches inspired from the Transformers architecture, like GANsformer or TransGAN, suggest noticeable improvements in the performance of those models and hence in the quality of the generated images.


Traditional text generation models were based on Markov Chains. The rough idea is quite simple: if you want to generate a word after “The dog is”, the model will look into your training dataset for all instances of “The dog is”, and sample the next word accordingly. It gives surprisingly good results given the simplicity of the method, however the generated text lacks a long-term structure and is not satisfactory enough for practical applications.

The use of deep learning for text generation has truly revolutionized the field:

  • Text generation architectures based on somewhat simple LSTM neural networks improved quite significantly the result of the simple models introduced above

  • The release of GPT-2 model by OpenAI in 2018 gave another significant improvement for text generation tasks

  • During the summer 2020, OpenAI released GPT-3, the updated version of GPT-2 with about 175 billion parameters, 10 times more than any previous language model. This model gives impressive results and enables many practical applications

GPT-2 and GPT-3 models are general language models and it is important to note that they can be used as well for other tasks than simply text generation, including text summarization, translation or question answering.


Compared to image or text generation, music generation is still in its infancy, and the results of state-of-the-art models are a bit less impressive. Though, we can introduce the two following models:

  • OpenAI’s Jukebox (2020) is able to produce a wide range of music and singing styles. It generates the raw audio waveform, enabling theoretically to generate any sound one can think of

  • PerformanceRNN (2017) is a LSTM-based RNN model released by Magenta, an open source research project exploring the role of ML as a tool in the creative process. It has been applied to piano composition: the approach is quite different, as it is based on discrete events generation (each possible note corresponds to a given event)


The revolution of generative modeling gave birth to a large range of applications, among which:

Given how quickly the academic research is moving forward in the generative modeling field, we should expect much more breakthroughs and applications to come in the next few years. However, we must also be aware of the more controversial applications they enable, for example deepfakes or fake news generation.