Deep Learning Weekly: Self-Supervised Learning Deep Dive

An introduction to self-supervised learning, a recent technique which has the potential to radically change how we build ML models

Hey folks,

Today, we send you our third deep dive! Our aim is to provide a precise examination of a given topic. Today we dive into self-supervised learning, a recent technique aiming to apply supervised learning methods to an unlabeled dataset.

We give a technical description of the field, an overview of its main applications as of today, and we show how it could significantly impact how we build ML models in the future.


Self-Supervised Learning: What is it About?

Using supervised learning, we are able to build ML models which perform exceptionally well on certain complex tasks, such as language translation or image classification. Training those models requires large-scale labeled datasets, containing several millions or billions data points. Building datasets at this scale may be a time-consuming, expensive and error prone step. Also, this may be the key issue in areas where high-quality data is scarce, like healthcare.

Over the last years, transfer learning has emerged as a reference technique to solve part of this issue. The idea is to start from a pre-trained model solving a problem similar to the target problem. Transfer learning has deeply impacted how we build ML models in many fields, and today, very few people train computer vision or natural language processing models from scratch. However, transfer learning is only part of the solution, in particular when the target problem is too specific.

Self-supervised learning is an emerging solution to these limitations, eliminating the necessity of data labeling. The technology started to gain more interest by 2018, and there is an increasing trend since then. It is often compared to unsupervised learning in that it doesn't need a labeled dataset. We think that this technique does not have the popularity it deserves, hence we’ve decided to write a deep dive on it.

Self-Supervised Learning: a Technical Description

The goal of self-supervised learning is to eliminate the necessity of building a labeled dataset to solve a learning task. Instead, the learning model trains itself by using labels that are naturally part of the data. For example, if one wants to train a model able to colorize black and white images, one can automatically build a training dataset formed with color images and the corresponding black and white images, without any need for human labeling. This is a use-case where self-supervised learning is an obvious solution to solve the problem.

In some other situations, self-supervised learning is not used to solve the target problem, but it is used instead to learn useful features by making predictions on some task which does not need human labeling. Those tasks are called pretext tasks, while the target problem is called the downstream task. For example, in NLP applications, a common pretext task is to learn word embeddings by filling the blanks in sentences, and the embeddings are then used to solve a downstream task (for example language translation).

It is important to remember at this stage the main challenge self-supervised learning will face in its development: defining and solving pretext tasks is an additional step which requires qualified time and computing resources compared to a traditional supervised learning framework.

The Main Applications

Whatever the field, the first question to answer when doing self-supervised learning is: “What pretext task should I use to learn useful features from my unlabeled dataset?”.

The first field where this technique has been used extensively, sometimes without calling it by its name, is Natural Language Processing. A common pretext task, introduced in the Word2Vec paper to learn word embeddings, is to predict a word in a sentence using the words around. This post on applications of self-supervised learning for NLP presents other pretext tasks commonly used, like sentence permutation or sentence order prediction.

Applications of self-supervised learning for computer vision tasks have been introduced more recently and show very promising early results. This interesting course, this paper or this shorter post give an overview of the pretext tasks usually defined in computer vision and of the early results at this point. The pretext tasks usually defined include the following:

  • Colorization of black and white images

  • An image is split into smaller patches, and the task is to predict a patch given the patches around, or to reorder the patches

  • Placing frames in the right order in a video

FAIR (Facebook AI Research) is particularly active on this topic and has released a library for state-of-the-art self-supervised learning, whose underlying concepts are wonderfully introduced in this lecture. Also, the application of self-supervised learning to train transformer architectures for computer vision has been the focus of recent work, as those architectures need huge training datasets. Finally, self-supervised learning is successfully applied for healthcare applications where data scarcity is the main limiting factor.

Conclusion and Perspectives

We introduced this deep dive by stating that supervised learning enables us to build accurate systems solving diverse tasks, mainly in NLP and computer vision. As of today, the bottleneck of this approach is the need for always bigger datasets, often requiring manual data labeling. Also, this approach comes up with very specialized AI systems solving narrow tasks.

Self-supervised learning steps in at this point: it automatically generates labels from any unlabeled dataset and lets the machine come up with a solution without any human interference. It may be a step towards building more generalist AI models more similar to how human intelligence works.

Also, self-supervised learning is one of several methods aiming to make the most of unlabeled data, including for example semi-supervised learning, active learning or weakly supervised learning.

A guest post by