Deep Learning Weekly: Issue #220

State of AI Report 2021, Perceive, TinyML, Hailo, IC-GANs, Yann LeCunn's new deep learning course, and more

Hey folks,

This week in deep learning, we bring you the State of AI Report 2021, Hailo’s fundraising round, the future of TinyML, and the “Patches are All You Need” paper.

You may also enjoy Yann Le Cun’s deep learning course, a library to label text documents, DeepMind’s latest AlphaFold-Multimer, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!


NVIDIA Invites Healthcare Startup Submissions to Access UK’s Most Powerful Supercomputer

Healthcare startups can now apply to get free access to Cambridge-1, UK’s most powerful supercomputer. This will help the companies bring their healthcare innovations to market faster, accelerating the evolution of drug discovery, genome sequencing, and disease research.

Perceive: Building the World’s AI Community

Perceive is Clarifai’s AI conference. This year’s session is named “Accelerate the progress of humanity with continuously improving AI”.

UK’s National AI Strategy

The UK released its AI strategy recently, recognizing the power of AI to increase resilience, productivity, growth, and innovation across the private and public sectors.

Facebook Loves Self-Supervised Learning. Period.

A nice summary of how Facebook is investing in self-supervised learning, a recent method which helps to take advantage of entirely unlabelled datasets.

State of AI Report 2021

The State of AI Report 2021 is out and analyzes the field’s trends. Key themes in this year’s report: more and more applications of AI are being deployed—from electric grid optimization to drug discovery—AI funding is still increasing, and China’s ascension is notable.

A New Link to an Old Model Could Crack the Mystery of Deep Learning

Researchers think that there are deep analogies between deep learning and kernel machines, a well-known ML technique. Using those analogies could help in explaining why and how deep learning works.

Mobile & Edge

Israeli AI chip maker Hailo becomes newest ‘unicorn’ after $136m investment

Hailo is a maker of chips that allows edge devices like smart cameras or smart cars to run deep learning applications for industries such as automotive, drones, and home appliances.

AI and Machine Learning for On-Device Development

This insightful book explores how to create and run ML models on popular mobile platforms such as iOS and Android, using the appropriate libraries such as TensorFlow Lite or Core ML / Create ML.

Revolutionizing the Edge with TinyML

This post details the challenges posed by traditional cloud-based ML models and how ‘tiny’ machine learning (tinyML) can help resolve them.


Building Scalable, Explainable, and Adaptive NLP Models with Retrieval

Stanford AI Lab introduces ColBERT-QA and Baleen, question-answering systems based on retrieval-based NLP methods, an emerging alternative in which models directly “search” for information in a text corpus.

Image Encoders: BigTransfer vs CLIP

This post compares two image encoders (i.e. neural networks to turn an image into a vector embedding) in terms of accuracy and complexity: Big Transfer from Google and CLIP from OpenAI.

Finding Complex Metal Oxides for Technology Advancement

This post presents a ML model able to find a short-list of materials that may be exceptional for any given property, with important applications like developing technologies needed for hydrogen production.

Yann LeCun’s Deep Learning Course at CDS

Yann LeCun’s course at NYU’s Center for Data Science teaches the latest techniques in deep learning and representation learning, focusing on supervised and unsupervised deep learning, embedding methods, metric learning, and convolutional and recurrent networks.

Building AI that can generate images of things it has never seen before

Facebook introduces Instance-Conditioned GANs, a model able to generate realistic, unforeseen image combinations, such as camels surrounded by snow or zebras in a city. This approach exhibits exceptional transfer capabilities across different types of objects.

Libraries & Code

DataQA: Unstructured Text Documents Labeling

DataQA is a library to label unstructured text documents. With DataQA, you can, for example, extract and classify named entities to implement simple heuristics to automatically label your documents.

Data-Efficient Deep Learning Benchmark (DEIC)

This repository gives access to train/validation/test splits for six datasets used as benchmarks for data-efficient image classification. It covers multiple image domains (natural images, medical images, remote sensing, handwriting recognition, etc).

Colab Notebook: Logging and Visualizing StyleGAN3 training runs with Comet*

An implementation of the new StyleGAN3 model architecture from Nvidia that allows you to log and visualize model training runs using Comet.

*Deep Learning Weekly is sponsored by Comet

Papers & Publications

Protein complex prediction with AlphaFold-Multimer


While the vast majority of well-structured single protein chains can now be predicted to high accuracy due to the recent AlphaFold [1] model, the prediction of multi-chain protein complexes remains a challenge in many cases. In this work, we demonstrate that an AlphaFold model trained specifically for multimeric inputs of known stoichiometry, which we call AlphaFold-Multimer, significantly increases accuracy of predicted multimeric interfaces over input-adapted single-chain AlphaFold while maintaining high intra-chain accuracy. On a benchmark dataset of 17 heterodimer proteins without templates (introduced in [2]) we achieve at least medium accuracy (DockQ [3]≥0.49) on 14 targets and high accuracy (DockQ≥0.8) on 6 targets, compared to 9 targets of at least medium accuracy and 4 of high accuracy for the previous state of the art system (an AlphaFold-based system from [2]). We also predict structures for a large dataset of 4,433 recent protein complexes, from which we score all non-redundant interfaces with low template identity. For heteromeric interfaces we successfully predict the interface (DockQ≥0.23) in 67% of cases, and produce high accuracy predictions (DockQ≥0.8) in 23% of cases, an improvement of +25 and +11 percentage points over the flexible linker modification of AlphaFold [4] respectively. For homomeric interfaces we successfully predict the interface in 69% of cases, and produce high accuracy predictions in 34% of cases, an improvement of +5 percentage points in both instances.

Exploring the Limits of Large Scale Pre-training


Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models. We showcase an even more extreme scenario where performance on upstream and downstream are at odds with each other. That is, to have a better downstream performance, we need to hurt upstream accuracy.

Patches Are All You Need


Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet. Our code is available at

A guest post by