Deep Learning Weekly: Issue #220
State of AI Report 2021, Perceive, TinyML, Hailo, IC-GANs, Yann LeCunn's new deep learning course, and more
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Healthcare startups can now apply to get free access to Cambridge-1, UK’s most powerful supercomputer. This will help the companies bring their healthcare innovations to market faster, accelerating the evolution of drug discovery, genome sequencing, and disease research.
Perceive is Clarifai’s AI conference. This year’s session is named “Accelerate the progress of humanity with continuously improving AI”.
The UK released its AI strategy recently, recognizing the power of AI to increase resilience, productivity, growth, and innovation across the private and public sectors.
A nice summary of how Facebook is investing in self-supervised learning, a recent method which helps to take advantage of entirely unlabelled datasets.
The State of AI Report 2021 is out and analyzes the field’s trends. Key themes in this year’s report: more and more applications of AI are being deployed—from electric grid optimization to drug discovery—AI funding is still increasing, and China’s ascension is notable.
Researchers think that there are deep analogies between deep learning and kernel machines, a well-known ML technique. Using those analogies could help in explaining why and how deep learning works.
Mobile & Edge
Hailo is a maker of chips that allows edge devices like smart cameras or smart cars to run deep learning applications for industries such as automotive, drones, and home appliances.
This insightful book explores how to create and run ML models on popular mobile platforms such as iOS and Android, using the appropriate libraries such as TensorFlow Lite or Core ML / Create ML.
This post details the challenges posed by traditional cloud-based ML models and how ‘tiny’ machine learning (tinyML) can help resolve them.
Stanford AI Lab introduces ColBERT-QA and Baleen, question-answering systems based on retrieval-based NLP methods, an emerging alternative in which models directly “search” for information in a text corpus.
This post compares two image encoders (i.e. neural networks to turn an image into a vector embedding) in terms of accuracy and complexity: Big Transfer from Google and CLIP from OpenAI.
This post presents a ML model able to find a short-list of materials that may be exceptional for any given property, with important applications like developing technologies needed for hydrogen production.
Yann LeCun’s course at NYU’s Center for Data Science teaches the latest techniques in deep learning and representation learning, focusing on supervised and unsupervised deep learning, embedding methods, metric learning, and convolutional and recurrent networks.
Facebook introduces Instance-Conditioned GANs, a model able to generate realistic, unforeseen image combinations, such as camels surrounded by snow or zebras in a city. This approach exhibits exceptional transfer capabilities across different types of objects.
Libraries & Code
DataQA is a library to label unstructured text documents. With DataQA, you can, for example, extract and classify named entities to implement simple heuristics to automatically label your documents.
This repository gives access to train/validation/test splits for six datasets used as benchmarks for data-efficient image classification. It covers multiple image domains (natural images, medical images, remote sensing, handwriting recognition, etc).
An implementation of the new StyleGAN3 model architecture from Nvidia that allows you to log and visualize model training runs using Comet.
*Deep Learning Weekly is sponsored by Comet
Papers & Publications
While the vast majority of well-structured single protein chains can now be predicted to high accuracy due to the recent AlphaFold  model, the prediction of multi-chain protein complexes remains a challenge in many cases. In this work, we demonstrate that an AlphaFold model trained specifically for multimeric inputs of known stoichiometry, which we call AlphaFold-Multimer, significantly increases accuracy of predicted multimeric interfaces over input-adapted single-chain AlphaFold while maintaining high intra-chain accuracy. On a benchmark dataset of 17 heterodimer proteins without templates (introduced in ) we achieve at least medium accuracy (DockQ ≥0.49) on 14 targets and high accuracy (DockQ≥0.8) on 6 targets, compared to 9 targets of at least medium accuracy and 4 of high accuracy for the previous state of the art system (an AlphaFold-based system from ). We also predict structures for a large dataset of 4,433 recent protein complexes, from which we score all non-redundant interfaces with low template identity. For heteromeric interfaces we successfully predict the interface (DockQ≥0.23) in 67% of cases, and produce high accuracy predictions (DockQ≥0.8) in 23% of cases, an improvement of +25 and +11 percentage points over the flexible linker modification of AlphaFold  respectively. For homomeric interfaces we successfully predict the interface in 69% of cases, and produce high accuracy predictions in 34% of cases, an improvement of +5 percentage points in both instances.
Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models. We showcase an even more extreme scenario where performance on upstream and downstream are at odds with each other. That is, to have a better downstream performance, we need to hurt upstream accuracy.
Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet. Our code is available at https://github.com/tmp-iclr/convmixer.