Deep Learning Weekly: Issue #195

New deep learning to solve differential equations, unifying the CUDA ecosystem, realtime CNN for AR on mobile, a GANsformer implementation, going deeper with Image Transformers, and more

Hey folks,

This week in deep learning, we bring you Nvidia’s CEO keynote, a model able to find an effective combination of drugs, a tutorial to accelerate CNN inference on mobile, and a deep dive on self-attention.

You may also enjoy a codebase implementing the GANsformer architecture, an audio codec using ML, a paper on large-scale language model training, an other one on how to train GANs under limited data, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!


Bias, Variance, Unpredictability: Read iMerit's Latest on Handling Edge Cases in ML Systems

Through training, ML systems learn to recognize patterns by associating inputs like pixel values to categories like object identities. What happens when an input falls beyond the edge of the region assigned to its category? (Sponsored Content)

Google Open-Sources Lyra, A New Audio Codec Using Machine Learning To Produce High-Quality Voice Calls

Lyra uses machine learning to enable high-quality voice calls with low bandwidth. Lyra can compress audio down to as little as three kbps, ensuring a better sound quality compared to other codecs requiring much greater bandwidth.

Nvidia’s CEO GTC Keynote

In this keynote introducing 2021’s GTC, Nvidia’s AI conference, its CEO Jensen Huang introduces the latest advancements in AI: automotive, robotics, 5G, real-time graphics, collaboration and data centers.

Latest Neural Nets Solve World’s Hardest Equations Faster Than Ever Before

Two new approaches allow deep neural networks to solve entire families of partial differential equations, making it easier to model complicated systems and to do so orders of magnitude faster.

Unifying the CUDA Python Ecosystem

This post introduces Python CUDA, Nvidia’s soon-to-be-released library to help unify the CUDA ecosystem with a single standard set of low-level Python interfaces.

AI predicts effective drug combinations to fight complex diseases faster

This post introduces Facebook’s Compositional Perturbation Autoencoder, an AI model able to find an effective combination of existing drugs to treat a disease.

“Faster than real-time” cybersecurity company raises $100m

Israel-based Deep Instinct uses deep learning to recognize and thwart cyberattacks in milliseconds. It announced a $100m funding round.

Mobile & Edge

GPU Accelerated High-Performance Machine Learning Pipeline

This post details how Adobe’s AI/ML teams are working with Nvidia to build a GPU-based high-performance ML pipeline, resulting in significantly faster speed and cost reduction.

Getting real-time CNN inference for AR on mobile

This post gives a few tricks to accelerate CNN inference for AR on mobile, in particular to be able to get to real-time processing.

NVIDIA’s New CPU to ‘Grace’ World’s Most Powerful AI-Capable Supercomputer

Nvidia announced a collaboration with the Swiss National Supercomputing Center to build a supercomputer powered by Nvidia’s new Grace CPU and next-generation GPUs. This will be the world’s most powerful AI-capable supercomputer.

Intel and MILA join forces to put Artificial Intelligence to Work in Medical Research

Intel and MILA (the world’s largest academic ML research institute) announced a new partnership on access to large-scale high-performance computing for speeding up the search of new drugs.


Discover How Old your Brain is with MRI Data and Artificial Intelligence

A long tutorial on how to estimate a brain’s age from MRI data. It begins with simple linear models and ends with convolutional neural networks.

Why multi-head self attention works

This deep dive on self-attention is for people who want to understand how self-attention works. It presents both the intuitions behind it and the maths.

Weight Banding

This post explores Weight Banding, a large-scale structure that appears in the weights of some convolutional neural networks. The understanding of such structures should help in the design of more effective neural networks architectures.

10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape

This post covers what you need to know to understand BERT and the transformer architecture: where does this technology come from? How was it developed? How does it work? What to expect in the near future?

Libraries & Code

GANsformer: Generative Adversarial Transformers

This codebase implements the GANsformer architecture, the latest method for image generation, giving state-of-the-art results on a large range of datasets.

VidSitu: Towards understanding situations in videos

VidSitu is a large-scale dataset containing diverse 10-second videos from movies depicting complex situations (a collection of related events). Events in the video are richly annotated at 2-second intervals with verbs, semantic-roles, entity co-references, and event relations.

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

This notebook illustrates Hyper-deep Ensembles, a recent method which consists in forming an ensemble over multiple variants of a neural network architecture where each member uses different hyperparameters.

Papers & Publications

Regularizing Generative Adversarial Networks under Limited Data


Recent years have witnessed the rapid progress of generative adversarial networks (GANs). However, the success of the GAN models hinges on a large amount of training data. This work proposes a regularization approach for training robust GAN models on limited data. We theoretically show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data. Extensive experiments on several benchmark datasets demonstrate that the proposed regularization scheme 1) improves the generalization performance and stabilizes the learning dynamics of GAN models under limited training data, and 2) complements the recent data augmentation methods. These properties facilitate training GAN models to achieve state-of-the-art performance when only limited training data of the ImageNet benchmark is available.

Efficient Large-Scale Language Model Training on GPU Clusters


Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required to train these models can result in unrealistically long training times. New methods of model parallelism such as tensor and pipeline parallelism have been proposed to address these challenges; unfortunately, naive usage leads to fundamental scaling issues at thousands of GPUs due to various reasons, e.g., expensive cross-node communication or idle periods waiting on other devices.

In this work, we show how to compose different types of parallelism methods (tensor, pipeline, and data parallelism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models we can efficiently train compared to existing systems. We discuss various implementations of pipeline parallelism and propose a novel schedule that can improve throughput by more than 10% with comparable memory footprint compared to previously-proposed approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. The composition of these techniques allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of peak; previous efforts to train similar-sized models achieve much lower throughput (36% of theoretical peak). Our code has been open-sourced at this https URL.

Going deeper with Image Transformers


Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters. Moreover, our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data. We share our code and models.