Deep Learning Weekly: Issue #195
New deep learning to solve differential equations, unifying the CUDA ecosystem, realtime CNN for AR on mobile, a GANsformer implementation, going deeper with Image Transformers, and more
This week in deep learning, we bring you Nvidia’s CEO keynote, a model able to find an effective combination of drugs, a tutorial to accelerate CNN inference on mobile, and a deep dive on self-attention.
You may also enjoy a codebase implementing the GANsformer architecture, an audio codec using ML, a paper on large-scale language model training, an other one on how to train GANs under limited data, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Through training, ML systems learn to recognize patterns by associating inputs like pixel values to categories like object identities. What happens when an input falls beyond the edge of the region assigned to its category? (Sponsored Content)
Lyra uses machine learning to enable high-quality voice calls with low bandwidth. Lyra can compress audio down to as little as three kbps, ensuring a better sound quality compared to other codecs requiring much greater bandwidth.
In this keynote introducing 2021’s GTC, Nvidia’s AI conference, its CEO Jensen Huang introduces the latest advancements in AI: automotive, robotics, 5G, real-time graphics, collaboration and data centers.
Two new approaches allow deep neural networks to solve entire families of partial differential equations, making it easier to model complicated systems and to do so orders of magnitude faster.
This post introduces Python CUDA, Nvidia’s soon-to-be-released library to help unify the CUDA ecosystem with a single standard set of low-level Python interfaces.
This post introduces Facebook’s Compositional Perturbation Autoencoder, an AI model able to find an effective combination of existing drugs to treat a disease.
Israel-based Deep Instinct uses deep learning to recognize and thwart cyberattacks in milliseconds. It announced a $100m funding round.
Mobile & Edge
This post details how Adobe’s AI/ML teams are working with Nvidia to build a GPU-based high-performance ML pipeline, resulting in significantly faster speed and cost reduction.
This post gives a few tricks to accelerate CNN inference for AR on mobile, in particular to be able to get to real-time processing.
Nvidia announced a collaboration with the Swiss National Supercomputing Center to build a supercomputer powered by Nvidia’s new Grace CPU and next-generation GPUs. This will be the world’s most powerful AI-capable supercomputer.
Intel and MILA (the world’s largest academic ML research institute) announced a new partnership on access to large-scale high-performance computing for speeding up the search of new drugs.
A long tutorial on how to estimate a brain’s age from MRI data. It begins with simple linear models and ends with convolutional neural networks.
This deep dive on self-attention is for people who want to understand how self-attention works. It presents both the intuitions behind it and the maths.
This post explores Weight Banding, a large-scale structure that appears in the weights of some convolutional neural networks. The understanding of such structures should help in the design of more effective neural networks architectures.
This post covers what you need to know to understand BERT and the transformer architecture: where does this technology come from? How was it developed? How does it work? What to expect in the near future?
Libraries & Code
This codebase implements the GANsformer architecture, the latest method for image generation, giving state-of-the-art results on a large range of datasets.
VidSitu is a large-scale dataset containing diverse 10-second videos from movies depicting complex situations (a collection of related events). Events in the video are richly annotated at 2-second intervals with verbs, semantic-roles, entity co-references, and event relations.
This notebook illustrates Hyper-deep Ensembles, a recent method which consists in forming an ensemble over multiple variants of a neural network architecture where each member uses different hyperparameters.
Papers & Publications
Recent years have witnessed the rapid progress of generative adversarial networks (GANs). However, the success of the GAN models hinges on a large amount of training data. This work proposes a regularization approach for training robust GAN models on limited data. We theoretically show a connection between the regularized loss and an f-divergence called LeCam-divergence, which we find is more robust under limited training data. Extensive experiments on several benchmark datasets demonstrate that the proposed regularization scheme 1) improves the generalization performance and stabilizes the learning dynamics of GAN models under limited training data, and 2) complements the recent data augmentation methods. These properties facilitate training GAN models to achieve state-of-the-art performance when only limited training data of the ImageNet benchmark is available.
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required to train these models can result in unrealistically long training times. New methods of model parallelism such as tensor and pipeline parallelism have been proposed to address these challenges; unfortunately, naive usage leads to fundamental scaling issues at thousands of GPUs due to various reasons, e.g., expensive cross-node communication or idle periods waiting on other devices.
In this work, we show how to compose different types of parallelism methods (tensor, pipeline, and data parallelism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models we can efficiently train compared to existing systems. We discuss various implementations of pipeline parallelism and propose a novel schedule that can improve throughput by more than 10% with comparable memory footprint compared to previously-proposed approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. The composition of these techniques allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of peak; previous efforts to train similar-sized models achieve much lower throughput (36% of theoretical peak). Our code has been open-sourced at this https URL.
Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters. Moreover, our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data. We share our code and models.