Deep Learning Weekly Issue #166

Visualizing tensor operations, transformers for image recognition, custom AI for Snapchat Lenses, and more

Hey folks,

This week in deep learning we bring you Nvidia's solutions to some of the biggest problems in video calls, this article on clarifying exceptions and visualizing tensor operations in deep learning code, Google's real-time sign language detection for video conferencing, and Fritz AI’s support for SnapML in Lens Studio.

You may also enjoy these papers related to transformers titled Rethinking Attention with Performers, and An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!


Anticipating heart failure with machine learning

Many health issues are tied to excess fluid in the lungs. A new algorithm can detect the severity by looking at a single X-ray.

Nvidia says its AI can fix some of the biggest problems in video calls

Face alignment, noise reduction, and AI-powered super-resolution.

How the Police Use AI to Track and Identify You

This article explores the various ways that the police and the government use AI to track and identify people.

Datasaur snags $3.9M investment to build intelligent machine learning labeling platform

Datasaur, a member of the Y Combinator Winter 2020 batch, announced a $3.9 million investment today to help label training data with a platform designed for machine learning labeling teams.

Mobile + Edge

Announcing Fritz AI’s Support for SnapML in Lens Studio

Enhance your Snapchat AR Lenses with machine learning.

Arm unveils new chips for advanced driver assistance systems

Arm today announced a suite of technologies intended to make it easier for autonomous car developers to bring their designs to market.

YouTube Stories on iOS gains AI-powered speech enhancement

Google today launched Looking-to-Listen, a new audiovisual speech enhancement feature in YouTube Stories captured with iOS devices.

3d Scanner App - Free LIDAR 3d scanner for iPad Pro

Quickly capture, edit, and share 3d scans using the iPad Pro.


Clarifying exceptions and visualizing tensor operations in deep learning code

The author of this post introduces TensorSensor, which clarifies exceptions by augmenting messages and visualizing Python code to indicate the shape of tensor variables. It works with Tensorflow, PyTorch, and Numpy, as well as higher-level libraries like Keras and fastai.

Developing Real-Time, Automatic Sign Language Detection for Video Conferencing

Google AI presents a real-time sign language detection model and demonstrates how it can be used to provide video conferencing systems a mechanism to identify the person signing as the active speaker.

Massively Large-Scale Distributed Reinforcement Learning with Menger

Researchers at Google introduce Menger, a massive large-scale distributed RL infrastructure with localized inference that scales up to several thousand actors across multiple processing clusters (e.g., Borg cells), reducing the overall training time in the task of chip placement.

Libraries & Code

[GitHub] NVlabs/imaginaire

NVIDIA PyTorch GAN library with distributed and mixed precision support.

[GitHub] zhaohengyuan1/PAN

Efficient Image Super-Resolution Using Pixel Attention, in ECCV Workshop, 2020.

[GitHub] lucidrains/vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in PyTorch

Papers & Publications

Rethinking Attention with Performers

Abstract: We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. When pre-trained on large amounts of data and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc), Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.