Deep Learning Weekly: Issue #209
DeepMind’s newest general-purpose architectures, an autoML tool for embedded machine learning models, OpenAI’s GPU programming language for neural networks, and more
This week in deep learning, we bring you DeepMind's newest general-purpose architectures, a novel training approach for misogyny detection, an autoML tool for embedded machine learning models and OpenAI's GPU programming language for neural networks.
You may also enjoy a technical article on TensorFlow Model Optimization Pruning API's updates, dragonfly-based neural network design, an annotated implementation of Graph Attention Networks, a paper on approximating Turing Machines using Transformers, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
DeepMind introduces Perceiver and Perceiver IO, general-purpose architectures that can process a variety of inputs, for real-world applications such as multimodal understanding.
MIT researchers used a new probability-based algorithm, machine learning, natural language processing, and patent network analytics to predict which technologies are rapidly improving and which ones are overhyped.
Researchers from Denmark designed an annotation approach that led to a misogyny detection model which is 85% accurate on popular social media platforms, among other things.
A team from DeepMind and Google Research leverages neural networks to automatically construct effective heuristics from a dataset for mixed integer programming (MIP) problems. The approach significantly outperforms classical MIP solver techniques.
Seoul-based education tech startup Mathpresso aims to disrupt the traditional tutoring industry with AI and plans to join the growing list of Korean companies going public.
Mobile & Edge
Edge Impulse launches the EON Tuner, an autoML tool that helps you select the best embedded machine learning model for your application within the constraints of your target device.
A technical article demonstrating the updates to the TensorFlow Model Optimization (TF MOT) Pruning API that simplify pruning and enable developers to build sparse models for fast on-device inference.
A comprehensive tutorial on a "speed trap" that uses ML to identify vehicles, a radar sensor to measure speed, and a cellular module to report data to the cloud.
Light Gestures is a technical prototype which demonstrates the natural interactions of emerging gesture tracking technologies and lighting.
A technical blog on Triton 1.0, an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU code.
An article that discusses the potential biomimicry of modern neural networks based on a dragonfly’s interception system.
A guide detailing the usage of the TF-Agents Bandits library with the help of the MovieLens Environment.
An annotated PyTorch implementation of a version 2 Graph Attention Network.
Libraries & Code
FaceSwap is a tool that utilizes deep learning to recognize and swap faces in pictures and videos.
A source separation library that makes it easy to train models, and provides pre-trained SOTA models for performing various flavors of separation.
Learn the foundations of ML through intuitive explanations, clean code and visuals.
Papers & Publications
A common lens to theoretically study neural net architectures is to analyze the functions they can approximate. However, the constructions from approximation theory often have unrealistic aspects, for example, reliance on infinite precision to memorize target function values, which make these results potentially less meaningful. To address these issues, this work proposes a formal definition of statistically meaningful approximation which requires the approximating network to exhibit good statistical learnability. We present case studies on statistically meaningful approximation for two classes of functions: boolean circuits and Turing machines. We show that overparameterized feedforward neural nets can statistically meaningfully approximate boolean circuits with sample complexity depending only polynomially on the circuit size, not the size of the approximating network. In addition, we show that transformers can statistically meaningfully approximate Turing machines with computation time bounded by
T, requiring sample complexity polynomial in the alphabet size, state space size, and log(T). Our analysis introduces new tools for generalization bounds that provide much tighter sample complexity guarantees than the typical VC-dimension or norm-based bounds, which may be of independent interest.
The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest studies suggest that pre-training benefits from gigantic model capacity. We are hereby curious and ask: after pre-training, does a pre-trained model indeed have to stay large for its downstream transferability?
In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH). LTH identifies highly sparse matching subnetworks that can be trained in isolation from (nearly) scratch yet still reach the full models' performance. We extend the scope of LTH and question whether matching subnetworks still exist in pre-trained computer vision models, that enjoy the same downstream transfer performance. Our extensive experiments convey an overall positive message: from all pre-trained weights obtained by ImageNet classification, simCLR, and MoCo, we are consistently able to locate such matching subnetworks at 59.04% to 96.48% sparsity that transfer universally to multiple downstream tasks, whose performance see no degradation compared to using full pre-trained weights. Further analyses reveal that subnetworks found from different pre-training tend to yield diverse mask structures and perturbation sensitivities. We conclude that the core LTH observations remain generally relevant in the pre-training paradigm of computer vision, but more delicate discussions are needed in some cases.
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a detailed mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 51.7 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and state-of-the-art segmentation performance on ADE20K with 55.7 mIoU.