Deep Learning Weekly: Issue #193
Accelerating CNNs on edge devices, uses cases for SOTA speech-to-text models, a new book about the history of AI, a guide to model inference optimization, and more
Sponsored by Ray Summit
Want to learn the best way to scale ML? Find out how Ray is being used for large-scale machine learning. Topics include: ML in production, MLOps, deep learning, reinforcement learning, cloud computing, serverless & Ray libraries. Register for free to join live & on-demand.
This week in deep learning, we bring you a new method to speed up drug development, a self-supervised learning framework for hyperparameter tuning, a few tricks to accelerate convolutional neural network inference on mobile devices, as well as nice use cases built with the latest speech-to-text models.
You may also enjoy learning about the history of AI, Snorkel AI’s fundraising to make data labeling more efficient, and a nice overview of model inference optimization!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Recent studies have found that datasets used in AI research can contain serious flaws, like racist or wrong labels. This is very probably distorting our understanding of the field’s progress.
The DeepMind Safety Research team analyzes the harms that can arise when a language AI system is misspecified. The misspecification can come from the training data, from the training process, or from differences between training and deployment environments.
Snorkel AI wants to make it easier for subject matter experts to build labeled datasets, and announced a new tool to build common ML applications.
In a recently published work, researchers developed a new method to generate novel proteins. This offers fantastic potential for a number of future applications, such as faster and more cost-efficient development of drugs.
Facebook Research introduces a new self-supervised learning framework for model selection and hyperparameter tuning, which works much faster than baseline algorithms.
This article details where the algorithmic biases of AI systems come from and how they can be mitigated.
This post is a step-by-step look at using PyTorch Mobile with the Android Neural Networks API (NNAPI) to run state-of-the-art computer vision models on mobile devices.
This article summarizes how ML is used to design accelerator hardware for improving AI inference. The latest significant work in this field is Google Research’s APOLLO, which achieves up to 25% speedup over baseline algorithms.
A very concise post describing a few simple steps to accelerate the edge inference of convolutional neural networks.
This tutorial outlines two trials using Wav2Vec2, made possible by its addition in Hugging Face’s library: speech-to-text-to-translation and speech-to-text-to-summarization.
Salesforce Research presents the latest techniques they’ve developed to perform a theoretical analysis of wide neural networks.
This book is a comprehensive history of AI through the lives of its major players and through the companies bringing to life those technologies.
This post covers the optimization of a deep learning model’s inference. It includes engineering topics like model quantization and binarization, more research-oriented topics like knowledge distillation, as well as well-known hacks.
Libraries & Code
A deep learning-based translation library built on Hugging Face transformers and Facebook's mBART-Large model.
The official Implementation of StyleCLIP, a method to manipulate images using a driving text.
LIT is an open-source platform developed by Google Research for visualizing and understanding NLP models. Major improvements have been added recently.
Papers & Publications
Abstract: We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (ie dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.
Abstract: We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of 3.4% errors across the 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set. Putative label errors are identified using confident learning algorithms and then human-validated via crowdsourcing (54% of the algorithmically-flagged candidates are indeed erroneously labeled). Traditionally, machine learning practitioners choose which model to deploy based on test accuracy - our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets. Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by just 5%.
Abstract: Using transformers over large generated datasets, we train models to learn mathematical properties of differential systems, such as local stability, behavior at infinity and controllability. We achieve near perfect prediction of qualitative characteristics, and good approximations of numerical features of the system. This demonstrates that neural networks can learn to perform complex computations, grounded in advanced theory, from examples, without built-in mathematical knowledge.