Deep Learning Weekly: Issue #203
Facebook’s fundamental theory on DNNs beyond the infinite-width limit, An award-winning CNN-based model called GarbageNet, a paper on Decision Transformers and more
This week in deep learning, we bring you Amazon's Just Walk Out technology, a deep learning model that sorts trash exceptionally well, a TinyML tutorial that lets your edge device interpret your dog's mood and Facebook's fundamental theory on Deep Neural Networks beyond the infinite-width abstraction.
You may also enjoy a tool that lets you generate verses from your favorite rappers, a comprehensive tutorial on modeling and optimizing ML pipelines, a complete data augmentation library for audio, image, text and video, a paper on Decision Transformers and Sequence Modeling, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Google’s next generation of Tensor Processing Units use a deep reinforcement learning technique for its highly complex floorplanning, highlighting the synergistic strength of human and artificial intelligence.
Amazon opens a grocery store equipped with its Just Walk Out technology, which uses sensors and machine learning to let consumers shop without waiting in a checkout line.
GarbageNet, a CNN-based model, uses a three-pronged approach to categorize new garbage items it has not yet encountered with an overall accuracy of 96.96%.
Uberduck is a new text-to-speech tool that can synthesize verses from Tupac, Jay-Z, Kanye West and other rappers/celebrities in a matter of seconds.
Leaders, partners and directors from highly regarded corporations and consulting groups express their dilemmas concerning behavioral control of large-scale artificial intelligence.
Fiddler is a platform for heterogeneous model explainability that can handle a variety of problems from retail to healthcare.
Mobile & Edge
A technical tutorial using a Nano 33 BLE Sense and the Edge Impulse Studio to interpret a dog's mood based on vocal signals.
A comprehensive article describing the use of machine learning, AWS Prediction and Neosensory Buzz's haptic feedback for deaf parents to connect to their kids.
MLCommons releases MLPerf Tiny Inference which offers benchmark reporting for key tasks (such as anomaly detection) and comparisons of tinyML devices, systems and software.
A blog post showcasing how to leverage the latest offerings (on-device ML learning pathway, EfficientDet-lite, Object Detection model maker and metadata writer API) from TensorFlow Lite to build a state-of-the-art object detector.
A comprehensive article that discusses a qualitative analysis of automated driving systems.
An introductory article to the Principles of Deep Learning Theory book, which lays out an effective theory of DNNs beyond the infinite-width abstraction.
A detailed blog going through the state-of-the-art results of Facebook’s HuBERT, a new approach for learning self-supervised speech representations.
A step-by-step tutorial for machine learning pipelines and optimization using scikit-learn.
A technical article that discusses using Interpret, an open-source Python library for performance analysis, to create dashboards for machine learning models.
Libraries & Code
AugLy is a data augmentations library that currently supports four modalities (audio, image, text and video) and over 100 augmentations.
XBNet that is built on PyTorch combines tree-based models with neural networks to create a robust architecture that is trained by using a novel optimization technique.
Gradio is an open-source Python library that lets you create demos of your machine learning code, get feedback on model performance from users and debug your model interactively.
Papers & Publications
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.