Deep Learning Weekly Issue #183

Snap's latest AI acquisition, 3000+ ML datasets, Spotify's new speech recognition patent, and more

Hey folks,

This week in deep learning we bring you Spotify's new patent that involves monitoring users' speech to recommend music, OkCupid's gradient-descent-based solution to collaborative filtering at scale, this article about how image-generation algorithms are regurgitating the same sexist, racist ideas that exist on the internet, and Snap's acquisition of Ariel AI to boost Snapchat AR features.

You may also enjoy these implementations of architectures from deep learning papers, this article about Google Brain's PyGlove for programming AutoML based on symbolic programming, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!


“Liquid” machine-learning system adapts to changing conditions

The new type of neural network could aid decision making in autonomous driving and medical diagnosis.

New Spotify Patent Involves Monitoring Users’ Speech to Recommend Music

The streaming platform is interested in extracting data points like emotional state, gender, age, and accent to hone its recommendations.

An AI saw a cropped photo of AOC. It autocompleted her wearing a bikini.

Image-generation algorithms are regurgitating the same sexist, racist ideas that exist on the internet.

Pinecone exits stealth with a vector database for machine learning

Pinecone says it can dynamically transform and index billions of high-dimensional vectors to answer queries such as the nearest neighbor and max-dot-product search extremely accurately in just milliseconds.

These Doctors Are Using AI to Screen for Breast Cancer

During the pandemic, thousands of women have skipped scans and check-ups. So physicians tapped an algorithm to predict those at the highest risk.

Mobile + Edge

SnapML: How to Run Machine Learning in Snapchat

Snapchat, the popular Social app, launched SnapML last June: an important update to its development tool (Lens Studio) that allows the use of Machine Learning algorithms to create Lens, that is filters that enrich the user experience.

Snap acquires Ariel AI to boost Snapchat augmented reality features

Ariel AI’s 12 engineers have been tasked with making the Snapchat camera “smarter” and improving the augmented reality experiences that allow Snapchat users to engage with the real world.

Improving Mobile App Accessibility with Icon Detection

IconNet is a vision-based object detection model, included in the latest version of Voice Access, that can automatically detect icons on-screen, enabling improved accessibility across a range of apps for hands-free use.

Applying Embedded Machine Learning to Temperature Monitoring in Cold Chain Applications

In their latest blog post, Zin Thein Kyaw of Edge Impulse explores how to use Edge Impulse for deploying embedded ML on low-power temperature sensors for cold chain monitoring applications.


Teaching AI to manipulate objects using visual demos

Facebook AI created and open-sourced a new technique that teaches robots to learn a model of their environment, observe human behavior, and develop its own reward system.

Large-scale collaborative filtering to predict who on OkCupid will like you*, with JAX**

Since memory-based collaborative filtering is not very scalable, engineers at OkCupid learn vector representations of users with gradient descent based on the hundreds of millions of “votes” each week.

Learning to Reason Over Tables from Less Data

Check out a new strategy from Google AI for intermediate pre-training and data filtering that enables table parsing models to learn better, faster & from less data, achieving 4x gains in speed & memory utilization without a significant drop in performance.

NVIDIA, UToronto, McGill & Vector Study Delivers Real-Time SDF Rendering & SOTA Complex Geometry Reconstruction

A new study by NVIDIA, University of Toronto, McGill University and the Vector Institute introduces an efficient neural representation that enables real-time rendering of high-fidelity neural SDFs for the first time while delivering SOTA quality geometric reconstruction. Project website.

Google Brain Introduces Symbolic Programming + PyGlove Library to Reformulate AutoML

A recent study by the Google Brain Team proposes a new way of programming automated machine learning (AutoML) based on symbolic programming.


[GitHub] lab-ml/nn

Minimal implementations/tutorials of deep learning papers with side-by-side notes; including transformers (original, xl, switch, feedback), optimizers(adam, radam, adabelief), gans(dcgan, cyclegan), reinforcement learning (ppo, dqn), capsnet, sketch-rnn, etc.

[GitHib] lucidrains/bottleneck-transformer-pytorch

Implementation of Bottleneck Transformer, SotA visual recognition model with convolution + attention that outperforms EfficientNet and DeiT in terms of performance-computes trade-off, in Pytorch.


Machine Learning Datasets

Papers With Code is now indexing 3000+ research datasets from machine learning. Find datasets by task and modality, compare usage over time, browse benchmarks, and much more.

Papers & Publications

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Abstract: Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformers (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance compared with CNNs when trained from scratch on a midsize dataset (e.g., ImageNet). We find it is because: 1) the simple tokenization of input images fails to model the important local structure (e.g., edges, lines) among neighboring pixels, leading to its low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness in fixed computation budgets and limited training samples.

To overcome such limitations, we propose a new Tokens-To-Token Vision Transformers (T2T-ViT), which introduces 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformers motivated by CNN architecture design after extensive study. Notably, T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200\%, while achieving more than 2.5\% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets when directly training on ImageNet. For example, T2T-ViT with ResNet50 comparable size can achieve 80.7\% top-1 accuracy on ImageNet. (Code: this https URL)

Muppet: Massive Multi-task Representations with Pre-Finetuning

Abstract: We propose pre-finetuning, an additional large-scale learning stage between language model pre-training and fine-tuning. Pre-finetuning is massively multi-task learning (around 50 datasets, over 4.8 million total labeled examples), and is designed to encourage learning of representations that generalize better to many different tasks. We show that pre-finetuning consistently improves performance for pretrained discriminators (e.g.~RoBERTa) and generation models (e.g.~BART) on a wide range of tasks (sentence prediction, commonsense reasoning, MRC, etc.), while also significantly improving sample efficiency during fine-tuning. We also show that large-scale multi-tasking is crucial; pre-finetuning can hurt performance when few tasks are used up until a critical point (usually above 15) after which performance improves linearly in the number of tasks.