Deep Learning Weekly: Issue #224

Nvidia GTC, Landing AI, Sleep Sensing, AnimeGANv2, Merlot and more.

Hey folks,

This week in deep learning, we bring you the Nvidia GTC Keynote, Landing AI’s fundraising round, a nice tool to use AnimeGANv2, and a paper about ethical AI.

You may also enjoy an introduction to deep learning optimization theory, a multilingual translation model, a deep dive into Apple’s on-device image segmentation, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!


Nvidia GTC Keynote

In this keynote, Nvidia’s CEO makes several major announcements regarding the company’s strategy and future releases: Nvidia Quantum-2, Nvidia Omniverse, Nvidia NeMo Megatron, and more.

GPT-3 is No Longer the Only Game in Town

GPT-3, OpenAI’s generative text model, is estimated to have cost between 10 and 20 million dollars to train. Nevertheless, a half dozen or so models as big or bigger than GPT-3 have been announced over the course of 2021.

Why AI Lags Behind the Human Brain in Computational Power

Scientists have tried to model neurons found in the human brain with artificial neural networks, and found that human neurons are much more complex than we’d previously thought. This means that simulating the brain involves staggeringly large computational resources.

Daniel Ek puts €100M into defense startup, to support democracies

Spotify’s founder will put €100 million into Helsing, a European defense AI company aiming to boost defense and national security among liberal democracies by making them more efficient.

How AI is reinventing what computers are

Computers have not changed much in the last 40 years. This article explores how AI changes that on at least 3 fronts: how computers are made, how they’re programmed, and how they’re used.

Landing AI brings in $57M for its machine learning operations tools

Just over a year after launching, Landing AI, Andrew Ng’s company, secured a $57 million round of Series A funding to help manufacturers more easily and quickly build and deploy AI systems.

Mobile & Edge

On-device Panoptic Segmentation for Camera Using Transformers

Apple introduces HyperDETR, an image segmentation architecture that is compact and efficient enough to execute on-device without impacting battery life. It enables a wide range of features in the Camera app.

Enhanced Sleep Sensing in Nest Hub

Google AI explains how they enhanced Sleep Sensing, a feature that helps users better understand their sleep patterns and nighttime wellness, based on sleep staging classification and audio source separation models.

Google Tensor is a milestone for machine learning

Google’s team of researchers came together across hardware, software, and ML to build Google Tensor, a chip that can deliver totally new capabilities for Pixel users by keeping pace with the latest advancements in ML.


Deep Learning Optimization Theory — Introduction

This post introduces experimental and theoretical approaches to study the theory of optimization in deep learning. It shows that a theory behind the convergence of stochastic gradient descent is still needed.

Rliable: Better Evaluation for Reinforcement Learning - A Visual Explanation

This post gives a visual explanation of the various tools used by the rliable library to better evaluate and compare reinforcement learning algorithms: score normalization, stratified bootstrap, interquartile mean, and more.

Tracking Databricks Notebooks and Experiments with Comet

Data Scientist Matt Blasa explores how tracking your ML experiments in a Databricks environment allows more control over model versioning, as well as the ability to keep track of and log metrics, data visualizations, dataset artifacts, and more.

Explaining Machine Learning Models: A Non-Technical Guide to Interpreting SHAP Analyses

SHAP is a powerful ML interpretation technique. This guide provides an actionable framework to use in order to communicate with non-technical stakeholders.

A First-Principles Theory of Neural Network Generalization

We do not really understand why deep learning works, and in particular how the functions learned by neural networks generalize so well to unseen data. The approach described here gives some answers to this complex question.

The first AI model that translates 100 languages without relying on English data

Facebook AI releases M2M-100, the first multilingual machine translation model that translates between any pair of 100 languages without relying on English data. It reaches unprecedented accuracy for most languages.

Libraries & Code


Try AnimeGANv2 with any image you upload. AnimeGANv2 is the latest model to transform real portraits into anime style images, combining neural style transfer and generative adversarial networks.

Laion-400-Million Open Dataset

Laion-400-Million dataset is the world’s largest openly-available image-text-pair dataset with over 400 million samples. It is non-curated and built for research purposes.

Rliable: a Library for Reliable Evaluation of RL Models

Rliable is an open-source Python library used to comprehensively evaluate reinforcement learning models. It is based on NeurIPS 2021 paper “Deep Reinforcement Learning at the Edge of the Statistical Precipice”.

Papers & Publications

Towards a Theory of Justice for Artificial Intelligence


This paper explores the relationship between artificial intelligence and principles of distributive justice. Drawing upon the political philosophy of John Rawls, it holds that the basic structure of society should be understood as a composite of socio-technical systems, and that the operation of these systems is increasingly shaped and influenced by AI. As a consequence, egalitarian norms of justice apply to the technology when it is deployed in these contexts. These norms entail that the relevant AI systems must meet a certain standard of public justification, support citizens rights, and promote substantively fair outcomes -- something that requires specific attention be paid to the impact they have on the worst-off members of society.

Parameter Prediction for Unseen Deep Architectures


Deep learning has been successful in automating the design of features in machine learning pipelines. However, the algorithms optimizing neural network parameters remain largely hand-designed and computationally inefficient. We study if we can use deep learning to directly predict these parameters by exploiting the past knowledge of training other networks. We introduce a large-scale dataset of diverse computational graphs of neural architectures - DeepNets-1M - and use it to explore parameter prediction on CIFAR-10 and ImageNet. By leveraging advances in graph neural networks, we propose a hypernetwork that can predict performant parameters in a single forward pass taking a fraction of a second, even on a CPU. The proposed model achieves surprisingly good performance on unseen and diverse networks. For example, it is able to predict all 24 million parameters of a ResNet-50 achieving a 60% accuracy on CIFAR-10. On ImageNet, top-5 accuracy of some of our networks approaches 50%. Our task along with the model and results can potentially lead to a new, more computationally efficient paradigm of training networks. Our model also learns a strong representation of neural architectures enabling their analysis.

MERLOT: Multimodal Neural Script Knowledge Models


As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes).

Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.

A guest post by