Deep Learning Weekly: Issue #224
Nvidia GTC, Landing AI, Sleep Sensing, AnimeGANv2, Merlot and more.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
In this keynote, Nvidia’s CEO makes several major announcements regarding the company’s strategy and future releases: Nvidia Quantum-2, Nvidia Omniverse, Nvidia NeMo Megatron, and more.
GPT-3, OpenAI’s generative text model, is estimated to have cost between 10 and 20 million dollars to train. Nevertheless, a half dozen or so models as big or bigger than GPT-3 have been announced over the course of 2021.
Scientists have tried to model neurons found in the human brain with artificial neural networks, and found that human neurons are much more complex than we’d previously thought. This means that simulating the brain involves staggeringly large computational resources.
Spotify’s founder will put €100 million into Helsing, a European defense AI company aiming to boost defense and national security among liberal democracies by making them more efficient.
Computers have not changed much in the last 40 years. This article explores how AI changes that on at least 3 fronts: how computers are made, how they’re programmed, and how they’re used.
Just over a year after launching, Landing AI, Andrew Ng’s company, secured a $57 million round of Series A funding to help manufacturers more easily and quickly build and deploy AI systems.
Mobile & Edge
Apple introduces HyperDETR, an image segmentation architecture that is compact and efficient enough to execute on-device without impacting battery life. It enables a wide range of features in the Camera app.
Google AI explains how they enhanced Sleep Sensing, a feature that helps users better understand their sleep patterns and nighttime wellness, based on sleep staging classification and audio source separation models.
Google’s team of researchers came together across hardware, software, and ML to build Google Tensor, a chip that can deliver totally new capabilities for Pixel users by keeping pace with the latest advancements in ML.
This post introduces experimental and theoretical approaches to study the theory of optimization in deep learning. It shows that a theory behind the convergence of stochastic gradient descent is still needed.
This post gives a visual explanation of the various tools used by the rliable library to better evaluate and compare reinforcement learning algorithms: score normalization, stratified bootstrap, interquartile mean, and more.
Data Scientist Matt Blasa explores how tracking your ML experiments in a Databricks environment allows more control over model versioning, as well as the ability to keep track of and log metrics, data visualizations, dataset artifacts, and more.
SHAP is a powerful ML interpretation technique. This guide provides an actionable framework to use in order to communicate with non-technical stakeholders.
We do not really understand why deep learning works, and in particular how the functions learned by neural networks generalize so well to unseen data. The approach described here gives some answers to this complex question.
Facebook AI releases M2M-100, the first multilingual machine translation model that translates between any pair of 100 languages without relying on English data. It reaches unprecedented accuracy for most languages.
Libraries & Code
Try AnimeGANv2 with any image you upload. AnimeGANv2 is the latest model to transform real portraits into anime style images, combining neural style transfer and generative adversarial networks.
Laion-400-Million dataset is the world’s largest openly-available image-text-pair dataset with over 400 million samples. It is non-curated and built for research purposes.
Rliable is an open-source Python library used to comprehensively evaluate reinforcement learning models. It is based on NeurIPS 2021 paper “Deep Reinforcement Learning at the Edge of the Statistical Precipice”.
Papers & Publications
This paper explores the relationship between artificial intelligence and principles of distributive justice. Drawing upon the political philosophy of John Rawls, it holds that the basic structure of society should be understood as a composite of socio-technical systems, and that the operation of these systems is increasingly shaped and influenced by AI. As a consequence, egalitarian norms of justice apply to the technology when it is deployed in these contexts. These norms entail that the relevant AI systems must meet a certain standard of public justification, support citizens rights, and promote substantively fair outcomes -- something that requires specific attention be paid to the impact they have on the worst-off members of society.
Deep learning has been successful in automating the design of features in machine learning pipelines. However, the algorithms optimizing neural network parameters remain largely hand-designed and computationally inefficient. We study if we can use deep learning to directly predict these parameters by exploiting the past knowledge of training other networks. We introduce a large-scale dataset of diverse computational graphs of neural architectures - DeepNets-1M - and use it to explore parameter prediction on CIFAR-10 and ImageNet. By leveraging advances in graph neural networks, we propose a hypernetwork that can predict performant parameters in a single forward pass taking a fraction of a second, even on a CPU. The proposed model achieves surprisingly good performance on unseen and diverse networks. For example, it is able to predict all 24 million parameters of a ResNet-50 achieving a 60% accuracy on CIFAR-10. On ImageNet, top-5 accuracy of some of our networks approaches 50%. Our task along with the model and results can potentially lead to a new, more computationally efficient paradigm of training networks. Our model also learns a strong representation of neural architectures enabling their analysis.
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%, even those that make heavy use of auxiliary supervised data (like object bounding boxes).
Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.