Deep Learning Weekly: Issue #256

Meta's AI-driven acoustic synthesis for AR and VR, orchestrating PyTorch workflows using Vertex AI pipelines, Fast Interpretable Greedy-Tree Sums from BAIR, a paper on video pre-training, and more

Jun 29, 2022

Hey Folks,

This week in deep learning, we bring you Meta's AI-driven acoustic synthesis for AR and VR, orchestrating PyTorch workflows using Vertex AI pipelines, Fast Interpretable Greedy-Tree Sums from Berkeley Artificial Intelligence Research, and a paper on video pre-training.

You may also enjoy deployment with StreamLit + BentoML + DagsHub, How Imagen Actually Works, an open-sourced Pytorch reproduction of AlphaFold 2, a paper on scaling autoregressive models, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Google and France's Engie Team Up to Accelerate Clean Energy

French utility Engie SA will begin using an experimental technology from Google that aims to boost efficiency and power from wind farms.

Announcing the Inverse Scaling Prize ($250k Prize Pool)

LessWrong is launching the Inverse Scaling Prize: a contest with $250k in prizes for finding zero/few-shot text tasks where larger language models show increasingly undesirable behavior (“inverse scaling”).

Introducing AI-driven acoustic synthesis for AR and VR

Meta AI researchers, in collaboration with an audio specialist from Meta’s Reality Labs and researchers from the University of Texas at Austin, are open-sourcing three new models for audio-visual understanding of human speech and sounds in video.

Measuring AI's Carbon Footprint

Microsoft’s new research offers a more accurate method, which records server chips’ energy usage as a series and aligns usage with a series of data points indicating local emissions per kilowatt-hour, for calculating CO2 emissions.

Mapping Urban Trees Across North America with the Auto Arborist Dataset

Google AI introduces the Auto Arborist Dataset, a multiview urban tree classification dataset that, at ~2.6 million trees and >320 genera, is two orders of magnitude larger than those in prior work.

MLOps

The Easiest Way to Deploy Your Machine Learning Models in 2022: Streamlit + BentoML + DagsHub

A technical article showing a deployment method that enables you to serve your model as an API, a Docker container, and a hosted web app, all within a few minutes and a couple of short Python scripts.

Orchestrating PyTorch ML Workflows on Vertex AI Pipelines

Google shows how to automate and monitor a PyTorch-based ML workflow by orchestrating the pipeline in a serverless manner using Vertex AI Pipelines.

Productionize your models with MLEM in a Git-native way

DVC announces a tool that automatically extracts meta information like environment and frameworks from models, and standardizes that information into a human-readable format within Git.

ZenML: Build portable, production-ready MLOps pipelines

ZenML is an extensible, open-source MLOps framework for creating portable, production-ready MLOps pipelines.

Learning

How Imagen Actually Works

A deep dive into Imagen that is intended for Machine Learning researchers, students, and practitioners.

ML with Intel OpenVINO Toolkit — Image Inpainting

Learn to perform image inpainting in the first installment of a four-part series on the Intel OpenVINO toolkit.

How Computational Graphs are Executed in PyTorch

A technical blog describing how the autograd engine of PyTorch works in detail.

Convert Transformers to ONNX with Hugging Face Optimum

An article on ONNX, Hugging Face Optimum, the supported transformer models, and the conversion of a HuggingFace BERT model to ONNX using Hugging Face Optimum.

FIGS: Attaining XGBoost-level performance with the interpretability and speed of CART

A Berkeley Artificial Intelligence Research (BAIR) blog post covering FIGS (Fast Interpretable Greedy-tree Sums), a new method for fitting an interpretable model that takes the form of a sum of trees.

Libraries & Code

aqlaboratory/openfold: Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2

A faithful but trainable PyTorch reproduction of DeepMind's AlphaFold 2.

firmai/financial-machine-learning: A curated list of practical financial machine learning tools and applications.

A curated list of practical financial machine learning (FinML) tools and applications.

Papers & Publications

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

Abstract:

Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Abstract:

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and fine-tuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

The ArtBench Dataset: Benchmarking Generative Models with Artworks

Abstract:

We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions (32×32, 256×256, and original image size), formatted in a way that is easy to be incorporated by popular machine learning frameworks. We also conduct extensive benchmarking experiments using representative image synthesis models with ArtBench-10 and present in-depth analysis.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Discussion about this post

Ready for more?