Deep Learning Weekly: Issue #256
Meta's AI-driven acoustic synthesis for AR and VR, orchestrating PyTorch workflows using Vertex AI pipelines, Fast Interpretable Greedy-Tree Sums from BAIR, a paper on video pre-training, and more
This week in deep learning, we bring you Meta's AI-driven acoustic synthesis for AR and VR, orchestrating PyTorch workflows using Vertex AI pipelines, Fast Interpretable Greedy-Tree Sums from Berkeley Artificial Intelligence Research, and a paper on video pre-training.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
French utility Engie SA will begin using an experimental technology from Google that aims to boost efficiency and power from wind farms.
LessWrong is launching the Inverse Scaling Prize: a contest with $250k in prizes for finding zero/few-shot text tasks where larger language models show increasingly undesirable behavior (“inverse scaling”).
Meta AI researchers, in collaboration with an audio specialist from Meta’s Reality Labs and researchers from the University of Texas at Austin, are open-sourcing three new models for audio-visual understanding of human speech and sounds in video.
Microsoft’s new research offers a more accurate method, which records server chips’ energy usage as a series and aligns usage with a series of data points indicating local emissions per kilowatt-hour, for calculating CO2 emissions.
Google AI introduces the Auto Arborist Dataset, a multiview urban tree classification dataset that, at ~2.6 million trees and >320 genera, is two orders of magnitude larger than those in prior work.
A technical article showing a deployment method that enables you to serve your model as an API, a Docker container, and a hosted web app, all within a few minutes and a couple of short Python scripts.
Google shows how to automate and monitor a PyTorch-based ML workflow by orchestrating the pipeline in a serverless manner using Vertex AI Pipelines.
DVC announces a tool that automatically extracts meta information like environment and frameworks from models, and standardizes that information into a human-readable format within Git.
ZenML is an extensible, open-source MLOps framework for creating portable, production-ready MLOps pipelines.
A deep dive into Imagen that is intended for Machine Learning researchers, students, and practitioners.
Learn to perform image inpainting in the first installment of a four-part series on the Intel OpenVINO toolkit.
A technical blog describing how the autograd engine of PyTorch works in detail.
An article on ONNX, Hugging Face Optimum, the supported transformer models, and the conversion of a HuggingFace BERT model to ONNX using Hugging Face Optimum.
A Berkeley Artificial Intelligence Research (BAIR) blog post covering FIGS (Fast Interpretable Greedy-tree Sums), a new method for fitting an interpretable model that takes the form of a sum of trees.
Libraries & Code
A faithful but trainable PyTorch reproduction of DeepMind's AlphaFold 2.
A curated list of practical financial machine learning (FinML) tools and applications.
Papers & Publications
Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.
We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and fine-tuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.
We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions (32×32, 256×256, and original image size), formatted in a way that is easy to be incorporated by popular machine learning frameworks. We also conduct extensive benchmarking experiments using representative image synthesis models with ArtBench-10 and present in-depth analysis.