Discover more from Deep Learning Weekly
Deep Learning Weekly: Issue #272
State of AI Report 2022, DeepMind's Perception Test, a technical tutorial on creating music videos with Stable Diffusion, a paper on unifying language learning paradigms, and many more.
This week in deep learning, we bring you the State of AI Report 2022, DeepMind's Perception Test, a technical tutorial on creating music videos with Stable Diffusion, and a paper on unifying language learning paradigms.
You may also enjoy OpenAI Hackathon for Climate Change, designing Spotify Radio's recommendation system, a command-line utility for provisioning ML infrastructure, a paper on exploring long-sequence masked autoencoders, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
An AI report that comprehensively analyzes interesting research, industry, politics, and safety developments.
OpenAI announces a virtual hackathon to explore how the current AI models can accelerate solutions to climate change.
DeepMind introduces the Perception Test, a multimodal benchmark using real-world videos to help evaluate the perception capabilities of a model.
Meta announced its latest big contribution to the Open Compute Project: Grand Teton, a new, next-generation platform for artificial intelligence at large scale.
Stability AI, the company funding the development of open source music- and image-generating systems like Dance Diffusion and Stable Diffusion, announced that it raised $101 million in a funding round led by Coatue and Lightspeed Venture Partners.
A post that explains how Instacart uses a system through transfer learning to improve the calibration of deep predicted click-through-rate (pCTR) models.
An article that explains how Spotify Radio’s recommendation system is designed from framing the problem all the way down to the metrics used.
An article that covers how to track JAX and Flax models with Comet.
A Sentiment Classification tutorial that covers data preparation all the way to inference logging using HuggingFace and Arize.
A technical article on how to leverage Stable Diffusion to generate captivating music videos that move to the beat of a song.
A blog about generating model-specific minimal runtimes using PyTorch’s Tracing Based Selective Build.
A tutorial on how to deploy models on different types of Inference Endpoints.
A new Deep Learning community sponsored by Deci AI.
Libraries & Code
A lightweight command-line utility to provision infrastructure for ML workflows.
A library that aims to simplify feature extractions from mono audio files.
Papers & Publications
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. Finally, we show that UL2 20B works well with chain-of-thought prompting and reasoning.
Denoising diffusion models (DDMs) have shown promising results in 3D point cloud synthesis. To advance 3D DDMs and make them useful for digital artists, we require (i) high generation quality, (ii) flexibility for manipulation and applications such as conditional synthesis and shape interpolation, and (iii) the ability to output smooth surfaces or meshes. To this end, we introduce the hierarchical Latent Point Diffusion Model (LION) for 3D shape generation. LION is set up as a variational autoencoder (VAE) with a hierarchical latent space that combines a global shape latent representation with a point-structured latent space. For generation, we train two hierarchical DDMs in these latent spaces. The hierarchical VAE approach boosts performance compared to DDMs that operate on point clouds directly, while the point-structured latents are still ideally suited for DDM-based modeling. Experimentally, LION achieves state-of-the-art generation performance on multiple ShapeNet benchmarks. Furthermore, our VAE framework allows us to easily use LION for different relevant tasks: LION excels at multimodal shape denoising and voxel-conditioned synthesis, and it can be adapted for text- and image-driven 3D generation. We also demonstrate shape autoencoding and latent shape interpolation, and we augment LION with modern surface reconstruction techniques to generate smooth 3D meshes. We hope that LION provides a powerful tool for artists working with 3D shapes due to its high-quality generation, flexibility, and surface reconstruction
Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains. In contrast to discrete tokens in natural languages, the input for image MAE is continuous and subject to additional specifications. We systematically study each input specification during the pre-training stage, and find sequence length is a key axis that further scales MAE. Our study leads to a long-sequence version of MAE with minimal changes to the original recipe, by just decoupling the mask size from the patch size. For object detection and semantic segmentation, our long-sequence MAE shows consistent gains across all the experimental setups without extra computation cost during the transfer. While long-sequence pre-training is discerned most beneficial for detection and segmentation, we also achieve strong results on ImageNet-1K classification by keeping a standard image size and only increasing the sequence length. We hope our findings can provide new insights and avenues for scaling in computer vision.