Deep Learning Weekly: Issue 286
Microsoft & UCLA’s climate foundation model, tips on scaling storage for inference and training, the Transformer Family Version 2.0, a paper on Text-To-Audio Generation with Prompt-Enhanced Diffusion
Hey Folks,
This week in deep learning, we bring you Microsoft and UCLA introduces a climate and weather foundation model, Tips on scaling storage for inference and training, The Transformer Family Version 2.0, and a paper on Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models.
You may also enjoy Ray 2.2, simplifying and accelerating ML predictions in Apache Beam with NVIDIA TensorRT, end-to-end deep learning for autonomous driving: imitation learning, a paper on Cut and Learn for Unsupervised Object Detection and Instance Segmentation, and more.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Will an AI be the first to discover alien life?
SETI, the search for extraterrestrial intelligence, is deploying machine-learning algorithms that filter out Earthly interference and spot signals humans might miss.
Microsoft & UCLA Introduce ClimaX: A Foundation Model for Climate and Weather Modeling
Microsoft and UCLA present ClimaX, a foundation model for weather and climate that can be efficiently adapted for general-purpose tasks related to the Earth’s atmosphere.
China’s Baidu plans to develop its own ChatGPT-like AI bot for search in March
Baidu intends to launch its own ChatGPT-like bot that will be able to understand and respond in conversational language to users, according to a report from Bloomberg.
4chan users embrace AI voice clone tool to generate celebrity hatespeech
Free AI voice cloning technology from startup ElevenLabs has been used by trolls to imitate the voices of celebrities. The generated audio ranges in content from memes and erotica to virulent hatespeech.
OpenAI has hired an army of contractors to make basic coding obsolete
OpenAI, the company behind the chatbot ChatGPT, has ramped up its hiring around the world, bringing on roughly 1,000 remote contractors over the past six months.
Ray 2.2 boosts machine learning observability and scalability performance
Ray has released its 2.2 version with improved performance and observability capabilities, as well as features that can help to enable reproducibility.
MLOps
Simplifying and Accelerating Machine Learning Predictions in Apache Beam with NVIDIA TensorRT
An article that walks through the integration of NVIDIA TensorRT with Apache Beam SDK and show how complex inference scenarios can be fully encapsulated within a data processing pipeline.
An overview of the process of MLOps and the structure of an MLOps team.
Tips on Scaling Storage for AI Training and Inferencing
This NVIDIA blog explains how to plan in advance and scale data storage for training and inferencing.
How to Train Time Series Forecasting Faster using Ray, part 3 of 3
A blog that covers the steps on how to train and tune many models in parallel using distributed computing with open-source Ray.
Secure and enable Vertex AI platform as your end to end ML/AI platform for production workloads.
A blog post focusing on how to set up Cloud foundations to cater specifically to the Vertex AI platform and its configuration to be able to set up proper Vertex AI foundations for future MLOps and ML/AI use cases.
Learning
The Transformer Family Version 2.0
Lilian Weng’s comprehensive blog post that covers the math behind the components of the latest Transformer models (as well as the fundamentals).
Accelerating and scaling Temporal Graph Networks on the Graphcore IPU
An article that explores the implementation of Temporal GNNs on hardware architecture developed by Graphcore that is tailored to graph-structured workloads.
This article discusses the rapid development in ML and DL foundational models, citing a need for practitioners to focus on safety and security.
Sentiment Analysis with Python and Streamlit
A technical tutorial about building and deploying your own sentiment classification app using Python and Streamlit.
End-to-End Deep Learning Approach for Autonomous Driving: Imitation Learning
An article that highlights an End-to-End Deep Learning approach for Autonomous Lane Navigation, implemented using Duckie Town Simulator and an architecture that is based on NVIDIA’s DAVE-2.
Using TensorFlow for Deep Learning on Video Data
A tutorial that covers loading, video classification, streaming action recognition, and transfer learning for video, for building models in a memory-efficient manner.
Libraries & Code
A collection of audio-focused loss functions in PyTorch.
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
The general purpose micro-framework for creating dataflows from Python functions.
Papers & Publications
MusicLM: Generating Music From Text
Abstract:
We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Abstract:
Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind due to two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach which alleviates the data scarcity by using weekly-supervised data with language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pre-training (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective evaluation. Moreover, we present its controllability with classifier-free guidance and generalization for X-to-Audio with "No Modality Left Behind", for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input.
Cut and Learn for Unsupervised Object Detection and Instance Segmentation
Abstract:
We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in an image and then learns a detector on these masks using our robust loss function. We further improve the performance by self-training the model on its predictions. Compared to prior work, CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER is also a zero-shot unsupervised detector and improves detection performance AP50 by over 2.7 times on 11 benchmarks across domains like video frames, paintings, sketches, etc. With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% APbox and 6.6% APmask on COCO when training with 5% labels.