Deep Learning Weekly: Issue #274
MIT's neural acoustic field, an in-depth guide to two-phase learning, large-scale training with FAIR's Vision Library for Self-Supervised Learning, and more.
This week in deep learning, we bring you MIT's neural acoustic field, an in-depth guide to two-phase learning, large-scale training with FAIR's Vision Library for Self-Supervised Learning, and a paper on a large prompt gallery dataset for text-to-image models.
You may also enjoy a fashion sketch pad that utilizes DALL-E, your first high quality MLOps system, distributed forecasting using Fugue and Nixtla, a paper on Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Google has reportedly acquired artificial intelligence (AI) avatar startup Alter — which was formerly known as Facemoji — for $100 million.
Team PyTorch announced the release of PyTorch 1.13. This includes stable versions of BetterTransformer along with other improvements.
MIT researchers have developed a machine-learning technique that accurately captures and models the underlying acoustics of a scene from only a limited number of sound recordings.
Madrona, Goldman Sachs, Microsoft, Amazon Web Services, and PitchBook announce the 2022 Intelligent Applications 40.
CALA reimagines DALL-E as a clothing designer’s ultimate smart sketch pad.
Bumble shares their methodology for handling unbalanced classes in production data.
A full code tutorial on how to manually log charts and graphs to Comet.
An in-depth article about design challenges, good practices, and methodologies for ML application design.
A podcast that covers the characteristics of your first high quality MLOps system.
An in-depth introduction to two-phase learning, an approach to unbalanced classes in real-world problems.
A blog post that shows how you can leverage the distributed power of Spark and the highly efficient code from StatsForecast to fit millions of models in a couple of minutes.
An article discussing recent research into boosting performance of deep learning models on tabular data.
A Colab tutorial that guides you through the configurations for large scale training with FAIR’s VISSL.
An article that explains how Pinterest uses multi-task learning, calibration, and Bayesian optimization to build a flexible, interpretable, and scalable candidate ranking solution for Related Products recommendations.
A test of the robustness of state-of-the-art computer vision models to assess their generalization ability.
An article that covers how a PINN works, and what are the trade-offs and differences between PINNs, pure data-driven neural networks, and pure physics functions.
Libraries & Code
A curated list of applied machine learning and data science notebooks and libraries across different industries.
Latex code for drawing neural networks for reports and presentation.
Papers & Publications
With recent advancements in diffusion models, users can generate high-quality images by writing text prompts in natural language. However, generating images with desired details requires proper prompts, and it is often unclear how a model reacts to different prompts and what the best prompts are. To help researchers tackle these critical challenges, we introduce DiffusionDB, the first large-scale text-to-image prompt dataset. DiffusionDB contains 2 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. We analyze prompts in the dataset and discuss key properties of these prompts. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models.
We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed.
Building a scalable and real-time recommendation system is vital for many businesses driven by time-sensitive customer feedback, such as short-videos ranking or online ads. Despite the ubiquitous adoption of production-scale deep learning frameworks like TensorFlow or PyTorch, these general-purpose frameworks fall short of business demands in recommendation scenarios for various reasons: on one hand, tweaking systems based on static parameters and dense computations for recommendation with dynamic and sparse features is detrimental to model quality; on the other hand, such frameworks are designed with batch-training stage and serving stage completely separated, preventing the model from interacting with customer feedback in real-time. These issues led us to reexamine traditional approaches and explore radically different design choices. In this paper, we present Monolith, a system tailored for online training. Our design has been driven by observations of our application workloads and production environment that reflects a marked departure from other recommendations systems. Our contributions are manifold: first, we crafted a collisionless embedding table with optimizations such as expirable embeddings and frequency filtering to reduce its memory footprint; second, we provide an production-ready online training architecture with high fault-tolerance; finally, we proved that system reliability could be traded-off for real-time learning. Monolith has successfully landed in the BytePlus Recommend product.