Deep Learning Weekly: Issue #274
MIT's neural acoustic field, an in-depth guide to two-phase learning, large-scale training with FAIR's Vision Library for Self-Supervised Learning, and more.
This week in deep learning, we bring you MIT's neural acoustic field, an in-depth guide to two-phase learning, large-scale training with FAIR's Vision Library for Self-Supervised Learning, and a paper on a large prompt gallery dataset for text-to-image models.
You may also enjoy a fashion sketch pad that utilizes DALL-E, your first high quality MLOps system, distributed forecasting using Fugue and Nixtla, a paper on Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Google Pays $100M for AI Avatar Startup Alter
Google has reportedly acquired artificial intelligence (AI) avatar startup Alter — which was formerly known as Facemoji — for $100 million.
PyTorch 1.13 release, including beta versions of functorch and improved support for Apple’s new M1 chips.
Team PyTorch announced the release of PyTorch 1.13. This includes stable versions of BetterTransformer along with other improvements.
Using sound to model the world
MIT researchers have developed a machine-learning technique that accurately captures and models the underlying acoustics of a scene from only a limited number of sound recordings.
2022 Intelligent Applications 40 Revealed
Madrona, Goldman Sachs, Microsoft, Amazon Web Services, and PitchBook announce the 2022 Intelligent Applications 40.
AI-Generated Fashion Is Next Wave of DIY Design
CALA reimagines DALL-E as a clothing designer’s ultimate smart sketch pad.
Evaluation of classification models on unbalanced production data
Bumble shares their methodology for handling unbalanced classes in production data.
Logging Recommendation System Visualizations in Comet
A full code tutorial on how to manually log charts and graphs to Comet.
Good Design in ML Applications With Konrad Piercey
An in-depth article about design challenges, good practices, and methodologies for ML application design.
Your First MLOps System: What Does Good Look Like?
A podcast that covers the characteristics of your first high quality MLOps system.
A Beginner’s Guide to Two-Phase Learning
An in-depth introduction to two-phase learning, an approach to unbalanced classes in real-world problems.
Distributed Forecast of 1M Time Series in Under 15 Minutes with Spark, Nixtla, and Fugue
A blog post that shows how you can leverage the distributed power of Spark and the highly efficient code from StatsForecast to fit millions of models in a couple of minutes.
Why Deep Learning Underperforms with Tabular Data
An article discussing recent research into boosting performance of deep learning models on tabular data.
Large Scale Training with VISSL Training (mixed precision, LARC, ZeRO etc)
A Colab tutorial that guides you through the configurations for large scale training with FAIR’s VISSL.
Multi-task Learning for Related Products Recommendations at Pinterest
An article that explains how Pinterest uses multi-task learning, calibration, and Bayesian optimization to build a flexible, interpretable, and scalable candidate ranking solution for Related Products recommendations.
How robust are pre-trained object detection ML models like YOLO or DETR?
A test of the robustness of state-of-the-art computer vision models to assess their generalization ability.
Physics Informed Neural Networks (PINNs): An Intuitive Guide
An article that covers how a PINN works, and what are the trade-offs and differences between PINNs, pure data-driven neural networks, and pure physics functions.
Libraries & Code
Machine Learning and Data Science Applications in Industry
A curated list of applied machine learning and data science notebooks and libraries across different industries.
Latex code for drawing neural networks for reports and presentation.
Papers & Publications
DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models
With recent advancements in diffusion models, users can generate high-quality images by writing text prompts in natural language. However, generating images with desired details requires proper prompts, and it is often unclear how a model reacts to different prompts and what the best prompts are. To help researchers tackle these critical challenges, we introduce DiffusionDB, the first large-scale text-to-image prompt dataset. DiffusionDB contains 2 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. We analyze prompts in the dataset and discuss key properties of these prompts. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models.
Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform
We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed.
Monolith: Real Time Recommendation System With Collisionless Embedding Table
Building a scalable and real-time recommendation system is vital for many businesses driven by time-sensitive customer feedback, such as short-videos ranking or online ads. Despite the ubiquitous adoption of production-scale deep learning frameworks like TensorFlow or PyTorch, these general-purpose frameworks fall short of business demands in recommendation scenarios for various reasons: on one hand, tweaking systems based on static parameters and dense computations for recommendation with dynamic and sparse features is detrimental to model quality; on the other hand, such frameworks are designed with batch-training stage and serving stage completely separated, preventing the model from interacting with customer feedback in real-time. These issues led us to reexamine traditional approaches and explore radically different design choices. In this paper, we present Monolith, a system tailored for online training. Our design has been driven by observations of our application workloads and production environment that reflects a marked departure from other recommendations systems. Our contributions are manifold: first, we crafted a collisionless embedding table with optimizations such as expirable embeddings and frequency filtering to reduce its memory footprint; second, we provide an production-ready online training architecture with high fault-tolerance; finally, we proved that system reliability could be traded-off for real-time learning. Monolith has successfully landed in the BytePlus Recommend product.