Deep Learning Weekly: Issue #275
Meta AI's neural theorem prover that has solved 10 IMO problems, partial blockout experiments at Booking.com, fine-tuning Whisper for Multilingual ASR with Hugging Face Transformers, and more.
This week in deep learning, we bring you Meta AI's neural theorem prover that has solved 10 IMO problems, partial blockout experiments at Booking.com, fine-tuning Whisper for Multilingual ASR with Hugging Face Transformers, and a paper on Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models.
You may also enjoy the public API of DALLE, end-to-end active learning pipeline using DVC, MLflow, Label Studio, and DagsHub, a data lake for deep learning, a paper on Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Google announces a number of internal research projects focused on exploring new AI-powered applications for inclusive language initiatives, empowerment of artists, and climate solutions.
Skycatch, a San Francisco-based vision startup that helps companies mine data and minerals, is now digging into the creation of digital twins.
Meta AI has built a neural theorem prover that has solved 10 International Math Olympiad (IMO) problems — 5x more than any previous AI system.
TorchVision extends its Transforms API to Object Detection, Segmentation, and Video Tasks.
Developers can now integrate DALL·E directly into their apps and products through a public API.
An in-depth article that compares the three task orchestrators: Argo, Airflow, and Prefect.
An article that introduces partial blockout experiments for a two-sided marketplace such as Booking.com.
How to use GitLab, Heroku, and Ruby to quickly create a seamless CI/CD pipeline.
This post walks you through the process of downloading, optimizing, and deploying a 1.3 billion parameter GPT-3 model using NeMo Megatron.
This 2-part tutorial will teach you how to implement an active learning pipeline using open source tools, such as MLflow, Label Studio, and DVC.
An article that explores how to prepare data for time series models in Keras and use it to train a time series model to predict the price of items.
In this blog, we present a step-by-step guide on fine-tuning Whisper for any multilingual ASR dataset using Hugging Face Transformers.
An article that describes how IBM uses synthetic data to speed up the training of AI models, protect sensitive data, improve accuracy, or find and mitigate bias and security weaknesses.
A comprehensive dive into the world of sketch-based computer vision.
Domino shares their experience from the BMS molecular translation challenge and shows that CPU-based distributed learning can significantly shorten model training times.
Libraries & Code
Deep Lake (formerly known as Activeloop Hub) is a data lake for deep learning applications.
Ax is an accessible, general-purpose platform for understanding, managing, deploying, and automating adaptive experiments.
dstack is a lightweight command-line utility that lets you run ML workflows in the cloud, while keeping them highly reproducible.
Papers & Publications
During image editing, existing deep generative models tend to re-synthesize the entire output from scratch, including the unedited regions. This leads to a significant waste of computation, especially for minor editing operations. In this work, we present Spatially Sparse Inference (SSI), a general-purpose technique that selectively performs computation for edited regions and accelerates various generative models, including both conditional GANs and diffusion models. Our key observation is that users tend to make gradual changes to the input image. This motivates us to cache and reuse the feature maps of the original image. Given an edited image, we sparsely apply the convolutional filters to the edited regions while reusing the cached features for the unedited regions. Based on our algorithm, we further propose Sparse Incremental Generative Engine (SIGE) to convert the computation reduction to latency reduction on off-the-shelf hardware. With 1.2%-area edited regions, our method reduces the computation of DDIM by 7.5× and GauGAN by 18× while preserving the visual fidelity. With SIGE, we accelerate the speed of DDIM by 3.0x on RTX 3090 and 6.6× on Apple M1 Pro CPU, and GauGAN by 4.2× on RTX 3090 and 14× on Apple M1 Pro CPU.
Diffusion probabilistic models (DPMs) have achieved impressive success in high-resolution image synthesis, especially in recent large-scale text-to-image generation applications. An essential technique for improving the sample quality of DPMs is guided sampling, which usually needs a large guidance scale to obtain the best sample quality. The commonly-used fast sampler for guided sampling is DDIM, a first-order diffusion ODE solver that generally needs 100 to 250 steps for high-quality samples. Although recent works propose dedicated high-order solvers and achieve a further speedup for sampling without guidance, their effectiveness for guided sampling has not been well-tested before. In this work, we demonstrate that previous high-order fast samplers suffer from instability issues, and they even become slower than DDIM when the guidance scale grows large. To further speed up guided sampling, we propose DPM-Solver++, a high-order solver for the guided sampling of DPMs. DPM-Solver++ solves the diffusion ODE with the data prediction model and adopts thresholding methods to keep the solution matches training data distribution. We further propose a multistep variant of DPM-Solver++ to address the instability issue by reducing the effective step size. Experiments show that DPM-Solver++ can generate high-quality samples within only 15 to 20 steps for guided sampling by pixel-space and latent-space DPMs.
Transformer networks have been a focus of research in many fields in recent years, being able to surpass the state-of-the-art performance in different computer vision tasks. A few attempts have been made to apply this method to the task of Multiple Object Tracking (MOT), among those the state-of-the-art was TransCenter, a transformer-based MOT architecture with dense object queries for accurately tracking all the objects while keeping reasonable runtime. TransCenter is the first center-based transformer framework for MOT, and is also among the first to show the benefits of using transformer-based architectures for MOT. In this paper we show an improvement to this tracker using post processing mechanism based in the Track-by-Detection paradigm: motion model estimation using Kalman filter and target Re-identification using an embedding network. Our new tracker shows significant improvements in the IDF1 and HOTA metrics and comparable results on the MOTA metric (70.9%, 59.8% and 75.8% respectively) on the MOTChallenge MOT17 test dataset and improvement on all 3 metrics (67.5%, 56.3% and 73.0%) on the MOT20 test dataset. Our tracker is currently ranked first among transformer-based trackers in these datasets.