Deep Learning Weekly: Issue #245
OpenAI's new and improved DALL-E 2, understanding your model's predictions with Grad-CAM, Vertex AI model registry and how it works with BigQuery ML, a paper on flow-guided video inpainting, and more.
This week in deep learning, we bring you OpenAI's new and improved DALL-E 2, understanding your model's predictions with Grad-CAM, Vertex AI model registry and how it works with BigQuery ML, and a paper on flow-guided video inpainting.
You may also enjoy Google AI's visually-driven text-to-speech model, Meta AI's first ever external demo of its self-supervised learning work called DINO, Kedro pipelines with Optuna, a paper on large-scale matrix factorization on TPUs, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
OpenAI releases an updated version of its text-to-image generation model DALL-E, with higher resolution and lower latency than the original system.
Meta AI releases the first-ever external demo based on Meta AI's self-supervised learning work called DINO.
In a new study, scientists from the University of California-Irvine have developed a deep learning-based method in which infrared vision might help see the visible colors in a scene in the absence of light.
Researchers at MIT and IBM Research have created a method called Shared Interest that enables a user to aggregate, sort, and rank manual evaluations to rapidly analyze a machine learning model’s behavior.
Google AI presents a proof-of-concept visually-driven text-to-speech model, called VDTTS, that automates the dialog replacement process.
An AI algorithm, developed by MIT researchers, can drastically trim back the time needed to iterate designs of a promising new material called the topological insulator.
This article shows how to robustly fulfill dataset level, feature level, and model level experimental requirements using Kedro pipelines and Optuna.
This blog dives into how Model Registry works with BigQuery ML, showcasing the features that allow you to register, version, and easily deploy your BigQuery ML Models to Vertex AI.
An article describing how to deploy a machine learning algorithm in production from scratch using the end-to-end MLOps Vertex AI platform.
A comprehensive article discussing what data lineage is, how to implement it, and which corresponding tools to use when doing so.
A blog on how to choose an enterprise server for deep learning training.
A technical introduction to Tensorflow Privacy, an open-source package for providing privacy while training deep learning models, and how to implement it.
A technical blog on how to use Grad-CAM to understand the model’s decisions for model interpretability.
Libraries & Code
DGL is an easy-to-use, high performance and scalable Python package for deep learning on graphs.
Apache TVM is a compiler stack for deep learning systems. It is designed to close the gap between the productivity-focused deep learning frameworks and the performance- and efficiency-focused hardware backends.
Papers & Publications
Optical flow, which captures motion information across frames, is exploited in recent video inpainting methods through propagating pixels along its trajectories. However, the hand-crafted flow-based processes in these methods are applied separately to form the whole inpainting pipeline. Thus, these methods are less efficient and rely heavily on the intermediate results from earlier stages. In this paper, we propose an End-to-End framework for Flow-Guided Video Inpainting (E2FGVI) through elaborately designed three trainable modules, namely, flow completion, feature propagation, and content hallucination modules. The three modules correspond with the three stages of previous flow-based methods but can be jointly optimized, leading to a more efficient and effective inpainting process. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods both qualitatively and quantitatively and shows promising efficiency.
In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens." With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show our DaViT achieves state-of-the-art performance on four different tasks with efficient computations. Without extra data, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K with 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Gaint reaches 90.4% top-1 accuracy on ImageNet-1K.
We present ALX, an open-source library for distributed matrix factorization using Alternating Least Squares, written in JAX. Our design allows for efficient use of the TPU architecture and scales well to matrix factorization problems of O(B) rows/columns by scaling the number of available TPU cores. In order to spur future research on large scale matrix factorization methods and to illustrate the scalability properties of our own implementation, we also built a real world web link prediction dataset called WebGraph. This dataset can be easily modeled as a matrix factorization problem. We created several variants of this dataset based on locality and sparsity properties of sub-graphs. The largest variant of WebGraph has around 365M nodes and training a single epoch finishes in about 20 minutes with 256 TPU cores. We include speed and performance numbers of ALX on all variants of WebGraph. Both the framework code and the dataset is open-sourced.