Deep Learning Weekly: Issue # 263
Google's new AI advancements for its search engine, GNN-based real time fraud detection using Deep Graph Library, hitchhiker's guide to score-based generative modeling, and more
This week in deep learning, we bring you Google's new AI advancements for its search engine, GNN-based real time fraud detection using Deep Graph Library, hitchhiker's guide to score-based generative modeling, and a paper on neural human radiance fields from a single video.
You may also enjoy Meta's modular framework for neural implicit representations, deploying Hugging Face ViT on Kubernetes with TF Serving, a mathematical deep dive on semi-supervised learning, a paper on speech enhancement and dereverberation with diffusion-based generative models , and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Google has updated its search engine with the new Multitask Unified Model (MUM) and other advancements that will help improve the accuracy of search results.
Stability AI announces the first stage of release of Stable Diffusion to researchers.
Meta AI is now releasing Implicitron, a modular framework within their popular open source PyTorch3D library, created and released to advance research on implicit neural representation.
OpenAI introduces a new-and-improved content moderation tool: The Moderation endpoint improves upon the previous content filter, and is available for free to OpenAI API developers.
The University of Washington is making more space for AI. A new $90 million Interdisciplinary Engineering Building will be the new home for its AI Education Institute.
A podcast with Hamza Tahir and Adam Probst, co-creators of ZenML, on the state of MLOps and tools for productionizing machine learning pipelines.
An article that describes some of the current experiment tracking tools, including TensorBoard, MLFlow, and Neptune.ai, especially in using them with Kubeflow Pipelines, a popular framework that runs on Kubernetes.
A post showing how to use Amazon Neptune, Amazon SageMaker, and the Deep Graph Library (DGL), among other AWS services, to construct an end-to-end solution for real-time fraud detection using GNN models.
A technical blog on how to scale your local deployment of a Hugging Face ViT with Docker and Kubernetes.
The easiest way to deploy from MLflow to SageMaker.
An introductory article on Hugging Face’s Skops, a new library that allows you to host your scikit-learn models on the HuggingFace Hub, create model cards for model documentation, and collaborate with others.
A hitchhiker's guide to score-based generative models, a family of approaches based on estimating gradients of the data distribution.
A mathematical deep dive on semi-supervised learning.
Libraries & Code
A repository that aims to map the ecosystem of artificial intelligence guidelines, principles, codes of ethics, standards, regulation and beyond.
An open bilingual (English & Chinese) bidirectional dense model with 130 billion parameters, pre-trained using the algorithm of General Language Model (GLM).
NVIDIA Kaolin Wisp is a PyTorch library powered by NVIDIA Kaolin Core to work with neural fields (including NeRFs, NGLOD, instant-ngp and VQAD).
Papers & Publications
Photorealistic rendering and reposing of humans is important for enabling augmented reality experiences. We propose a novel framework to reconstruct the human and the scene that can be rendered with novel human poses and views from just a single in-the-wild video. Given a video captured by a moving camera, we train two NeRF models: a human NeRF model and a scene NeRF model. To train these models, we rely on existing methods to estimate the rough geometry of the human and the scene. Those rough geometry estimates allow us to create a warping field from the observation space to the canonical pose-independent space, where we train the human model in. Our method is able to learn subject specific details, including cloth wrinkles and accessories, from just a 10 second video clip, and to provide high quality renderings of the human under novel poses, from novel views, together with the background.
Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pre-training methods explore language supervision from text captions accompanying the images. In this work, we propose masked image pre-training on language assisted representation, dubbed as MILAN. Instead of predicting raw pixels or low level features, our pre-training objective is to reconstruct the image features with substantial semantic signals that are obtained using caption supervision. Moreover, to accommodate our reconstruction target, we propose a more efficient prompting decoder architecture and a semantic aware mask sampling mechanism, which further advance the transfer performance of the pre-trained model. Experimental results demonstrate that MILAN delivers higher accuracy than the previous works. When the masked autoencoder is pre-trained and fine-tuned on ImageNet-1K dataset with an input resolution of 224x224, MILAN achieves a top-1 accuracy of 85.4% on ViTB/16, surpassing previous state-of-the-arts by 1%. In the downstream semantic segmentation task, MILAN achieves 52.7 mIoU using ViT-B/16 backbone on ADE20K dataset, outperforming previous masked pre-training results by 4 points.
Recently, diffusion-based generative models have been introduced to the task of speech enhancement. The corruption of clean speech is modeled as a fixed forward process in which increasing amounts of noise are gradually added. By learning to reverse this process in an iterative fashion conditioned on the noisy input, clean speech is generated. We build upon our previous work and derive the training task within the formalism of stochastic differential equations. We present a detailed theoretical review of the underlying score matching objective and explore different sampler configurations for solving the reverse process at test time. By using a sophisticated network architecture from natural image generation literature, we significantly improve performance compared to our previous publication. We also show that we can compete with recent discriminative models and achieve better generalization when evaluating on a different corpus than used for training. We complement the evaluation results with a subjective listening test, in which our proposed method is rated best. Furthermore, we show that the proposed method achieves remarkable state-of-the-art performance in single-channel speech dereverberation.