Deep Learning Weekly: Issue #252
Google's Imagen, deep learning for human action recognition, fast path from a notebook to a deployed model, AI toolkit for healthcare imaging, a paper on contrastive captioners, and many more
You may also enjoy Future of Life Institute's world-building competition for superintelligent AI, common mistakes when using Tensorflow Serving with Docker, deep learning for human action recognition, a paper on inception transformers, and more.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Google presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
Projects from power generation to smart meters are embracing machine learning, moving toward a green, resilient and smart grid.
The Future of Life Institute, a nonprofit that focuses on existential threats to humanity, organized a world-building competition for positive futures with superintelligent AI.
Graphcore and Hugging Face have significantly expanded the range of Machine Learning modalities and tasks available in Hugging Face Optimum, an open-source library for Transformers performance optimization.
A tutorial on wrapping a Comet experiment in a Docker image for best practices.
A technical article highlighting the common mistakes that data scientists make when they serve machine learning models using Tensorflow Serving and Docker.
A tutorial discussing the steps on how to go from an experimental notebook to a deployed model on Google Cloud Platform.
A demonstration on how to use Amazon Neptune ML to detect fake news based on the content and social context of the news on social media.
Google discusses the advantages (and results) of a relaxed label differential privacy algorithm over full differential privacy.
In this tutorial, you will discover how to develop a Multilayer Perceptron neural network model for the cancer survival binary classification datasets.
A comprehensive blog on the challenges and applications of Human Action Recognition with Deep Learning.
Libraries & Code
MONAI is a PyTorch-based, open-source framework for deep learning in healthcare imaging.
PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
SensorsCalibration is a simple calibration toolbox and open source project, mainly used for sensor calibration in autonomous driving.
Papers & Publications
We propose a new method named OnePose for object pose estimation. Unlike existing instance-level or category-level methods, OnePose does not rely on CAD models and can handle objects in arbitrary categories without instance- or category-specific network training. OnePose draws the idea from visual localization and only requires a simple RGB video scan of the object to build a sparse SfM model of the object. Then, this model is registered to new query images with a generic feature matching network. To mitigate the slow runtime of existing visual localization methods, we propose a new graph attention network that directly matches 2D interest points in the query image with the 3D points in the SfM model, resulting in efficient and robust pose estimation. Combined with a feature-based pose tracker, OnePose is able to stably detect and track 6D poses of everyday household objects in real-time. We also collected a large-scale dataset that consists of 450 sequences of 150 objects.
Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e. gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs.
Exploring large-scale pre-trained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pre-train an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pre-trained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a fine-tuned encoder.