Deep Learning Weekly: Issue #259
Meta's Make-A-Scene, AI Infrastructure Ecosystem Report of 2022, dynamic adversarial data collection, a paper on efficient representation learning via adaptive context pooling, and many more
This week in deep learning, we bring you Meta's multimodal generative method with higher creative control called Make-A-Scene, AI Infrastructure Ecosystem Report of 2022, dynamic adversarial data collection, and a paper on efficient representation learning via adaptive context pooling.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Meta AI showcases a multimodal generative method with higher creative control called Make-A-Scene, that demonstrates AI’s potential for empowering anyone to bring their imagination to life.
Google has a strong presence at this year’s conference with over 100 accepted publications and active involvement in a number of workshops and tutorials.
OpenAI implemented a new technique so that DALL·E generates images of people that more accurately reflect the diversity of the world’s population.
Microsoft has launched a platform to train the artificial intelligence systems of autonomous aircraft.
A comprehensive tutorial on a Binance time-series-based machine learning project and its MLOps pipeline.
A post showing how your MLOps team can improve productivity and reduce time to detect issues for your SageMaker models by integrating with the Fiddler Model Performance Management Platform.
AI Infrastructure Alliance’s first annual AI Infrastructure Ecosystem report highlighting stack maturity questions and other relevant information.
An article describing feature stores, and the top frameworks to use for deploying them.
A case study on the leading provider of AI-powered technology tools and services for pathology, and how they leverage image segmentation, graph neural networks, and multiple instance learning.
A list of various machine learning libraries and how they’ve changed the machine learning landscape.
A technical tutorial showing how to use Better Transformer for production inference with torchtext.
A blog describing dynamic adversarial data collection, along with a basic code example.
BAIR presents concrete analysis to show that in certain scenarios, e.g., environments with a highly multi-modal reward landscape, value decomposition can be problematic. By contrast, policy gradient methods with individual policies can converge to an optimal policy.
An article that sheds some light on the technology and engineering behind the training, both in terms of hardware and software, of the 176B parameter language model called BLOOM.
Libraries & Code
An open source no-code system for text annotation and building text classifiers.
TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Monotonic Calibrated Interpolated Look-Up Tables in TensorFlow.
Papers & Publications
We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters. Due to the fragmented definitions of the object tracking problem itself, most existing trackers are developed to address a single or part of tasks and overspecialize on the characteristics of specific tasks. By contrast, Unicorn provides a unified solution, adopting the same input, backbone, embedding, and head across all tracking tasks. For the first time, we accomplished the great unification of the tracking network architecture and learning paradigm. Unicorn performs on-par or better than its task-specific counterparts in eight tracking datasets, including LaSOT, TrackingNet, MOT17, BDD100K, DAVIS16-17, MOTS20, and BDD100K MOTS. We believe that Unicorn will serve as a solid step towards the general vision model.
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets.
Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens (e.g., text characters or image pixels), which may not be optimal for modeling complex dependencies at higher levels. In this paper, we propose ContextPool to address this problem by adapting the attention granularity for each token. Inspired by the success of ConvNets that are combined with pooling to capture long-range dependencies, we learn to pool neighboring features for each token before computing attention in a given attention layer. The pooling weights and support size are adaptively determined, allowing the pooled features to encode meaningful context with varying scale. We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost. Experiments validate that our ContextPool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to ConvNets for efficient feature learning.