Deep Learning Weekly: Issue #233
Google's new Task-level Mixture-of-Experts method, a containerized inference solution for transformers, post-training quantization with TensorFlow Lite, and more
This week in deep learning, we bring you Google's new method called Task-level Mixture-of-Experts (TaskMoE), a case study on a containerized inference solution for transformers, a TinyML article on post-training quantization and quantization-aware training with TensorFlow Lite, and a paper on scaling vision with a sparse mixture of experts.
You may also enjoy the newest release of the CUDA 11.6 Toolkit, an article on how machine learning teams use CI/CD, a collection of notebooks for StyleGAN3 and CLIP2, a paper on unified transformers for efficient spatiotemporal representation learning, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Google introduces a new method called Task-level Mixture-of-Experts (TaskMoE), which takes advantage of the quality gains of model scaling while still being efficient to serve, by extracting subnetworks from a large multi-task model.
Hume AI claims to have developed datasets and models that “respond beneficially to cues of [human] emotions,” enabling customers to identify emotions from a person’s facial, vocal, and verbal expressions.
NVIDIA announces the newest release of the CUDA development environment, CUDA 11.6. With this, CUDA continues to push the boundaries of GPU acceleration for deep learning and other use cases.
Search API platform Algolia announced that Walgreens has joined its customer base and will deploy the solution alongside the Microsoft Azure Cloud platform to help improve the search experience of its customers.
Distributed machine learning startup Boosted.ai revealed $35 million in new funding to scale up its web-based platform that brings explainable machine learning tools to investment managers.
A blog post highlighting the detailed performance results for Infinity, a containerized inference solution for Transformers, to achieve optimal cost, efficiency, and latency.
An article laying out how 5 different teams are using CI/CD concepts, tools, and techniques to build and deploy their machine learning applications.
A step-by-step guide (with code) on Comet Artifacts, a tool that provides Machine Learning teams with a convenient way to log, version, browse, and access data from all parts of their experimentation pipelines.
A technical blog (with code) showing you how to construct an end-to-end pipeline and how to train a complete model using Neo4J and Vertex AI.
As teams ramp up their AI/ML capabilities the biggest challenges they face center on deploying their models intro production. The teams from Pachyderm and Comet will unearth these challenges in this upcoming webinar.
A technical TinyML article on post-training quantization, quantization-aware training, weight pruning, weight clustering, and collaborating optimization using TensorFlow Lite and the TensorFlow Model Optimization Toolkit.
A technical guide on how to easily deploy GPT-J using Amazon SageMaker and the Hugging Face Inference Toolkit with a few lines of code.
A tutorial on how to set-up and use Azure Percept and Edge Impulse to create a highly capable and secure edge object detection solution.
A science journalist’s comprehensive article on some of her new understandings about AI and cognitive science. This includes how people fear AI for the wrong reasons, how deep learning can be hard to explain, and other topics.
Libraries & Code
A collection of Jupyter notebooks made to easily play with StyleGAN3 and CLIP2 for a text-based guided image generation.
An interactive visualization demo using Meta’s pre-trained PyTorch implementation of MAE models.
Papers & Publications
Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.
It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy. Different from traditional transformers, our relation aggregator can tackle both spatiotemporal redundancy and dependency, by learning local and global token affinity respectively in shallow and deep layers. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods. For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60.9% and 71.2% top-1 accuracy respectively.
The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.