Deep Learning Weekly: Issue #267
CSAIL's composable diffusion, advanced MLOps pipelines using Seldon Core and Kubeflow, the math behind neural tangent kernels, a paper on multi-axis vision transformers, and many more
This week in deep learning, we bring you CSAIL's composable diffusion, advanced MLOps pipelines using Seldon Core and Kubeflow, the math behind neural tangent kernels, and a paper on multi-axis vision transformers.
You may also enjoy PyTorch Foundation, a detailed comparison of workflow orchestration tools, an end-to-end framework for search use cases, a paper on extremely simple graph contrastive learning for recommendations, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
U.S. officials ordered NVIDIA to stop exporting two top AI chips to China, a move that could cripple Chinese firms' ability to carry out advanced work like image recognition and hamper NVIDIA’'s business in the country.
To generate more complex images with better understanding, scientists from CSAIL developed composable diffusion.
To accelerate progress in AI, PyTorch is moving to a new, independent PyTorch Foundation, under the Linux Foundation umbrella.
MLCommons released the latest results of its MLPerf Inference benchmark test, which compares the speed of artificial intelligence systems from different hardware makers.
Canonical Ltd. has announced a key update to Charmed Kubeflow, an end-to-end machine learning operations platform that provides complex model training capabilities.
A post that demonstrates how to drop RAPIDS into a KubeFlow environment. This starts with using RAPIDS in the interactive notebook environment and then scales beyond the single container to use multiple GPUs across multiple nodes with Dask.
A detailed comparison of task and workflow orchestration tools.
A technical article on how to build reproducible machine learning pipelines using DVC, beyond using it as a version control system.
An article on choosing the best MLOps architecture for your project.
A comprehensive blog on how to build your own advanced MLOps pipeline using Kubeflow Pipelines, MLFlow, and Seldon Core.
A deep dive into the motivation and definition of NTK, as well as the proof of a deterministic convergence at different initializations of neural networks with infinite width by characterizing NTK in such a setting.
In this blogpost, you will learn how to train a language model on NVIDIA GPUs in Megatron-LM, and use it with transformers.
In this tutorial, you will discover the Luong attention mechanism for neural machine translation.
An article that teaches you how to train your first Offline Decision Transformer model from scratch to make a half-cheetah run.
Libraries & Code
Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases.
MaartenGr/BERTopicHaystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases.
BERTopic is a topic modeling technique that leverages HuggingFace transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
Jumanji is a suite of Reinforcement Learning (RL) environments written in JAX providing clean, hardware-accelerated environments for industry-driven research.
Papers & Publications
Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module.
Contrastive learning (CL) has recently been demonstrated critical in improving recommendation performance. The fundamental idea of CL-based recommendation models is to maximize the consistency between representations learned from different graph augmentations of the user-item bipartite graph. In such a self-supervised manner, CL-based recommendation models are expected to extract general features from the raw data to tackle the data sparsity issue. Despite the effectiveness of this paradigm, we still have no clue what underlies the performance gains. In this paper, we first reveal that CL enhances recommendation through endowing the model with the ability to learn more evenly distributed user/item representations, which can implicitly alleviate the pervasive popularity bias and promote long-tail items. Meanwhile, we find that the graph augmentations, which were considered a necessity in prior studies, are relatively unreliable and less significant in CL-based recommendation. On top of these findings, we put forward an eXtremely Simple Graph Contrastive Learning method (XSimGCL) for recommendation, which discards the ineffective graph augmentations and instead employs a simple yet effective noise-based embedding augmentation to create views for CL. A comprehensive experimental study on three large and highly sparse benchmark datasets demonstrates that, though the proposed method is extremely simple, it can smoothly adjust the uniformity of learned representations and outperforms its graph augmentation-based counterparts by a large margin in both recommendation accuracy and training efficiency.
With the remarkable progress of deep neural networks in computer vision, data mixing augmentation techniques are widely studied to alleviate problems of degraded generalization when the amount of training data is limited. However, mixup strategies have not been well assembled in current vision toolboxes. In this paper, we propose OpenMixup, an open-source all-in-one toolbox for supervised, semi-, and self-supervised visual representation learning with mixup. It offers an integrated model design and training platform, comprising a rich set of prevailing network architectures and modules, a collection of data mixing augmentation methods as well as practical model analysis tools. In addition, we also provide standard mixup image classification benchmarks on various datasets, which expedites practitioners to make fair comparisons among state-of-the-art methods under the same settings.