Deep Learning Weekly: Issue #246
A U-Net and Fourier Neural Operator-based model for improving carbon sequestration, offline access to Amazon SageMaker Feature Store, Mobile Video Networks Tutorial, and more
This week in deep learning, we bring you a U-Net and Fourier Neural Operator-based model for improving carbon sequestration, offline access to Amazon SageMaker Feature Store using AWS Lake Formation, Mobile Video Networks Tutorial, and a paper on understanding dimensional collapse in contrastive self-supervised learning.
You may also enjoy Google's new approaches toward zero-shot transfer for dialogue modeling, common pitfalls with machine learning pipelines, a unified model serving framework, a paper on zero-shot transfer with locked-image text tuning, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
A new neural operator architecture named U-FNO simulates pressure levels during carbon storage in a fraction of a second while doubling accuracy on certain tasks, helping scientists find optimal injection rates and sites.
Google AI introduces two different sequence-to-sequence approaches toward zero-shot transfer for dialogue modeling.
Meta AI highlights a promising AI project that is helping the world decarbonize energy production, boost the efficiency of industrial systems, and reduce the carbon footprint of everyday life.
Meta AI announced an entirely new data set focused on oxide catalysts for the Oxygen Evolution Reaction (OER), a critical chemical reaction used in green hydrogen fuel production via wind and solar energy.
LinkedIn has just released Performance-Adaptive Sampling Strategy (PASS), which uses AI to select the neighbors in graphs that are the most relevant, to the open-source community.
A blog announcing Prevision.io, a first-of-its-kind dedicated AI management platform built on Google Cloud and now available exclusively on Google Cloud Marketplace.
A comprehensive article which discusses common pipeline pitfalls by taking a look at a typical ML pipeline architecture, including the steps involved, and what to avoid under the various steps.
This post provides an overview of how to implement granular access control to feature groups and features stored in an offline feature store using Amazon SageMaker Feature Store and AWS Lake Formation.
An MLOps article that focuses on the use of DagsHub, DVC, and EC2 instances for the smooth industrialization of your machine learning application.
A complete set of time series tools, packages, and libraries gathered in one place.
This notebook provides basic example code to build, run, and fine-tune MoViNets (Mobile Video Networks).
A technical blog post describing some basic code you can use to implement an article recommendation system using Pandas, TfidfVectorizer, and the cosine_similarity.
This article discusses how we to build custom object detection models using Detecto.
Libraries & Code
BentoML is an open platform that simplifies ML model deployment and enables you to serve your models at production scale in minutes.
Modelkit is a minimalist yet powerful MLOps library for Python, built for people who want to deploy ML models to production.
Papers & Publications
Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.
Transformers have been widely used in numerous vision problems especially for visual recognition and detection. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. In addition, we extend it to ViDT+ to support joint-task learning for object detection and instance segmentation. Specifically, we attach an efficient multi-scale feature fusion layer and utilize two more auxiliary training losses, IoU-aware loss and token labeling loss. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and its extended ViDT+ achieves 53.2AP owing to its high scalability for large models.
This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Tuning" (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 84.5% zero-shot transfer accuracy on the ImageNet test set, and 81.1% on the challenging out-of-distribution ObjectNet test set.