Deep Learning Weekly: Issue #270
MIT's new on-device training technique which requires less than a quarter of a megabyte of memory, Apple's 3D Parametric Room Representation, and more.
This week in deep learning, we bring you MIT's new on-device training technique which requires less than a quarter of a megabyte of memory, Apple's 3D Parametric Room Representation, zero-shot evaluation of very large language models, and a paper on text-to-video generation.
You may also enjoy Colab's Pay-As-You-Go offering, five tools for ML model testing, a gentle introduction to geometric deep learning, a paper on federated learning annotated image repository, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory.
Imagimob and FotaHub, Inc. announced a partnership where Imagimob’s tinyML platform will be seamlessly integrated with FotaHub’s cloud-based firmware over-the-air update and provisioning service.
Google Cloud is bringing its expertise in vision-based artificial intelligence to the healthcare industry with the launch of its new Medical Imaging Suite.
Google Colab is launching a new paid tier, Pay-As-You-Go, giving anyone the option to purchase additional compute time in Colab with or without a paid subscription.
A comprehensive article that fully explores the open-source and subscription-based model testing tools which will be handy for your MLOps pipeline.
An article that highlights the main challenges and relevant lessons learned from Instacart’s real-time machine learning journey.
Learn how to share your apps/services with your colleagues without worrying about the operating system difference, software mistakes, or any other common everyday Data Scientist/ML Engineer file sharing problems.
A post that shows a holistic view of an end-to-end ML production system using AWS SageMaker Studio.
A post that compares Dagster and Airflow, and digs into the main differences in data-passing, event-driven execution, and backfills.
An article that introduces deep learning for non-euclidean data.
A deep dive into the two main components of RoomPlan: 3D room layout estimation, and a 3D object-detection pipeline.
A blog post that uses the zero-shot text classification task to evaluate various OPT models on WinoBias, a coreference task measuring gender bias related to occupations.
A technical article that addresses the unique problems of credit card fraud detection, and walks through an end-to-end workflow for fraud detection using GNNs.
Libraries & Code
A repository containing code for behavioral testing of NLP models with CheckList.
A library to enable Bayesian active learning in your research or labeling work.
AITemplate (AIT) is a Python framework that transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving.
Papers & Publications
We propose Make-A-Video, an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.
Cross-device federated learning is an emerging machine learning (ML) paradigm where a large population of devices collectively train an ML model while the data remains on the devices. This research field has a unique set of practical challenges, and to systematically make advances, new datasets curated to be compatible with this paradigm are needed. Existing federated learning benchmarks in the image domain do not accurately capture the scale and heterogeneity of many real-world use cases. We introduce FLAIR, a challenging large-scale annotated image dataset for multi-label classification suitable for federated learning. FLAIR has 429,078 images from 51,414 Flickr users and captures many of the intricacies typically encountered in federated learning, such as heterogeneous user data and a long-tailed label distribution. We implement multiple baselines in different learning setups for different tasks on this dataset. We believe FLAIR can serve as a challenging benchmark for advancing the state-of-the art in federated learning.
Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over attention-based baselines such as NAT and Swin, as well as modern convolutional baseline ConvNeXt. Our Large model is ahead of its Swin counterpart by 1.5% box AP in COCO object detection, 1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation, and faster in throughput. We believe combinations of NA and DiNA have the potential to empower various tasks beyond those presented in this paper.