Deep Learning Weekly: Issue #194
A tutorial for smart mask detection, a toolkit for conversational AI, IBM uses a quantum computer to improve ML, Microsoft acquires an AI company, hierarchical vision transformers, and more
This week in deep learning, we bring you Scale AI's $325M round, IBM's new quantum machine learning modules, Microsoft’s $19.7B medical AI acquisition, and the iGibson 2021 challenge and its accompanying dataset.
You may also enjoy Edge Impulse's latest addition to its cloud-based environment , a tutorial for a smart mask detection camera, a toolkit for conversational AI, a paper on hierarchical vision transformers using shifted windows, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
IBM releases Qiskit Machine Learning, a set of new application modules that is part of its open source quantum software.
Microsoft makes its 2nd largest deal in history, which is the acquisition of an AI company that helps physicians create medical notes faster, in hopes of doubling its healthcare provider segment to nearly $500 billion.
Intel’s new audio-based AI project for detecting hate speech and slurs gets criticized for its toxicity sliding scale.
The interdisciplinary research community of NASA’s SpaceML is building image classifiers from satellite imagery of Earth to spot signs of natural disasters.
Srikant and the MIT-IBM Watson AI Lab unveil an automated method for finding weaknesses in code-processing models, and retraining them to be more resilient against attacks.
An AI platform company for CG characters and digital worlds, co-founded by actor Tye Sheridan (Wade Watts in Ready Player One), raises a $2.5M seed round led by Founders Fund, Cyan Banister, the Realize Tech Fund, Capital Factory, MaC Venture Capital, and Robert Schwab.
Scale AI, which is now worth $7.3 billion, looks forward to expanding beyond data annotation with the $325M round led by Dragoneer, Greenoaks Capital and Tiger Global, and the former head of Amazon’s worldwide consumer business joining the team.
Mobile & Edge
An introduction to federated learning applied to healthcare along with brief descriptions of algorithms such as FedNova, FedAvg and FedProx.
A tutorial for a smart mask detection camera based around the Jetson Nano that features 30FPS video processing and web server connectivity.
An inspirational overview of the current TinyML landscape and its contributions to the modern age of industrial design.
Raspberry Pi 4 owners can now train their own custom models using Edge Impulse's cloud-based development platform for machine learning on edge devices.
A gesture recognition and embedded intelligence demonstration using an Arduino Nano 33 BLE Sense.
Facebook AI open-sourced a unique data set called Casual Conversations, consisting of 45,186 videos of natural conversations, to help AI researchers evaluate fairness in computer vision and audio models.
A comprehensive tutorial on using the new Hugging Face DLCs and Amazon SageMaker extension to train a distributed Seq2Seq-transformer model on the summarization task.
An introduction to the iGibson 2021 Dataset for the similarly-named upcoming challenge that explores issues such as simulation, sim-to-real transfer, visual navigation, semantic mapping and change detection, and many more.
A brief walkthrough of the model development process using NVIDIA’s RAPIDS, a suite of libraries for speeding up training workflows, and Determined AI, a platform which tracks these deep learning workflows.
Libraries & Code
A toolkit with extendable collections of pre-built modules and ready-to-use models for Automatic Speech Recognition, Natural Language Processing and Text-to-Speech.
A unified toolkit for Deep Learning Based Document Image Analysis.
A unified framework that provides specialized time series algorithms and scikit-learn compatible tools to build, tune and validate time series models.
Papers & Publications
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
We present an efficient high-resolution network, Lite-HRNet, for human pose estimation. We start by simply applying the efficient shuffle block in ShuffleNet to HRNet (high-resolution network), yielding stronger performance over popular lightweight networks, such as MobileNet, ShuffleNet, and Small HRNet.
We find that the heavily-used pointwise (1x1) convolutions in shuffle blocks become the computational bottleneck. We introduce a lightweight unit, conditional channel weighting, to replace costly pointwise (1x1) convolutions in shuffle blocks. The complexity of channel weighting is linear w.r.t the number of channels and lower than the quadratic time complexity for pointwise convolutions. Our solution learns the weights from all the channels and over multiple resolutions that are readily available in the parallel branches in HRNet. It uses the weights as the bridge to exchange information across channels and resolutions, compensating the role played by the pointwise (1x1) convolution. Lite-HRNet demonstrates superior results on human pose estimation over popular lightweight networks. Moreover, Lite-HRNet can be easily applied to semantic segmentation task in the same lightweight manner.
In recent years, the use of Generative Adversarial Networks (GANs) has become very popular in generative image modeling. While style-based GAN architectures yield state-of-the-art results in high-fidelity image synthesis, computationally, they are highly complex. In our work, we focus on the performance optimization of style-based generative models. We analyze the most computationally hard parts of StyleGAN2, and propose changes in the generator network to make it possible to deploy style-based generative networks in the edge devices. We introduce MobileStyleGAN architecture, which has x3.5 fewer parameters and is x9.5 less computationally complex than StyleGAN2, while providing comparable quality.
Sponsored by Springboard
Get a machine learning engineering job, guaranteed. With Springboard’s online machine learning engineering bootcamp, you’ll work 1:1 with an expert mentor to learn the skills and gain the experience needed to get hired in an AI/ML role. Learn more today.