Deep Learning Weekly: Issue #204
NVIDIA’s new deep learning model for video conferencing, a paper on Visual Outlooker for Visual Recognition, and more
This week in deep learning, we bring you NVIDIA's new deep learning model for video conferencing, A TinyML tutorial on a COVID health condition classifier, A bird call classifier on a Nano 33 BLE Sense and a paper on the Visual Outlooker for Visual Recognition.
You may also enjoy A new AI-based mapping service that offers routing advice to drivers and the like, A pose detection tutorial on Android using Google's On-device ML Kit, a comprehensive blog on Multi-Task Learning, Facebook AI Research's library for state-of-the-art detection and segmentation algorithms, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Vid2Vid Cameo, one of the deep learning models behind the NVIDIA Maxine software development kit for video conferencing, uses generative adversarial networks to synthesize realistic talking-head videos using a single 2D image of a person.
A DeepMind paper that proposes how AGI can be achieved in a shorter time frame given the advancements in reinforcement learning and language models.
A team at QCRI partnered with a Doha-based taxi company called Karwa to build a new mapping service called QARTA that offers routing advice to drivers and delivery fleets.
Nvidia is turbocharging its Nvidia HGX artificial intelligence supercomputing platform with some major enhancements to its compute, networking and storage performance.
Google’s knowledge graph falsely registers a software engineer as a serial killer in a Google search.
Leaders from Pepsi discuss numerous ways in which machine learning is utilized in their enterprise operations.
Mobile & Edge
Qualcomm announces the latest iteration of the Snapdragon 888 mobile processor which can run 32 trillion operations per seconds for AI tasks.
A brief tutorial on a TinyML medical device using Edge Impulse to classify and analyze the Covid patient's health conditions.
An article testing out pose detection on Android with the help of Google ML Kit’s Pose Detection API.
A simple walkthrough showcasing a four-bird classifier using a Nano 33 BLE Sense.
A comprehensive blog that discusses the motivation for Multi-Task Learning (MTL) as well as some use cases, difficulties, and recent algorithmic advances.
An introduction to a paper on designing a weakly supervised deep neural network whose working resembles the diagnostic procedure of radiologists.
Google releases the Translated Wikipedia Biographies dataset, which can be used to evaluate the gender bias of translation models.
An article highlighting machine learning models that can be used to quickly phenotype large cohorts and increase statistical power for genome-wide association studies.
A step-by-step guide to fine-tuning Microsoft’s recently released Layout LM model on an annotated custom dataset that includes French and English invoices.
Libraries & Code
Facebook AI Research's next generation library that provides state-of-the-art detection and segmentation algorithms.
An open-source project that enables games and simulations to serve as environments for training intelligent agents.
A TensorFlow library for deep labeling, aiming to provide a unified and state-of-the-art TensorFlow codebase for dense pixel labeling tasks including different segmentations.
Papers & Publications
Visual recognition has been dominated by convolutional neural networks (CNNs) for years. Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to latest SOTA CNNsif no extra data are provided. In this work, we aim to close the performance gap and demonstrate that attention-based models are indeed able to outperform CNNs. We found that the main factor limiting the performance of ViTs for Im-geNet classification is their low efficacy in encoding fine-level features into the token representations. To resolve this, we introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO). Unlike self-attention that focuses on global depen-dency modeling at a coarse level, the outlook attention aims to efficiently encode finer-level features and contexts into tokens, which are shown to be critical for recognition per-formance but largely ignored by the self-attention. Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, being the first model exceeding 87% accuracy on this competitive benchmark, without using any extra training data. In addition, the pre-trained VOLO transfers well to downstream tasks, such as semantic segmentation. We achieve 84.3% mIoU score on the cityscapes validation set and 54.3% on the ADE20K validation set.
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).