Deep Learning Weekly: Issue #192
Hugging Face teams up with SageMaker, body pose estimation on smart TVs, Arm's new architectural upgrade, Microsoft Teams intros live meeting transcriptions, and more
This week in deep learning, we bring you Arm’s first major architectural upgrade, Facebook’s Fairness Flow, Microsoft’s AI-powered live meeting transcriptions, and the partnership of Amazon SageMaker and Hugging Face.
You may also enjoy a paper on Mobile Video Networks for Efficient Video Recognition, an in-depth blog on BigBird’s Block Sparse Attention, a library for graph deep learning research, a paper on PlenOctrees for Real-time Rendering of Neural Radiance Fields, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Microsoft Teams can now identify each speaker, capture audio in “near real-time,” and generate a live transcript on the right-hand side of the meeting window.
The world’s first crisis intervention and suicide prevention lifeline for the LGBTQPIA+ youth now has a Crisis Contact Simulator and an ML-powered assessment tool that detects high-risk users through Google’s help.
An AI-powered fraud detection startup lands a $200M round of funding led by KKR, placing them above the $1 billion dollar valuation mark.
A Python-based technical toolkit that exposes measurements of statistical bias early using a methodology based on Signal Detection Theory.
A paper that proposes a novel approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs via a neural architecture search, the Stream Buffer technique, and a simple ensembling method.
Arm unveils a ground-breaking CPU design geared towards AI, IoT, and 5G applications. This new architecture comes with confidential computing and enhanced machine learning capabilities through Scalable Vector Extension (SVE2) and other technologies.
A high-level walkthrough of how to implement a lightweight body pose estimation model on a smart TV using BlazePose, ArmNN TFLite Delegate, and Streamline Performance Analyzer.
An implementation of an image classification system that runs inference over a stream of data on a NVIDIA Jetson Nano device using RedisAI.
50% of the global workforce needs reskilling. Meet KIMO, a Dutch start-up that uses GNNs to represent knowledge domains. According to founders, the work will form the foundation of personalized learning at scale. (Sponsored Content)
The why, what, and how of an engineering discipline that aims to unify ML systems development and deployment.
An introduction to Hugging Face Deep Learning Containers and SageMaker Extensions, along with a comprehensive list of notebooks, examples, and other resources.
A short tutorial on using a pretrained Mask R-CNN model with PixelLib: a library created for easy integration of image and video segmentation in real-world applications.
An in-depth blog on BigBird’s inner workings and how it can handle sequence lengths of up to 4096 through Block Sparse Attention.
A quick tutorial on fuzz testing TensorFlow APIs via OSS-Fuzz.
Libraries & Code
A turnkey library for graph deep learning research, with a unified testbed for higher level, research-oriented graph deep learning tasks, such as graph generation, self-supervised learning, explainability, and 3D graphs.
A simple command line tool for text-to-image generation using OpenAI's CLIP and Siren.
A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization.
Papers & Publications
Abstract: Significant progress has been achieved in automating the design of various components in deep networks. However, the automatic design of loss functions for generic tasks with various evaluation metrics remains under-investigated. Previous works on handcrafting loss functions heavily rely on human expertise, which limits their extendibility. Meanwhile, existing efforts on searching loss functions mainly focus on specific tasks and particular metrics, with task-specific heuristics. Whether such works can be extended to generic tasks is not verified and questionable. In this paper, we propose AutoLoss-Zero, the first general framework for searching loss functions from scratch for generic tasks. Specifically, we design an elementary search space composed only of primitive mathematical operators to accommodate the heterogeneous tasks and evaluation metrics. A variant of the evolutionary algorithm is employed to discover loss functions in the elementary search space. A loss-rejection protocol and a gradient-equivalence-check strategy are developed so as to improve the search efficiency, which are applicable to generic tasks. Extensive experiments on various computer vision tasks demonstrate that our searched loss functions are on par with or superior to existing loss functions, which generalize well to different datasets and networks.
Abstract: Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on the above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The proposed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet.
Abstract: We introduce a method to render Neural Radiance Fields (NeRFs) in real time using PlenOctrees, an octree-based 3D representation which supports view-dependent effects. Our method can render 800x800 images at more than 150 FPS, which is over 3000 times faster than conventional NeRFs. We do so without sacrificing quality while preserving the ability of NeRFs to perform free-viewpoint rendering of scenes with arbitrary geometry and view-dependent effects. Real-time performance is achieved by pre-tabulating the NeRF into a PlenOctree. In order to preserve view-dependent effects such as specularities, we factorize the appearance via closed-form spherical basis functions. Specifically, we show that it is possible to train NeRFs to predict a spherical harmonic representation of radiance, removing the viewing direction as an input to the neural network. Furthermore, we show that PlenOctrees can be directly optimized to further minimize the reconstruction loss, which leads to equal or better quality compared to competing methods. Moreover, this octree optimization step can be used to reduce the training time, as we no longer need to wait for the NeRF training to converge fully. Our real-time neural rendering approach may potentially enable new applications such as 6-DOF industrial and product visualizations, as well as next generation AR/VR systems.