Deep Learning Weekly: Issue #212
Facebook’s new open-world object segmentation, an algorithm that adapts to complexity, a game theoretic approach to explain ML model outputs, Google’s synthetic dataset for grammar and more
This week in deep learning, we bring you a new benchmark for open-world object segmentation, a virtual dressing room that uses computer vision, MIT's AI-powered material sensing platform for laser cutters and a paper on an algorithm that adapts to complexity: PonderNet.
You may also enjoy Google's synthetic dataset for grammatical error correction, a hands-on guide to effective image captioning using attention mechanism, a paper on task-aligned one-stage object detection, a paper on diverse point cloud completion with geometry-aware transformers, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Facebook shares Unidentified Video Objects (UVO), a new benchmark to facilitate research on open-world segmentation, an important computer vision task that aims to detect, segment, and track all objects exhaustively in a video.
A team of Ph.D. students is creating what they consider to be the first fashion tool using existing catalog images to process at a scale of over a million garments weekly.
Google introduces a synthetic dataset, generated via tagged corruption models, which significantly improves on Grammatical Error Correction (GEC) baselines.
Italian machinery company Breton, announced the development of the first 3D printer that uses advanced machine learning algorithms.
Tel Aviv-based artificial intelligence startup D-ID created a personalized digital campaign that allows viewers to upload a photograph and then see it come to life in a personalized trailer.
Mobile & Edge
AeroFarms and Nokia Bell Labs unveiled a multi-year partnership to combine their expertise and expand their joint capabilities in cutting-edge networking, autonomous systems, AI-enable sensor technologies, and others to identify and track plant interactions at the most advanced levels.
A new tutorial notebook that teaches you how to train a pose classification model using MoveNet and TensorFlow Lite from preprocessing down to TFLite conversion.
A team from MIT’s Computer Science and Artificial Intelligence Laboratory has come up with a technology called SensiCut, a material sensing platform for laser cutters that warns you of potentially harmful materials.
An article that extends Pete Warden’s blog on neat TinyML tricks for shrinking convolutional networks.
A visual blog highlighting the capabilities of language models such as writing stories, programming websites, and turning captions into images.
A comprehensive tutorial on neural image generation with visual attention.
An introduction to the One-Click Triton Inference Server in Google Kubernetes Engine (GKE), and how the solution scales these ML models, meet stringent latency budgets, and optimize operational costs.
A comprehensive primer on the historical trajectory of AI story generation.
Libraries & Code
A library that uses a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values
LabelImg is a graphical image annotation tool which is written in Python and uses Qt for its graphical interface.
Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.
Papers & Publications
In standard neural networks the amount of computation used grows with the size of the inputs, but not with the complexity of the problem being learnt. To overcome this limitation we introduce PonderNet, a new algorithm that learns to adapt the amount of computation based on the complexity of the problem at hand. PonderNet learns end-to-end the number of computational steps to achieve an effective compromise between training prediction accuracy, computational cost and generalization. On a complex synthetic problem, PonderNet dramatically improves performance over previous adaptive computation methods and additionally succeeds at extrapolation tests where traditional neural networks fail. Also, our method matched the current state of the art results on a real world question and answering dataset, but using less compute. Finally, PonderNet reached state of the art results on a complex task designed to test the reasoning capabilities of neural networks.
One-stage object detection is commonly implemented by optimizing two sub-tasks: object classification and localization, using heads with two parallel branches, which might lead to a certain level of spatial misalignment in predictions between the two tasks. In this work, we propose a Task-aligned One-stage Object Detection (TOOD) that explicitly aligns the two tasks in a learning-based manner. First, we design a novel Task-aligned Head (T-Head) which offers a better balance between learning task-interactive and task-specific features, as well as a greater flexibility to learn the alignment via a task-aligned predictor. Second, we propose Task Alignment Learning (TAL) to explicitly pull closer (or even unify) the optimal anchors for the two tasks during training via a designed sample assignment scheme and a task-aligned loss. Extensive experiments are conducted on MS-COCO, where TOOD achieves a 51.1 AP at single-model single-scale testing. This surpasses the recent one-stage detectors by a large margin, such as ATSS (47.7 AP), GFL (48.2 AP), and PAA (49.0 AP), with fewer parameters and FLOPs. Qualitative results also demonstrate the effectiveness of TOOD for better aligning the tasks of object classification and localization.
Point clouds captured in real-world applications are often incomplete due to the limited sensor resolution, single viewpoint, and occlusion. Therefore, recovering the complete point clouds from partial ones becomes an indispensable task in many practical applications. In this paper, we present a new method that reformulates point cloud completion as a set-to-set translation problem and design a new model, called PoinTr that adopts a transformer encoder-decoder architecture for point cloud completion. By representing the point cloud as a set of unordered groups of points with position embeddings, we convert the point cloud to a sequence of point proxies and employ the transformers for point cloud generation. To facilitate transformers to better leverage the inductive bias about 3D geometric structures of point clouds, we further devise a geometry-aware block that models the local geometric relationships explicitly. The migration of transformers enables our model to better learn structural knowledge and preserve detailed information for point cloud completion. Furthermore, we propose two more challenging benchmarks with more diverse incomplete point clouds that can better reflect the real-world scenarios to promote future research. Experimental results show that our method outperforms state-of-the-art methods by a large margin on both the new benchmarks and the existing ones.