Deep Learning Weekly: Issue #276
AI with the right dose of curiosity, notebooks to DVC pipelines for reproducible experiments, generating human-level text with contrastive search, Kangas-a new open-source tool, and more.
This week in deep learning, we bring you AI with the right dose of curiosity, notebooks to DVC pipelines for reproducible experiments, generating human-level text with contrastive search, the new release of Kangas, and a paper on the emergent abilities of large language models.
You may also enjoy a unified benchmark for mathematical reasoning, deploying YOLOv5 using the OctoML CLI, a quickstart guide to using OneFormer, a paper on continuous soft pseudo-labeling in ASR, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Researchers make headway in solving a longstanding problem of balancing curious “exploration” versus “exploitation” of known pathways in reinforcement learning.
Microsoft and d-Matrix announced that the Microsoft Project Bonsai reinforcement learning will be supported on the d-Matrix DIMC technology, which the two vendors hope will provide a significant acceleration for AI inference.
Researchers from Arizona State University and the Allen Institute for AI proposed Līla, a unified benchmark for mathematical reasoning, to assess and enhance AI systems in this field.
Intel Corp. is looking to compete with rivals Advanced Micro Devices Inc. and Nvidia Corp. in the high-performance computing and artificial intelligence markets with the launch of its latest product family, the Intel Max.
In this post, we discuss how SageMaker and NVIDIA Triton Inference Server can serve multiple models.
Visualizing time series predictions from Prophet with Matplotlib in Comet.
A guide that explores the use of Papermill to build a one-stage DVC pipeline that executes an entire notebook.
A guide for quickly deploying YOLOv5 using the OctoML CLI.
A blog that introduces the current state-of-the-art decoding method, Contrastive Search, for neural text generation.
This blog explains how to analyze a neural network on a layer-by-layer basis and builds on top of the blog post explaining how to use Vela Compiler.
This notebook provides a quickstart guide to using OneFormer, the first multi-task universal image segmentation framework based on transformers, for inference on images.
An article that goes through the purpose of feature scaling, techniques for feature scaling, as well as what technique you should employ in different scenarios.
Libraries & Code
Kangas is a tool for exploring, analyzing, and visualizing large-scale multimedia data. It provides a straightforward Python API for logging large tables of data, along with an intuitive visual interface for performing complex queries against your dataset.
A package for symbolically tokenizing MIDI music files for neural networks, presented at the ISMIR 2021 LBD.
cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
A package that makes prompt programming with foundation models easier.
Papers & Publications
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence raises the question of whether additional scaling could potentially further expand the range of capabilities of language models.
We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.
Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.