Deep Learning Weekly: Issue #254
Iterative's VS Code Extension for Experiment Tracking, MLOps at a reasonable scale, Meta's new direct speech-to-speech model that does not rely on text generation, and more
This week in deep learning, we bring you Iterative's VS Code Extension for Experiment Tracking, MLOps at a reasonable scale, Meta's new direct speech-to-speech model that does not rely on text generation, and a paper on end-to-end generative pre-training for multimodal video captioning.
You may also enjoy Arkestro's predictive procurement orchestration, infrastructure for parallel training of models, OpenAI's techniques for training large neural networks, a paper on machine learning sensors, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Sara Hooker will serve as head of Cohere For AI, a nonprofit research lab and community, which just announced its official launch.
Iterative, the MLOps company dedicated to streamlining the workflow of data scientists and machine learning engineers, announced a free extension to Visual Studio Code for experiment tracking and machine learning model development.
Comet, a leading development platform for machine learning teams, announced several integrations including Ray, Kubeflow, and Google Vertex AI.
Graphcore and Aleph Alpha will work together on research and deployment of Aleph Alpha’s advanced multi-modal models on current IPU systems and the next-generation Good Computer, thanks to a new partnership between the two companies.
Arkestro, a company that uses predictive procurement orchestration, announced a $26 million Series A to help keep growing the platform.
A high-level blog post on parallel training infrastructure using Docker, Kubernetes, Google Cloud File Store, and more.
In this guide, you’ll learn more about MLOps at a reasonable scale, and you’ll get to know the best practices, templates, and examples that will help you understand how to implement them in your work.
This document introduces best practices for implementing machine learning (ML) on Google Cloud, with a focus on custom-trained models based on your data and code.
A guide to the differences between translational, rotational, and scale variance techniques.
A post examining metrics that measure interest in distributed systems and distributed computing with an eye towards their implications for machine learning.
To enable faster inference and support translation between unwritten languages, Meta AI is sharing new work on a direct speech-to-speech translation (S2ST) approach, which does not rely on text generation as an intermediate step.
An article describing the technical challenges encountered in applying QAT and pruning to the subclass models and custom layers. This also shows the optimized results to show the benefits from optimization techniques.
OpenAI shares some techniques for memory-efficient training techniques for large neural networks, including data parallelism, tensor parallelism, pipeline parallelism, and Mixture-of-Experts.
Libraries & Code
An open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
A Colab notebook on loading BIG-bench JSON tasks, a collaborative benchmark intended to probe large language models and extrapolate their future capabilities, and inspecting examples.
A seamless, high-performing and accessible library for OCR-related tasks powered by Deep Learning.
Papers & Publications
Machine learning sensors represent a paradigm shift for the future of embedded machine learning applications. Current instantiations of embedded machine learning (ML) suffer from complex integration, lack of modularity, and privacy and security concerns from data movement. This article proposes a more data-centric paradigm for embedding sensor intelligence on edge devices to combat these challenges. Our vision for "sensor 2.0" entails segregating sensor input data and ML processing from the wider system at the hardware level and providing a thin interface that mimics traditional sensors in functionality. This separation leads to a modular and easy-to-use ML sensor device. We discuss challenges presented by the standard approach of building ML processing into the software stack of the controlling microprocessor on an embedded system and how the modularity of ML sensors alleviates these problems. ML sensors increase privacy and accuracy while making it easier for system builders to integrate ML into their products as a simple component. We provide examples of prospective ML sensors and an illustrative datasheet as a demonstration and hope that this will build a dialogue to progress us towards sensor 2.0.
Hyperparameter optimization (HPO) is crucial for machine learning algorithms to achieve satisfactory performance, whose progress has been boosted by related benchmarks. Nonetheless, existing efforts in benchmarking all focus on HPO for traditional centralized learning while ignoring federated learning (FL), a promising paradigm for collaboratively learning models from dispersed data. In this paper, we first identify some uniqueness of HPO for FL algorithms from various aspects. Due to this uniqueness, existing HPO benchmarks no longer satisfy the need to compare HPO methods in the FL setting. To facilitate the research of HPO in the FL setting, we propose and implement a benchmark suite FedHPO-B that incorporates comprehensive FL tasks, enables efficient function evaluations, and eases continuing extensions. We also conduct extensive experiments based on FedHPO-B to benchmark a few HPO methods.
Recent video and language pre-training frameworks lack the ability to generate sentences. We present Multimodal Video Generative Pre-training (MV-GPT), a new pre-training framework for learning from unlabelled videos which can be effectively used for generative tasks such as multimodal video captioning. Unlike recent video-language pre-training frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly. To overcome the lack of captions in unlabelled videos, we leverage the future utterance as an additional text source and propose a bidirectional generation objective -- we generate future utterances given the present mulitmodal context, and also the present utterance given future observations. With this objective, we train an encoder-decoder model end-to-end to generate a caption from raw pixels and transcribed speech directly. Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks, as well as for other video understanding tasks such as VideoQA, video retrieval and action classification.