Deep Learning Weekly: Issue #217
Google’s Wikipedia-based Image-Text Dataset, How Waze uses TFX and Vertex for production-scale ML, Translatotron 3, the environmental impacts of AI systems, and more
This week in deep learning, we bring you Google's Wikipedia-based Image-Text Dataset, a library that delivers a unified low-precision inference interface, how Waze uses TFX and Vertex for production ML and a paper on Google's Translatotron 2.
You may also enjoy big tech and their favorite deep learning techniques, the environmental impact of different deep learning models, 3D reconstruction of endangered species, a paper on recursively summarizing books, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Google introduces the Wikipedia-Based Image-Text (WIT) Dataset, a large multimodal dataset created by extracting multiple text selections associated with an image from Wikipedia articles and Wikimedia image links.
A brief and timely article on the deep learning techniques and research efforts of companies such as Facebook, Google, and others.
An AI tool developed by researchers at the University of Liverpool was recently used to discover four new materials, including a new family of solid-state materials that conduct lithium.
The latest MLPerf benchmarks show NVIDIA has extended its high watermarks in performance and energy efficiency for AI inference to Arm as well as x86 computers.
Dynatask is Dynabench’s new feature that makes it easy for researchers to leverage human annotators to actively fool NLP models and identify weaknesses through natural interactions.
Mobile & Edge
You can now develop solutions with TensorFlow for the Spresense microcontroller board from Sony. The short blog includes an introductory tutorial on how to do micro speech and person detection.
A repository that provides code for machine learning algorithms for edge devices developed at Microsoft Research India.
Arduino released the Nicla Sense ME: a tiny but mighty board, co-developed with Bosch Sensortec, designed to enable better sensing and intelligence on the edge.
A new open-source deep learning training/inference framework that could be used for mobile, edge, and cloud scenarios.
A comprehensive analysis highlighting the environmental impacts of AI systems.
A technical case study of how the world's largest community-based traffic and navigation app uses TFX and Vertex pipelines.
An NVIDIA AI podcast about Liu’s project called Online Adaptation for Consistent Mesh Reconstruction in the Wild.
Seven real-world examples of AI failures and what current weaknesses they reveal about the state of deep learning.
Libraries & Code
An open-source Python library that delivers a unified low-precision inference interface across multiple Intel-optimized Deep Learning (DL) frameworks on both CPUs and GPUs.
A Python library for online machine learning and streaming data.
A differentiable computer vision library for PyTorch. It consists of a set of routines and differentiable modules to solve generic computer vision problems.
Papers & Publications
We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, but unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.
A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task. We collect a large volume of demonstrations and comparisons from human labelers, and fine-tune GPT-3 using behavioral cloning and reward modeling to do summarization recursively. At inference time, the model first summarizes small sections of the book and then recursively summarizes these summaries to produce a summary of the entire book. Our human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves. Our resulting model generates sensible summaries of entire books, even matching the quality of human-written summaries in a few cases (∼5% of books). We achieve state-of-the-art results on the recent BookSum dataset for book-length summarization. A zero-shot question-answering model using these summaries achieves state-of-the-art results on the challenging NarrativeQA benchmark for answering questions about books and movie scripts. We release datasets of samples from our model.