Deep Learning Weekly: Issue #217

Google’s Wikipedia-based Image-Text Dataset, How Waze uses TFX and Vertex for production-scale ML, Translatotron 3, the environmental impacts of AI systems, and more

Hey folks,

This week in deep learning, we bring you Google's Wikipedia-based Image-Text Dataset, a library that delivers a unified low-precision inference interface, how Waze uses TFX and Vertex for production ML and a paper on Google's Translatotron 2.

You may also enjoy big tech and their favorite deep learning techniques, the environmental impact of different deep learning models, 3D reconstruction of endangered species, a paper on recursively summarizing books, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!


Announcing WIT: A Wikipedia-Based Image-Text Dataset

Google introduces the Wikipedia-Based Image-Text (WIT) Dataset, a large multimodal dataset created by extracting multiple text selections associated with an image from Wikipedia articles and Wikimedia image links.

Big Tech & Their Favourite Deep Learning Techniques

A brief and timely article on the deep learning techniques and research efforts of companies such as Facebook, Google, and others.

Researchers tap AI in search of new wonder materials

An AI tool developed by researchers at the University of Liverpool was recently used to discover four new materials, including a new family of solid-state materials that conduct lithium.

ARM Debuts in Latest MLPerf AI Inference Benchmarks

The latest MLPerf benchmarks show NVIDIA has extended its high watermarks in performance and energy efficiency for AI inference to Arm as well as x86 computers.

Dynatask, a new paradigm of AI benchmarking is now available for the AI community

Dynatask is Dynabench’s new feature that makes it easy for researchers to leverage human annotators to actively fool NLP models and identify weaknesses through natural interactions.

Mobile & Edge

TensorFlow is available on Sony Spresense

You can now develop solutions with TensorFlow for the Spresense microcontroller board from Sony. The short blog includes an introductory tutorial on how to do micro speech and person detection.


A repository that provides code for machine learning algorithms for edge devices developed at Microsoft Research India.

Arduino Nicla Sense ME makes sense of the world

Arduino released the Nicla Sense ME: a tiny but mighty board, co-developed with Bosch Sensortec, designed to enable better sensing and intelligence on the edge.


A new open-source deep learning training/inference framework that could be used for mobile, edge, and cloud scenarios.


The Imperative for Sustainable AI Systems

A comprehensive analysis highlighting the environmental impacts of AI systems.

How Waze Uses TFX to Scale Production-Ready ML

A technical case study of how the world's largest community-based traffic and navigation app uses TFX and Vertex pipelines.

3D Reconstruction of Endangered Species with Sifei Liu

An NVIDIA AI podcast about Liu’s project called Online Adaptation for Consistent Mesh Reconstruction in the Wild.

7 Revealing Ways AIs Fail

Seven real-world examples of AI failures and what current weaknesses they reveal about the state of deep learning. 

Libraries & Code


An open-source Python library that delivers a unified low-precision inference interface across multiple Intel-optimized Deep Learning (DL) frameworks on both CPUs and GPUs.

online-ml/river: Online machine learning in Python

A Python library for online machine learning and streaming data.

kornia/kornia: Open Source Differentiable Computer Vision Library

A differentiable computer vision library for PyTorch. It consists of a set of routines and differentiable modules to solve generic computer vision problems.

Papers & Publications

Translatotron 2: Robust direct speech-to-speech translation


We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause. We also propose a new method for retaining the source speaker's voice in the translated speech. The trained model is restricted to retain the source speaker's voice, but unlike the original Translatotron, it is not able to generate speech in a different speaker's voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts. When the new method is used together with a simple concatenation-based data augmentation, the trained Translatotron 2 model is able to retain each speaker's voice for input with speaker turns.

Recursively Summarizing Books with Human Feedback


A major challenge for scaling machine learning is training models to perform tasks that are very difficult or time-consuming for humans to evaluate. We present progress on this problem on the task of abstractive summarization of entire fiction novels. Our method combines learning from human feedback with recursive task decomposition: we use models trained on smaller parts of the task to assist humans in giving feedback on the broader task. We collect a large volume of demonstrations and comparisons from human labelers, and fine-tune GPT-3 using behavioral cloning and reward modeling to do summarization recursively. At inference time, the model first summarizes small sections of the book and then recursively summarizes these summaries to produce a summary of the entire book. Our human labelers are able to supervise and evaluate the models quickly, despite not having read the entire books themselves. Our resulting model generates sensible summaries of entire books, even matching the quality of human-written summaries in a few cases (∼5% of books). We achieve state-of-the-art results on the recent BookSum dataset for book-length summarization. A zero-shot question-answering model using these summaries achieves state-of-the-art results on the challenging NarrativeQA benchmark for answering questions about books and movie scripts. We release datasets of samples from our model.

A guest post by
Industrial Engineering - Deep Learning - Music Production - Rock Climbing