Deep Learning Weekly: Issue #215
TensorFlow Similarity, faster quantized inference with XNNPACK, an introduction to generative spoken language models, lettuce-picking robots, transformer-based 3D dance generation, and more
This week in deep learning, we bring you Tensorflow Similarity, faster quantized inference with XNNPACK, the world's first 5G and AI enabled drone platform and a paper on transformer-based 3D dance generation.
You may also enjoy Intel's advancements in the area of multiagent evolutionary reinforcement learning, on-device image recognition resources for ESP32, a technical introduction to generative spoken language models, a paper on fastformers, and more!
As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.
Until next week!
Tensorflow releases the first version of a python package designed to make it easy and fast to train similarity models.
Intel’s advances in the area of multi-agent evolutionary reinforcement learning (MERL) is a step towards what one may call a non-sentient hive mind.
Akkio's platform was able to build a money-making model with 700 rows of training data consisting of the history of horses scheduled to run at Saratoga Race Course.
A new technique developed by researchers at Carnegie Mellon University and the KAIST Cybersecurity Research Center employs unsupervised learning to address some of the challenges of current methods used to detect adversarial attacks.
Cloning your voice using artificial intelligence is simultaneously tedious and simple: hallmarks of a technology that’s just about mature and ready to go public.
In a new study led by researchers from TU Dresden, researchers created a system made from networks of tiny polymer fibers that, when submerged in a solution meant to replicate the inside of the human body, function as organic transistors.
Mobile & Edge
Tensorflow extends the XNNPACK backend to quantized models with, on average across computer vision models, 30% speedup on ARM64 mobile phones, 5X speedup on x86-64 laptop and desktop systems, and 20X speedup for in-browser inference.
Designed for small, unmanned aircraft systems, the Qualcomm Flight RB5 Platform combines low-power computing and camera systems with AI, and other connectivity features to bring together advanced imaging capabilities and drone-to-drone communication.
A compact and intelligent VOC detection system based on a trained tinyML model deployed to a low-power embedded device.
A handful of resources for using the ESP32 with multiple different camera modules to run image recognition models on-device.
In Australia, innovators have just presented the country's first fully automated AI-enabled farm spanning 1,900 hectares.
A technical introduction to Generative Spoken Language Model (GSLM), the first high-performance NLP model that incorporates the full range of expressivity of oral language without application restrictions.
An in-depth review of Stanford’s recent paper on foundation models.
Researchers are turning to the mathematics of higher-order interactions to better model the complex connections within their data.
A comprehensive introduction to Optimum, an optimization toolkit that provides performance optimization tools targeting efficient AI hardware and built-in collaboration with hardware partners.
Libraries & Code
A python library to benchmark counterfactual explanation and recourse models.
A visual and statistical debugger specifically designed for deep learning.
A set of example templates to accelerate the delivery of custom ML solutions to production so you can get started quickly without having to make too many design choices.
Papers & Publications
We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion conditioned on music. The proposed AIST++ dataset contains 5.2 hours of 3D dance motion in 1408 sequences, covering 10 dance genres with multi-view videos with known camera poses -- the largest dataset of this kind to our knowledge. We show that naively applying sequence models such as transformers to this dataset for the task of music conditioned 3D motion generation does not produce satisfactory 3D motion that is well correlated with the input music. We overcome these shortcomings by introducing key changes in its architecture design and supervision: FACT model involves a deep cross-modal transformer block with full-attention that is trained to predict N future motions. We empirically show that these changes are key factors in generating long sequences of realistic dance motion that are well-attuned to the input music. We conduct extensive experiments on AIST++ with user studies, where our method outperforms recent state-of-the-art methods both qualitatively and quantitatively.
Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.