Deep Learning Weekly: Issue #232
Meta's AV-HuBERT framework for understanding speech by both seeing and hearing, an article on the data-centric approach to AI, an overview of TF-GAN, and more.
This week in deep learning, we bring you Meta's AV-HuBERT framework for understanding speech by both seeing and hearing, an article on the data-centric approach to AI, an overview of TF-GAN, and a paper on contrastive fine-grained clustering via generative adversarial networks.
You may also enjoy MIT's novel ML technique for modeling plasma phenomena, a comparison matrix of different MLOps platforms, an essential guide to six types of autoencoders, a paper on a detector with image classes that can use image-level labels to easily train detectors, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
AI that understands speech by looking as well as hearing
Meta announces Audio-Visual Hidden Unit BERT (AV-HuBERT), a state-of-the-art self-supervised framework for understanding speech that learns by both seeing and hearing people speak. It is the first system to jointly model speech and lip movements from unlabeled data.
Seeing the plasma edge of fusion experiments in new ways with artificial intelligence
MIT researchers are testing a simplified turbulence theory’s ability to model complex plasma phenomena using a novel machine learning technique.
Sensory debuts new voice and vision AI services in the cloud
Sensory Inc. announces the launch in beta test mode of its new SensoryCloud.ai service, providing a full “AI as a service” platform for companies that want to process voice and vision AI in the cloud.
FarmSense uses sensors and machine learning to bug-proof crops
FarmSense, a Riverside, California-based agtech startup is attempting to solve insect pest problems using optical sensors and novel classification systems.
AI Startup Speeds Up Derivative Models for Bank of Montreal
Toronto-based Riskfuel uses NVIDIA DGX systems to train its neural network-based accelerator, which provides ‘rocket’ power to financial institutions.
Data-Centric Approach vs Model-Centric Approach in Machine Learning
A comprehensive article highlighting the differences between the data-centric approach and the model-centric approach to ML. The article also explains how to adopt a data-centric infrastructure.
Compare MLOps Platforms. Breakdowns of SageMaker, VertexAI, AzureML, Dataiku, Databricks, h2o, kubeflow, mlflow...
A comprehensive repository that comes with a comparison matrix of different MLOps platforms.
An article discussing the relevant metrics to help you monitor your models and the practical tools that allow you to apply model monitoring in your current workflows.
Avalanche debuts hAIsten AI to speed up AI model deployment
Taiwan-based Avalanche Computing Inc. announces a new low-code artificial intelligence tool called hAIsten AI that it claims can train AI models using multiple GPUs without any coding.
Our Summer of Code Project on TF-GAN
A technical article providing an overview of TF-GAN, along with the highlighted outcomes of the Google Summer of Code 2021 project.
Essential Guide to Auto Encoders
A comprehensive blog discussing the six types of auto-encoders at a high level: under-complete auto-encoders, sparse auto-encoders, contractive auto-encoders, de-noising auto-encoders, variational auto-encoders (for generative modeling), and convolutional auto-encoders.
Graph ML in 2022: Where Are We Now?
An article showcasing the trends and major advancements in Graph ML.
Get Started on NVIDIA Triton with an Introductory Course from NVIDIA DLI
To get hands-on practice with a live server, the NVIDIA Deep Learning Institute (DLI) is offering a 4-hour, self-paced course titled Deploying a Model for Inference at Production Scale.
Libraries & Code
coqui-ai/STT: The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
Coqui STT is a fast, open-source, multi-platform, deep learning toolkit for training and deploying speech-to-text models.
Deep Learning Interviews (The Amazon Softcover is printed in B&W) is home to hundreds of fully-solved problems, from a wide range of key topics in AI.
The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends).
Papers & Publications
Detecting Twenty-thousand Classes using Image-level Supervision
Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic reaches 41.7 mAP for all classes and 41.7 mAP for rare classes. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without fine-tuning.
Contrastive Fine-grained Class Clustering via Generative Adversarial Networks
Unsupervised fine-grained class clustering is practical yet challenging task due to the difficulty of feature representations learning of subtle object details. We introduce C3-GAN, a method that leverages the categorical inference power of InfoGAN by applying contrastive learning. We aim to learn feature representations that encourage the data to form distinct cluster boundaries in the embedding space, while also maximizing the mutual information between the latent code and its observation. Our approach is to train the discriminator, which is used for inferring clusters, to optimize the contrastive loss, where the image-latent pairs that maximize the mutual information are considered as positive pairs and the rest as negative pairs. Specifically, we map the input of the generator, which has sampled from the categorical distribution, to the embedding space of the discriminator and let them act as a cluster centroid. In this way, C3-GAN achieved to learn a clustering-friendly embedding space where each cluster is distinctively separable. Experimental results show that C3-GAN achieved state-of-the-art clustering performance on four fine-grained benchmark datasets, while also alleviating the mode collapse phenomenon.