Deep Learning Weekly: Issue #262
Meta's 175B parameter chatbot now publicly available, Chip Huyen's introduction to streaming for data scientists, neural networks for keyword spotting using nnAudio and PyTorch, and more
Hey Folks,
This week in deep learning, we bring you Meta's 175B parameter chatbot now publicly available, Chip Huyen's introduction to streaming for data scientists, neural networks for keyword spotting using nnAudio and PyTorch, and a paper on neural architects for immersive 3D scene generation.
You may also enjoy Google Universal Image Embedding Challenge, Microsoft's MLOps maturity model, a tutorial on using depthwise separable convolutions in TensorFlow, a paper on personalizing text-to-image generation using textual inversion, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Meta AI has built and released BlenderBot 3, the first 175B-parameter, publicly available chatbot complete with model weights, code, datasets, and model cards.
Artificial Synapses 10,000x Faster Than Real Thing
MIT researchers developed new protonic programmable resistors that may help speed learning in deep neural networks.
Introducing the Google Universal Image Embedding Challenge
Google AI announces the Google Universal Image Embedding Challenge, where participants are asked to build a single universal image embedding model capable of representing objects from multiple domains at the instance level.
Introducing the Private Hub: A New Way to Build With Machine Learning
Hugging Face launches the Private Hub, a unified set of tools to accelerate each step of the machine learning lifecycle in a secure and compliant way.
MLOps
Machine Learning operations maturity model
The MLOps maturity model of Microsoft which shows the continuous improvement in the creation and operation of a production level machine learning application environment.
Introduction to streaming for data scientists
A comprehensive introductory blog on streaming data by Chip Huyen.
Eight Considerations When Choosing a Data Store for Data Science
An article discussing the considerations when identifying the data store that will support data science at scale across your enterprise.
Organizing machine learning projects: project management guidelines.
A document that provides a common framework for approaching machine learning projects that can be referenced by practitioners.
Learning
Advanced Natural Language Processing in Google Sheets
A post that shows you how to supercharge your spreadsheet with advanced text analysis and natural language processing by integrating Cohere's Language Models.
Feature Selection using Wrapper Method
A comprehensive guide to Feature Selection using Wrapper methods in Python.
Build a deep neural network for the keyword spotting (KWS) task with nnAudio GPU audio processing
A tutorial on how to build a neural network in PyTorch by directly feeding it audio files that are directly converted into fine-tunable spectrograms.
Using Depthwise Separable Convolutions in Tensorflow
A tutorial on what depthwise separable convolutions are and how we can use them to speed up our convolutional neural network image models.
Active Learning: Strategies, Tools, and Real-World Use Cases
A comprehensive article on the what, why, and how of active learning.
Libraries & Code
a feature engineering and preprocessing library for tabular data that is designed to easily manipulate terabyte scale datasets and train deep learning (DL) based recommender systems.
MATLAB Deep Learning Model Hub
A collection of pre-trained models for deep learning in MATLAB.
mosaicml/composer: Train neural networks up to 7x faster
Composer is a library written in PyTorch that enables you to train neural networks faster, at lower cost, and to higher accuracy.
Papers & Publications
Video Question Answering with Iterative Video-Text Co-Tokenization
Abstract:
Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
GAUDI: A Neural Architect for Immersive 3D Scene Generation
Abstract:
We introduce GAUDI, a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera. We tackle this challenging problem with a scalable yet powerful approach, where we first optimize a latent representation that disentangles radiance fields and camera poses. This latent representation is then used to learn a generative model that enables both unconditional and conditional generation of 3D scenes. Our model generalizes previous works that focus on single objects by removing the assumption that the camera pose distribution can be shared across samples. We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets and allows for conditional generation of 3D scenes given conditioning variables like sparse image observations or text that describes the scene.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Abstract:
Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks.