Deep Learning Weekly: Issue #292
Meta AI's Multilingual Audio-Visual Corpus, GPT-4, Jensen-Shannon Divergence in Drift Monitoring, CMU's Tutorial on MultiModal Machine Learning, and a paper on Visual ChatGPT, and many more.
Hey Folks,
This week in deep learning, we bring you Meta AI's Multilingual Audio-Visual Corpus, Jensen-Shannon Divergence in Drift Monitoring, CMU's Tutorial on MultiModal Machine Learning, and a paper on Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models.
You may also enjoy Generative AI in the Fashion Industry, Data Labeling & Annotation Guidelines, Multivariate Probabilistic Time Series Forecasting with Informer, a paper on Prismer: A Vision-Language Model with An Ensemble of Experts, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
OpenAI releases GPT-4, a multimodal AI that it claims is state-of-the-art
GPT-4 was just released to paying users via ChatGPT Plus, and developers can sign up on a waitlist to access the API.
New method accelerates data retrieval in huge databases
Researchers used machine learning to build faster and more efficient hash functions, which are a key component of databases.
Generative AI in fashion | McKinsey
An article that outlines some of the most promising use cases of generative AI in the fashion industry.
Microsoft’s Bing reaches 100 million daily active users
Microsoft’s Bing search engine has reached 100 million daily active users, thanks to its AI-powered features such as Bing Chat.
Microsoft lays off an ethical AI team as it doubles down on OpenAI
Microsoft laid off an entire team dedicated to guiding AI innovation that leads to ethical, responsible, and sustainable outcomes.
Optical Algorithm Simplifies Analog AI Training
The article is about how a new optical algorithm can speed up and simplify the training of analog AI devices that use light instead of electricity to perform computations.
MuAViC: The first audio-video speech translation benchmark
Meta AI is releasing MuAViC (Multilingual Audio-Visual Corpus), the first benchmark that makes it possible to use audio-visual learning for highly accurate speech translation.
MLOps
Dataset Tracking with Comet ML Artifacts
An article that shows how someone can keep track of changes with Comet ML’s Artifacts.
How to Understand and Use the Jensen-Shannon Divergence
A primer on the math, logic, and pragmatic application of JS Divergence — including how it is best used in drift monitoring.
How to Write Data Labeling & Annotation Guidelines
Eugene Yan’s article about writing data labeling and annotation guidelines, covering aspects such as motivation, definition, examples, edge cases and quality assurance.
Training large language models on Amazon SageMaker: Best practices
The article is about how to train large language models (LLMs) on Amazon SageMaker using best practices such as data parallelism, model parallelism, mixed precision training, gradient accumulation, and checkpointing.
Learning
Tutorial on MultiModal Machine Learning
A tutorial on Multimodal Machine Learning organized by CMU MultiComp Lab1 that covers topics such as multimodal representation learning, multimodal fusion, multimodal alignment and multimodal applications.
Multivariate Probabilistic Time Series Forecasting with Informer
An article about how to use Informer, a state-of-the-art Transformer model for long sequence time-series forecasting, with Hugging Face Transformers.
Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU
A post on how to fine-tune a 20B language model with reinforcement learning using Hugging Face Transformers (TRL) and parameter-efficient fine-tuning (PEFT).
Detecting Obstacles and Drivable Free Space with RadarNet
An introduction to RadarNet, a novel method for detecting obstacles and drivable free space using radar data for advanced driver assistance systems and autonomous vehicles.
Libraries & Code
A Python library that allows you to version, export, save and download machine learning models in your choice of storage.
A Python library designed to make data analysis, monitoring, and sensitive data detection easy.
Python implementations of Machine Learning helper functions for Quantitative Finance based on books, Advances in Financial Machine Learning and Machine Learning for Asset Managers.
Papers & Publications
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Abstract:
ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models.
Prismer: A Vision-Language Model with An Ensemble of Experts
Abstract:
Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of domain experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from readily-available, pre-trained domain experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show that Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-art models, whilst requiring up to two orders of magnitude less training data.
PaLM-E: An Embodied Multimodal Language Model
Abstract:
Large language models have been demonstrated to perform complex tasks. However, enabling general inference in the real world, e.g. for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks, including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.