Deep Learning Weekly: Issue #287
Google's LaMDA-based conversational AI called Bard, training models on streaming data, a Dive into Vision-Language Models, a paper on Multimodal Chain-of-Thought Reasoning in Language Models, and more
This week in deep learning, we bring you Google's LaMDA-based conversational AI called Bard, training models on streaming data, a Dive into Vision-Language Models, and a paper on Multimodal Chain-of-Thought Reasoning in Language Models.
You may also enjoy Quora's AI chatbot app called Poe, ML Model Deployment Checklist, Solving the Cold-Start Problem using Two-Tower Neural Networks, a paper on Dual PatchNorm, and more.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Quora opens its new AI chatbot app Poe to the general public
Quora has opened up public access to its new AI chatbot app, Poe, which lets users ask questions and get answers from a range of AI chatbots, including those from OpenAI, and others like Anthropic.
Google’s Bard and new AI features in Search
Google unveils Bard, an experimental conversational AI service powered by Language Model for Dialogue Applications (LaMDA).
Machines Learn Better if We Teach Them the Basics
A wave of research improves reinforcement learning algorithms by pre-training them as if they were human.
Vietnam’s VinBrain Deploys Healthcare AI Models to 100+ Hospitals
VinBrain’s DrAid, an AI-powered software for automated X-ray diagnostics, is deployed in more than 100 hospitals in Vietnam, Myanmar, New Zealand, and the U.S.
AI Joins Hunt for ET: Study Finds 8 Potential Alien Signals
Researchers have developed an AI system that outperforms traditional methods in the search for alien signals.
Automating the math for decision-making under uncertainty
MIT researchers developed ADEV, which extends automatic differentiation to handle models that make random choices.
Detecting Data Drift with Machine Learning
An article that covers the three types of data drift and how to detect them.
Training Models on Streaming Data [Practical Guide]
A practical guide that covers the basics of streaming data, its relevance, and a hands-on training example.
An efficient, to-the-point, and easy-to-use checklist to following when deploying an ML model into production.
First Step to Object Detection Algorithms
An informative blog post that lists the workings of different object detection algorithms and compares them with similar algorithms.
Solving the Cold-Start Problem using Two-Tower Neural Networks for NVIDIA’s E-Mail Recommender Systems
A blog post that covers the training of a two-tower neural network to solve the cold start problem of NVIDIA’s Email recommender systems.
The Illustrated Stable Diffusion – Jay Alammar
An article that visually introduces Stable Diffusion and its inner workings.
A Dive into Vision-Language Models
An article that discusses the learning strategies, datasets, emerging areas of research, and other things related to Vision-Language Models.
Life vs. ImageNet: What I wish I had known before deploying computer vision to the real world
A webinar discussing the latest innovations in Computer Vision projects.
Libraries & Code
OSS Vizier is a Python-based service for black-box optimization and research, based on Google Vizier, one of the first hyperparameter tuning services designed to work at scale.
An open source python library for scalable Bayesian optimisation.
A Python library for audio data augmentation.
Papers & Publications
Multimodal Chain-of-Thought Reasoning in Language Models
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy. To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. To mitigate the effect of such mistakes, we propose Multimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages. By incorporating the vision features in both stages, the model is able to generate effective rationales that contribute to answer inference. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16% (75.17%->91.68%) on the ScienceQA benchmark and even surpasses human performance.
We propose Dual PatchNorm: two Layer Normalization layers (LayerNorms), before and after the patch embedding layer in Vision Transformers. We demonstrate that Dual PatchNorm outperforms the result of exhaustive search for alternative LayerNorm placement strategies in the Transformer block itself. In our experiments, incorporating this trivial modification, often leads to improved accuracy over well-tuned Vision Transformers and never hurts.
RangeAugment: Efficient Online Augmentation with Range Learning
State-of-the-art automatic augmentation methods (e.g., AutoAugment and RandAugment) for visual recognition tasks diversify training data using a large set of augmentation operations. The range of magnitudes of many augmentation operations (e.g., brightness and contrast) is continuous. Therefore, to make search computationally tractable, these methods use fixed and manually-defined magnitude ranges for each operation, which may lead to sub-optimal policies. To answer the open question on the importance of magnitude ranges for each augmentation operation, we introduce RangeAugment that allows us to efficiently learn the range of magnitudes for individual as well as composite augmentation operations. RangeAugment uses an auxiliary loss based on image similarity as a measure to control the range of magnitudes of augmentation operations. As a result, RangeAugment has a single scalar parameter for search, image similarity, which we simply optimize via linear search. RangeAugment integrates seamlessly with any model and learns model- and task-specific augmentation policies. With extensive experiments on the ImageNet dataset across different networks, we show that RangeAugment achieves competitive performance to state-of-the-art automatic augmentation methods with 4-5 times fewer augmentation operations. Experimental results on semantic segmentation, object detection, foundation models, and knowledge distillation further shows RangeAugment's effectiveness.