Deep Learning Weekly: Issue #287
Google's LaMDA-based conversational AI called Bard, training models on streaming data, a Dive into Vision-Language Models, a paper on Multimodal Chain-of-Thought Reasoning in Language Models, and more
This week in deep learning, we bring you Google's LaMDA-based conversational AI called Bard, training models on streaming data, a Dive into Vision-Language Models, and a paper on Multimodal Chain-of-Thought Reasoning in Language Models.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Quora has opened up public access to its new AI chatbot app, Poe, which lets users ask questions and get answers from a range of AI chatbots, including those from OpenAI, and others like Anthropic.
Google unveils Bard, an experimental conversational AI service powered by Language Model for Dialogue Applications (LaMDA).
A wave of research improves reinforcement learning algorithms by pre-training them as if they were human.
VinBrain’s DrAid, an AI-powered software for automated X-ray diagnostics, is deployed in more than 100 hospitals in Vietnam, Myanmar, New Zealand, and the U.S.
Researchers have developed an AI system that outperforms traditional methods in the search for alien signals.
MIT researchers developed ADEV, which extends automatic differentiation to handle models that make random choices.
An article that covers the three types of data drift and how to detect them.
A practical guide that covers the basics of streaming data, its relevance, and a hands-on training example.
An efficient, to-the-point, and easy-to-use checklist to following when deploying an ML model into production.
An informative blog post that lists the workings of different object detection algorithms and compares them with similar algorithms.
A blog post that covers the training of a two-tower neural network to solve the cold start problem of NVIDIA’s Email recommender systems.
An article that visually introduces Stable Diffusion and its inner workings.
An article that discusses the learning strategies, datasets, emerging areas of research, and other things related to Vision-Language Models.
A webinar discussing the latest innovations in Computer Vision projects.
Libraries & Code
OSS Vizier is a Python-based service for black-box optimization and research, based on Google Vizier, one of the first hyperparameter tuning services designed to work at scale.
An open source python library for scalable Bayesian optimisation.
A Python library for audio data augmentation.
Papers & Publications
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy. To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. To mitigate the effect of such mistakes, we propose Multimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages. By incorporating the vision features in both stages, the model is able to generate effective rationales that contribute to answer inference. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16% (75.17%->91.68%) on the ScienceQA benchmark and even surpasses human performance.
We propose Dual PatchNorm: two Layer Normalization layers (LayerNorms), before and after the patch embedding layer in Vision Transformers. We demonstrate that Dual PatchNorm outperforms the result of exhaustive search for alternative LayerNorm placement strategies in the Transformer block itself. In our experiments, incorporating this trivial modification, often leads to improved accuracy over well-tuned Vision Transformers and never hurts.
State-of-the-art automatic augmentation methods (e.g., AutoAugment and RandAugment) for visual recognition tasks diversify training data using a large set of augmentation operations. The range of magnitudes of many augmentation operations (e.g., brightness and contrast) is continuous. Therefore, to make search computationally tractable, these methods use fixed and manually-defined magnitude ranges for each operation, which may lead to sub-optimal policies. To answer the open question on the importance of magnitude ranges for each augmentation operation, we introduce RangeAugment that allows us to efficiently learn the range of magnitudes for individual as well as composite augmentation operations. RangeAugment uses an auxiliary loss based on image similarity as a measure to control the range of magnitudes of augmentation operations. As a result, RangeAugment has a single scalar parameter for search, image similarity, which we simply optimize via linear search. RangeAugment integrates seamlessly with any model and learns model- and task-specific augmentation policies. With extensive experiments on the ImageNet dataset across different networks, we show that RangeAugment achieves competitive performance to state-of-the-art automatic augmentation methods with 4-5 times fewer augmentation operations. Experimental results on semantic segmentation, object detection, foundation models, and knowledge distillation further shows RangeAugment's effectiveness.