Deep Learning Weekly: Issue #320
DALL-E 3, SAM + Stable Diffusion for Text-to-Image Inpainting, Inside the Matrix: Visualizing Matrix Multiplication, a paper on NExT-GPT: Any-to-Any Multimodal LLM, and many more!
This week in deep learning, we bring you DALL-E 3, SAM + Stable Diffusion for Text-to-Image Inpainting, Inside the Matrix: Visualizing Matrix Multiplication, Attention and Beyond, and a paper on NExT-GPT: Any-to-Any Multimodal LLM.
You may also enjoy Amazon steps up AI race with up to $4 billion deal to invest in Anthropic, 10 Ways to Improve the Performance of Retrieval Augmented Generation Systems, LLM Training: RLHF and Its Alternatives, a paper on LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
DALL-E 3: Art Generator Powered by ChatGPT
OpenAI officially announced DALL·E 3, a modern text-to-image system built natively on ChatGPT to generate artwork by simply talking to the chatbot.
A catalogue of genetic mutations to help pinpoint the cause of diseases
DeepMind releases AlphaMissense catalogue, which was developed using AlphaMissense, a new AI model that classified 71 million missense genetic mutations.
Amazon steps up AI race with up to $4 billion deal to invest in Anthropic
Amazon will invest up to $4 billion in the high-profile startup Anthropic, in its effort to compete with growing cloud rivals on artificial intelligence.
Third-party AI tools pose increasing risks for organizations
A new report by MIT Sloan and Boston Consulting Group about the increasing risks that come with using third-party AI tools.
Effective Small Language Models: Microsoft’s 1.3 Billion Parameter phi-1.5
Microsoft introduces a small language model called phi-1.5, which outperformed models such as Llama 2 7B on several benchmarks.
How an archeological approach can help leverage biased data in AI to improve medicine
In a new paper, professors call for an alternative approach to understanding biased data used in medical machine learning — one that views biased clinical data as akin to archaeological artifacts.
Pulitzer Prize winning author Michael Chabon and others sue OpenAI
Pulitzer Prize winning US novelist Michael Chabon and several other writers are the latest to file a proposed class action accusing OpenAI of copyright infringement, alleging it pulled their work into the datasets used to train the models behind ChatGPT.
MLOps & LLMOps
SAM + Stable Diffusion for Text-to-Image Inpainting
A guide for creating a text-to-image inpainting pipeline using Grounding DINO, SAM, Stable Diffusion, and Comet.
10 Ways to Improve the Performance of Retrieval Augmented Generation Systems
A post that highlights strategies for improving the quality of Retrieval Augmented Generation (RAG) systems.
The Olympics of AI: Benchmarking Machine Learning Systems
An exploration of the crucial role of benchmarking in advancing computer science and machine learning by journeying through its history.
Learning
Inside the Matrix: Visualizing Matrix Multiplication, Attention and Beyond
A comprehensive article on how (and why) to use 3D to visualize matrix multiplication expressions, attention heads with real weights, and more.
LLM Training: RLHF and Its Alternatives
An article that compares ChatGPT’s and Llama 2’s way of doing RLHF, and lists other alternatives.
This article provides a brief overview of the GraphSAGE neural network architecture, complete with code examples in PyTorch Geometric.
Libraries & Code
A lightweight and highly hackable framework for chat-based language models with tool usage/function calling.
A new Bayesian neural network library for PyTorch for large-scale deep networks.
An open-source, extensible, high-performance chatbot framework. It supports one-click free deployment of your private ChatGPT/LLM web application.
Papers & Publications
NExT-GPT: Any-to-Any Multimodal LLM
Abstract:
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.
Abstract:
Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset.
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Abstract:
We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.