Deep Learning Weekly: Issue 344
The State of Competitive Machine Learning, Optimizing Retrieval with HyDE, Evaluate LLMs with Hugging Face Lighteval, a paper on Measuring and Reducing Malicious Use With Unlearning, and more!
This week in deep learning, we bring you The State of Competitive Machine Learning, Optimizing Retrieval with HyDE, Evaluate LLMs with Hugging Face Lighteval on Amazon SageMaker, and a paper on The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning.
You may also enjoy Cohere releases powerful 'Command-R' language model for enterprise use, Portable Evaluation Tasks via the METR Task Standard, PIXART-α: A Diffusion Transformer Model for Text-to-Image Generation, a paper on GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
The State of Competitive Machine Learning
ML Contests provides a summary of the competitive landscape, an analysis of 300+ competitions, and a deep dive into the winning solutions of 2023.
You can now train a 70b language model at home
Answer.AI releases its first project: an open source system, based on FSDP and QLoRA, that can train a 70b model on two 24GB GPUs.
Cohere releases powerful 'Command-R' language model for enterprise use
Cohere announced the release of a new language model called Command-R, which is designed for Retrieval Augmented Generation at scale.
Elon Musk’s xAI to open-source its Grok language model
Elon Musk announced that xAI plans to open-source its flagship large language model.
UK startup launches AI satellite to provide near real-time images of Earth
UK-based space tech startup Open Cosmos has successfully launched a new AI-powered satellite that can provide near real-time views of Earth.
MLOps & LLMOps
Optimizing Retrieval with HyDE
An article on how to implement and incorporate Hypothetical Document Embeddings into Haystack.
Portable Evaluation Tasks via the METR Task Standard
METR’s article which introduces a task format for evaluating the capabilities of AI agents.
Augmenting Gemini-1.0-Pro with Knowledge Graphs via LangChain
An article that explores how to create a Knowledge Graph from Wikipedia articles, and using this with LangChain to create a chatbot with memory.
Learning
Evaluate LLMs with Hugging Face Lighteval on Amazon SageMaker
A tutorial on how to evaluate LLMs using Hugging Face lighteval.
PIXART-α: A Diffusion Transformer Model for Text-to-Image Generation
A short tutorial on how to run experiments with Pixart-α — the new transformer-based Diffusion model for generating photorealistic images from text.
An Overview of the LoRA Family. LoRA, DoRA, AdaLoRA, Delta-LoRA
An overview of some variants of LoRA that promise to improve LoRAs capabilities in different ways.
Tokens-to-Token Vision Transformers, Explained
An article that walks through Tokens-to-Token Vision Transformers (T2T) and explains its underlying concepts and components.
Libraries & Code
A library for generative social simulation.
Seamlessly integrate powerful language models like ChatGPT into scikit-learn for enhanced text analysis tasks.
LangGraph is a library for building stateful, multi-actor applications with LLMs, built on top of LangChain.
Papers & Publications
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Abstract:
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 4,157 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop CUT, a state-of-the-art unlearning method based on controlling model representations. CUT reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Abstract:
Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Abstract:
Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.