Deep Learning Weekly: Issue 362
Gen AI: too much spend, too little benefit?, Workflow for LLM Regression Testing, Prompt Engineering for Cognitive Flexibility, a paper on GaLore: Memory-Efficient LLM Training, and many more!
This week in deep learning, we bring you Gen AI: too much spend, too little benefit?, Step-by-step workflow for LLM Regression Testing, Prompt Engineering for Cognitive Flexibility, and a paper on GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.
You may also enjoy Mistral's Codestral Mamba, Building a Natively Multimodal RAG Pipeline (over a Slide Deck), a paper on CodeAct: Your LLM Agent Acts Better when Generating Code, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Gen AI: too much spend, too little benefit?
Goldman Sachs Research reports the economic implications of significant investments in AI.
Mistral AI releases Codestral Mamba, a Mamba2 language model specialized in code generation.
Anthropic releases the new Claude Android app which includes their most powerful model, Claude 3.5 Sonnet.
Amazon hires founders from well-funded enterprise AI startup Adept to boost tech giant's 'AGI' team
Amazon is amping up its AI efforts by hiring executives from Adept, a San Francisco-based startup building agents that automate enterprise workflows.
MLOps & LLMOps
Building Pinterest Canvas, a text-to-image foundation model
The Pinterest team shares their latest explorations and progress on Pinterest Canvas, a text-to-image foundation model for enhancing existing images and products on the platform.
Tutorial: Best Practices for Evaluating Fine-Tuned LLMs
An in-depth technical tutorial with code examples for common evaluation methods for various tasks LLMs are performing in content generation, with a focus on human-in-the-loop and using a larger model to assess coherence and quantify other metrics.
Introducing Micro Agent: An (Actually Reliable) AI Coding Agent
An article that introduces an open-source AI coding tool that aims to deliver the benefits of AI-assisted coding while mitigating the problems of unreliable code generation.
Building a Natively Multimodal RAG Pipeline (over a Slide Deck)
A cookbook that shows you how to build a multimodal RAG pipeline over a slide deck, with text, tables, images, diagrams, and complex layouts.
LLM Evaluation doesn't need to be complicated
A blog post on how to set up a simplified evaluation workflow for LLM applications using an additive score, chain-of-thought prompting, and form-filling prompt templates.
Learning
Prompt Engineering for Cognitive Flexibility
An article that explores prompt engineering through the lens of a recent mini-experiment that leverages the latest MMLU-Pro benchmark, leading to insights around cognitive flexibility.
Perception-Inspired Graph Convolution for Music Understanding Tasks
An article that discusses MusGConv, a perception-inspired graph convolution block for symbolic musical applications.
Dealing with cognitive dissonance, the AI way
An article that highlights behavioral inconsistencies of LLMs when faced with contradictory instructions in its prompts.
The Ultimate Handbook for LLM Quantization
A deep dive into LLM quantization and techniques.
Libraries & Code
A framework for serving and evaluating LLM routers.
Agent Evaluation is a generative AI-powered framework for testing virtual agents.
Papers & Publications
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Abstract:
Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies
CodeAct: Your LLM Agent Acts Better when Generating Code
Abstract:
Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents' actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug.
Interactive Continual Learning: Fast and Slow Thinking
Abstract:
Advanced life forms, sustained by the synergistic interaction of neural cognitive mechanisms, continually acquire and transfer knowledge throughout their lifespan. In contrast, contemporary machine learning paradigms exhibit limitations in emulating the facets of continual learning (CL). Nonetheless, the emergence of large language models (LLMs) presents promising avenues for realizing CL via interactions with these models. Drawing on Complementary Learning System theory, this paper presents a novel Interactive Continual Learning (ICL) framework, enabled by collaborative interactions among models of various sizes. Specifically, we assign the ViT model as System1 and multimodal LLM as System2. To enable the memory module to deduce tasks from class information and enhance Set2Set retrieval, we propose the Class-Knowledge-Task Multi-Head Attention (CKT-MHA). Additionally, to improve memory retrieval in System1 through enhanced geometric representation, we introduce the CL-vMF mechanism, based on the von Mises-Fisher (vMF) distribution. Meanwhile, we introduce the von Mises-Fisher Outlier Detection and Interaction (vMF-ODI) strategy to identify hard examples, thus enhancing collaboration between System1 and System2 for complex reasoning realization. Comprehensive evaluation of our proposed ICL demonstrates significant resistance to forgetting and superior performance relative to existing methods.