Deep Learning Weekly: Issue 327
DeepMind's GraphCast, Ingesting Data for Semantic Searches in a Production-Ready Way, AI Timelines, a paper on S-LoRA: Serving Thousands of Concurrent LoRA Adapters, and many more!
This week in deep learning, we bring you DeepMind's GraphCast, Ingesting Data for Semantic Searches in a Production-Ready Way, AI Timelines, and a paper on S-LoRA: Serving Thousands of Concurrent LoRA Adapters.
You may also enjoy Scale AI's Safety, Evaluations and Analysis Lab, Testing Large Language Models with Giskard, SDXL in 4 steps with Latent Consistency LoRAs, a paper on Alternating Updates for Efficient Transformers, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Google DeepMind’s AI Weather Forecaster Handily Beats a Global Standard
DeepMind’s latest weather model surpasses the European Centre for Medium-Range Weather Forecasting, a global weather prediction leader, in 90% of 1,300+ atmospheric variables.
SEAL: Scale’s Safety, Evaluations and Analysis Lab
Scale AI unveils a new frontier research effort dedicated to building robust evaluation products and tackling the challenging research problems in evaluation and red teaming.
OpenAI recruiters are trying to lure Google AI employees with $10 million pay packets, report says
OpenAI is attempting to lure top Google researchers with lucrative compensation packages and advanced AI hardware to support their research endeavors.
Open-source ML observability course
Evidently AI launched a free open-source ML observability course for data scientists and ML engineers.
AI-enhanced enhanced security operations solutions startup Radiant Security raises $15M
AI-enhanced security operations solutions startup Radiant Security announced that it has raised $15 million in new funding for additional research and development.
OpenAI launches partner initiative focused on creating AI training datasets
OpenAI announced a new initiative, OpenAI Data Partnerships, through which it will collect records from other organizations to create training datasets.
MLOps & LLMOps
Introducing LCEL: A Guide to LangChain Expression Language
An in-depth overview of the capabilities of LangChain Expression Language (LCEL), from its initial setup to its advanced functionalities.
Ingesting Data for Semantic Searches in a Production-Ready Way
A tutorial on how to embed a large volume of data, upload it to a vector database, run top K similarity searches against it, and monitor it in production.
Learning
A written dialogue that highlights disagreements between researchers on when transformative AI will be built.
SDXL in 4 steps with Latent Consistency LoRAs
An article about using Latent Consistency LoRAs to speed up image generation with Stable Diffusion and SDXL. It discusses how to train and use LCM LoRAs, and the benefits of using them.
4-Bit Quantization with Lightning Fabric
An article about the basics of Lightning Fabric’s plugin for 4-bit quantization.
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
An introductory article about using sparse, discrete hidden states to make neural networks more interpretable.
Defining Marketing Strategy Using Comet
An article about using Comet to perform market segmentation and learn about your customers to launch marketing campaigns.
Libraries & Code
Build, train, and fine-tune production-ready deep learning SOTA vision models
A Multi-Voice and Prompt-Controlled TTS Engine
An open-source, high-performance chatbot framework. It supports one-click free deployment of private LLM web applications.
Papers & Publications
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Abstract:
The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.
Alternating Updates for Efficient Transformers
Abstract:
It has been well established that increasing scale in deep transformer networks leads to improved quality and performance. However, this increase in scale often comes with prohibitive increases in compute cost and inference latency. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation, i.e., the token embedding, while only incurring a negligible increase in latency. AltUp achieves this by working on a subblock of the widened representation at each layer and using a predict-and-correct mechanism to update the inactivated blocks. We present extensions of AltUp, such as its applicability to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity. Our experiments on benchmark transformer models and language tasks demonstrate the consistent effectiveness of AltUp on a diverse set of scenarios. Notably, on SuperGLUE and SQuAD benchmarks, AltUp enables up to 87% speedup relative to the dense baselines at the same accuracy.
CogVLM: Visual Expert for Pretrained Language Models
Abstract:
We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B.