Deep Learning Weekly: Issue 333
Apple's Ferret, Recipe for Serving Thousands of Concurrent LoRA Adapters, Challenges with Unsupervised LLM Knowledge Discovery, a paper on StreamDiffusion, and many more!
This week in deep learning, we bring you Apple's Ferret, Recipe for Serving Thousands of Concurrent LoRA Adapters, Challenges with Unsupervised LLM Knowledge Discovery, and a paper on StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation.
You may also enjoy The Biggest Discoveries in Computer Science in 2023, Push Notifications - What to Push, What Not to Push, and How Often, Develop Your First AI Agent: Deep Q-Learning, a paper on DriveLM: Driving with Graph Visual Question Answering, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
AI researchers from Apple and Cornell University quietly unveiled an open-source and multimodal large language model last October known as Ferret.
In 2023, computer scientists made progress on a new vector-driven approach to AI, fundamentally improved Shor’s algorithm for factoring large numbers, and examined the surprising and powerful behaviors that can emerge from large language models.
Anthropic is in discussions to raise $750 million in a funding round led by Menlo Ventures.
Using deep learning, MIT researchers have discovered a class of compounds that can kill a drug-resistant bacterium that causes more than 10,000 deaths in the United States every year.
OpenAI, the creator of ChatGPT, is reportedly holding discussions with investors over a fresh round of funding that would value it at or above $100 billion.
Researchers have found child sexual abuse material in LAION-5B, an open-source artificial intelligence training dataset used to build image generation models.
MLOps & LLMOps
A blog post that introduces S-LoRA (code), a system designed for the scalable serving of many LoRA adapters.
A blog post that discusses the challenges and best practices of designing push notifications, a form of recommender system that proactively sends suggestions via email or mobile alerts.
An article on how Instacart deployed a single Deep Learning pCTR model for multiple surfaces with improved operations and performance along the way.
An article that dives into the steps on how to use Comet at different stages of ML projects.
An article on the challenges of contrast-consistent search (CCS) for knowledge discovery and alignment strategies.
A tutorial on how to fine-tune open LLMs like Llama 2 on AWS Trainium.
A comprehensive tutorial on building a Deep Reinforcement Learning gym from the ground up, including the environment, agent, and training protocol.
Libraries & Code
Practical best practices for distilling large language models.
PowerInfer is a CPU/GPU LLM inference engine leveraging activation locality for your device.
Papers & Publications
We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate object interactions before taking actions. The key insight is that with our proposed task, Graph VQA, where we model graph-structured reasoning through perception, prediction and planning question-answer pairs, we obtain a suitable proxy task to mimic the human reasoning process. We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent baseline performs end-to-end autonomous driving competitively in comparison to state-of-the-art driving-specific architectures. Notably, its benefits are pronounced when it is evaluated zero-shot on unseen objects or sensor configurations. We hope this work can be the starting point to shed new light on how to apply VLMs for autonomous driving. To facilitate future research, all code, data, and models are available to the public.
In this paper, we start by training End-to-End Automatic Speech Recognition (ASR) models using Federated Learning (FL) and examining the fundamental considerations that can be pivotal in minimizing the performance gap in terms of word error rate between models trained using FL versus their centralized counterpart. Specifically, we study the effect of (i) adaptive optimizers, (ii) loss characteristics via altering Connectionist Temporal Classification (CTC) weight, (iii) model initialization through seed start, (iv) carrying over modeling setup from experiences in centralized training to FL, e.g., pre-layer or post-layer normalization, and (v) FL-specific hyperparameters, such as number of local epochs, client sampling size, and learning rate scheduler, specifically for ASR under heterogeneous data distribution. We shed light on how some optimizers work better than others via inducing smoothness. We also summarize the applicability of algorithms, trends, and propose best practices from prior works in FL (in general) toward End-to-End ASR models.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.