Deep Learning Weekly: Issue #299
StableVicuna, Google AI's Visual Blocks for Accelerating ML Prototyping, Hyperbolic Deep RL, a paper on Neural Architecture Search in Polynomial Complexity, and many more!
This week in deep learning, we bring you Stability AI's StableVicuna, Google AI's Visual Blocks for Accelerating ML Prototyping, Hyperbolic Deep Reinforcement Learning, and a paper on LayerNAS: Neural Architecture Search in Polynomial Complexity.
You may also enjoy Hinton Resigns From Google Over Ethical Fears, A Notebook-First ML Observability Library, Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA), a paper on AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Stability AI releases StableVicuna, the AI World’s First Open Source RLHF LLM Chatbot
Stability AI unveils StableVicuna, the first large-scale open source chatbot trained via reinforced learning from human feedback (RLHF).
The Andy Warhol Copyright Case That Could Transform Generative AI
The US Supreme Court is expected to make a decision on whether Andy Warhol’s portraits of the musician Prince were transformative or copyright infringement, which could impact how copyright law is applied to AI-generated works.
Researchers use AI combined with MRI to decode human thoughts
Researchers at the University of Texas at Austin have been using artificial intelligence in combination with fMRI scans to translate brain activity into continuous text.
Geoffrey Hinton, a pioneer in artificial intelligence, resigns from Google over ethical fears
Geoffrey Hinton, a pioneer in AI and a longtime leader of Google’s AI research division, has resigned from his position, citing growing concerns about the ethical risks of the technology he helped create.
Now Shipping: DGX H100 Systems Bring Advanced AI Capabilities to Industries Worldwide
Customers from Tokyo to Stockholm will plug into NVIDIA’s latest AI supercomputers to advance workloads that include generative AI across manufacturing, healthcare, robotics and more.
PricewaterhouseCoopers announces multiyear $1B investment into generative AI
PricewaterhouseCoopers announced plans today to invest $1 billion in generative AI that will expand its offerings for clients in the U.S. to use the technology to power their businesses.
MLOps
Scaling deep retrieval with TensorFlow Recommenders and Vertex AI Matching Engine
A dive deep into custom deep retrieval techniques, and a demonstration of how to build a playlist recommendation system via a candidate retrieval workflow with Vertex AI.
Reduce your model training spending
Model training is expensive and the cost to train a model is directly proportional to the time it takes to train a model. This blog shows you how to make data-informed decisions and cut costs.
Visual Blocks for ML: Accelerating machine learning prototyping with interactive tools
An article on Google AI’s new visual programming platform for rapid and iterative development of end-to-end ML-based multimedia applications.
Kangas: The Pandas of Computer Vision
Similar to how Pandas revolutionized the way data analysts work with tabular data, Kangas is doing the same for computer vision tasks.
Serving With TF and GKE: Stable Diffusion
A post that discusses how TensorFlow Serving (TF Serving) and Google Kubernetes Engine (GKE) can serve Stable Diffusion with online deployment.
Learning
Hyperbolic Deep Reinforcement Learning
A post that goes through the basics of hyperbolic geometry, shows empirically that it provides a good inductive bias for many RL problems, and describes a practical regularization procedure.
Stanford AI Lab Papers and Talks at ICLR 2023
A compilation of Stanford AI Lab Papers and Talks at ICLR 2023.
Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA)
A technical article on how to tune an LLM with Low-Rank Adaptation (LoRA) in a computationally efficient manner.
An Introduction to Multimodal Models
A post that provides an overview of diverse applications and state-of-the-art techniques for training and evaluating multimodal models.
In-Context Learning, In Context
A comprehensive article on the new developments in in-context learning, as well as opinions on about its future.
Libraries & Code
langchain-serve helps you deploy your LangChain apps on Jina AI Cloud in just a matter of seconds.
An open-source efficient deep learning framework/compiler, written in python.
A notebook-first library that provides MLOps insights at lightning speed with zero-config observability for model drift, performance, and data quality.
Papers & Publications
LayerNAS: Neural Architecture Search in Polynomial Complexity
Abstract:
Neural Architecture Search (NAS) has become a popular method for discovering effective model architectures, especially for target hardware. As such, NAS methods that find optimal architectures under constraints are essential. In our paper, we propose LayerNAS to address the challenge of multi-objective NAS by transforming it into a combinatorial optimization problem, which effectively constrains the search complexity to be polynomial.
For a model architecture with L layers, we perform layerwise-search for each layer, selecting from a set of search options S. LayerNAS groups model candidates based on one objective, such as model size or latency, and searches for the optimal model based on another objective, thereby splitting the cost and reward elements of the search. This approach limits the search complexity to O(H⋅|S|⋅L), where H is a constant set in LayerNAS.
Our experiments show that LayerNAS is able to consistently discover superior models across a variety of search spaces in comparison to strong baselines, including search spaces derived from NATS-Bench, MobileNetV2 and MobileNetV3.
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Abstract:
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease.
DataComp: In search of the next generation of multimodal datasets
Abstract:
Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a benchmark where the training code is fixed and researchers innovate by proposing new training sets. We provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources.
Our baseline experiments show that the DataComp workflow is a promising way of improving multimodal datasets. We introduce DataComp-1B, a dataset created by applying a simple filtering algorithm to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9x less training compute. We also outperform OpenAI's CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets.