Deep Learning Weekly: Issue #304
NVIDIA's Neuralangelo, LLM Inference Container for SageMaker, Testing Language Models (and Prompts) Like We Test Software, Augmenting Verbal Communication with On-the-fly Visuals.
This week in deep learning, we bring you NVIDIA's Neuralangelo for Reconstructing 3D Scenes, Hugging Face LLM Inference Container for SageMaker, Testing Language Models (and Prompts) Like We Test Software, and a paper on Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals.
You may also enjoy Introducing Aya: An Open Science Initiative to Accelerate Multilingual AI Progress, Train a Text-to-Image Model Using Kubeflow, Graph Analysis with PyG Explainability, a paper on Fine-Tuning Language Models with Just Forward Passes, and more!As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Neuralangelo Research Reconstructs 3D Scenes
Neuralangelo, a new AI model by NVIDIA Research for 3D reconstruction using neural networks, turns 2D video clips into detailed 3D structures
Introducing Aya: An Open Science Initiative to Accelerate Multilingual AI Progress
Cohere AI introduces Aya—an open-science endeavor aimed at building a multilingual language model via instruction tuning that harnesses the collective wisdom and contributions of people from all over the world.
Google reportedly invests in generative AI startup Runway at $1.5B valuation
Google has reportedly invested in Runway AI, a New York-based startup that develops generative artificial intelligence models.
Introducing InkyMM, the first open source, commercializable multi-modal model
OctoML announced InkyMM, the first open-source, fully commercializable Image + Text LLM.
MLOps
Introducing the Hugging Face LLM Inference Container for Amazon SageMaker
An article on how to deploy the open-source LLMs, like BLOOM to Amazon SageMaker for inference using the new Hugging Face LLM Inference Container.
A blogpost that highlights how Comet’s solutions are utilized together to perform Model CI/CD.
Train a Text-to-Image Model Using Kubeflow
A technical guide on how to train models using Kubeflow.
Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances
A tutorial on how to run your ML training jobs in a container using Amazon ECS to deploy, manage, and scale your ML workload.
Optimizing Large Language Model Performance with ONNX on DataRobot MLOps
A tutorial on how to significantly increase inference speed by converting your language model to the ONNX format.
Learning
Graph Convolutional Networks for NLP Using Comet
This article provides a brief overview of GCNs for NLP tasks and how to implement them using PyTorch and Comet.
Fine Tuning vs. Prompt Engineering Large Language Models
An article that provides practical descriptions of the differences between fine-tuning and prompt engineering.
Graph Analysis Made Easy with PyG Explainability
A technical guide that shows how to use PyG’s explainability module to apply the GNNExplainer algorithm on explaining a GNN’s property predictions.
Testing Language Models (and Prompts) Like We Test Software
A post that focuses on the concept of testing applications (or prompts) built with language models, in order to better understand their capabilities and limitations.
The Falcon has landed in the Hugging Face ecosystem
A deep dive into the available Falcon models, a family of SOTA language models created by the Technology Innovation Institute in Abu Dhabi.
Libraries & Code
An API store for LLMs.
A Python library and trace visualizer for language model programs.
Featureform is a virtual feature store. It enables data scientists to define, manage, and serve their ML model's features.
Papers & Publications
Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals
Abstract:
Computer-mediated platforms are increasingly facilitating verbal communication, and capabilities such as live captioning and noise cancellation enable people to understand each other better. We envision that visual augmentations that leverage semantics in the spoken language could also be helpful to illustrate complex or unfamiliar concepts. To advance our understanding of the interest in such capabilities, we conducted formative research through remote interviews (N=10) and crowdsourced a dataset of 1500 sentence-visual pairs across a wide range of contexts.
These insights informed Visual Captions, a real-time system that we integrated into a videoconferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest relevant visuals in open-vocabulary conversations. We report on our findings from a lab study (N=26) and a two-week deployment study (N=10), which demonstrate how Visual Captions has the potential to help people improve their communication through visual augmentation in various scenarios.
Abstract:
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
Fine-Tuning Language Models with Just Forward Passes
Abstract:
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.