Deep Learning Weekly: Issue 328
Anthropic’s Claude 2.1, Architectural Patterns for Text-to-SQL, Accelerating Gen AI with PyTorch: Segment Anything Fast, Comparing Humans and GPT-4V On Abstraction and Reasoning Tasks, and many more!
This week in deep learning, we bring you Anthropic’s Claude 2.1, Architectural Patterns for Text-to-SQL: Leveraging LLMs for Enhanced BigQuery Interactions , Accelerating Generative AI with PyTorch: Segment Anything, Fast, and a paper on Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks.
You may also enjoy Microsoft's Orca 2, Mastering LLM Techniques: LLMOps, LangChain Evaluators for Language Model Validation, a paper on Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Anthropic released Claude 2.1 which includes a 200K token context window, a significant reduction in model hallucination, the ability to use tools, and more.
Microsoft released a pair of small language models that either match or outperform 5-10 times larger language models when tested on complex reasoning tasks in zero-shot settings.
Hundreds of OpenAI employees have signed a letter demanding that either the remaining board members resign or those employees will join Sam Altman’s new venture at Microsoft.
US venture funding going to companies in the San Francisco Bay Area hit a multiyear high, boosted largely by the AI boom.
Google DeepMind announces Lyria, an advanced AI music generation model, along with two AI experiments designed to open a new playground for creativity.
AI21 Labs,a generative AI startup, has secured additional investment to close a $208 million funding round that raises its valuation to $1.4 billion.
MIT CSAIL researchers innovate with synthetic imagery to train AI, paving the way for more efficient and bias-reduced machine learning.
MLOps & LLMOps
An article that delves into architectural patterns for Text-to-SQL, demonstrating the growing reliance on Large Language Models for this complex task.
NVIDIA outlines the generative AI app development journey, defines the concepts of GenAIOps and LLMOps, and compares them with MLOps.
An end-to-end tutorial on how to deploy and speed up Embeddings Model inference using AWS Inferentia2 and optimum-neuron on Amazon SageMaker.
An article that shares a journey on hosting production grade LLMs in a Whole-of-Government environment.
The first part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch.
A guide that explores various advanced string evaluation methods for AI-powered applications.
The first in a series of upcoming blogs that will cover additional aspects for efficient memory usage with ONNX Runtime quantization updates, and cross-platform usage scenarios.
Libraries & Code
An app that uses tldraw and the gpt-4-vision API to generate HTML based on a wireframe you draw.
A fast inference library for running LLMs locally on modern consumer-class GPUs.
A repository that provides examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models.
Papers & Publications
We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4, using the ConceptARC benchmark , which is designed to evaluate robust understanding and reasoning with core-knowledge concepts. We extend the work of Moskvichev et al.  by evaluating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4, on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.
Language models have shown promise in various tasks but can be affected by undesired data during training, fine-tuning, or alignment. For example, if some unsafe conversations are wrongly annotated as safe ones, the model fine-tuned on these samples may be harmful. Therefore, the correctness of annotations, i.e., the credibility of the dataset, is important. This study focuses on the credibility of real-world datasets, including the popular benchmarks Jigsaw Civil Comments, Anthropic Harmless & Red Team, PKU BeaverTails & SafeRLHF, that can be used for training a harmless language model. Given the cost and difficulty of cleaning these datasets by humans, we introduce a systematic framework for evaluating the credibility of datasets, identifying label errors, and evaluating the influence of noisy labels in the curated language data, specifically focusing on unsafe comments and conversation classification. With the framework, we find and fix an average of 6.16% label errors in 11 datasets constructed from the above benchmarks. The data credibility and downstream learning performance can be remarkably improved by directly fixing label errors, indicating the significance of cleaning existing real-world datasets.
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder.
We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.