Deep Learning Weekly: Issue 330
Introducing IBM and Meta's AI Alliance, LMQL — SQL for Language Models, Growth and Form in a Toy Model of Superposition, Continual Learning for Instruction Following from Real-time Feedback, and more!
This week in deep learning, we bring you Introducing IBM and Meta's AI Alliance, LMQL — SQL for Language Models, Growth and Form in a Toy Model of Superposition, and a paper on Continual Learning for Instruction Following from Real-time Feedback.
You may also enjoy Millions of new materials discovered with deep learning, Evaluating Multi-Modal Retrieval-Augmented Generation, Enable faster training with Amazon SageMaker data parallel library, a paper on Translatotron 3: Speech to Speech Translation with Monolingual Data, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
IBM and Meta launched the AI Alliance — a group of leading organizations coming together to support open innovation and open science in AI.
Researchers from MIT and ETH Zurich have developed a new machine learning technique that could be applied to many complex logistical challenges, such as package routing, vaccine distribution, and power grid management.
Documents show that OpenAI signed a letter of intent to spend $51 million on brain-inspired chips developed by startup Rain.
Dell and Imbue, an independent AI research company, have entered into a $150 million agreement to build a new high-performance computing cluster for training foundation models optimized for reasoning.
DeepMind’s AI tool GNoME finds 2.2 million new crystals, including 380,000 stable materials that could power future technologies.
MLOps & LLMOps
A blog post that highlights the differences between evaluating multi-modal RAG systems and text-only RAG systems.
A deep dive into the practical use cases of LMQL – an open-source programming language for language models.
A blog post that covers how to use serverless compute for LLMs, as well as how to deploy Mistral 7B model to AWS Lambda using LLAMA.CPP with OpenBLAS support.
An article that focuses on the technical details of constructing the ML Training Platform (MLTP) within Instacart’s Griffin 2.0.
An article that outlines the steps for maximizing and configuring VS Code for data scientists and machine learning engineers.
A post that distills Dynamical and Bayesian Phase Transitions in a Toy Model of Superposition, where the developmental stages of the Toy Model of Superposition from the perspective of singular learning theory are studied.
A post that shows a high-level overview of how the SageMaker Distributed Data Parallel library works, how to enable it in SageMaker training scripts, and the performance improvements to be expected.
A comprehensive guide that aims to demystify CNNs, providing insights into their structure, functionality, and why they are so effective for image-related tasks.
Libraries & Code
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
A code-first agent framework for seamlessly planning and executing data analytics tasks.
A new state space model architecture showing promising performance on information-dense data such as language modeling.
Papers & Publications
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long context. In this paper, we ask: is it possible for a language model to be performant without any fixed token embeddings? Such a language model would have to rely entirely on the co-occurence and repetition of tokens in the context rather than the a priori identity of any token. To answer this, we study lexinvariant language models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice. First, we prove that we can construct a lexinvariant LM to converge to the true language model at a uniform rate that is polynomial in terms of the context length, with a constant factor that is sublinear in the vocabulary size. Second, to build a lexinvariant LM, we simply encode tokens using random Gaussian vectors, such that each token maps to the same representation within each sequence but different representations across sequences. Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We further explore two properties of the lexinvariant language models: First, given text generated from a substitution cipher of English, it implicitly implements Bayesian in-context deciphering and infers the mapping to the underlying real tokens with high accuracy. Second, it has on average 4X better accuracy over synthetic in-context reasoning tasks. Finally, we discuss regularizing standard language models towards lexinvariance and potential practical applications.
We propose and deploy an approach to continually train an instruction-following agent from feedback provided by users during collaborative interactions. During interaction, human users instruct an agent using natural language, and provide real-time binary feedback as they observe the agent following their instructions. We design a contextual bandit learning approach, converting user feedback to immediate reward. We evaluate through thousands of human-agent interactions, demonstrating 15.4% absolute improvement in instruction execution accuracy over time. We also show our approach is robust to several design variations, and that the feedback signal is roughly equivalent to the learning signal of supervised demonstration data.
This paper presents Translatotron 3, a novel approach to train a direct speech-to-speech translation model from monolingual speech-text datasets only in a fully unsupervised manner. Translatotron 3 combines masked autoencoder, unsupervised embedding mapping, and back-translation to achieve this goal. Experimental results in speech-to-speech translation tasks between Spanish and English show that Translatotron 3 outperforms a baseline cascade system, reporting 18.14 BLEU points improvement on the synthesized Unpaired-Conversational dataset. In contrast to supervised approaches that necessitate real paired data, which is unavailable, or specialized modeling to replicate para-/non-linguistic information, Translatotron 3 showcases its capability to retain para-/non-linguistic such as pauses, speaking rates, and speaker identity.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.