Deep Learning Weekly: Issue #306
Meta AI's Voicebox, On-device acceleration of Large Diffusion Models via GPU-aware optimizations, Private LLMs, a paper on Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation,
This week in deep learning, we bring you Meta AI's Voicebox, On-device acceleration of Large Diffusion Models via GPU-aware optimizations, Private LLMs, and a paper on Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation.
You may also enjoy Function calling capability in the new OpenAI GPT update, NeMo Guardrails, Fine-Tune MMS Adapter Models for low-resource ASR, a paper on Augmenting Language Models with Long-Term Memory, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Meta AI achieves a breakthrough by developing Voicebox, the first model that can generalize to speech-generation tasks with state-of-the-art performance.
MLOps platform Comet announced a strategic partnership with Snowflake aimed at empowering data scientists to build superior machine learning (ML) models at an accelerated pace.
OpenAI has announced several updates including a new function calling capability in the Chat Completions API, more steerable versions of gpt-4 and gpt-3.5-turbo, and more.
A 4-week old startup picked up a $113 million round of funding for building, training and application of LLMs and generative AI.
Google AI released a conceptual framework to help collaboratively secure AI technology.
A team of researchers from the University of Kansas has developed a tool to weed out AI-generated academic writing from the stuff penned by people, with over 99 percent accuracy.
Galileo, a San Francisco-based artificial intelligence startup, announced today the launch of Galileo LLM Studio, a platform to diagnose and fix issues with large language models.
Opera announced that its new generative AI-enabled browser Opera One is out of testing and available with numerous improvements.
An article about the init_module() feature in the upcoming Lightning Fabric release that keeps the peak GPU memory usage in control and enables fast loading times.
A blogpost that presents Google AI’s new inference optimization method to reduce on-device latency for large diffusion models.
An article that highlights the new Core ML optimizations, as well as how to enable faster stable diffusion models with these.
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
A post that showcases how to have a repeatable process with low-code tools like Amazon SageMaker Autopilot such that it can be seamlessly integrated into an MLOps environment.
A comprehensive blogpost about cleaning up survey responses using OpenAI’s GPT Model.
This article shows how to use GPT4All, LangChain, and Cerebrium to build and deploy a a local chatbot.
An article that introduces the concept of calibration in deep neural networks.
A technical article that shows how to fine-tune MMS Adapter Models for low-resource ASR>
Libraries & Code
The Vercel AI SDK is a library for building edge-ready AI-powered streaming text and chat UIs.
A Python library to benchmark machine learning systems' vulnerability to adversarial examples.
Specify what you want it to build, the AI asks for clarification, and then builds it.
Papers & Publications
Large-scale text-to-image generative models have been a revolutionary breakthrough in the evolution of generative AI, allowing us to synthesize diverse images that convey highly complex visual concepts. However, a pivotal challenge in leveraging such models for real-world content creation tasks is providing users with control over the generated content. In this paper, we present a new framework that takes text-to-image synthesis to the realm of image-to-image translation -- given a guidance image and a target text prompt, our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text, while preserving the semantic layout of the source image. Specifically, we observe and empirically demonstrate that fine-grained control over the generated structure can be achieved by manipulating spatial features and their self-attention inside the model. This results in a simple and effective approach, where features extracted from the guidance image are directly injected into the generation process of the target image, requiring no training or fine-tuning and applicable for both real or generated guidance images. We demonstrate high-quality results on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing of the class and appearance of objects in a given image, and modifications of global qualities such as lighting and color.
Existing large language models (LLMs) can only afford fix-sized inputs due to the input length limit, preventing them from utilizing rich long-context information from past inputs. To address this, we propose a framework, Language Models Augmented with Long-Term Memory (LongMem), which enables LLMs to memorize long history. We design a novel decoupled network architecture with the original backbone LLM frozen as a memory encoder and an adaptive residual side-network as a memory retriever and reader. Such a decoupled memory design can easily cache and update long-term past contexts for memory retrieval without suffering from memory staleness. Enhanced with memory-augmented adaptation training, LongMem can thus memorize long past context and use long-term memory for language modeling. The proposed memory retrieval module can handle unlimited-length context in its memory bank to benefit various downstream tasks. Typically, LongMem can enlarge the long-form memory to 65k tokens and thus cache many-shot extra demonstration examples as long-form memory for in-context learning. Experiments show that our method outperforms strong long-context models on ChapterBreak, a challenging long-context modeling benchmark, and achieves remarkable improvements on memory-augmented in-context learning over LLMs. The results demonstrate that the proposed method is effective in helping language models to memorize and utilize long-form contents.
We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. RAM introduces a new paradigm for image tagging, leveraging large-scale image-text pairs for training instead of manual annotations.
The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the caption and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset.
We evaluate the tagging capabilities of RAM on numerous benchmarks and observe impressive zero-shot performance, significantly outperforming CLIP and BLIP. Remarkably, RAM even surpasses the fully supervised manners and exhibits competitive performance with the Google tagging API.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.