Discover more from Deep Learning Weekly
Deep Learning Weekly: Issue #315
Meta AI's SeamlessM4T, ML Pipeline Architecture Design Pattern, Do Machine Learning Models Memorize or Generalize?, a paper on Autonomous Visual Information Seeking with Large Language Models and more
This week in deep learning, we bring you Meta AI's SeamlessM4T, ML Pipeline Architecture Design Patterns, Do Machine Learning Models Memorize or Generalize?, and a paper on AVIS: Autonomous Visual Information Seeking with Large Language Models.
You may also enjoy Nature-inspired foundational model, How to Build a Fully Automated Data Drift Detection Pipeline, The NeurIPS 2023 LLM Efficiency Challenge Starter Guide, a paper on GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Meta AI introduces SeamlessM4T, a foundational multilingual and multitask model that seamlessly translates and transcribes across speech and text. This supports nearly 100 languages.
Two prominent AI researchers launched a startup, Sakana AI, that aims to build a new kind of foundation model based on nature-inspired intelligence.
Stability AI released its first Japanese language model, Japanese StableLM Alpha, the best-performing openly available LM created for Japanese speakers.
Arthur introduces Arthur Bench, an open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models.
Hugging Face releases IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS), an open-access visual language model.
OpenAI has acquired Global Illumination, a small digital company that creates products dedicated to providing incentives for creativity.
The UK will reportedly spend £100 m of taxpayer money to buy AI chip technology from AMD, Intel, and Nvidia.
MLOps & LLMOps
A blog that shows you how to use Infrastructure as Code with AWS Cloud Development Kit (AWS CDK) to deploy and manage Llama 2.
SDXL 1.0 is poised to revolutionize text-to-image synthesis. In this blog, a comprehensive comparison between the outcomes of SDXL 1.0 and Stable Diffusion 2.0 is presented. For a hands-on experience, a complete code tutorial and a public project are provided.
A blog post that explores some common patterns and practices in ML pipeline stages, such as DAGs, foreach, embeddings, and data parallelism, and how they are used in prominent tech companies.
A blog that attempts to provide guidance on how to choose the right Generative AI approach for different use cases.
An article that explains how to design a workflow that detects data drift, notifies the data team, and triggers model retraining using Kestra, an open-source library.
A comprehensive article describing a new architecture for LLMs that achieves training parallelism, good performance, and low inference cost simultaneously.
A short walkthrough explaining how to participate in the NeurIPS 2023 LLM Efficiency Challenge, which focuses on efficient LLM finetuning.
An interactive article that introduces the concept of grokking, and provides an illustration of the emerging field of mechanistic interpretability.
Libraries & Code
A MIT-licensed, deployable starter kit for building and customizing your own version of AI town.
A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2. 100% private, with no data leaving your device.
Reusable computer vision tools.
Papers & Publications
In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.
Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.