Discover more from Deep Learning Weekly
Deep Learning Weekly: Issue #302
Claude's 100K Context Window, Performance Bottlenecks in Deploying LLMs, YOLO-NAS, a paper on Larger language models do in-context learning differently, and many more!
This week in deep learning, we bring you Meta publicly released Massively Multilingual Speech project , The A to Z of LLMOps, Spotify Track Neural Recommender System using GNNs, and a paper on Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold.
You may also enjoy ChatGPT for iOS, Efficiently Scale LLM Training Across a Large GPU Cluster with Alpa and Ray, Instruction-tuning Stable Diffusion with InstructPix2Pix, a paper on Tree of Thoughts: Deliberate Problem Solving with Large Language Models, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Meta publicly shares the Massively Multilingual Speech (MMS) project that supports speech-to-text and text-to-speech for 1,107 languages and language identification for over 4,000 languages.
A machine learning model – developed by scientists from MIT and Adobe Research – can identify all the pixels in an image that represent a given material.
OpenAI launched the free-to-use, Whisper-enabled ChatGPT app for iOS.
A study published last month suggests that natural and artificial networks learn in similar ways, at least when it comes to language.
Stability AI announced StableStudio, the open-source release of their premiere text-to-image consumer application DreamStudio.
Anthropic, a San Francisco-based AI startup and rival to OpenAI, announced that it has raised $450 million in Series C funding led by Spark Capital.
An article that explores how to use Comet to visually compare and evaluate object detection models from TorchVision.
This post presents how two open-source frameworks, Alpa.ai and Ray.io, work together to achieve the scale required to train a 175 billion-parameter JAX transformer model with pipeline parallelism.
An article that explores the key aspects of LLMOps, illustrating the importance of each component in driving LLM success.
A comprehensive post that shows you how to train GNNs, and how to build a cutting-edge playlist-track recommender system.
Eugene Yan addresses some questions and tries to provide intuition on Attention and other parts of the Transformer architecture.
This post explores instruction-tuning to teach Stable Diffusion to follow instructions to translate or process input images.
A guide on how to fine-tune large language models (LLMs) on a custom dataset, using a nanoGPT based implementation of the GPT-NeoX.
This article presents an in-depth solution and code sample for language identification using Intel® Extension for PyTorch.
Libraries & Code
A guidance language for controlling large language models.
A library that makes it very easy for ML engineers to run dev environments, pipelines and apps cost-effectively on any cloud.
Dump all your files and thoughts into your GenerativeAI Second Brain and chat with it.
Papers & Publications
Synthesizing visual content that meets users' needs often requires flexible and precise controllability of the pose, shape, expression, and layout of the generated objects. Existing approaches gain controllability of generative adversarial networks (GANs) via manually annotated training data or a prior 3D model, which often lack flexibility, precision, and generality. In this work, we study a powerful yet much less explored way of controlling GANs, that is, to "drag" any points of the image to precisely reach target points in a user-interactive manner, as shown in Fig.1. To achieve this, we propose DragGAN, which consists of two main components including: 1) a feature-based motion supervision that drives the handle point to move towards the target position, and 2) a new point tracking approach that leverages the discriminative GAN features to keep localizing the position of the handle points. Through DragGAN, anyone can deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc. As these manipulations are performed on the learned generative image manifold of a GAN, they tend to produce realistic outputs even for challenging scenarios such as hallucinating occluded content and deforming shapes that consistently follow the object's rigidity. Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking. We also showcase the manipulation of real images through GAN inversion.
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%.
Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation. This ability comes from two designs. First, we train the detector on the joint of part-level, object-level and image-level data to build the multi-granularity alignment between language and image. Second, we parse the novel object into its parts by its dense semantic correspondence with the base object. These two designs enable the detector to largely benefit from various data sources and foundation models. In open-vocabulary part segmentation experiments, our method outperforms the baseline by 3.3∼7.3 mAP in cross-dataset generalization on PartImageNet, and improves the baseline by 7.3 novel AP50 in cross-category generalization on Pascal Part. Finally, we train a detector that generalizes to a wide range of part segmentation datasets while achieving better performance than dataset-specific training.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.