Deep Learning Weekly: Issue 334
GitHub Copilot Chat, How to Build a Knowledge Assistant at Scale, Understanding GPU Memory, a paper on Fast Inference of Mixture-of-Experts Language Models with Offloading, and many more!
This week in deep learning, we bring you GitHub Copilot Chat, How to Build a Knowledge Assistant at Scale, Understanding GPU Memory 2: Finding and Removing Reference Cycles, and a paper on Fast Inference of Mixture-of-Experts Language Models with Offloading.
You may also enjoy DeepMind discovers the influence of subtle adversarial image manipulations to human perception, Evaluating Prompts: A Developer’s Guide, a paper on TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
GitHub Copilot Chat now generally available for organizations and individuals
GitHub Copilot Chat is now generally available for both Visual Studio Code and Visual Studio, and is included in all GitHub Copilot plans.
OpenAI’s annualized revenue reportedly tops $1.6B
OpenAI’s annualized revenue has reportedly topped $1.6 billion a mere two months after reaching the $1.3 billion mark.
Images altered to trick machine vision can influence humans too - Google DeepMind
New research shows that even subtle changes to digital images, designed to confuse computer vision systems, can also affect human perception.
2023: A year of groundbreaking advances in AI and computing
A Year-in-Review post that goes over some of Google Research's and Google DeepMind’s efforts.
AI predictions for 2024: What top VCs think
VCs from top firms including Bain Capital Ventures (BCV), General Catalyst and more offered their outlook on topics such as the future of generative AI, GPU shortages, and AI regulation.
MLOps & LLMOps
Evaluating Prompts: A Developer’s Guide
A comprehensive article that delves into the nuances of prompt engineering, the iterative processes essential for refining prompts, and the challenges that come with them.
Speculative Decoding for 2x Faster Whisper Inference
A blog post that demonstrates how Speculative Decoding can reduce the inference time of Whisper by a factor of two.
How to Build a Knowledge Assistant at Scale
An article that describes some of the considerations necessary when developing an enterprise-level knowledge assistant (KA) and introduces a scalable architecture.
Efficient Vector Similarity Search in Recommender Workflows Using Milvus with NVIDIA Merlin
An introduction to NVIDIA Merlin and Milvus integration in building recommender systems and benchmarking its performance in various scenarios.
A data engineer/consultant shares his experience and challenges of building a human-in-the-loop computer vision system for counting fish at large hydroelectric dams.
Learning
A case for AI alignment being difficult
A blog post by Jessica Taylor that explains her model of AI alignment, why it is difficult, and what paths forward there might be.
LLMs: Exploring Data with YOLOPandas 🐼 and Comet
An article that dives into YOLOPandas, its abilities, and how to incorporate it with Comet.
Understanding GPU Memory 2: Finding and Removing Reference Cycles
A blog post that explains how to use the Memory Snapshot and the Reference Cycle Detector tools to identify and fix GPU memory leaks caused by reference cycles in PyTorch code.
Libraries & Code
OML is a PyTorch-based framework to train and validate the models producing high-quality embeddings.
A tiny library for coding with large language models.
A generalized information-seeking agent system with Large Language Models (LLMs).
Papers & Publications
Fast Inference of Mixture-of-Experts Language Models with Offloading
Abstract:
With the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) - a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their dense counterparts, but it also increases model size due to having multiple experts. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Abstract:
In the era of advanced multimodel learning, multimodal large language models (MLLMs) such as GPT-4V have made remarkable strides towards bridging language and visual elements. However, the closed-source nature and considerable computational demand present notable challenges for universal usage and modifications. This is where open-source MLLMs like LLaVA and MiniGPT-4 come in, presenting groundbreaking achievements across tasks. Despite these accomplishments, computational efficiency remains an unresolved issue, as these models, like LLaVA-v1.5-13B, require substantial resources. Addressing these issues, we introduce TinyGPT-V, a new-wave model marrying impressive performance with commonplace computational capacity. It stands out by requiring merely a 24G GPU for training and an 8G GPU or CPU for inference. Built upon Phi-2, TinyGPT-V couples an effective language backbone with pre-trained vision modules from BLIP-2 or CLIP. TinyGPT-V's 2.8B parameters can undergo a unique quantisation process, suitable for local deployment and inference tasks on 8G various devices. Our work fosters further developments for designing cost-effective, efficient, and high-performing MLLMs, expanding their applicability in a broad array of real-world scenarios. Furthermore this paper proposed a new paradigm of Multimodal Large Language Model via small backbones.
AnyText: Multilingual Visual Text Generation And Editing
Abstract:
Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality.