Deep Learning Weekly: Issue 374
Geoffrey Hinton and John Hopfield win Nobel Prize in Physics, Building a Legal AI Agent using Azure AI Search and CrewAI, A Visual Guide to Mixture of Experts (MoE), and many more!
This week in deep learning, we bring you Geoffrey Hinton and John Hopfield win Nobel Prize in Physics, Building a Legal AI Agent using Azure AI Search, Azure OpenAI, LlamaIndex, and CrewAI, A Visual Guide to Mixture of Experts (MoE), and a paper on Depth Pro: Sharp Monocular Metric Depth in Less Than a Second.
You may also enjoy Meta Movie Gen, Supervised Fine Tuning for Gemini, a paper on "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Geoffrey Hinton and John Hopfield win Nobel Prize in Physics for their work in foundational AI
The Royal Swedish Academy of Sciences has announced that Geoffrey Hinton and John Hopfield are jointly sharing the Nobel Prize in Physics for their work on artificial neural networks.
How Meta Movie Gen could usher in a new AI-enabled era for content creators
Meta released Meta Movie Gen, a breakthrough generative AI research for media, which includes modalities like image, video, and audio.
Canvas is a new way to write and code with ChatGPT
OpenAI introduced canvas, a new interface for working with ChatGPT on writing and coding projects that go beyond simple chat.
AI coding startup Poolside raises $500M from eBay, Nvidia, and others
Poolside, the AI-powered software development platform, has raised half a billion dollars in new capital.
Introducing Inflection for Enterprise
Inflection AI introduced Inflection for Enterprise, powered by their industry-first, enterprise-grade AI system, Inflection 3.0.
Foxconn to build Taiwan's fastest AI supercomputer with Nvidia Blackwell
Nvidia and Foxconn are building Taiwan’s largest supercomputer using Nvidia Blackwell chips.
Microsoft announced the public preview of GPT-4o-Realtime-Preview for audio and speech, a major enhancement that adds advanced voice capabilities and expands GPT-4o’s multimodal offerings.
MLOps & LLMOps
Building a Legal AI Agent using Azure AI Search, Azure OpenAI, LlamaIndex, and CrewAI
A detailed blog post showcasing the development of a Legal AI Agent using Azure AI Search, Azure OpenAI, LlamaIndex, and CrewAI to streamline legal compliance, comparing single RAG chat with a multi-agent approach.
AI-Enabled eCommerce in TypeScript
An instructive blog post about how to build an AI-enabled eCommerce search web application using Nuxt.js, Weaviate, and Cohere to improve the search experience.
Learning
A Visual Guide to Mixture of Experts (MoE)
A visual guide exploring the concept of Mixture of Experts (MoE) and its application in large language models.
Supervised Fine Tuning for Gemini
A blog post on when to use supervised fine-tuning for Gemini, explaining its benefits, applications, and comparison to other methods like prompt engineering and RAG.
Comparing Open-source and Proprietary LLMs in Medical AI
A comprehensive article comparing the performance of open-source and proprietary LLMs in medical AI, using popular benchmark datasets like MedQA, NEJM-QA, MMLU, and MMLU-Pro.
An explainer about machine unlearning for LLMs, highlighting the challenges, methods, and benchmarks for removing the influence of unwanted data from trained LLMs.
Multimodal and reasoning LLMs supersize training data for dexterous robotic tasks
A blog post describing GenSim2, a framework developed by MIT researchers that uses multimodal and reasoning LLMs to generate simulated training data for robots.
Libraries & Code
PyTorch library for custom data types & optimizations. Quantize and sparsify weights, gradients, optimizers, & activations for inference and training.
A multimodal agent framework for solving complex tasks.
Papers & Publications
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Abstract:
We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions.
Abstract:
The misuse of large language models (LLMs) has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Abstract:
Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.
In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, 11-22% on LLaMA-3 on QA tasks on average. We only utilize 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8× increase in inference throughput compared to SOTA.