Deep Learning Weekly: Issue 378
Introducing ChatGPT search, 39 Lessons on Building ML Systems, Scaling, Execution, and More, a paper on Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws, and many more!
This week in deep learning, we bring you Introducing ChatGPT search, 39 Lessons on Building ML Systems, Scaling, Execution, and More, and a paper on Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws.
You may also enjoy Advancing embodied AI through progress in touch perception, dexterity, and human-robot interaction, Understanding Multimodal LLMs, a paper on MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
OpenAI introduced ChatGPT search, which allows users to get fast, timely answers with links to relevant web sources.
Advancing embodied AI through progress in touch perception, dexterity, and human-robot interaction
Meta FAIR publicly released several new research artifacts that advance robotics and support the goal of reaching advanced machine intelligence (AMI).
Bret Taylor's customer service AI startup just raised $175M
Sierra, the AI startup co-founded by OpenAI chairman Bret Taylor, has raised $175 million in a funding round that values the startup at $4.5 billion.
Patronus AI debuts API for equipping AI workloads with reliability guardrails
Patronus AI introduced a new tool designed to help developers ensure that their AI applications generate accurate output.
Israeli AI security startup Noma launches with $32M to secure the 'Data and AI Lifecycle'
Noma Security launched and announced that it has raised $32 million in funding to enhance its end-to-end AI security platform and grow its customer base.
MLOps & LLMOps
An article that introduces you to the concept of agentic RAG, its implementation, and its benefits and limitations.
39 Lessons on Building ML Systems, Scaling, Execution, and More
An article containing lessons from a series of ML conferences, offering insights into building effective ML systems and navigating the industry landscape.
RFP Response Generation Workflow (with Human-in-the-Loop)
A notebook that shows you how to build a LlamaCloud workflow for generating responses to RFPs.
Deploying LLMs with TorchServe + vLLM
A PyTorch blog post about deploying LLMs using TorchServe and vLLM.
Design Haystack AI Applications Visually with deepset Studio & NVIDIA NIMs
A practical blog post about how to visually design Haystack AI applications using deepset Studio and NVIDIA NeMo.
Learning
LLM Evaluations: A Complete Course
A course on mastering LLM evaluation for real-world applications using state-of-the-art tools and metrics like LLM-as-a-judge and production LLM monitoring.
Gen-AI Safety Landscape: A Guide to the Mitigation Stack for Text-to-Image Models
A detailed article about safety measures to reduce risks of AI models generating harmful or biased outputs, especially focusing on text-to-image models.
Text to Knowledge Graph Made Easy with Graph Maker
A comprehensive article explaining how to build knowledge graphs from text using open-source LLMs.
Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet \ Anthropic
A blog post about how Anthropic achieved state-of-the-art results on the SWE-bench coding benchmark using Claude 3.5 Sonnet.
Understanding Multimodal LLMs - by Sebastian Raschka, PhD
An in-depth article discussing the capabilities and development of Multimodal LLMs along with an overview of recent models.
Jailbreaking LLM-Controlled Robots
A blog post about jailbreaking LLM-controlled robots, discussing the potential for physical harm caused by these.
Libraries & Code
Educational framework exploring ergonomic, lightweight multi-agent orchestration. Managed by OpenAI Solution team.
A tool that parses UI screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the interface.
A modern, lightweight, and effective open-source application security testing framework—engineered by humans and primed for AI.
Papers & Publications
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Abstract:
The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the mask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility.
Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
Abstract:
LLMs produce harmful and undesirable behavior when trained on poisoned datasets that contain a small fraction of corrupted or harmful data. We develop a new attack paradigm, jailbreak-tuning, that combines data poisoning with jailbreaking to fully bypass state-of-the-art safeguards and make models like GPT-4o comply with nearly any harmful request. Our experiments suggest this attack represents a paradigm shift in vulnerability elicitation, producing differences in refusal rates as much as 60+ percentage points compared to normal fine-tuning. Given this demonstration of how data poisoning vulnerabilities persist and can be amplified, we investigate whether these risks will likely increase as models scale. We evaluate three threat models - malicious fine-tuning, imperfect data curation, and intentional data contamination - across 23 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability..
OmniGen: Unified Image Generation
Abstract:
In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of chain-of-thought mechanisms. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues.