Deep Learning Weekly: Issue 347
Apple's ReALM, Visualize your RAG Data, A Builder's Guide to Evals for LLM-based Applications, a paper on VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild, and many more!
This week in deep learning, we bring you Apple's ReALM, Visualize your RAG Data — Evaluate your Retrieval-Augmented Generation System with Ragas, A Builder's Guide to Evals for LLM-based Applications, and a paper on VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild.
You may also enjoy AI21's Groundbreaking SSM-Transformer Model, Deep Dive into Vector Databases by Hand, Understanding the Sparse Mixture of Experts (SMoE) Layer in Mixtral, a paper on Long-form factuality in large language models, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Apple researchers develop AI that can 'see' and understand screen context
Apple researchers have developed ReALM, a new AI system that can understand ambiguous references to on-screen entities using LLMs.
Introducing Jamba: AI21's Groundbreaking SSM-Transformer Model
AI21 Labs announced Jamba, the world’s first production-grade Mamba model with a 256k context window.
AI startups Scale AI and Cohere reportedly in talks to raise hundreds of millions
Scale AI and Cohere are seeking to raise hundreds of millions of dollars from investors, according to two reports.
Microsoft’s new safety system can catch hallucinations in its customers’ AI apps
The Azure AI Studio tools can screen for malicious prompt attacks as well as ‘unsupported’ responses, aka hallucinations.
Hailo takes on Nvidia with energy-efficient gen AI accelerator for edge devices and $120M in funding
Hailo introduces a new energy-efficient genAI accelerator and announces an additional $120 million in funding.
Read AI raises $21M to unify communications across meetings, emails and chats
Read AI, a company focused on content summarization, announced a $21 million early-stage round of funding.
MLOps & LLMOps
How to handle a Million Vector Embeddings in the RAG Applications
An article that explores the use of PGVector in managing a large dataset for RAG applications.
Visualize your RAG Data — Evaluate your Retrieval-Augmented Generation System with Ragas
A technical post on how to use UMAP dimensionality reduction for embeddings to show multiple evaluation questions and their relationships to source documents with Ragas, OpenAI, Langchain, and ChromaDB.
Deep Dive into Vector Databases by Hand
A visual blog post highlighting the inner workings of vector databases.
Level up Your RAG Application with Speaker Diarization
An article on leveraging speaker diarization for RAG Applications using Haystack and AssemblyAI.
Learning
Mastering Customer Segmentation with LLM
An article on improving clustering methods with LLMs and other advanced techniques.
A Builder's Guide to Evals for LLM-based Applications
Eugene Yan discusses useful Classification, Summarization, Translation, Copyright, and Toxicity Evaluations for LLM-based applications.
EU AI Act Regulation Compliance with Comet
On March 13, 2024, the European Parliament passed the EU AI Act to establish a common regulatory and legal framework for AI.
Understanding the Sparse Mixture of Experts (SMoE) Layer in Mixtral
A blog post that explores the findings of the “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” paper and its implementation in Mixtral.
Evaluating LLM Responses to Moral Scenarios
A blog post on evaluating LLM responses to understand the moral beliefs encoded in them, especially in ambiguous scenarios.
Libraries & Code
A Python library that allows formulating many tensor operations as concise expressions using Einstein notation.
An LLM Agent Operating System.
Mini-Gemini supports a series of dense and MoE LLMs from 2B to 34B with image understanding, reasoning, and generation simultaneously.
Papers & Publications
Long-form factuality in large language models
Abstract:
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall).
Empirically, we demonstrate that LLM agents can achieve superhuman rating performance - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality.
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation
Abstract:
In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we extract 3D intermediate representations from audio and project them into a sequence of 2D facial landmarks. Subsequently, we employ a robust diffusion model, coupled with a motion module, to convert the landmark sequence into photorealistic and temporally consistent portrait animation. Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality, thereby offering an enhanced perceptual experience. Moreover, our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment.
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
Abstract:
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS, our model outperforms prior SotA models including VALLE and the popular commercial model XTTS-v2. Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music, and our model performs consistently well compared to other models and real recordings. In particular, for speech editing evaluation, we introduce a high quality, challenging, and realistic dataset named RealEdit.