Deep Learning Weekly: Issue 363
Llama 3.1, The LLM Triangle Principles to Architect Reliable AI Apps, How to Make Your RAG Less Distracted?, a paper on SEED-Story: Multimodal Long Story Generation with LLM, and many more!
This week in deep learning, we bring you Introducing Llama 3.1, The LLM Triangle Principles to Architect Reliable AI Apps, How to Make Your RAG Less Distracted?, and a paper on SEED-Story: Multimodal Long Story Generation with Large Language Model.
You may also enjoy GPT-4o mini: advancing cost-efficient intelligence, Building a multi-agent concierge system, a paper on Improving GFlowNets for Text-to-Image Diffusion Alignment, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing Llama 3.1: Our most capable models to date
Meta unleashed Llama 3.1 405B, the first open-source model that rivals top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.
Creating and verifying stable AI-controlled systems in a rigorous and flexible way
MIT CSAIL researchers helped design a new technique that can guarantee the stability of robots controlled by neural networks.
OpenAI is reportedly exploring the development of its own AI chips in a strategic move to reduce dependence on scarce and expensive GPUs.
AI method radically speeds predictions of materials’ thermal properties
Researchers create a novel graph neural network approach that could help engineers design more efficient energy-conversion systems and faster microelectronic devices.
GPT-4o mini: advancing cost-efficient intelligence
OpenAI announces GPT-4o mini, their most cost-efficient small model.
The Mistral team releases their new best small model, a state-of-the-art 12B model with 128k context length, built in collaboration with NVIDIA.
MLOps & LLMOps
The LLM Triangle Principles to Architect Reliable AI Apps
A comprehensive article on software design principles for thoughtfully designing reliable, high-performing LLM applications.
An article explains the concept and the low-abstraction implementation of employing an LLM judge to evaluate another LLM judge.
Building a multi-agent concierge system
An article that focuses on creating a multi-agent system through a combination of specialized task agents and a “concierge” agent that directs users to the appropriate task-specific agents.
Learning
Forecasting in the Age of Foundation Models
A blog post that compares the performance of a fine-tuned Lag-Llama against XGBoost.
Reasoning through arguments against taking AI safety seriously - Yoshua Bengio
Yoshua Bengio reflects on the potential catastrophic risks associated with future AI systems, emphasizing the need for vigilance and attention to safety.
How to Make Your RAG Less Distracted?
An article that focuses on investigating the optimal number of Q&A (from HotPotQA) required for fine-tuning Mistral Instruct-7b in order to enhance a Retrieval-Augmented Generation (RAG) system.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
A blogpost that highlights some of the optimizations for FlashAttention available on Hopper GPUs.
Libraries & Code
An open-source Python library for processing and curating unstructured data at scale.
A modular graph-based Retrieval-Augmented Generation (RAG) system
Papers & Publications
SEED-Story: Multimodal Long Story Generation with Large Language Model
Abstract:
With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant challenges, as it necessitates the comprehension of the complex interplay between texts and images, and the ability to generate long sequences of coherent, contextually relevant texts and visuals. In this work, we propose SEED-Story, a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. Our model, built upon the powerful comprehension capability of MLLM, predicts text tokens as well as visual tokens, which are subsequently processed with an adapted visual de-tokenizer to produce images with consistent characters and styles. We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. Additionally, we present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data
Abstract:
The semantic capabilities of language models (LMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems lack high-level abstractions to perform semantic queries at scale. We introduce semantic operators, a declarative programming interface that extends the relational model with composable AI-based operations for semantic queries over datasets (e.g., sorting or aggregating records using natural language criteria). Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans similar to relational operators. We implement our operators and several optimizations for them in LOTUS, an open-source query engine with a Pandas-like API.
We demonstrate LOTUS' effectiveness across a series of real applications, including fact-checking, extreme multi-label classification, and search. We find that LOTUS' programming model is highly expressive, capturing state-of-the-art query pipelines with low development overhead. Specifically, on the FEVER dataset, LOTUS' programs can reproduce FacTool, a recent state-of-the-art fact-checking pipeline, in few lines of code, and implement a new pipeline that improves accuracy by 9.5%, while offering 7−34× lower execution time. In the extreme multi-label classification task on the BioDEX dataset, LOTUS reproduces state-of-the art result quality with its join operator, while providing an efficient algorithm that runs 800× faster than a naive join. In the search and ranking application, LOTUS allows a simple composition of operators to achieve 5.9−49.4% higher nDCG@10 than the vanilla retriever and re-ranker, while also providing query efficiency, with 1.67−10× lower execution time than LM-based ranking methods used by prior works.
Improving GFlowNets for Text-to-Image Diffusion Alignment
Abstract:
Diffusion models have become the de-facto approach for generating visual data, which are trained to match the distribution of the training dataset. In addition, we also want to control generation to fulfill desired properties such as alignment to a text description, which can be specified with a black-box reward function. Prior works fine-tune pretrained diffusion models to achieve this goal through reinforcement learning-based algorithms. Nonetheless, they suffer from issues including slow credit assignment as well as low quality in their generated samples. In this work, we explore techniques that do not directly maximize the reward but rather generate high-reward images with relatively high probability — a natural scenario for the framework of generative flow networks (GFlowNets). To this end, we propose the Diffusion Alignment with GFlowNet (DAG) algorithm to post-train diffusion models with black-box property functions. Extensive experiments on Stable Diffusion and various reward specifications corroborate that our method could effectively align large-scale text-to-image diffusion models with given reward information.