Deep Learning Weekly: Issue 353
OpenAI's GPT-4o, Extract Metadata from Queries to Improve Retrieval, Machine Unlearning in 2024, StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation, and many more!
This week in deep learning, we bring you OpenAI's GPT-4o, Advanced Retrieval: Extract Metadata from Queries to Improve Retrieval, Machine Unlearning in 2024, and a paper on StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation.
You may also enjoy AlphaFold 3 predicts the structure and interactions of all of life's molecules, The 4 Advanced RAG Algorithms You Must Know to Implement, How to Convert Any Text Into a Graph of Concepts, a paper on DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
OpenAI releases GPT-4o, a faster model that’s free for all ChatGPT users
OpenAI releases GPT-4o, a faster and more capable iteration of GPT-4.
AlphaFold 3 predicts the structure and interactions of all of life's molecules
Google DeepMind introduces AlphaFold 3, a new AI model for drug discovery, which was co-developed with Isomorphic Labs.
ElevenLabs previews music-generating AI model
Voice AI startup ElevenLabs is offering an early look at a new model that turns prompts into song lyrics.
Using ideas from game theory to improve the reliability of language models
A new “consensus game,” developed by MIT CSAIL researchers, elevates AI’s text comprehension and generation skills.
MLOps & LLMOps
Advanced Retrieval: Extract Metadata from Queries to Improve Retrieval
A tutorial on how to use LLMs to extract metadata from queries to use as filters that improve retrieval in RAG applications.
The 4 Advanced RAG Algorithms You Must Know to Implement
An article that highlights the details and architectures of four advanced RAG methods to optimize retrieval and post-retrieval.
Accelerating Llama3 FP8 Inference with Triton Kernels
A blog post that covers how to design an optimized kernel using Triton for FP8 inference, and tune it for Llama3-70B inference.
Learning
Hallucinations, Errors, and Dreams
An article on why modern AI systems produce false outputs and what there is to be done about it.
Machine Unlearning in 2024 - Stanford Computer Science
A gentle introduction to machine unlearning and things like copyright protection, NeurIPS machine unlearning challenge, retrieval-based AI systems, etc.
Phi-3 and the Beginning of Highly Performant iPhone LLMs
A blog post that delves into the findings of the Phi-3 paper and highlights some of the implications of releasing models similar to Phi-3.
How to Convert Any Text Into a Graph of Concepts
A comprehensive article that highlights a method to convert any text corpus into a Knowledge Graph using Mistral 7B.
Financial Market Applications of LLMs
An article that explores the potential application of LLMs in financial markets, discussing their use in predicting price sequences, multimodal learning, synthetic data creation, and fundamental analysis.
An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin
An article that walks through how to architect and build a real-world LLM system from start to finish — from data collection to deployment.
Prompt Engineering for Vision Models
In this course, learn to prompt different vision models like Meta’s Segment Anything Model (SAM), a universal image segmentation model, OWL-ViT, a zero-shot object detection model, and Stable Diffusion 2.0, a widely used diffusion model.
Libraries & Code
Python package for concise, transparent, and accurate predictive modeling.
EdinburghNLP/awesome-hallucination-detection
List of papers on hallucination detection in LLMs.
Papers & Publications
MOMENT: A FAMILY OF OPEN TIME-SERIES FOUNDATION MODELS
Abstract:
We introduce MOMENT, a family of open-source foundation models for general-purpose time-series analysis. Pre-training large models on time-series data is challenging due to (1) the absence of a large and cohesive public time-series repository, and (2) diverse time-series characteristics which make multi-dataset training onerous. Additionally, (3) experimental benchmarks to evaluate these models, especially in scenarios with limited resources, time, and supervision, are still in their nascent stages. To address these challenges, we compile a large and diverse collection of public time-series, called the Time-series Pile, and systematically tackle time-series-specific challenges to unlock large-scale multi-dataset pre-training. Finally, we build on recent work to design a benchmark to evaluate time-series foundation models on diverse tasks and datasets in limited supervision settings. Experiments on this benchmark demonstrate the effectiveness of our pre-trained models with minimal data and task-specific fine-tuning. Finally, we present several interesting empirical observations about large pre-trained time-series models.
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation
Abstract:
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.