Deep Learning Weekly: Issue 342
Mistral AI’s three new LLMs, LLM-Powered API Agent for Task Execution, How to Unit Test Machine Learning Code & Models, Unified Training of Universal Time Series Forecasting Transformers, and more!
This week in deep learning, we bring you Mistral AI challenges OpenAI with three new LLMs, Build an LLM-Powered API Agent for Task Execution, How to Unit Test Machine Learning Code & Models, and a paper on Unified Training of Universal Time Series Forecasting Transformers.
You may also enjoy AI model Poro sets new milestones for multilingual LLMs in Europe, Evaluation Framework for RAG Pipelines, RLHF in 2024 with DPO & Hugging Face, a paper on LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Mistral AI challenges OpenAI with three new LLMs
Mistral AI introduced three large language models and a chatbot service designed to rival OpenAI’s ChatGPT.
Jasper acquires Stability AI’s Clipdrop to strengthen marketing copilot
Jasper, the San Francisco-based startup known for its generative AI-driven marketing copilot, announced its acquisition of Stability AI’s Clipdrop.
AI model Poro sets new milestones for multilingual LLMs in Europe
Helsinki-based Silo AI has completed the training of the Poro model — a new milestone in its mission to create large language models (LLMs) for low-resource languages.
New AI model could streamline operations in a robotic warehouse
A group of MIT researchers who use AI to mitigate traffic congestion applied ideas from that domain to tackle the problem of multiple robots in a warehouse setting.
Intenseye, which uses AI computer vision to enhance workplace safety, raises $61M
Intenseye has closed on what is the largest-ever funding round in its category after raising $61 million in an investment led by Lightspeed Venture Partners.
MLOps & LLMOps
Build an LLM-Powered API Agent for Task Execution
A post that covers the basics of how to build an LLM-powered API execution agent.
How to Build an Advanced AI-Powered Enterprise Content Pipeline Using Mixtral 8x7B and Qdrant
An article that explores the essential components and strategies for building an advanced AI-powered enterprise content pipeline using Mixtral 8x7B and Qdrant.
Top 5 Web Scraping Methods: Including Using LLMs
A blog post describing how web scraping automates the extraction of data from websites using programming or specialized tools.
Learning
How to Unit Test Machine Learning Code & Models
Eugene Yan discusses the challenges, differences, and guidelines of unit testing machine learning code and models.
RLHF in 2024 with DPO & Hugging Face
A blog post that walks you through how to use DPO to improve open LLMs using Hugging Face TRL, Transformers & datasets.
An article that explains some of the common pitfalls that can cause machine learning models to fail in real world scenarios, and provides some suggestions on how to avoid them.
Audio Generation with Mamba using Determined AI
A tutorial on how to use Mamba, a fast and scalable sequence model, to generate audio using Determined AI, a deep learning platform.
Libraries & Code
A guidance language for controlling large language models.
Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines.
Auto Prompt is a prompt optimization framework designed to enhance and perfect your prompts for real-world use cases.
Papers & Publications
Unified Training of Universal Time Series Forecasting Transformers
Abstract:
Deep learning for time series forecasting has traditionally operated within a one-model-per-dataset framework, limiting its potential to leverage the game-changing impact of large pre-trained models. The concept of universal forecasting, emerging from pre-training on a vast collection of time series datasets, envisions a single Large Time Series Model capable of addressing diverse downstream forecasting tasks. However, constructing such a model poses unique challenges specific to time series data: i) cross-frequency learning, ii) accommodating an arbitrary number of variates for multivariate time series, and iii) addressing the varying distributional properties inherent in large-scale data. To address these challenges, we present novel enhancements to the conventional time series Transformer architecture, resulting in our proposed Masked Encoder-based Universal Time Series Forecasting Transformer (Moirai). Trained on our newly introduced Large-scale Open Time Series Archive (LOTSA) featuring over 27B observations across nine domains, Moirai achieves competitive or superior performance as a zero-shot forecaster when compared to full-shot models. Code, model weights, and data will be released.
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Abstract:
Large context window is a desirable feature in large language models (LLMs). However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
Abstract:
Today's deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture -- Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN's architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Figure 1.