Deep Learning Weekly: Issue 339
Google's foundation model for time-series forecasting, Top Evaluation Metrics for RAG Failures, Patch Time Series Transformer, a paper on Specialized Language Models with Cheap Inference, and more!
This week in deep learning, we bring you Google's decoder-only foundation model for time-series forecasting, Top Evaluation Metrics for RAG Failures, Patch Time Series Transformer in Hugging Face, and a paper on Specialized Language Models with Cheap Inference from Limited Domain Data.
You may also enjoy Adept Fuyu-Heavy: A new multimodal model, AI2's Open Language Model: OLMo, Code LoRA from Scratch, a paper on WARM: On the Benefits of Weight Averaged Reward Models, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
A decoder-only foundation model for time-series forecasting
Google introduced TimesFM, a single forecasting model pre-trained on a large time-series corpus of 100 billion real world time-points.
Adept Fuyu-Heavy: A new multimodal model
Adept introduced Adept Fuyu-Heavy, a new multimodal model designed specifically for digital agents.
UK invests £100M in AI research and regulation, £45M in quantum
The UK government is investing over £100M AI R&D and regulation, of which £90M will support the launch of nine research hubs across the country.
LLaVA-1.6: Improved reasoning, OCR, and world knowledge
Liu et al. released reasoning, OCR, and world knowledge improvements to an end-to-end trained large multimodal model that connects a vision encoder and LLM.
A new way to discover places with generative AI in Maps
Google introduced a new way to discover places by adding generative AI to Maps.
MLOps & LLMOps
Top Evaluation Metrics for RAG Failures
An article that goes through the best workflows for troubleshooting poor retrieval and response metrics.
Decoding the Significance of LLM Chains in LLMOps
An article to help understand the basics of LLM and LLM Chains, and specifically, what chaining is.
Building LLM Platforms for Your Organisation
An article that highlights several considerations that are essential for any enterprise-level deployment of LLMs – particularly Knowledge Assistants (KAs).
Learning
A comprehensive article that explains how LoRA works by coding it from scratch.
Accelerating Triton Dequantization Kernels for GPTQ
The PyTorch team showcases a step-by-step process undertaken to accelerate the current Triton GPTQ kernels by 3x (core GPTQ) and 6x (AutoGPTQ).
Patch Time Series Transformer in Hugging Face
An overview, along with a technical demonstration of how to get started with Patch Time Series Transformers using Hugging Face.
Libraries & Code
Modeling, training, eval, and inference code for OLMo.
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Hawkeye is a unified deep learning based fine-grained image recognition toolbox built on PyTorch, which is designed for researchers and engineers.
Papers & Publications
WARM: On the Benefits of Weight Averaged Reward Models
Abstract:
Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.
Specialized Language Models with Cheap Inference from Limited Domain Data
Abstract:
Large language models have emerged as a versatile tool but are challenging to apply to tasks lacking large inference budgets and large in-domain training sets. This work formalizes these constraints and distinguishes four important variables: the pretraining budget (for training before the target domain is known), the specialization budget (for training after the target domain is known), the inference budget, and the in-domain training set size. Across these settings, we compare different approaches from the machine learning literature. Limited by inference cost, we find better alternatives to the standard practice of training very large vanilla transformer models. In particular, we show that hyper-networks and mixture of experts have better perplexity for large pretraining budgets, while small models trained on importance sampled datasets are attractive for large specialization budgets.
Abstract:
In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility.